《人工智能与数据挖掘教学课件》lect.ppt

合集下载

《人工智能与数据挖掘教学课件》l(3)

《人工智能与数据挖掘教学课件》l(3)
Chapter 3 Basic Data Mining Techniques
3.1 Decision Trees
(For classification)
2020/11/2
ppt课件
1
Introduction: Classification—A Two-Step Process
• 1. Model construction: build a model that can describe a set of
– This set of examples is used for model construction: training set
– The model can be represented as classification rules, decision trees, or mathematical formulae
buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5
1 Example (2): Output: A Decision Tree for “buys_computer”
age?
<=30 ov30e.r.c4a0st
student?
no
George Professor
5
yes
Joseph Assistant Prof 7 ppt课件 yes
(Jeff, Professor, 4)
Tenured?
4
1 Example (1): Training Dataset
An example from Quinlan’ s ID3 (1986)
student credit_rating no fair no excellent no fair no fair yes fair yes excellent yes excellent no fair yes fair yes fair yes excellent no excellent yes fair nppot课件 excellent

《人工智能与数据挖掘教学课件》lect-3-12

《人工智能与数据挖掘教学课件》lect-3-12

– There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
– There are no samples left
– Reach the pre-set accuracy
4/22/2020
age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40
income high high high medium low low low medium low medium medium medium high medium
4/22/2020
AI&DM
12
3.2 Rules simplification and elimination
A Rule for the Tree in Figure 3.4
IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No (accuracy = 75%, Figure 3.4)
4/22/2020
AI&DM
9
Attribute Selection by Information Gain
Computation
Class P: buys_computer = “yes”
Class N:
E(age) 5 I (2,3) 4 I (4,0)

《人工智能与数据挖掘教学课件》courseintro-13.ppt

《人工智能与数据挖掘教学课件》courseintro-13.ppt

40
20
0
East Coast
South
Q1 Q2 Q3
Midwest West Coast
Insight
Based on:
Michael J. A. Berry, Data Miners,

2021/3/11
MonAtIh&s DaMs :CBUuPsTtomer
✓ Approach – give away gifts: target customers, what gift, what time
2021/3/11
AI & DM: BUPT
5
Issues to consider: 1. Targeting customers
• Every customer • High expenditure customers • Most profitable customers (who are) • Customers likely to churn (concentrate on the ones
– There are different types of information systems that can support the operation of business: word processor, spread sheets, databases, accounting systems, ERP, decision support systems, expert systems, business intelligence…
4
Example: why CRM needs DM
✓ CRM for mobile phone company – customer retention (churn)

《人工智能与数据挖掘教学课件》lect-5-13

《人工智能与数据挖掘教学课件》lect-5-13

d (i, x |2 ) i1 j1 i2 j2 ip jp
– d(i,i) = 0
– d(i,j) = d(j,i) – d(i,j) d(i,k) + d(k,j)
2019/1/28 AI&DM BUPT 16
– Calculate the standardized measurement (z-score)
xif m f zif sf
2019/1/28 AI&DM BUPT 18
4.2 Binary Variables (二值变量)
• A contingency table (相依表)for binary data
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
2019/1/28 AI&DM BUPT 17
4.1 Interval-valued variables (cont. 2)
Object j
1
Object i
0 b d
sum a b cd p
1 0
a c
sum a c b d
• Simple matching coefficient (if the binary variable is
symmetric (对称的)):
d (i, j)
bc a bc d bc a bc
2019/1/28
AI&DM BUPT
4
Example
Price($)
7 20 22 50 51 53

《人工智能与数据挖掘教学课件》2.datawarehouse-文档资料

《人工智能与数据挖掘教学课件》2.datawarehouse-文档资料

Data Warehouse environment




the source systems from which data is extracted the tools used to extract data for loading the data warehouse the data warehouse database itself where the data is stored the desktop query and reporting tools used for decision support
Data Warehousing Process Overview
Operational Vs. Multidimensional View Of Sales
Hale Waihona Puke Creating A Data Warehouse
The Data Warehouse

The Data Warehouse is an integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making.
The Data Warehouse

Integrated

The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization. The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.

人工智能(六)知识发现与数据挖掘ppt课件

人工智能(六)知识发现与数据挖掘ppt课件
人工智能 Artificial Intelligence
北京信息科技大学计算机学院 李宝安
精选ppt课件
1
知识发现与数据挖掘
精选ppt课件
2
数据库技术和计算机网络已经成为当前计 算机应用中的两个最重要的基础领域,触及到 人类生活的各个方面。目前,全世界数据库和 因特网中的数据总量正以极快的速度增长。虽 然简单的数据查询或统计可以满足某些低层次 的需求,但人们更为需要的是从大量数据资源 中挖掘出对各类决策有指导意义的一般知识。 数据的急剧膨胀和时效性、复杂性远远超过了 人们的手工处理能力,人们迫切需要高性能的 自动化数据分析工具,以高速、全面、深入、 有效地加工数据。
B
8.67
3.571 2.427 21.038 51.06
C
14.00
7.155
1.957 7.395
53.61
D
24.67 16.889 1.418 36.459 53.89
精选ppt课件
13
BACON4调用上述的启发式,寻到了D和P的单调趋势 关系,即P随D增大而增大,但相应的斜率项不是常数, 而是随D的增加而减少。这又导致BACON4定义D2/P, 此项的值也不是常数,但随D/P减少而增加,结果系统 考虑项D3/P2,这个值接近常数(系统给出了一个允许 的误差范围如7.5%)。BACON4根据这结果就归纳出 该定律了。 一旦一个推理项定义后,它和直接观察的变量就 没有区别了。例如,理想气体定律例中,趋势探测器 会首先确定如PV这样的推理项,并进而确定如PV/T那样 的推理项。也可以发现这些推理项所取值之间的关系, 又从中重新派生出新的推理项,导致对直接观察的变 量更为复杂的描述如PV/nT。BACON4递归地应用相同 的启发式逐步生成更复杂的高层次描述,这种推理能 力使系统具备相当强大的搜索经验定律的功能。

人工智能与数据挖掘教学课件-2.datawarehouse

人工智能与数据挖掘教学课件-2.datawarehouse
Subject-Oriented
The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.
What is Data Warehouse
The idea of a data warehouse is to put a wide range of operational data from internal and external sources into one place so it can be better utilized by executives, line of business managers and other business analysts.
The Data Warehouse
Time Variant
The Warehouse data represent the flow of data through time. It can even contain projected data.
Non-Volatile
Once data enter the Data Warehouse, they are never removed.
The Data Warehouse
The Data Warehouse is an integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making.

《人工智能与数据挖掘教学课件》2.datawarehouse.ppt

《人工智能与数据挖掘教学课件》2.datawarehouse.ppt
Once the information is gathered, OLAP (on-line analytical processing ) software comes into play by providing the desktop analysis tools for querying, manipulating and reporting the data from the data warehouse.
The Data Warehouse is always growing.
Operational Database vs. Data warehouse
Operational DB
Data Warehouse
Similar data can have Unified view of all
different representations data elements
Data Warehouse
Why Data warehouse
The most common issue companies face when looking at data mining is that the information is not in one place.
The biggest challenge business analysts face in using data mining is how to extract, integrate, cleanse, and prepare data to solve their most pressing business problems.
Data Mart
Data Marts can serve as a test vehicle for companies exploring the potential benefits of Data Warehouses.
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

yes
>40 credit rating?
no
yes
excellent fair
no
yes
no
yes
2020/4/24
AI&DM
6
2 Algorithm for Decision Tree Building
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
2020/4/24
AI&DM
9
Attribute Selection by Information Gain
Computation
Class P: buys_computer = “yes”
Class N:
E(age) 5 I (2,3) 4 I (4,0)
14
14
5 I (3,2) 0.69 14
– There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
– There are no samples left
– Reach the pre-set accuracy
predetermined classes
– Preparation: Each tuple/sample is assumed to belong to a predefined class, labeled by the output attribute or class label attribute
– Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
2020/4/24
AI&DM
12
3.2 Rules simplification and elimination
A Rule for the Tree in Figure 3.4
IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No (accuracy = 75%, Figure 3.4)
A Simplified Rule Obtained by Removing Attribute Age
IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No (accuracy = 83.3% (5/6), Figure 3.5)
classify objects in all subsets Si is
E( A)
i1
pi p
ni n
I(
pi , ni )
• The encoding information that would be gained by branching on A Gain(A) I ( p, n) E(A)
buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5
1 Example (2): Output: A Decision Tree for “buys_computer”
age?
<=30 ov30e.r.c4a0st
student?
Classification Process (2): Use the Model in Prediction
Classifier
Testing Data
Unseen Data
NAME RANK
YEARS TENURED
T om A ssistant P rof 2
no
M erlisa A ssociate P rof 7
Gain(income) 0.029 Gain(student) 0.151
30?0 4 0 0
Gain(credit _ rating) 0.048
2>02400/4/24
3 2 0.971
AI&DM
10
3. Decision Tree Rules
• Automate rule creation • Rules simplification and elimination • A default rule is chosen
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of testing set samples that are correctly classified by the model
2020/4/24
AI&DM
2
Classification Process (1): Model Construction
Training Data
Classification Algorithms
NAME RANK
YEARS TENURED
M ike A ssistant P rof 3
no
M ary A ssistant P rof 7
2020/4/24
AI&DM
11
3.1 Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to a leaf
2020/4/24
AI&DM
7
Information Gain (信息增益)(ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of class N
student credit_rating no fair no excellent no fair no fair yes fair yes excellent yes excellent no fair yes fair yes fair yes excellent no excellent yes fair nAIo&DM excellent
age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40
income high high high medium low low low medium low medium medium medium high medium
yes
B ill P rofessor
2
yes
Jim A ssociate P rof 7
yes
D ave A ssistant P rof 6
no
A nne A ssociate P rof 3
no
2020/4/24
AI&DM
Classifier (Model)
IF rank = ‘professor’ OR years > 6 THEN tenured = ‘y3 es’
– The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as
I ( p, n)
p
ห้องสมุดไป่ตู้
p
n
log2
p pn
p
n
n
log2
n pn
2020/4/24
AI&DM
• Note: Test set is independent of training set, otherwise over-fitting will occur
• 2. Model usage: use the model to classify future or unknown
相关文档
最新文档