《数据挖掘、机器学习和Weka》教学提纲

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
• 循环选择一个属性来分割样本 (算法:ID3、C4.5)
Covering algorithms:Constructing rules(算法: Prism)
• Take each class in turn and seek a way of covering all instances in it, at the same time excluding instances not in the class.
Inferring rudimentary rules (算法:1R、1-Rule) Statistical modeling(算法:Naïve Bayes)
• 使用所有属性,假设属性无关、且同等重要
Divide and conquer: Constructing decision trees
概念:Machine Learning
To learn:
– to get knowledge of study, experience, or being taught; – to become aware by information or from observation; – to commit to memory; – to be informed of, ascertain(确定); to receive instruction
其他算法:Neural Network
14.2857 %
数据挖掘的过程步骤:见『回顾:DM的步骤』
输入:Concepts, Instances, Attributes
Concept
– 四种基本的学习类型
• Classification, association, clustering, numeric prediction
Getting to know your data!
• 数据清理一个耗时、费力,却很重要的过程, • Garbage in, garbage out!
输出:Knowledge representation
Decision tables Decision trees Classification rules
• 爱因斯坦:Everything should be made as simple as possible,
实现:Real machine learning schemes (略)
参考阅读:
– Ch6.1 Decision tree – Ch6.2 Classification rules – Ch6.3 Extending linear classification:
概念:KDD、ML、OLAP与DM
KDD(Knowledge Discovery in Database)
是一种知识发现的一连串过程。
ML(Machine Learning)
=KD,不限于Database的数据 过程:挖掘-数据模式-表示-验证-预测 OLAP(Online Analytical Process)
– 主要用于值预估和分类(Linear regression)
Instance-based learning
– 算法:Nearest-neighbor, K-Nearest-neighbor
评估可信度*
三个数据集:
• Training data:用于导出模型,越大则模型越好 • Validation data:用于优化模型参数 • Test data:用于计算最终模型的错误率,越大越准确
输入:Preparing the input*
Gathering the data together
– The data must be assembled, integrated, and cleaned up(Data Warehousing)
– Selecting the right type and level of aggregation is usually critical for success
– 不考虑类型,我们把要学习的称为Concept,而 把学习的输出成为concept description
Instance:数据样本记录 Attribute:数据字段
– Nominal:outlook: sunny => no – Ordinal:距离无法度量,如hot > mild > cool – Interval:距离可度量,如整数 – Ratio:如58.1%
原则:测试数据无论如何也不能用于模型的训练 问题:如果样本很少,如何划分? 方法:
• N-fold Cross-validation,(n=3,10) • Leave-one-out Cross-validation • Bootstrap (e=0.632): best for very small datasets
• Covering approach导出一个规则集而不是决策树
算法:The basic methods
Mining association rules:
– 参数:coverage(support),accuracy(confidence)
Linear models(参考cpu.arff例子)
是数据库在线分析过程。
数据挖掘(data Mining)
只是KDD/ML的一个重要组成部分。
DM用在产生假设 ,而OLAP则用于查证假设
概念:DM与DB
Data Preparation要占Data mining过程70%工作量 「Data base」+「 Data mining」=会说话的数据库
数据挖掘—实用机器学习技术 及Java实现
原书
– 英文版《Data Mining—Practical Machine Learning Tools and Techniques with Java Implementations》,新西兰 Ian H. Witten、 Eibe Frank著
Weka
Combining multiple models
– Bagging – Boosting – Stacking – Error-correcting output codes
未来:Looking forward
大数据集 可视化:输入、输出 Incorporating domain knowledge
测试结果:Confusion Matrix(P.138)和准确率
• a b <-- classified as
• 8 1 | a = yes
• 1 4 | b = no
• Correctly Classified Instances
12
85.7143 %
• Incorrectly Classified Instances 2
回顾:DM的具体应用
市场--购物蓝分析 • 保险欺诈侦察
客户关系管理
• 客户信用风险评级
寻找潜在客户
• 电话盗打
提高客户终生价值 • NBA球员强弱分析
保持客户忠诚度 • 信用卡可能呆帐预警
行销活动规划
• 星际星体分类
预测金融市场方向
回顾:DM的步骤*
一种步骤划分方式
– 理解资料与进行的工作 – 获取相关知识与技术(Acquisition) – 整合与查核资料(Integration and checking) – 去除错误、不一致的资料(Data cleaning) – 模式与假设的演化(Model and hypothesis development) – 实际数据挖掘工作 – 测试与核查所分析的资料(Testing and verification) – 解释与运用(Interpretation and use)
Counting the cost:
• Lift charts (Respondents /Sample Size) 、ROC curves (P.141)
The MDL principle (Minimum Description Length)
• Occam’s Razor:Other things being equal, simple theories are preferable to complex ones.
回顾:DM的功能分类
分类方法一 分类(classification) 估计(estimation) 预测(prediction) 关联分组(affinity grouping) 聚类(clustering)
分类方法二 Classification Regression Time-Series Forecasting Clustering Association Sequence Discovery
算法:The basic methods
Simplicity-first:simple ideas often work very well
• Very simple classification rules perform well on most commonly used datasets (Holte 1993)
Support vector machines – Ch6.4 Instance-based learning – Ch6.5 Numeric prediction – Ch6.6 Clustering
改进:Engineering the input and output
数据工程
– Attribute selection – Discretizing(离散化) numeric attributes – Automatic data cleaning
Shortcomings when it comes to talking about computes
– It’s virtually imБайду номын сангаасossible to test if learning as bean achieved or not.
– This ties learning to performance rather than knowledge
属性类型:
• ARFF文件格式(备注:weather.nominal.arff) • 支持两种基本类型:nominal and numeric,尽可能用前者
属性值
• Missing value:去掉该样本、替代、(用?来表示字段值) • Inaccurate value:一粒老鼠屎——需要领域知识!
• If a and b then x
Association rules:多个结果
• If … then outlook=sunny and humidity=high
Rules with exceptions (P.66)
• If … then … except…else … except…
Trees for numeric prediction Instance-based representation Clusters
简单例子:天气问题*
天气数据:weather.nominal.arff
运行Weka,载入数据,选择算法id3
预测(决策树)
• outlook = rainy
• | windy = TRUE: no
• | windy = FALSE: yes
测试方法:采用10 Cross-validation的
– An open source framework for text analysis implemented in Java that is being developed at the University of Waikato in New Zealand.
– http://www.cs.waikato.ac.nz/ml/weka/ – http://www.mkp.com/datamining/
– Metadata often involves relations among attributes
文本挖掘 挖掘Web
回顾:目录
DM综合的技术领域 DM的功能分类
DM的具体应用
DM的步骤
DM的理论技术和算法
DM的常用分析工具
回顾:DM综合的技术领域
Database systems, Data Warehouses, OLAP Machine learning Statistical and data analysis methods Visualization Mathematical programming High performance computing
相关文档
最新文档