Data Mining Concepts and Techniques second edition 数据挖掘概念与技术第二版韩家炜第十一章.PPT

合集下载

数据挖掘概念与技术_课后题答案

数据挖掘概念与技术_课后题答案数据挖掘⼀⼀概念概念与技术Data MiningConcepts andTechniques习题答案第1章引⾔1.1什么是数据挖掘？在你的回答中，针对以下问题：1.2 1.6定义下列数据挖掘功能：特征化、区分、关联和相关分析、预测聚类和演变分析。

使⽤你熟悉的现实⽣活的数据库，给岀每种数据挖掘功能的例⼦。

解答：特征化是⼀个⽬标类数据的⼀般特性或特性的汇总。

例如，学⽣的特征可被提岀，形成所有⼤学的计算机科学专业⼀年级学⽣的轮廓，这些特征包括作为⼀种⾼的年级平均成绩（GPA: Grade point aversge）的信息，还有所修的课程的最⼤数量。

区分是将⽬标类数据对象的⼀般特性与⼀个或多个对⽐类对象的⼀般特性进⾏⽐较。

例如，具有⾼GPA的学⽣的⼀般特性可被⽤来与具有低GPA的⼀般特性⽐较。

最终的描述可能是学⽣的⼀个⼀般可⽐较的轮廓，就像具有⾼GPA的学⽣的75%是四年级计算机科学专业的学⽣，⽽具有低GPA的学⽣的65%不是。

关联是指发现关联规则，这些规则表⽰⼀起频繁发⽣在给定数据集的特征值的条件。

例如，⼀个数据挖掘系统可能发现的关联规则为：major（X, Computi ng scie nee” S own s（X, personalcomputer ” [support=12%, confid en ce=98%]其中，X是⼀个表⽰学⽣的变量。

这个规则指出正在学习的学⽣，12% （⽀持度）主修计算机科学并且拥有⼀台个⼈计算机。

这个组⼀个学⽣拥有⼀台个⼈电脑的概率是98% （置信度，或确定度）。

分类与预测不同，因为前者的作⽤是构造⼀系列能描述和区分数据类型或概念的模型（或功能），⽽后者是建⽴⼀个模型去预测缺失的或⽆效的、并且通常是数字的数据值。

它们的相似性是他们都是预测的⼯具：分类被⽤作预测⽬标数据的类的标签，⽽预测典型的应⽤是预测缺失的数字型数据的值。

聚类分析的数据对象不考虑已知的类标号。

Chapter 4 Data Mining Primitives, Languages, and System Architectures 数据挖掘：概念与技术英文版教

Rule-based hierarchy low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
9/19/2020
Data Mining: Concepts and Techniques
7
Measurements of Pattern Interestingness
Association
Mine_Knowledge_Specification ::= mine associations [as pattern_name]
9/19/2020
Data Mining: Concepts and Techniques
15
Syntax for specifying the kind of knowledge to be mined (cont.)
from relation(s)/cube(s) [where condition] in relevance to att_or_dim_list order by order_list group by grouping_list having condition
9/19/2020
Data Mining: Concepts and Techniques
9/19/2020
Data Mining: Concepts and Techniques
8
Visualization of Discovered Patterns
Different backgrounds/usages may require different forms of representation E.g., rules, tables, crosstabs, pie/bar chart etc.

数据挖掘概念与技术原书第3版课后练习题含答案

数据挖掘概念与技术原书第3版课后练习题含答案前言《数据挖掘概念与技术》（Data Mining: Concepts and Techniques）是一本经典的数据挖掘教材，已经推出了第3版。

本文将为大家整理并提供第3版课后习题的答案，希望对大家学习数据挖掘有所帮助。

答案第1章绪论习题1.1数据挖掘的基本步骤包括：1.数据预处理2.数据挖掘3.模型评价4.应用结果习题1.2数据挖掘的主要任务包括：1.描述性任务2.预测性任务3.关联性任务4.分类和聚类任务第2章数据预处理习题2.3数据清理包括以下几个步骤：1.缺失值处理2.异常值检测处理3.数据清洗习题2.4处理缺失值的方法包括：1.删除缺失值2.插补法3.不处理缺失值第3章数据挖掘习题3.1数据挖掘的主要算法包括：1.决策树2.神经网络3.支持向量机4.关联规则5.聚类分析习题3.6K-Means算法的主要步骤包括：1.首先随机选择k个点作为质心2.将所有点分配到最近的质心中3.重新计算每个簇的质心4.重复2-3步，直到达到停止条件第4章模型评价与改进习题4.1模型评价的方法包括：1.混淆矩阵2.精确率、召回率3.F1值4.ROC曲线习题4.4过拟合是指模型过于复杂，学习到了训练集的噪声和随机变化，导致泛化能力不足。

对于过拟合的处理方法包括：1.增加样本数2.缩小模型规模3.正则化4.交叉验证结语以上是《数据挖掘概念与技术》第3版课后习题的答案，希望能够给大家的学习带来帮助。

如果大家还有其他问题，可以在评论区留言，或者在相关论坛等平台提出。

Data Mining：Concepts and Techniques

4
Types of Outliers (I)

Three kinds: global, contextual and collective outliers Global Outlier Global outlier (or point anomaly) Object is Og if it significantly deviates from the rest of the data set Ex. Intrusion detection in computer networks Issue: Find an appropriate measurement of deviation Contextual outlier (or conditional outlier) Object is Oc if it deviates significantly based on a selected context o Ex. 80 F in Urbana: outlier? (depending on summer or winter?) Attributes of data objects should be divided into two groups Contextual attributes: defines the context, e.g., time & location Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature Can be viewed as a generalization of local outliers—whose density significantly deviates from its local area Issue: How to define or formulate meaningful context?

数据仓库与数据挖掘教学大纲

数据仓库与数据挖掘教学大纲一、课程介绍数据仓库与数据挖掘是现代信息技术领域的重要学科，本课程旨在介绍数据仓库和数据挖掘的基本概念、原理和方法，培养学生分析和处理大规模数据的能力，以及利用数据挖掘技术进行知识发现和决策支持的能力。

二、课程目标1. 理解数据仓库和数据挖掘的基本概念和原理。

2. 掌握数据仓库和数据挖掘的常用方法和技术。

3. 能够独立设计和实施数据仓库和数据挖掘项目。

4. 能够利用数据挖掘技术进行知识发现和决策支持。

三、教学内容和安排1. 数据仓库基础知识- 数据仓库的概念和特点- 数据仓库架构和组成- 数据仓库的设计和建模2. 数据挖掘基础知识- 数据挖掘的概念和任务- 数据挖掘的过程和方法- 数据挖掘的评估和应用3. 数据仓库与数据挖掘技术- 数据清洗和预处理- 数据集成和转换- 数据加载和存储- 数据仓库查询和分析- 数据挖掘算法和模型4. 数据挖掘应用案例- 市场营销数据分析- 社交网络分析- 金融风险预测- 医疗数据挖掘5. 实践项目在课程结束前，学生将组成小组进行一个实践项目，包括数据仓库的设计和搭建，以及数据挖掘任务的实施和结果分析。

四、教学方法1. 理论讲授：通过课堂讲解，介绍数据仓库与数据挖掘的基本概念、原理和方法。

2. 实践操作：通过实验和项目实践，让学生亲自操作和实施数据仓库和数据挖掘任务。

3. 讨论与交流：鼓励学生参与课堂讨论，分享自己的见解和经验，促进学生之间的交流与合作。

五、考核方式1. 平时成绩：包括课堂表现、实验报告和项目成果等。

2. 期末考试：考察学生对数据仓库与数据挖掘的理论知识的掌握程度。

3. 实践项目评估：评估学生在实践项目中的设计和实施能力。

六、参考教材1. Jiawei Han, Micheline Kamber, Jian Pei. "Data Mining: Concepts and Techniques." Morgan Kaufmann, 2011.2. Ralph Kimball, Margy Ross. "The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling." Wiley, 2013.七、参考资源1. 数据挖掘工具：Weka, RapidMiner, Python等。

数据挖掘概念与技术英文版第二版课程设计

Data Mining: Concepts and Techniques, Second EditionCourse DesignIntroductionData mining is the process of discovering hidden patterns and knowledge from large amounts of data. It has become an essential tool for businesses and organizations to gn insights into customer behavior, optimize marketing strategies, and improve decision-making processes. This course is designed for students who are interested in learning the fundamental concepts and techniques of data mining.Course Objectives1.To understand the basic concepts and principles of datamining.2.To learn how to apply data mining techniques to real-worldproblems.3.To gn experience in using data mining software and tools.4.To explore advanced topics in data mining.Course OutlineWeek 1: Introduction to Data Mining•What is data mining?•Why is data mining important?•Data preprocessing•Sampling•Data explorationWeek 2: Classification•Decision trees•Nve Bayes•K-Nearest Neighbor (KNN)•Support Vector Machines (SVM) Week 3: Association Rule Mining•Market Basket Analysis•Apriori algorithm•FP-Growth algorithmWeek 4: Clustering•K-Means•Hierarchical clustering•DBSCANWeek 5: Evaluation and Validation•Cross-validation•Confusion matrix•Precision, recall, and F1-score•ROC curveWeek 6: Text Mining•Text preprocessing•Text representation•Topic modeling•Sentiment analysisWeek 7: Web Mining•Web scraping•PageRank algorithm•Link analysis•Web usage miningWeek 8: Advanced Topics•Deep learning for data mining•Time series analysis•Graph mining•Recommender systemsCourse Requirements•Attendance and active participation in class discussions and activities.•Completion of individual assignments and group projects.•Interactive group presentations.•Final examination.ConclusionThis course is designed to equip students with the foundational knowledge and practical skills in data mining. Through this course, students will learn how to employ various data mining techniques to solve real-world problems, explore advanced topics and applications of data mining, and gn hands-on experience in using data mining software and tools.。

Data Mining - Concepts and Techniques CH01

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing Miing interesting knowledge (rules, regularities, patterns,
3
Chapter 1. Introduction
Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Major issues in data mining
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
September 14, 2019
Data Mining: Concepts and Techniques
multidimensional summary reports
statistical summary information (data central tendency and variation)

Data Mining Concepts and Techniques

Data Mining: Concepts and Techniques— Slides for Textbook — — Chapter 4 —Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.caJanuary 17, 2001 Data Mining: Concepts and Techniques 1Chapter 4: Data Mining Primitives, Languages, and System ArchitecturesnData mining primitives: What defines a data mining task? A data mining query language Design graphical user interfaces based on a data mining query language Architecture of data mining systems SummaryData Mining: Concepts and Techniques 2n nn nJanuary 17, 2001Why Data Mining Primitives and Languages?nWhat Defines a Data Mining Task ?Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization of discovered patternsData Mining: Concepts and Techniques 4nnnFinding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting Data mining should be an interactive process n User directs what to be mined Users must be provided with a set of primitives to be used to communicate with the data mining system Incorporating these primitives in a data mining query language n More flexible user interaction n Foundation for design of graphical user interface n Standardization of data mining industry and practiceData Mining: Concepts and Techniques 3n n n n nJanuary 17, 2001January 17, 2001Task-Relevant Data (Minable View)Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteriaData Mining: Concepts and Techniques 5 n nTypes of knowledge to be minedCharacterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasksData Mining: Concepts and Techniques 6nnn n nnnn nnJanuary 17, 2001January 17, 20011Background Knowledge: Concept HierarchiesnMeasurements of Pattern InterestingnessnnnnSchema hierarchy n E.g., street < city < province_or_state < country Set-grouping hierarchy n E.g., {20-39} = young, {40-59} = middle_aged Operation-derived hierarchy n email address: login-name < department < university < country Rule-based hierarchy n low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50Data Mining: Concepts and Techniques 7nnnSimplicity e.g., (association) rule length, (decision) tree size Certainty e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility potential usefulness, e.g., support (association), noise threshold (description) Novelty not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratioData Mining: Concepts and Techniques 8January 17, 2001January 17, 2001Visualization of Discovered PatternsnChapter 4: Data Mining Primitives, Languages, and System ArchitecturesnDifferent backgrounds/usages may require different forms of representationnData mining primitives: What defines a data mining task? A data mining query language Design graphical user interfaces based on a data mining query language Architecture of data mining systems SummaryData Mining: Concepts and Techniques 10E.g., rules, tables, crosstabs, pie/bar chart etc. Discovered knowledge might be more understandable when represented at high level of abstraction Interactive drill up/down, pivoting, slicing and dicing provide different perspective to datan nnConcept hierarchy is also importantnnn n9nDifferent kinds of knowledge require different representation: association, classification, clustering, etc.Data Mining: Concepts and TechniquesJanuary 17, 2001January 17, 2001A Data Mining Query Language (DMQL)nSyntax for DMQLnMotivationnA DMQL can provide the ability to support ad-hoc and interactive data mining By providing a standardized language like SQLnSyntax for specification ofn n n n ntask-relevant data the kind of knowledge to be mined concept hierarchy specification interestingness measure pattern presentation and visualizationnHope to achieve a similar effect like that SQL has on relational database Foundation for system development and evolution Facilitate information exchange, technology transfer, commercialization and wide acceptancen nnDesignnDMQL is designed with the primitives described earlierData Mining: Concepts and Techniques 11nPutting it all together — a DMQL queryData Mining: Concepts and Techniques 12January 17, 2001January 17, 20012Syntax for task-relevant data specificationnSpecification of task-relevant datause database database_name, or use data warehouse data_warehouse_name from relation(s)/cube(s) [where condition] in relevance to att_or_dim_list order by order_list group by grouping_list having conditionData Mining: Concepts and Techniques 13 January 17, 2001 Data Mining: Concepts and Techniques 14n n n n nJanuary 17, 2001Syntax for specifying the kind of knowledge to be minednSyntax for specifying the kind of knowledge to be mined (cont.)vnnCharacterization Mine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s) Discrimination Mine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where contrast_condition_i} analyze measure(s) Association Mine_Knowledge_Specification ::= mine associations [as pattern_name]Data Mining: Concepts and Techniques 15Classification Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_ value_i}} i=v PredictionJanuary 17, 2001January 17, 2001Data Mining: Concepts and Techniques16Syntax for concept hierarchy specificationnSyntax for concept hierarchy specification (Cont.)nnTo specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension> We use different syntax to define different type of hierarchies n schema hierarchies define hierarchy time_hierarchy on date as [date,month quarter,year] n set -grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: seniorData Mining: Concepts and Techniques 17noperation-derived hierarchies define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) rule-based hierarchies define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250Data Mining: Concepts and Techniques 18January 17, 2001January 17, 20013Syntax for interestingness measure specificationn nSyntax for pattern presentation and visualization specificationWe have syntax which allows users to specify the display of discovered patterns in one or more forms display as <result_form> To facilitate interactive viewing at different concept level, the following syntax is defined:Interestingness measures and thresholds can be specified by the user with the statement: with <interest_measure_name> threshold = threshold_valuennExample: with with support threshold= 0.05 0.7confidencethreshold=January 17, 2001Data Mining: Concepts and Techniques19Multilevel_Manipulation ::= roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimensionConcepts and Techniques January 17, 2001 Data Mining:20Putting it all together: the fullspecification of a DMQL queryuse database AllElectronics_db use hierarchy location_hierarchy for B.address mine characteristics as customerPurchasing analyze count% in relevance to C.age, I.type, I.place_made from customer C, item I, purchases P, items_sold S, works_at W, branch where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID and P. cust_ID = C. cust_ID and P.method_paid = ``AmEx'' and P. empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = ``Canada" and I.price >= 100 with noise threshold = 0.05 display as tableJanuary 17, 2001 Data Mining: Concepts and Techniques 21 nOther Data Mining Languages & Standardization EffortsAssociation rule language specifications n MSQL (Imielinski & Virmani'99) n MineRule (Meo Psaila and Ceri'96)n Query flocks based on Datalog syntax ( Tsur et al'98) OLEDB for DM (Microsoft'2000) n Based on OLE, OLE DB, OLE DB for OLAP n Integrating DBMS, data warehouse and data mining CRISP-DM (CRoss-Industry Standard Process for Data Mining) n Providing a platform and process structure for effective data mining nnnEmphasizing on deploying data mining technology to solve business problemsData Mining: Concepts and Techniques 22January 17, 2001Chapter 4: Data Mining Primitives, Languages, and System ArchitecturesnDesigning Graphical User Interfaces based on a data mining query languagenData mining primitives: What defines a data mining task? A data mining query language Design graphical user interfaces based on a data mining query language Architecture of data mining systems SummaryData Mining: Concepts and Techniques 23What tasks should be considered in the design GUIs based on a data mining query language?n n n n n nn nData collection and data mining query composition Presentation of discovered patterns Hierarchy specification and manipulation Manipulation of data mining primitives Interactive multilevel mining Other miscellaneous informationData Mining: Concepts and Techniques 24n nJanuary 17, 2001January 17, 20014Chapter 4: Data Mining Primitives, Languages, and System ArchitecturesnData Mining System ArchitecturesCoupling data mining system with DB/DW system n No coupling—flat file processing, not recommendednnData mining primitives: What defines a data mining task? A data mining query language Design graphical user interfaces based on a data mining query languageLoose couplingnFetching data from DB/DW Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functionsn nnSemi-tight coupling —enhanced DM performancennn nArchitecture of data mining systems SummaryData Mining: Concepts and Techniques 25Tight coupling —A uniform information processing environmentnDM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc.Data Mining: Concepts and Techniques 26January 17, 2001January 17, 2001Chapter 4: Data Mining Primitives, Languages, and System ArchitecturesnSummaryFive primitives for specification of a data mining task n task -relevant data n kind of knowledge to be mined n background knowledge n interestingness measures n knowledge presentation and visualization techniques to be used for displaying the discovered patterns Data mining query languages n DMQL, MS/OLEDB for DM, etc. Data mining system architecture n No coupling, loose coupling, semi-tight coupling, tight couplingData Mining: Concepts and Techniques 28nData mining primitives: What defines a data mining task? A data mining query language Design graphical user interfaces based on a data mining query languagenn nn nArchitecture of data mining systems SummaryData Mining: Concepts and Techniques 27nJanuary 17, 2001January 17, 2001Referencesnhttp://www.cs.sfu.ca/~hanE. Baralis and G. Psaila . Designing templates for mining association rules. Journal of Intelligent Information Systems, 9:7-32, 1997. Microsoft Corp., OLEDB for Data Mining, version 1.0, /data/oledb/dm, Aug. 2000. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, "DMQL: A Data Mining Query Language for Relational Databases", DMKD'96, Montreal, Canada, June 1996. T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. M. Klemettinen, H. Mannila , P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, Gaithersburg, Maryland, Nov. 1994. R. Meo, G. Psaila , and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages 122-133, Bombay, India, Sept. 1996. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.nnnnnnnnThank you !!!29 January 17, 2001 Data Mining: Concepts and Techniques 30January 17, 2001Data Mining: Concepts and Techniques5。

Data Mining Concepts and Techniques

Data Mining: Concepts and T echniquesData Mining: Concepts and T echniquesSecond EditionJiawei HanandMicheline KamberUniversity of Illinois at Urbana-ChampaignA M S T E R D A MB O S T O NH E I D E L B E R G L O N D O NN E W Y O R K O X F O R D P A R I SS A N D I E G O S A N F R A N C I S C OS I N G A P O R E S Y D N E Y T O K Y OPublisher Diane CerraPublishing Services Manager Simon CrumpEditorial Assistant Asma StephanCover DesignCover ImageCover IllustrationText DesignComposition diacriTechTechnical Illustration Dartmouth Publishing,Inc.Copyeditor Multiscience PressProofreader Multiscience PressIndexer Multiscience PressInterior printer Maple-Vail Book Manufacturing GroupCover printer Phoenix ColorMorgan Kaufmann Publishers is an imprint of Elsevier.500Sansome Street,Suite400,San Francisco,CA94111This book is printed on acid-free paper.c 2006by Elsevier Inc.All rights reserved.Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks.In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters.Readers,however,should contact the appropriate companies for more complete information regarding trademarks and registration.No part of this publication may be reproduced,stored in a retrieval system,or transmitted in any form or by any means—electronic,mechanical,photocopying,scanning,or otherwise—without prior written permission of the publisher.Permissions may be sought directly from Elsevier’s Science&Technology Rights Department in Oxford,UK:phone:(+44)1865843830,fax:(+44)1865853333,e-mail:permissions@.You may also complete your request on-line via the Elsevier homepage ()by selecting“Customer Support”and then“Obtaining Permissions.”Library of Congress Cataloging-in-Publication DataApplication submittedISBN13:978-1-55860-901-3ISBN10:1-55860-901-6For information on all Morgan Kaufmann publications,visit our Web site at or Printed in the United States of America060708091054321DedicationTo Y.Dora and Lawrence for your love and encouragementJ.H.To Erik,Kevan,Kian,and Mikael for your love and inspirationM.K.vContentsAbout the Author xviiForeword xixPreface xxiChapter1Introduction11.1What Motivated Data Mining?Why Is It Important?11.2So,What Is Data Mining?51.3Data Mining—On What Kind of Data?91.3.1Relational Databases101.3.2Data Warehouses121.3.3Transactional Databases141.3.4Advanced Data and Information Systems and AdvancedApplications151.4Data Mining Functionalities—What Kinds of Patterns Can BeMined?211.4.1Concept/Class Description:Characterization andDiscrimination211.4.2Mining Frequent Patterns,Associations,and Correlations231.4.3Classiﬁcation and Prediction241.4.4Cluster Analysis251.4.5Outlier Analysis261.4.6Evolution Analysis271.5Are All of the Patterns Interesting?271.6Classiﬁcation of Data Mining Systems291.7Data Mining T ask Primitives311.8Integration of a Data Mining System witha Database or Data Warehouse System341.9Major Issues in Data Mining36viiviii Contents1.10Summary39Exercises40Bibliographic Notes42Chapter2Data Preprocessing472.1Why Preprocess the Data?482.2Descriptive Data Summarization512.2.1Measuring the Central T endency512.2.2Measuring the Dispersion of Data532.2.3Graphic Displays of Basic Descriptive Data Summaries562.3Data Cleaning612.3.1Missing Values612.3.2Noisy Data622.3.3Data Cleaning as a Process652.4Data Integration and T ransformation672.4.1Data Integration672.4.2Data Transformation702.5Data Reduction722.5.1Data Cube Aggregation732.5.2Attribute Subset Selection752.5.3Dimensionality Reduction772.5.4Numerosity Reduction802.6Data Discretization and Concept Hierarchy Generation862.6.1Discretization and Concept Hierarchy Generation forNumerical Data882.6.2Concept Hierarchy Generation for Categorical Data942.7Summary97Exercises97Bibliographic Notes101Chapter3Data Warehouse and OLAP T echnology:An Overview1053.1What Is a Data Warehouse?1053.1.1Differences between Operational Database Systemsand Data Warehouses1083.1.2But,Why Have a Separate Data Warehouse?1093.2A Multidimensional Data Model1103.2.1From T ables and Spreadsheets to Data Cubes1103.2.2Stars,Snowﬂakes,and Fact Constellations:Schemas for Multidimensional Databases1143.2.3Examples for Deﬁning Star,Snowﬂake,and Fact Constellation Schemas117Contents ix3.2.4Measures:Their Categorization and Computation1193.2.5Concept Hierarchies1213.2.6OLAP Operations in the Multidimensional Data Model1233.2.7A Starnet Query Model for QueryingMultidimensional Databases1263.3Data Warehouse Architecture1273.3.1Steps for the Design and Construction of Data Warehouses1283.3.2A Three-Tier Data Warehouse Architecture1303.3.3Data Warehouse Back-End T ools and Utilities1343.3.4Metadata Repository1343.3.5T ypes of OLAP Servers:ROLAP versus MOLAPversus HOLAP1353.4Data Warehouse Implementation1373.4.1Efﬁcient Computation of Data Cubes1373.4.2Indexing OLAP Data1413.4.3Efﬁcient Processing of OLAP Queries1443.5From Data Warehousing to Data Mining1463.5.1Data Warehouse Usage1463.5.2From On-Line Analytical Processingto On-Line Analytical Mining1483.6Summary150Exercises152Bibliographic Notes154Chapter4Data Cube Computation and Data Generalization1574.1Efﬁcient Methods for Data Cube Computation1574.1.1A Road Map for the Materialization of Different Kindsof Cubes1584.1.2Multiway Array Aggregation for Full Cube Computation1644.1.3BUC:Computing Iceberg Cubes from the Apex CuboidDownward1684.1.4Star-cubing:Computing Iceberg Cubes Usinga Dynamic Star-tree Structure1734.1.5Precomputing Shell Fragments for Fast High-DimensionalOLAP1784.1.6Computing Cubes with Complex Iceberg Conditions1874.2Further Development of Data Cube and OLAPT echnology1894.2.1Discovery-Driven Exploration of Data Cubes1894.2.2Complex Aggregation at Multiple Granularity:Multifeature Cubes1924.2.3Constrained Gradient Analysis in Data Cubes195x Contents4.3Attribute-Oriented Induction—An AlternativeMethod for Data Generalization and Concept Description1984.3.1Attribute-Oriented Induction for Data Characterization1994.3.2Efﬁcient Implementation of Attribute-Oriented Induction2054.3.3Presentation of the Derived Generalization2064.3.4Mining Class Comparisons:Discriminating betweenDifferent Classes2104.3.5Class Description:Presentation of Both Characterizationand Comparison2154.4Summary218Exercises219Bibliographic Notes223Chapter5Mining Frequent Patterns,Associations,and Correlations2275.1Basic Concepts and a Road Map2275.1.1Market Basket Analysis:A Motivating Example2285.1.2Frequent Itemsets,Closed Itemsets,and Association Rules2305.1.3Frequent Pattern Mining:A Road Map2325.2Efﬁcient and Scalable Frequent Itemset Mining Methods2345.2.1The Apriori Algorithm:Finding Frequent Itemsets UsingCandidate Generation2345.2.2Generating Association Rules from Frequent Itemsets2395.2.3Improving the Efﬁciency of Apriori2405.2.4Mining Frequent Itemsets without Candidate Generation2425.2.5Mining Frequent Itemsets Using Vertical Data Format2455.2.6Mining Closed Frequent Itemsets2485.3Mining Various Kinds of Association Rules2505.3.1Mining Multilevel Association Rules2505.3.2Mining Multidimensional Association Rulesfrom Relational Databases and Data Warehouses2545.4From Association Mining to Correlation Analysis2595.4.1Strong Rules Are Not Necessarily Interesting:An Example2605.4.2From Association Analysis to Correlation Analysis2615.5Constraint-Based Association Mining2655.5.1Metarule-Guided Mining of Association Rules2665.5.2Constraint Pushing:Mining Guided by Rule Constraints2675.6Summary272Exercises274Bibliographic Notes280Contents xiChapter6Classiﬁcation and Prediction2856.1What Is Classiﬁcation?What Is Prediction?2856.2Issues Regarding Classiﬁcation and Prediction2896.2.1Preparing the Data for Classiﬁcation and Prediction2896.2.2Comparing Classiﬁcation and Prediction Methods2906.3Classiﬁcation by Decision T ree Induction2916.3.1Decision Tree Induction2926.3.2Attribute Selection Measures2966.3.3Tree Pruning3046.3.4Scalability and Decision Tree Induction3066.4Bayesian Classiﬁcation3106.4.1Bayes’Theorem3106.4.2Naïve Bayesian Classiﬁcation3116.4.3Bayesian Belief Networks3156.4.4Training Bayesian Belief Networks3176.5Rule-Based Classiﬁcation3186.5.1Using IF-THEN Rules for Classiﬁcation3196.5.2Rule Extraction from a Decision Tree3216.5.3Rule Induction Using a Sequential Covering Algorithm3226.6Classiﬁcation by Backpropagation3276.6.1A Multilayer Feed-Forward Neural Network3286.6.2Deﬁning a Network T opology3296.6.3Backpropagation3296.6.4Inside the Black Box:Backpropagation and Interpretability3346.7Support Vector Machines3376.7.1The Case When the Data Are Linearly Separable3376.7.2The Case When the Data Are Linearly Inseparable3426.8Associative Classiﬁcation:Classiﬁcation by AssociationRule Analysis3446.9Lazy Learners(or Learning from Y our Neighbors)3476.9.1k-Nearest-Neighbor Classiﬁers3486.9.2Case-Based Reasoning3506.10Other Classiﬁcation Methods3516.10.1Genetic Algorithms3516.10.2Rough Set Approach3516.10.3Fuzzy Set Approaches3526.11Prediction3546.11.1Linear Regression3556.11.2Nonlinear Regression3576.11.3Other Regression-Based Methods358xii Contents6.12Accuracy and Error Measures3596.12.1Classiﬁer Accuracy Measures3606.12.2Predictor Error Measures3626.13Evaluating the Accuracy of a Classiﬁer or Predictor3636.13.1Holdout Method and Random Subsampling3646.13.2Cross-validation3646.13.3Bootstrap3656.14Ensemble Methods—Increasing the Accuracy3666.14.1Bagging3666.14.2Boosting3676.15Model Selection3706.15.1Estimating Conﬁdence Intervals3706.15.2ROC Curves3726.16Summary373Exercises375Bibliographic Notes378Chapter7Cluster Analysis3837.1What Is Cluster Analysis?3837.2T ypes of Data in Cluster Analysis3867.2.1Interval-Scaled Variables3877.2.2Binary Variables3897.2.3Categorical,Ordinal,and Ratio-Scaled Variables3927.2.4Variables of Mixed T ypes3957.2.5Vector Objects3977.3A Categorization of Major Clustering Methods3987.4Partitioning Methods4017.4.1Classical Partitioning Methods:k-Means and k-Medoids4027.4.2Partitioning Methods in Large Databases:Fromk-Medoids to CLARANS4077.5Hierarchical Methods4087.5.1Agglomerative and Divisive Hierarchical Clustering4087.5.2BIRCH:Balanced Iterative Reducing and ClusteringUsing Hierarchies4127.5.3ROCK:A Hierarchical Clustering Algorithm forCategorical Attributes4147.5.4Chameleon:A Hierarchical Clustering AlgorithmUsing Dynamic Modeling4167.6Density-Based Methods4187.6.1DBSCAN:A Density-Based Clustering Method Based onConnected Regions with Sufﬁciently High Density418Contents xiii7.6.2OPTICS:Ordering Points to Identify the ClusteringStructure4207.6.3DENCLUE:Clustering Based on DensityDistribution Functions4227.7Grid-Based Methods4247.7.1STING:ST atistical INformation Grid4257.7.2WaveCluster:Clustering Using Wavelet Transformation4277.8Model-Based Clustering Methods4297.8.1Expectation-Maximization4297.8.2Conceptual Clustering4317.8.3Neural Network Approach4337.9Clustering High-Dimensional Data4347.9.1CLIQUE:A Dimension-Growth Subspace Clustering Method4367.9.2PROCLUS:A Dimension-Reduction Subspace ClusteringMethod4397.9.3Frequent Pattern–Based Clustering Methods4407.10Constraint-Based Cluster Analysis4447.10.1Clustering with Obstacle Objects4467.10.2User-Constrained Cluster Analysis4487.10.3Semi-Supervised Cluster Analysis4497.11Outlier Analysis4517.11.1Statistical Distribution-Based Outlier Detection4527.11.2Distance-Based Outlier Detection4547.11.3Density-Based Local Outlier Detection4557.11.4Deviation-Based Outlier Detection4587.12Summary460Exercises461Bibliographic Notes464Chapter8Mining Stream,Time-Series,and Sequence Data4678.1Mining Data Streams4688.1.1Methodologies for Stream Data Processing andStream Data Systems4698.1.2Stream OLAP and Stream Data Cubes4748.1.3Frequent-Pattern Mining in Data Streams4798.1.4Classiﬁcation of Dynamic Data Streams4818.1.5Clustering Evolving Data Streams4868.2Mining Time-Series Data4898.2.1Trend Analysis4908.2.2Similarity Search in Time-Series Analysis493xiv Contents8.3Mining Sequence Patterns in T ransactional Databases4988.3.1Sequential Pattern Mining:Concepts and Primitives4988.3.2Scalable Methods for Mining Sequential Patterns5008.3.3Constraint-Based Mining of Sequential Patterns5098.3.4Periodicity Analysis for Time-Related Sequence Data5128.4Mining Sequence Patterns in Biological Data5138.4.1Alignment of Biological Sequences5148.4.2Hidden Markov Model for Biological Sequence Analysis5188.5Summary527Exercises528Bibliographic Notes531Chapter9Graph Mining,Social Network Analysis,and MultirelationalData Mining5359.1Graph Mining5359.1.1Methods for Mining Frequent Subgraphs5369.1.2Mining Variant and Constrained Substructure Patterns5459.1.3Applications:Graph Indexing,Similarity Search,Classiﬁcation,and Clustering5519.2Social Network Analysis5569.2.1What Is a Social Network?5569.2.2Characteristics of Social Networks5579.2.3Link Mining:T asks and Challenges5619.2.4Mining on Social Networks5659.3Multirelational Data Mining5719.3.1What Is Multirelational Data Mining?5719.3.2ILP Approach to Multirelational Classiﬁcation5739.3.3T uple ID Propagation5759.3.4Multirelational Classiﬁcation Using Tuple ID Propagation5779.3.5Multirelational Clustering with User Guidance5809.4Summary584Exercises586Bibliographic Notes587Chapter10Mining Object,Spatial,Multimedia,T ext,and Web Data59110.1Multidimensional Analysis and Descriptive Mining of ComplexData Objects59110.1.1Generalization of Structured Data59210.1.2Aggregation and Approximation in Spatial and Multimedia DataGeneralization593Contents xv10.1.3Generalization of Object Identiﬁers and Class/SubclassHierarchies59410.1.4Generalization of Class Composition Hierarchies59510.1.5Construction and Mining of Object Cubes59610.1.6Generalization-Based Mining of Plan Databases byDivide-and-Conquer59610.2Spatial Data Mining60010.2.1Spatial Data Cube Construction and Spatial OLAP60110.2.2Mining Spatial Association and Co-location Patterns60510.2.3Spatial Clustering Methods60610.2.4Spatial Classiﬁcation and Spatial Trend Analysis60610.2.5Mining Raster Databases60710.3Multimedia Data Mining60710.3.1Similarity Search in Multimedia Data60810.3.2Multidimensional Analysis of Multimedia Data60910.3.3Classiﬁcation and Prediction Analysis of Multimedia Data61110.3.4Mining Associations in Multimedia Data61210.3.5Audio and Video Data Mining61310.4T ext Mining61410.4.1T ext Data Analysis and Information Retrieval61510.4.2Dimensionality Reduction for T ext62110.4.3T ext Mining Approaches62410.5Mining the World Wide Web62810.5.1Mining the Web Page Layout Structure63010.5.2Mining the Web’s Link Structures to IdentifyAuthoritative Web Pages63110.5.3Mining Multimedia Data on the Web63710.5.4Automatic Classiﬁcation of Web Documents63810.5.5Web Usage Mining64010.6Summary641Exercises642Bibliographic Notes645Chapter11Applications and T rends in Data Mining64911.1Data Mining Applications64911.1.1Data Mining for Financial Data Analysis64911.1.2Data Mining for the Retail Industry65111.1.3Data Mining for the T elecommunication Industry65211.1.4Data Mining for Biological Data Analysis65411.1.5Data Mining in Other Scientiﬁc Applications65711.1.6Data Mining for Intrusion Detection658xvi Contents11.2Data Mining System Products and Research Prototypes66011.2.1How to Choose a Data Mining System66011.2.2Examples of Commercial Data Mining Systems66311.3Additional Themes on Data Mining66511.3.1Theoretical Foundations of Data Mining66511.3.2Statistical Data Mining66611.3.3Visual and Audio Data Mining66711.3.4Data Mining and Collaborative Filtering67011.4Social Impacts of Data Mining67511.4.1Ubiquitous and Invisible Data Mining67511.4.2Data Mining,Privacy,and Data Security67811.5T rends in Data Mining68111.6Summary684Exercises685Bibliographic Notes687Appendix An Introduction to Microsoft’s OLE DB forData Mining691A.1Model Creation693A.2Model Training695A.3Model Prediction and Browsing697Bibliography703About the Authors Jiawei Han is Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign.Well known for his research in the areas of data mining and database systems,he has received many recognitions and awards for his contribu-tions in theﬁeld,including the ACM Fellow and the2004ACM SIGKDD Innovations Award.He serves as Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data,and on the editorial boards for several scientiﬁc journals in theﬁeld.Micheline Kamber is a researcher who enjoys writing in easy-to-understand terms.She has a mas-ter’s degree in computer science(specializing in artiﬁcial intelligence)from Concordia University,Canada.xviiForewordJim GrayMicrosoft Research We are deluged by data—scientiﬁc data,medical data,demographic data,ﬁnancial data, and marketing data.People have no time to look at this data.Human attention has become a precious resource.So,we mustﬁnd ways to automatically analyze the data, to automatically classify it,to automatically summarize it,to automatically discover and characterize trends in it,and to automaticallyﬂag anomalies.This is one of the most active and exciting areas of the database research community.Researchers in areas such as statistics,visualization,artiﬁcial intelligence,and machine learning are contributing to thisﬁeld.The breadth of theﬁeld makes it difﬁcult to grasp its extraordinary progress over the last few years.Jiawei Han and Micheline Kamber have done a wonderful job of organizing and presenting data mining in this very readable textbook.They begin by giving quick intro-ductions to database and data mining concepts with particular emphasis on data analysis. They review the current product offerings by presenting a general framework that covers them all.They then cover,in a chapter-by-chapter tour,the concepts and techniques that underlie classiﬁcation,prediction,association,and clustering.These topics are presented with examples,a tour of the best algorithms for each problem class,and pragmatic rules of thumb about when to apply each technique.I found this presentation style to be very readable,and I certainly learned a lot from reading the book.Jiawei Han and Micheline Kamber have been leading contributors to data mining research.This is the text they use with their students to bring them up to speed on theﬁeld.Theﬁeld is evolving very rapidly,but this book is a quick way to learn the basic ideas and to understand where the ﬁeld is today.I found it very informative and stimulating,and I expect you will too.xixPreface Our capabilities of both generating and collecting data have been increasing rapidly. Contributing factors include the computerization of business,scientiﬁc,and government transactions;the widespread use of digital cameras,publication tools,and bar codes for most commercial products;and advances in data collection tools ranging from scanned text and image platforms to satellite remote sensing systems.In addition,popular use of the World Wide Web as a global information system hasﬂooded us with a tremen-dous amount of data and information.This explosive growth in stored or transient data has generated an urgent need for new techniques and automated tools that can intelli-gently assist us in transforming the vast amounts of data into useful information and knowledge.This book explores the concepts and techniques of data mining,a promising and ﬂourishing frontier in data and information systems and their applications.Data mining, also popularly referred to as knowledge discovery from data(KDD),is the automated or convenient extraction of patterns representing knowledge implicitly stored or catchable in large databases,data warehouses,the Web,other massive information repositories,or data streams.Data mining is a multidisciplinaryﬁeld,drawing work from areas including database technology,machine learning,statistics,pattern recognition,information retrieval, neural networks,knowledge-based systems,artiﬁcial intelligence,high-performance computing,and data visualization.We present techniques for the discovery of patterns hidden in large data sets,focusing on issues relating to their feasibility,usefulness,effec-tiveness,and scalability.As a result,this book is not intended as an introduction to database systems,machine learning,statistics,or other such areas,although we do pro-vide the background necessary in these areas in order to facilitate the reader’s compre-hension of their respective roles in data mining.Rather,the book is a comprehensive introduction to data mining,presented with effectiveness and scalability issues in focus. It should be useful for computing science students,application developers,and business professionals,as well as researchers involved in any of the disciplines listed above.Data mining emerged during the late1980s,made great strides during the1990s,and continues toﬂourish into the new millennium.This book presents an overall picture of theﬁeld,introducing interesting data mining techniques and systems and discussingxxixxii Prefaceapplications and research directions.An important motivation for writing this book wasthe need to build an organized framework for the study of data mining—a challengingtask,owing to the extensive multidisciplinary nature of this fast-developingﬁeld.Wehope that this book will encourage people with different backgrounds and experiencesto exchange their views regarding data mining so as to contribute toward the furtherpromotion and shaping of this exciting and dynamicﬁeld.Organization of the BookSince the publication of theﬁrst edition of this book,great progress has been made intheﬁeld of data mining.Many new data mining methods,systems,and applications havebeen developed.This new edition substantially revises theﬁrst edition of the book,withnumerous enhancements and a reorganization of the technical contents of the entirebook.In addition,several new chapters are included to address recent developments onmining complex types of data,including stream data,sequence data,graph structureddata,social network data,and multirelational data.The chapters are described brieﬂy as follows,with emphasis on the new material.Chapter1provides an introduction to the multidisciplinaryﬁeld of data mining.It discusses the evolutionary path of database technology,which has led to the needfor data mining,and the importance of its applications.It examines the types of datato be mined,including relational,transactional,and data warehouse data,as well ascomplex types of data such as data streams,time-series,sequences,graphs,social net-works,multirelational data,spatiotemporal data,multimedia data,text data,and Webdata.The chapter presents a general classiﬁcation of data mining tasks,based on thedifferent kinds of knowledge to be mined.In comparison with theﬁrst edition,twonew sections are introduced:Section1.7is on data mining primitives,which allowusers to interactively communicate with data mining systems in order to direct themining process,and Section1.8discusses the issues regarding how to integrate a datamining system with a database or data warehouse system.These two sections repre-sent the condensed materials of Chapter4,“Data Mining Primitives,Languages andArchitectures,”in theﬁrst edition.Finally,major challenges in theﬁeld are discussed.Chapter2introduces techniques for preprocessing the data before mining.This corresponds to Chapter3of theﬁrst edition.Because data preprocessing precedes theconstruction of data warehouses,we address this topic here,and then follow with anintroduction to data warehouses in the subsequent chapter.This chapter describes var-ious statistical methods for descriptive data summarization,including measuring bothcentral tendency and dispersion of data.The description of data cleaning methods hasbeen enhanced.Methods for data integration and transformation and data reduction arediscussed,including the use of concept hierarchies for dynamic and static discretization.The automatic generation of concept hierarchies is also described.Chapters3and4provide a solid introduction to data warehouse,OLAP(On-Line Analytical Processing),and data generalization.These two chapters correspond toChapters2and5of theﬁrst edition,but with substantial enhancement regarding dataPreface xxiii warehouse implementation methods.Chapter3introduces the basic concepts,archi-tectures and general implementations of data warehouse and on-line analytical process-ing,as well as the relationship between data warehousing and data mining.Chapter4 takes a more in-depth look at data warehouse and OLAP technology,presenting a detailed study of methods of data cube computation,including the recently developed star-cubing and high-dimensional OLAP methods.Further explorations of data ware-house and OLAP are discussed,such as discovery-driven cube exploration,multifeature cubes for complex data mining queries,and cube gradient analysis.Attribute-oriented induction,an alternative method for data generalization and concept description,is also discussed.Chapter5presents methods for mining frequent patterns,associations,and corre-lations in transactional and relational databases and data warehouses.In addition to introducing the basic concepts,such as market basket analysis,many techniques for fre-quent itemset mining are presented in an organized way.These range from the basic Apriori algorithm and its variations to more advanced methods that improve on efﬁ-ciency,including the frequent-pattern growth approach,frequent-pattern mining with vertical data format,and mining closed frequent itemsets.The chapter also presents tech-niques for mining multilevel association rules,multidimensional association rules,and quantitative association rules.In comparison with the previous edition,this chapter has placed greater emphasis on the generation of meaningful association and correlation rules.Strategies for constraint-based mining and the use of interestingness measures to focus the rule search are also described.Chapter6describes methods for data classiﬁcation and prediction,including decision tree induction,Bayesian classiﬁcation,rule-based classiﬁcation,the neural network tech-nique of backpropagation,support vector machines,associative classiﬁcation,k-nearest neighbor classiﬁers,case-based reasoning,genetic algorithms,rough set theory,and fuzzy set approaches.Methods of regression are introduced.Issues regarding accuracy and how to choose the best classiﬁer or predictor are discussed.In comparison with the corre-sponding chapter in theﬁrst edition,the sections on rule-based classiﬁcation and support vector machines are new,and the discussion of measuring and enhancing classiﬁcation and prediction accuracy has been greatly expanded.Cluster analysis forms the topic of Chapter7.Several major data clustering approaches are presented,including partitioning methods,hierarchical methods,density-based methods,grid-based methods,and model-based methods.New sections in this edition introduce techniques for clustering high-dimensional data,as well as for constraint-based cluster analysis.Outlier analysis is also discussed.Chapters8to10treat advanced topics in data mining and cover a large body of materials on recent progress in this frontier.These three chapters now replace our pre-vious single chapter on advanced topics.Chapter8focuses on the mining of stream data,time-series data,and sequence data(covering both transactional sequences and biological sequences).The basic data mining techniques(such as frequent-pattern min-ing,classiﬁcation,clustering,and constraint-based mining)are extended for these types of data.Chapter9discusses methods for graph and structural pattern mining,social network analysis and multirelational data mining.Chapter10presents methods for。

有关异常值处理的书

有关异常值处理的书异常值处理是数据分析和统计学中的重要内容，涉及到检测和处理数据中的异常或离群值。

以下是一些与异常值处理相关的书籍，它们可以帮助你深入了解异常值的概念、检测方法和处理技术：1. "统计学习方法"（Pattern Recognition and Machine Learning）作者：Christopher M. Bishop这本书是机器学习领域的经典教材，其中涉及异常值检测和处理在机器学习中的应用。

2. "数据挖掘：概念与技术"（Data Mining: Concepts and Techniques）作者：Jiawei Han，Micheline Kamber，Jian Pei这本书介绍了数据挖掘的基本概念和技术，其中包括异常值检测和处理的方法。

3. "数据分析导论"（Introduction to Data Mining）作者：Pang-Ning Tan，Michael Steinbach，Vipin Kumar这是一本数据挖掘和数据分析的入门教材，涵盖了异常值检测和处理的内容。

4. "Applied Multivariate Statistical Analysis"作者：Richard A. Johnson，Dean W. Wichern这本书着重介绍多元统计分析的方法，其中包括处理多元数据中的异常值问题。

5. "R语言实战"（R in Action: Data Analysis and Graphics with R）作者：Robert I. Kabacoff这是一本关于使用R语言进行数据分析和可视化的实战教材，其中包括异常值处理的内容。

6. "Outliers in Statistical Data"作者：Vic Barnett，Terry Lewis这本书是关于统计数据中异常值的经典著作，深入讨论了异常值检测和处理的方法和理论。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

November 28, 2010 Data Mining: Concepts and Techniques 10
Biomedical Data Analysis
DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). Gene: a sequence of hundreds of individual nucleotides arranged in a particular order Humans have around 30,000 genes Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes Semantic integration of heterogeneous, distributed genome databases Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data Data cleaning and data integration methods developed in data mining will help
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
November 28, 2010 Data Mining: Concepts and Techniques 1
November 28, 2010
பைடு நூலகம்
Data Mining: Concepts and Techniques
November 28, 2010 Data Mining: Concepts and Techniques 7
Data Mining in Retail Industry (2)
Ex. 1. Design and construction of data warehouses based on the benefits of data mining Multidimensional analysis of sales, customers, products, time, and region Ex. 2. Analysis of the effectiveness of sales campaigns Ex. 3. Customer retention: Analysis of customer loyalty Use customer loyalty card information to register sequences of purchases of particular customers Use sequential pattern mining to investigate changes in customer consumption or loyalty Suggest adjustments on the pricing and variety of goods Ex. 4. Purchase recommendation and cross-reference of items
November 28, 2010 Data Mining: Concepts and Techniques 8
Data Mining for Telecomm. Industry (1)
A rapidly expanding and highly competitive industry and a great demand for data mining Understand the business involved Identify telecommunication patterns Catch fraudulent activities Make better use of resources Improve the quality of service Multidimensional analysis of telecommunication data Intrinsically multidimensional: calling-time, duration, location of caller, location of callee, type of call, etc.
November 28, 2010 Data Mining: Concepts and Techniques 4
Data Mining for Financial Data Analysis
Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality Design and construction of data warehouses for multidimensional data analysis and data mining View the debt and revenue changes by month, by region, by sector, and by other factors Access statistical information such as max, min, total, average, trend, etc. Loan payment prediction/consumer credit policy analysis feature selection and attribute relevance ranking Loan payment performance Consumer credit rating
November 28, 2010 Data Mining: Concepts and Techniques 3
Data Mining Applications
Data mining is an interdisciplinary field with wide and diverse applications There exist nontrivial gaps between data mining principles and domain-specific applications Some application domains Financial data analysis Retail industry Telecommunication industry Biological data analysis
Data Mining:
Concepts and Techniques
— Chapter 11 —
— Applications
and Trends in Data Mining—
Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign /~hanj
2
Applications and Trends in Data Mining
Data mining applications Data mining system products and research prototypes Additional themes on data mining Social impacts of data mining Trends in data mining Summary
November 28, 2010 Data Mining: Concepts and Techniques 6
Data Mining for Retail Industry
Retail industry: huge amounts of data on sales, customer shopping history, etc. Applications of retail data mining Identify customer buying behaviors Discover customer shopping patterns and trends Improve the quality of customer service Achieve better customer retention and satisfaction Enhance goods consumption ratios Design more effective goods transportation and distribution policies
November 28, 2010 Data Mining: Concepts and Techniques 9
Data Mining for Telecomm. Industry (2)
Fraudulent pattern analysis and the identification of unusual patterns Identify potentially fraudulent users and their atypical usage patterns Detect attempts to gain fraudulent entry to customer accounts Discover unusual patterns which may need special attention Multidimensional association and sequential pattern analysis Find usage patterns for a set of communication services by customer group, by month, etc. Promote the sales of specific services Improve the availability of particular services in a region Use of visualization tools in telecommunication data analysis

Data Mining Concepts and Techniques second edition 数据挖掘概念与技术 第二版 韩家炜 第十一章.PPT

数据挖掘概念与技术_课后题答案

Chapter 4 Data Mining Primitives, Languages, and System Architectures 数据挖掘：概念与技术 英文版教

数据挖掘概念与技术原书第3版课后练习题含答案

Data Mining：Concepts and Techniques

数据仓库与数据挖掘教学大纲

数据挖掘概念与技术英文版第二版课程设计

Data Mining - Concepts and Techniques CH01

Data Mining Concepts and Techniques

Data Mining Concepts and Techniques

有关异常值处理的书

Data Mining Concepts and Techniques second edition 数据挖掘概念与技术第二版韩家炜第十一章.PPT

Chapter 4 Data Mining Primitives, Languages, and System Architectures 数据挖掘：概念与技术英文版教