数据挖掘课件

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。


Noisy data comes from the process of data
collection entry transmission

Inconsistent data comes from
Different data sources Functional dependency violation
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to
equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry
Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:
Accuracy Completeness Consistency Believability Value added Interpretability Accessibility
Why Is Data Dirty?

Incomplete data comes from
different consideration between the time when the data was collected and when it is analyzed. human/hardware/software problems

Other data problems which requires data cleaning
duplicate records incomplete data inconsistent data
How to Handle Noisy Data?

Binning分箱method
first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Clustering
detect and remove outliers

Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)

Smoothing by bin means:
Data Preprocessing
Gaolin School of Computer Science and Technology Xidian University
2013-2-28 lgao@mail.xidian.edu.cn
Contents of last course


What is data mining? Process of data mining Functionalities of data mining Application of data mining Major issues in data mining
Data Cleaning : Noisy Data

Noise: random error or variance in a measured variable Incorrect attribute values may due to
faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention
a global constant : e.g., ―unknown‖, a new class?! the attribute mean the attribute mean for all samples belonging to the same class the most probable value: inference-based such as Bayesian formula or decision tree

Data integration
Integration of multiple databases, data cubes, or files


Data transformation
Normalization and aggregation规范化和聚集
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Binning Methods for Data Smoothing

Sorted data (e.g., by price)
– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition into (equi-depth) bins:
– Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34

Equal-depth (frequency) partitioning:
Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky.

Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data Cleaning

Importance

―Data cleaning is one of the three biggest problems in data warehousing
noisy: containing errors or outliers
• e.g., Salary=―-10‖
inconsistent: containing discrepancies in codes or names
• e.g., Age=―42‖ Birthday=―03/07/1997‖ • e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖

Regression回归
smooth by fitting the data into regression functions
Simple Discretization Methods: Binning

Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well

Data cleaning tasks
Fill in missing values Identify outliers Smooth out noisy data Correct inconsistent data
Data Cleaning : Missing Data

Data is not always available
Why Is Data Preprocessing Important?

No quality data, no quality mining results!
Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest
• e.g., occupation=― ‖
ຫໍສະໝຸດ Baidu
Contents of today: preprocessing



Why preprocess the data? Data cleaning 清理 Data integration 集成 Data transformation 变换 Data reduction 约简 Data discretization 离散化 Summary
the percentage of missing values per attribute varies considerably)


Fill in the missing value manually: tedious + infeasible? Fill in it automatically with

Missing data may need to be inferred
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when

Broad categories:
intrinsic, contextual, representational, and accessibility
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies


Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
相关文档
最新文档