Introduction to Data Mining
数据挖掘导论英文版

数据挖掘导论英文版Data Mining IntroductionData mining is the process of extracting valuable insights and patterns from large datasets. It involves the application of various techniques and algorithms to uncover hidden relationships, trends, and anomalies that can be used to inform decision-making and drive business success. In today's data-driven world, the ability to effectively harness the power of data has become a critical competitive advantage for organizations across a wide range of industries.One of the key strengths of data mining is its versatility. It can be applied to a wide range of domains, from marketing and finance to healthcare and scientific research. In the marketing realm, for example, data mining can be used to analyze customer behavior, identify target segments, and develop personalized marketing strategies. In the financial sector, data mining can be leveraged to detect fraud, assess credit risk, and optimize investment portfolios.At the heart of data mining lies a diverse set of techniques and algorithms. These include supervised learning methods, such asregression and classification, which can be used to predict outcomes based on known patterns in the data. Unsupervised learning techniques, such as clustering and association rule mining, can be employed to uncover hidden structures and relationships within datasets. Additionally, advanced algorithms like neural networks and decision trees have proven to be highly effective in tackling complex, non-linear problems.The process of data mining typically involves several key steps, each of which plays a crucial role in extracting meaningful insights from the data. The first step is data preparation, which involves cleaning, transforming, and integrating the raw data into a format that can be effectively analyzed. This step is particularly important, as the quality and accuracy of the input data can significantly impact the reliability of the final results.Once the data is prepared, the next step is to select the appropriate data mining techniques and algorithms to apply. This requires a deep understanding of the problem at hand, as well as the strengths and limitations of the available tools. Depending on the specific goals of the analysis, the data mining practitioner may choose to employ a combination of techniques, each of which can provide unique insights and perspectives.The next phase is the actual data mining process, where the selectedalgorithms are applied to the prepared data. This can involve complex mathematical and statistical calculations, as well as the use of specialized software and computing resources. The results of this process may include the identification of patterns, trends, and relationships within the data, as well as the development of predictive models and other data-driven insights.Once the data mining process is complete, the final step is to interpret and communicate the findings. This involves translating the technical results into actionable insights that can be easily understood by stakeholders, such as business leaders, policymakers, or scientific researchers. Effective communication of data mining results is crucial, as it enables decision-makers to make informed choices and take appropriate actions based on the insights gained.One of the most exciting aspects of data mining is its continuous evolution and the emergence of new techniques and technologies. As the volume and complexity of data continue to grow, the need for more sophisticated and powerful data mining tools and algorithms has become increasingly pressing. Advances in areas such as machine learning, deep learning, and big data processing have opened up new frontiers in data mining, enabling practitioners to tackle increasingly complex problems and extract even more valuable insights from the data.In conclusion, data mining is a powerful and versatile tool that has the potential to transform the way we approach a wide range of challenges and opportunities. By leveraging the power of data and the latest analytical techniques, organizations can gain a deeper understanding of their operations, customers, and markets, and make more informed, data-driven decisions that drive sustainable growth and success. As the field of data mining continues to evolve, it is clear that it will play an increasingly crucial role in shaping the future of business, science, and society as a whole.。
Introduction to Data Mining

Introduction to Data MiningData mining is a process of extracting useful information from large datasets by using various statistical and machine learning techniques. It is a crucial part of the field of data science and plays a key role in helping businesses make informed decisions based on data-driven insights.One of the main goals of data mining is to discover patterns and relationships within data that can be used to make predictions or identify trends. This can help businesses improve their marketing strategies, optimize their operations, and better understand their customers. By analyzing large amounts of data, data mining algorithms can uncover hidden patterns that may not be immediately apparent to human analysts.There are several different techniques that are commonly used in data mining, including classification, clustering, association rule mining, and anomaly detection. Classification involves categorizing data points into different classes based on their attributes, while clustering groups similar data points together. Association rule mining identifies relationships between different variables, and anomaly detection detects outliers or unusual patterns in the data.In order to apply data mining techniques effectively, it is important to have a solid understanding of statistics, machine learning, and data analytics. Data mining professionals must be able to preprocess data, select appropriate algorithms, and interpret the results of their analyses. They must also be able to communicate their findings effectively to stakeholders in order to drive business decisions.Data mining is used in a wide range of industries, including finance, healthcare, retail, and telecommunications. In finance, data mining is used to detect fraudulent transactions and predict market trends. In healthcare, it is used to analyze patient data and improve treatment outcomes. In retail, it is used to optimize inventory management and personalize marketing campaigns. In telecommunications, it is used to analyze network performance and customer behavior.Overall, data mining is a powerful tool that can help businesses gain valuable insights from their data and make more informed decisions. By leveraging the latest advances in machine learning and data analytics, organizations can stay competitive in today's data-driven world. Whether you are a data scientist, analyst, or business leader, understanding the principles of data mining can help you unlock the potential of your data and drive success in your organization.。
数据建模的书

以下是一些关于数据建模的书籍推荐:
1. 《数据仓库与数据挖掘导论》(Introduction to Data Warehousing and Data Mining) - 作者:Vipin Kumar、Michael Steinbach和Anuj Karpatne。
- 这本教材介绍了数据建模的基本概念,包括数据仓库设计、数据集成和数据挖掘技术。
它包含了许多实际案例和示例,适合初学者入门。
2. 《数据仓库工具包》(The Data Warehouse Toolkit) - 作者:Ralph Kimball和Margy Ross。
- 这本经典书籍介绍了数据仓库建模的原则和技巧。
它提供了丰富的维度建模和星型模式设计的实践指南,并包含了大量实用的案例。
3. 《大数据管理与处理》(Big Data Management and Processing) - 作者:Kuan-Ching Li、Jianhua Ma和Jiannong Cao。
- 这本书着重介绍了大数据环境下的数据建模和处理技术。
它覆盖了分布式数据库、并行计算和云计算等主题,适合对大数据领域感兴趣的读者。
4. 《数据建模精粹》(Data Modeling Essentials) - 作者:Graeme Simsion和Graham Witt。
- 这本书详细介绍了数据建模的基本原则和技巧。
它讲解了实体关系模型(ER模型)、规范化、关系数据库设计等内容,适合想要深入学习数据建模的读者。
以上是一些经典的数据建模书籍推荐,希望能对你有所帮助!请注意,我提供的信息仅供参考,具体选择还需根据个人需求和背景来确定。
k均值聚类算法例题

k均值聚类算法例题k均值聚类(k-means clustering)是一种常用的无监督学习算法,用于将一组数据分成k个不同的群集。
本文将通过例题的方式介绍k均值聚类算法,并提供相关参考内容。
例题:假设有一组包含10个点的二维数据集,需要将其分成3个不同的群集。
我们可以使用k均值聚类算法来解决这个问题。
步骤1:初始化聚类中心首先,从数据集中随机选择k个点作为初始聚类中心。
在这个例题中,我们选择3个点作为初始聚类中心。
步骤2:分配数据点到最近的聚类中心对于每个数据点,计算其与每个聚类中心的距离,并将其分配到最近的聚类中心。
距离的计算通常使用欧几里得距离(Euclidean distance)。
步骤3:更新聚类中心对于每个聚类,计算其所有数据点的平均值,并将该平均值作为新的聚类中心。
步骤4:重复步骤2和步骤3重复执行步骤2和步骤3,直到聚类中心不再改变或达到预定的迭代次数。
参考内容:1. 《机器学习实战》(Machine Learning in Action)- 书中的第10章介绍了k均值聚类算法,并提供了相应的Python代码实现。
该书详细介绍了k均值聚类算法的原理、实现步骤以及应用案例,是学习和理解k均值聚类的重要参考书籍。
2. 《Pattern Recognition and Machine Learning》- 该书由机器学习领域的权威Christopher M. Bishop撰写,在第9章介绍了k均值聚类算法。
书中详细介绍了k均值聚类的数学原理,从最优化的角度解释了算法的过程,并提供了相关代码示例。
3. 《数据挖掘导论》(Introduction to Data Mining)- 该书由数据挖掘领域的专家Pang-Ning Tan、Michael Steinbach和Vipin Kumar合著,在第10章中介绍了k均值聚类算法及其变体。
该书提供了理论和应用层面的讲解,包括如何选择最佳的k值、处理异常值和空值等问题。
数据挖掘英语

数据挖掘英语随着信息技术和互联网的不断发展,数据已经成为企业和个人在决策和分析中不可或缺的一部分。
而数据挖掘作为一种利用大数据技术来挖掘数据潜在价值的方法,也因此变得越来越重要。
在这篇文章中,我们将会介绍数据挖掘的相关英语术语和概念。
一、概念1.数据挖掘(Data Mining)数据挖掘是一种从大规模数据中提取出有用信息的过程。
数据挖掘通常包括数据预处理、数据挖掘和结果评估三个阶段。
2.机器学习(Machine Learning)机器学习是一种通过对数据进行学习和分析来改善和优化算法的方法。
机器学习可以被视为是一种数据挖掘的技术,它可以用来预测未来的趋势和行为。
3.聚类分析(Cluster Analysis)聚类分析是一种通过将数据分组为相似的集合来发现数据内在结构的方法。
聚类分析可以用来确定市场细分、客户分组、产品分类等。
4.分类分析(Classification Analysis)分类分析是一种通过将数据分成不同的类别来发现数据之间的关系的方法。
分类分析可以用来识别欺诈行为、预测客户行为等。
5.关联规则挖掘(Association Rule Mining)关联规则挖掘是一种发现数据集中变量之间关系的方法。
它可以用来发现购物篮分析、交叉销售等。
6.异常检测(Anomaly Detection)异常检测是一种通过识别不符合正常模式的数据点来发现异常的方法。
异常检测可以用来识别欺诈行为、检测设备故障等。
二、术语1.数据集(Dataset)数据集是一组数据的集合,通常用来进行数据挖掘和分析。
2.特征(Feature)特征是指在数据挖掘和机器学习中用来描述数据的属性或变量。
3.样本(Sample)样本是指从数据集中选取的一部分数据,通常用来进行机器学习和预测。
4.训练集(Training Set)训练集是指用来训练机器学习模型的样本集合。
5.测试集(Test Set)测试集是指用来测试机器学习模型的样本集合。
01Intro教材PPT

7
CS 412: Course Project [4th credit]
A comprehensive survey on a focused topic Individual surveys, not group work Examples of topics (need to be focused and specific)
9
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 1 —
Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &
有关异常值处理的书
有关异常值处理的书异常值处理是数据分析和统计学中的重要内容,涉及到检测和处理数据中的异常或离群值。
以下是一些与异常值处理相关的书籍,它们可以帮助你深入了解异常值的概念、检测方法和处理技术:1. "统计学习方法"(Pattern Recognition and Machine Learning)作者:Christopher M. Bishop这本书是机器学习领域的经典教材,其中涉及异常值检测和处理在机器学习中的应用。
2. "数据挖掘:概念与技术"(Data Mining: Concepts and Techniques)作者:Jiawei Han,Micheline Kamber,Jian Pei这本书介绍了数据挖掘的基本概念和技术,其中包括异常值检测和处理的方法。
3. "数据分析导论"(Introduction to Data Mining)作者:Pang-Ning Tan,Michael Steinbach,Vipin Kumar这是一本数据挖掘和数据分析的入门教材,涵盖了异常值检测和处理的内容。
4. "Applied Multivariate Statistical Analysis"作者:Richard A. Johnson,Dean W. Wichern这本书着重介绍多元统计分析的方法,其中包括处理多元数据中的异常值问题。
5. "R语言实战"(R in Action: Data Analysis and Graphics with R)作者:Robert I. Kabacoff这是一本关于使用R语言进行数据分析和可视化的实战教材,其中包括异常值处理的内容。
6. "Outliers in Statistical Data"作者:Vic Barnett,Terry Lewis这本书是关于统计数据中异常值的经典著作,深入讨论了异常值检测和处理的方法和理论。
数据挖掘课程大纲
数据挖掘课程大纲课程名称:数据挖掘/ Data Mining课程编号:242023授课对象:信息管理与信息系统专业本科生开课学期:第7学期先修课程:C语言程序设计、数据库应用课程属性:专业教育必修课总学时/学分:48 (含16实验学时)/3执笔人:编写日期:一、课程概述数据挖掘是信息管理与信息系统专业的专业基础课。
课程通过介绍数据仓库和数据挖掘的相关概念和理论,要求学生掌握数据仓库的建立、联机分析以及分类、关联规那么、聚类等数据挖掘方法。
从而了解数据收集、分析的方式,理解知识发现的过程,掌握不同问题的分析和建模方法。
通过本课程的教学我们希望能够使学生在理解数据仓库和数据挖掘的基本理论基础上,能在SQL Server 2005平台上,初步具备针对具体的问题,选择合适的数据仓库和数据挖掘方法解决现实世界中较复杂问题的能力。
Data mining is a professional basic course of information management and information system. Through introducing the related concepts and theories of data warehouse and data mining, it requests students to understand the approaches for the establishment of data warehouse, on-line analysis, classification, association rules, clustering etc. So as to get familiar with the methods of data collection and analysis, understand the process of knowledge discovery, and master the analysis and modeling method of different problems. Through the teaching of this course, students are expected to be equipped with the basic theory of data warehouse and data mining, and the ability to solve complex real life problems on the platform of SQL Server 2005 by selecting the appropriate data warehouse and data mining approaches.二、课程目标1. 了解数据仓库的特点和建立方法;2.学会联机分析;3.掌握分类、关联规那么、聚类等数据挖掘方法;4.理解知识发现的过程。
6-data mining(1)
Part II Data MiningOutlineThe Concept of Data Mining(数据挖掘概念) Architecture of a Typical Data Mining System (数据挖掘系统结构)What can be Mined? (能挖掘什么?)Major Issues(主要问题)in Data MiningData Cleaning(数据清理)3What Is Data Mining?Data mining is the process of discovering interesting knowledge from large amounts of data. (数据挖掘是从大量数据中发现有趣知识的过程) The main difference that separates information retrieval apart from data mining is their goals. (数据挖掘和信息检索的主要差别在于他们的目标) Information retrieval is to help users search for documents or data that satisfy their information needs(信息检索帮用户寻找他们需要的文档/数据)e.g. Find customers who have purchased more than $10,000 in the last month .(查找上个月购物量超过1万美元的客户)Data mining discovers useful knowledge by analyzing data correlations using sophisticated data mining techniques(数据挖掘用复杂技术分析…)e.g. Find all items which are frequently purchased with milk .(查找经常和牛奶被购买的商品)A KDD Process (1) Some people view data mining as synonymous5A KDD Process (2)Learning the application domain (学习应用领域相关知识):Relevant knowledge & goals of application (相关知识和目标) Creating a target data set (建立目标数据集) Data selection, Data cleaning and preprocessing (预处理)Choosing functions of data mining (选择数据挖掘功能)Summarization, classification, association, clustering , etc.Choosing the mining algorithm(s) (选择挖掘算法)Data mining (进行数据挖掘): search for patterns of interest Pattern evaluation and knowledge presentation (模式评估和知识表示)Removing redundant patterns, visualization, transformation, etc.Present results to user in meaningful manner.Use of discovered knowledge (使用所发现的知识)7Concept/class description (概念/类描述)Characterization(特征): provide a summarization of the given data set Comparison(区分): mine distinguishing characteristics(挖掘区别特征)that differentiate a target class from comparable contrasting classes. Association rules (correlation and causality)(关联规则)Association rules are of the form(这种形式的规则): X ⇒Y,Examples: contains(T, “computer”) ⇒contains(T, “software”)[support = 1%, confidence = 50%]age(X, “20..29”) ∧income(X, “20..29K ”) ⇒buys(X, “PC ”)[support = 2%, confidence = 60%]Classification and Prediction (分类和预测)Find models that describe and distinguish classes for future prediction.What kinds of patterns can be mined?(1)What kinds of patterns can be mined?(2)Cluster(聚类)Group data to form some classes(将数据聚合成一些类)Principle: maximizing the intra-class similarity and minimizing the interclass similarity (原则: 最大化类内相似度,最小化类间相似度) Outlier analysis: objects that do not comply with the general behavior / data model. (局外者分析: 发现与一般行为或数据模型不一致的对象) Trend and evolution analysis (趋势和演变分析)Sequential pattern mining(序列模式挖掘)Regression analysis(回归分析)Periodicity analysis(周期分析)Similarity-based analysis(基于相似度分析)What kinds of patterns can be mined?(3)In the context of text and Web mining, the knowledge also includes: (在文本挖掘或web挖掘中还可以发现)Word association (术语关联)Web resource discovery (WEB资源发现)News Event (新闻事件)Browsing behavior (浏览行为)Online communities (网上社团)Mining Web link structures to identify authoritative Web pages finding spam sites (发现垃圾网站)Opinion Mining (观点挖掘)…10Major Issues in Data Mining (1)Mining methodology(挖掘方法)and user interactionMining different kinds of knowledge in DBs (从DB 挖掘不同类型知识) Interactive mining of knowledge at multiple levels of abstraction (在多个抽象层上交互挖掘知识)Incorporation of background knowledge (结合背景知识)Data mining query languages (数据挖掘查询语言)Presentation and visualization of data mining results(结果可视化表示) Handling noise and incomplete data (处理噪音和不完全数据) Pattern evaluation (模式评估)Performance and scalability (性能和可伸缩性) Efficiency(有效性)and scalability(可伸缩性)of data mining algorithmsParallel(并行), distributed(分布) & incremental(增量)mining methods©Wu Yangyang 11Major Issues in Data Mining (2)Issues relating to the diversity of data types (数据多样性相关问题)Handling relational and complex types of data (关系和复杂类型数据) Mining information from heterogeneous databases and www(异质异构) Issues related to applications (应用相关的问题) Application of discovered knowledge (所发现知识的应用)Domain-specific data mining tools (面向特定领域的挖掘工具)Intelligent query answering (智能问答) Process control(过程控制)and decision making(决策制定)Integration of the discovered knowledge with existing knowledge:A knowledge fusion problem (知识融合)Protection of data security(数据安全), integrity(完整性), and privacy12CulturesDatabases: concentrate on large-scale (non-main-memory) data.(数据库:关注大规模数据)To a database person, data-mining is an extreme form of analytic processing. Result is the data that answers the query.(对数据库工作者而言数据挖掘是一种分析处理, 其结果就是问题答案) AI (machine-learning): concentrate on complex methods, small data.(人工智能(机器学习):关注复杂方法,小数据)Statistics: concentrate on models. (统计:关注模型.)To a statistician, data-mining is the inference of models. Result is the parameters of the model (数据挖掘是模型推论, 其结果是一些模型参数)e.g. Given a billion numbers, a statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation.©Wu Yangyang 13Data Cleaning (1)Data Preprocessing (数据预处理):Cleaning, integration, transformation, reduction, discretization (离散化) Why data cleaning? (为什么要清理数据?)--No quality data, no quality mining results! Garbage in, Garbage out! Measure of data quality (数据质量的度量标准)Accuracy (正确性)Completeness (完整性)Consistency(一致)Timeliness(适时)Believability(可信)Interpretability(可解释性) Accessibility(可存取性)14Data Cleaning (2)Data in the real world is dirtyIncomplete (不完全):Lacking some attribute values (缺少一些属性值)Lacking certain interest attributes /containing only aggregate data(缺少某些有用属性或只包含聚集数据)Noisy(有噪音): containing errors or outliers(包含错误或异常) Inconsistent: containing discrepancies in codes or names(不一致: 编码或名称存在差异)Major tasks in data cleaning (数据清理的主要任务)Fill in missing values (补上缺少的值)Identify outliers(识别出异常值)and smooth out noisy data(消除噪音)Correct inconsistent data(校正不一致数据) Resolve redundancy caused by data integration (消除集成产生的冗余)15Data Cleaning (3)Handle missing values (处理缺值问题) Ignore the tuple (忽略该元组) Fill in the missing value manually (人工填补) Use a global constant to fill in the missing value (用全局常量填补) Use the attribute mean to fill in the missing value (该属性平均值填补) Use the attribute mean for all samples belonging to the same class to fill in the missing value (用同类的属性平均值填补) Use the most probable value(最大可能的值)to fill in the missing value Identify outliers and smooth out noisy data(识别异常值和消除噪音)Binning method (分箱方法):First sort data and partition into bins (先排序、分箱)Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.(然后用平均值、中值、边界值平滑)©Wu Yangyang 16Data Cleaning (4)Example: Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins (分成等深的箱):-Bin 1: 4, 8, 9, 15-Bin 2: 21, 21, 24, 25-Bin 3: 26, 28, 29, 34Smoothing by bin means (用平均值平滑):-Bin 1: 9, 9, 9, 9-Bin 2: 23, 23, 23, 23-Bin 3: 29, 29, 29, 29Smoothing by bin boundaries (用边界值平滑):-Bin 1: 4, 4, 4, 15-Bin 2: 21, 21, 25, 25-Bin 3: 26, 26, 26, 34Clustering (。
数据挖掘导论 第二章 数据
Divorced 220K Single Married Single 85K 75K 90K
© Tan,Steinbach, Kumar
Introduction to Data Mining
Ratio
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
‹#›
What is Data?
Collection of data objects and their attributes
Attributes
An attribute is a property or characteristic of an object
– Examples: eye color of a person, temperature, etc.
– ID has no limit but age has a maximum and minimum value
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measurement of Length
The way you measure an attribute is somewhat may not match the attributes properties.
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Examples: ID numbers, eye color, zip codes
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Examples: calendar dates, temperatures in Celsius or Fahrenheit. Examples: temperature in Kelvin, length, time, counts
Ratio
new_value = a * old_value
Length can be measured in meters or feet.
Discrete and Continuous Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests geometric mean, harmonic mean, percent variation
Ordered
– – Spatial Data Temporal Data Nhomakorabea–
–
Sequential Data
Genetic Sequence Data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Important Characteristics of Structured Data
5 A B 7 C 8 3 2 1
D 10 4
E
15
5
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of Attributes
There are different types of attributes
– Nominal
Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
Example: Attribute values for ID and age are integers But properties of attribute values can be different
Single Married Single Married
– Attribute is also known as variable, field, characteristic, or feature Objects
Divorced 95K Married 60K
A collection of attributes describe an object
Data Mining: Data
Lecture Notes for Chapter 2 Introduction to Data Mining
by Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Types of data sets
Record
– – Data Matrix Document Data
–
Transaction Data
Graph
– – World Wide Web Molecular Structures
– – – –
– – – –
Distinctness: Order: Addition: Multiplication:
= < > + */
Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties
Tid Refund Marital Status 1 2 3 4 5 6 7 8 9 10
10
Taxable Income Cheat 125K 100K 70K 120K No No No No Yes No No Yes No Yes
Yes No No Yes No No Yes No No No
Ratio
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
– ID has no limit but age has a maximum and minimum value
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measurement of Length
The way you measure an attribute is somewhat may not match the attributes properties.
– Object is also known as record, point, case, sample, entity, or instance
Divorced 220K Single Married Single 85K 75K 90K
© Tan,Steinbach, Kumar
Introduction to Data Mining
Introduction to Data Mining 4/18/2004 ‹#›
© Tan,Steinbach, Kumar
Attribute Type
Nominal
Description
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
– Ordinal
– Interval
– Ratio
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Properties of Attribute Values
The type of an attribute depends on which of the following properties it possesses:
Ordinal
An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function.
Interval
new_value =a * old_value + b where a and b are constants
Examples
zip codes, employee ID numbers, eye color, sex: {male, female}
Operations
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
‹#›
What is Data?
Collection of data objects and their attributes
Attributes
An attribute is a property or characteristic of an object
– Examples: eye color of a person, temperature, etc.