1998-Data mining-- statistics and more

合集下载

数据挖掘导论英文版

数据挖掘导论英文版

数据挖掘导论英文版Data Mining IntroductionData mining is the process of extracting valuable insights and patterns from large datasets. It involves the application of various techniques and algorithms to uncover hidden relationships, trends, and anomalies that can be used to inform decision-making and drive business success. In today's data-driven world, the ability to effectively harness the power of data has become a critical competitive advantage for organizations across a wide range of industries.One of the key strengths of data mining is its versatility. It can be applied to a wide range of domains, from marketing and finance to healthcare and scientific research. In the marketing realm, for example, data mining can be used to analyze customer behavior, identify target segments, and develop personalized marketing strategies. In the financial sector, data mining can be leveraged to detect fraud, assess credit risk, and optimize investment portfolios.At the heart of data mining lies a diverse set of techniques and algorithms. These include supervised learning methods, such asregression and classification, which can be used to predict outcomes based on known patterns in the data. Unsupervised learning techniques, such as clustering and association rule mining, can be employed to uncover hidden structures and relationships within datasets. Additionally, advanced algorithms like neural networks and decision trees have proven to be highly effective in tackling complex, non-linear problems.The process of data mining typically involves several key steps, each of which plays a crucial role in extracting meaningful insights from the data. The first step is data preparation, which involves cleaning, transforming, and integrating the raw data into a format that can be effectively analyzed. This step is particularly important, as the quality and accuracy of the input data can significantly impact the reliability of the final results.Once the data is prepared, the next step is to select the appropriate data mining techniques and algorithms to apply. This requires a deep understanding of the problem at hand, as well as the strengths and limitations of the available tools. Depending on the specific goals of the analysis, the data mining practitioner may choose to employ a combination of techniques, each of which can provide unique insights and perspectives.The next phase is the actual data mining process, where the selectedalgorithms are applied to the prepared data. This can involve complex mathematical and statistical calculations, as well as the use of specialized software and computing resources. The results of this process may include the identification of patterns, trends, and relationships within the data, as well as the development of predictive models and other data-driven insights.Once the data mining process is complete, the final step is to interpret and communicate the findings. This involves translating the technical results into actionable insights that can be easily understood by stakeholders, such as business leaders, policymakers, or scientific researchers. Effective communication of data mining results is crucial, as it enables decision-makers to make informed choices and take appropriate actions based on the insights gained.One of the most exciting aspects of data mining is its continuous evolution and the emergence of new techniques and technologies. As the volume and complexity of data continue to grow, the need for more sophisticated and powerful data mining tools and algorithms has become increasingly pressing. Advances in areas such as machine learning, deep learning, and big data processing have opened up new frontiers in data mining, enabling practitioners to tackle increasingly complex problems and extract even more valuable insights from the data.In conclusion, data mining is a powerful and versatile tool that has the potential to transform the way we approach a wide range of challenges and opportunities. By leveraging the power of data and the latest analytical techniques, organizations can gain a deeper understanding of their operations, customers, and markets, and make more informed, data-driven decisions that drive sustainable growth and success. As the field of data mining continues to evolve, it is clear that it will play an increasingly crucial role in shaping the future of business, science, and society as a whole.。

专业英语八级核心词汇

专业英语八级核心词汇

专业英语八级核心词汇Professional English Level 8 Core Vocabulary。

Introduction:In this document, we will explore the core vocabulary required for the Professional English Level 8 examination. This comprehensive list of words will help candidates improve their language skills and enhance their proficiency in the professional context. The following sections will cover various domains, including business, finance, marketing, technology, and more.Business Vocabulary:1. Entrepreneurship: The process of starting and managing a business venture.2. Strategic planning: The process of defining an organization's objectives and determining the best way to achieve them.3. Leadership: The ability to guide and inspire others towards a common goal.4. Innovation: The introduction of new ideas, products, or processes.5. Collaboration: Working together with others to achieve a common objective.6. Negotiation: The process of reaching an agreement through discussion and compromise.7. Stakeholder: An individual or group with an interest or concern in a business or project.8. Sustainability: The practice of using resources in a way that meets present needs without compromising future generations' ability to meet their own needs.Finance Vocabulary:1. Asset: Something of value owned or controlled by a person, organization, or country.2. Liability: A financial obligation or debt.3. Revenue: Income generated from business activities.4. Profitability: The ability of a business to generate profit.5. Cash flow: The movement of money in and out of a business.6. Investment: The act of putting money into something with the expectation of gaining a return or profit.7. Risk management: The process of identifying, assessing, and prioritizing risks to minimize their impact on business operations.8. Capital: Financial resources available for investment.Marketing Vocabulary:1. Market segmentation: Dividing a market into distinct groups based on characteristics, needs, or behaviors.2. Branding: The process of creating a unique name, design, or symbol that identifies and differentiates a product or company.3. Advertising: The promotion of products or services through various media channels.4. Consumer behavior: The study of individuals, groups, or organizations and the processes they use to select, secure, use, and dispose of products, services, experiences, or ideas.5. Market research: The collection and analysis of data to understand and interpret market trends, customer preferences, and competitor strategies.6. Product placement: The inclusion of branded products or references in entertainment media.7. Public relations: The management of communication between an organization and its publics.8. Sales promotion: Short-term incentives to encourage the purchase or sale of a product or service.Technology Vocabulary:1. Artificial intelligence: The simulation of human intelligence in machines that are programmed to think and learn.2. Big data: Large and complex data sets that require advanced techniques to analyze and interpret.3. Cloud computing: The practice of using a network of remote servers hosted on the internet to store, manage, and process data.4. Cybersecurity: Measures taken to protect computer systems and networks from unauthorized access or attacks.5. Internet of Things (IoT): The network of physical devices, vehicles, appliances, and other objects embedded with sensors, software, and connectivity to exchange data.6. Virtual reality: A computer-generated simulation of a three-dimensional environment that can be interacted with in a seemingly real or physical way.7. Blockchain: A digital ledger in which transactions made in cryptocurrencies are recorded chronologically and publicly.8. Data mining: The process of discovering patterns in large data sets using techniques at the intersection of statistics and computer science.Conclusion:Mastering the core vocabulary for Professional English Level 8 is essential for individuals seeking to excel in the professional world. This document has provided an extensive list of words in various domains, including business, finance, marketing, and technology. By incorporating these words into their everyday language, candidates can enhance their communication skills and increase their chances of success in the professional arena.。

DATA MINING(CH4)

DATA MINING(CH4)

数据挖掘与知识发现(第2版)
(44-10)
李雄飞等©2003,2010
ID3学习算法
•ID3决策树学习算法是贪心算法,采用自顶向下的递归方式构 造决策树。
– 针对所有训练样本,从树的根节点处开始,选取一个属性来分区这些 样本。 – 属性的每一个值产生一个分枝。按属性值将相应样本子集被移到新生 成的子节点上。 – 递归处理每个子节点,直到每个节点上各自只包含同类样本。
•建树是通过递归过程,最终得到一棵决策树,而剪枝则是为了降低噪声数 据对分类正确率的影响。
数据挖掘与知识发现(第2版)
(44-4)
李雄飞等©2003,2010
信息论基础
•信息论是C.E.Shannon为解决信息传递(通信)过程问题建立的 一系列理论。
–传递信息系统由三部分组成:
•信源:发送端 •信宿:接受端 •信道连接两者的通道
引言
•决策树学习是以实例为基础的归纳学习算法,是应用最广泛的逻辑方法。 •典型的决策树学习系统采用自顶向下的方法,在部分搜索空间中搜索解决 方案。它可以确保求出一个简单的决策树,但未必是最简单的。 •Hunt等人于1966年提出的概念学习系统CLS(Concept Learning System) 是最早的决策树算法。 •决策树常用来形成分类器和预测模型,可以对未知数据进行分类或预测、 数据挖掘等。从20世纪60年代,决策树广泛应用在分类、预测、规则提取 等领域。 •J. R. Quinlan于1979年提出ID3(Iterative Dichotomizer3)算法后,决策 树方法在机器学习、知识发现领域得到了进一步应用。 •C4.5是以ID3为蓝本的能处理连续属性的算法。 •ID4和ID5是ID3的增量版本。 •强调伸缩性的决策树算法有SLIQ、SPRINT、RainForest算法等。 •用决策树分类的步骤:

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data MiningData mining is a process of extracting useful information from large datasets by using various statistical and machine learning techniques. It is a crucial part of the field of data science and plays a key role in helping businesses make informed decisions based on data-driven insights.One of the main goals of data mining is to discover patterns and relationships within data that can be used to make predictions or identify trends. This can help businesses improve their marketing strategies, optimize their operations, and better understand their customers. By analyzing large amounts of data, data mining algorithms can uncover hidden patterns that may not be immediately apparent to human analysts.There are several different techniques that are commonly used in data mining, including classification, clustering, association rule mining, and anomaly detection. Classification involves categorizing data points into different classes based on their attributes, while clustering groups similar data points together. Association rule mining identifies relationships between different variables, and anomaly detection detects outliers or unusual patterns in the data.In order to apply data mining techniques effectively, it is important to have a solid understanding of statistics, machine learning, and data analytics. Data mining professionals must be able to preprocess data, select appropriate algorithms, and interpret the results of their analyses. They must also be able to communicate their findings effectively to stakeholders in order to drive business decisions.Data mining is used in a wide range of industries, including finance, healthcare, retail, and telecommunications. In finance, data mining is used to detect fraudulent transactions and predict market trends. In healthcare, it is used to analyze patient data and improve treatment outcomes. In retail, it is used to optimize inventory management and personalize marketing campaigns. In telecommunications, it is used to analyze network performance and customer behavior.Overall, data mining is a powerful tool that can help businesses gain valuable insights from their data and make more informed decisions. By leveraging the latest advances in machine learning and data analytics, organizations can stay competitive in today's data-driven world. Whether you are a data scientist, analyst, or business leader, understanding the principles of data mining can help you unlock the potential of your data and drive success in your organization.。

数据探勘概念

数据探勘概念
• 数据探勘的特性
– 数据探勘没有数据量的限制,不会因为数据量太 大而造成一定显著的盲点。同时,只要分析的工 具与功能足够,数据量与变量的限制,在数据采 矿的过程中将会减少。 – 资料探勘不单只是数据库与分析工具及方法的概 念,在描述现象与建构问题的过程中,必须特过 某些专业的 (professional) 及专家的 (expertise) 人 员,来将问题领域 (problem domain) 之现象表征 建构出来,使得决策变量的形成能够充分的描述 现象及问题的核心,以及完成分析后数据的判读 工作。
数据库系统 (1970年代)
阶层式数据库(hierarchical Oracle, Sybase, database)、网络式数据库 Informix, IBM, (network database)、关系 Microsoft 数据库(relational database)、结构化查询语 言(SQL)、开放性数据库 链接设定(ODBC) 在线分析处理(OLAP) 、多维度数据模型 (multidimensional data model)、资料仓储(data warehouse) 进阶算法、多处理器计算 机系统、大量数据储存技 术、人工智能 Pilot, Comshare, Arbor, Cognos, Microstrategy, Microsoft Pilot, Lockheed, IBM, SGI
提供分析算法 模式建立
相关变数 可以预期分析结果 执行方式
统计分析方法 需要分析者逐一分析变量重要性,模 式才能建立。
一次只能检查一个变量对结果的影响 可以
提供多种模型,可以在短时间内决定 合适者。
可以找出多个变量间之相关性。 不可以 不断循环、不断修正的过程
可以问题为导向,相关问题通常只需 分析一次。

大数据外文翻译文献

大数据外文翻译文献

大数据外文翻译文献(文档含中英文对照即英文原文和中文翻译)原文:What is Data Mining?Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery in databases. Knowledge discovery consists of an iterative sequence of the following steps:· data cleaning: to remove noise or irrelevant data,· data integration: where multiple data sources may be combined,·data selection : where data relevant to the analysis task are retrieved from the database,·data transformation : where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance,·data mining: an essential process where intelligent methods are applied in order to extract data patterns,·pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures, and ·knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user .The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user, and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one since it uncovers hidden patterns for evaluation.We agree that data mining is a knowledge discovery process. However, in industry, in media, and in the database research milieu, the term “data mining” is becoming more popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to use the term “data mining”. We adop t a broad view of data mining functionality: data mining is the process of discovering interestingknowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.Based on this view, the architecture of a typical data mining system may have the following major components:1. Database, data warehouse, or other information repository. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.2. Database or data warehouse server. The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such ascharacterization, association analysis, classification, evolution and deviation analysis.5. Pattern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns.6. Graphical user interface. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.From a data warehouse perspective, data mining can be viewed as an advanced stage of on-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-styleanalytical processing of data warehouse systems by incorporating more advanced techniques for data understanding.While there may be many “data mining systems” on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data can at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as either a database system, an information retrieval system, or a deductive database system.Data mining involves an integration of techniques from mult1ple disciplines such as database technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. We adopt a database perspective in our presentation of data mining in this book. That is, emphasis is placed on efficient and scalable data mining techniques for large databases. By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can be applied to decision making, process control, information management, query processing, and so on. Therefore,data mining is considered as one of the most important frontiers in database systems and one of the most promising, new database applications in the information industry.A classification of data mining systemsData mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, Information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, or psychology.Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows.1) Classification according to the kinds of databases mined.A data mining system can be classified according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly.For instance, if classifying according to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time -series, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous data mining systems, and legacy data mining systems.2) Classification according to the kinds of knowledge mined.Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis , similarity analysis, etc.A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities.Moreover, data mining systems can also be distinguished based on the granularity or levels of abstraction of the knowledge mined, includinggeneralized knowledge(at a high level of abstraction), primitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.3) Classification according to the kinds of techniques utilized.Data mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed(e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on ) .A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique which combines the merits of a few individual approaches.什么是数据挖掘?许多人把数据挖掘视为另一个常用的术语—数据库中的知识发现或KDD的同义词。

经典数据挖掘文献

Based on Apriori algorithm Data mining1.Over the valuable hideaway event in the huge database, and performs to analyze Data Mining outlineAlong with take the computer and the network as representative of information technology's development, more and more enterprises, the official organization, educational institution and the scientific research unit has achieved information digitized processing. The information content unceasing growth in the database ask the data memory, the management and the analysis a higher request.One side, the progress of data collection tool enable the humanity to have the huge data quantity, facing assumed the detonation growth of the data, the people need some new tools which could automate transforms the data into the valuable information and the knowledge. Thus, the data mining becomes a new research hot spot domain.On the other hand, along with the data bank technology rapid development and the data management system universal promotion, the data which the people accumulate also day by day .In the sharp increase data also possibility hide many important informations , people hope to make a higher level analysis of the held information, in order to used these data better.The Data Mining (Data Mining)is a new technology which excavates the concealment, formerly unknown, the latent value knowledge to the decision-making from the mass datas. The data mining is a technology that devotes to the data analysis, the understanding and the revelation data interior implication knowledge, it will become one of future information technology application profitable targets. It is likely to other new technical development course, the data mining technology also must after the concept propose, the concept accepts, the widespread research and the exploration, gradually applies and massive applies stages.The data mining technology and the daily life relations already become more and more close. We must face the pointed advertisement every day, the commercial sector reduce the cost through the data mining technology to enhance the efficiency. The data mining opponent worried the data miningobtains the information is threatens people's privacy for the price. Using the data mining might obtain some population statistics information which is unknown before and hideaway in the customer data.The people grow day by day regarding the data mining technology in certain domain application interest, for example cheat examination, suspect identification as well as latent terrorist forecast.The data mining technology may help the people to withdraw from the database correlation data is interested the knowledge, the rule or the higher level information, and may help the people to analyze them from the varying degree, thus may use in the database effectively the data. Not only the data mining technology might use in describing the past data developing process, further also will be able to forecast the future tendency.The data mining technology classified method are very many, according to the data mining duty, may divide into the connection rule excavation, the data class rule excavation, the cluster rule excavation, the dependent analysis and the dependent model discovered, as well as the concept description, the deviation analyze, the trend analysis and the pattern analysis and so on; According to the database which excavates looked that, may divide into the relations database, the object-oriented database, the space database, the time database, the multimedia databases and the different configuration database and so on; According to technical classification which uses, may divide into the artificial neural networks, the decision tree, the genetic algorithm, the neighborhood principle and may the vision and so on.The data mining process by the determination excavation object, the data preparation, the model establishment, the data mining, the result analysis indicates generally and excavates applies these main stages to be composed. The data mining may describe for these stages repeatedly the process.The data mining needs to process the question, is dische gain has the significance information, induces the useful structure, carries on policy-making as the enterprise the basis. Its application is extremely widespread, so long as this industry has the analysis value and the demanddatabase, all may carry on using the Mining tool has the goal excavating analysis. The common application case occurs much at the retail trade, the manufacturing industry, the financial finance insurance, the communication and the medical service.The data mining technology may help the people to withdraw from the database correlation data centralism is interested the knowledge, the rule or the higher level information, and may help the people to analyze them from the varying degree, thus may use in the database effectively the data. Not only the data mining technology might use in describing the past data developing process, further also will be able to forecast the future tendency. In view of this, we study the data mining to have the significance.But the data mining is only a tool, is not multi-purpose, it may discover some potential users, but why can't tell you, also cannot guarantee these potential users become the reality. The data mining success request to expected solves the question domain to have the profound understanding, understands the data, understood its process, can discover the reasonable explanation to the data mining result.2 The connection ruleThe connection rule is refers to between the mass data mean terms collection the interesting connection or the correlation relation. Along with the data accumulation, many field public figures regarding excavate the connection rule from theirs database more and more to be interested. Records from the massive commercial business discovered the interesting incidence relation, may help many commercial decisions-making the formulation.The connection rule discovery initial form is retail merchant's shopping blue analysis, the shopping blue analysis is through discovered the customer puts in its goods blue the different commodity, namely the different between relation, analyzes customer's purchase custom. Through understood which commodities also are purchased frequently by the customer, the analysis obtains between the commodity connection, this kind of connection discovery may help the retail merchant formulation marketing strategy. Theshopping blue analysis model application is may help manager to design the different store layout. One kind of strategy is: Together purchases frequently the commodity may place near somewhat, in order to further stimulate these commodities to sell together. For example, if the customer purchases the computer also to favor simultaneously purchases the financial control software, if then places the hardware exhibits near to the software, possibly is helpful in increases the two the sale. Another kind of strategy is: Place separately the hardware and the software in the store both sides, this possible to induce to purchase these commodities a customer group to choose other commodities. Certainly, the shopping blue analysis is connected the rule discovery the initial form, quite simple. The connection rule discovered the research and the application in unceasingly are also developing. For example, if some food shop through the shopping basket analysis knew “the majority of customers can simultaneously purchase the bread and the milk in a shopping”, then this food shop has the possibility through the reduction promotion bread simultaneously to enhance the bread and the milk sales volume.For example Again, if some children good store through the shopping basket analysis knew “the majority of customers can simultaneously purchase the powdered milk and the urine piece in a shopping”, then this children good store through lays aside separately the powdered milk and the urine piece in the close far place, the middle laying aside some other commonly used child thing, possibly induces the customer when the purchase powdered milk and the urine piece a group purchases other commodities.Digs the stubborn connection rule in business set mainly to include two steps: (1)Discovers all frequent item of collections because the frequent item of collection is the support is bigger than is equal to the smallest support threshold value an item of collection, therefore is composed by frequent item of collection all items the connection rule support is also bigger than is equal to the smallest support threshold value.(2) Has the strong connection rule tohave the strong connection rule is bigger than in all supports was equal to the smallest support threshold value in the connection rule, discovers all confidences to be bigger than is equal to the smallest confidence threshold value the connection rule. In the above two steps, the key is the first step, its efficiency influence entire excavation algorithm efficiency.3 the Apriori algorithmFrequent item of collection: If an item of collection satisfies the smallest support, namely if the item of collection appearance frequency is bigger than or is equal to min_sup (smallest support threshold value) with business database D in business total product, then calls this item of collection the frequent item of collection (Freqent Itemset), abbreviation frequent collection.The frequent k- item of collection set records is usually L K.The Apriori algorithm also called Breadth First or Level the Wise algorithm, proposed by Rakesh Agrawal and Rnamakrishnan Srekant in 1994, it is the present frequent collection discovery algorithm core.The Apriori algorithm uses one kind of being called as cascade search the iterative method, a frequent k- item of collection uses in searching a frequent (k+1)- item of collection.First, discovers the frequent 1- item of collection the set, this set records makes L1, L1 to use in discovering the frequent 2- item of collection set L2, but L2 uses in discovering L3, continue like this, until cannot find a frequent k- item of collection.Looks for each LK to need to scan a database.The connection rule excavation algorithm decomposition is two sub-questions: (1) Extracts in D to satisfy smallest support min_sup all frequent collections; (2) use frequent collection production satisfies smallest confidence level min_conf all connection rule.The first question is the key of this algorithm, the Apriori algorithm solves this problem based on the frequent collection theory recursion method.。

预测模型研究利器-列线图(Logistic回归)

预测模型研究利器-列线图(Logistic回归)背景Background在临床中,预测模型⼗分重要。

正如第⼀节(医⽣必备技能,万字长⽂让你明⽩临床模型研究应该如何做)所阐明的,如果我们能提前预测病⼈的病情,很多时候我们可能会做出完全不同的临床决定。

⽐如,对于肝癌患者,如果能提前预测是否有微⾎管浸润,可能有助于外科医⽣在标准切除和扩⼤切除之间做出选择。

术前新辅助放疗和化疗是T1-4 N+中低位直肠癌的标准治疗⽅法。

然⽽,在临床实践中发现,根据术前影像学检查判断淋巴结状态不够准确,假阳性或假阴性的⽐例很⾼。

那么我们是否有可能在放疗和化疗前根据已知的特征准确地预测患者的淋巴结状况?如果我们能够建⽴这样的预测模型,我们就可以更准确地做出临床决策,避免因误判⽽导致的不正确决策。

越来越多的⼈开始意识到这个问题的重要性。

⽬前,很多⼈已经付出了巨⼤的努⼒来构建预测模型或改进现有的预测⼯具,在这其中,Nomogram的构造是当前最热门的研究⽅向之⼀。

下⾯,我们再来说说Logistic回归。

什么时候选择Logistic回归来建⽴预测模型与建⽴的临床结果有关。

如果结果是⼆分类变量、⽆序分类变量或有序变量(总⽽⾔之就是分类变量),我们可以选择Logistic回归来构建模型。

⽆序Logistic回归和有序Logistic回归⼀般应⽤于⽆序、多分类或有序变量结果,其结果难以解释。

因此,我们通常将⽆序多分类或有序的变量转换为⼆分类变量,并使⽤⼆分Logistic回归来构建模型。

上述“肝癌是否有微⾎管浸润”和“直肠癌前淋巴结转移复发”均属于⼆分法结果。

⼆分Logistic回归最常⽤于构建、评估和验证预测模型。

⾃变量的筛选原则与第2节(【临床研究】⼀个你⽆法逃避的问题:多元回归分析中的变量筛选)描述的原则是⼀致的,另外需要考虑两点:⼀⽅⾯要权衡模型包含的样本量和⾃变量个数;另⼀⽅⾯还要权衡模型的准确性和使⽤模型的便捷性,最终确定进⼊预测模型的⾃变量个数。

数据挖掘data mining 核心专业词汇

1、Bilingual 双语Chinese English bilingual text 中英对照2、Data warehouse and Data Mining 数据仓库与数据挖掘3、classification 分类systematize classification 使分类系统化4、preprocess 预处理The theory and algorithms of automatic fingerprint identification system (AFIS) preprocess are systematically illustrated.摘要系统阐述了自动指纹识别系统预处理的理论、算法5、angle 角度6、organizations 组织central organizations 中央机关7、OLTP On-Line Transactional Processing 在线事物处理8、OLAP On-Line Analytical Processing 在线分析处理9、Incorporated 包含、包括、组成公司A corporation is an incorporated body 公司是一种组建的实体10、unique 唯一的、独特的unique technique 独特的手法11、Capabilities 功能Evaluate the capabilities of suppliers 评估供应商的能力12、features 特征13、complex 复杂的14、information consistency 信息整合15、incompatible 不兼容的16、inconsistent 不一致的Those two are temperamentally incompatible 他们两人脾气不对17、utility 利用marginal utility 边际效用18、Internal integration 内部整合19、summarizes 总结20、application-oritend 应用对象21、subject-oritend 面向主题的22、time-varient 随时间变化的23、tomb data 历史数据24、seldom 极少Advice is seldom welcome 忠言多逆耳25、previous 先前的the previous quarter 上一季26、implicit 含蓄implicit criticism 含蓄的批评27、data dredging 数据捕捞28、credit risk 信用风险29、Inventory forecasting 库存预测30、business intelligence(BI)商业智能31、cell 单元32、Data cure 数据立方体33、attribute 属性34、granular 粒状35、metadata 元数据36、independent 独立的37、prototype 原型38、overall 总体39、mature 成熟40、combination 组合41、feedback 反馈42、approach 态度43、scope 范围44、specific 特定的45、data mart 数据集市46、dependent 从属的47、motivate 刺激、激励Motivate and withstand higher working pressure个性积极,愿意承受压力.敢于克服困难48、extensive 广泛49、transaction 交易50、suit 诉讼suit pending 案件正在审理中51、isolate 孤立We decided to isolate the patients.我们决定隔离病人52、consolidation 合并So our Party really does need consolidation 所以,我们党确实存在一个整顿的问题53、throughput 吞吐量Design of a Web Site Throughput Analysis SystemWeb网站流量分析系统设计收藏指正54、Knowledge Discovery(KDD)55、non-trivial(有价值的)--Extraction interesting (non-trivial(有价值的), implicit(固有的), previously unknown and potentially useful) patterns or knowledge from huge amounts of data.56、archeology 考古57、alternative 替代58、Statistics 统计、统计学population statistics 人口统计59、feature 特点A facial feature 面貌特征60、concise 简洁a remarkable concise report 一份非常简洁扼要的报告61、issue 发行issue price 发行价格62、heterogeneous (异类的)--Constructed by integrating multiple, heterogeneous (异类的)data sources63、multiple 多种Multiple attachments多实习64、consistent(一贯)、encode(编码)ensure consistency in naming conventions,encoding structures, attribute measures, etc.确保一致性在命名约定,编码结构,属性措施,等等。

Data Mining


Data Mining Techniques
8
Contents
• • • • • • • Introduction to Data Mining Association analysis Sequential Pattern Mining Classification and prediction Data Clustering Data preprocessing Advanced topics
Data Mining Techniques

12
Useful Information
• How to get a paper online?
– DBLP
• A good index for good papers
– CiteSeer – Just google it – Send requests to the authors
Data Mining Techniques 9
Course Schedule(1)
Date Sep- 19 Sep- 22 Sep- 26 Sep- 29 Oct- 10 Oct- 13 Time 7:00 pm-9:00 pm 7:00 pm-9:00 pm Session Session 1 Session 2 Session 3 Session 4 Session 5 Session 6
• Databases today are huge:
– More than 1,000,000 entities/records/rows – From 10 to 10,000 fields/attributes/variables – Gigabytes and terabytes
• Databases a growing at an unprecedented rate
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Data Mining:Statistics and More? David J.H ANDData mining is a new discipline lying at the interface of statistics,database technology,pattern recognition,machine learning,and other areas.It is concerned with the secondary analysis of large databases in order tofind previously un-suspected relationships which are of interest or value to the database owners.New problems arise,partly as a con-sequence of the sheer size of the data sets involved,and partly because of issues of pattern matching.However,since statistics provides the intellectual glue underlying the effort, it is important for statisticians to become involved.There are very real opportunities for statisticians to make signifi-cant contributions.KEY WORDS:Databases;Exploratory data analysis; Knowledge discovery.1.DEFINITION AND OBJECTIVESThe term data mining is not new to statisticians.It is a term synonymous with data dredging orfishing and has been used to describe the process of trawling through data in the hope of identifying patterns.It has a derogatory con-notation because a sufficiently exhaustive search will cer-tainly throw up patterns of some kind—by definition data that are not simply uniform have differences which can be interpreted as patterns.The trouble is that many of these “patterns”will simply be a product of randomfluctuations, and will not represent any underlying structure.The object of data analysis is not to model thefleeting random pat-terns of the moment,but to model the underlying structures which give rise to consistent and replicable patterns.To statisticians,then,the term data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.To other researchers,however,the term is seen in a much more positive light.Stimulated by progress in computer technology and electronic data acquisition,recent decades have seen the growth of huge databases,infields ranging from supermarket sales and banking,through astronomy, particle physics,chemistry,and medicine,to official and governmental statistics.These databases are viewed as a re-source.It is certain that there is much valuable information in them,information that has not been tapped,and data min-ing is regarded as providing a set of tools by which that in-formation may be extracted.Looked at in this positive light, it is hardly surprising that the commercial,industrial,andDavid J.Hand is Professor of Statistics,Department of Statistics,The Open University,Milton Keynes,MK76AA,United Kingdom(E-mail: d.j.hand@).economic possibilities inherent in the notion of extracting information from these large masses of data have attracted considerable interest.The interest in thefield is demon-strated by the fact that the Third International Conference on Knowledge Discovery and Data Mining,held in1997, attracted around700participants.Superficially,of course,what we are describing here is nothing but exploratory data analysis,an activity which has been carried out since data werefirst analyzed and which achieved greater respectability through the work of John Tukey.But there is a difference,and it is this difference that explains why statisticians have been slow to latch on to the opportunities.This difference is the sheer size of the data sets now available.Statisticians have typically not con-cerned themselves with data sets containing many millions or even billions of records.Moreover,special storage and manipulation techniques are required to handle data collec-tions of this size—and the database technology which has grown up to handle them has been developed by entirely different intellectual communities from statisticians.It is probably no exaggeration to say that most statis-ticians are concerned with primary data analysis.That is, the data are collected with a particular question or set of questions in mind.Indeed,entire subdisciplines,such as ex-perimental design and survey design,have grown up to fa-cilitate the efficient collection of data so as to answer the given questions.Data mining,on the other hand,is entirely concerned with secondary data analysis.In fact we might define data mining as the process of secondary analysis of large databases aimed atfinding unsuspected relationships which are of interest or value to the database owners.We see from this that data mining is very much an inductive exercise,as opposed to the hypothetico-deductive approach often seen as the paradigm for how modern science pro-gresses(Hand in press).Statistics as a discipline has a poor record for timely recognition of important ideas.A common pattern is that a new idea will be launched by researchers in some other dis-cipline,will attract considerable interest(with its promise often being subjected to excessive media hype—which can sometimes result in a backlash),and only then will statis-ticians become involved.By which time,of course,the intellectual proprietorship—not to mention large research grants—has gone elsewhere.Examples of this include work on pattern recognition,expert systems,genetic algorithms, neural networks,and machine learning.All of these might legitimately be regarded as subdisciplines of statistics,but they are not generally so regarded.Of course,statisticians have later made very significant advances in all of these fields,but the fact that the perceived natural home of these areas lies not in statistics but in other areas is demonstrated112The American Statistician,May1998Vol.52,No.2c 1998American Statistical Associationby the key journals for these areas—they are not statistical journals.Data mining seems to be following this pattern.For the health of the discipline of statistics as a whole it is impor-tant,perhaps vital,that we learn from previous experience. Unless we do,there is a real danger that statistics—and statisticians—will be perceived as a minor irrelevance,and as not playing the fundamental role in scientific and wider life that they properly do.There is an urgency for statis-ticians to become involved with data mining exercises,to learn about the special problems of data mining,and to con-tribute in important ways to a discipline that is attracting increasing attention from a broad spectrum of concerns. In Section2of this article we examine some of the major differences in emphasis between statistics and data mining. In Section3we look at some of the major tools,and Section 4concludes.2.WHAT’S NEW ABOUT DATA MINING? Statistics,especially as taught in most statistics texts, might be described as being characterized by data sets which are small and clean,which permit straightforward answers via intensive analysis of single data sets,which are static,which were sampled in an iid manner,which were often collected to answer the particular problem being ad-dressed,and which are solely numeric.None of these apply in the data mining context.2.1Size of Data SetsFor example,to a classically trained statistician a large data set might contain a few hundred points.Certainly a data set of a few thousand would be large.But modern databases often contain millions of records.Indeed,nowa-days gigabyte or terabyte databases are by no means uncom-mon.Here are some examples.The American retailer Wal-Mart makes over20million transactions daily(Babcock 1994).According to Cortes and Pregibon(1997)AT&T has 100million customers,and carries200million calls a day on its long-distance network.Harrison(1993)said that Mo-bil Oil aims to store over100terabytes of data concerned with oil exploration.Fayyad,Djorgovski,and Weir(1996) described the Digital Palomar Observatory Sky Survey as involving three terabytes of data,and Fayyad,Piatetsky-Shapiro,and Smyth(1996)said that the NASA Earth Ob-serving System is projected to generate on the order of50 gigabytes of data per hour around the turn of the century.A project of which most readers will have heard,the human genome project,has already collected gigabytes of data. Numbers like these clearly put into context the futility of standard statistical techniques.Something new is called for. Data sets of these sorts of sizes lead to problems with which statisticians have not typically had to concern them-selves in the past.An obvious one is that the data will not allfit into the main memory of the computer,despite the recent dramatic increases in capacity.This means that,if all of the data is to be processed during an analysis,adaptive or sequential techniques have to be developed.Adaptive and sequential estimation methods have been of more central concern to nonstatistical communities—especially to those working in pattern recognition and machine learning. Data sets may be large because the number of records is large or because the number of variables is large.(Of course,what is a record in one situation may be a variable in another—it depends on the objectives of the analysis.) When the number of variables is large the curse of dimen-sionality really begins to bite—with1,000binary variables there are of the order of10300cells,a number which makes even a billion records pale into insignificance.The problem of limited computer memory is just the be-ginning of the difficulties that follow from large data sets. Perhaps the data are stored not as the singleflatfile so beloved of statisticians,but as multiple interrelatedflatfiles. Perhaps there is a hierarchical structure,which does not permit an easy scan through the entire data set.It is pos-sible that very large data sets will not all be held in one place,but will be distributed.This makes accessing and sampling a complicated and time-consuming process.As a consequence of the structured way in which the data are necessarily stored,it might be the case that straightforward statistical methods cannot be applied,and stratified or clus-tered variants will be necessary.There are also more subtle issues consequent on the sheer size of the data sets.In the past,in many situations where statisticians have classically worked,the problem has been one of lack of data rather than abundance.Thus,the strat-egy was developed offixing the Type I error of a test at some“reasonable”value,such as1%,5%,or10%,and collecting sufficient data to give adequate power for appro-priate alternative hypotheses.However,when data exists in the superabundance described previously,this strategy be-comes rather questionable.The results of such tests will lead to very strong evidence that even tiny effects exist, effects which are so minute as to be of doubtful practical value.All research questions involve a background level of uncertainty(of the precise question formulation,of the defi-nitions of the variables,of the precision of the observations, of the way in which the data was drawn,of contamination, and so on)and if the effect sizes are substantially less than these other sources,then,no matter how confident one is in their reality,their value is doubtful.In place of statistical significance,we need to consider more carefully substantive significance:is the effect important or valuable or not? 2.2Contaminated DataClean data is a necessary prerequisite for most statistical analyses.Entire books,not to mention careers,have been created around the issues of outlier detection and missing data.An ideal solution,when questionable data items arise, is to go back and check the source.In the data mining con-text,however,when the analysis is necessarily secondary, this is impossible.Moreover,when the data sets are large, it is practically certain that some of the data will be invalid in some way.This is especially true when the data describe human interactions of some kind,such as marketing data,financial transaction data,or human resource data.Con-tamination is also an important problem when large data sets,in which we are perhaps seeking weak relationships, are involved.Suppose,for example,that one in a thousand The American Statistician,May1998Vol.52,No.2113records have been drawn from some distribution other than that we believe they have been drawn from.One-tenth of 1%of the data from another source would have little impact in conventional statistical problems,but in the context of a billion records this means that a million are drawn from this distribution.This is sufficient that they cannot be ignored in the analysis.2.3Nonstationarity,Selection Bias,and DependentObservationsStandard statistical techniques are based on the assump-tion that the data items have been sampled independently and from the same distribution.Models,such as repeated measures methods,have been and are being developed for certain special situations when this is not the case.How-ever,contravention of the idealized iid situation is probably the norm in data mining problems.Very large data sets are unlikely to arise in an iid manner;it is much more likely that some regions of the variable space will be sampled more heavily than others at different times(for example, differing time zones mean that supermarket transaction or telephone call data will not occur randomly over the whole of the United States).This may cast doubt on the validity of standard estimates,as well as posing special problems for sequential estimation and search algorithms. Despite their inherent difficulties,the data acquisition as-pects are perhaps one of the more straightforward to model. More difficult are issues of nonstationarity of the popula-tion being studied and selection bias.Thefirst of these,also called population drift(Taylor,Nakhaeizadeh,and Kunisch 1997;Hand1997),can arise because the underlying popula-tion is changing(for example,the population of applicants for bank loans may evolve as the economy heats and cools) or for other reasons(for example,gradual distortion creep-ing into measuring instruments).Unless the time of acqui-sition of the individual records is date-stamped,changing population structures may be undetectable.Moreover,the nature of the changes may be subtle and difficult to detect. Sometimes the situation can be even more complicated than the above may imply because often the data are dynamic. The Wal-Mart transactions or AT&T phone calls occur ev-ery day,not just one day,so that the database is a constantly evolving entity.This is very different from the conventional statistical situation.It might be necessary to process the data in real time.The results of an analysis obtained in Septem-ber,for what happened one day in June may be of little value to the organization.The need for quick answers and the size of the data sets also lead to tough questions about statistical algorithms.Selection bias—distortion of the selected sample away from a simple random sample—is an important and under-rated problem.It is ubiquitous,and is not one which is spe-cific to large data sets,though it is perhaps especially trou-blesome there.It arises,for example,in the choice of pa-tients for clinical trials induced by the inclusion/exclusion criteria;can arise in surveys due to nonresponse;and in psychological research when the subjects are chosen from readily available people,namely young and intelligent stu-dents.In general,very large data sets are likely to have been subjected to selection bias of various kinds—they are likely to be convenience or opportunity samples rather than the statisticians’idealized random samples.Whether selec-tion bias matters or not depends on the objectives of the data analysis.If one hopes to make inferences to the un-derlying population,then any sample distortion can inval-idate the results.Selection bias can be an inherent part of the problem:it arises when developing scoring rules for deciding whether an applicant to be a mail order agent is acceptable.Typically in this situation comprehensive data is available only for those previous applicants who were graded“good risk”by some previous rule.Those graded “bad”would have been rejected and hence their true status never discovered.Likewise,of people offered a bank loan, comprehensive data is available only for those who take up the offer.If these are used to construct the models,to make inferences about the behavior of future applicants,then er-rors are likely to be introduced.On a small scale,Copas and Li(1997)describe a study of the rate of hospitaliza-tion of kidney patients given a new form of dialysis.A plot shows that the log-rate decreases over time.However,it also shows that the numbers assigned to the new treatment change over time.Patients not assigned to the new treatment were assigned to the standard one,and the selection was not random but was in the hands of the clinician,so that doubt is cast on the argument that log-rate for the new treatment is improving.What is needed to handle selection bias,as in the case of population drift,is a larger model that also takes account of the sample selection mechanism.For the large data sets that are the focus of data mining studies—which will generally also be complex data sets and for which suf-ficient details of how the data were collected may not be available—this will usually not be easy to construct.2.4Finding Interesting PatternsThe problems outlined previously show why the current statistical paradigm of intensive“hand”analysis of a single data set is inadequate for what faces those concerned with data mining.With a billion data points,even a scatterplot may be useless.There is no alternative to heavy reliance on computer programs set to discover patterns for themselves, with relatively little human intervention.A nice example was given by Fayyad,Djorgovski,and Weir(1996).De-scribing the crisis in astronomy arising from the huge quan-tities of data which are becoming available,they say:“We face a critical need for information processing technology and methodology with which to manage this data avalanche in order to produce interesting scientific results quickly and efficiently.Developments in thefields of Knowledge Dis-covery in Databases(KDD),machine learning,and related areas can provide at least some solutions.Much of the fu-ture of scientific information processing lies in the creative and efficient implementation and integration of these meth-ods.”Referring to the Second Palomar Observatory Sky Survey,the authors estimate that there will be at least5×107 galaxies and2×109stellar objects detectable.Their aim is “to enable and maximize the extraction of meaningful in-formation from such a large database in an efficient and timely manner”and they note that“reducing the images to114Generalcatalog entries is an overwhelming task which inherently requires an automated approach.”Of course,it is not possible simply to ask a computer to “search for interesting patterns”or to“see if there is any structure in the data.”Before one can do this one needs to define what one means by patterns or structure.And be-fore one can do that one needs to decide what one means by“interesting.”Kl¨o sgen(1996,p.252)characterized in-terestingness as multifaceted:“Evidence indicates the sig-nificance of afinding measured by a statistical criterion. Redundancy amounts to the similarity of afinding with re-spect to otherfindings and measures to what degree afind-ing follows from another efulness relates afinding to the goals of the user.Novelty includes the deviation from prior knowledge of the user or system.Simplicity refers to the syntactical complexity of the presentation of afinding, and generality is determined by the fraction of the popula-tion afinding refers to.”In general,of course,what is of interest will depend very much on the application domain. When searching for patterns or structure a compromise needs to be made between the specific and the general.The essence of data mining is that one does not know precisely what sort of structure one is seeking,so a fairly general definition will be appropriate.On the other hand,too gen-eral a definition will throw up too many candidate patterns. In market basket analysis one studies conditional proba-bilities of purchasing certain goods,given that others are purchased.One can define potentially interesting patterns as those which have high conditional probabilities(termed confidence in market basket analysis)as well as reasonably large marginal probabilities for the conditioning variables (termed support in market basket analysis).A computer pro-gram can identify all such patterns with values over given thresholds and present them for consideration by the client. In the market basket analysis example the existing database was analyzed to identify potentially interesting patterns.However,the objective is not simply to charac-terize the existing database.What one really wants to do is,first,to make inferences to future likely co-occurrences of items in a basket,and,second and ideally,to make causal statements about the patterns of purchases:if someone can be persuaded to buy item A then they are also likely to buy item B.The simple marginal and conditional probabilities are insufficient to tell us about causal relationships—more sophisticated techniques are required.Another illustration of the need to compromise between the specific and the general arises when seeking patterns in time series,such as arise in patient monitoring,teleme-try,financial markets,trafficflow,and so on.Keogh and Smyth(1997)describe telemetry signals from the Space Shuttle:about20,000sensors are measured each second, with the signals from missions that may last several days accumulating.Such data are especially valuable for fault detection.One of the difficulties with time series pattern matching is potential nonlinear transformation of the time scale.By allowing such transformations in the pattern to be matched,one generalizes—but overdoing such generaliza-tion will make the exercise pointless.Familiarity with the problem domain and a willingness to try ad hoc approaches seems essential here.2.5Nonnumeric DataFinally,classical statistics deals solely with numeric data.Increasingly nowadays,databases contain data of other kinds.Four obvious examples are image data,au-dio data,text data,and geographical data.The issues of data mining—offinding interesting patterns and structures in the data—apply just as much here as to simple numerical data.Mining the internet has become a distinct subarea of data mining in its own right.2.6Spurious Relationships and Automated DataAnalysisTo statisticians,one thing will be immediately apparent from the previous examples.Because the pattern searches will throw up a large numbers of candidate patterns,there will be a high probability that spurious(chance)data con-figurations will be identified as patterns.How might this be dealt with?There are conventional multiple comparisons approaches in statistics,in which,for example,the over-all experimentwise error is controlled,but these were not designed for the sheer numbers of candidate patterns gen-erated by data mining.This is an area which would benefit from some careful thought.It is possible that a solution will only be found by stepping outside the conventional probabilistic statistical framework—possibly using scoring rules instead of probabilistic interpretations.The problem is similar to that of overfitting of statistical models,an is-sue which has attracted renewed interest with the develop-ment of extremelyflexible models such as neural networks. Several distinct but related strategies have been developed for easing the problem,and it may be possible to develop analogous strategies for data mining.These strategies in-clude restricting the family of models(c.f.limiting the size of the class of patterns examined),optimizing a penalized goodness-of-fit function(c.f.penalizing the patterns accord-ing to the size of the set of possible patterns satisfying the criteria),and shrinking an overfitted model(c.f.imposing tougher pattern selection criteria).Of course,the bottom line is that those patterns and structures identified as poten-tially interesting will be presented to a domain expert for consideration—to be accepted or rejected in the context of the substantive domain and objectives,and not merely on the basis of internal statistical structure.It is probably legitimate to characterize some of the anal-ysis undertaken during data mining as automatic data anal-ysis,since much of it occurs outside the direct control of the researcher.To many statisticians this whole notion will be abhorrent.Data analysis is as much an art as a science. However,the imperatives of the sheer volume of data mean that we have no choice.In any case,the issue of where human data analysis stops and automatic data analysis be-gins is a moot point.After all,even standard statistical tools use extensive search as part of the model-fitting process—think of variable selection in regression and of the search involved in constructing classification trees.In the1980s aflurry of work on automatic data analysis occurred under the name of statistical expert systems re-The American Statistician,May1998Vol.52,No.2115search(a review of such work was given by Gale,Hand, and Kelly1993).These were computer programs that in-teracted with the user and the data to conduct valid and accurate statistical analyses.The work was motivated by a concern about misuse of increasingly powerful and yet easy to use statistical packages.In principle,a statistical expert system would embody a large base of intelligent un-derstanding of the data analysis process,which it could ap-ply automatically(to a relatively small set of data,at least in data mining terms).Compare this with a data mining system,which embodies a small base of intelligent under-standing,but which applies it to a large data set.In both cases the application is automatic,though in both cases in-teraction with the researcher is fundamental.In a statistical expert system the program drives the analysis following a statistical strategy because the user has insufficient statis-tical expertise to do so.In a data mining application,the program drives the analysis because the user has insuffi-cient resources to manually examine billions of records and hundreds of thousands of potential patterns.Given these similarities between the two enterprises,it is sensible to ask if there are lessons which the data mining community might learn from the statistical expert system experience. Relevant lessons include the importance of a well-defined potential user population.Much statistical expert systems research went on in the abstract(“let’s see if we can build a system which will do analysis of variance”).Little won-der that such systems vanished without trace,when those who might need and make use of such a system had not been identified beforehand.A second lesson is the impor-tance of sufficiently broad system expertise—a system may be expert at one-way analysis of variance(or identifying one type of pattern in data mining),but,given an inevitable learning curve,a certain frequency of use is necessary to make the system valuable.And,of course,from a scientific point of view,it is necessary to formulate beforehand a cri-terion by which success can be judged.It seems clear that to have an impact,research on data mining systems should be tied into real practical applications,with a clear problem and objective specification.3.METHODSIn the previous sections I have spoken in fairly general terms about the objective of data mining as being tofind patterns or structure in large data sets.However,it is some-times useful to distinguish between two classes of data min-ing techniques,which seek,respectively,tofind patterns and models.The position of the dividing line between these is rather arbitrary.However,to me a model is a global rep-resentation of a structure that summarizes the systematic component underlying the data or that describes how the data may have arisen.The word“global”here signifies that it is a comprehensive structure,referring to many cases. In contrast,a pattern is a local structure,perhaps relating to just a handful of variables and a few cases.The mar-ket basket associations mentioned previously illustrate such patterns:perhaps only a few hundred of the many baskets demonstrate a particular pattern.Likewise,in the time se-ries example,if one is searching for patterns the objective is not to construct a global model,such as a Box–Jenkins model,but rather to locate structures that are of relatively short duration—the patterns sought in technical analysis of stock market behavior provide a good illustration.With this distinction we can identify two types of data mining method,according to whether they seek to build models or tofind patterns.Thefirst type,concerned with building global models is,apart from the problems inher-ent from the sizes of the data sets,identical to conven-tional exploratory statistical methods.It was such“tradi-tional”methods,used in a data mining context,which led to the rejection of the conventional wisdom that a portfolio of long-term mortgage customers is a good portfolio:in fact such customers may be the ones who have been unable to find a more attractive offer elsewhere—the less good cus-tomers.Models for both prediction and description occur in data mining contexts—for example,description is often the aim with scientific data while prediction is often the aim with commercial data.(Of course,again there is overlap.I am not intending to imply that only descriptive models are relevant with scientific data,but simply to illustrate appli-cation domains.)A distinction is also sometimes made(Box and Hunter1965;Cox1990;Hand1995)between empir-ical and mechanistic models.The former(also sometimes called operational)seek to model relationships without bas-ing them on any underlying theory.The latter(sometimes called substantive,phenomenological,or iconic)are based on some theory of mechanism for the underlying data gen-erating process.Data mining,almost by definition,is chiefly concerned with the former.We could add a third type of model here,which might be termed prescriptive.These are models which do not so much unearth structure in the data as impose struc-ture on it.Such models are also relevant in a data mining context—though perhaps the interpretation is rather differ-ent from most data mining applications.The class of tech-niques which generally go under the name of cluster anal-ysis provides an example.On the one hand we have meth-ods which seek to discover naturally occurring structures in the data—to carve nature at the joints,as it has been put.And on the other hand we have methods which seek to partition the data in some convenient way.The former might be especially relevant in a scientific context,where one may be interested in characterizing different kinds of entities.The latter may be especially relevant in a com-mercial context,where one may simply want to group the objects into classes which have a relatively high measure of internal homogeneity—without any notion that the dif-ferent clusters really represent qualitatively different kinds of entities.Partitioning the data in this latter sense yields a prescriptive model.Parenthetically at this point,we might also note that mixture decomposition,with slightly different aims yet again,but also a data mining tool,is also some-times included under the term cluster analysis.It is perhaps unfortunate that the term“cluster analysis”is sometimes used for all three objectives.Methods for building global models in data mining in-clude cluster analysis,regression analysis,supervised clas-sification methods in general,projection pursuit,and,in-116General。

相关文档
最新文档