数据挖掘ACM论文翻译-附录为英文原文

合集下载

数据挖掘01Intro

1990s:

2000s

人类走过了石器时代，纸器时代，磁器时代，直至现在的网络技术时代和正在跨入的物联网时代，这些智慧、文明的结晶是怎么样代代相传，生生不息地保留和继承下来的呢 ?

信息获取…… 信息存储…… 信息查询…… 信息的加工和应用……
Data Mining: Concepts and Techniques
— Chapter 1 —
— Introduction —
Department of Computer Science
Hubei University of Technology
2
Statement

bilingual course，some part of content will be taught in English and some part will be taught in Chinese

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web, computerized society

Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. (Newton's laws of motion, relativity)

数据挖掘技术毕业论文中英文资料对照外文翻译文献综述

数据挖掘技术毕业论文中英文资料对照外文翻译文献综述数据挖掘技术简介中英文资料对照外文翻译文献综述英文原文Introduction to Data MiningAbstract:Microsoft® SQL Server™ 2005 provides an integrated environment for creating and working with data mining models. This tutorial uses four scenarios, targeted mailing, forecasting, market basket, and sequence clustering, to demonstrate how to use the mining model algorithms, mining model viewers, and data mining tools that are included in this release of SQL Server.IntroductionThe data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft SQL Server 2005. The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial.The most visible components in SQL Server 2005 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see "Choosing Between SQL Server Management Studio and Business Intelligence Development Studio" in SQL Server Books Online.All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions basedon existing models.After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. For more information about the viewers, see "Viewing a Data Mining Model" in SQL Server Books Online.Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model.To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see "Data Mining Extensions (DMX) Reference" in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder.Just as important as the tools that you use to work with and create data mining models are the mechanics by which they are created. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model — it is the engine behind the process.Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. For more information on using DTS in conjunction with a data mining solution, see "DTS Data Mining Tasks and Transformations" in SQL Server Books Online.In order to demonstrate the SQL Server data mining features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with SQL Server 2005, and it supports OLAP and data mining functionality. In order to make the sample database available, you need to select the sample database at the installation time in the “Advanced” dialog for component selection.Adventure WorksAdventureWorksDW is based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base.Adventure Works sells products wholesale to specialty shops and to individuals through theInternet. For the data mining exercises, you will work with the AdventureWorksDW Internet sales tables, which contain realistic patterns that work well for data mining exercises.For more information on Adventure Works Cycles see "Sample Databases and Business Scenarios" in SQL Server Books Online.Database DetailsThe Internet sales schema contains information about 9,242 customers. These customers live in six countries, which are combined into three regions:North America (83%)Europe (12%)Australia (7%)The database contains data for three fiscal years: 2002, 2003, and 2004.The products in the database are broken down by subcategory, model, and product.Business Intelligence Development StudioBusiness Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.Working in an IDE is beneficial for the following reasons:The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers.If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without having to deploy the solution.SQL Server Management StudioSQL Server Management Studio is a collection of administrative and scripting tools for working with Microsoft SQL Server components. This workspace differs from Business Intelligence Development Studio in that you are working in a connected environment where actions are propagated to the server as soon as you save your work.After the data has been cleaned and prepared for data mining, most of the tasks associated with creating a data mining solution are performed within Business Intelligence Development Studio. Using the Business Intelligence Development Studio tools, you develop and test the datamining solution, using an iterative process to determine which models work best for a given situation. When the developer is satisfied with the solution, it is deployed to an Analysis Services server. From this point, the focus shifts from development to maintenance and use, and thus SQL Server Management Studio. Using SQL Server Management Studio, you can administer your database and perform some of the same functions as in Business Intelligence Development Studio, such as viewing, and creating predictions from mining models.Data Transformation ServicesData Transformation Services (DTS) comprises the Extract, Transform, and Load (ETL) tools in SQL Server 2005. These tools can be used to perform some of the most important tasks in data mining: cleaning and preparing the data for model creation. In data mining, you typically perform repetitive data transformations to clean the data before using the data to train a mining model. Using the tasks and transformations in DTS, you can combine data preparation and model creation into a single DTS package.DTS also provides DTS Designer to help you easily build and run packages containing all of the tasks and transformations. Using DTS Designer, you can deploy the packages to a server and run them on a regularly scheduled basis. This is useful if, for example, you collect data weekly data and want to perform the same cleaning transformations each time in an automated fashion.You can work with a Data Transformation project and an Analysis Services project together as part of a business intelligence solution, by adding each project to a solution in Business Intelligence Development Studio.Mining Model AlgorithmsData mining algorithms are the foundation from which mining models are created. The variety of algorithms included in SQL Server 2005 allows you to perform many types of analysis. For more specific information about the algorithms and how they can be adjusted using parameters, see "Data Mining Algorithms" in SQL Server Books Online.Microsoft Decision TreesThe Microsoft Decision Trees algorithm supports both classification and regression and it works well for predictive modeling. Using the algorithm, you can predict both discrete and continuous attributes.In building a model, the algorithm examines how each input attribute in the dataset affects the result of the predicted attribute, and then it uses the input attributes with the strongest relationship to create a series of splits, called nodes. As new nodes are added to the model, a tree structure begins to form. The top node of the tree describes the breakdown of the predicted attribute over the overall population. Each additional node is created based on the distribution of states of the predicted attribute as compared to the input attributes. If an input attribute is seen tocause the predicted attribute to favor one state over another, a new node is added to the model. The model continues to grow until none of the remaining attributes create a split that provides an improved prediction over the existing node. The model seeks to find a combination of attributes and their states that creates a disproportionate distribution of states in the predicted attribute, therefore allowing you to predict the outcome of the predicted attribute.Microsoft ClusteringThe Microsoft Clustering algorithm uses iterative techniques to group records from a dataset into clusters containing similar characteristics. Using these clusters, you can explore the data, learning more about the relationships that exist, which may not be easy to derive logically through casual observation. Additionally, you can create predictions from the clustering model created by the algorithm. For example, consider a group of people who live in the same neighborhood, drive the same kind of car, eat the same kind of food, and buy a similar version of a product. This is a cluster of data. Another cluster may include people who go to the same restaurants, have similar salaries, and vacation twice a year outside the country. Observing how these clusters are distributed, you can better understand how the records in a dataset interact, as well as how that interaction affects the outcome of a predicted attribute.Microsoft Naïve BayesThe Microsoft Naïve Bayes algorithm quickly builds mining models that can be used for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The probabilities used to generate the model are calculated and stored during the processing of the cube. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. The Microsoft Naïve Bayes algorithm produces a simple mining model that can be considered a starting point in the data mining process. Because most of the calculations used in creating the model are generated during cube processing, results are returned quickly. This makes the model a good option for exploring the data and for discovering how various input attributes are distributed in the different states of the predicted attribute.Microsoft Time SeriesThe Microsoft Time Series algorithm creates models that can be used to predict continuous variables over time from both OLAP and relational data sources. For example, you can use the Microsoft Time Series algorithm to predict sales and profits based on the historical data in a cube.Using the algorithm, you can choose one or more variables to predict, but they must be continuous. You can have only one case series for each model. The case series identifies the location in a series, such as the date when looking at sales over a length of several months or years.A case may contain a set of variables (for example, sales at different stores). The Microsoft Time Series algorithm can use cross-variable correlations in its predictions. For example, prior sales at one store may be useful in predicting current sales at another store.Microsoft Neural NetworkIn Microsoft SQL Server 2005 Analysis Services, the Microsoft Neural Network algorithm creates classification and regression mining models by constructing a multilayer perceptron network of neurons. Similar to the Microsoft Decision Trees algorithm provider, given each state of the predictable attribute, the algorithm calculates probabilities for each possible state of the input attribute. The algorithm provider processes the entire set of cases , iteratively comparing the predicted classification of the cases with the known actual classification of the cases. The errors from the initial classification of the first iteration of the entire set of cases is fed back into the network, and used to modify the network's performance for the next iteration, and so on. You can later use these probabilities to predict an outcome of the predicted attribute, based on the input attributes. One of the primary differences between this algorithm and the Microsoft Decision Trees algorithm, however, is that its learning process is to optimize network parameters toward minimizing the error while the Microsoft Decision Trees algorithm splits rules in order to maximize information gain. The algorithm supports the prediction of both discrete and continuous attributes.Microsoft Linear RegressionThe Microsoft Linear Regression algorithm is a particular configuration of the Microsoft Decision Trees algorithm, obtained by disabling splits (the whole regression formula is built in a single root node). The algorithm supports the prediction of continuous attributes.Microsoft Logistic RegressionThe Microsoft Logistic Regression algorithm is a particular configuration of the Microsoft Neural Network algorithm, obtained by eliminating the hidden layer. The algorithm supports the prediction of both discrete andcontinuous attributes.)中文译文数据挖掘技术简介摘要：微软® SQL Server™2005中提供用于创建和使用数据挖掘模型的集成环境的工作。

数据挖掘外文翻译

Applied intelligence, 2005, 22,47-60.一种用于零售银行客户流失分析的数据挖掘方法作者：胡晓华作者单位：美国费城卓克索大学信息科学学院摘要在金融服务业中解除管制，和新技术的广泛运用在金融市场上增加了竞争优势。

每一个金融服务公司的经营策略的关键是保留现有客户，和挖掘新的潜在客户。

数据挖掘技术在这些方面发挥了重要的作用。

在本文中，我们采用数据挖掘方法对零售银行客户流失进行分析。

我们讨论了具有挑战性的问题，如倾向性数据、数据按时序展开、字段遗漏检测等，以及一项零售银行损失分析数据挖掘任务的步骤。

我们使用枚举法作为损失分析的适当方法，用枚举法比较了决策树，选择条件下的贝叶斯网络，神经网络和上述分类的集成的数据挖掘模型。

一些有趣的调查结果被报道。

而我们的研究结果表明，数据挖掘技术在零售业银行中的有效性。

关键词数据挖掘分类方法损失分析1.简介在金融服务业中解除管制，和新技术的广泛运用在金融市场上增加了竞争优势。

每一个金融服务公司经营策略的关键是保留现有客户，和挖掘新的潜在客户。

数据挖掘技术在这些方面中发挥了重要的作用。

数据挖掘是一个结合商业知识，机器学习方法，工具和大量相关的准确信息的反复过程，使隐藏在组织中的企业数据的非直观见解被发现。

这个技术可以改善现有的进程，发现趋势和帮助制定公司的客户和员工的关系政策。

在金融领域，数据挖掘技术已成功地被应用。

•谁可能成为下两个月的流失客户？•谁可能变成你的盈利客户？•你的盈利客户经济行为是什么？•什么产品的不同部分可能被购买？•不同的群体的价值观是什么？•不同部分的特征是什么和每个部分在个人利益中扮演的角色是什么？在本论文中，我们关注的是应用数据挖掘技术来帮助分析零售银行损失分析。

损失分析的目的是确定一组高流失率的客户，然后公司可以控制市场活动来改变所需方向的行为（改变他们的行为，降低流失率）。

在直接营销活动的数据挖掘中，每一个目标客户是无利可图的，无效的，这个概念很容易被理解。

Data-mining-clustering数据挖掘—聚类分析大学毕业论文外文文献翻译及原文

Data-mining-clustering数据挖掘—聚类分析大学毕业论文外文文献翻译及原文毕业设计（论文）外文文献翻译文献、资料中文题目：聚类分析文献、资料英文题目：clustering文献、资料来源：文献、资料发表（出版）日期：院（部）：专业：自动化班级：姓名：学号：指导教师：翻译日期： 2017.02.14外文翻译英文名称：Data mining-clustering译文名称：数据挖掘—聚类分析专业：自动化姓名：****班级学号：****指导教师：******译文出处：Data mining：Ian H.Witten, EibeFrank 著Clustering5.1 INTRODUCTIONClustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actualdata. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have been proposed:●Set of like elements. Elements from different clusters are not alike.●The distance between points in a cluster is less than the distance betweena point in the cluster and any point outside it.A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward.As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent usesinclude examining Web log data to detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:●Outlier handling is difficult. Here the elements do not naturally fallinto any cluster. They can be viewed as solitary clusters. However, if aclustering algorithm attempts to find larger clusters, these outliers will beforced to be placed in some cluster. This process may result in the creationof poor clusters by combining two existing clusters and leaving the outlier in its own cluster.● Dynamic data in the database implies that cluster membership may change over time.● Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.● There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created.● Another related iss ue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):● The (best) number of clusters is not known.● There may not be any a priori knowledge co ncerning the clusters. ● Cluster results are dynamic.The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster,j k ,1j k ≤≤, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem is that a set of clusters is created: K={12,,...,k k k k }.D EFINITION 5.1.Given a database D ={12,,...,n t t t } of tuples and aninteger value k , the clustering problem is to define a mapping f : {1,...,}D k → where each i t is assigned to one cluster j K ,1j k ≤≤. A cluster j K , contains precisely those tuples mapped to it; that is,j K ={|(),1,i i j t f t K i n =≤≤and i t D ∈}.A classification of the different types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item is in itsown unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clusters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into the intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can becategorized as agglomerat ive or divisive. ”Agglomerative ” implies that the clusters are created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although bothhierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one, serial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with decision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matrix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure.We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers.5.2 SIMILARITY AND DISTANCE MEASURESThere are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(,i l t t ), defined between any two tuples, ,i l t t D . This provides a more strict and alternative clustering definition, as found in Definition 5.2. Unless otherwise stated, we use thefirst definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.A distance measure, dis(,i j t t ), as opposed to similarity, is often used in。

数据挖掘英文论文数据挖掘的论文

数据挖掘英文论文数据挖掘的论文Web数据挖掘中XML的应用研究摘要：网络的普及基于信息的获取，随着Html技术的发展，数据信息与日俱增.面对浩瀚如烟的信息，要想得到想要的、有用的的信息，必须要对Web信息进行挖掘。

而对于Html语言的数据，结构性很差，Web数据挖掘工作很难满足搜索的需要。

XML语言的出现极大的改观了这一现状。

由于它具有良好的结构性、层次性，所以利用它组织网络页面信息，更有利于进行数据挖掘工作。

通过对XML语言的介绍，提出一个基于XML的Web Miner模型，认识XML在Web数据挖掘中的应用。

关键词：HTML;XML；电子商务；Web数据挖掘XML Web Application Studies In Data MiningNIU Yan-cheng1, BAO Ying2(nzhou Jiaotong University, Lanzhou 730030, China; 2.Northwest Normal University, Lanzhou 730070, China)Abstract: The popularization of the Internet is based on the acquisition of information. As the Html technology developing, a number of data information is growing. Facing with the massive information, we must explore the Web information that we wanted and useful. But for the Html language data, its structure is very poor. So the exploration of the Web data is hard to meet the needs of searching. The emergence of the XML language has changed that situation greatly. XML language has good structural property and organizational property, which used for organizing the network information is more conducive to the data miningwork. The goal of this paper is to recommend a Miner model based on XML Web by the introduce of the XML language and to know the application of XML Web in the data mining.Key words: HTML; XML; e-commerce; web data mining随着Internet的迅速发展与普及，我们进入了一个数据信息时代。

数据挖掘论文英文版

Jilin Province’s population growth and energy consumption analysisMajor StatisticsStudent No. 0401083710Name Niu FukuanJilin Province’s population growth andenergy consumption analysis[Summary]Since the third technological revolution, the energy has become the lifeline of national economy, while the energy on Earth is limited, so in between the major powers led to a number of oil-related or simply a war for oil. In order to compete on the world's resources and energy control, led to the outbreak of two world wars. China's current consumption period coincided with the advent of high-energy, CNPC, Sinopec, CNOOC three state-owned oil giants have been "going out" to develop international markets, Jilin Province as China's energy output and energy consumption province, is also active in the energy corresponding diplomacy. Economic globalization and increasingly fierce competition in the energy environment, China's energy policy is still there are many imperfections, to a certain extent, affect the energy and population development of Jilin Province, China and even to some extent can be said existing population crisis is the energy crisis.[Keyword]Energy consumption; Population; Growth; Analysis;Data sourceI select data from "China Statistical Yearbook 2009" Jilin Province 1995-2007 comprehensive annual financial data (Table 1). Record of the total population (end) of the annual data sequence {Xt}, mind full of energy consumption (kg of standard coal) annual data sequence {Yt}.Table 1 1995-2007 older and province GDP per capita consumption level of all data2001 127627 16629798.1 11.75686723 16.626706712002 128453 17585215.7 11.76331836 16.682569092003 129227 19888035.3 11.76932583 16.805628872004 129988 21344029.6 11.77519742 16.876282612005 130756 23523004.4 11.78108827 16.973489412006 131448 25592925.6 11.78636662 17.057826532007 132129 26861825.7 11.791534 17.106216721.Timing diagramFirst, the total population of Table 1 (end) of the annual data series {Xt}, full of energy consumption (kg of standard coal) annual data series {Yt} are drawn timing diagram, in order to observe the annual population data series {Xt} and national annual energy consumption data sequence {Yt} is stationary, by EVIEWS software output is shown below.Figure 1 of the total population (end) sequence timing diagramFigure 2 universal life energy consumption (kg of standard coal) sequence timing diagramFigure 1 is a sequence {Xt} the timing diagram, Figure 2 is a sequence {Yt} of the timing diagram.Two figures show both the total population (end) or universal life energy consumption (kg of standard coal) index showed a rising trend, the total population of the annual data series {Xt} and national annual energy consumption data sequence {Yt} not smooth, the two may have long-term cointegration relationship.2. Data smoothing(1)Sequence LogarithmFigures 1 and 2 by the intuitive discovery data sequence {Xt} and {Yt} showed a significant growth trend, a significant non-stationary sequence. Therefore, the total population of first sequence {Xt} and universal life energy consumption (kg of standard coal) {Yt}, respectively for the number of treatment to eliminate heteroscedasticity. That logx = lnXt, logy = lnYt, with a view to the target sequence into the linear trend trend sequence, by EVIEWS software operations, the number of sequence timing diagram, in which the population sequence {logx} timing diagram shown in Figure 3, the full sequence of energy consumption {logy} timing diagram shown in Figure 4.Figure3 Figure 4Figure 3 shows the total population observed sequence {logx} and universal life energy consumption (kg of standard coal) sequence {logy} index trend has been basically eliminated, the two have obvious long-term cointegration relationship, which is the transfer function modeling an important prerequisite. However, the above sequence of numbers is still non-stationary series. Respectively {logx} and {logy} sequence of ADF unit root test (Table 5 and Table 6), the test results as shown below. (2)Unit root testHere we will be on the province's total population and the whole sequence {Xt} energy consumption (kg of standard coal) sequence data {Yt} be the unit root test, the results obtained by Eviews software operation is as follows:Table 2 Of the total population sequence {logx}Obtained from Table 2: Total population sequence data {Xt} of the ADF is -0.784587, significantly larger than the 1% level in the critical test value of -4.3260, the 5% level greater than the critical value of -3.2195 testing, but also greater than 10% level in the critical test value -2.7557, so the total population of the data sequence {logx} {Xt} is a non-stationary series.Table 3 National energy consumption (kg of standard coal) unit root test {logy}Obtained from Table 3: National energy consumption (kg of standard coal) data {Yt} of the ADF is 0.489677, significantly larger than the 1% level in the critical test value of -4.3260, the 5% level greater than the critical test value of -3.2195, but also 10% greater than the critical level test value -2.7557, so the total population of the sequence {logx} data {Yt} is a non-stationary series.(3) Sequence of differentialBecause of the number of time series after still not a smooth sequence, so the need for further logarithm of the total population after the sequence {logx} and after a few of the universal life energy consumption (kg of standard coal) differential sequence data {logY} differential sequences were recorded as {▽logx} and {▽logy}. Are respectively the second-order differential of the total population of the sequence {▽logX} and second-order differential of the national energy consumption (kg of standard coal) sequence data {▽ logy} the ADF unit root test (Table 7 and Table 8), test results the following table.Table 4Table 4 shows that the total population of second-order differential sequence {▽logx} ADF value is -10.6278, apparently less than 1% level in the critical test value of -6.292057, less than the 5% level in the critical test value -4.450425 also 10% less than the level in the critical test value of -3.701534, second-order differential of the total population of the sequence {▽ logx} is a stationary sequence.Table5 5Table 5 shows that the second-order differential universal life energy consumption (kg of standard coal) {▽logy} of the ADF is -6.395029, apparently less than 1% level in the critical test value of -4.4613, less than the 5 % level of the critical test value of -3.2695, but also less than the 10% level the critical value of -2.7822 testing,universal life, second-order differential consumption of energy (kg of standard coal) {▽ logy} is a stationary sequence.3. Cointegration(1)Cointegration regressionCointegration theory in the 1980s there Engle Granger put forward specific, it is from the analysis of non-stationary time series start to explore the non-stationary variable contains the long-run equilibrium relationship between the non-stationary time series modeling provides a new solution.As the population time series {Xt} and universal life energy consumption time series {Yt} are logarithmic, the total population obtained by the analysis of time series {logX} and universal life energy consumption time series {logY} are second-order single whole sequence, so they may exist cointegration relationship. The results obtained by Eviews software operation is as follows:Table 6Obtained from Table 6:D(LNE2)= -0.054819 – 101.8623D(LOGX2)t = (-1.069855) (-1.120827)R2=0.122487 DW=1.593055（2）Check the smoothness of the residual sequenceFrom the Eviews software, get residual sequence analysis:Table 7Residual series unit root testObtained from Table 7: second-order differential value of -5.977460 ADF residuals, significantly less than 1% level in the critical test value -4.6405, less than 5% level in the critical test value of -3.3350, but also less than 10% level in the critical test value of -2.8169. Therefore, the second-order difference of the residual et is a stationary time series sequence. Expressed as follows:D（ET,2）=-0.042260-1.707007D(ET(-1),2)t = （-0.783744）(-5.977460)DW= 1.603022 EG=-5.977460，Since EG =- 5.977460, check the AFG cointegration test critical value table (N = 2, = 0.05, T = 16) received, EG value is less than the critical value, so to accept the original sequence et is stationary assumption. So you can determine the total population and energy consumption of all the people living there are two variables are long-term cointegration relationship.4. ECM model to establishThrough the above analysis, after the second-order differential of the logarithm of the total population time series {▽ logX} and second-order differential of Logarithm of of national energy consumption time series {▽ logY} is a stationary sequence, the second-order differential residuals et is also a stationary series. So that the number of second-order differential of the national energy consumption time series {▽ logY} as the dependent variable, after the second-order differential of the logarithm of the total population time series {▽logX} and second-order differential as residuals et from variable regression estimation, using Eviews software, the following findings:Table 8ECM model resultsTable 8 can be written by the ECM standard regression model, results are as follows:D(logY2)= -0.047266-154.4568D(LNP2) +0.171676D(ET2)t = (-1.469685) (-2.528562) (1.755694)R2= 0.579628 DW=1.760658ECM regression equation of the regression coefficients by a significance test, the error correction coefficient is positive, in line with forward correction mechanism. The estimation results show that the province of everyone's life changes in energy consumption depends not only on the change of the total population, but also on the previous year's total population deviation from the equilibrium level. In addition, the regression results show that short-term changes in the total population of all the people living there is a positive impact on energy consumption. Because short-term adjustment coefficient is significant, it shows that all the people living in JilinProvince annual consumption of energy in its long-run equilibrium value is the deviation can be corrected well.5. ARMA model(1) Model to identifyAfter differential differenced stationary series into stationary time series, after the analysis can be used ARMR model, the choice of using the model of everyone's life before the first stable after the annual energy consumption time series {logY} to estimate the first full life energy consumption sequence {logY} do autocorrelation and partial autocorrelation, the results of the following:Table 9{logy} of the autocorrelation and partial autocorrelation mapObtained from Table 9, the relevant figure from behind, after K = 1 in a random interval, partial autocorrelation can be seen in K = 1 after a random interval. So we can live on national energy consumption to establish the sequence {logY} ARMA (1,1) model, following on the ARMA (1,1) model parameter estimation, which results in the following table:Table 10ARMA (1,1) model parameter estimationTable 10 obtained by the ARMA (1,1) model parameter estimation is given by: D(LNE,2)=0.014184+0.008803D(LNE,2)t-1-0.858461U t-1(2)ARMA (1,2) model testModel of the residuals obtained for white noise test, if the residuals are not white noise sequence, then the need for ARMA (1,2) model for further improvement; if it is white noise process, the acceptance of the original model. ARMA (1,2) model residuals test results are as follows:Table11 ARMA (1,2) model residuals testTable 11 shows, Q statistic P value greater than 0.05, so the ARMA (1,1) model, the residual series is white noise sequence and accept the ARMA (1,1) model. Our whole life to predict changes in energy consumption, the results are as follows:Figure 5 National energy consumption forecast mapJilin Province of everyone's life through the forecast energy consumption, we can see all the people living consumption of energy is rising every year, which also shows that in the future for many years, Jilin Province, universal life energy consumption will be showing an upward trend. And because of the total population and the existence of universal life energy consumption effects of changes in the same direction, so the total population over the next many years, will continue to increase.6. ProblemsBased on the province's total population and the national energy consumption cointegration analysis of the relationship between population and energy consumption obtained between Jilin Province, there are long-term stability of the interaction and mutual promotion of the long-run equilibrium relationship. The above analysis can be more accurate understanding of the energy consumption of Jilin Province, Jilin Province put forward a better proposal on energy conservation. Moment, Jilin Province facing energy problems:(1) The heavy industry still accounts for a large proportion of;(2)The scale of energy-intensive industry, the rapid growth of production ofenergy saving effect;(3)The coal-based energy consumption is still.7.Recommendation:(1) Population control, and actively cooperate with the national policy of family planning, ease the pressure on the average population can consume.(2) Raise awareness of the importance of energy saving, the implementation of energy-saving target responsibility system, energy efficiency are implemented.Conscientiously implement the State Council issued the statistics of energy saving, monitoring and evaluation program of the three systems. Strict accountability.(3) Speed up industrial restructuring and transformation of economic development. Speed up industrial restructuring and transformation of economic development, to overcome the resource, energy and other bottlenecks, and take the high technological content, good economic returns, low resources consumption, little environmental pollution and human resources into full play to the new industrialization path.(4) Should pay attention to quality improvement and optimization of the structure, so that the final implementation of the restructuring to improve the overall quality of industrial and economic growth, quality and efficiency up.(5) To enhance the development and promotion of energy-saving technologies, strengthen energy security, promotion of renewable energy, clean energy.Adhere to technical progress and the deepening of reform and opening up the combination. To enhance the independent innovation capability as the adjustment of industrial structure, changing the growth mode of the central link, speed up the innovation system, efforts to address the constraints of the city development major science and technology. Vigorously promote the recycling economy demonstration pilot enterprises to actively carry out comprehensive utilization of resources and renewable resources recycling. And actively promote solar, wind, biogas, biodiesel and other renewable energy construction.References[1] Wang Yan, Applied time series analysis of the Chinese People's University Press, 2008.12[2] Pang Hao. Econometric Science Press, 2006.1。

不确定性数据挖掘外文翻译文献

不确定性数据挖掘外文翻译文献(文档含中英文对照即英文原文和中文翻译)译文：不确定性数据挖掘：一种新的研究方向摘要由于不精确测量、过时的来源或抽样误差等原因，数据不确定性常常出现在真实世界应用中。

目前，在数据库数据不确定性处理领域中，很多研究结果已经被发表。

我们认为，当不确定性数据被执行数据挖掘时，数据不确定性不得不被考虑在内，才能获得高质量的数据挖掘结果。

我们称之为“不确定性数据挖掘”问题。

在本文中，我们为这个领域可能的研究方向提出一个框架。

同时，我们以UK-means 聚类算法为例来阐明传统K-means算法怎么被改进来处理数据挖掘中的数据不确定性。

1.引言由于测量不精确、抽样误差、过时数据来源或其他等原因，数据往往带有不确定性性质。

特别在需要与物理环境交互的应用中，如：移动定位服务[15]和传感器监测[3]。

例如：在追踪移动目标（如车辆或人）的情境中，数据库是不可能完全追踪到所有目标在所有瞬间的准确位置。

因此，每个目标的位置的变化过程是伴有不确定性的。

为了提供准确地查询和挖掘结果，这些导致数据不确定性的多方面来源不得不被考虑。

在最近几年里，已有在数据库中不确定性数据管理方面的大量研究，如：数据库中不确定性的表现和不确定性数据查询。

然而，很少有研究成果能够解决不确定性数据挖掘的问题。

我们注意到，不确定性使数据值不再具有原子性。

对于使用传统数据挖掘技术，不确定性数据不得不被归纳为原子性数值。

再以追踪移动目标应用为例，一个目标的位置可以通过它最后的记录位置或通过一个预期位置（如果这个目标位置概率分布被考虑到）归纳得到。

不幸地是，归纳得到的记录与真实记录之间的误差可能会严重也影响挖掘结果。

图1阐明了当一种聚类算法被应用追踪带有不确定性位置的移动目标时所发生的问题。

图1（a）表示一组目标的真实数据，而图1（b）则表示记录的已过时的这些目标的位置。

如果这些实际位置是有效的话，那么它们与那些从过时数据值中得到的数据集群有明显差异。

数据挖掘技术综述毕业论文外文翻译

Summary of Data Mining TechnologyAbstract: With the development of computer and network technology, it is very easy to obtain relevant information. But for the large number of large-scale data, the traditional statistical methods can not complete the analysis of such data. Therefore, an intelligent, comprehensive application of a variety of statistical analysis, database, intelligent language to analyze large data data "data mining" (Date Mining) technology came into being. This paper mainly introduces the basic concept of data mining and the method of data mining. The application of data mining and its development prospect are also described in this paper.Keywords: data mining; method; application; foreground1 IntroductionWith the rapid development of information technology, the scale of the database has been expanding, resulting in a lot of data. The surge of data is hidden behind a lot of important information, people want to be able to conduct a higher level of analysis in order to make better use of these data. In order to provide decision makers with a unified global perspective, data warehouses are established in many areas. But a lot of data often makes it impossible to identify hidden in which can provide support for decision-making information, and the traditional query, reporting tools can not meet the needs of mining this information. Therefore, the need for a new data analysis technology to deal with large amounts of data, and from the extraction of valuable potential knowledge, data mining (Data Mining) technology came into being. Data mining technology is also accompanied by the development of data warehouse technology and gradually improved.2 Data Mining Technology2.1 Definition of data miningData mining refers to the non-trivial process of automatically extracting useful information hidden in the data from the data set. The information is represented by rules, concepts, rules and patterns. It helps decision makers analyze historical data and current data and discover hidden relationships and patterns to predict future behaviors that may occur. The process of data mining is also called the process of knowledge discovery. It is a kind of interdisciplinary and interdisciplinary subject, which involves the fields of database, artificial intelligence, mathematical statistics, visualization and parallel computing. Data mining is a new information processing technology, its main feature is the database of large amounts of data extraction, conversion, analysis and other modelprocessing, and extract the auxiliary decision-making key data. Data mining is an important technology in KDD (Knowledge Discovery in Database). It does not use the standard database query language (such as SQL) to query, but the content of the query to summarize the pattern and the inherent law of the search. Traditional query and report processing are only the result of the incident, and there is no in-depth study of the reasons for the occurrence of data mining is the main understanding of the causes of occurrence, and with a certain degree of confidence in the future forecast for the decision-making behavior to provide favorable stand by.2.2 Methods of data miningData mining research combines a number of different disciplines in the field of technology and results, making the current data mining methods show a variety of forms. From the perspective of statistical analysis, the data mining models used in statistical analysis techniques are linear and non-linear analysis, regression analysis, logistic regression analysis, univariate analysis, multivariate analysis, time series analysis, recent sequence analysis, and recent Oracle algorithm and clustering analysis and other methods. Using these techniques, you can examine the data in those unusual forms, and then interpret the data using various statistical models and mathematical models to explain the market rules and business opportunities that are hidden behind those data. Knowledge discovery class Data mining technology is a kind of mining technology which is completely different from the statistical analysis class data mining technology, including artificial neural network, support vector machine, decision tree, genetic algorithm, rough set, rule discovery and association order.2.2.1 Statistical methodsTraditional statistics provide a number of discriminant and regression analysis methods for data mining. Commonly used techniques such as Bayesian reasoning, regression analysis, and variance analysis. Bayesian reasoning is the basic principle of correcting the probability distribution of data sets after knowing new information Tools, to deal with the classification of data mining problems, regression analysis used to find an input variable and the relationship between the output variables of the best model, in the regression analysis used to describe a variable trends and other variables of the relationship between the linear regression, There is also a logarithmic regression for predicting the occurrence of certain events. The variance analysis in the statistical method is generally used to analyze the effects of estimating the regression line's performance and the independent variables on the final regression, which is the result of many mining applications One of the powerful tools.2.2.2 Association rulesThe association rule is a simple and practical analysis rule, which describes the law and pattern of some attributes in one thing at the same time, which is one of the most mature and important technologies in data mining. It is made by R. Agrawal et al. First proposed that the most classical association rule mining algorithm is Apriori, which first digs out all frequent itemsets, and then generates association rules from frequent itemsets. Many mining rules of frequent rule sets are It evolved from the evolution of the rules in the field of data mining is widely used in large data sets to find a meaningful relationship between the data, one of the reasons is that it is not only a choice of a dependent variable, the association rules in the data The most typical application of the mining area is the shopping basket analysis. Most association rule mining algorithms can discover all the associated relationships hidden in the mining data, and the amount of association rules is often very large. However, not all the relationships between the attributes obtained through the association are practical. Value, the effective evaluation of these association rules, screening out the user is really interested, meaningful association rules is particularly important.2.2.3 Clustering analysisCluster analysis is based on the criteria associated with the selected samples to be divided into several groups, the same group of samples with high similarity, different groups are different, commonly used techniques have split algorithm, cohesion algorithm, Clustering and incremental clustering. The clustering method is suitable for the internal relationship between the samples, so as to make a reasonable evaluation of the sample structure. In addition, the cluster analysis is also used to detect the isolated points. Sometimes clustering is not intended to get objects together but to make it easier for an object to be separated from other objects. Cluster analysis has been applied to a variety of areas such as economic analysis, pattern recognition, image processing, and especially in business. Clustering analysis can help marketers discover different groups of characteristics that exist in customer groups. The key to clustering analysis In addition to the choice of algorithms, it is the choice of metrics for the sample. The classes that are not derived from the clustering algorithm are effective for decision making. Before applying an algorithm, the clustering trend of the data is usually checked first.2.2.4 Decision tree methodDecision tree learning is a method of approximating discrete objective functions by classifying instances from a root node to a leaf node to classify an instance. The leaf node is the classification of the instance. Each node on the tree illustrates a test of anattribute of the instance, and each subsequent branch of the node corresponds to a possible value of the attribute. The method of sorting the instance is from the root node of the tree, Test the properties specified by this node, and then move down the corresponding branch of the attribute value for the given instance. Decision tree method is to be applied to the classification of data mining.2.2.5 neural networkThe neural network is based on the mathematical model of self-learning, which can analyze a large number of complex data and can complete the extremely complex pattern extraction and trend analysis for human brain or other computer. The neural network can be expressed as guidance The learning can also be a non-guided cluster, whichever is the value entered into the neural network. Artificial neural network is used to simulate the structure of human brain neurons. Based on MP model and Hebb learning rules, three kinds of neural networks are established, which have non-linear mapping characteristics, information storage, parallel processing and global collective action, High degree of self-learning, self-organizing and adaptive ability. The feedforward neural network is represented by the sensor network and BP network, which can be used for classification and prediction. The feedback network is represented by Hopfield network for associative memory and optimization. The self-organizing network is based on ART model, Kohonon The model is represented for clustering.2.2.6 support vector machineSupport vector machine (SVM) is a new machine learning method developed on the basis of statistical learning theory. It is based on the principle of structural risk minimization, as far as possible to improve the learning machine generalization ability, has good promotion performance and good classification accuracy, can effectively solve the learning problem, has become a training multi-layer sensor, RBF An Alternative Method for Neural Networks and Polynomial Neural Networks. In addition, the support vector machine algorithm is a convex optimization problem, the local optimal solution must be the global optimal solution, these features are including the neural network, including other algorithms can not and. Support vector machine can be applied to the classification of data mining, regression, the exploration of unknown things and so on. In addition to the above methods, there are ways to convert data and results into visualization techniques, cloud model methods, and inductive logic programs.In fact, any kind of excavation tool is often based on specific issues to select the appropriate mining method, it is difficult to say which method is good, that method is inferior, but depending on the specific problems.2.3 data mining processFor data mining, we can be divided into three main stages: data preparation, data mining, evaluation and expression of results. The results of the evaluation and expression can also be broken down into: assessment, interpretation model model, consolidation, the use of knowledge. Knowledge discovery in the database is a multi-step process, but also the three stages of the repeated process,2.3.1 Data PreparationKDD processing object is a lot of data, these data are generally stored in the database system, the long-term accumulation of the results. But often not suitable for direct knowledge mining on these data, need to do data preparation, generally including the choice of data (select the relevant data), clean (eliminate noise, data), speculate (estimate missing data), conversion (discrete Data conversion between data and continuous value data, packet classification of data values, calculation combinations between data items, etc.), data reduction (reduction of data volume). These jobs are often prepared when the data warehouse is generated. Data preparation is the first step in KDD. Whether data preparation is good will affect the efficiency and accuracy of data mining and the effectiveness of the final model.2.3.2 Data miningData mining is the most critical step KDD, but also technical difficulties. Most of the research KDD personnel are studying data mining technology, using more technology to have decision tree, classification, clustering, rough set, association rules, neural network, genetic algorithm and so on. Data mining According to the goal of KDD, select the parameters of the corresponding algorithm, analyze the data, and get the model model of the possible model layer knowledge.2.3.3 Results evaluation and expressionEvaluation model: the model model obtained above, there may be no practical significance or no use value, it may not be able to accurately reflect the true meaning of the data, even in some cases is contrary to the facts, so need Evaluate, determine which are valid and useful patterns. Evaluation can be based on years of experience, some models can also be used directly to test the accuracy of the data. This step also includes presenting the pattern to the user in an easy-to-understand manner.Consolidate knowledge: the user understands and is considered to be consistent with the actual and valuable model of the model that forms the knowledge. But also pay attention to the consistency of knowledge to check, with the knowledge obtained before the conflict, contradictory embankment, so that knowledge is consolidated.The use of knowledge: to find knowledge is to use, how to make knowledge can be used is one of the steps of KDD. There are two ways to use knowledge: one is to rely on the relationship or result described by the knowledge itself to support decision-making; the other is to require the use of new data knowledge, which may produce new problems, and Need to further optimize the knowledge. The process of KDD may need to be repeated multiple times. Once each step does not match the expected target, go back to the previous step, re-adjust, and re-execute.3 data mining applicationsThe potential application of data mining is very broad: government management decision-making, business management, scientific research and industrial enterprise decision support and other fields.3.1 Applied in scientific researchFrom the point of view of scientific research methodology, scientific research can be divided into three categories: theoretical science, experimental science and computational science. Computational science is an important symbol of modern science. Computing scientists work with data and analyze a wide variety of experimental or observational data every day. With the use of advanced scientific data collection tools, such as observing satellites, remote sensors, DNA molecular technology, the amount of data is very large, the traditional data analysis tools can not do anything, so there must be a strong intelligent automatic data analysis tools Caixing. Data mining in astronomy has a very famous application system: SKICAT (Sky Image Cataloging andAnalysis Tool). It is a tool developed by the California Institute of Technology's Jet Propulsion Laboratory (a laboratory designed to design a Mars probe rover) and astronomical scientists to help astronomers discover distant quasars. SKICAT is both the first successful data mining application and one of the first successful applications of artificial intelligence in astronomy and space science. Using SKICAT, astronomers have discovered 16 new and distant quasars that help astronomers better study the formation of quasars and the structure of the early universe. The application of data mining in biology is mainly focused on the study of molecular biology, especially genetic engineering. Gene research, there is a well-known international research project - the human genome project.3.2 in the commercial applicationIn the business sector, especially in the retail industry, the use of data mining is more successful. As the MIS system in the commercial use of universal, especially the use of code technology, you can collect a lot of data on the purchase situation, and the amount of data in the surge. The use of data mining technology can provide managers with theright decision-making means, so to promote sales and improve competitiveness is of great help.3.3 in the financial applicationIn the financial sector, the amount of data is very large, banks, securities companies and other transaction data and storage capacity is great. And for credit card fraud, the bank's annual loss is very large. Therefore, you can use data mining to analyze the customer's reputation. Typical financial analysis areas include investment assessment and stock trading market forecasts.3.4 in medical applicationsData mining in the medical application is very wide, from molecular medicine to medical diagnosis, can use data mining means to improve efficiency and efficiency. In the case of drug synthesis, the analysis of the chemical structure of the drug molecule can determine which of the atoms or atomic genes in the drug can play a role in the disease, so that in the synthesis of new drugs, according to the molecular structure of the drug to determine the drug will be possible What kind of disease? Data mining can also be used in industry, agriculture, transportation, telecommunications, military, Internet and other industries. Data mining has a wide range of application prospects, it can be applied to decision support, can also be applied to the database management system (DBMS). Data mining as a tool for decision support and analysis can be used to construct a knowledge base. In DBMS, data mining can be used for semantic query optimization, integrity constraints and inconsistent checks.4 Development Trend of Data MiningDue to the diversity of data, data mining tasks and data mining methods, many challenging topics are proposed for data mining. At the same time, the design of data mining language, efficient and useful data mining methods and system development, interactive and integrated data mining environment, as well as the application of data mining technology to solve large application problems, are currently data mining researchers, systems And the main problems faced by application developers. At present, the development trend of data mining is mainly as follows: application exploration; scalable data mining method; data mining and database system, data warehouse system and Web database system integration; data mining language standardization; visual data mining; Complex mining of new data types; Web mining; data mining in the privacy protection and information security.5 concluding remarksAt present, although the data mining technology has been applied to a certain degree, andachieved remarkable results, but there are still many unresolved problems, such as data preprocessing, mining algorithms, pattern recognition and interpretation, visualization problems. For the business process, the most critical issue of data mining is how to combine the spatial and temporal characteristics of business data, will be excavated out of knowledge, that is, time and space knowledge expression and interpretation mechanism. With the deepening of data mining technology, data mining technology will be applied in a wider range of areas, and achieved more significant results.Reference[1] HAN Jia-wei,KAMBER M. Data Mining Concepts and Technigues [M]. FAN Ming,MENG Xiao-feng,trrnsl. Beijing:China Ma-chine Press,2010. 305-307.(in Chinese)[2] ZHOU Bin,LIU Ya-ping,WU Ouan-yuan. The design and implementations issues of a data mining systems for eIectronic commerce[J]. Computer Engineering,2012,26 (6) :18-20.(in Chinese)[3] WANG Jia-cai,CHEN Oi,ZHAO Jie-yu,etla. VISMiner:An interactive visua I data mining prototyped system [J] . Computer Engi-neering,2003,29 (1) :17-19.(in Chinese)[4] LIU Kan,ZHOU Xiao-zheng,ZHOU Dong-ru. Visua I data mining based on para IIe I coordinates [J]. Computer Engineering and Ap-p Iications,2013,39 (5) : 193-196.(in Chinese)[5] NETZA,CHAUDHURI S,FAYYAD U,et al. Integrating data mining with SOL databases:OLE DB for data mining [A] . Pro 17th Int Conf on Data Engineering [C]. Heide Iberg:IEEE,2001. 379-387.[6] ZHAO Zhi-hong,LUO Bin,CHEN Shi-fu. A structure of data mining system based on data warehouse [J] . Computer App Iications and Software,2012,19 (4) :27-30.(in Chinese)[7] OIAN Wei-ning,WEI Li,WANG Yan,et a I. A data mining system for very Iarge databases [J]. Journa I of Software, 2012, 13 (8) :1540-1545.(in Chinese)[8] Quanyin Zhu，Jin Ding，Yonghua Yin，et al. A HybridApproach for New Products Discovery of Cell PhoneBased on Web Mining[J]. Journal of Information andComputational Science. 2012，9( 16) : 5039－5046.[9]Quanyin Zhu，Pei Zhou，Sunqun Cao，et al. A novel RDB－SW approach for commodities price dynamic trend a-nalysis based on Web extracting[J]. Journal of Digital In-formation Management，2012，10( 4) : 230－235.[10]Quanyin Zhu，Pei Zhou. The System Architecture for theBasic Information of Science and Technology ExpertsBased on Distributed Storage and Web Mining[C]. Pro-ceedings of the International Conference on ComputerScience and Service System，2012: 661－664.数据挖掘技术综述摘要：随着计算机、网络技术的发展，获得有关资料非常简单易行。

外文文献及翻译：什么是数据挖掘

什么是数据挖掘？简单地说，数据挖掘是从大量的数据中提取或“挖掘”知识。

该术语实际上有点儿用词不当。

注意，从矿石或砂子中挖掘黄金叫做黄金挖掘，而不是叫做矿石挖掘。

这样，数据挖掘应当更准确地命名为“从数据中挖掘知识”，不幸的是这个有点儿长。

“知识挖掘”是一个短术语，可能它不能反映出从大量数据中挖掘的意思。

毕竟，挖掘是一个很生动的术语，它抓住了从大量的、未加工的材料中发现少量金块这一过程的特点。

这样，这种用词不当携带了“数据”和“挖掘”，就成了流行的选择。

还有一些术语，具有和数据挖掘类似但稍有不同的含义，如数据库中的知识挖掘、知识提取、数据/模式分析、数据考古和数据捕捞。

许多人把数据挖掘视为另一个常用的术语—数据库中的知识发现或KDD的同义词。

而另一些人只是把数据挖掘视为数据库中知识发现过程的一个基本步骤。

知识发现的过程由以下步骤组成：1）数据清理：消除噪声或不一致数据，2）数据集成：多种数据可以组合在一起，3）数据选择：从数据库中检索与分析任务相关的数据，4）数据变换：数据变换或统一成适合挖掘的形式，如通过汇总或聚集操作，5）数据挖掘：基本步骤，使用智能方法提取数据模式，6）模式评估：根据某种兴趣度度量，识别表示知识的真正有趣的模式，7）知识表示：使用可视化和知识表示技术，向用户提供挖掘的知识。

数据挖掘的步骤可以与用户或知识库进行交互。

把有趣的模式提供给用户，或作为新的知识存放在知识库中。

注意，根据这种观点，数据挖掘只是整个过程中的一个步骤，尽管是最重要的一步，因为它发现隐藏的模式。

我们同意数据挖掘是知识发现过程中的一个步骤。

然而，在产业界、媒体和数据库研究界，“数据挖掘”比那个较长的术语“数据库中知识发现”更为流行。

因此，在本书中，选用的术语是数据挖掘。

我们采用数据挖掘的广义观点：数据挖掘是从存放在数据库中或其他信息库中的大量数据中挖掘出有趣知识的过程。

基于这种观点，典型的数据挖掘系统具有以下主要成分：数据库、数据仓库或其他信息库：这是一个或一组数据库、数据仓库、电子表格或其他类型的信息库。

数据挖掘技术英语论文

Good evening, ladies and gentlemen. I’m very glad to stand here and give you a short speech. Today I would introduce data mining technology to you. What is the data mining technology and what’s advantage and disadvantage. Now let's talk about this.Data mining refers to "Extracting implicit unknown valuable information from the data in the past” or “a scientific extracting information from a large amount of data or databases”, In general,it needs strict steps to be taken.including understanding, aquistion, intergration, data cleaning, assumptions and interpretation.By using these steps, we could get implicit and valuable information from the data. However, in spite of these complete steps, there are still many shortcomings.First of all, the operator has many problems in its development, such as the target market segmentation is not clear,the demand of data mining and evaluation of information is not enough; product planning and management are difficult to meet the customer information needs; the attraction to partners is a little less, and it has not yet formed a win-win value chain; in the level of operation management and business process, the ability of sales team and group informatization service are not adapted to the development of business.In a word, there’re still have a lot of things to be solved. It needs excellent statistics and technology. Italso needs greater power of refining and summary.Secondly,it’s easy to listen only by the data.”let the data speak”is not wrong, but we should keep it in mind that :next,parties! If the data and tools can solve the problem,what should people do? The data itself can only help analysts to find what are significant results,but it can’t tell you whether the result is right or wrong.So it also requires us to check up the relevant information seriously in case of being cheated by the data. Thirdly, Related to data mining,it also involves privacy issues, for example: an employer can access medical records to screen out those who have diabetes or serious heart disease, which is aimed to reduce the insurance expenditure. However, this approach will lead to ethical and legal issues.Data mining of government and commerce may involve in the national security or commercial confidentiality issues . It is also a big challenge to confidentiality.. In this aspect,it need the user obey social morals and government strengthen regulation.All in all,every technology has its own advantages and disadvantages. We need to learn to recognize it and how to use it effectively. In order to create greater benefits for mankind.we still have many things to be discovered about data mining. That’sall,thanks for your listening.,。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

基于自然语言的Apriori关联规则的视觉挖掘方法摘要：抽象-可视化数据挖掘技术可以以图形方式向用户展示数据挖掘过程，从而使用户更易于理解挖掘过程及其结果，而且在数据挖掘中也非常重要。

然而，现在大多数视觉数据挖掘都是通过可视化的结果而进行的。

同时，它不适用于关联规则的可视化处理的图形显示。

鉴于上述缺点，本文采用自然语言处理方法，以自然语言视觉地进行Apriori关联规则的整体挖掘过程，包括数据预处理，挖掘过程和挖掘结果的可视化显示为用户提供了一套具有更多感知和更易于理解的特征的集成方案关键字：apriori 关联规则数据挖掘可视化1 引言视觉数据挖掘技术是可视化技术和数据挖掘技术的结合。

使用计算机图形、图像处理技术等方法将数据挖掘的源数据，中间结果和最终挖掘结果转换成易于理解的图形或图像，然后进行贯穿的理论，方法和技术交互式处理。

根据数据挖掘应用中可视化的不同阶段，数据挖掘的可视化可以分为源数据可视化，挖掘过程可视化和结果可视化。

（1）源数据可视化源数据可视化方法在数据挖掘之前，以可视化的形式将整个数据集呈现给用户。

目的是使用户能够快速找到有趣的地区，从而实现挖掘目标和目标的下一步。

（2）过程可视化过程可视化实现起来相当复杂。

主要有两种方法- 一种是在采矿过程中可视化地呈现中间结果，并使用户根据中间结果的反馈方便地调整参数和约束。

另一种方法是以图标和流程图的形式保持整个数据挖掘过程，根据用户可以观察数据源，数据集成，清理和预处理过程以及采矿结果的存储和可视化等等。

（3）结果可视化数据挖掘结果可视化是指在采矿过程结束时以图形和图像的形式描述挖掘结果或知识，以提高用户对结果的理解，并使用户更好地评估和利用采矿结果。

2、国外家庭视觉数据挖掘研究状况目前，视觉数据挖掘技术的研究在国内外都处于起步阶段，如何使用可视化技术来显示利用各种数据挖掘算法生成后的模型。

该方向的主要研究内容是通过一些特殊视觉图形中的关联规则、决策树和聚类等算法向用户显示生成的结果，以帮助用户更好地了解结果数据挖掘模型。

典型的业务应用程序是IBM SPSS Modeler，开源工具包括Weka、Orange、GGobi 和KNIME，以及Google Visual Public Platform：Public Data Explorer。

视觉数据挖掘工具是一种很好的数据分析工具，在行业应用中，使用可视化数据挖掘工具显示数据挖掘更为明确，结合数据挖掘技术，更有利于分析的数据挖掘结果。

目前，关联规则的可视化研究主要集中在可视化数据和关联规则结果上，而挖掘过程可视化存在很多缺陷。

特别是在视觉演示过程中，基本采用图形形式。

在实践中已经发现，图形方法不适合在过程中显示关联规则及其结果。

因为对于关联规则，我们的目的是找到频繁的项目集，最好的结果显示它们是文本，同时对于最终获得的关联规则，图形应用程序不能够很好地显示，最好的方法是用基于自然语言的方式显示应用程序。

本文提出了基于自然语言的Apriori关联规则的视觉挖掘方案。

该方案的预处理，中间过程和采矿结果各个方面均可视化。

旨在通过最可接受的自然语言作为工具，实现整个采矿过程的视觉演示。

3 基于APRIORI协会规则的可视化采矿的基本理念本文提出的关联规则的视觉挖掘基本思想是在数据挖掘的整个过程中，提前提出关联规则的视觉挖掘基本上是关于采矿结果可视化的，很少涉及中间和预处理过程中的可视化。

对于结果可视化，图形方法是主要采用的显示方式，如使用平行坐标法，有向图法等。

然而，对于关联规则，通过频繁项目集和关联规则的方式进行图形显示似乎无能为力。

协会只是反映规则，规则最直接的形式是使用自然语言，而奥术公式和图形对于那些非常专业的人员而言是可以理解的，不适合普及。

而且，当然，充分运用反映关联规则的自然语言对实现有一定困难。

在本文中，采用自然语言的形式，以视觉方式展示了整个采矿过程。

可视化过程如图1所示图1关联规则的视觉过程表1 数学分数变换规则（1）数据预处理数据预处理是整个数据挖掘的关键，也是第一步，一般程序自动完成工作并显示差异。

本文采用完全互动的预处理操作可视化方法，首先构建用户定义的自然语言转换规则库，易于编辑规则，其最终目标是将属性值转换为自然语言。

例如，表1可以被定义为这样的规则，根据得分值，不同的分数可以被转换成不同的代码。

（2）采矿过程挖掘过程的可视化主要体现在中间挖掘结果的视觉显示和用户与系统之间的相互作用。

对于关联规则，中间挖掘结果体现在频繁项集合的显示中，以供用户观察采矿过程正确或不正确，同时根据交互程序，用户可以及时地介入方案进行运作（3）采矿结果挖掘结果可视化主要是基于最大频繁项集来提取关联规则，并通过转换规则将编码关联规则转换为自然语言形式。

用户可以一目了然地了解规则的含义。

4 APRIORI协会规则的视觉采矿实施A.数据预处理可视化构建转换表：转换表（字段名称，代码，条件和含义）图2 数据预处理可视化用户可以在转换规则表中进行编辑，包括添加，删除等。

形成转换规则表后，从数据预处理开始。

如图2所示，首先打开原始数据表，扫描表中的每个属性和值，并在转换规则表中查找属性和值，并进行转换，如果没有找到相应的属性和值，然后反复进行错误处理，直到转换完成B 视觉挖掘过程1）采矿参数设定在挖掘之前，用户选择支持度和置信度，然后开始进行数据挖掘，其中可以随时观察采矿频繁项目集和最大频繁项集的变化，如果异常，可能会及时终止程序的运行并重新选择参数以重复数据挖掘。

2）中间结果显示在采矿过程中，可以显示初始数据项集，频繁项集，最大频繁项集，以便观察用户数据挖掘的整个过程。

C 采矿结果可视化1）根据模糊关联理论建立关联规则的模糊运算符规则有两个限制，一个是支持度，另一个是置信度。

建立关联规则的关键在于信心度量，因此本文以信度为参照。

根据需要，在本文中，置信度取0-100的水平作为边界，所以模糊理论的领域表达为[0,100]。

模糊集的特征函数被称为隶属函数，它是描述逐渐变化的东西和“中介转型”现象的关键。

下属功能有很多种，常用的有三种形式：正常型，基于环型类型和环型。

从经验来看，建议使用基于环型类型或环型的会员功能来描述模糊操作者，而选择正常类型来描述模糊运算符。

运算符是其模糊度的描述，本文表明了关联规则的建立程度。

我们使用“很可能”，“可能”，“更可能”，“可能”修改关联规则的建立程度。

其中a为阈值，λ为操作者的对应值，Hλ为定量描述模糊值的操作符。

设置A的模糊值，定义Hλ，对于HλA = Aλ，并且λ的值的相应语义含义应该是“很可能”，λ= 4; “可能”，λ= 2; “更可能”，Aλ= 0.5; “可能”，λ= 0.25。

基于模糊算子，模糊条件由公式（1）得出，通过公式（1）可以推导出精确的范围：2）关联规则的自然语言转换如图3所示，为关联规则形成以符号形式显示，扫描转换表，扫描规则中的每个符号，将符号转换为自然语言，最后通过自然语言将符号中的显示规则转换为规则。

例如，符号规则：B2 -->F3转换成：中等职业成就-->方向（就业）表3 挖掘结果的可视化过程为了测试本文方法的可行性，根据Apriori关联规则挖掘算法，编写了学生成绩与毕业指导关系的数据挖掘程序。

以N大学X学院C学院为例：64名学生，5名综合表现属性，1名毕业方向属性，挖掘过程和结果如图4和图5所示。

从图4和图5可以看出，它在整个过程中主要建立自然语言转换规则库，然后将属性值转换为代码，并使用代码进行数据挖掘。

可以观察采矿过程中频繁项集的变化，使用户能够及时调整初始参数。

挖掘结果可以直接以自然语言显示，以提高规则的可读性。

图4 原始数据的预处理图5 数据挖掘结果5 结论本文针对目前大多数现有的视觉数据挖掘技术已经集中在数据挖掘结果的可视化这两个缺陷，同时对于Apriori关联规则，其视觉处理不适合图形显示，提出了一种基于自然语言的视觉处理方法。

该方法可以对关联规则的Apriori算法进行数据预处理，并对整个挖掘过程中的挖掘过程和结果进行自然语言视觉处理。

它提供了一套具有更多视觉和易于理解的特征的集成方案。

扩大了视觉数据挖掘过程的应用范围，有利于数据挖掘技术的推广应用。

参考文献：[1] Xie Qinghua,Zhang Ningrong,Song Yishen etc,"The Visual Model Method and Technology of Clustering Data Mining",Joumal of PLA university of science and technology (Natural Science Edition),vo1. I 6,no. I,pp.7-15,20 15.[2] Zhang Jun,"Researeh and Implementation of Visual Data Mining Technology",Joumal of Chongqing Technology and Business University (Natural Science Edition),vo1.30,no.3,pp.58-61,95,2013.[3] Wang Jing,"The Research and Application of Visual Technology in Data Mining",Jilin University press,Changchun,2009.[4] Hu Jun,"The Visual Data Mining Model and Its Application Research ", Beijing Jiaotong University press,Beijing,2009.[5] Song Chengzhang,Huang Xiaodong,Li Peng, ete,"Publie Sentiment Large Data Analysis Basedon Processing and Its Visualization Study " ,Fujian Computer,no.5,pp.19-21 ,2014.[6] Li Huijun,Li Zhiquan,"The Research on the Visual Clustering Method Based on Improving Radar Map",Journal of Yanshan University, vo1.5,no.1 ,pp.58-62,20 13.。