外文翻译-不确定性数据挖掘:一种新的研究方向

合集下载

人工智能领域中英文专有名词汇总

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱(共包含二级节点15 个,三级节点93 个)间序列分析)监督学习)领域 二级分类 三级分类。

工业工程 外文期刊 翻译

工业工程 外文期刊 翻译

Adrian Payne & Pennie FrowA Strategic Framework for Customer RelationshipManagementOver the past decade, there has been an explosion of interest in customer relationship management (CRM) by both academics and executives. However, despite an increasing amount of published material,most of which is practitioner oriented, there remains a lack of agreement about what CRM is and how CRM strategy should be developed. The purpose of this article is to develop a process-oriented conceptual framework that positions CRM at a strategic level by identifying the key crossfunctional processes involved in the development of CRM strategy. More specifically, the aims of this article are •To identify alternative perspectives of CRM,•To emphasize the importance of a strategic approach to CRM within a holistic organizational context,•To propose five key generic cross-functional processes that organizations can use to develop and deliver an effective CRM strategy, and•To develop a process-based conceptual framework for CRM strategy development and to review the role and components of each process.We organize this article in three main parts. First, we explore the role of CRM and identify three alternative perspectives of CRM. Second, we consider the need for a cross -functional process-based approach to CRM. We develop criteria for process selection and identify five key CRM processes. Third, we propose a strategic conceptual framework that is constructed of these five processes and examine the components of each process.The development of this framework is a response to a challenge by Reinartz, Krafft, and Hoyer (2004), who criticize the severe lack of CRM research that takes a broader, more strategic focus. The article does not explore people issues related to CRM implementation. Customer relationship management can fail when a limited number of employees are committed to the initiative; thus, employee engagement and change management are essential issues in CRM implementation. In our discussion, we emphasize such implementation and people issues as a priority area for further research.CRM Perspectives and DefinitionThe term “customer relationship management” emerged in the information technology (IT) vendor community and practitioner community in the mid-1990s. It is often used todescribe technology-based customer solutions, such as sales force automation (SFA). In the academic community, the terms “relationship marketing and CRM are often used interchangeably (Parvatiyar and Sheth 2001). However,CRM is more commonly used in the context of technology solutions and has been described as “information-enabled relationship marketing” (Ryals and Payne 2001, p. 3).Zablah, Beuenger, and Johnston (2003, p. 116) suggest that CRM is “a philosophically-related offspring to relationship marketing which is for the most part neglected in the literature,”and they conclude that “further exploration of CRM and its related phenomena is not only warranted but also desperately needed.”A significant problem that many organizations deciding to adopt CRM face stems from the great deal of confusion about what constitutes CRM. In interviews with executives, which formed part of our research process (we describe this process subsequently), we found a wide range of views about what CRM means. To some, it meant direct mail, a loyalty card scheme, or a database, whereas others envisioned it as a help desk or a call center. Some said that it was about populating a data warehouse or undertaking data mining; others considered CRM an e-commerce solution,such as the use of a personalization engine on the Internet or a relational database for SFA. This lack of a widely accepted and appropriate definition of CRM can contribute to the failure of a CRM project when an organization views CRM from a limited technology perspective or undertakes CRM on a fragmented basis. The definitions and descriptions of CRM that different authors and authorities use vary considerably, signifying a variety of CRM viewpoints. To identify alternative perspectives of CRM, we considered definitions and descriptions of CRM from a range of sources, which we summarize in the Appendix. We excluded other, similar definitions from this List.Process Identification and the CRM FrameworkWe began by identifying possible generic CRM processes from the CRM and related business literature. We then discussed these tentative processes interactively with the groups of executives. The outcome of this work was a short list of seven processes. We then used the expert panel of experienced CRM executives who had assisted in the development of the process selection schema to nominate the CRM processes that they considered important and to agree on those that were the most relevant and generic. After an initial group workshop, eachpanel member independently completed a list representing his or her view of the key generic processes that met the six previously agreed-on process criteria. The data were fed back to this group, and a detailed discussion followed to help confirm our understanding of the process categories.As a result of this interactive method, five CRM processes that met the selection criteria were identified; all five were agreed on as important generic processes by more than two-thirds of the group in the first iteration. Subsequently, we received strong confirmation of these as key generic CRM processes by several of the other groups of managers. The resultant five generic processes were (1) the strategy development process, (2) the value creation process, (3) the multichannel integration process, (4) the information management process, and (5) the performance assessment process.We then incorporated these five key generic CRM processes into a preliminary conceptual framework. This initial framework and the development of subsequent versions were both informed by and further refined by our interactions with two primary executive groups.客户关系的管理框架在过去的十年里,管理层和学术界对客户关系管理(CRM)的兴趣激增。

外文翻译-不确定性数据挖掘:一种新的研究方向

外文翻译-不确定性数据挖掘:一种新的研究方向

毕业设计(论文)外文资料翻译系部:计算机科学与技术系专业:计算机科学与技术姓名:学号:外文出处:Proceeding of Workshop on the (用外文写)of Artificial,Hualien,TaiWan,2005不确定性数据挖掘:一种新的研究方向Michael Chau1, Reynold Cheng2, and Ben Kao31:商学院,香港大学,薄扶林,香港2:计算机系,香港理工大学九龙湖校区,香港3:计算机科学系,香港大学,薄扶林,香港摘要由于不精确测量、过时的来源或抽样误差等原因,数据不确定性常常出现在真实世界应用中。

目前,在数据库数据不确定性处理领域中,很多研究结果已经被发表。

我们认为,当不确定性数据被执行数据挖掘时,数据不确定性不得不被考虑在内,才能获得高质量的数据挖掘结果。

我们称之为“不确定性数据挖掘”问题。

在本文中,我们为这个领域可能的研究方向提出一个框架。

同时,我们以UK-means 聚类算法为例来阐明传统K-means算法怎么被改进来处理数据挖掘中的数据不确定性。

1.引言由于测量不精确、抽样误差、过时数据来源或其他等原因,数据往往带有不确定性性质。

特别在需要与物理环境交互的应用中,如:移动定位服务[15]和传感器监测[3]。

例如:在追踪移动目标(如车辆或人)的情境中,数据库是不可能完全追踪到所有目标在所有瞬间的准确位置。

因此,每个目标的位置的变化过程是伴有不确定性的。

为了提供准确地查询和挖掘结果,这些导致数据不确定性的多方面来源不得不被考虑。

在最近几年里,已有在数据库中不确定性数据管理方面的大量研究,如:数据库中不确定性的表现和不确定性数据查询。

然而,很少有研究成果能够解决不确定性数据挖掘的问题。

我们注意到,不确定性使数据值不再具有原子性。

对于使用传统数据挖掘技术,不确定性数据不得不被归纳为原子性数值。

再以追踪移动目标应用为例,一个目标的位置可以通过它最后的记录位置或通过一个预期位置(如果这个目标位置概率分布被考虑到)归纳得到。

聚类分析文献英文翻译

聚类分析文献英文翻译

电气信息工程学院外文翻译英文名称:Data mining-clustering译文名称:数据挖掘—聚类分析专业:自动化姓名:****班级学号:****指导教师:******译文出处:Data mining:Ian H.Witten, EibeFrank 著二○一○年四月二十六日Clustering5.1 INTRODUCTIONClustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have been proposed:●Set of like elements. Elements from different clusters are not alike.●The distance between points in a cluster is less than the distance betweena point in the cluster and any point outside it.A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This example illustrates the fact that that determining how to do the clustering is not straightforward.As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses include examining Web log data to detect usage patterns.When clustering is applied to a real-world database, many interesting problems occur:●Outlier handling is difficult. Here the elements do not naturally fallinto any cluster. They can be viewed as solitary clusters. However, if aclustering algorithm attempts to find larger clusters, these outliers will beforced to be placed in some cluster. This process may result in the creationof poor clusters by combining two existing clusters and leaving the outlier in its own cluster.● Dynamic data in the database implies that cluster membership may change over time.● Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may not be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for each cluster.● There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created.● Another related issue is what data should be used of clustering. Unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised learning.We can then summarize some basic features of clustering (as opposed to classification):● The (best) number of clusters is not known.● There may not be any a priori knowledge concerning the clusters.● Cluster results are dynamic.The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clusters to be created is an input value, k. The actual content (and interpretation) of each cluster,j k ,1j k ≤≤, is determined as a result of the function definition. Without loss of generality, we will view that the result of solving a clustering problem is that a set of clusters is created: K={12,,...,k k k k }.D EFINITION 5.1.Given a database D ={12,,...,n t t t } of tuples and an integer value k , the clustering problem is to define a mapping f : {1,...,}D k → where each i t is assigned to one cluster j K ,1j k ≤≤. A cluster j K , contains precisely those tuples mapped to it; that is, j K ={|(),1,i i j t f t K i n =≤≤and i t D ∈}.A classification of the different types of clustering algorithms is shown in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest level, each item is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive how the final set is created. Traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .There are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. Clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clusters. Even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. In turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. Extrinsic techniques use labeling of the items to assist in the classification process. These algorithms are the traditional classification supervised learning algorithms in which a special input training set is used. Intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. All algorithms we examine in this chapter fall into the intrinsic class.The types of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can becategorized as agglomerative or divisive. ”Agglomerative ” implies that the clusters are created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. Another descriptive tag indicates whether each individual element is handled one by one, serial (sometimes called incremental), or whether all items are examined together, simultaneous. If a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. As is usually done with decision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. Polythetic algorithms consider all attribute values at one time. Finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or matrix algebra. In this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure.We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been proposed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers.5.2 SIMILARITY AND DISTANCE MEASURESThere are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(,i l t t ), defined between any two tuples, ,i l t t D . This provides a more strict and alternative clustering definition, as found in Definition 5.2. Unless otherwise stated, we use the first definition rather than the second. Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.A distance measure, dis(,i j t t ), as opposed to similarity, is often used inclustering. The clustering problem then has the desirable property that given a cluster,j K ,,jl jm j t t K ∀∈ and ,(,)(,)i j jl jm jl i t K sim t t dis t t ∉≤.Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can then be described by using several characteristic values. Given a cluster, m K of N points { 12,,...,m m mN t t t }, we make the following definitions [ZRL96]:Here the centroid is the “middle ” of the cluster; it need not be an actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represented by one centrally located object in the cluster called a medoid . The radius is the square root of the average mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation m M to indicate the medoid for cluster m K .Many clustering algorithms require that the distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters i K and j K , there are several standard alternatives to calculate the distance between clusters. A representative list is:● Single link : Smallest distance between an element in onecluster and an element in the other. We thus havedis(,i j K K )=min((,))il jm il i j dis t t t K K ∀∈∉and jm j i t K K ∀∈∉.● Complete link : Largest distance between an element in onecluster and an element in the other. We thus havedis(,i j K K )=max((,))il jm il i j dis t t t K K ∀∈∉and jm j i t K K ∀∈∉.● Average : Average distance between an element in onecluster and an element in the other. We thus havedis(,i j K K )=((,))il jm il i j mean dis t t t K K ∀∈∉and jm j i t K K ∀∈∉.● Centroid : If cluster have a representative centroid, then thecentroid distance is defined as the distance between the centroids.We thus have dis(,i j K K )=dis(,i j C C ), where i C is the centroidfor i K and similarly for j C .Medoid : Using a medoid to represent each cluster, thedistance between the clusters can be defined by the distancebetween the medoids: dis(,i j K K )=(,)i j dis M M5.3 OUTLIERSAs mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In analyzing the height of individuals, this value probably would be viewed as an outlier.Some clustering techniques do not perform well with the presence of outliers. This problem is illustrated in Figure 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed in one cluster because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found.Clustering algorithms may actually find and remove outliers to ensure that they perform better. However, care must be taken in actually removing outliers. For example, suppose that the data mining problem is to predict flooding. Extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these values may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred.Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to remove or treat these values differently. Some outlier detection techniques are based on statistical techniques. These usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. However, thesetests are not very realistic for real-world data because real-world data values may not follow well-defined data distributions. Also, most of these tests assume single attribute value, and many attributes are involved in real-world datasets. Alternative detection techniques may be based on distance measures.聚类分析5.1简介聚类分析与分类数据分组类似。

数据挖掘英语

数据挖掘英语

数据挖掘英语随着信息技术和互联网的不断发展,数据已经成为企业和个人在决策和分析中不可或缺的一部分。

而数据挖掘作为一种利用大数据技术来挖掘数据潜在价值的方法,也因此变得越来越重要。

在这篇文章中,我们将会介绍数据挖掘的相关英语术语和概念。

一、概念1.数据挖掘(Data Mining)数据挖掘是一种从大规模数据中提取出有用信息的过程。

数据挖掘通常包括数据预处理、数据挖掘和结果评估三个阶段。

2.机器学习(Machine Learning)机器学习是一种通过对数据进行学习和分析来改善和优化算法的方法。

机器学习可以被视为是一种数据挖掘的技术,它可以用来预测未来的趋势和行为。

3.聚类分析(Cluster Analysis)聚类分析是一种通过将数据分组为相似的集合来发现数据内在结构的方法。

聚类分析可以用来确定市场细分、客户分组、产品分类等。

4.分类分析(Classification Analysis)分类分析是一种通过将数据分成不同的类别来发现数据之间的关系的方法。

分类分析可以用来识别欺诈行为、预测客户行为等。

5.关联规则挖掘(Association Rule Mining)关联规则挖掘是一种发现数据集中变量之间关系的方法。

它可以用来发现购物篮分析、交叉销售等。

6.异常检测(Anomaly Detection)异常检测是一种通过识别不符合正常模式的数据点来发现异常的方法。

异常检测可以用来识别欺诈行为、检测设备故障等。

二、术语1.数据集(Dataset)数据集是一组数据的集合,通常用来进行数据挖掘和分析。

2.特征(Feature)特征是指在数据挖掘和机器学习中用来描述数据的属性或变量。

3.样本(Sample)样本是指从数据集中选取的一部分数据,通常用来进行机器学习和预测。

4.训练集(Training Set)训练集是指用来训练机器学习模型的样本集合。

5.测试集(Test Set)测试集是指用来测试机器学习模型的样本集合。

翻译专业毕业论文研究方向探索

翻译专业毕业论文研究方向探索

翻译专业毕业论文研究方向探索翻译是一个与语言和文化密切相关的学科,它作为一门跨学科的综合性学科,涉及到语言学、文学、传媒、社会学等多个领域。

作为翻译专业的学生,选择一个具体的研究方向对于毕业论文的撰写非常重要。

本文将探讨翻译专业毕业论文的研究方向,并提供一些建议。

一、语言对比与翻译技巧语言是翻译的基础,而不同语言之间存在着巨大的差异。

研究语言对比可以深入了解各种语言之间的语法、词汇和翻译技巧。

例如,中英文之间的语序差异、文化隐喻的翻译、习惯用语的转化等。

通过系统地研究语言对比,可以提高翻译者的跨语言沟通能力。

二、跨文化交际与翻译策略翻译过程中,文化因素起着至关重要的作用。

不同文化之间存在着巨大的差异,这也是翻译过程中常常出现问题的地方。

研究跨文化交际与翻译策略可以探讨如何在不同文化背景下进行有效的信息传递和沟通。

例如,如何在翻译中考虑到文化背景、语境以及受众的差异等。

通过深入研究跨文化交际与翻译策略,可以提高翻译的准确性和有效性。

三、翻译技术与计算机辅助翻译随着技术的发展,计算机辅助翻译(CAT)成为翻译领域的一个重要方向。

CAT工具可以帮助翻译者提高翻译效率和准确性。

研究翻译技术与计算机辅助翻译,可以探索如何合理使用翻译工具、如何进行术语管理以及如何利用机器翻译等技术提高翻译效果。

此外,还可以研究自然语言处理技术在翻译过程中的应用。

四、专业文本翻译与行业应用翻译工作广泛应用于各个行业,而不同领域的专业文本存在着各自的特点和难点。

研究专业文本翻译与行业应用可以探讨如何在特定领域内进行准确且流畅的翻译。

例如,法律文件、医学文献、商务合同等。

通过研究专业文本翻译与行业应用,可以提高翻译者在特定领域内的工作能力和竞争力。

五、翻译教育与专业能力培养翻译教育是培养翻译专业人才的关键环节。

研究翻译教育与专业能力培养可以探讨如何有效地进行翻译教学和实践,并培养学生的综合素质与专业技能。

例如,探讨翻译教学方法、实习机制以及评估体系等。

市场调研方法外文文献及翻译

市场调研方法外文文献及翻译

市场调研方法外文文献及翻译1. Market Research Methods: Incorporating Social Media into Traditional Approaches文章介绍了如何在市场调研中运用社交媒体,以帮助企业更好地了解消费者。

研究人员将社交媒体与传统的定量调研和定性调研相结合,以获得更全面的信息。

通过采集社交媒体的数据分析消费者的行为和偏好,以及对产品或服务的反馈意见。

2. Using Eye Tracking in Market Research: A Guide to Best Practices该文献介绍了视觉追踪技术在市场调研中的应用。

作者指出,视觉追踪技术可以帮助研究人员理解消费者在浏览产品或服务时的注意力分配和行为模式。

文章介绍了适用于市场调研的视觉追踪应用程序的最佳实践和测试方法。

3. Conjoint Analysis in Marketing: New Developments with Implications for Research and Practice这篇文章介绍了一种被称为 "共轭分析" 的调研方法,该方法可以帮助研究人员了解消费者在购买某种产品或服务时的偏好和决策过程。

文献称,共轭分析已经成为市场营销领域最为普遍的工具之一。

文章还介绍了最新的研究和在实践中的应用,并探讨了一些特定情况下共轭分析的限制。

4. Qualitative Market Research: An International Journal这个杂志专注于定性市场调研方法。

它包括与确定消费者需求、分析竞争对手、建立品牌等相关的研究。

文章强调定性市场调研可以提供深入的见解和对产品或服务的更清晰的理解,帮助企业做出更明智的营销和业务决策。

每一期都包括来自该领域的专家的文章,并提供案例研究和最佳实践。

5. Use of Artificial Intelligence Techniques in Market Research: A Review该文献介绍了如何使用人工智能技术进行市场调研。

大数据挖掘外文翻译文献

大数据挖掘外文翻译文献

文献信息:文献标题:A Study of Data Mining with Big Data(大数据挖掘研究)国外作者:VH Shastri,V Sreeprada文献出处:《International Journal of Emerging Trends and Technology in Computer Science》,2016,38(2):99-103字数统计:英文2291单词,12196字符;中文3868汉字外文文献:A Study of Data Mining with Big DataAbstract Data has become an important part of every economy, industry, organization, business, function and individual. Big Data is a term used to identify large data sets typically whose size is larger than the typical data base. Big data introduces unique computational and statistical challenges. Big Data are at present expanding in most of the domains of engineering and science. Data mining helps to extract useful data from the huge data sets due to its volume, variability and velocity. This article presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective.Keywords: Big Data, Data Mining, HACE theorem, structured and unstructured.I.IntroductionBig Data refers to enormous amount of structured data and unstructured data thatoverflow the organization. If this data is properly used, it can lead to meaningful information. Big data includes a large number of data which requires a lot of processing in real time. It provides a room to discover new values, to understand in-depth knowledge from hidden values and provide a space to manage the data effectively. A database is an organized collection of logically related data which can be easily managed, updated and accessed. Data mining is a process discovering interesting knowledge such as associations, patterns, changes, anomalies and significant structures from large amount of data stored in the databases or other repositories.Big Data includes 3 V’s as its characteristics. They are volume, velocity and variety. V olume means the amount of data generated every second. The data is in state of rest. It is also known for its scale characteristics. Velocity is the speed with which the data is generated. It should have high speed data. The data generated from social media is an example. Variety means different types of data can be taken such as audio, video or documents. It can be numerals, images, time series, arrays etc.Data Mining analyses the data from different perspectives and summarizing it into useful information that can be used for business solutions and predicting the future trends. Data mining (DM), also called Knowledge Discovery in Databases (KDD) or Knowledge Discovery and Data Mining, is the process of searching large volumes of data automatically for patterns such as association rules. It applies many computational techniques from statistics, information retrieval, machine learning and pattern recognition. Data mining extract only required patterns from the database in a short time span. Based on the type of patterns to be mined, data mining tasks can be classified into summarization, classification, clustering, association and trends analysis.Big Data is expanding in all domains including science and engineering fields including physical, biological and biomedical sciences.II.BIG DATA with DATA MININGGenerally big data refers to a collection of large volumes of data and these data are generated from various sources like internet, social-media, business organization, sensors etc. We can extract some useful information with the help of Data Mining. It is a technique for discovering patterns as well as descriptive, understandable, models from a large scale of data.V olume is the size of the data which is larger than petabytes and terabytes. The scale and rise of size makes it difficult to store and analyse using traditional tools. Big Data should be used to mine large amounts of data within the predefined period of time. Traditional database systems were designed to address small amounts of data which were structured and consistent, whereas Big Data includes wide variety of data such as geospatial data, audio, video, unstructured text and so on.Big Data mining refers to the activity of going through big data sets to look for relevant information. To process large volumes of data from different sources quickly, Hadoop is used. Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Its distributed supports fast data transfer rates among nodes and allows the system to continue operating uninterrupted at times of node failure. It runs Map Reduce for distributed data processing and is works with structured and unstructured data.III.BIG DATA characteristics- HACE THEOREM.We have large volume of heterogeneous data. There exists a complex relationship among the data. We need to discover useful information from this voluminous data.Let us imagine a scenario in which the blind people are asked to draw elephant. The information collected by each blind people may think the trunk as wall, leg as tree, body as wall and tail as rope. The blind men can exchange information with each other.Figure1: Blind men and the giant elephantSome of the characteristics that include are:i.Vast data with heterogeneous and diverse sources: One of the fundamental characteristics of big data is the large volume of data represented by heterogeneous and diverse dimensions. For example in the biomedical world, a single human being is represented as name, age, gender, family history etc., For X-ray and CT scan images and videos are used. Heterogeneity refers to the different types of representations of same individual and diverse refers to the variety of features to represent single information.ii.Autonomous with distributed and de-centralized control: the sources are autonomous, i.e., automatically generated; it generates information without any centralized control. We can compare it with World Wide Web (WWW) where each server provides a certain amount of information without depending on other servers.plex and evolving relationships: As the size of the data becomes infinitely large, the relationship that exists is also large. In early stages, when data is small, there is no complexity in relationships among the data. Data generated from social media and other sources have complex relationships.IV.TOOLS:OPEN SOURCE REVOLUTIONLarge companies such as Facebook, Yahoo, Twitter, LinkedIn benefit and contribute work on open source projects. In Big Data Mining, there are many open source initiatives. The most popular of them are:Apache Mahout:Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent patternmining.R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.MOA: Stream data mining open source software to perform data mining in real time. It has implementations of classification, regression; clustering and frequent item set mining and frequent graph mining. It started as a project of the Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA, Android and Storm.SAMOA: It is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA.Vow pal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of any single machine networkinterface when doing linear learning, via parallel learning.V.DATA MINING for BIG DATAData mining is the process by which data is analysed coming from different sources discovers useful information. Data Mining contains several algorithms which fall into 4 categories. They are:1.Association Rule2.Clustering3.Classification4.RegressionAssociation is used to search relationship between variables. It is applied in searching for frequently visited items. In short it establishes relationship among objects. Clustering discovers groups and structures in the data.Classification deals with associating an unknown structure to a known structure. Regression finds a function to model the data.The different data mining algorithms are:Table 1. Classification of AlgorithmsData Mining algorithms can be converted into big map reduce algorithm based on parallel computing basis.Table 2. Differences between Data Mining and Big DataVI.Challenges in BIG DATAMeeting the challenges with BIG Data is difficult. The volume is increasing every day. The velocity is increasing by the internet connected devices. The variety is also expanding and the organizations’ capability to capture and process the data is limited.The following are the challenges in area of Big Data when it is handled:1.Data capture and storage2.Data transmission3.Data curation4.Data analysis5.Data visualizationAccording to, challenges of big data mining are divided into 3 tiers.The first tier is the setup of data mining algorithms. The second tier includesrmation sharing and Data Privacy.2.Domain and Application Knowledge.The third one includes local learning and model fusion for multiple information sources.3.Mining from sparse, uncertain and incomplete data.4.Mining complex and dynamic data.Figure 2: Phases of Big Data ChallengesGenerally mining of data from different data sources is tedious as size of data is larger. Big data is stored at different places and collecting those data will be a tedious task and applying basic data mining algorithms will be an obstacle for it. Next we need to consider the privacy of data. The third case is mining algorithms. When we are applying data mining algorithms to these subsets of data the result may not be that much accurate.VII.Forecast of the futureThere are some challenges that researchers and practitioners will have to deal during the next years:Analytics Architecture:It is not clear yet how an optimal architecture of analytics systems should be to deal with historic data and with real-time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, theserving layer, and the speed layer. It combines in the same system Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, and extensible, allows ad hoc queries, minimal maintenance, and debuggable.Statistical significance: It is important to achieve significant statistical results, and not be fooled by randomness. As Efron explains in his book about Large Scale Inference, it is easy to go wrong with huge data sets and thousands of questions to answer at once.Distributed mining: Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot of research is needed with practical and theoretical analysis to provide new methods.Time evolving data: Data may be evolving over time, so it is important that the Big Data mining techniques should be able to adapt and in some cases to detect change first. For example, the data stream mining field has very powerful techniques for this task.Compression: Dealing with Big Data, the quantity of space needed to store it is very relevant. There are two main approaches: compression where we don’t loose anything, or sampling where we choose what is thedata that is more representative. Using compression, we may take more time and less space, so we can consider it as a transformation from time to space. Using sampling, we are loosing information, but the gains inspace may be in orders of magnitude. For example Feldman et al use core sets to reduce the complexity of Big Data problems. Core sets are small sets that provably approximate the original data for a given problem. Using merge- reduce the small sets can then be used for solving hard machine learning problems in parallel.Visualization: A main task of Big Data analysis is how to visualize the results. As the data is so big, it is very difficult to find user-friendly visualizations. New techniques, and frameworks to tell and show stories will be needed, as for examplethe photographs, infographics and essays in the beautiful book ”The Human Face of Big Data”.Hidden Big Data: Large quantities of useful data are getting lost since new data is largely untagged and unstructured data. The 2012 IDC studyon Big Data explains that in 2012, 23% (643 exabytes) of the digital universe would be useful for Big Data if tagged and analyzed. However, currently only 3% of the potentially useful data is tagged, and even less is analyzed.VIII.CONCLUSIONThe amounts of data is growing exponentially due to social networking sites, search and retrieval engines, media sharing sites, stock trading sites, news sources and so on. Big Data is becoming the new area for scientific data research and for business applications.Data mining techniques can be applied on big data to acquire some useful information from large datasets. They can be used together to acquire some useful picture from the data.Big Data analysis tools like Map Reduce over Hadoop and HDFS helps organization.中文译文:大数据挖掘研究摘要数据已经成为各个经济、行业、组织、企业、职能和个人的重要组成部分。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

毕业设计(论文)外文资料翻译系部:计算机科学与技术系专业:计算机科学与技术姓名:学号:外文出处:Proceeding of Workshop on the (用外文写)of Artificial,Hualien,TaiWan,2005不确定性数据挖掘:一种新的研究方向Michael Chau1, Reynold Cheng2, and Ben Kao31:商学院,香港大学,薄扶林,香港2:计算机系,香港理工大学九龙湖校区,香港3:计算机科学系,香港大学,薄扶林,香港摘要由于不精确测量、过时的来源或抽样误差等原因,数据不确定性常常出现在真实世界应用中。

目前,在数据库数据不确定性处理领域中,很多研究结果已经被发表。

我们认为,当不确定性数据被执行数据挖掘时,数据不确定性不得不被考虑在内,才能获得高质量的数据挖掘结果。

我们称之为“不确定性数据挖掘”问题。

在本文中,我们为这个领域可能的研究方向提出一个框架。

同时,我们以UK-means 聚类算法为例来阐明传统K-means算法怎么被改进来处理数据挖掘中的数据不确定性。

1.引言由于测量不精确、抽样误差、过时数据来源或其他等原因,数据往往带有不确定性性质。

特别在需要与物理环境交互的应用中,如:移动定位服务[15]和传感器监测[3]。

例如:在追踪移动目标(如车辆或人)的情境中,数据库是不可能完全追踪到所有目标在所有瞬间的准确位置。

因此,每个目标的位置的变化过程是伴有不确定性的。

为了提供准确地查询和挖掘结果,这些导致数据不确定性的多方面来源不得不被考虑。

在最近几年里,已有在数据库中不确定性数据管理方面的大量研究,如:数据库中不确定性的表现和不确定性数据查询。

然而,很少有研究成果能够解决不确定性数据挖掘的问题。

我们注意到,不确定性使数据值不再具有原子性。

对于使用传统数据挖掘技术,不确定性数据不得不被归纳为原子性数值。

再以追踪移动目标应用为例,一个目标的位置可以通过它最后的记录位置或通过一个预期位置(如果这个目标位置概率分布被考虑到)归纳得到。

不幸地是,归纳得到的记录与真实记录之间的误差可能会严重也影响挖掘结果。

图1阐明了当一种聚类算法被应用追踪带有不确定性位置的移动目标时所发生的问题。

图1(a)表示一组目标的真实数据,而图1(b)则表示记录的已过时的这些目标的位置。

如果这些实际位置是有效的话,那么它们与那些从过时数据值中得到的数据集群有明显差异。

如果我们仅仅依靠记录的数据值,那么将会很多的目标可能被置于错误的数据集群中。

更糟糕地是,一个群中的每一个成员都有可能改变群的质心,因此导致更多的错误。

图1 数据图图1.(a)表示真实数据划分成的三个集群(a、b、c)。

(b)表示的有些目标(隐藏的)的记录位置与它们真实的数据不一样,因此形成集群a’、b’、c’和c”。

注意到a’集群中比a集群少了一个目标,而b’集群中比b集群多一个目标。

同时,c也误拆分会为c’和c”。

(c)表示方向不确定性被考虑来推测出集群a’,b’和c。

这种聚类产生的结果比(b)结果更加接近(a)。

我们建议将不确定性数据的概率密度函数等不确定性信息与现有的数据挖掘方法结合,这样在实际数据可利用于数据挖掘的情况下会使得挖掘结果更接近从真实数据中获得的结果。

本文研究了不确定性怎么通过把数据聚类当成一种激励范例使用使得不确定性因素与数据挖掘相结合。

我们称之为不确定性数据挖掘问题。

在本文中,我们为这个领域可能的研究方向提出一个框架。

文章接下来的结构如下。

第二章是有关工作综述。

在第三章中,我们定义了不确定性数据聚类问题和介绍我们提议的算法。

第四章将呈现我们算法在移动目标数据库的应用。

详细地的实习结果将在第五章解释。

最后在第六章总结论文并提出可能的研究方向。

2.研究背景近年来,人们对数据不确定性管理有明显的研究兴趣。

数据不确定性被为两类,即已存在的不确定生和数值不确定性。

在第一种类型中,不管目标或数据元组存在是否,数据本身就已经存在不确定性了。

例如,关系数据库中的元组可能与能表现它存在信任度的一个概率值相关联[1,2]。

在数据不确定性类型中,一个数据项作为一个封闭的区域,与其值的概率密度函数(PDF)限定了其可能的值[3,4,12,15]。

这个模型可以被应用于量化在不断变化的环境下的位置或传感器数据的不精密度。

在这个领域里,大量的工作都致力于不精确查找。

例如,在[5]中,解决不确定性数据范围查询的索引方案已经被提出。

在[4]中,同一作者提出了解决邻近等查询的方案。

注意到,所有工作已经把不确定性数据管理的研究结果应用于简化数据库查询中,而不是应用于相对复杂的数据分析和挖掘问题中。

在数据挖掘研究中,聚类问题已经被很好的研究。

一个标准的聚类过程由5个主要步骤组成:模式表示,模式定义,模式相似度量的定义,聚类或分组,数据抽象和造工评核[10]。

只有小部分关于数据挖掘或不确定性数据聚类的研究被发表。

Hamdan与Govaert已经通过运用EM算法解决使混合密度适合不确定性数据聚类的问题 [8]。

然而,这个模型不能任意地应用于其他聚类算法因为它相当于为EM定制的。

在数据区间的聚类也同样被研究。

像城区距离或明考斯基距离等不同距离测量也已经被用来衡量两个区间的相似度。

在这些测量的大多数中,区间的概率密度函数并没有被考虑到。

另外一个相关领域的研究就是模糊聚类。

在模糊逻辑中的模糊聚类研究已经很久远了[13]。

在模糊聚类中,一个是数据簇由一组目标的模糊子集组成。

每个目标与每个簇都有一个“归属关系度”。

换言之,一个目标可以归属于多个簇,与每个簇均有一个度。

模糊C均值聚类算法是一种最广泛的使用模糊聚类方法[2,7]。

不同的模糊聚类方法已被应用在一般数据或模糊数据中来产生的模糊数据簇。

他们研究工作是基于一个模糊数据模型的,而我们工作的开展则基于移动目标的不确定性模型。

3.不确定数据的分类在图2中,我们提出一种分类法来阐述数据挖掘方法怎么根据是否考虑数据不准确性来分类。

有很多通用的数据挖掘技术,如: 关联规则挖掘、数据分类、数据聚类。

当然这些技术需要经过改进才能用于处理不确定性技术。

此外,我们区分出数据聚类的两种类型:硬聚类和模糊聚类。

硬聚类旨在通过考虑预期的数据来提高聚类的准确性。

另一方面,模糊聚类则表示聚类的结果为一个“模糊”表格。

模糊聚类的一个例子是每个数据项被赋予一个被分配给数据簇的任意成员的概率。

图2. 不确定性数据挖掘的一种分类例如,当不确定性被考虑时,会发生一个有意思的问题,即如何在数据集中表示每个元组和关联的不确定性。

而且,由于支持和其他指标的概念需要重新定义,不得不考虑改进那些著名的关联规则挖掘算法(如Apriori)。

同样地,在数据分类和数据聚集中,传统算法由于未将数据不确定性考虑在内而导致不能起作用。

不得不对聚类质心、两个目标的距离、或目标与质心的距离等重要度量作重新定义和进行更深的研究。

4.不确定性数据聚类实例在这个章节中,我们将以不确定性数据挖掘的例子为大家介绍我们在不确定性数据聚类中的研究工作。

这将阐明我们在改进传统数据挖掘算法以适合不确定性数据问题上的想法。

4.1 问题定义用S 表示V 维向量x i 的集合,其中i=1到n ,这些向量表示在聚类应用中被考虑的所有记录的属性值。

每个记录o i 与一个概率密度函数f i (x)相联系,这个函数就是o i 属性值x 在时间t 时刻的概率密度函数。

我们没有干涉这个不确定性函数的实时变化,或记录的概率密度函数是什么。

平均密度函数就是一个概率密度函数的例子,它描述“大量不确定性”情景中是最糟的情况[3]。

另一个常用的就是高斯分布函数,它能够用于描述测量误差[12,15]。

聚类问题就是在数据集簇C j (j 从1到K )找到一个数据集C ,其中C j 由基于相似性的平均值c j 构成。

不同的聚类算法对应不对的目标函数,但是大意都是最小化同一数据集目标间的距离和最大化不同数据集目标间的距离。

数据集内部距离最小化也被视为每个数据点之间距离x i 以及x i 与对应的C j 中平均值c j 距离的最小化。

在论文中,我们只考虑硬聚类,即,每个目标只分配给一个一个集群的一个元素。

4.2 均值聚类在精确数据中的应用这个传统的均值聚类算法目的在于找到K(也就是由平均值c j 构成数据集簇C j )中找到一个数据集C 来最小化平方误差总和(SSE )。

平方误差总和通常计算如下:∑∑=∈-K j x i j ji x c 1C 2 (1)|| . ||表示一个数据点x i 与数据集平均值c j 的距离试题。

例如,欧氏距离定义为:∑=-=-V i i i y x y x 12(2)一个数据集C i 的平均值(质心)由下面的向量公式来定义:∑∈=j C i i j i x C c 1 (3)均值聚类算法如下:1. Assign initial values for cluster means c 1 to c K2. repeat3. for i = 1 to n do4. Assign each data point x i to cluster C j where || c j - x i || is the minimum.5. end for6. for j = 1 to K do7. Recalculate cluster mean c j of cluster C j8. end for9. until convergence10. return C收敛可能基于不同的质心来确定。

一些收敛性判别规则例子包括:(1)当平方误差总和小于某一用户专用临界值,(2)当在一次迭代中没有一个目标再分配给不同的数据集和(3)当迭代次数还达到预期的定义的最大值。

4.3 K-means 聚类在不确定性数据中的应用为了在聚类过程中考虑数据不确定性,我们提出一种算法来实现最小化期望平方误差总和E(SSE)的目标。

注意到一个数据对象x i 由一个带有不确定性概率密度f(x i )的不确定性区域决定。

给定一组数据群集,期望平方误差总和可以计算如下:()ii K j C i ij Kj C i ij Kj C i i j dx x f x c x c E x c E j jj )(121212∑∑∑∑∑∑=∈=∈=∈-=-=⎪⎪⎭⎫ ⎝⎛- (4) 数据集平均值可以如下给出: ()∑⎰∑∑∈∈∈==⎪⎪⎭⎫ ⎝⎛=j jj C i ii i j C i i j C i i j j dx x f x C x E C x C E c )(111 (5) 我们到此将提出一种新K-means 算法,即UK-means ,来实现不确定性数据聚类。

1. Assign initial values for cluster means c 1 to c K2. repeat3. for i = 1 to n do4. Assign each data point x i to cluster C j where E(|| c j - x i ||) is the minimum.5. end for6. for j = 1 to K do7. Recalculate cluster mean c j of cluster C j8. end for9. until convergence10. return CUK-mean 聚类算法与K-means 聚类算法的最大不同点在于距离和群集的计算。

相关文档
最新文档