Automatically attaching semantic metadata to web services
人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱(共包含二级节点15 个,三级节点93 个)间序列分析)监督学习)领域 二级分类 三级分类。
exception_in_phase_'semantic_analysis_概述及解释说明

exception in phase 'semantic analysis 概述及解释说明1. 引言1.1 概述在软件开发过程中,编译器是一个关键的工具,它负责将我们编写的源代码转化成可执行的程序。
编译器主要包含多个阶段的处理过程,其中之一就是语义分析阶段(Semantic Analysis)。
在这个阶段,编译器会对代码进行语法和语义检查,以确定代码是否符合程序设计语言的规范,并生成相应的中间表示形式。
1.2 文章结构本文将介绍和解释编译器中“exception in phase 'semantic analysis”这个错误信息。
文章首先会给出一个简要概述,然后详细讨论它出现的原因和可能导致此错误的常见情况。
接着,我们将深入探讨与“semantic analysis”相关的背景知识和关键概念。
最后,文章将总结并给出解决此问题的方法。
1.3 目的本文旨在帮助读者了解“exception in phase 'semantic analysis”的意义以及其可能存在的原因。
通过深入剖析该错误信息引起的背景知识,读者将能够更好地理解和解决类似错误所涉及的问题。
同时,本文还提供了一些可能的解决方案和建议,以指导读者如何纠正或避免这类错误的发生。
请注意,本文将不会提供具体编码示例和编程语言相关的细节。
相反,它将重点关注该错误的一般概念和解决方法,以增强读者对编译器中语义分析阶段错误的理解。
2. 正文正文部分主要对"exception in phase 'semantic analysis"进行概述和解释说明。
semantic analysis,也被称为语义分析,是编译器中的一个重要阶段,用于检查源代码的语法结构是否符合语言规范,并为后续的代码生成做准备。
在编译过程中,当进行语义分析时,可能会出现"exception in phase 'semantic analysis"异常。
survey--on sentiment detection of reviews

A survey on sentiment detection of reviewsHuifeng Tang,Songbo Tan *,Xueqi ChengInformation Security Center,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100080,PR Chinaa r t i c l e i n f o Keywords:Sentiment detection Opinion extractionSentiment classificationa b s t r a c tThe sentiment detection of texts has been witnessed a booming interest in recent years,due to the increased availability of online reviews in digital form and the ensuing need to organize them.Till to now,there are mainly four different problems predominating in this research community,namely,sub-jectivity classification,word sentiment classification,document sentiment classification and opinion extraction.In fact,there are inherent relations between them.Subjectivity classification can prevent the sentiment classifier from considering irrelevant or even potentially misleading text.Document sen-timent classification and opinion extraction have often involved word sentiment classification tech-niques.This survey discusses related issues and main approaches to these problems.Ó2009Published by Elsevier Ltd.1.IntroductionToday,very large amount of reviews are available on the web,as well as the weblogs are fast-growing in blogsphere.Product re-views exist in a variety of forms on the web:sites dedicated to a specific type of product (such as digital camera ),sites for newspa-pers and magazines that may feature reviews (like Rolling Stone or Consumer Reports ),sites that couple reviews with commerce (like Amazon ),and sites that specialize in collecting professional or user reviews in a variety of areas (like ).Less formal reviews are available on discussion boards and mailing list archives,as well as in Usenet via Google ers also com-ment on products in their personal web sites and blogs,which are then aggregated by sites such as , ,and .The information mentioned above is a rich and useful source for marketing intelligence,social psychologists,and others interested in extracting and mining opinions,views,moods,and attitudes.For example,whether a product review is positive or negative;what are the moods among Bloggers at that time;how the public reflect towards this political affair,etc.To achieve this goal,a core and essential job is to detect subjec-tive information contained in texts,include viewpoint,fancy,atti-tude,sensibility etc.This is so-called sentiment detection .A challenging aspect of this task seems to distinguish it from traditional topic-based detection (classification)is that while top-ics are often identifiable by keywords alone,sentiment can be ex-pressed in a much subtle manner.For example,the sentence ‘‘What a bad picture quality that digital camera has!...Oh,thisnew type camera has a good picture,long battery life and beautiful appearance!”compares a negative experience of one product with a positive experience of another product.It is difficult to separate out the core assessment that should actually be correlated with the document.Thus,sentiment seems to require more understand-ing than the usual topic-based classification.Sentiment detection dates back to the late 1990s (Argamon,Koppel,&Avneri,1998;Kessler,Nunberg,&SchÄutze,1997;Sper-tus,1997),but only in the early 2000s did it become a major sub-field of the information management discipline (Chaovalit &Zhou,2005;Dimitrova,Finn,Kushmerick,&Smyth,2002;Durbin,Neal Richter,&Warner,2003;Efron,2004;Gamon,2004;Glance,Hurst,&Tomokiyo,2004;Grefenstette,Qu,Shanahan,&Evans,2004;Hil-lard,Ostendorf,&Shriberg,2003;Inkpen,Feiguina,&Hirst,2004;Kobayashi,Inui,&Inui,2001;Liu,Lieberman,&Selker,2003;Rau-bern &Muller-Kogler,2001;Riloff and Wiebe,2003;Subasic &Huettner,2001;Tong,2001;Vegnaduzzo,2004;Wiebe &Riloff,2005;Wilson,Wiebe,&Hoffmann,2005).Until the early 2000s,the two main popular approaches to sentiment detection,espe-cially in the real-world applications,were based on machine learn-ing techniques and based on semantic analysis techniques.After that,the shallow nature language processing techniques were widely used in this area,especially in the document sentiment detection.Current-day sentiment detection is thus a discipline at the crossroads of NLP and IR,and as such it shares a number of characteristics with other tasks such as information extraction and text-mining.Although several international conferences have devoted spe-cial issues to this topic,such as ACL,AAAI,WWW,EMNLP,CIKM etc.,there are no systematic treatments of the subject:there are neither textbooks nor journals entirely devoted to sentiment detection yet.0957-4174/$-see front matter Ó2009Published by Elsevier Ltd.doi:10.1016/j.eswa.2009.02.063*Corresponding author.E-mail addresses:tanghuifeng@ (H.Tang),tansongbo@ (S.Tan),cxq@ (X.Cheng).Expert Systems with Applications 36(2009)10760–10773Contents lists available at ScienceDirectExpert Systems with Applicationsjournal homepage:/locate/eswaThis paperfirst introduces the definitions of several problems that pertain to sentiment detection.Then we present some appli-cations of sentiment detection.Section4discusses the subjectivity classification problem.Section5introduces semantic orientation method.The sixth section examines the effectiveness of applying machine learning techniques to document sentiment classification. The seventh section discusses opinion extraction problem.The eighth part talks about evaluation of sentiment st sec-tion concludes with challenges and discussion of future work.2.Sentiment detection2.1.Subjectivity classificationSubjectivity in natural language refers to aspects of language used to express opinions and evaluations(Wiebe,1994).Subjectiv-ity classification is stated as follows:Let S={s1,...,s n}be a set of sentences in document D.The problem of subjectivity classification is to distinguish sentences used to present opinions and other forms of subjectivity(subjective sentences set S s)from sentences used to objectively present factual information(objective sen-tences set S o),where S s[S o=S.This task is especially relevant for news reporting and Internet forums,in which opinions of various agents are expressed.2.2.Sentiment classificationSentiment classification includes two kinds of classification forms,i.e.,binary sentiment classification and multi-class senti-ment classification.Given a document set D={d1,...,d n},and a pre-defined categories set C={positive,negative},binary senti-ment classification is to classify each d i in D,with a label expressed in C.If we set C*={strong positive,positive,neutral,negative,strong negative}and classify each d i in D with a label in C*,the problem changes to multi-class sentiment classification.Most prior work on learning to identify sentiment has focused on the binary distinction of positive vs.negative.But it is often helpful to have more information than this binary distinction pro-vides,especially if one is ranking items by recommendation or comparing several reviewers’opinions.Koppel and Schler(2005a, 2005b)show that it is crucial to use neutral examples in learning polarity for a variety of reasons.Learning from negative and posi-tive examples alone will not permit accurate classification of neu-tral examples.Moreover,the use of neutral training examples in learning facilitates better distinction between positive and nega-tive examples.3.Applications of sentiment detectionIn this section,we will expound some rising applications of sen-timent detection.3.1.Products comparisonIt is a common practice for online merchants to ask their cus-tomers to review the products that they have purchased.With more and more people using the Web to express opinions,the number of reviews that a product receives grows rapidly.Most of the researches about these reviews were focused on automatically classifying the products into‘‘recommended”or‘‘not recom-mended”(Pang,Lee,&Vaithyanathan,2002;Ranjan Das&Chen, 2001;Terveen,Hill,Amento,McDonald,&Creter,1997).But every product has several features,in which maybe only part of them people are interested.Moreover,a product has shortcomings in one aspect,probably has merits in another place(Morinaga,Yamanishi,Tateishi,&Fukushima,2002;Taboada,Gillies,&McFe-tridge,2006).To analysis the online reviews and bring forward a visual man-ner to compare consumers’opinions of different products,i.e., merely with a single glance the user can clearly see the advantages and weaknesses of each product in the minds of consumers.For a potential customer,he/she can see a visual side-by-side and fea-ture-by-feature comparison of consumer opinions on these prod-ucts,which helps him/her to decide which product to buy.For a product manufacturer,the comparison enables it to easily gather marketing intelligence and product benchmarking information.Liu,Hu,and Cheng(2005)proposed a novel framework for ana-lyzing and comparing consumer opinions of competing products.A prototype system called Opinion Observer is implemented.To en-able the visualization,two tasks were performed:(1)Identifying product features that customers have expressed their opinions on,based on language pattern mining techniques.Such features form the basis for the comparison.(2)For each feature,identifying whether the opinion from each reviewer is positive or negative,if any.Different users can visualize and compare opinions of different products using a user interface.The user simply chooses the prod-ucts that he/she wishes to compare and the system then retrieves the analyzed results of these products and displays them in the interface.3.2.Opinion summarizationThe number of online reviews that a product receives grows rapidly,especially for some popular products.Furthermore,many reviews are long and have only a few sentences containing opin-ions on the product.This makes it hard for a potential customer to read them to make an informed decision on whether to purchase the product.The large number of reviews also makes it hard for product manufacturers to keep track of customer opinions of their products because many merchant sites may sell their products,and the manufacturer may produce many kinds of products.Opinion summarization(Ku,Lee,Wu,&Chen,2005;Philip et al., 2004)summarizes opinions of articles by telling sentiment polari-ties,degree and the correlated events.With opinion summariza-tion,a customer can easily see how the existing customers feel about a product,and the product manufacturer can get the reason why different stands people like it or what they complain about.Hu and Liu(2004a,2004b)conduct a work like that:Given a set of customer reviews of a particular product,the task involves three subtasks:(1)identifying features of the product that customers have expressed their opinions on(called product features);(2) for each feature,identifying review sentences that give positive or negative opinions;and(3)producing a summary using the dis-covered information.Ku,Liang,and Chen(2006)investigated both news and web blog articles.In their research,TREC,NTCIR and articles collected from web blogs serve as the information sources for opinion extraction.Documents related to the issue of animal cloning are selected as the experimental materials.Algorithms for opinion extraction at word,sentence and document level are proposed. The issue of relevant sentence selection is discussed,and then top-ical and opinionated information are summarized.Opinion sum-marizations are visualized by representative sentences.Finally, an opinionated curve showing supportive and non-supportive de-gree along the timeline is illustrated by an opinion tracking system.3.3.Opinion reason miningIn opinion analysis area,finding the polarity of opinions or aggregating and quantifying degree assessment of opinionsH.Tang et al./Expert Systems with Applications36(2009)10760–1077310761scattered throughout web pages is not enough.We can do more critical part of in-depth opinion assessment,such asfinding rea-sons in opinion-bearing texts.For example,infilm reviews,infor-mation such as‘‘found200positive reviews and150negative reviews”may not fully satisfy the information needs of different people.More useful information would be‘‘Thisfilm is great for its novel originality”or‘‘Poor acting,which makes thefilm awful”.Opinion reason mining tries to identify one of the critical ele-ments of online reviews to answer the question,‘‘What are the rea-sons that the author of this review likes or dislikes the product?”To answer this question,we should extract not only sentences that contain opinion-bearing expressions,but also sentences with rea-sons why an author of a review writes the review(Cardie,Wiebe, Wilson,&Litman,2003;Clarke&Terra,2003;Li&Yamanishi, 2001;Stoyanov,Cardie,Litman,&Wiebe,2004).Kim and Hovy(2005)proposed a method for detecting opinion-bearing expressions.In their subsequent work(Kim&Hovy,2006), they collected a large set of h review text,pros,cons i triplets from ,which explicitly state pros and cons phrases in their respective categories by each review’s author along with the re-view text.Their automatic labeling systemfirst collects phrases in pro and confields and then searches the main review text in or-der to collect sentences corresponding to those phrases.Then the system annotates this sentence with the appropriate‘‘pro”or‘‘con”label.All remaining sentences with neither label are marked as ‘‘neither”.After labeling all the data,they use it to train their pro and con sentence recognition system.3.4.Other applicationsThomas,Pang,and Lee(2006)try to determine from the tran-scripts of US Congressionalfloor debates whether the speeches rep-resent support of or opposition to proposed legislation.Mullen and Malouf(2006)describe a statistical sentiment analysis method on political discussion group postings to judge whether there is oppos-ing political viewpoint to the original post.Moreover,there are some potential applications of sentiment detection,such as online message sentimentfiltering,E-mail sentiment classification,web-blog author’s attitude analysis,sentiment web search engine,etc.4.Subjectivity classificationSubjectivity classification is a task to investigate whether a par-agraph presents the opinion of its author or reports facts.In fact, most of the research showed there was very tight relation between subjectivity classification and document sentiment classification (Pang&Lee,2004;Wiebe,2000;Wiebe,Bruce,&O’Hara,1999; Wiebe,Wilson,Bruce,Bell,&Martin,2002;Yu&Hatzivassiloglou, 2003).Subjectivity classification can prevent the polarity classifier from considering irrelevant or even potentially misleading text. Pang and Lee(2004)find subjectivity detection can compress re-views into much shorter extracts that still retain polarity informa-tion at a level comparable to that of the full review.Much of the research in automated opinion detection has been performed and proposed for discriminating between subjective and objective text at the document and sentence levels(Bruce& Wiebe,1999;Finn,Kushmerick,&Smyth,2002;Hatzivassiloglou &Wiebe,2000;Wiebe,2000;Wiebe et al.,1999;Wiebe et al., 2002;Yu&Hatzivassiloglou,2003).In this section,we will discuss some approaches used to automatically assign one document as objective or subjective.4.1.Similarity approachSimilarity approach to classifying sentences as opinions or facts explores the hypothesis that,within a given topic,opinion sen-tences will be more similar to other opinion sentences than to fac-tual sentences(Yu&Hatzivassiloglou,2003).Similarity approach measures sentence similarity based on shared words,phrases, and WordNet synsets(Dagan,Shaul,&Markovitch,1993;Dagan, Pereira,&Lee,1994;Leacock&Chodorow,1998;Miller&Charles, 1991;Resnik,1995;Zhang,Xu,&Callan,2002).To measure the overall similarity of a sentence to the opinion or fact documents,we need to go through three steps.First,use IR method to acquire the documents that are on the same topic as the sentence in question.Second,calculate its similarity scores with each sentence in those documents and make an average va-lue.Third,assign the sentence to the category(opinion or fact) for which the average value is the highest.Alternatively,for the frequency variant,we can use the similarity scores or count how many of them for each category,and then compare it with a prede-termined threshold.4.2.Naive Bayes classifierNaive Bayes classifier is a commonly used supervised machine learning algorithm.This approach presupposes all sentences in opinion or factual articles as opinion or fact sentences.Naive Bayes uses the sentences in opinion and fact documents as the examples of the two categories.The features include words, bigrams,and trigrams,as well as the part of speech in each sen-tence.In addition,the presence of semantically oriented(positive and negative)words in a sentence is an indicator that the sentence is subjective.Therefore,it can include the counts of positive and negative words in the sentence,as well as counts of the polarities of sequences of semantically oriented words(e.g.,‘‘++”for two con-secutive positively oriented words).It also include the counts of parts of speech combined with polarity information(e.g.,‘‘JJ+”for positive adjectives),as well as features encoding the polarity(if any)of the head verb,the main subject,and their immediate modifiers.Generally speaking,Naive Bayes assigns a document d j(repre-sented by a vector dÃj)to the class c i that maximizes Pðc i j dÃjÞby applying Bayes’rule as follow,Pðc i j dÃjÞ¼Pðc iÞPðdÃjj c iÞPðdÃjÞð1Þwhere PðdÃjÞis the probability that a randomly picked document dhas vector dÃjas its representation,and P(c)is the probability that a randomly picked document belongs to class c.To estimate the term PðdÃjj cÞ,Naive Bayes decomposes it byassuming all the features in dÃj(represented by f i,i=1to m)are con-ditionally independent,i.e.,Pðc i j dÃjÞ¼Pðc iÞQ mi¼1Pðf i j c iÞÀÁPðdÃjÞð2Þ4.3.Multiple Naive Bayes classifierThe hypothesis of all sentences in opinion or factual articles as opinion or fact sentences is an approximation.To address this, multiple Naive Bayes classifier approach applies an algorithm using multiple classifiers,each relying on a different subset of fea-tures.The goal is to reduce the training set to the sentences that are most likely to be correctly labeled,thus boosting classification accuracy.Given separate sets of features F1,F2,...,F m,it train separate Na-ive Bayes classifiers C1,C2,...,C m corresponding to each feature set. Assuming as ground truth the information provided by the docu-ment labels and that all sentences inherit the status of their docu-ment as opinions or facts,itfirst train C1on the entire training set,10762H.Tang et al./Expert Systems with Applications36(2009)10760–10773then use the resulting classifier to predict labels for the training set.The sentences that receive a label different from the assumed truth are then removed,and train C2on the remaining sentences. This process is repeated iteratively until no more sentences can be removed.Yu and Hatzivassiloglou(2003)report results using five feature sets,starting from words alone and adding in bigrams, trigrams,part-of-speech,and polarity.4.4.Cut-based classifierCut-based classifier approach put forward a hypothesis that, text spans(items)occurring near each other(within discourse boundaries)may share the same subjectivity status(Pang&Lee, 2004).Based on this hypothesis,Pang supplied his algorithm with pair-wise interaction information,e.g.,to specify that two particu-lar sentences should ideally receive the same subjectivity label. This algorithm uses an efficient and intuitive graph-based formula-tion relying onfinding minimum cuts.Suppose there are n items x1,x2,...,x n to divide into two classes C1and C2,here access to two types of information:ind j(x i):Individual scores.It is the non-negative estimates of each x i’s preference for being in C j based on just the features of x i alone;assoc(x i,x k):Association scores.It is the non-negative estimates of how important it is that x i and x k be in the same class.Then,this problem changes to calculate the maximization of each item’s score for one class:its individual score for the class it is assigned to,minus its individual score for the other class,then minus associated items into different classes for penalization. Thus,after some algebra,it arrives at the following optimization problem:assign the x i to C1and C2so as to minimize the partition cost:X x2C1ind2ðxÞþXx2C2ind1ðxÞþXx i2C1;x k2C2assocðx i;x kÞð3ÞThis situation can be represented in the following manner.Build an undirected graph G with vertices{v1,...,v n,s,t};the last two are, respectively,the source and sink.Add n edges(s,v i),each with weight ind1(x i),and n edges(v i,t),each with weight ind2(x i).Finally, addðC2nÞedges(v i,v k),each with weight assoc(x i,x k).A cut(S,T)of G is a partition of its nodes into sets S={s}US0and T={t}UT0,where s R S0,t R T0.Its cost cost(S,T)is the sum of the weights of all edges crossing from S to T.A minimum cut of G is one of minimum cost. Then,finding solution of this problem is changed into looking for a minimum cut of G.5.Word sentiment classificationThe task on document sentiment classification has usually in-volved the manual or semi-manual construction of semantic orien-tation word lexicons(Hatzivassiloglou&McKeown,1997; Hatzivassiloglou&Wiebe,2000;Lin,1998;Pereira,Tishby,&Lee, 1993;Riloff,Wiebe,&Wilson,2003;Turney&Littman,2002; Wiebe,2000),which built by word sentiment classification tech-niques.For instance,Das and Chen(2001)used a classifier on investor bulletin boards to see if apparently positive postings were correlated with stock price,in which several scoring methods were employed in conjunction with a manually crafted lexicon.Classify-ing the semantic orientation of individual words or phrases,such as whether it is positive or negative or has different intensities, generally using a pre-selected set of seed words,sometimes using linguistic heuristics(For example,Lin(1998)&Pereira et al.(1993) used linguistic co-locations to group words with similar uses or meanings).Some studies showed that restricting features to those adjec-tives for word sentiment classification would improve perfor-mance(Andreevskaia&Bergler,2006;Turney&Littman,2002; Wiebe,2000).However,more researches showed most of the adjectives and adverb,a small group of nouns and verbs possess semantic orientation(Andreevskaia&Bergler,2006;Esuli&Sebas-tiani,2005;Gamon&Aue,2005;Takamura,Inui,&Okumura, 2005;Turney&Littman,2003).Automatic methods of sentiment annotation at the word level can be grouped into two major categories:(1)corpus-based ap-proaches and(2)dictionary-based approaches.Thefirst group in-cludes methods that rely on syntactic or co-occurrence patterns of words in large texts to determine their sentiment(e.g.,Hatzi-vassiloglou&McKeown,1997;Turney&Littman,2002;Yu&Hat-zivassiloglou,2003and others).The second group uses WordNet (/)information,especially,synsets and hierarchies,to acquire sentiment-marked words(Hu&Liu, 2004a;Kim&Hovy,2004)or to measure the similarity between candidate words and sentiment-bearing words such as good and bad(Kamps,Marx,Mokken,&de Rijke,2004).5.1.Analysis by conjunctions between adjectivesThis method attempts to predict the orientation of subjective adjectives by analyzing pairs of adjectives(conjoined by and,or, but,either-or,or neither-nor)which are extracted from a large unlabelled document set.The underlying intuition is that the act of conjoining adjectives is subject to linguistic constraints on the orientation of the adjectives involved(e.g.and usually conjoins two adjectives of the same-orientation,while but conjoins two adjectives of opposite orientation).This is shown in the following three sentences(where thefirst two are perceived as correct and the third is perceived as incorrect)taken from Hatzivassiloglou and McKeown(1997):‘‘The tax proposal was simple and well received by the public”.‘‘The tax proposal was simplistic but well received by the public”.‘‘The tax proposal was simplistic and well received by the public”.To infer the orientation of adjectives from analysis of conjunc-tions,a supervised learning algorithm can be performed as follow-ing steps:1.All conjunctions of adjectives are extracted from a set ofdocuments.2.Train a log-linear regression classifier and then classify pairs ofadjectives either as having the same or as having different ori-entation.The hypothesized same-orientation or different-orien-tation links between all pairs form a graph.3.A clustering algorithm partitions the graph produced in step2into two clusters.By using the intuition that positive adjectives tend to be used more frequently than negative ones,the cluster containing the terms of higher average frequency in the docu-ment set is deemed to contain the positive terms.The log-linear model offers an estimate of how good each pre-diction is,since it produces a value y between0and1,in which 1corresponds to same-orientation,and one minus the produced value y corresponds to dissimilarity.Same-and different-orienta-tion links between adjectives form a graph.To partition the graph nodes into subsets of the same-orientation,the clustering algo-rithm calculates an objective function U scoring each possible par-tition P of the adjectives into two subgroups C1and C2as,UðPÞ¼X2i¼11j C i jXx;y2C i;x–ydðx;yÞ!ð4Þwhere j C i j is the cardinality of cluster i,and d(x,y)is the dissimilarity between adjectives x and y.H.Tang et al./Expert Systems with Applications36(2009)10760–1077310763In general,because the model was unsupervised,it required an immense word corpus to function.5.2.Analysis by lexical relationsThis method presents a strategy for inferring semantic orienta-tion from semantic association between words and phrases.It fol-lows a hypothesis that two words tend to be the same semantic orientation if they have strong semantic association.Therefore,it focused on the use of lexical relations defined in WordNet to calcu-late the distance between adjectives.Generally speaking,we can defined a graph on the adjectives contained in the intersection between a term set(For example, TL term set(Turney&Littman,2003))and WordNet,adding a link between two adjectives whenever WordNet indicates the presence of a synonymy relation between them,and defining a distance measure using elementary notions from graph theory.In more de-tail,this approach can be realized as following steps:1.Construct relations at the level of words.The simplest approachhere is just to collect all words in WordNet,and relate words that can be synonymous(i.e.,they occurring in the same synset).2.Define a distance measure d(t1,t2)between terms t1and t2onthis graph,which amounts to the length of the shortest path that connects t1and t2(with d(t1,t2)=+1if t1and t2are not connected).3.Calculate the orientation of a term by its relative distance(Kamps et al.,2004)from the two seed terms good and bad,i.e.,SOðtÞ¼dðt;badÞÀdðt;goodÞdðgood;badÞð5Þ4.Get the result followed by this rules:The adjective t is deemedto belong to positive if SO(t)>0,and the absolute value of SO(t) determines,as usual,the strength of this orientation(the con-stant denominator d(good,bad)is a normalization factor that constrains all values of SO to belong to the[À1,1]range).5.3.Analysis by glossesThe characteristic of this method lies in the fact that it exploits the glosses(i.e.textual definitions)that one term has in an online ‘‘glossary”,or dictionary.Its basic assumption is that if a word is semantically oriented in one direction,then the words in its gloss tend to be oriented in the same direction(Esuli&Sebastiani,2005; Esuli&Sebastiani,2006a,2006b).For instance,the glosses of good and excellent will both contain appreciative expressions;while the glosses of bad and awful will both contain derogative expressions.Generally,this method can determine the orientation of a term based on the classification of its glosses.The process is composed of the following steps:1.A seed set(S p,S n),representative of the two categories positiveand negative,is provided as input.2.Search new terms to enrich S p and S e lexical relations(e.g.synonymy)with the terms contained in S p and S n from a thesau-rus,or online dictionary,tofind these new terms,and then append them to S p or S n.3.For each term t i in S0p [S0nor in the test set(i.e.the set of termsto be classified),a textual representation of t i is generated by collating all the glosses of t i as found in a machine-readable dic-tionary.Each such representation is converted into a vector by standard text indexing techniques.4.A binary text classifier is trained on the terms in S0p [S0nandthen applied to the terms in the test set.5.4.Analysis by both lexical relations and glossesThis method determines sentiment of words and phrases both relies on lexical relations(synonymy,antonymy and hyponymy) and glosses provided in WordNet.Andreevskaia and Bergler(2006)proposed an algorithm named ‘‘STEP”(Semantic Tag Extraction Program).This algorithm starts with a small set of seed words of known sentiment value(positive or negative)and implements the following steps:1.Extend the small set of seed words by adding synonyms,ant-onyms and hyponyms of the seed words supplied in WordNet.This step brings on average a5-fold increase in the size of the original list with the accuracy of the resulting list comparable to manual annotations.2.Go through all WordNet glosses,identifies the entries that con-tain in their definitions the sentiment-bearing words from the extended seed list,and adds these head words to the corre-sponding category–positive,negative or neutral.3.Disambiguate the glosses with part-of-speech tagger,and elim-inate errors of some words acquired in step1and from the seed list.At this step,it alsofilters out all those words that have been assigned contradicting.In this algorithm,for each word we need compute a Net Overlap Score by subtracting the total number of runs assigning this word a negative sentiment from the total of the runs that consider it posi-tive.In order to make the Net Overlap Score measure usable in sen-timent tagging of texts and phrases,the absolute values of this score should be normalized and mapped onto a standard[0,1] interval.STEP accomplishes this normalization by using the value of the Net Overlap Score as a parameter in the standard fuzzy mem-bership S-function(Zadeh,1987).This function maps the absolute values of the Net Overlap Score onto the interval from0to1,where 0corresponds to the absence of membership in the category of sentiment(in this case,these will be the neutral words)and1re-flects the highest degree of membership in this category.The func-tion can be defined as follows,Sðu;a;b;cÞ¼0if u6a2uÀac a2if a6u6b1À2uÀacÀa2if b6u6c1if u P c8>>>>>><>>>>>>:ð6Þwhere u is the Net Overlap Score for the word and a,b,c are the three adjustable parameters:a is set to1,c is set to15and b,which represents a crossover point,is defined as b=(a+c)/2=8.Defined this way,the S-function assigns highest degree of membership (=1)to words that have the Net Overlap Score u P15.Net Overlap Score can be used as a measure of the words degree of membership in the fuzzy category of sentiment:the core adjec-tives,which had the highest Net Overlap Score,were identified most accurately both by STEP and by human annotators,while the words on the periphery of the category had the lowest scores and were associated with low rates of inter-annotator agreement.5.5.Analysis by pointwise mutual informationThe general strategy of this method is to infer semantic orienta-tion from semantic association.The underlying assumption is that a phrase has a positive semantic orientation when it has good asso-ciations(e.g.,‘‘romantic ambience”)and a negative semantic orien-tation when it has bad associations(e.g.,‘‘horrific events”)(Turney, 2002).10764H.Tang et al./Expert Systems with Applications36(2009)10760–10773。
Example-based metonymy recognition for proper nouns

Example-Based Metonymy Recognition for Proper NounsYves PeirsmanQuantitative Lexicology and Variational LinguisticsUniversity of Leuven,Belgiumyves.peirsman@arts.kuleuven.beAbstractMetonymy recognition is generally ap-proached with complex algorithms thatrely heavily on the manual annotation oftraining and test data.This paper will re-lieve this complexity in two ways.First,it will show that the results of the cur-rent learning algorithms can be replicatedby the‘lazy’algorithm of Memory-BasedLearning.This approach simply stores alltraining instances to its memory and clas-sifies a test instance by comparing it to alltraining examples.Second,this paper willargue that the number of labelled trainingexamples that is currently used in the lit-erature can be reduced drastically.Thisfinding can help relieve the knowledge ac-quisition bottleneck in metonymy recog-nition,and allow the algorithms to be ap-plied on a wider scale.1IntroductionMetonymy is afigure of speech that uses“one en-tity to refer to another that is related to it”(Lakoff and Johnson,1980,p.35).In example(1),for in-stance,China and Taiwan stand for the govern-ments of the respective countries:(1)China has always threatened to use forceif Taiwan declared independence.(BNC) Metonymy resolution is the task of automatically recognizing these words and determining their ref-erent.It is therefore generally split up into two phases:metonymy recognition and metonymy in-terpretation(Fass,1997).The earliest approaches to metonymy recogni-tion identify a word as metonymical when it vio-lates selectional restrictions(Pustejovsky,1995).Indeed,in example(1),China and Taiwan both violate the restriction that threaten and declare require an animate subject,and thus have to be interpreted metonymically.However,it is clear that many metonymies escape this characteriza-tion.Nixon in example(2)does not violate the se-lectional restrictions of the verb to bomb,and yet, it metonymically refers to the army under Nixon’s command.(2)Nixon bombed Hanoi.This example shows that metonymy recognition should not be based on rigid rules,but rather on statistical information about the semantic and grammatical context in which the target word oc-curs.This statistical dependency between the read-ing of a word and its grammatical and seman-tic context was investigated by Markert and Nis-sim(2002a)and Nissim and Markert(2003; 2005).The key to their approach was the in-sight that metonymy recognition is basically a sub-problem of Word Sense Disambiguation(WSD). Possibly metonymical words are polysemous,and they generally belong to one of a number of pre-defined metonymical categories.Hence,like WSD, metonymy recognition boils down to the auto-matic assignment of a sense label to a polysemous word.This insight thus implied that all machine learning approaches to WSD can also be applied to metonymy recognition.There are,however,two differences between metonymy recognition and WSD.First,theo-retically speaking,the set of possible readings of a metonymical word is open-ended(Nunberg, 1978).In practice,however,metonymies tend to stick to a small number of patterns,and their la-bels can thus be defined a priori.Second,classic 71WSD algorithms take training instances of one par-ticular word as their input and then disambiguate test instances of the same word.By contrast,since all words of the same semantic class may undergo the same metonymical shifts,metonymy recogni-tion systems can be built for an entire semantic class instead of one particular word(Markert and Nissim,2002a).To this goal,Markert and Nissim extracted from the BNC a corpus of possibly metonymical words from two categories:country names (Markert and Nissim,2002b)and organization names(Nissim and Markert,2005).All these words were annotated with a semantic label —either literal or the metonymical cate-gory they belonged to.For the country names, Markert and Nissim distinguished between place-for-people,place-for-event and place-for-product.For the organi-zation names,the most frequent metonymies are organization-for-members and organization-for-product.In addition, Markert and Nissim used a label mixed for examples that had two readings,and othermet for examples that did not belong to any of the pre-defined metonymical patterns.For both categories,the results were promis-ing.The best algorithms returned an accuracy of 87%for the countries and of76%for the orga-nizations.Grammatical features,which gave the function of a possibly metonymical word and its head,proved indispensable for the accurate recog-nition of metonymies,but led to extremely low recall values,due to data sparseness.Therefore Nissim and Markert(2003)developed an algo-rithm that also relied on semantic information,and tested it on the mixed country data.This algo-rithm used Dekang Lin’s(1998)thesaurus of se-mantically similar words in order to search the training data for instances whose head was sim-ilar,and not just identical,to the test instances. Nissim and Markert(2003)showed that a combi-nation of semantic and grammatical information gave the most promising results(87%). However,Nissim and Markert’s(2003)ap-proach has two major disadvantages.Thefirst of these is its complexity:the best-performing al-gorithm requires smoothing,backing-off to gram-matical roles,iterative searches through clusters of semantically similar words,etc.In section2,I will therefore investigate if a metonymy recognition al-gorithm needs to be that computationally demand-ing.In particular,I will try and replicate Nissim and Markert’s results with the‘lazy’algorithm of Memory-Based Learning.The second disadvantage of Nissim and Mark-ert’s(2003)algorithms is their supervised nature. Because they rely so heavily on the manual an-notation of training and test data,an extension of the classifiers to more metonymical patterns is ex-tremely problematic.Yet,such an extension is es-sential for many tasks throughout thefield of Nat-ural Language Processing,particularly Machine Translation.This knowledge acquisition bottle-neck is a well-known problem in NLP,and many approaches have been developed to address it.One of these is active learning,or sample selection,a strategy that makes it possible to selectively an-notate those examples that are most helpful to the classifier.It has previously been applied to NLP tasks such as parsing(Hwa,2002;Osborne and Baldridge,2004)and Word Sense Disambiguation (Fujii et al.,1998).In section3,I will introduce active learning into thefield of metonymy recog-nition.2Example-based metonymy recognition As I have argued,Nissim and Markert’s(2003) approach to metonymy recognition is quite com-plex.I therefore wanted to see if this complexity can be dispensed with,and if it can be replaced with the much more simple algorithm of Memory-Based Learning.The advantages of Memory-Based Learning(MBL),which is implemented in the T i MBL classifier(Daelemans et al.,2004)1,are twofold.First,it is based on a plausible psycho-logical hypothesis of human learning.It holds that people interpret new examples of a phenom-enon by comparing them to“stored representa-tions of earlier experiences”(Daelemans et al., 2004,p.19).This contrasts to many other classi-fication algorithms,such as Naive Bayes,whose psychological validity is an object of heavy de-bate.Second,as a result of this learning hypothe-sis,an MBL classifier such as T i MBL eschews the formulation of complex rules or the computation of probabilities during its training phase.Instead it stores all training vectors to its memory,together with their labels.In the test phase,it computes the distance between the test vector and all these train-ing vectors,and simply returns the most frequentlabel of the most similar training examples.One of the most important challenges inMemory-Based Learning is adapting the algorithmto one’s data.This includesfinding a represen-tative seed set as well as determining the rightdistance measures.For my purposes,however, T i MBL’s default settings proved more than satis-factory.T i MBL implements the IB1and IB2algo-rithms that were presented in Aha et al.(1991),butadds a broad choice of distance measures.Its de-fault implementation of the IB1algorithm,whichis called IB1-IG in full(Daelemans and Van denBosch,1992),proved most successful in my ex-periments.It computes the distance between twovectors X and Y by adding up the weighted dis-tancesδbetween their corresponding feature val-ues x i and y i:∆(X,Y)=ni=1w iδ(x i,y i)(3)The most important element in this equation is theweight that is given to each feature.In IB1-IG,features are weighted by their Gain Ratio(equa-tion4),the division of the feature’s InformationGain by its split rmation Gain,the nu-merator in equation(4),“measures how much in-formation it[feature i]contributes to our knowl-edge of the correct class label[...]by comput-ing the difference in uncertainty(i.e.entropy)be-tween the situations without and with knowledgeof the value of that feature”(Daelemans et al.,2004,p.20).In order not“to overestimate the rel-evance of features with large numbers of values”(Daelemans et al.,2004,p.21),this InformationGain is then divided by the split info,the entropyof the feature values(equation5).In the followingequations,C is the set of class labels,H(C)is theentropy of that set,and V i is the set of values forfeature i.w i=H(C)− v∈V i P(v)×H(C|v)2This data is publicly available and can be downloadedfrom /mnissim/mascara.73P F86.6%49.5%N&M81.4%62.7%Table1:Results for the mixed country data.T i MBL:my T i MBL resultsN&M:Nissim and Markert’s(2003)results simple learning phase,T i MBL is able to replicate the results from Nissim and Markert(2003;2005). As table1shows,accuracy for the mixed coun-try data is almost identical to Nissim and Mark-ert’sfigure,and precision,recall and F-score for the metonymical class lie only slightly lower.3 T i MBL’s results for the Hungary data were simi-lar,and equally comparable to Markert and Nis-sim’s(Katja Markert,personal communication). Note,moreover,that these results were reached with grammatical information only,whereas Nis-sim and Markert’s(2003)algorithm relied on se-mantics as well.Next,table2indicates that T i MBL’s accuracy for the mixed organization data lies about1.5%be-low Nissim and Markert’s(2005)figure.This re-sult should be treated with caution,however.First, Nissim and Markert’s available organization data had not yet been annotated for grammatical fea-tures,and my annotation may slightly differ from theirs.Second,Nissim and Markert used several feature vectors for instances with more than one grammatical role andfiltered all mixed instances from the training set.A test instance was treated as mixed only when its several feature vectors were classified differently.My experiments,in contrast, were similar to those for the location data,in that each instance corresponded to one vector.Hence, the slightly lower performance of T i MBL is prob-ably due to differences between the two experi-ments.Thesefirst experiments thus demonstrate that Memory-Based Learning can give state-of-the-art performance in metonymy recognition.In this re-spect,it is important to stress that the results for the country data were reached without any se-mantic information,whereas Nissim and Mark-ert’s(2003)algorithm used Dekang Lin’s(1998) clusters of semantically similar words in order to deal with data sparseness.This fact,togetherAcc RT i MBL78.65%65.10%76.0%—Figure1:Accuracy learning curves for the mixed country data with and without semantic informa-tion.in more detail.4Asfigure1indicates,with re-spect to overall accuracy,semantic features have a negative influence:the learning curve with both features climbs much more slowly than that with only grammatical features.Hence,contrary to my expectations,grammatical features seem to allow a better generalization from a limited number of training instances.With respect to the F-score on the metonymical category infigure2,the differ-ences are much less outspoken.Both features give similar learning curves,but semantic features lead to a higherfinal F-score.In particular,the use of semantic features results in a lower precisionfig-ure,but a higher recall score.Semantic features thus cause the classifier to slightly overgeneralize from the metonymic training examples.There are two possible reasons for this inabil-ity of semantic information to improve the clas-sifier’s performance.First,WordNet’s synsets do not always map well to one of our semantic la-bels:many are rather broad and allow for several readings of the target word,while others are too specific to make generalization possible.Second, there is the predominance of prepositional phrases in our data.With their closed set of heads,the number of examples that benefits from semantic information about its head is actually rather small. Nevertheless,myfirst round of experiments has indicated that Memory-Based Learning is a sim-ple but robust approach to metonymy recogni-tion.It is able to replace current approaches that need smoothing or iterative searches through a the-saurus,with a simple,distance-based algorithm.Figure3:Accuracy learning curves for the coun-try data with random and maximum-distance se-lection of training examples.over all possible labels.The algorithm then picks those instances with the lowest confidence,since these will contain valuable information about the training set(and hopefully also the test set)that is still unknown to the system.One problem with Memory-Based Learning al-gorithms is that they do not directly output prob-abilities.Since they are example-based,they can only give the distances between the unlabelled in-stance and all labelled training instances.Never-theless,these distances can be used as a measure of certainty,too:we can assume that the system is most certain about the classification of test in-stances that lie very close to one or more of its training instances,and less certain about those that are further away.Therefore the selection function that minimizes the probability of the most likely label can intuitively be replaced by one that max-imizes the distance from the labelled training in-stances.However,figure3shows that for the mixed country instances,this function is not an option. Both learning curves give the results of an algo-rithm that starts withfifty random instances,and then iteratively adds ten new training instances to this initial seed set.The algorithm behind the solid curve chooses these instances randomly,whereas the one behind the dotted line selects those that are most distant from the labelled training exam-ples.In thefirst half of the learning process,both functions are equally successful;in the second the distance-based function performs better,but only slightly so.There are two reasons for this bad initial per-formance of the active learning function.First,it is not able to distinguish between informativeandFigure4:Accuracy learning curves for the coun-try data with random and maximum/minimum-distance selection of training examples. unusual training instances.This is because a large distance from the seed set simply means that the particular instance’s feature values are relatively unknown.This does not necessarily imply that the instance is informative to the classifier,how-ever.After all,it may be so unusual and so badly representative of the training(and test)set that the algorithm had better exclude it—something that is impossible on the basis of distances only.This bias towards outliers is a well-known disadvantage of many simple active learning algorithms.A sec-ond type of bias is due to the fact that the data has been annotated with a few features only.More par-ticularly,the present algorithm will keep adding instances whose head is not yet represented in the training set.This entails that it will put off adding instances whose function is pp,simply because other functions(subj,gen,...)have a wider variety in heads.Again,the result is a labelled set that is not very representative of the entire training set.There are,however,a few easy ways to increase the number of prototypical examples in the train-ing set.In a second run of experiments,I used an active learning function that added not only those instances that were most distant from the labelled training set,but also those that were closest to it. After a few test runs,I decided to add six distant and four close instances on each iteration.Figure4 shows that such a function is indeed fairly success-ful.Because it builds a labelled training set that is more representative of the test set,this algorithm clearly reduces the number of annotated instances that is needed to reach a given performance.Despite its success,this function is obviously not yet a sophisticated way of selecting good train-76Figure5:Accuracy learning curves for the organi-zation data with random and distance-based(AL) selection of training examples with a random seed set.ing examples.The selection of the initial seed set in particular can be improved upon:ideally,this seed set should take into account the overall dis-tribution of the training examples.Currently,the seeds are chosen randomly.Thisflaw in the al-gorithm becomes clear if it is applied to another data set:figure5shows that it does not outper-form random selection on the organization data, for instance.As I suggested,the selection of prototypical or representative instances as seeds can be used to make the present algorithm more robust.Again,it is possible to use distance measures to do this:be-fore the selection of seed instances,the algorithm can calculate for each unlabelled instance its dis-tance from each of the other unlabelled instances. In this way,it can build a prototypical seed set by selecting those instances with the smallest dis-tance on average.Figure6indicates that such an algorithm indeed outperforms random sample se-lection on the mixed organization data.For the calculation of the initial distances,each feature re-ceived the same weight.The algorithm then se-lected50random samples from the‘most proto-typical’half of the training set.5The other settings were the same as above.With the present small number of features,how-ever,such a prototypical seed set is not yet always as advantageous as it could be.A few experiments indicated that it did not lead to better performance on the mixed country data,for instance.However, as soon as a wider variety of features is taken into account(as with the organization data),the advan-pling can help choose those instances that are most helpful to the classifier.A few distance-based al-gorithms were able to drastically reduce the num-ber of training instances that is needed for a given accuracy,both for the country and the organization names.If current metonymy recognition algorithms are to be used in a system that can recognize all pos-sible metonymical patterns across a broad variety of semantic classes,it is crucial that the required number of labelled training examples be reduced. This paper has taken thefirst steps along this path and has set out some interesting questions for fu-ture research.This research should include the investigation of new features that can make clas-sifiers more robust and allow us to measure their confidence more reliably.This confidence mea-surement can then also be used in semi-supervised learning algorithms,for instance,where the clas-sifier itself labels the majority of training exam-ples.Only with techniques such as selective sam-pling and semi-supervised learning can the knowl-edge acquisition bottleneck in metonymy recogni-tion be addressed.AcknowledgementsI would like to thank Mirella Lapata,Dirk Geer-aerts and Dirk Speelman for their feedback on this project.I am also very grateful to Katja Markert and Malvina Nissim for their helpful information about their research.ReferencesD.W.Aha, D.Kibler,and M.K.Albert.1991.Instance-based learning algorithms.Machine Learning,6:37–66.W.Daelemans and A.Van den Bosch.1992.Generali-sation performance of backpropagation learning on a syllabification task.In M.F.J.Drossaers and A.Ni-jholt,editors,Proceedings of TWLT3:Connection-ism and Natural Language Processing,pages27–37, Enschede,The Netherlands.W.Daelemans,J.Zavrel,K.Van der Sloot,andA.Van den Bosch.2004.TiMBL:Tilburg Memory-Based Learner.Technical report,Induction of Linguistic Knowledge,Computational Linguistics, Tilburg University.D.Fass.1997.Processing Metaphor and Metonymy.Stanford,CA:Ablex.A.Fujii,K.Inui,T.Tokunaga,and H.Tanaka.1998.Selective sampling for example-based wordsense putational Linguistics, 24(4):573–597.R.Hwa.2002.Sample selection for statistical parsing.Computational Linguistics,30(3):253–276.koff and M.Johnson.1980.Metaphors We LiveBy.London:The University of Chicago Press.D.Lin.1998.An information-theoretic definition ofsimilarity.In Proceedings of the International Con-ference on Machine Learning,Madison,USA.K.Markert and M.Nissim.2002a.Metonymy res-olution as a classification task.In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP2002),Philadelphia, USA.K.Markert and M.Nissim.2002b.Towards a cor-pus annotated for metonymies:the case of location names.In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC2002),Las Palmas,Spain.M.Nissim and K.Markert.2003.Syntactic features and word similarity for supervised metonymy res-olution.In Proceedings of the41st Annual Meet-ing of the Association for Computational Linguistics (ACL-03),Sapporo,Japan.M.Nissim and K.Markert.2005.Learning to buy a Renault and talk to BMW:A supervised approach to conventional metonymy.In H.Bunt,editor,Pro-ceedings of the6th International Workshop on Com-putational Semantics,Tilburg,The Netherlands. G.Nunberg.1978.The Pragmatics of Reference.Ph.D.thesis,City University of New York.M.Osborne and J.Baldridge.2004.Ensemble-based active learning for parse selection.In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL).Boston, USA.J.Pustejovsky.1995.The Generative Lexicon.Cam-bridge,MA:MIT Press.78。
外文文献—遗传算法

附录I 英文翻译第一部分英文原文文章来源:书名:《自然赋予灵感的元启发示算法》第二、三章出版社:英国Luniver出版社出版日期:2008Chapter 2Genetic Algorithms2.1 IntroductionThe genetic algorithm (GA), developed by John Holland and his collaborators in the 1960s and 1970s, is a model or abstraction of biolo gical evolution based on Charles Darwin’s theory of natural selection. Holland was the first to use the crossover and recombination, mutation, and selection in the study of adaptive and artificial systems. These genetic operators form the essential part of the genetic algorithm as a problem-solving strategy. Since then, many variants of genetic algorithms have been developed and applied to a wide range of optimization problems, from graph colouring to pattern recognition, from discrete systems (such as the travelling salesman problem) to continuous systems (e.g., the efficient design of airfoil in aerospace engineering), and from financial market to multiobjective engineering optimization.There are many advantages of genetic algorithms over traditional optimization algorithms, and two most noticeable advantages are: the ability of dealing with complex problems and parallelism. Genetic algorithms can deal with various types of optimization whether the objective (fitness) functionis stationary or non-stationary (change with time), linear or nonlinear, continuous or discontinuous, or with random noise. As multiple offsprings in a population act like independent agents, the population (or any subgroup) can explore the search space in many directions simultaneously. This feature makes it ideal to parallelize the algorithms for implementation. Different parameters and even different groups of strings can be manipulated at the same time.However, genetic algorithms also have some disadvantages.The formulation of fitness function, the usage of population size, the choice of the important parameters such as the rate of mutation and crossover, and the selection criteria criterion of new population should be carefully carried out. Any inappropriate choice will make it difficult for the algorithm to converge, or it simply produces meaningless results.2.2 Genetic Algorithms2.2.1 Basic ProcedureThe essence of genetic algorithms involves the encoding of an optimization function as arrays of bits or character strings to represent the chromosomes, the manipulation operations of strings by genetic operators, and the selection according to their fitness in the aim to find a solution to the problem concerned. This is often done by the following procedure:1) encoding of the objectives or optimization functions; 2) defining a fitness function or selection criterion; 3) creating a population of individuals; 4) evolution cycle or iterations by evaluating the fitness of allthe individuals in the population,creating a new population by performing crossover, and mutation,fitness-proportionate reproduction etc, and replacing the old population and iterating again using the new population;5) decoding the results to obtain the solution to the problem. These steps can schematically be represented as the pseudo code of genetic algorithms shown in Fig. 2.1.One iteration of creating a new population is called a generation. The fixed-length character strings are used in most of genetic algorithms during each generation although there is substantial research on the variable-length strings and coding structures.The coding of the objective function is usually in the form of binary arrays or real-valued arrays in the adaptive genetic algorithms. For simplicity, we use binary strings for encoding and decoding. The genetic operators include crossover,mutation, and selection from the population.The crossover of two parent strings is the main operator with a higher probability and is carried out by swapping one segment of one chromosome with the corresponding segment on another chromosome at a random position (see Fig.2.2).The crossover carried out in this way is a single-point crossover. Crossover at multiple points is also used in many genetic algorithms to increase the efficiency of the algorithms.The mutation operation is achieved by flopping the randomly selected bits (see Fig. 2.3), and the mutation probability is usually small. The selection of anindividual in a population is carried out by the evaluation of its fitness, and it can remain in the new generation if a certain threshold of the fitness is reached or the reproduction of a population is fitness-proportionate. That is to say, the individuals with higher fitness are more likely to reproduce.2.2.2 Choice of ParametersAn important issue is the formulation or choice of an appropriate fitness function that determines the selection criterion in a particular problem. For the minimization of a function using genetic algorithms, one simple way of constructing a fitness function is to use the simplest form F = A−y with A being a large constant (though A = 0 will do) and y = f(x), thus the objective is to maximize the fitness function and subsequently minimize the objective function f(x). However, there are many different ways of defining a fitness function.For example, we can use the individual fitness assignment relative to the whole populationwhere is the phenotypic value of individual i, and N is the population size. The appropriateform of the fitness function will make sure that the solutions with higher fitness should be selected efficiently. Poor fitness function may result in incorrect or meaningless solutions.Another important issue is the choice of various parameters.The crossover probability is usually very high, typically in the range of 0.7~1.0. On the other hand, the mutation probability is usually small (usually 0.001 _ 0.05). If is too small, then the crossover occurs sparsely, which is not efficient for evolution. If the mutation probability is too high, the solutions could still ‘jump around’ even if the optimal solution is approaching.The selection criterion is also important. How to select the current population so that the best individuals with higher fitness should be preserved and passed onto the next generation. That is often carried out in association with certain elitism. The basic elitism is to select the most fit individual (in each generation) which will be carried over to the new generation without being modified by genetic operators. This ensures that the best solution is achieved more quickly.Other issues include the multiple sites for mutation and the population size. The mutation at a single site is not very efficient, mutation at multiple sites will increase the evolution efficiency. However, too many mutants will make it difficult for the system to converge or even make the system go astray to the wrong solutions. In reality, if the mutation is too high under high selection pressure, then the whole population might go extinct.In addition, the choice of the right population size is also very important. If the population size is too small, there is not enough evolution going on, and there is a risk for the whole population to go extinct. In the real world, a species with a small population, ecological theory suggests that there is a real danger of extinction for such species. Even the system carries on, there is still a danger of premature convergence. In a small population, if a significantly more fit individual appears too early, it may reproduces enough offsprings so that they overwhelm the whole (small) population. This will eventually drive the system to a local optimum (not the global optimum). On the other hand, if the population is too large, more evaluations of the objectivefunction are needed, which will require extensive computing time.Furthermore, more complex and adaptive genetic algorithms are under active research and the literature is vast about these topics.2.3 ImplementationUsing the basic procedure described in the above section, we can implement the genetic algorithms in any programming language. For simplicity of demonstrating how it works, we have implemented a function optimization using a simple GA in both Matlab and Octave.For the generalized De Jong’s test function where is a positive integer andr > 0 is the half length of the domain. This function has a minimum of at . For the values of , r = 100 and n = 5 as well as a population size of 40 16-bit strings, the variations of the objective function during a typical run are shown in Fig. 2.4. Any two runs will give slightly different results dueto the stochastic nature of genetic algorithms, but better estimates are obtained as the number of generations increases.For the well-known Easom functionit has a global maximum at (see Fig. 2.5). Now we can use the following Matlab/Octave to find its global maximum. In our implementation, we have used fixedlength 16-bit strings. The probabilities of crossover and mutation are respectivelyAs it is a maximization problem, we can use the simplest fitness function F = f(x).The outputs from a typical run are shown in Fig. 2.6 where the top figure shows the variations of the best estimates as they approach while the lower figure shows the variations of the fitness function.% Genetic Algorithm (Simple Demo) Matlab/Octave Program% Written by X S Yang (Cambridge University)% Usage: gasimple or gasimple(‘x*exp(-x)’);function [bestsol, bestfun,count]=gasimple(funstr)global solnew sol pop popnew fitness fitold f range;if nargin<1,% Easom Function with fmax=1 at x=pifunstr=‘-cos(x)*exp(-(x-3.1415926)^2)’;endrange=[-10 10]; % Range/Domain% Converting to an inline functionf=vectorize(inline(funstr));% Generating the initil populationrand(‘state’,0’); % Reset the random generatorpopsize=20; % Population sizeMaxGen=100; % Max number of generationscount=0; % counternsite=2; % number of mutation sitespc=0.95; % Crossover probabilitypm=0.05; % Mutation probabilitynsbit=16; % String length (bits)% Generating initial populationpopnew=init_gen(popsize,nsbit);fitness=zeros(1,popsize); % fitness array% Display the shape of the functionx=range(1):0.1:range(2); plot(x,f(x));% Initialize solution <- initial populationfor i=1:popsize,solnew(i)=bintodec(popnew(i,:));end% Start the evolution loopfor i=1:MaxGen,% Record as the historyfitold=fitness; pop=popnew; sol=solnew;for j=1:popsize,% Crossover pairii=floor(popsize*rand)+1; jj=floor(popsize*rand)+1;% Cross overif pc>rand,[popnew(ii,:),popnew(jj,:)]=...crossover(pop(ii,:),pop(jj,:));% Evaluate the new pairscount=count+2;evolve(ii); evolve(jj);end% Mutation at n sitesif pm>rand,kk=floor(popsize*rand)+1; count=count+1;popnew(kk,:)=mutate(pop(kk,:),nsite);evolve(kk);endend % end for j% Record the current bestbestfun(i)=max(fitness);bestsol(i)=mean(sol(bestfun(i)==fitness));end% Display resultssubplot(2,1,1); plot(bestsol); title(‘Best estimates’); subplot(2,1,2); plot(bestfun); title(‘Fitness’);% ------------- All sub functions ----------% generation of initial populationfunction pop=init_gen(np,nsbit)% String length=nsbit+1 with pop(:,1) for the Signpop=rand(np,nsbit+1)>0.5;% Evolving the new generationfunction evolve(j)global solnew popnew fitness fitold pop sol f;solnew(j)=bintodec(popnew(j,:));fitness(j)=f(solnew(j));if fitness(j)>fitold(j),pop(j,:)=popnew(j,:);sol(j)=solnew(j);end% Convert a binary string into a decimal numberfunction [dec]=bintodec(bin)global range;% Length of the string without signnn=length(bin)-1;num=bin(2:end); % get the binary% Sign=+1 if bin(1)=0; Sign=-1 if bin(1)=1.Sign=1-2*bin(1);dec=0;% floating point.decimal place in the binarydp=floor(log2(max(abs(range))));for i=1:nn,dec=dec+num(i)*2^(dp-i);enddec=dec*Sign;% Crossover operatorfunction [c,d]=crossover(a,b)nn=length(a)-1;% generating random crossover pointcpoint=floor(nn*rand)+1;c=[a(1:cpoint) b(cpoint+1:end)];d=[b(1:cpoint) a(cpoint+1:end)];% Mutatation operatorfunction anew=mutate(a,nsite)nn=length(a); anew=a;for i=1:nsite,j=floor(rand*nn)+1;anew(j)=mod(a(j)+1,2);endThe above Matlab program can easily be extended to higher dimensions. In fact, there is no need to do any programming (if you prefer) because there are many software packages (either freeware or commercial) about genetic algorithms. For example, Matlab itself has an extra optimization toolbox.Biology-inspired algorithms have many advantages over traditional optimization methods such as the steepest descent and hill-climbing and calculus-based techniques due to the parallelism and the ability of locating the very good approximate solutions in extremely very large search spaces.Furthermore, more powerful new generation algorithms can be formulated by combiningexisting and new evolutionary algorithms with classical optimization methods.Chapter 3Ant AlgorithmsFrom the discussion of genetic algorithms, we know that we can improve the search efficiency by using randomness which will also increase the diversity of the solutions so as to avoid being trapped in local optima. The selection of the best individuals is also equivalent to use memory. In fact, there are other forms of selection such as using chemical messenger (pheromone) which is commonly used by ants, honey bees, and many other insects. In this chapter, we will discuss the nature-inspired ant colony optimization (ACO), which is a metaheuristic method.3.1 Behaviour of AntsAnts are social insects in habit and they live together in organized colonies whose population size can range from about 2 to 25 millions. When foraging, a swarm of ants or mobile agents interact or communicate in their local environment. Each ant can lay scent chemicals or pheromone so as to communicate with others, and each ant is also able to follow the route marked with pheromone laid by other ants. When ants find a food source, they will mark it with pheromone and also mark the trails to and from it. From the initial random foraging route, the pheromone concentration varies and the ants follow the route with higher pheromone concentration, and the pheromone is enhanced by the increasing number of ants. As more and more ants follow the same route, it becomes the favoured path. Thus, some favourite routes (often the shortest or more efficient) emerge. This is actually a positive feedback mechanism.Emerging behaviour exists in an ant colony and such emergence arises from simple interactions among individual ants. Individual ants act according to simple and local information (such as pheromone concentration) to carry out their activities. Although there is no master ant overseeing the entire colony and broadcasting instructions to the individual ants, organized behaviour still emerges automatically. Therefore, such emergent behaviour is similar to other self-organized phenomena which occur in many processes in nature such as the pattern formation in animal skins (tiger and zebra skins).The foraging pattern of some ant species (such as the army ants) can show extraordinary regularity. Army ants search for food along some regular routes with an angle of about apart. We do not know how they manage to follow such regularity, but studies show that they could move in an area and build a bivouac and start foraging. On the first day, they forage in a random direction, say, the north and travel a few hundred meters, then branch to cover a large area. The next day, they will choose a different direction, which is about from the direction on the previous day and cover a large area. On the following day, they again choose a different direction about from the second day’s direction. In this way, they cover the whole area over about 2 weeks and they move out to a different location to build a bivouac and forage again.The interesting thing is that they do not use the angle of (this would mean that on the fourth day, they will search on the empty area already foraged on the first day). The beauty of this angle is that it leaves an angle of about from the direction on the first day. This means they cover the whole circle in 14 days without repeating (or covering a previously-foraged area). This is an amazing phenomenon.3.2 Ant Colony OptimizationBased on these characteristics of ant behaviour, scientists have developed a number ofpowerful ant colony algorithms with important progress made in recent years. Marco Dorigo pioneered the research in this area in 1992. In fact, we only use some of the nature or the behaviour of ants and add some new characteristics, we can devise a class of new algorithms.The basic steps of the ant colony optimization (ACO) can be summarized as the pseudo code shown in Fig. 3.1.Two important issues here are: the probability of choosing a route, and the evaporation rate of pheromone. There are a few ways of solving these problems although it is still an area of active research. Here we introduce the current best method. For a network routing problem, the probability of ants at a particular node to choose the route from node to node is given bywhere and are the influence parameters, and their typical values are .is the pheromone concentration on the route between and , and the desirability ofthe same route. Some knowledge about the route such as the distance is often used so that ,which implies that shorter routes will be selected due to their shorter travelling time, and thus the pheromone concentrations on these routes are higher.This probability formula reflects the fact that ants would normally follow the paths with higher pheromone concentrations. In the simpler case when , the probability of choosing a path by ants is proportional to the pheromone concentration on the path. The denominator normalizes the probability so that it is in the range between 0 and 1.The pheromone concentration can change with time due to the evaporation of pheromone. Furthermore, the advantage of pheromone evaporation is that the system could avoid being trapped in local optima. If there is no evaporation, then the path randomly chosen by the first ants will become the preferred path as the attraction of other ants by their pheromone. For a constant rate of pheromone decay or evaporation, the pheromone concentration usually varies with time exponentiallywhere is the initial concentration of pheromone and t is time. If , then we have . For the unitary time increment , the evaporation can beapproximated by . Therefore, we have the simplified pheromone update formula:where is the rate of pheromone evaporation. The increment is the amount of pheromone deposited at time t along route to when an ant travels a distance . Usually . If there are no ants on a route, then the pheromone deposit is zero.There are other variations to these basic procedures. A possible acceleration scheme is to use some bounds of the pheromone concentration and only the ants with the current global best solution(s) are allowed to deposit pheromone. In addition, certain ranking of solution fitness can also be used. These are hot topics of current research.3.3 Double Bridge ProblemA standard test problem for ant colony optimization is the simplest double bridge problem with two branches (see Fig. 3.2) where route (2) is shorter than route (1). The angles of these two routes are equal at both point A and pointB so that the ants have equal chance (or 50-50 probability) of choosing each route randomly at the initial stage at point A.Initially, fifty percent of the ants would go along the longer route (1) and the pheromone evaporates at a constant rate, but the pheromone concentration will become smaller as route (1) is longer and thus takes more time to travel through. Conversely, the pheromone concentration on the shorter route will increase steadily. After some iterations, almost all the ants will move along the shorter route. Figure 3.3 shows the initial snapshot of 10 ants (5 on each route initially) and the snapshot after 5 iterations (or equivalent to 50 ants have moved along this section). Well, there are 11 ants, and one has not decided which route to follow as it just comes near to the entrance.Almost all the ants (well, about 90% in this case) move along the shorter route.Here we only use two routes at the node, it is straightforward to extend it to the multiple routes at a node. It is expected that only the shortest route will be chosen ultimately. As any complex network system is always made of individual nodes, this algorithms can be extended to solve complex routing problems reasonably efficiently. In fact, the ant colony algorithms have been successfully applied to the Internet routing problem, the travelling salesman problem, combinatorial optimization problems, and other NP-hard problems.3.4 Virtual Ant AlgorithmAs we know that ant colony optimization has successfully solved NP-hard problems such asthe travelling salesman problem, it can also be extended to solve the standard optimization problems of multimodal functions. The only problem now is to figure out how the ants will move on an n-dimensional hyper-surface. For simplicity, we will discuss the 2-D case which can easily be extended to higher dimensions. On a 2D landscape, ants can move in any direction or , but this will cause some problems. How to update the pheromone at a particular point as there are infinite number of points. One solution is to track the history of each ant moves and record the locations consecutively, and the other approach is to use a moving neighbourhood or window. The ants ‘smell’ the pheromone concentration of their neighbourhood at any particular location.In addition, we can limit the number of directions the ants can move by quantizing the directions. For example, ants are only allowed to move left and right, and up and down (only 4 directions). We will use this quantized approach here, which will make the implementation much simpler. Furthermore, the objective function or landscape can be encoded into virtual food so that ants will move to the best locations where the best food sources are. This will make the search process even more simpler. This simplified algorithm is called Virtual Ant Algorithm (VAA) developed by Xin-She Yang and his colleagues in 2006, which has been successfully applied to topological optimization problems in engineering.The following Keane function with multiple peaks is a standard test functionThis function without any constraint is symmetric and has two highest peaks at (0, 1.39325) and (1.39325, 0). To make the problem harder, it is usually optimized under two constraints:This makes the optimization difficult because it is now nearly symmetric about x = y and the peaks occur in pairs where one is higher than the other. In addition, the true maximum is, which is defined by a constraint boundary.Figure 3.4 shows the surface variations of the multi-peaked function. If we use 50 roaming ants and let them move around for 25 iterations, then the pheromone concentrations (also equivalent to the paths of ants) are displayed in Fig. 3.4. We can see that the highest pheromoneconcentration within the constraint boundary corresponds to the optimal solution.It is worth pointing out that ant colony algorithms are the right tool for combinatorial and discrete optimization. They have the advantages over other stochastic algorithms such as genetic algorithms and simulated annealing in dealing with dynamical network routing problems.For continuous decision variables, its performance is still under active research. For the present example, it took about 1500 evaluations of the objective function so as to find the global optima. This is not as efficient as other metaheuristic methods, especially comparing with particle swarm optimization. This is partly because the handling of the pheromone takes time. Is it possible to eliminate the pheromone and just use the roaming ants? The answer is yes. Particle swarm optimization is just the right kind of algorithm for such further modifications which will be discussed later in detail.第二部分中文翻译第二章遗传算法2.1 引言遗传算法是由John Holland和他的同事于二十世纪六七十年代提出的基于查尔斯·达尔文的自然选择学说而发展的一种生物进化的抽象模型。
stringtie转录组组装原理

stringtie转录组组装原理stringtie转录组组装原理1. 什么是转录组组装转录组组装是基于RNA测序数据,通过将测序片段重新组装成转录本的过程。
它能够帮助研究人员了解细胞内的转录组结构、基因表达和剪接变异等信息。
StringTie是一种常用的转录组组装软件,它可以高效准确地鉴定和定量多样性的转录本。
2. StringTie的工作原理数据预处理在进行转录组组装之前,需要对原始的RNA测序数据进行预处理。
这包括去除测序读取中的低质量碱基、去除接头序列等。
StringTie可以接受多个样本的测序数据,并将它们合并以便更好地进行组装。
转录本组装StringTie首先将RNA测序数据比对到参考基因组上,生成比对文件(SAM或BAM格式)。
然后,它将在参考基因组上寻找潜在的转录本。
具体而言,StringTie根据测序片段的分布情况来判断该片段是否属于一个转录本,并使用碱基比对信息来判断该转录本的边界。
StringTie还会考虑转录本的覆盖度和剪接形式等信息,以增强转录本的组装准确性。
转录本定量完成转录本组装后,StringTie会对转录本进行定量,即计算每个转录本表达水平的估计值。
转录本的表达水平可以用FPKM (Fragments Per Kilobase of transcript per Million mapped reads) 或 TPM (Transcripts Per Million)等单位进行表示。
这一步骤可以帮助研究人员了解不同基因在不同样本中的表达差异。
转录本合并如果有多个样本需要进行转录组组装,StringTie支持将多个样本的转录本进行合并。
这一步骤能够生成一个更完整、更全面的转录本集合,提高对复杂基因组的覆盖度和准确性。
3. StringTie的优势和应用场景优势•准确性高:StringTie采用了多种信息来进行转录本的组装和定量,能够获得较高的准确性。
•适应性强:StringTie支持多个样本的转录组组装和合并,可以应对不同的实验设计和研究需求。
外文翻译---一种基于树结构的快速多目标遗传算法

附录4一种基于树结构的快速多目标遗传算法介绍:一般来讲,解决多目标的科学和工程问题,是一个非常困难的任务。
在这些多目标优化问题(MOPS)中,这些目标往往在一个高维的问题空间发生冲突,而且多目标优化也需要更多的计算资源。
一些经典的优化方法表明将多目标优化转化成为单目标优化问题,其中许多运行被要求找到多个解决方案。
这使得一种算法返回一组候选解,这比只返回一个基于目标的权重解的算法更好。
由于这个原因,在过去20年中,人们越来越感兴趣把进化算法(EAs)应用到多目标优化中。
许多多目标进化算法(MOEAs)已经被提出,这些多目标进化算法使用Pareto占优的概念来引导搜索,并返回一组非支配解作为结果。
与在单目标优化中找到最优解作为最终的解不同,在多目标优化中有二个目标:(1)收敛到Pareto最优解集(2)在Pareto最优解集中保持解的多样性。
为了解决在多目标优化中这两个有时候会冲突的任务,许多策略和方法被提出。
这些方法的一个共同的问题是,它们往往是错综复杂的。
对于这两项任务,为了得到更优秀的解,一些复杂的策略通常被使用,并且许多参数需要依据经验和已经得到的问题信息进行调整。
另外,许多多目标进化算法有高达()2GMNO的计算复杂度或者需要更多的处理时间(G是代数,M是目标函数的数量,N是种群大小。
这些符号在下文也保持相同的含义)。
在这篇文章中,我们提出了一种基于树结构的快速多目标遗传算法。
(这个数据结构是一个二进制树,它保存了在多目标优化中解的三值支配关系(例如,正在支配、被支配和非支配),因此,我们命名它为支配树(DT)。
由于一些独特的性能,使支配树能够含蓄地包含种群个体的密度信息,并且很明显地减少了种群个体之间的比较。
计算复杂度实验也表明,支配树是一种处理种群有效的工具。
基于支配树的进化算法(DTEA)统一了在支配树中的收敛性和多样性策略,即多目标进化算法中的两个目标,并且由于只有几个参数,这种算法很容易操作。
结合自监督学习的多任务文本语义匹配方法

国家社会科学基金(17BGL068)和广东省自然科学基金(2018A030313777)资助 收稿日期: 2021-06-08; 修回日期: 2021-08-14北京大学学报(自然科学版) 第58卷 第1期 2022年1月Acta Scientiarum Naturalium Universitatis Pekinensis, Vol. 58, No. 1 (Jan. 2022) doi: 10.13209/j.0479-8023.2021.101结合自监督学习的多任务文本语义匹配方法陈源1 丘心颖1,2,†1. 广东外语外贸大学信息科学与技术学院, 广州 510006;2. 广州市非通用语种智能处理实验室,广东外语外贸大学, 广州 510006; † 通信作者,E-mail:******************摘要 基于文本交互信息对文本语义匹配模型的重要性, 提出一种结合序列生成任务的自监督学习方法。
该方法利用自监督模型提取的文本数据对的交互信息, 以特征增强的方式辅助基于神经网络的语义匹配模型, 构建多任务的文本匹配模型。
9 个模型的实验结果表明, 加入自监督学习模块后, 原始模型的效果都有不同程度的提升, 表明所提方法可以有效地改进深度文本语义匹配模型。
关键词 自监督学习; 文本语义匹配; 多任务学习Multi-task Semantic Matching with Self-supervised LearningCHEN Yuan 1, QIU Xinying 1,2,†1. School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006;2. Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies,Guangzhou510006;†Correspondingauthor,E-mail:******************Abstract In semantic matching, the interaction information between pairs of texts is critical in predicting amatching score for the pairs. This paper proposes a multi-task learning framework with self-supervised learning fordeep learning semantic matching problem. Specifically, a self-supervised model is designed for the paired sentences to regenerate each other with sequence-to-sequence generation method. Then a multi-task learning framework integrates the representation from the self-supervised generation with that of the deep matching model to predict the similarity score of the texts. Experimentations with 9 deep matching models prove that the proposed framework can improve the performances of the traditional deep matching models. Key words self-supervised learning; semantic matching; multi-task learning文本语义匹配研究两个文本之间语义等价的度量或语义相似匹配度问题, 是自然语言处理的基础任务之一。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Automatically attaching semantic metadata to Web services Andreas HeßNicholas KushmerickComputer Science Department,University College Dublin,Ireland{andreas.hess,nick}@ucd.ieAbstractEmerging Web standards promise a network of het-erogeneous yet interoperable Web Services.WebServices would greatly simplify the developmentof many kinds of data integration and knowledgemanagement applications.Unfortunately,this vi-sion requires that services provide large amountsof semantic metadata“glue”.As afirst step to au-tomatically generating such metadata,we describehow machine learning and clustering techniquescan be used to attach attach semantic metadata toWeb forms and services.1IntroductionEmerging Web standards such as UDDI[],SOAP [/TR/soap],WSDL[/TR/wsdl]and DAML-S [/services]promise an ocean of Web Services, networked components that can be invoked remotely using standard XML-based protocols.For example,significant e-commerce players such as Amazon and Google export Web Services giving public access to their content databases. The key to automatically invoking and composing Web Services is to associate machine-understandable semantic metadata with each service.While the details are beyond the scope of this paper,the various Web standards involve metadata at various levels of abstraction,from high-level ad-vertisements that facilitate discovering relevant services,to low-level input/output specifications of particular operations.A central challenge to the Web Services initiative is the lack of tools to(semi-)automatically generate the necessary metadata.In this paper we explore the use of machine learn-ing techniques to automatically create such metadata from training data.Such an approach complements existing uses of machine learning to facilitate the Semantic Web,such as for information extraction[6;8;2]and for mapping between heterogeneous data schemata[4].Specifically,we describe and evaluate supervised learning techniques for attaching semantic metadata to Web forms and fields(Sec.2),as well as unsupervised clustering techniques for discovering semantically related Web Services(Sec.3).2Web form classificationProblem formulation.Web form instances are structured objects:a form comprises one or morefields,and eachfield in turn comprises one or more terms.More precisely,a form F i is a sequence offields,written F i=[f1i,f2i,...],and each field f j i is a bag of terms,written f j i=[t j i(1),t j i(2),...]. We assume two taxonomies for attaching semantic meta-data to forms andfields.First,we assume a domain taxonomy D.Domains capture the overall purpose of a form,such as“searching for a book”,“finding a job”,“querying a airline timetable”,etc.We use S MALL-C APS to indicate domains,so we might have D= {S EARCH B OOK,F IND J OB,Q UERY F LIGHT,...}. Second,we assume a datatype taxonomy T.Datatypes do not relate to low-level encoding issues such as“string”or “integer”,but rather to the expected semantic category of a field’s data,such as“book title”,“salary”,“destination air-port”,etc.SansSerif style indicates datatypes,so we might have T={BookTitle,Salary,DestAirport,...}.The Web form learning problem is as follows.The input is set of labelled forms andfields;that is,a set{F1,F2,...}of forms together with a domain D i∈D for each form F i,and a datatype T j i∈T for eachfield f j i∈F i.The output is a form classifier;that is,a function that maps an unlabelled form F i,to a predicted domain D i∈D,and a predicted datatype T ji∈T for eachfield f ji∈F i.Generative model.Our solution to the Web form classi-fication is based on a stochastic generative model of a hy-pothetical“Web service designer”creating a Web page to host a particular service.First,the designerfirst selects a domain D i∈D according to some probability distribution Pr[D i].For example,in our Web form data described in rel-ative to forms forfinding colleges,so Pr[S EARCH B OOK] Pr[F IND C OLLEGE].Second,the designer selects datatypes T j i∈T appropriate to D i,by selecting according to some distribution Pr[T j i|D i]. For example,presumably Pr[BookTitle|S EARCH B OOK] Pr[DestAirport|S EARCH B OOK],because services forfinding books usually involve a book’s title,but rarely involve air-ports.On the other hand,Pr[BookTitle|Q UERY F LIGHT] Pr[DestAirport|Q UERY F LIGHT].Finally,the designer writes the Web page that implements the form by coding eachfield in turn.More precisely,for each selected datatype T j i,the designer uses terms t j i(k)drawn ac-cording to some distribution Pr[t j i(k)|T j i].For example,pre-sumably Pr[title|BookTitle] Pr[city|BookTitle],be-cause the term title is much more likely than city to occur in afield requesting a book title.On the other hand,pre-sumably Pr[title|DestAirport] Pr[city|DestAirport]. Parameter estimation The learning task is to estimate the parameters of the stochastic generative model from a set of training data.The training data comprises a set of N Web forms F={F1,...,F N},where for each form F i the learn-ing algorithm is given the domain D i∈D and the datatypesT j i of thefields f j i∈F i.The parameters to be estimated are the domain probabili-tiesˆPr[D]for D∈D,the conditional datatype probabilities ˆPr[T|D]for D∈D and T∈T,and the conditional termprobabilitiesˆPr[t|T]for term t and T∈T.We estimate these parameters based on their frequency in the training data:ˆPr[D]=NF (D)/N,ˆPr[T|D]=M F(T,D)/M F(D),andˆPr[t|T]=WF (t,T)/W F(T),where N F(D)is the numberof forms in the training set F with domain D;M F(D)is the total number offields in all forms of domain D;M F(T,D)is the number offields of datatype T in all forms of domain D; W F(T)is the total number of terms of allfields of datatype T;and W F(t,T)is the number of occurrences of term t in all fields of datatype t.Classification.Our approach to Web form classification in-volves converting a form into a Bayesian network.The net-work is a tree that reflects the generative model:there is a root node representing the form’s domain,children represent-ing the datatype of eachfield,and grandchildren encoding the terms used to code eachfield.In more detail,a Web form to be classified is converted into a three-layer tree-structured Bayesian network as fol-lows.Thefirst(root)layer contains just a single node do-main that takes on values from the set of domains D.The second layer consists of one child datatype i of domain for eachfield in the form being classified,where each datatype i take on values from the datatype set T.The third(leaf)layer comprises a set of children {term1i,...,term K i}for each datatype i node,where K is the number of terms in thefield.The term nodes take on val-ues from the vocabulary set V,defined as the set of all terms that have occurred in the training data.Fig.1illustrates the network that would be constructed for a form with threefields and K terms for eachfield.(Each field contains the same number K of terms/field for simplic-ity;in fact,the number of term nodes reflects the actual num-ber of terms in the parentfield.)The conditional probability tables associated with each node correspond directly to the learned parameters men-tioned earlier.That is,Pr[domain=D]≡ˆPr(D), Pr[datatype i=T|domain=D]≡ˆPr(T|D),and Figure1:The three-layer tree-structured Bayesian network used to classify a form containing threefields.Pr[term k i=t|datatype i=T]≡ˆPr(t|T).Note that the conditional probabilities tables are identical for all datatype nodes,and for all term nodes.Given such a Bayesian network,classifying a form F i= [f1i,f2i,...]involves“observing”the terms in eachfield(i.e., setting the probability Pr[term k i=t j i(k)]≡1for each term t ji(k)∈f ji),and then computing the maximum-likelihood form domain andfield datatypes consistent with that evi-dence.Evaluation We have evaluated our approach using a col-lection of129Web forms comprising656fields in total,for an average of5.1fields/form.As shown in Fig.2,the domain taxonomy D used in our experiments contains6domains,and the datatype taxonomy T comprises71datatypes.The forms were manually gathered by manually brows-ing Web forms indices such as for relevant forms.Each form was then inspected by hand to assign a domain to the form as a whole,and a datatype to eachfield. After the forms were gathered,they were segmented into fields.We discuss the details below.For now,it suf-fices to say that we use HTML tags such as<input>and <textarea>to identify thefields that will appear to the user when the page is rendered.After a form has been seg-mented intofields,certain irrelevantfields(e.g.,submit/reset buttons)are discarded.The remainingfields are then assigned a datatype.Afinal subtlety is that somefields are not easily interpreted as“data”,but rather indicate minor modifications to either the way the query is interpreted,or the output presentation. For example,there is a“help”option on one search services that augments the requested data with suggestions for query refinement.We discarded suchfields on a case-by-case basis;a total of12.1%of thefields were discarded in this way. Thefinal data-preparation step is to convert the HTML fragments into the“form=sequence offields;field=bag of terms”representation.The HTML isfirst parsed into a sequence of tokens.Some of these tokens are HTMLfield tags(eg.,<input>,<select>,<textarea>).The form is segmented intofields by associating the remaining tokensDomain taxonomy D and number of forms for each domainS EARCH B OOK(44)F IND C OLLEGE(2)S EARCH C OLLEGE B OOK(17)Q UERY F LIGHT(34)F IND J OB(23)F IND S TOCK Q UOTE(9)Datatype taxonomy T(illustrative sample)Address NAdults Airline Author BookCode BookCondition BookDetailsBookEdition BookFormat BookSearchType BookSubject BookTitle NChildren CityClass College CollegeSubject CompanyName Country Currency DateDepartDateReturn DestAirport DestCity Duration Email EmployeeLevel...Figure2:Subsets of the domain and datatype taxonomies used in the experiments.with the nearestfield.For example,“<form>a<inputname=f1>b c<textarea name=f2>d</form>”would be segmented as“a<input name=f1>b”and“c<textarea name=f2>d”.The intent is that this segmentation process will associatewith eachfield a bag of terms that provides evidence of thefield’s datatype.For example,our classification algorithm will learn to distinguish labels like“Book title”that are as-sociated with BookTitlefields,from labels like“Title(Dr, Ms,...)”that indicate PersonTitle.Finally,we convert HTML fragments like“Enter name: <input name=name1type=text size=20><br>”that correspond to a particularfield,into thefield’s bag of terms representation.We process each fragment as follows. First,we discard HTML tags,retaining the values of a set of“interesting”attributes,such as an<input>tag’s name attribute.The result is“Enter name:name1”.Next, we tokenize the string at punctuation and space characters, convert all characters to lower case,apply Porter’s stemming algorithm,discard stop words,and insert a special symbol encoding thefield’s HTML type(text,select,radio-button, etc).This yields the token sequence[enter,name,name1, TypeText].Finally,we apply a set of term normalizations, such as replacing terms comprising just a single digit(letter) with a special symbol SingleDigit(SingleLetter),and delet-ing leading/trailing numbers.In this example thefinal result is the sequence[enter,name,name,TypeText]. Results We begin by comparing our approach to two simple bag of terms baselines using a leave-one-out methodology. For domain classification,the baseline uses a single bag of all terms in the entire form.For datatype classification,the baseline approach is the naive Bayes algorithm over its bag of terms.For domain prediction,our algorithm has an F1score of 0.87while the baseline scores0.82.For datatype predic-tion,our algorithm has an F1score of0.43while the baseline scores0.38.We conclude that our“holistic”approach to form andfield prediction is more accurate than a greedy baseline approach of making each prediction independently.While our approach is far from perfect,we observe that form classification is extremely challenging,due both to noise in the underlying HTML,and the fact that our domain and datatype taxonomies contain many classes compared to tradi-tional(usually binary!)text classification tasks.While fully-automated form classification is our ultimate goal,an imperfect form classifier can still be useful in inter-active,partially-automated scenarios in which a human gives0.40.50.60.70.80.910.20.30.40.50.60.70.80.91F1degree of assistanceα33333Figure3:Domain prediction F1as a function of the fraction αoffields’datatypes supplied by the user.the domain or(some of)the datatypes of a form to be labelled, and the classifier labels the remaining elements.Ourfirst experiment measures the improvement in datatype prediction if the Bayesian network is also provided as evi-dence the form’s domain.In this case our algorithm has an F1score of0.51,compared to0.43mentioned earlier.Our second experiment measures the improvement in do-main prediction if evidence is provided for a randomly chosen fractionαof thefields’datatypes,for0≤α≤1.α=0cor-responds to the fully automated situation in which no datatype evidence is provided;α=1requires that a person provide the datatype of everyfield.As shown in Fig.3,the domain classification F1score increases rapidly asαapproaches1. Our third investigation of semi-automated prediction in-volves the idea of ranking the predictions rather than requir-ing that the algorithm make just one prediction.In many semi-automated scenarios,the fact that the second-or third-ranked prediction is correct can still be useful even if thefirst is wrong.To formalize this notion,we calculate F1based on treating the algorithm as correct if the true class is in the top R predictions as ranked by posterior probability.Fig.4shows the F1score for predicting both domains and datatypes,as a function of R.R=1corresponds to the cases described so far.We can see that relaxing R even slightly results in a dramatic increase in F1score.So far we have assumed unstructured datatype and domain taxonomies.However,domains and datatypes exhibit a nat-ural hierarchical structure(eg,“forms forfinding something”vs.“forms for buying something”;or“fields related to book information”vs.“fields related to personal details”).It seems reasonable that in partially-automated settings,predicting a similar but wrong class is more useful than a dissimilar class. To explore this issue,our research assistants converted their domain and datatype taxonomies into trees,creating ad-ditional abstract nodes to obtain reasonable and compact hier-archies.We used distance in these trees to measure the“qual-0.40.50.60.70.80.9112345678910F 1rank threshold Rform domain 3333333field datatype+++++++++++Figure 4:F1as a function of rank threshold R .ity”of a prediction,instead of a binary “right/wrong”.For domain predictions,our algorithm’s prediction is on average 0.40edges away from the correct class,while the baseline al-gorithm’s predictions are 0.55edges away.For datatype pre-diction,our algorithm’s average distance is 2.08edges while the baseline algorithm averages 2.51.As above,we conclude that our algorithm outperforms the baseline.3Web Service clusteringClustering.As a second approach towards our goal of au-tomatically creating Web Services metadata,we explored the use of unsupervised clustering algorithms to automatically group services into semantically related categories.Note that a given Web Service can export more than one operation.Each operation would correspond to a single Web form.Let C ={C 1,C 2,...}be the categories discovered by clustering a collection of Web Services.Then each C i corresponds for-mally to a subset of the domain ontology D :C i ∈2D for each i .Ultimately,we intend to use the category C i associ-ated with a particular Web Service as additional top-down ev-idence that could be exploited by the Bayesian network used to classify each of its operations.Clustering in information retrieval has been applied pri-marily on unstructured documents.As Web Service descrip-tions are structured documents,we developed a clustering al-gorithm that takes advantage of this structure.UDDI.As Web Services become more common,there is an increasing need for search and discovery tools.UDDI reg-istries promise to solve the discovery problem,but UDDI’s search capabilities are rather weak.UDDI is more like the Domain Name Service than a search engine:it provides an abstraction over the low level details by avoiding hard-coded access addresses,because applications can query the UDDI registry for technical details given a Web Service name.On the other hand,UDDI has richer features than a Web search engine,such as support for the classification of Web Services into a taxonomy like the North American Industry Classification System (NAICS).These taxonomies provide a means of classifying a Web Service as a whole.This is still far less than a complete semantic annotation,and the situation is complicated due to multiple competing taxonomies.Some-times not even a part-whole relationship is exploited when searching the registry for a service that belongs to a certainindustry classification.For instance,a Web Service that ad-vertised itself as a credit card service might not be found if the user searches for a banking service,although it might be exactly what the user wants.Creating the technical descriptions of Web Services such as WSDL is more or less an automatic procedure,creating semantic descriptions—even merely classification into an in-dustry taxonomy—involves additional work by the provider.Our goal is to provide a semi-automated tool to simplify this burden,by using clustering techniques to discover a set C of categories that correspond to,for example,an industry taxon-omy like NAICS.Web Service corpus.For our clustering experiments,we focused on UDDI data structures.However,we obtained a collection of 425Web Services not through a UDDI registry,but from ,a WSDL indexing service.As shown in Fig.5,the information available is a Web Service description in the WSDL format and the name and descrip-tion of the service.(We did not use the service provider as a text source.)We parsed the port types,operations and mes-sages from the WSDL and extract names and comments.We did not extract standard XML schema data types like string or integer.These Web Services were than manually classified into a hierarchical taxonomy.To avoid bias,the person who clas-sified the services was a research student with no previous experience with Web Services.The person had the same in-formation as given on ,and was allowed to inspect the WSDL description if necessary.The person was advised to adaptively create new classes while classifying the Web Services and was allowed to arrange the classes in a hi-erarchy.As shown in Fig.6,the 425Web Services were clas-sified into 24top level classes.The Web Services were not evenly distributed over the 24classes:some top level cate-gories contain only two Web Services while others contain over 50services in several subclasses.Clustering algorithms.We tested four clustering algo-rithms on our collection of Web Services:a hierarchical group average algorithm (e.g.[11]or [10]),the Word-IC al-gorithm [12],a variant of the group average clusterer that we call Common-Term,and a clusterer that exploits the struc-tured nature of the data we have gathered.The Common-Term algorithm differs from the standard group average clustering in the way the centroid document vector is computed.Instead of using all terms from all the sub-clusters,only the terms that occur in all sub-clusters form the centroid.Like the Word-IC algorithm,our hope is that this leads to short and concise cluster labels.For each of the three unstructured clustering algorithms,we convert a Web Service A into a bag of words text(A ).text(A )contains all the information about a particular ser-vice.We use the Porter stemming algorithm and a stop-word list to reduce the dimensionality of the task,and we use the standard cosine approach to measuring the similarity σ(a,b )between two “documents”a and b .For the first three unstruc-tured clustering algorithms,we define the similarity σ(A,B )Cartoon Communication Converter Country Information Courier Information Database Provider Data Management Developers Engineering Finder Flights GamesGenbank Graphics License Mathematics Money MusicNews Parser Poll Creation Server Info User Groups WebFigure6:Web Service categories C(only top-level categories areshown).Figure5:Data structure for our Web Service corpus. between two services A and B as the similarity of their bags of words:σ(A,B)=σ(text(A),text(B)).The structured clusterer measures similarity in terms of both the similarity of the Web Services and the descriptions, and also the similarity between the input/output messages. The similarity between messages of two Web Services A and B is defined as the maximum similarity between all messages msgs(A)of thefirst service and all messages msgs(B)of the other service.The intuition is that similar services tend to not only have similar textual descriptions,but also similar names for their messages.The combined similarityσ(A,B) is defined as the product between the textual similarity of the service and the similarity of the messages:σ(A,B)=σ(text(A),text(B))·maxn∈msgs(A)m∈msgs(B)σ(n,m),where for the structured clusterer text(A)excludes terms in the messages associated with service A’s operations.The structured clusterer is still work in progress,as we are currently experimenting with different ways to exploit the structured nature of Web Services.For example,other ways of combining the similarities,like using a weighted average, are possible.It is also possible to cluster on the messagesfirst and then use the cluster distance of the messages as a simi-larity measure.The disadvantage of the latter is that every service can have dozens of messages,and so this preliminary clustering step is infeasible without additional heuristics. Evaluation.We compared the four clustering algorithms with the manually-generated reference clusters mentioned above,using three ways to measure the similarity between two clusters over a set of objects.Thefirst measure was in-troduced by Zamir et al[12].This measure looks at a cluster in the target distribution and classifies each pair of services as a true positive or a false positive.The second similarity measure is the conditional probabil-ity of two Web Services being in the same cluster in our refer-ence classification,given they are in the same machine gener-ated cluster and vice versa,and the conditional probability of two Web Services being in different clusters in the reference given they are in different clusters in the machine clustering and vice versa.To get an overall measure,we multiplied these probabilities.The probability of two Web Services being in the same ref-erence class given they are in the same cluster alone is similar to the precision in traditional information retrieval,so we also report this value.This measure is not affected by unclustered services,while Zamir’s measure explicitly penalises unclus-tered services.Fig.7shows the quality of the clusters generated compared to the reference clusters,according to the three evaluation metrics,for each of the four algorithms.Not surprisingly, none of the algorithms does particularly well,because the clustering problem is quite challenging.In many cases even humans disagree on the correct classification.For example, manually organizes Web Services into their own taxonomy,and thefinal column in Fig.7shows that these reference clusters bear little resemblance to ours.Further-more,we have24categories in our reference classification, which is a rather high number.Our structured algorithm is penalised by Zamir’s metric, because it produces many unclustered services.The reason for this is that the similarity between two documents in the Structured algorithm reaches zero,if either the text or the messages have zero similarity.Thus,some Web Services will never be clustered in the structured algorithm.While the clusters do not agree well with our reference clusters,anecdotally wefind that the discovered clusters do generally make intuitive sense.An exception concerns syn-onyms:all algorithms make mistakes clustering Web Ser-vices containing the word“quote”,where“quote of the day”and“stock quote”services are inadvertently merged.The Common-Term and Structured clustering algorithms achieve a high precision,but create very many top level clusters,and, in the case of the Structured clusterer,a very high number ofWord-IC Common-Term Group Average Structured SALCentral Zamir’s quality measure-0.4738-0.2339-0.5390-0.3960-0.4163 Combined probabilities0.02230.02150.03430.02850.0329 Precision0.10240.33170.12570.41360.1774 Number of unclustered services0721170Number of top level clusters147285214Figure7:Cluster quality compared to reference cluster;see below for a discussion of the SALCentral column.unclustered documents.The Common-Term algorithm per-forms also best when looking at Zamir’s quality measure,al-though all algorithms only reach negative values.4Discussion.Future work.We are currently extending our classification and clustering algorithms in several directions.For exam-ple,our approaches ignore valuable sources of evidence—such as the actual data passed to/from a Web Service—and it would be interesting to incorporate such evidence into our algorithms.Our clustering algorithm could be extended in a number of ways,such as using statistical methods such as latent semantic analysis as well as thesauri like WordNet to improve the discovery and clustering of Web Services.We are currently integrating the clustering and classifica-tion algorithms.As described earlier,the clusters can provide additional evidence for classification,in that clustering on the Web Service level can give hints on the domains of the ser-vice’s operations.Although the classification approach has been evaluated only on HTML forms,we anticipate that the method can be extended to real Web Services.Related work.There has been some work on semantic matching of Web Services[9;1],but they require semantic metadata such as DAML-S.Clustering is a well-known technique in information re-trieval,although it has not yet been applied to Web Ser-vices.Besides the traditional group-average or single-link approaches,newer algorithms like the mentioned Word-IC or the Scatter/Gather-algorithms[3]exist.However,these algo-rithms have been applied mostly to unstructured data.The search capabilities of UDDI are still very restricted,different extensions are available or under development(e.g.UDDIe [/user/A.Shaikhali/uddie]).Kerschberg et al are planning to apply the techniques they introduced in Web-Sifter[5]to UDDI.When we actually want to simultaneously invoke multiple similar Web Services and aggregate the results,we encounter the problem of XML schema mapping.This problem is ad-dressed in different ways in[4;7].Conclusions.The emerging Web Services protocols rep-resent exciting new directions for the Web,but interoper-ability requires that each service be described by a large amount of semantic metadata“glue”.We have presented two approaches to automatically generating such metadata,and evaluated our approach on a collection of Web Services and forms.Although we a far from being able to fully automatically create semantic metadata,we believe that the methods we have presented here are a reasonablefirst step.Our pre-liminary results indicate that some of the requisite semantic metadata can be semi-automatically generated using machine learning,information retrieval and clustering techniques. Acknowledgments.This research was supported by grants SFI/01/F.1/C015from Science Foundation Ireland,and N00014-00-1-0021from the US Office of Naval Research. References[1]J.Cardoso.Quality of Service and Semantic Composition ofWorkflows.PhD thesis,University of Georgia,2002.[2] F.Ciravegna.Adaptive information extraction from text byrule induction and generalization.In17th International Joint Conference on Artifical Intelligence,2001.[3] D.Cutting,J.Pedersen, D.Karger,and J.Tukey.Scat-ter/gather:A cluster-based approach to browsing large doc-ument collections.In Proceedings of the Fifteenth Annual In-ternational ACM SIGIR Conference on Research and Develop-ment in Information Retrieval,pages318–329,1992.[4] A.Doan,P.Domingos,and A.Halevy.Reconciling schemasof disparate data sources:A machine-learning approach.In Proc.SIGMOD Conference,2001.[5]L.Kerschberg,W.Kim,and A.Scime.Intelligent web searchvia personalizable meta-search agents.pages1345–1358, 2002.[6]N.Kushmerick.Wrapper induction:Efficiency and expres-siveness.Artificial Intelligence,118(1–2):15–68,2000. [7]S.Melnik,H.Molina-Garcia,and E.Rahm.Similariyflood-ing:A versatile graph matching algorithm.In Proc.of the International Conference on Data Engineering(ICDE),2002.[8]I.Muslea,S.Minton,and C.Knoblock.A Hierachical Ap-proach to Wrapper Induction.In Proc.3rd Int.Conf.Au-tonomous Agents,pages190–197,1999.[9]M.Paolucci,T.Kawamura,T.Payne,and K.Sycara.Semanticmatchmaking of web services capabilities.In Int.Semantic Web Conference,2002.[10]G.Salton and M.McGill.Introduction to Modern InformationRetrieval.McGraw-Hill,1983.[11] C.J.van rmation Retrieval.Butterworths,Lon-don,2nd edition,1979.[12]Oren Zamir,Oren Etzioni,Omid Madani,and Richard M.Karp.Fast and intuitive clustering of web documents.In Knowledge Discovery and Data Mining,pages287–290,1997.。