基于边权重的WordNet词语相似度计算

Computer Engineering and Applications 计算机工程与应用

2018，54（1）1引言词语相似度计算是自然语言处理中的一个重要的基础性研究课题，它被广泛应用于生物医学[1]、认知科学和心理学[2-3]等领域。目前词语相似度计算方法主要被分为两种：一种是利用大规模语料库进行统计，依据词汇上下文信息的概率分布进行计算；另一种是基于某种世界知识来计算，通常是基于某个知识完备的语义词典中的层次结构关系进行计算[4]。目前国际上基于世界知识进行的研究主要是基于WordNet 。当前国际上针对

基于世界知识计算词语相似度主要提出了以下4种方法：基于路径的方法、基于信息内容的方法、基于特征的方法、杂合方法。基于路径的方法是计算词语相似度的一种简单直接的方法。它利用分类结构中待比较的两个概念的对应结点间的最短路径距离测量这两个概念间相似度。最短路径距离越小，概念间的相似度越大，Rada 等人[5]、Wu 等人[6]、Leakcock 等人[7]、Li 等人[8]、Liu 等基于边权重的WordNet 词语相似度计算

郭小华1，彭琦2，邓涵1，朱新华1

GUO Xiaohua 1,PENG Qi 2,DENG Han 1,ZHU Xinhua 1

1.广西师范大学计算机科学与信息工程学院，广西桂林541004

2.广西师范大学网络中心，广西桂林541004

1.College of Computer Science &Information Technology,Guangxi Normal University,Guilin,Guangxi 541004,China

2.Department of Network Center,Guangxi Normal University,Guilin,Guangxi 541004,China

GUO Xiaohua,PENG Qi,DENG Han,et al.Edge weight-based word similarity computation in https://www.360docs.net/doc/5911539254.html,puter Engineering and Applications,2018,54（1）：172-178.

Abstract ：Aimed at the defective including single information source,high nonlinear computational results and asymmetry between performance and efficiency of computation for word similarity currently,a word similarity computation method based on edge weight in WordNet is proposed.On the basis of path and depth,hierarchy in homogeneity in WordNet structure is improved by adding edge weight,similarity between two concepts is identified uniquely by definite encoding,and nonlinear deviation of computational result is corrected by using cosine function.Experimental results show that Pearson correlation coefficients obtained by comparing word similarity values calculated by using this method with corresponding artificial judgment value for MC30and RG65test set all reach 0.87.In addition,a higher level in performance and efficiency of computation is kept simultaneously.

Key words ：word similarity;edge weight;WordNet;encoding

摘要：针对目前词语相似度算法中普遍存在的信息源单一化，计算结果非线性偏高，以及计算性能和效率的不一致的缺陷，提出了一种基于边权重的WordNet 词语相似度的计算方法。该方法在路径与深度的基础上，通过边权重改善WordNet 结构中的层次不均匀性，引入编码概念唯一标识两个概念间的相似度，并利用余弦函数修正计算结果的非线性偏差。实验结果表明，对于MC30和RG65测试集，使用该方法计算的词语相似度值与人工判定值计算得到的Pearson 相关系数均达到0.87；此外，该方法在计算性能和效率上均保持较高水平。

关键词：词语相似度；边权重；WordNet ；编码

文献标志码：A 中图分类号：TP391doi ：10.3778/j.issn.1002-8331.1607-0159

基金项目：国家自然科学基金（No.61462010，No.61363036）；广西师范大学自然科学青年基金。

作者简介：郭小华（1992—），女，硕士研究生，主要研究方向为自然语言处理；彭琦（1988—），男，助理研究员，讲师，主要研究方向

为自然语言处理；邓涵（1991—），女，硕士研究生，主要研究方向为自然语言处理；朱新华（1965—），通讯作者，男，教授，研究生导师，主要研究方向为自然语言处理，E-mail ：1304765153@https://www.360docs.net/doc/5911539254.html, 。

收稿日期：2016-07-12修回日期：2016-09-20文章编号：1002-8331（2018）01-0172-07

CNKI 网络优先出版：2017-01-11,https://www.360docs.net/doc/5911539254.html,/kcms/detail/11.2127.TP.20170111.1014.024.html

172万方数据