中文嵌套命名实体识别语料库的构建

第32卷第8期

2018年8月中文信息学报JOU RNAL OF CHINESE INFORM A TION PROCESSING Vol .32，No .8Aug .，2018文章编号：1003-0077（2018）08-0019-08

中文嵌套命名实体识别语料库的构建

李雁群1，2，何云琪1，2，钱龙华1，2，周国栋1，2

（1.苏州大学自然语言处理实验室，江苏苏州215006；

2.苏州大学计算机科学与技术学院，江苏苏州215006）

摘要：嵌套命名实体含有丰富的实体和实体间语义关系，有助于提高信息抽取的效率。由于缺少统一的标准中文嵌套命名实体语料库，目前中文嵌套命名实体的研究工作难于比较。该文在已有命名实体语料的基础上采用半自动化方法构建了两个中文嵌套命名实体语料库。首先利用已有中文命名实体语料库中的标注信息自动地构造出尽可能多的嵌套命名实体，然后再进行手工调整以满足对中文嵌套实体的标注要求，从而构建高质量的中文嵌套命名实体识别语料库。语料内和跨语料嵌套实体识别的初步实验表明，中文嵌套命名实体识别仍是一个比较困难的问题，需要进一步研究。

关键词：中文嵌套命名实体识别；条件随机场；信息抽取；语料库

中图分类号：T P 391 文献标识码：A

Chinese Nested Named Entity Recognition Corpus Construction

LI Yanqun 1，2，HE Yunqi 1，2，QIAN Longhua 1，2，ZHOU Guodong 1，2（1.Natural Language Processing Laboratory ，Soochow University ，Suzhou ，Jiangsu 215006，China ；

2.School of Computer Science and Technology ，Soochow University ，Suzhou ，Jiangsu 215006，China ）

Abstract ：Nested named entities contain rich entities and semantic relations between them ，w hich facilitates to im -p rove the effectiveness of information extraction .Due to the lack of uniform and standard Chinese nested named en -tity corpora ，currently it is difficult to compare the research works on Chinese nested named entities .Based on the existing named entity corpora ，this paper proposes to use semi -automatic method to construct two Chinese nested named entity corpora .First ，we use the annotation information in the Chinese named entity corpora to automatically construct as many nested named entities as possible ，and then manually adjust them to meet our annotation require -ments for Chinese nested entity in order to build high -q uality Chinese nested named entity corpora .The preliminary experiment of nested named entity recognition both within and across the corpora show s that Chinese nested named entity recognition is still a quite difficult problem and requires further research .Key words ：Chinese nested named entity recognition ；conditional random fields ；information extraction ；corpus 收稿日期：2017-10-19 定稿日期：2018-01-11基金项目：国家自然科学基金（61373096，61331011，61673290）0 引言

信息抽取的目的是从无结构文本中抽取出实体

及其相互关系并转化为结构化表达形式，从而为知

识库的构造提供数据基础［1-5］。嵌套命名实体中含

有丰富的实体信息以及实体之间的相互关系，其结

构相对而言也较为简单，因而嵌套命名实体的识别

成为信息抽取中值得研究的话题之一。

目前的嵌套命名实体识别都采用有监督的机器学习方法，因而需要一定规模的语料库。GENIAV 3.02［6］是生物医学领域内的命名实体语料库，其中包含了嵌套实体，被广泛应用于生物医学领域的命名实体识别研究。该语料库包含2000条M EDL -LINE 摘要，94014个实体引用，其中约有17％的实体嵌套在其他实体中。EPPI ［7］是生物医学领域内另一个标注了蛋白质及其相互作用关系的语料库，它包含217个从PubM ed 和PubM edCentral 选出

万方数据