非结构化中文自然语言地址描述的自动识别

2016,52(23)1引言自然语言是人们进行通信和交流的主要工具,自然语言处理是现代信息科学和技术研究不可或缺的重要内容[1]。在互联网与大数据时代,存在海量易获取的中文自然语言地址描述数据,如生活服务类网站中说明各

类兴趣点(即商户、学校、银行、加油站、医院等地理对象)位置的语句。它们体现了公众描述空间位置的语言和认知习惯,蕴含着丰富的空间信息。利用文本挖掘技非结构化中文自然语言地址描述的自动识别

赵卫锋1,2,张勤1

ZHAO Weifeng 1,2,ZHANG Qin 1

1.长安大学地质工程与测绘学院,西安710054

2.地理信息工程国家重点实验室,西安710054

1.College of Geology Engineering and Geomatics,Chang ’an University,Xi ’an 710054,China

2.State Key Laboratory of Geo-information Engineering,Xi ’an 710054,China

ZHAO Weifeng,ZHANG Qin.Automatic identification of address description in unstructured Chinese natural https://www.360docs.net/doc/543083235.html,puter Engineering and Applications,2016,52(23):19-24.

Abstract :The texts of address description in natural language,which are massive and available on the Internet,imply a wealth of spatial information.Considering its unstructured characteristics,a two-step approach is proposed in this paper to automatically extract the information of words and syntaxes from the corpus of address description in Chinese natural lan-guage,for further discovery of associated spatial knowledge.In the first step,an gazetteer-independent word segmentation algorithm for Chinese is designed,according to statistical regularities of the co-occurrence of character strings in the address corpus.In this algorithm,a predefined list comprised of common words used for indicating or restricting others in address statements,could be introduced to improve segmentation effect and facilitate part-of-speech tagging.In the second step,a finite state machine model is built to represent common syntaxes of Chinese address description,and then applied to automatically match and recognize the syntactic structures of segmented and tagged address statements.On the basis of the abundant address corpus collected from Internet,the experiments for statistical segmentation and syntactic recognition demonstrate the effectiveness and availability of this approach.

Key words :address description;natural language;Chinese word segmentation;syntactic recognition

摘要:互联网中存在海量易获取的自然语言形式地址描述文本,其中蕴含丰富的空间信息。针对其非结构化特点,提出了自动提取中文自然语言地址描述中词语和句法信息的方法,以便深度挖掘空间知识。首先,根据地址语料中字串共现的统计规律设计一种不依赖地名词典的中文分词算法,并利用在地址文本中起指示、限定作用的常见词语组成的预定义词表改善分词效果及辅助词性标注。分词完成后,定义能够表达中文地址描述常用句法的有限状态机模型,进而利用其自动匹配与识别地址文本的句法结构。最后,基于大规模真实语料的统计分词及句法识别实验表明了该方法的可用性及有效性。

关键词:地址描述;自然语言;中文分词;句法识别

文献标志码:A 中图分类号:TP391doi :10.3778/j.issn.1002-8331.1512-0386

基金项目:国家自然科学基金(No.41301513);地理信息工程国家重点实验室开放研究基金(No.SKLGIE 2014-M-4-2);中央高校

基本科研业务费专项资金(No.2014G1261056)。

作者简介:赵卫锋(1981—),男,博士,讲师,研究领域为智能导航系统、空间位置服务、文本数据挖掘等,E-mail :wfzhao1018@

https://www.360docs.net/doc/543083235.html, ;张勤(1958—),女,博士,教授,研究领域为测量数据处理理论与方法、非线性数据处理理论、地壳形变监测理论、GPS 定位理论与应用。

收稿日期:2015-12-30修回日期:2016-08-23文章编号:1002-8331(2016)23-0019-06

Computer Engineering and Applications 计算机工程与应用

19

万方数据

相关文档
最新文档