濒危语言口语语料库的研究与构建——以吕苏语为范例

Computer Engineering and Applications 计算机工程与应用

2018，54（2）1引言据民族语言组织统计，截至2014年在7106种记录语言中，424种语言濒临灭绝，203种语言处于接近濒危的状态[1]。联合国科教文组织把濒危语言解释为：当母语人停止说某种语言，或者是在各个领域说得越来越少，并且不将其传递给后代的时候，这种语言就被称为

濒危语言，因此，对原生态的语言面貌和文化信息的保存，对珍贵濒危语言文化遗产的留存是当今时代刻不容缓的目标与追求[2]。

吕苏是藏族的一个支系，村落或族群中只进行吕苏濒危语言口语语料库的研究与构建

——以吕苏语为范例

操镭1，尹蔚彬2，孙沁瑶1，王志3，于重重1，李道玮1

CAO Lei 1,YIN Weibin 2,SUN Qinyao 1,WANG Zhi 3,YU Chongchong 1,LI Daowei 1

1.北京工商大学计算机与信息工程学院，北京100048

2.中国社会科学院民族学与人类学研究所，北京100081

3.四川大学历史文化学院，成都610064

1.College of Computer ＆Information Engineering,Beijing Technology ＆Business University,Beijing 100048,China

2.Institute of Ethnology and Anthropology,Chinese Academy of Social Sciences,Beijing 100081,China

3.College of History and Culture,Sichuan University,Chengdu 610064,China

CAO Lei,YIN Weibin,SUN Qinyao,et al.Research and construction of endangered language spoken corpus ——case study on https://www.360docs.net/doc/1a5351171.html,puter Engineering and Applications,2018,54（2）：234-238.

Abstract ：The purpose of establishing an endangered language spoken corpus is to preserve the endangered language totally,especially its vitality and the local culture,for studying and researching.The preservation of endangered language spoken corpus includes original voice files,international phonetic alphabet annotation,Chinese translation annotation.The paper takes Lizu language as an example,and studies the establishment of endangered languages spoken corpus com-prehensively and systematically.Besides automatic word segmentation and keyword extraction of annotation corpus is real-ized,which is provided for the establishment of universal endangered language corpus subsequently as an example Key words ：endangered language;spoken language corpus;Lizu language

摘要：濒危语言口语语料库建立的目的是系统地保存近乎消失的濒危语言，留存濒危语言的生命力与地方文化，并且能够对其进行学习与研究。濒危语言口语语料库保存的内容主要包括原始声音文件、国际音标标注、汉语对译标注以及汉语翻译标注。以濒危语言吕苏语为范例，深入、全面、系统地研究与建立濒危语言口语语料库，并对标注语料实现了自动分词与关键词提取的功能，为后续建立通用濒危语言语料库提供了一个范例。

关键词：濒危语言；口语语料库；吕苏语

文献标志码：A 中图分类号：TP 391doi ：10.3778/j.issn.1002-8331.1608-0042

基金项目：国家社会科学基金重大项目（No.14ZDB156）；2016年研究生科研能力提升计划项目；教育部人文社会科学研究规划基

金（No.16YJAZH072）。

作者简介：操镭（1991—），女，硕士研究生，主要研究领域为模式识别与机器学习、数据挖掘；尹蔚彬（1969—），女，博士，副研究

员，主要研究领域为藏缅语族语言研究、藏区语言生态；孙沁瑶（1992—），男，硕士研究生，主要研究领域为模式识别与

机器学习、数据挖掘；王志（1988—），男，博士研究生，主要研究领域为人类学理论与方法、藏族史；于重重（1971—），女，博士，教授，主要研究领域为模式识别与机器学习、数据挖掘，E-mail ：chongzhy@https://www.360docs.net/doc/1a5351171.html, ；李道玮（1994—），男，主要研究领域为模式识别与机器学习。

收稿日期：2016-08-01修回日期：2016-12-05文章编号：1002-8331（2018）02-0234-05

CNKI 网络优先出版：2017-03-23,https://www.360docs.net/doc/1a5351171.html,/kcms/detail/11.2127.TP.20170323.0832.022.html

234万方数据