Do We Need Chinese Word Segmentation for Statistical Machine Translation

合集下载

中文的自然语言处理与英文的自然语言处理

中文的自然语言处理与英文的自然语言处理Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans using natural language. It is a field that has seen significant advancements in recent years, with researchers around the world working to improve the accuracy and effectiveness of NLP systems. In this article, we will compare and contrast the differences between NLP in Chinese and NLP in English.Chinese NLP:1. Character-based: One of the key differences between Chinese NLP and English NLP is that Chinese is a character-based language, whereas English is an alphabet-based language. This means that Chinese NLP systems need to be able to understand and process individual characters, as opposed to words in English.2. Word segmentation: Chinese is also a language that does not use spaces between words, which means that word segmentation is a crucial step in Chinese NLP. This process involves identifying where one word ends and another begins, which can be challenging due to the lack of spaces.3. Tonal differences: Another unique aspect of Chinese NLP is that Chinese is a tonal language, meaning that the tone in which a word is spoken can change its meaning. NLP systems need to be able to recognize and account for these tonal differences in order to accurately process and understand Chinese text.English NLP:1. Word-based: In contrast to Chinese, English is an alphabet-based language, which means that NLP systems can focus on processing words rather than individual characters. This can make certain tasks, such as named entity recognition, easier in English NLP.2. Sentence structure: English has a more rigid sentence structure compared to Chinese, which can make tasks such as parsing and syntactic analysis more straightforward in English NLP. This is because English follows a specificsubject-verb-object order in most sentences, whereas Chinese has a more flexible word order.3. Verb conjugation: English is also a language that uses verb conjugation, meaning that verbs change form based on tense, person, and number. NLP systems need to be able to recognizeand interpret these verb forms in order to accurately understand and generate English text.In conclusion, while there are similarities between Chinese NLP and English NLP, such as the use of machine learning algorithms and linguistic resources, there are also key differences that researchers need to consider when developing NLP systems for these languages. By understanding these differences, researchers can continue to advance the field of NLP and improve the performance of NLP systems in both Chinese and English.。

Chinese Word Segmentation

What’s a word? (Cont.)

Gao Jianfeng defines Chinese words in his paper as one of the following four types:
(1) entries in a lexicon words, (2) morphologically derived words, e.g.研究研究 (3) factoids, e.g.日期、时间、货币, etc

Agenda
Why Segment? Significant Problems, Classic Methods Overview of IRSEG The Mathematics Model Best Path/N-Best Paths Algorithm Unknown Words Ambiguities, Solutions
Agenda
Why Segment? Significant Problems, Classic Methods Overview of IRSEG The Mathematics Model Best Path/N-Best Paths Algorithm Unknown Words Ambiguities, Solutions
W * arg maxw P(W ) P(S | W )
The Mathematics Model (Cont.)

According to Law of Large Number
P( wi ) ki
k
j 0
M
j
ki represents how many times wi appears in samples for training. M represents the total number of words.

用正则表达式给字符串添加空格

⽤正则表达式给字符串添加空格⾃然语⾔处理有⼀种ROUGE的评测⽅法，使⽤这种评测⽅法时有时需要将带评测⽂本每个汉字之间⽤空格分开。

原版说明如下：The recommended ROUGE metrics are Recall and F scores ofCharacter-based ROUGE-1, ROUGE-2 and ROUGE-SU4. Characterbasedevaluation means that we do not need to perform Chinese wordsegmentation when running the ROUGE toolkit. (too long no look：D)Instead, we only need toseparate each Chinese character by using a blank space.使⽤正则表达式有⼀种⾮常便捷的⽅法可以解决这个问题：replaceAll("(.{1})","$1 ")其中.{1}表⽰任意⼀个字符，()表⽰⼀组，$1表⽰第⼀组因此字符串ABCDEF匹配过程就是将ABCDEF的每⼀个字符.{1}都替换成每⼀个字符.{1}+空格。

即是A B C D E F。

拓展：String str = "abcdefghijklmn";String str_next = str.replaceAll("(.{1})(.{2})(.{1})","$2$3");结果是什么呢？abcdefghijklmn机智的你⼀定已经知道答案了吧：bcdfghjklmn~~~~(.{1})(.{2})(.{1})对应a、bc、d; e、fg、h; i、jk、l;$2对应bc; fg; jk;$3对应d; h; l;。

Simple features for chinese word sense disambiguation

Simple Features for Chinese Word Sense DisambiguationHoa Trang Dang,Ching-yi Chia,Martha Palmer,and Fu-Dong ChiouDepartment of Computer and Information ScienceUniversity of Pennsylvaniahtd,chingyc,mpalmer,chioufd@AbstractIn this paper we report on our experiments on au-tomatic Word Sense Disambiguation using a max-imum entropy approach for both English and Chi-nese verbs.We compare the difﬁculty of the sense-tagging tasks in the two languages and investigatethe types of contextual features that are useful foreach language.Our experimental results suggestthat while richer linguistic features are useful forEnglish WSD,they may not be as beneﬁcial for Chi-nese.1IntroductionWord Sense Disambiguation(WSD)is a centralopen problem at the lexical level of Natural Lan-guage Processing(NLP).Highly ambiguous wordspose continuing problems for NLP applications.They can lead to irrelevant document retrieval in In-formation Retrieval systems,and inaccurate transla-tions in Machine Translation systems(Palmer et al.,2000).For example,the Chinese wordover many different languages,data for the Chinese lexical sample task was not made available in time for any systems to compete.Instead,we report on two experiments that we ran using our own lexicon and two separate Chinese corpora that are very sim-ilar in style(news articles from the People’s Repub-lic of China),but have different types and levels of annotation–the Penn Chinese Treebank(CTB)(Xia et al.,2000),and the People’s Daily News(PDN) corpus from Beijing University.We discuss the util-ity of different types of annotation for successful au-tomatic word sense disambiguation.2English ExperimentOur maximum entropy WSD system was de-signed to combine information from many differ-ent sources,using as much linguistic knowledge as could be gathered automatically by current NLP tools.In order to extract the linguistic features nec-essary for the model,all sentences wereﬁrst auto-matically part-of-speech-tagged using a maximum entropy tagger(Ratnaparkhi,1998)and parsed us-ing the Collins parser(Collins,1997).In addi-tion,an automatic named entity tagger(Bikel et al., 1997)was run on the sentences to map proper nouns to a small set of semantic classes.Chodorow,Leacock and Miller(Chodorow et al., 2000)found that different combinations of topical and local features were most effective for disam-biguating different words.Following their work,we divided the possible model features into topical fea-tures and several types of local contextual features. Topical features looked for the presence of key-words occurring anywhere in the sentence and any surrounding sentences provided as context(usually one or two sentences).The set of200-300keywords is speciﬁc to each lemma to be disambiguated,and is determined automatically from training data so as to minimize the entropy of the probability of the senses conditioned on the keyword.The local features for a verb in a particular sen-tence tend to look only within the smallest clause containing.They include collocational features requiring no linguistic preprocessing beyond part-of-speech tagging(1),syntactic features that cap-ture relations between the verb and its complements (2-4),and semantic features that incorporate infor-mation about noun classes for subjects and objects (5-6):1.the word,the part of speech of,the partof speech of words at positions-1and+1rela-tive to,and words at positions-2,-1,+1,+2, relative to2.whether or not the sentence is passive3.whether there is a subject,direct object,indi-rect object,or clausal complement(a comple-ment whose node label is S in the parse tree) 4.the words(if any)in the positions of subject,direct object,indirect object,particle,preposi-tional complement(and its object)5.a Named Entity tag(PERSON,ORGANIZA-TION,LOCATION)for proper nouns appear-ing in(4)6.WordNet synsets and hypernyms for the nounsappearing in(4)2.1English ResultsThe maximum entropy system’s performance on the verbs from the evaluation data for S ENSEVAL-1(Kilgarriff and Rosenzweig,2000)rivaled that of the best-performing systems.We looked at the effect of adding topical features to local features that either included WordNet class features or used just lexical and named entity features.In addition, we experimented to see if performance could be improved by undoing passivization transformations to recover underlying subjects and objects.This was expected to increase the accuracy with which verb arguments could be identiﬁed,helping in cases where selectional restrictions on arguments played an important role in differentiating between senses. The best overall variant of the system for verbs did not use WordNet class features,but included topical keywords and passivization transformation, giving an average verb accuracy of72.3%.If only the best combination of feature sets for each verb is used,then the maximum entropy mod-els achieve73.7%accuracy.These results are not signiﬁcantly different from the reported results of the best-performing systems(Yarowsky,2000). Our system was competitive with the top perform-ing systems even though it used only the training data provided and none of the information from the dictionary to identify multi-word constructions. Later experiments show that the ability to correctly identify multi-word constructions improves perfor-mance substantially.We also tested the WSD system on the verbs from the English lexical sample task for S ENSEVAL-2.1Feature Type(local only)Accuracy48.3collocation53.9+syntax59.0+syntax+semanticsdraw,dress,drift,drive,face,ferret,ﬁnd,keep,leave,live, match,play,pull,replace,see,serve,strike,train,treat,turn, use,wander,wash,work.other senses for other parts of speech,with an av-erage of6dictionary senses per word.Theﬁrst 20words were chosen by randomly selecting sev-eralﬁles totaling5000words from the100K-word Penn Chinese Treebank,and choosing only those words that had more than one dictionary verb sense and that occurred more than three times in these ﬁles.The remaining8words were chosen by se-lecting all words that had more than one dictio-nary verb sense and that occurred more than25 times in the CTB.The deﬁnitions for the words were based on the CETA(Chinese-English Transla-tion Assistance)dictionary(Group,1982)and other hard-copy dictionaries.Figure1shows an exam-ple dictionary entry for the most common sense of jian4.For each word,a sense entry in the lexi-con included the deﬁnition in Chinese as well as in English,the part of speech for the sense,a typ-ical predicate-argument frame if the sense is for a verb,and an example sentence.With these deﬁni-tions,each word was independently sense-tagged by two native Chinese-speaking annotators in a double-blind manner.Sense-tagging was done primarily us-ing raw text,without segmentation,part of speech, or bracketing information.Afterﬁnishing sense tag-ging,the annotators met to compare and to discuss their results,and to modify the deﬁnitions if neces-sary.The gold standard sense-taggedﬁles were then made after all this discussion.In a manner similar to our English approach,we included topical features as well as collocational, syntactic,and semantic local features in the maxi-mum entropy models.Collocational features could be extracted from data that had been segmented into words and tagged for part of speech:the target wordthe part of speech tag of the target wordthe words(if any)within2positions of the tar-get wordthe part of speech of the words(if any)immedi-ately preceding and following the target wordwhether the target word follows a verb<entry id="00007" word=",<word>2/Feature TypeStd Dev collocation (no part of speech)1.093.494.494.3collocation +topic 1.0+syntax +topic0.9+syntax +semantics +topic0.9Table 2:Overall accuracy of maximum entropy sys-tem using different subsets of features for Penn Chi-nese Treebank words (manually segmented,part-of-speech-tagged,parsed).3.1Penn Chinese TreebankAll sentences containing any of the 28target words were extracted from the Penn Chinese Treebank,yielding between 4and 1143occurrence (160av-erage)for each of the target words.The manual segmentation,part-of-speech tags,and bracketing of the CTB were used to extract collocational and syntactic features.The overall accuracy of the system on the 28words in the CTB was 94.4%using local colloca-tional and syntactic features.This is signiﬁcantly better than the baseline of 76.7%obtained by tag-ging all instances of a word with the most frequent sense of the word in the CTB.Considering only the 23words for which more than one sense occurred in the CTB,overall system accuracy was 93.9%,com-pared with a baseline of 74.7%.Figure 2shows the results broken down by word.As with the English data,we experimented with different types of features.Table 2shows the per-formance of the system using different subsets of features.While the system’s accuracy using syntac-tic features was higher than using only collocationalfeatures (signiﬁcant at),the improve-Word pinying (translation) Events Senses Baseline Acc. Std Dev-------------------------------------------------------------------------------chu1 (to go out/to come out) 34 5 50.0 50.0 11.1dao3 (to come/to arrive) 219 10 36.5 82.7 7.1hui4 (will/be able to) 86 6 58.1 91.9 6.0ke3 (may/can) 57 1 100 100 0.0rang4 (to let/to allow) 9 1 100 100 0.0shuo1 (to say in spoken words) 306 6 86.9 95.1 2.0wei2/wei4 (to be/to mean) 473 7 32.8 86.1 2.4zai4 (to exist/to be at(in, on)) 1143 4 96.9 99.3 0.4yao4 (must/should/to intend to) 106 6 65.1 62.3 8.93The PDN corpus can be found at/research/corpus/dwldform1.asp.The annotation guidelines are not exactly thesame as for the Penn CTB,and can be found at/research/corpus/coprus-annotation.htm.Feature Type Std Dev collocation(no part of speech) 2.270.371.772.7 collocation+topic 3.2+syntax+topic 3.9+syntax+semantics+topic 3.7Table3:Overall accuracy of maximum entropy sys-tem using different subsets of features for People’s Daily News words(automatically segmented,part-of-speech-tagged,parsed).Feature Type Std Dev collocation(no part of speech) 4.374.7 collocation+topic 3.1 Table4:Overall accuracy of maximum entropy sys-tem using different subsets of features for People’s Daily News words(manually segmented,part-of-speech-tagged).in the CTB corpus.About200sentences for each word were selected randomly from PDN and sense-tagged as with the CTB.We automatically annotated the PDN data to yield the same types of annotation that had been available in the CTB.We used a maximum-matching algorithm and a dictionary compiled from the CTB(Sproat et al.,1996;Xue,2001)to do seg-mentation,and trained a maximum entropy part-of-speech tagger(Ratnaparkhi,1998)and TAG-based parser(Bikel and Chiang,2000)on the CTB to do tagging and parsing.4Then the same feature extrac-tion and model-training was done for the PDN cor-pus as for the CTB.The system performance is much lower for the PDN than for the CTB,for several reasons.First, the PDN corpus is more balanced than the CTB, which contains primarilyﬁnancial articles.A wider range of usages of the words was expressed in PDN than in CTB,making the disambiguation task more difﬁcult;the average number of senses for the PDN words was8.2(compared to3.5for CTB),and theambiguation.Our experience in English has shown that the ability to identify multi-word constructions signiﬁcantly improves sense-tagging performance. Multi-character Chinese words,which are identiﬁed by word segmentation,may be the analogy to En-glish multi-word constructions.5AcknowledgmentsThis work has been supported by National Sci-ence Foundation Grants,NSF-9800658and NSF-9910603,and DARPA grant N66001-00-1-8915at the University of Pennsylvania.The authors would also like to thank the anonymous reviewers for their valuable comments.ReferencesAdam L.Berger,Stephen A.Della Pietra,and Vin-cent J.Della Pietra.1996.A maximum entropy approach to natural language -putational Linguistics,22(1).Daniel M.Bikel and David Chiang.2000.Two sta-tistical parsing models applied to the chinese tree-bank.In Proceedings of the Second Chinese Lan-guage Processing Workshop,Hong Kong. Daniel M.Bikel,Scott Miller,Richard Schwartz, and Ralph Weischedel.1997.Nymble:A high-performance learning name-ﬁnder.In Proceed-ings of the Fifth Conference on Applied Natural Language Processing,Washington,DC.Martin Chodorow,Claudia Leacock,and George A. Miller.2000.A topical/local classiﬁer for word sense identiﬁputers and the Human-ities,34(1-2),April.Special Issue on SENSE-V AL.Michael Collins.1997.Three generative,lexi-calised models for statistical parsing.In Pro-ceedings of the35th Annual Meeting of the As-sociation for Computational Linguistics,Madrid, Spain,July.Hoa Trang Dang and Martha -bining contextual features for word sense disam-biguation.In Proceedings of the Workshop on Word Sense Disambiguation:Recent Successes and Future Directions,Philadelphia,PA.Philip Edmonds and Scott Cotton.2001. SENSEVAL-2:Overview.In Proceedings of SENSEVAL-2:Second International Work-shop on Evaluating Word Sense Disambiguation Systems,Toulouse,France,July.Chinese-English Translation Assistance Group. 1982.Chinese Dictionaries:an Extensive Bib-liography of Dictionaries in Chinese and Other Languages.Greenwood Publishing Group. Nancy Ide and Jean Veronis.1998.Introduction to the special issue on word sense disambiguation: The state of the putational Linguistics, 24(1).Adam Kilgarriff and Martha Palmer.2000.In-troduction to the special issue on SENSEVAL. Computers and the Humanities,34(1-2),April. Special Issue on SENSEVAL.A.Kilgarriff and J.Rosenzweig.2000.Framework and results for English puters and the Humanities,34(1-2),April.Special Issue on SENSEVAL.Ruo-Ping Mo.1992.A conceptual structure that is suitable for analysing chinese.Technical Report CKIP-92-04,Academia Sinica,Taipei,Taiwan. M.Palmer,Chunghye Han,Fei Xia,Dania Egedi, and Joseph Rosenzweig.2000.Constraining lex-ical selection across languages using tags.In Anne Abeille and Owen Rambow,editors,Tree Adjoining Grammars:formal,computational and linguistic aspects.CSLI,Palo Alto,CA. Martha Palmer,Christiane Fellbaum,Scott Cotton, Lauren Delfs,and Hoa Trang Dang.2001.En-glish tasks:All-words and verb lexical sample. In Proceedings of SENSEVAL-2:Second Interna-tional Workshop on Evaluating Word Sense Dis-ambiguation Systems,Toulouse,France,July. Adwait Ratnaparkhi.1998.Maximum Entropy Models for Natural Language Ambiguity Resolu-tion.Ph.D.thesis,University of Pennsylvania. Richard Sproat,Chilin Shih,William Gale,and Nancy Chang.1996.A stochasticﬁnite-state word segmentation algorithm for -putational Linguistics,22(3).Fei Xia,Martha Palmer,Nianwen Xue,Mary Ellen Okurowski,John Kovarik,Fu-Dong Chiou, Shizhe Huang,Tony Kroch,and Mitch Mar-cus.2000.Developing guidelines and ensuring consistency for chinese text annotation.In Pro-ceedings of the second International Conference on Language Resources and Evaluation,Athens, Greece.Nianwen Xue.2001.Deﬁning and Automatically Identifying Words in Chinese.Ph.D.thesis,Uni-versity of Delaware.David Yarowsky.2000.Hierarchical decision lists for word sense puters and the Humanities,34(1-2),April.Special Issue on SENSEVAL.。

汉语分词简介

汉语分词
3
主要的分词方法（一）
基于字符串匹配的分词方法：按照一定的策略将待分析的汉字串与一个“充分大的”机器词典中的词条进行配，若在词典中找到某个字符串，则匹配成功。可以切分, 否则不予切分。实现简单, 实用性强, 但机械分词法的最大的缺点就是词典的完备性不能得到保证。 a. 正向最大匹配（由左到右的方向） b. 逆向最大匹配法（由右到左的方向） c. 最少切分（使每一句中切出的词数最小） d. 双向匹配法（进行由左到右、由右到左两次扫描）
汉语分词
16
未登录词识别的方法
统计的方法：根据相邻词同现的次数来统计得到各类用字、词的频率。优点：占用的资源少、速度快、效率高；缺点：准确率较低、系统开销大、搜集合理的有代表性的统计源的工作本身也较难。基于规则的方法：核心是根据语言学原理和知识制定一系列规则。优点：识别较准确；缺点：很难列举所有规则，规则之间往往会顾此失彼，产生冲突，系统庞大、复杂，耗费资源多但效率却不高两者融合：取长补短。即在规则中加入了统计信息或在统计方法过后又用到过滤规则以提高新词总体的识别效果
汉语分词 15
未登录词（OOV）
虽然一般的词典都能覆盖大多数的词语，但有相当一部分的词语不可能穷尽地收入系统词典中，这些词语称为未登录词或新词分类：
专有名词：人名、地名、机构名称、商标名网络语：“给力”、“神马” 重叠词：“高高兴兴”、“研究研究” 派生词：“一次性用品” 与领域相关的术语：“互联网”、“排气量 ”
汉语分词 18
汉语分词
19
汉语分词 9
主要的分词方法（三）
基于统计的分词方法：基本原理是根据字符串在语料库中出现的统计频率来决定其是否构成词无词典分词法也有一定的局限性, 会经常抽出一些共现频度高、但并不是词的常用字符串, , 如“这一”、“之一”以及“提供了”等等。在实际应用的统计分词系统中都要使用一部基本的分词词典(常用词词典)进行串匹配分词, 即将字符串的词频统计和字符串匹配结合起来, 既发挥匹配分词切分速度快、效率高的特点, 又利用了无词典分词结合上下文识别生词、自动消除歧义的优点。

人教版初中初三九年级英语名师教学课件 Unit 5 What are the shirts mad

伟大复兴的中国梦而努力奋斗，进一步增强
学生的爱国荣誉感，激发学生的爱党爱国之情。
Thank you!
Homework Write a composition about your now and past.源自Unit 5 Unit 5
What are the shirts made of ?
Section B reading 曹宛彤
课前三分钟一起来学剪纸吧
Video Enjoyment!
What is the video about?
Today’s learning task:
We’re going to read a passage to introduce Chinese traditional arts.
质国让量人学的民生提的认高生识。活到今越发天来明在越H1a对f中幸.oteR生mr国福eesa活cw共美dho的ot产满hrokel帮，党。p助ar的当esst和a领今egle对l导国at生gh下家aei活n中强盛，国力逐步位于世pa界ss领ag先e.地位。全国人民在中国共产党的领导2下.C正om在ple为te实2e现af中ter华cla民ss族
本单元主要是围绕各国传统文化元素展 Decid开e w，h以ich制gr作ou剪p t纸o j，oin孔. 明灯，陶艺为中心 Give话yo题ur ，rea让so学ns 生acc学or会din用g t一o y般ou过r o去pi时nio态n 的被
动语态来谈论身边的传统文化，让学生懂得人类的生活中的美创造了丰富的物质文明，通过熟悉周围经常使用的民间艺阅术历品，，使A开学阔生学了生解的中眼国B界的，传丰统文富化学在生世的 C 界的地位和影响。

1.1.2 SEO 常用术语[共3页]

3第1章 S E O 概述 1.1.2 SEO 常用术语网站进行SEO 不仅是让网站获取比较靠前的排名，更重要的是让网站的每个页面都能够获取流量，产生成交转化的行为。

这就需要网站站长从细节出发。

而对于一部分新手站长来说，要做好细节，首先要掌握与SEO 相关的专业术语。

接下来将介绍SEO 的常用专业术语。

1．网络爬虫（Spider ）网络爬虫是一种按照一定的规则自动抓取万维网信息的程序或者脚本。

网页的抓取策略可以分为深度优先、广度优先和最佳优先。

图1-2所示是网络爬虫抓取网页的路径。

EG F IHBDAC 图1-2 网络爬虫抓取路径深度优先搜索策略是从起始网页开始，选择一个URL 进入，分析该网页中的URL，并选择一个进行，一个链接接着一个链接地抓取，直到处理完一条路线之后才处理下一条路线。

以图1-2为例，其抓取路径为：A —B；A —C；A —D；A —G —H —I；A —E —F。

广度优先搜索策略是指爬虫在抓取过程中需要完成当前层次的搜索后，才进行下一层次的搜索。

以图1-2为例，其抓取路径为：A —B —C —D —E —G；F —H；I。

最佳优先搜索策略是按照一定的网页分析算法，预测候选URL 与目标网页的相似度或与主题的相关性，并选择评价最高的网页进行抓取。

以图1-2为例，如果B 网页的相似度最高，其次是F 网页，最后是G 网页，那么网络爬虫首先抓取B 网页。

2．中文分词（Chinese Word Segmentation ）中文分词是指将中间没有空格的、连续的中文字符分割成一个一个单独的、有意义的单词。

中文分词是中文搜索引擎特有的过程，因为在英文、拉丁文中，词与词之间用空格自然区隔，没有分词的必要，但是中文只有字、句和段能通过明显的分界符来简单划界，单独词没有形式上。

中文信息处理英语

中文信息处理英语The rapid growth and widespread adoption of digital technologies in recent decades have transformed the way we access, process, and communicate information. One area where this impact has been particularly profound is the field of Chinese information processing. As the world's most populous country and a major economic powerhouse, China's unique linguistic and cultural landscape has presented both challenges and opportunities in the digital age.At the heart of Chinese information processing lies the intricate nature of the Chinese writing system. Unlike the alphabetic scripts used in many Western languages, Chinese utilizes a logographic system where each character represents a distinct word or concept. This complexity poses unique challenges for computer-based text input, storage, and retrieval. Nonetheless, significant advancements in technology have enabled the seamless integration of Chinese language processing into various digital platforms and applications.One of the key developments in this field has been the evolution of Chinese input methods. Traditional methods, such as the Cangjie andWubi input systems, required users to memorize complex sequences of keystrokes to generate Chinese characters. However, the advent of pinyin-based input, where users type the Romanized phonetic representation of a character, has revolutionized the way people interact with Chinese digital content. This approach, combined with intelligent predictive algorithms and machine learning, has greatly improved the efficiency and accessibility of Chinese text input, making it more intuitive for both native and non-native users.Beyond input methods, the processing of Chinese text has also seen remarkable progress. The development of natural language processing (NLP) techniques, such as word segmentation, named entity recognition, and sentiment analysis, has enabled the automated extraction of meaningful information from vast amounts of Chinese textual data. These advancements have paved the way for a wide range of applications, from intelligent search engines and language translation services to sentiment analysis tools and content recommendation systems.One particularly noteworthy application of Chinese information processing is in the realm of machine translation. The inherent complexities of the Chinese language, including its tonal nature, idiomatic expressions, and lack of grammatical markers, have long posed challenges for accurate and fluent translation. However, the integration of neural machine translation (NMT) models, whichleverage deep learning algorithms, has significantly improved the quality and fluency of Chinese-to-English and English-to-Chinese translations. As a result, cross-cultural communication and collaboration have become more seamless, facilitating the exchange of ideas and knowledge between China and the rest of the world.The impact of Chinese information processing extends beyond language-specific applications. The vast amount of digital data generated in China, coupled with the country's technological advancements, has also contributed to the development of innovative data analytics and artificial intelligence (AI) solutions. Chinese tech giants, such as Baidu, Alibaba, and Tencent, have invested heavily in research and development to harness the power of big data and machine learning to address a wide range of challenges, from urban planning and transportation optimization to healthcare and education.One notable example of this is the application of AI in Chinese healthcare. The integration of natural language processing and computer vision techniques has enabled the development of intelligent medical diagnosis systems that can analyze medical records, radiological images, and even patient-doctor conversations to provide accurate and timely insights. These advancements have the potential to revolutionize the healthcare industry, improving patient outcomes and reducing the burden on medical professionals.Another area where Chinese information processing has made significant strides is in the realm of social media and digital communication. The widespread adoption of platforms like WeChat, Weibo, and TikTok in China has generated vast amounts of user-generated content, which has been leveraged for targeted advertising, content recommendation, and social network analysis. The ability to process and analyze this data in real-time has enabled Chinese tech companies to stay at the forefront of the digital landscape, providing personalized and engaging experiences for their users.Despite these advancements, the field of Chinese information processing is not without its challenges. Issues such as data privacy, algorithmic bias, and the ethical implications of AI-driven decision-making have become increasingly prominent. As the technology continues to evolve, it is crucial that developers and policymakers work collaboratively to address these concerns and ensure that the benefits of Chinese information processing are distributed equitably and responsibly.Furthermore, the rapid pace of technological change has also highlighted the need for continuous education and skill development in this field. As new techniques and tools emerge, professionals in areas such as natural language processing, machinelearning, and data analytics must constantly update their knowledge and adapt their skillsets to stay relevant and competitive.In conclusion, the field of Chinese information processing has witnessed remarkable progress in recent years, driven by advancements in digital technologies and the growing importance of China on the global stage. From improved language input and processing to innovative applications in healthcare, social media, and beyond, the impact of these developments has been far-reaching. As the world becomes increasingly interconnected, the continued evolution and responsible application of Chinese information processing will undoubtedly play a crucial role in shaping the future of global communication, collaboration, and problem-solving.。

Chinese Word Segmentation Using Minimal Linguistic Knowledge

Chinese Word Segmentation Using Minimal Linguistic KnowledgeAitao ChenSchool of Information Management and SystemsUniversity of California at BerkeleyBerkeley,CA94720,USAaitao@AbstractThis paper presents a primarily data-driven Chi-nese word segmentation system and its perfor-mances on the closed track using two corpora attheﬁrst international Chinese word segmentationbakeoff.The system consists of a new words rec-ognizer,a base segmentation algorithm,and pro-cedures for combining single characters,sufﬁxes,and checking segmentation consistencies.1IntroductionAt theﬁrst Chinese word segmentation bakeoff,we partici-pated in the closed track using the Academia Sinica corpus(for short)and the Beijing University corpus(forshort).We will refer to the segmented texts in the trainingcorpus as the training data,and to both the unsegmentedtesting texts and the segmented texts(the reference texts)as the testing data.For details on the word segmentationbakeoff,see(Sproat and Emerson,2003).2Word segmentationNew texts are segmented in four steps which are describedin this section.New words are automatically extracted fromthe unsegmented testing texts and added to the base dictio-nary consisting of words from the training data before thetesting texts are segmented,line by line.2.1Base segmentation algorithmGiven a dictionary and a sentence,our base segmenta-tion algorithmﬁnds all possible segmentations of the sen-tence with respect to the dictionary,computes the prob-ability of each segmentation,and chooses the segmenta-tion with the highest probability.If a sentence of char-acters,,has a segmentation of words,,then the probability of the segmentationis estimated as, where denotes a segmentation of a sentence.The prob-ability of a word is estimated from the training corpus as, where is the number of times that character occurs in the training data,and is the number of times that character is in a word of two or more characters.We do not want to combine the single characters that oc-cur as words alone more often than not.For both the PK training data and the AS training data,we divided the train-ing data into two parts,two thirds for training,and one third for system development.We found that setting the thresh-old of the in-word probability to0.85or around works best on the development data.After the initial segmentation of a sentence,the consecutive single-characters are com-bined into one word if their in-word probabilities are over the threshold of0.85.The text fragmentcontains a new word which is not in the PK training data.After the initial segmentation,the text is segmented into/////,which is subsequently changed into//after combining the three consecutive characters.The in-word probabilities for the three characters and are0.94,0.98,and 0.99,respectively.2.3Combining sufﬁxesA small set of characters,such as and fre-quently occur as the last character in words.We selected 145such characters from the PK training corpus,and113 from the AS corpus.After combining single characters,we combine a sufﬁx character with the word preceding it if the preceding word is at least two-character long.2.4Consistency checkThe last step is to perform consistency checks.A seg-mented sentence,after combining single characters and suf-ﬁxes,is checked against the training data to make sure that a text fragment in a testing sentence is segmented in the same way as in the training data if it also occurs in the training data.From the PK training corpus,we cre-ated a phrase segmentation table consisting of word quad-grams,trigrams,bigrams,and unigrams,together with their segmentations and frequencies.Our phrase table created from the AS corpus does not include word quad-grams to reduce the size of the phrase table.For example,from the training text///we create the following entries(only some are listed to save space): text fragment freq segmentationAfter a new sentence is processed by theﬁrst three steps, we look up every word quad-grams of the segmented sen-tence in the phrase segmentation table.When a word quad-gram is found in the phrase segmentation table with a differ-ent segmentation,we replace the segmentation of the word quad-gram in the segmented sentence by its segmentation found in the phrase table.This process is continued to word trigrams,word bigrams,and word unigrams.The idea is that if a text fragment in a new sentence is found in the training data,then it should be segmented in the same way as in the training data.As an example,in the PK testingdata,the sentence is segmented into////// /after theﬁrst three steps(the two characters and are not,but should be,combined because the in-word probability of character which is0.71,is below thepre-deﬁned threshold of0.85).The word bigram is found in the phrase segmentation table with a differ-ent segmentation,//So the segmentation /is changed to the segmentation// in theﬁnal segmented sentence.In essence,when a text fragment has two or more segmentations,its surrounding context,which can be the preceding word,the following word,or both,is utilized to choose the most appropriate segmentation.When a text fragment in a testing sentence never occurred in the same context in the training data,then the most frequent segmentation found in the training data is chosen.Consider the text again,in the testing data, is segmented into//by our base algorithm.In this case,never occurred in the context of or The consistency check step changes//into/// since is segmented into/515times,but is treated as one word105times in the training data.3New words recognitionWe developed a few procedures to identify new words in the testing data.Ourﬁrst procedure is designed to recog-nize numbers,dates,percent,time,foreign words,etc.We deﬁned a set of characters consisting of characters such as the digits‘0’to‘9’(in ASCII and GB),the letters‘a’to ’z’,‘A’to‘Z’(in ASCII and GB),‘’,and the like.Any consecutive sequence of the characters that are in this pre-deﬁned set of characters is extracted and post-processed.A set of rules is implemented in the post-processor.One such rule is that if an extracted text fragments ends with the character and contains any character inthen remove the ending character and keep the remaining fragment as a word.For example,our recognizer will extract the text fragment andsince all the characters are in the pre-deﬁned set of charac-ters.The post-processor will strip off the trailing character and return and as words.For per-sonal names,we developed a program to extract the names preceding texts such as and a pro-gram to detect and extract names in a sequence of names separated by the Chinese punctuation“”,such asa program to extractdict P1pkd10.8380.050 2pkd20.8920.347 3pkd30.9200.507 4pkd30.9350.610 5pkd30.9400.655 6pkd30.9380.647steps R F10.9500.9430.97010.9500.9470.9681-20.9510.9510.9641-30.9490.9510.9611-40.9660.9610.980 Table2:Results for the closed track using the AS corpus. corpus R Fasd10.9120.000PK0.9090.8670.972 Table3:Performances of the maximum matching(forward) using words from the training data.pus,pkd2consists of the words in pkd1and the words con-verted from pkd1by changing the GB encoding to ASCII encoding for the numeric digits and the English letters,and pkd3consists of the words in pkd2and the words automat-ically extracted from the PK testing texts using the proce-dures described in section3.The columns labeled R,P and F give the recall,precision,and F score,respectively.The columns labeled and show the recall on out-of-vocabulary words and the recall on in-vocabulary words, respectively.All evaluation scores reported in this paper are computed using the score program written by Richard Sproat.We refer readers to(Sproat and Emerson,2003)for details on the evaluation measures.For example,row4in table1gives the results using pkd3dictionary when a sen-tence is segmented by the base algorithm,and then the sin-gle characters in the initial segmentation are combined,but sufﬁxes are not attached and consistency check is not per-formed.The last row in table2presents our ofﬁcial results for the closed track using the AS corpus.The asd1dictio-nary contains only the words from the AS training corpus, while the asd2consists of the words in asd1and the new words automatically extracted from the AS testing texts us-ing the new words recognition described in section3.The results show that new words recognition and joining single characters contributed the most to the increase in precision, while the consistency check contributed the most to the in-crease in recall.Table3gives the results of the maximum matching using only the words in the training data.While the difference between the F-scores of the maximum match-ing and the base algorithm is small for the PK corpus,the F-score difference for the AS corpus is much larger.Our base algorithm performed substantially better than the max-imum matching for the AS corpus.The performances of our base algorithm on the testing data using the words from the training data are presented in row1in table1for the corpus,and row1in table2for the corpus.5DiscussionsIn this section we will examine in some details the problem of segmentation inconsistencies within the training data,within the testing data,and between training data and test-ing data.Due to space limit,we will only report ourﬁnd-ings in the PK corpus though the same kinds of inconsis-tencies also occur in the AS corpus.We understand that it is difﬁcult,or even impossible,to completely eliminatesegmentation inconsistencies.However,perhaps we couldlearn more about the impact of segmentation inconsisten-cies on a system’s performance by taking a close look at theproblem.We wrote a program that takes as input a segmented cor-pus and prints out the shortest text fragments in the corpus that have two or more segmentations.For each text frag-ment,the program also prints out how the text fragment is segmented,and how many times it is segmented in a partic-ular way.While some of the text fragments,such asand truly have two different segmentations,depend-ing on the contexts in which they occur or the meaningsof the text fragments,others are segmented inconsistently. We ran this program on the PK testing data and found21unique shortest text fragments,which occur87times in to-tal,that have two different segmentations.Some of the text fragments,such as are inconsistently segmented. The fragment occurs twice in the testing data and is segmented into/in one case,but treated as one word in the other case.We found1,500unique shortest textfragments in the PK training data that have two or more seg-mentations,and97unique shortest text fragments that are segmented differently in the training data and in the test-ing data.For example,the text is treated as one word in the training data,but is segmented into// /in the testing data.We found11,136unique short-est text fragments that have two or more segmentations inthe AS training data,21unique shortest text fragments thathave two or more segmentations in the AS testing data,and 38unique shortest text fragments that have different seg-mentations in the AS training data and in the AS testing data.Segmentation inconsistencies not only exists withintraining and testing data,but also between training and test-ing data.For example,the text fragment occurs35 times in the PK training data and is consistently segmented into”/but the same text fragment,occurring twice in the testing data,is segmented into//in both cases.The text occurs67times in the training data and is treated as one word in all67cases,but the same text,occurring4times in the testing data,is seg-mented into/in all4cases.The text occurs 16times in the training data,and is treated as one word in all cases,but in the testing data,it is treated as one word in three cases and segmented into/in one case.The text is segmented into/in8cases,but treated as one word in one case in the training data.A couple of text fragments seem to be incorrectly segmented.The textin the testing data is segmented into /and the text segmented into/ Our segmented texts of the PK testing data differ fromthe reference segmented texts for580text fragments(427 unique).Out of these580text fragments,126text frag-ments are among the shortest text fragments that have onesegmentation in the training data,but another in the test-ing data.This implies that up to21.7%of the mistakes committed by our system may have been impacted by thesegmentation inconsistencies between the PK training data and the PK testing data.Since there are only38uniqueshortest text fragments found in the AS corpus that are seg-mented differently in the training data and the testing data, the inconsistency problem probably had less impact on ourAS results.Out of the same580text fragments,359text fragments(62%)are new words in the PK testing data.For example,the proper name which is a new word, is incorrectly segmented into/by our system.An-other example is the new word which is treated as one word in the testing data,but is segmented into/ //by our system.Some of the longer text frag-ments that are incorrectly segmented may also involve newwords,so at least62%,but under80%,of the incorrectly segmented text fragments are either new words or involve new words.6ConclusionWe have presented our word segmentation system and the results for the closed track using the corpus and the corpus.The new words recognition,combining single char-acters,and checking consistencies contributed the most to the increase in precision and recall over the performance of the base segmentation algorithm,which works better than maximum matching.For the closed track experiment using the corpus,we found that62%of the text fragments that are incorrectly segmented by our system are actually new words,which clearly shows that to further improve the performance of our system,a better new words recognition algorithm is necessary.Our failure analysis also indicates that up to21.7%of the mistakes made by our system for the PK closed track may have been impacted by the seg-mentation inconsistencies between the training and testing data.ReferencesRichard Sproat and Tom Emerson.2003.The First Interna-tional Chinese Word Segmentation Bakeoff.In proceed-ings of the Second SIGHAN Workshop on Chinese Lan-guage Processing,July11-12,2003,Sapporo,Japan.。

中文分词

汉语的修饰在前

他说的确实在理
他/说/的确/实在/理他/说/的/确实/在理

双向匹配
最短路径算法

最少分词问题等价于在有向图中搜索最短路径问题
发 1 2
展 3
中 4
国 5
家 6
基于统计的最短路径分词算法

基本的最短路径每条边的边长为1
当最短路径有多条时，往往只保留一条结果

南京市长江大桥
南京市/长江大桥南京/市长/江大桥
歧义例子续

当结合成分子时
当/结合/成分/子时当/结合/成/分子/时当/结/合成/分子/时当/结/合成分/子时
中文分词歧义分类

交集型歧义
如果AB和BC都是词典中的词，那么如果待切分字串中包含“ABC”这个子串，就必然会造成两种可能的切分：“AB/ C/ ” 和 “A/ BC/ ”。比如“网球场”就可能造成交集型歧义（网球/ 场/ : 网/ 球场/）。
路径1： 0－1－3－5
路径2： 0－2－3－5
该走哪条路呢？
最大概率法分词

S: 有意见分歧
W1: 有/ 意见/ 分歧/ W2: 有意/ 见/ 分歧/
Max(P(W1|S), P(W2|S)) ?
P( S | W ) P(W ) P(W | S ) P(W ) P( S )
P(W ) P( w1, w2 ,...,wi ) P( w1 ) P( w2 ) ... P( wi )
对其它符合要求的路径不公平

这里考虑每个词的权重，即每条边的边长不相等
最简单的权重是词频（必须真实、科学有效）

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Do We Need Chinese Word Segmentation for Statistical Machine Translation?Jia Xu and Richard Zens and Hermann NeyChair of Computer Science VIComputer Science DepartmentRWTH Aachen University,Germany{xujia,zens,ney}@cs.rwth-aachen.deAbstractIn Chinese texts,words are not separated by white spaces.This is problematic for many nat-ural language processing tasks.The standard approach is to segment the Chinese character sequence into words.Here,we investigate Chi-nese word segmentation for statistical machine translation.We pursue two goals:theﬁrst one is the maximization of theﬁnal translation qual-ity;the second is the minimization of the man-ual eﬀort for building a translation system. The commonly used method for getting the word boundaries is based on a word segmenta-tion tool and a predeﬁned monolingual dictio-nary.To avoid the dependence of the trans-lation system on an external dictionary,we have developed a system that learns a domain-speciﬁc dictionary from the parallel training corpus.This method produces results that are comparable with the predeﬁned dictionary. Further more,our translation system is able to work without word segmentation with only a minor loss in translation quality.1IntroductionIn Chinese texts,words composed of single or multiple characters,are not separated by white spaces,which is diﬀerent from most of the west-ern languages.This is problematic for many natural language processing tasks.Therefore, the usual method is to segment a Chinese char-acter sequence into Chinese“words”.Many investigations have been performed concerning Chinese word segmentation.For example,(Palmer,1997)developed a Chinese word segmenter using a manually segmented corpus.The segmentation rules were learned automatically from this corpus.(Sproat and Shih,1990)and(Sun et al.,1998)used a method that does not rely on a dictionary or a manually segmented corpus.The characters of the unsegmented Chinese text are grouped into pairs with the highest value of mutual informa-tion.This mutual information can be learned from an unsegmented Chinese corpus.We will present a new method for segment-ing the Chinese text without using a manually segmented corpus or a predeﬁned dictionary.In statistical machine translation,we have a bilin-gual corpus available,which is used to obtain a segmentation of the Chinese text in the fol-lowing way.First,we train the statistical trans-lation models with the unsegmented bilingual corpus.As a result,we obtain a mapping of Chinese characters to the corresponding English words for each sentence pair.By using this map-ping,we can extract a dictionary automatically. With this self-learned dictionary,we use a seg-mentation tool to obtain a segmented Chinese text.Finally,we retrain our translation system with the segmented corpus.Additionally,we have performed experiments without explicit word segmentation.In this case,each Chinese character is interpreted as one“word”.Based on word groups,our ma-chine translation system is able to work without a word segmentation,while having only a minor translation quality relative loss of less than5%. 2Review of the Baseline System for Statistical Machine Translation2.1PrincipleIn statistical machine translation,we are given a source language(‘French’)sentence f J1= f1...f j...f J,which is to be translated into a target language(‘English’)sentence e I1= e1...e i...e I.Among all possible target lan-guage sentences,we will choose the sentencewith the highest probability:ˆe I1=argmaxe I1P r(e I1|f J1)(1)=argmaxe I1P r(e I1)·P r(f J1|e I1)(2)The decomposition into two knowledge sources in Equation2is known as the source-channel approach to statistical machine translation (Brown et al.,1990).It allows an independent modeling of target language model P r(e I1)and translation model P r(f J1|e I1)1.The target lan-guage model describes the well-formedness of the target language sentence.The translation model links the source language sentence to the target language sentence.The argmax opera-tion denotes the search problem,i.e.the gener-ation of the output sentence in the target lan-guage.We have to maximize over all possible target language sentences.The resulting architecture for the statistical machine translation approach is shown in Fig-ure1with the translation model further decom-posed into lexicon and alignment model.Figure1:Architecture of the translation ap-proach based on Bayes decision rule.2.2Alignment ModelsThe alignment model P r(f J1,a J1|e I1)introduces a‘hidden’alignment a=a J1,which describes 1The notational convention will be as follows:we use the symbol P r(·)to denote general probability distri-butions with(nearly)no speciﬁc assumptions.In con-trast,for model-based probability distributions,we use the generic symbol p(·).a mapping from a source position j to a target position a j.The relationship between the trans-lation model and the alignment model is given by:P r(f J1|e I1)=a J1P r(f J1,a J1|e I1)(3)In this paper,we use the models IBM-1,IBM-4from(Brown et al.,1993)and the Hidden-Markov alignment model(HMM)from(Vogel et al.,1996).All these models provide diﬀerent de-compositions of the probability P r(f J1,a J1|e I1).A detailed description of these models can be found in(Och and Ney,2003).A Viterbi alignmentˆa J1of a speciﬁc model is an alignment for which the following equation holds:ˆa J1=argmaxa J1P r(f J1,a J1|e I1).(4)The alignment models are trained on a bilin-gual corpus using GIZA++(Och et al.,1999; Och and Ney,2003).The training is done it-eratively in succession on the same data,where theﬁnal parameter estimates of a simpler model serve as starting point for a more complex model.The result of the training procedure is the Viterbi alignment of theﬁnal training iter-ation for the whole training corpus.2.3Alignment Template ApproachIn the translation approach from Section2.1, one disadvantage is that the contextual informa-tion is only taken into account by the language model.The single-word based lexicon model does not consider the surrounding words.One way to incorporate the context into the trans-lation model is to learn translations for whole word groups instead of single words.The key elements of this translation approach(Och et al.,1999)are the alignment templates.These are pairs of source and target language phrases with an alignment within the phrases.The alignment templates are extracted from the bilingual training corpus.The extraction al-gorithm(Och et al.,1999)uses the word align-ment information obtained from the models in Section2.2.Figure2shows an example of a word aligned sentence pair.The word align-ment is represented with the black boxes.The ﬁgure also includes some of the possible align-ment templates,represented as the larger,un-ﬁlled rectangles.Note that the extraction algo-rithm would extract many more alignment tem-plates from this sentence pair.In this example, the system input was the sequence of Chinese characters without any word segmentation.As can be seen,a translation approach that is based on phrases circumvents the problem of word seg-mentation to a certain degree.This method will be referred to as“translation with no segmen-tation”(see Section 5.2).theyFigure2:Example of a word aligned sentence pair and some possible alignment templates.In the Chinese–English DARPA TIDES eval-uations in June2002and May2003,carried out by NIST(NIST,2003),the alignment template approach performed very well and was ranked among the best translation systems.Further details on the alignment template ap-proach are described in(Och et al.,1999;Och and Ney,2002).3Task and Corpus StatisticsIn Section 5.3,we will present results for a Chinese–English translation task.The domain of this task is news articles.As bilingual train-ing data,we use a corpus composed of the En-glish translations of a Chinese Treebank.This corpus is provided by the Linguistic Data Con-sortium(LDC),catalog number LDC2002E17. In addition,we use a bilingual dictionary with 10K Chinese word entries provided by Stephan Vogel(LDC,2003b).Table1shows the corpus statistics of this task.We have calculated both the number of words and the number of characters in the cor-pus.In average,a Chinese word is composed of1.49characters.For each of the two lan-guages,there is a set of20special characters, such as digits,punctuation marks and symbols like“()%$...”The training corpus will be used to train a word alignment and then extract the alignment templates and the word-based lexicon.The re-sulting translation system will be evaluated on the test corpus.Table1:Statistics of training and test corpus. For each of the two languages,there is a set of20 special characters,such as digits,punctuation marks and symbols like“()%$...”Chinese English Train Sentences4172Characters172874832760Words116090145422Char.Vocab.3419+2026+20Word Vocab.93919505 Test Sentences993Characters42100167101Words28247262254Segmentation Methods4.1Conventional MethodThe commonly used segmentation method is based on a segmentation tool and a monolingual Chinese dictionary.Typically,this dictionary has been produced beforehand and is indepen-dent of the Chinese text to be segmented.The dictionary contains Chinese words and their fre-quencies.This information is used by the seg-mentation tool toﬁnd the word boundaries.In the LDC method(see Section5.2)we have used the dictionary and segmenter provided by the LDC.More details can be found on the LDC web pages(LDC,2003a).This segmenter is based on two ideas:it prefers long words over short words and it prefers high frequency words over low frequency words.4.2Dictionary Learning from Alignments In this section,we will describe our method of learning a dictionary from a bilingual corpus. As mentioned before,the bilingual training corpus listed in Section3is the only input to the system.Weﬁrstly divide every Chinese charac-ters in the corpus by white spaces,then train the statistical translation models with this un-segmented Chinese text and its English trans-lation,details of the training method are de-scribed in Section2.2.To extract Chinese words instead of phrases as in Figure2,we conﬁgure the training pa-rameters in GIZA ++,the alignment is then re-stricted to a multi-source-single-target relation-ship,i.e.one or more Chinese characters are translated to one English word.The result of this training procedure is an alignment for each sentence pair.Such an align-ment is represented as a binary matrix with J ·I elements.An example is shown in Figure 3.The un-segmented Chinese training sentence is plotted along the horizontal axes and the corresponding English sentence along the vertical axes.The black boxes show the Viterbi alignment for this sentence pair.Here,for example the ﬁrst two Chinese characters are aligned to “industry”,the next four characters are aligned to “restruc-turing”.industryrestructuringmadevigorousprogress Figure 3:Example of an alignment without word segmentation.The central idea of our dictionary learning method is:a contiguous sequence of Chinese characters constitute a Chinese word,if they are aligned to the same English word .Using this idea and the bilingual corpus,we can au-tomatically generate a Chinese word dictionary.Table 2shows the Chinese words that are ex-tracted from the alignment in Figure 3.Table 2:Word entries in Chinese dictionary learned from the alignment in Figure3.We extract Chinese words from all sentence pairs in the training corpus.Therefore,it is straightforward to collect word frequency statis-tics that are needed for the segmentation tool.Once,we have generated the dictionary,we can produce a segmented Chinese corpus using the method described in Section 4.1.Then,we retrain the translation system using the seg-mented Chinese text.4.3Word Length StatisticsIn this section,we present statistics of the word lengths in the LDC dictionary as well as in the self-learned dictionary extracted from the align-ment.Table 3shows the statistics of the word lengths in the LDC dictionary as well as in the learned dictionary.For example,there are 2368words consisting of a single character in learned dictionary and 2511words in the LDC dictionary.These single character words rep-resent 16.9%of the total number of entries in the learned dictionary and 18.6%in the LDC dictionary.We see that in the LDC dictionary more than 65%of the words consist of two characters and about 30%of the words consist of a single char-acter or three or four characters.Longer words with more than four characters constitute less than 1%of the dictionary.In the learned dic-tionary,there are many more long words,about 15%.A subjective analysis showed that many of these entries are either named entities or idiomatic expressions.Often,these idiomatic expressions should be segmented into shorter words.Therefore,we will investigate methods to overcome this problem in the future.Some suggestions will be discussed in Section 6.Table 3:Statistics of word lengths in the LDC dictionary and in the learned dictionary.word LDC dictionary learned dictionary length frequency [%]frequency [%]1233418.6236816.92814965.1548639.2311889.5189913.64759 6.1208414.95700.6791 5.76200.2617 4.4760.0327 2.3≥8110.0424 3.0total 12527100139961005Translation Experiments5.1Evaluation CriteriaSo far,in machine translation research,a sin-gle generally accepted criterion for the evalu-ation of the experimental results does not ex-ist.We have used three automatic criteria.For the test corpus,we have four references avail-able.Hence,we compute all the following cri-teria with respect to multiple references.•WER(word error rate):The WER is computed as the minimum number of substitution,insertion and dele-tion operations that have to be performed to convert the generated sentence into the reference sentence.•PER(position-independent word error rate):A shortcoming of the WER is that it re-quires a perfect word order.The word or-der of an acceptable sentence can be dif-ferent from that of the target sentence,so that the WER measure alone could be mis-leading.The PER compares the words in the two sentences ignoring the word order.•BLEU score:This score measures the precision of un-igrams,bigrams,trigrams and fourgrams with respect to a reference translation witha penalty for too short sentences(Papineniet al.,2001).The BLEU score measures accuracy,rge BLEU scores are bet-ter.5.2Summary:Three TranslationMethodsIn the experiments,we compare the following three translation methods:•Translation with no segmentation:Each Chinese character is interpreted as a single word.•Translation with learned segmentation: It uses the self-learned dictionary.•Translation with LDC segmentation: The predeﬁned LDC dictionary is used.The core contribution of this paper is the method we called“translation with learned seg-mentation”,which consists of three steps:•The input is a sequence of Chinese charac-ters without segmentation.After the train-ing using GIZA++,we extract a mono-lingual Chinese dictionary from the align-ment.This is discussed in Section4.2,and an example is given in Figure3and Table2.•Using this learned dictionary,we segment the sequence of Chinese characters into words.In other words,the LDC method is used,but the LDC dictionary is replaced by the learned dictionary(see Section4.1).•Based on this word segmentation,we perform another training using GIZA++.Then,after training the models IBM1, HMM and IBM4,we extract bilingual word groups,which are referred as alignment templates.5.3Evaluation ResultsThe evaluation is performed on the LDC corpus described in Section3.The translation perfor-mance of the three systems is summarized in Table4for the three evaluation criteria WER, PER and BLEU.We observe that the trans-lation quality with the learned segmentation is similar to that with the LDC segmentation.The WER of the system with the learned segmenta-tion is somewhat better,but PER and BLEU are slightly worse.We conclude that it is possi-ble to learn a domain-speciﬁc dictionary for Chi-nese word segmentation from a bilingual corpus. Therefore the translation system is independent of a predeﬁned dictionary,which may be unsuit-able for a certain task.The translation system using no segmenta-tion performs slightly worse.For example,for the WER there is a loss of about2%relative compared to the system with the LDC segmen-tation.Table4:Translation performance of diﬀerent segmentation methods(all numbers in percent). method error rates accuracyWER PER BLEU no segment.73.356.527.6 learned segment.70.454.629.1 LDC segment.71.954.429.2 5.4Eﬀect of Segmentation onTranslation ResultsIn this section,we present three examples of the eﬀect that segmentation may have on transla-tion quality.For each of the three examples inFigure4,we show the segmented Chinese source sentence using either the LDC dictionary or the self-learned dictionary,the corresponding trans-lation and the human reference translation.In theﬁrst example,the LDC dictionary leads to a correct segmentation,whereas with the learned dictionary the segmentation is erro-neous.The second and third token should be combined(“Hong Kong”),whereas theﬁfth to-ken should be separated(“stabilize in the long term”).In this case,the wrong segmentation of the Chinese source sentence does not result in a wrong translation.A possible reason is that the translation system is based on word groups and can recover from these segmentation errors.In the second example,the segmentation with the LDC dictionary produces at least one error. The second and third token should be combined (“this”).It is possible to combine the seventh and eighth token to a single word because the eighth token shows only the tense.The segmen-tation with the learned dictionary is correct. Here,the two segmentations result in diﬀerent translations.In the third example,both segmentations are incorrect and these segmentation errors aﬀect the translation results.In the segmentation with the LDC dictionary,theﬁrst Chinese char-acters should be segmented as a separate word. The second and third character and maybe even the fourth character should be combined to one word.2Theﬁfth and sixth character should be combined to a single word.In the segmentation with the learned dictionary,theﬁfth and sixth token(seventh and eighth character)should be combined(“isolated”).We see that this term is missing in the translation.Here,the segmenta-tion errors result in translation errors.6Discussion and Future WorkWe have presented a new method for Chinese word segmentation.It avoids the use of a pre-deﬁned dictionary and instead learns a corpus-speciﬁc dictionary from the bilingual training corpus.The idea is extracting a self-learned dictio-nary from the trained alignment models.This method has the advantage that the word entries in the dictionary all occur in the training data, and its content is much closer to the training text as a predeﬁned dictionary,which can never cover all possible word occurrences.Here,if the content of the test corpus is closer to that of the 2This is an example of an ambiguoussegmentation.Figure4:Translation examples using the learned dictionary and the LDC dictionary.training corpus,the quality of the dictionary is higher and the translation performance would be better.The experiments showed that the transla-tion quality with the learned segmentation is competitive with the LDC segmentation.Ad-ditionally,we have shown the feasibility of a Chinese–English statistical machine translation system that works without any word segmenta-tion.There is only a minor loss in translation performance.Further improvements could be possible by tuning the system toward this spe-ciﬁc task.We expect that our method could be im-proved by considering the word length as dis-cussed in Section4.3.As shown in the word length statistics,long words with more than four characters occur only occasionally.Most of them are named entity words,which are writ-ten in English in upper case.Therefore,we can apply a simple rule:we accept a long Chinese word only if the corresponding English word is in upper case.This should result in an improved dictionary.An alternative way is to use the word length statistics in Table3as a prior dis-tribution.In this case,long words would get a penalty,because their prior probability is low. Because the extraction of our dictionary is based on bilingual information,it might be in-teresting to combine it with methods that use monolingual information only.For Chinese–English,there is a large num-ber of bilingual corpora available at the LDC. Therefore using additional corpora,we can ex-pect to get an improved dictionary. ReferencesP.F.Brown,J.Cocke,S.A.Della Pietra,V.J. Della Pietra,F.Jelinek,ﬀerty,R.L. Mercer,and P.S.Roossin.1990.A statisti-cal approach to machine pu-tational Linguistics,16(2):79–85,June.P.F.Brown,S.A.Della Pietra,V.J.Della Pietra,and R.L.Mercer.1993.The mathe-matics of statistical machine translation:Pa-rameter putational Linguis-tics,19(2):263–311,June.LDC.2003a.LDC Chinese resources home page./Projects/ Chinese/LDC ch.htm.LDC.2003b.LDC resources home page. /Projects/TIDES/ mt2004cn.htm.NIST.2003.Machine translation home page./speech/tests/mt/ index.htm.F.J.Och and H.Ney.2002.Discriminative training and maximum entropy models for statistical machine translation.In Proc.of the40th Annual Meeting of the Association for Computational Linguistics(ACL),pages 295–302,Philadelphia,PA,July.F.J.Och and H.Ney.2003.A systematic com-parison of various statistical alignment putational Linguistics,29(1):19–51, March.F.J.Och,C.Tillmann,and H.Ney.1999.Im-proved alignment models for statistical ma-chine translation.In Proc.of the Joint SIG-DAT Conf.on Empirical Methods in Natu-ral Language Processing and Very Large Cor-pora,pages20–28,University of Maryland, College Park,MD,June.D. D.Palmer.1997.A trainable rule-based algorithm for word segmentation.In Proc. of the35th Annual Meeting of ACL and8th Conference of the European Chapter of ACL, pages321–328,Madrid,Spain,August.K.A.Papineni,S.Roukos,T.Ward,and W.J. Zhu.2001.Bleu:a method for automatic evaluation of machine translation.Techni-cal Report RC22176(W0109-022),IBM Re-search Division,Thomas J.Watson Research Center,September.R.W.Sproat and C.Shih.1990.A statistical method forﬁnding word boundaries in Chi-nese puter Processing of Chinese and Oriental Languages,4:336–351.M.Sun,D.Shen,and B.K.Tsou.1998.Chi-nese word segmentation without using lexi-con and hand-crafted training data.In Proc. of the36th Annual Meeting of ACL and 17th Int.Conf.on Computational Linguistics (COLING-ACL98),pages1265–1271,Mon-treal,Quebec,Canada,August.S.Vogel,H.Ney,and C.Tillmann.1996. HMM-based word alignment in statistical translation.In COLING’96:The16th Int. Conf.on Computational Linguistics,pages 836–841,Copenhagen,Denmark,August.。