Automatic extraction of subcategorization from corpora

合集下载

Part-of-Speech Tagging of Transcribed Speech

Part-of-Speech Tagging of Transcribed SpeechMargot Mieskes,Michael StrubeEML Research gGmbH,Heidelberg,Germanyhttp://www.eml-research.de/nlpAbstractWe used four Part-of-Speech taggers,which are available for research purposes and were originally trained on text to tag a corpus of transcribed multiparty spoken dialogues.The assigned tags were then manually corrected.The correction wasﬁrst used to evaluate the four taggers,then to retrain them.Despite limited resources in time,money and annotators we reached results comparable to those reported for the taggers on text.Based on our experience we present guidelines to produce reliably POS tagged corpora of new domains.1.IntroductionPart-of-Speech(POS)tagging is a prerequisite for many high-level Natural Language Processing(NLP)tasks.A number of POS taggers have been developed and made available to the research community.The majority of them has been trained on written texts,mostly on newspaper texts.Only in few instances POS tagging was applied to transcribed speech.Examples for this can be found in God-frey et al.(1992),Heeman&Allen(1999)and Zechner (2001).All three mainly deal with dialogues.Only Zech-ner(2001)reports results for multiparty dialogues as well. Some work has been done to apply POS taggers to new do-mains with little or with no manual annotation.A small amount of manually annotated training data was used by Clark et al.(2003)and by Collins(2002),who used small amounts of data to do co-training and explore the perfor-mance of a Hidden Markov Model based perceptron re-spectively.Nakagawa et al.(2002)and van Halteren et al. (1998)used no manually annotated data for retraining but multiple POS taggers directly and voting techniques on the results.A third method is applied by Zavrel&Daelemans (2000)who use the results from several taggers trained on a small amount of training data as input for a learner. There are problems in applying the approaches described above to our task,the application of POS taggers to tran-scribed multiparty dialogues.The researchers who tagged transcribed speech did not evaluate the taggers they used before retraining was done.The research exploring ways to retrain taggers with little or no data was performed on written text.But Wermter&Hahn(2004)showed that texts even from different domains can be very similar.They used two POS taggers,which were trained on newspaper texts, and applied them to medical texts.The evaluation on manu-ally annotated medical texts gave good results.The authors explain this by a similarity in uni-,bi-and trigram POS dis-tribution in newspaper and medical texts.In our work we make use of four different POS taggers(see Section2.).We apply them to transcribed multiparty spo-ken dialogues.We report results before(see Section3.)and after the taggers have been retrained on manually annotated data,and speciﬁcally we evaluated the behaviour on differ-ent(increasing)amounts of data(see Section4.).This led us to compile guidelines on how to efﬁciently create data for retraining taggers to be used on a new domain.2.TaggersFour taggers were considered.The TnT tagger1(Brants, 2000)uses a Hidden Markov Model(HMM)based on n-grams and lexical information.Two taggers from the Stan-ford Java library for tagging2(left3words(Toutanova& Manning,2000)and bidirectional(Toutanova et al.,2003)). Theﬁrst uses a maximum entropy model and the context of three words to the left.The second considers the context to the left and to the right by applying contextual HMMs (CHMM).Finally the Brill(TBL)tagger3(Brill,1994)uses transformation-based learning.Each of these automatic taggers was originally trained and tested on the Wall Street Journal portion of the Penn Tree-bank(Marcus et al.,1993),which only consists of written text.The results reported for the taggers on this corpus vary between96.5%and97.24%.For TnT only the perfor-mance on known and unknown words is given separately with97.7and89.0%.Therefore,using these taggers on the transcribed speech in the ICSI Meeting Recorder Corpus (ICSI Corpus)(Janin et al.,2003)will inevitably result in considerably lower accuracy rates.Some of the reasons that account for this are:First,the vocabulary ofﬁnancial news is different from that of dialogues which mostly deal with speech and language technology.Second,the style is differ-ent between newspaper text and colloquial speech.Exam-ples for these differences include disﬂuencies and explicit paragraph separation,but also sentence length and sentence complexity.Finally in the meetings non-native speakers are also involved.It seemed reasonable though to do the POS annotation semi-automatically and using these taggers as the basis for the manual correction.In order to get one tag for each token,a majority decision over the four automatic taggers(Maj4in the following)was reached.This was used as input for the human annotators,who corrected the data.3.Manual AnnotationThe whole ICSI Corpus contains75meetings.12were randomly chosen to be annotated by three human annota-tors.In addition to the12meetings we used one meeting to train the human annotators,which was not considered 1http://www.coli.uni-saarland.de/˜thorsten/tnt/2/software/ tagger.shtml3/˜brill/in the evaluation.We used the MMAX2annotation tool4, which allows easy access to and manipulation of the tags. The tagset was based on the Penn Treebank tagset(San-torini,1990),which was also used for the Switchboard POS annotation(Godfrey et al.,1992).We added some tags to deal with phenomena that are of speciﬁc interest to the project in which this evaluation was carried out-speciﬁ-cally RELP,which is used to distinguish relative pronouns from wh-determiners(WDT).Additionally,we introduced one tag to deal with all interpunctuation signs–INP.The12meetings were used to check inter-rater agreement for the three annotators.From the remaining62meetings 25were annotated individually by one of the three annota-tors.Based on the manual annotation a gold standard was created by assigning a majority decision tag to each token in the meetings.This majority decision was manually cor-rected by a senior annotator.Inter-rater reliability was very high(κ=.96),showing that the automatically assigned tags can be manually corrected highly reliable.Therefore, we assume that the quality of the individually annotated meetings are of equally high quality as the gold standard. Gathering the data took about two months,with three an-notators working for about240h in total.The costs were reasonable with about EUR5000total.Nine meetings were used for evaluating the automatic an-notation.Three meetings were taken from the gold standard data(Test1in the following)and six from the individually annotated data(Test2).Data TBL TnT Left3Bidirect Maj4Test111.311.211.111.110.5Test213.814.313.613.213.4 Table1:Error Rates for automatically annotated data Table1shows the results for the automatically tagged data. The results for Test1,which is part of the gold standard data but not used for retraining are better than those for Test2, which was also not used in retraining but was not part of the gold standard either.In this evaluation we did not consider tags that were unknown to the taggers,because they were introduced by our annotation scheme.4.RetrainingNine Meetings from the gold standard were used for train-ing.Since they are spread across the whole corpus they contain a large variety of words and language used.Ad-ditionally,we had15meetings which were manually an-notated.Retraining was done based on6different setups. These setups contained increasing amounts of data.Setup1 contained the data from the gold standard and consisted of 124,158tokens.In every setup3meetings,all of them an-notated individually by one of the three annotators,were added.Theﬁnal setup contained24meetings and282,686 tokens in total.Our aim was toﬁnd a good trade-off between good anno-tation results and reasonable effort for the manual annota-tion.It is also important that for most taggers,the amount of time needed for retraining increased with increasing data 4http://mmax.eml-research.de set size.Additionally,the relationship between the amount of training data and the results is unknown.It is assumed that the more data is available,the better the results will get. The fastest to train was the TnT tagger,which took only a few minutes.The TBL and Stanford taggers took hours. Despite of this difference,TnT’s results were comparable to the other three taggers.The training parameters for each of the taggers remained unchanged throughout the different training setups.Espe-cially the TBL Tagger would allow for several parameters to be set according to data set size and desired results.We left these parameters as they were suggested by the author. Test1contains about40K token and Test2contains about 77K tokens.Error Rates in%Set1Set2Set3Set4Set5Set6 tokens K124162197221253283TnTTest1 3.4 3.4 3.3 3.3 3.4 3.4Test2 5.4 5.1 4.9 4.6 4.5 4.5TBLTest1 3.9 3.5 3.5 3.5 3.6 3.5Test28.4 5.5 5.0 4.7 4.4 4.4Left3Test1 3.2 3.0 3.0 3.2 3.2 3.2Test2 5.2 4.7 4.5 4.3 4.2 4.1BidirectTest1 3.2 3.0 3.0 3.2 3.2 3.2Test2 5.2 4.7 4.5 4.3 4.2 4.1 Table2:Average error rates for all taggers in each of the setupsTable2shows the results for all taggers after they have been trained on the manual data in different setups.Theﬁrst part shows the results for TnT,after being trained on each of the setups.The results are very good and improve by about1% in total.The gain in each step is rather small,the biggest is about0.3%.In the last three steps the gain is very small and ﬁnally non-existent.The last noticeable step is from Set3 to Set4.This indicates that a training set size of between 197K and221K tokens gives good tagging results with rea-sonable effort for manual data.The second part shows the results for TBL.The results are similar to TnT but the gain for the various setups is higher. Especially from theﬁrst to the second setup in Test2the error rate decreases by2.9%.The later steps are smaller and level out towards the end.Here,the last noticeable step is from Set4to Set5.This indicates that TBL needs more training data than TnT.The third and fourth parts should be considered together be-cause the results are very similar if not identical.Again the ﬁrst few steps are bigger than the last few steps.After Set4 the error rate does not decrease much.This suggests that the even-odd point of decreasing error rate and increasing training data size is somewhere between Set3and Set4. The improvement for all taggers presented here from the original tagging(Table1)is in the range of8−10%for each tagger.In addition,we were interested in whether the improve-ment can also be demonstrated in the majority decision or whether the majority decision even requires less trainingdata,while keeping the error rate low.Since the two Stan-ford Taggers had a similar performance,the majority over all four taggers would show very similar results to these taggers.Therefore,we considered only three taggers in the ﬁnal evaluation.We removed the Bidirectional tagger from further consideration.This is motivated by the fact that this tagger takes longer to train and to tag and the results are very similar to those by Left3.Set1Set2Set3Set4Set5Set6 Test1 3.0 2.9 2.9 3.0 3.0 2.9Test2 5.1 4.7 4.4 4.1 4.0 3.9 Table3:Average error rates for Maj3on each of the Setups Table3shows the results for the majority decision based on three different taggers(Maj3in the following).As can be seen in Set4the results are as good as for any of the single taggers in all setups.This is also the last noticeable step.Although Set5and Set6show the best overall results, the gain in Set5and Set6is not as remarkable as in the steps before.With this amount of training data Maj3outperforms all single taggers in all setups.In general,one can observe that the gain throughout the setups is about1%.TBL improves by about4%,but this is mainly due to the difference betweenﬁrst and second setup(2.9%).Between the second and the last setup the difference is only1.1%,which is the same as with the other taggers.Maj3gained slightly more(1.2%),but also outper-forms the single taggers by0.2%.Furthermore,the individ-ual results are very close to those that have been reported for the taggers on text(see Section2.).Maj3achieves 97.1%on the best setup,which is close to the best tagger on text(97.2%).5.DiscussionThe work presented here,was done on transcribed speech from multiparty meetings,which so far has not been ex-plored in detail.We used four common POS taggers to automatically annotate the transcripts.These results were manually corrected by human annotators,which is consid-erably faster than assigning POS tags from scratch.The re-sults from the manual annotation were then used to retrain the POS taggers.It turned out,that about221K tokens are sufﬁcient to get results that are comparable to those reported for the POS taggers applied to text.Redoing the majority decision im-proved the results on this amount of data by about0.2%. Using the full amount of training data improved the best re-sults by0.3%also compared to the best results of the single taggers,which were achieved on texts.It has been argued in the past,that in some cases retrain-ing is not necessary to do POS tagging(Wermter&Hahn, 2004).The authors report an analysis of uni-,bi and tri-grams of the data on which the taggers were trained(news texts)and of the data on which the taggers were tested (medical texts).They found that the n-grams were very similar.Following this analysis we compared the WSJ cor-pus with the ICSI corpus.Table4shows theﬁve most common Uni-,Bi-and Tri-grams for Wall Street Journal(WSJ)and the Meetingx-gram Num WSJ%ICSI% uni1NN14.01INP19.00 2INP11.24PRP9.203IN10.51DT7.664NNP9.83UH7.545DT8.67NN7.186IN7.1412PRP 2.69······19······NNP 1.0933UH0.01······bi1DT+NN 4.13UH+INP 4.95 2NNP+NNP 3.68INP+UH 4.933NN+IN 3.50PRP+VBP 3.294IN+DT 3.46INP+PRP 3.295JJ+NN 2.91DT+NN 2.742$+CD+CD0.81INP+PRP+VBP 1.633DT+NN+NN0.78UH+INP+UH 1.464.+DT+NN0.66CD+CD+CD 1.395DT+NN+,0.65INP+RB+INP0.01 Table4:Differences between WSJ and ICSI Uni-,Bi-and Trigrams distributionRecorder Data(ICSI).Theﬁve most common unigrams for WSJ are NN,INP,IN,NNP and DT,whereas for ICSI theyare INP,PRP,DT,UH and NN.UH is very rare in the WSJ corpus.These differences also appear in the analysis of bi-and trigram.For WSJ the dominant tags are DT and NN in various combinations and variations,whereas for ICSI the dominant tags are UH,PRP and INP,also in various combi-nations.For the Unigrams we also show at which positionin the frequency table the tags that are most often in WSJ occur in ICSI and vice versa.The Bi-and Trigrams under-line the difference between these two corpora.For the Uni-grams three tags are shared in theﬁve most frequent tags.Inthe Bigrams only one combination is left and for Trigrams none.It has to be noted that the combination of CD+CD+CDis an artefact of the data in ICSI,as most meetings start orﬁnish with all speaker recording a sequence of numbers. The same accounts for the combination$+CD+CD in WSJ. Table5shows those categories which beneﬁt the most from retraining.Most other categories improve as well,but on a smaller scale.Only few categories do not improve at all or even achieve worse results after training.The categories in Table5can be characterized as either occurring rarely inthe original training data(WSJ)as e.g.UH,FW and PDT or they form a very large group,as e.g.NN.Some categories achieved good results(≥95%correct)with the original POS taggers.Among those categories are CC,CD,MD,NNS, PRP,PRP$,TO,VBP,VBZ and WRB.These categories ei-ther belong to aﬁxed group of words()or are ruledby certain(ﬁxed)rules(e.g.VBZ).Among the categories that have been unreliably tagged are particles(RP),which achieve a precision of about80%and recall of about72%,proper singular nouns(NNP)with a precision of about86%and recall of about90%,but alsowh-determiner(WDT),which only achieve a precision of about66%and a recall of about52%.A more detailed dis-cussion will be provided after further analysis of the data.Tag before afterFW52.185.2JJ78.689.2NN83.994.4NNP46.088.9PDT33.690.52POS33.984.5RP77.580.12UH56.699.0VB89.494.5WDT14.379.0WP86.294.2Table5:Categories which improve most through training6.ConclusionsIn Section1we presented some approaches to perform POS tagging with as little manual work as possible and ap-proaches to improve the results of POS tagging in general (Clark et al.,2003;Collins,2002;Nakagawa et al.,2002). Several remarks should be made here:First these works were based on text,like the Wall Street Journal portion of the Penn Treebank.Second,all of them are computation-ally very demanding.Finally,only the results presented by Clark et al.(2003)have been better than the results we re-port here,but on a considerably larger amount of data. The results presented in our work give several method-ological implications for further approaches to POS tag-ging of new domains.Human annotation is very expen-sive,in time and money.It is therefore desirable to ex-plore,how much manually annotated data is necessary to get good/comparable results.Furthermore,the computa-tional effort needed to achieve these results should be rea-sonable,too.Two main results were found in our work:ﬁrst,a good trade-off point between the effort put into man-ually annotated data and the results of retraining POS tag-gers based on this data.Second,we applied a computation-ally cheap method for getting good results automatically. In future work more sophisticated methods to merge the results of different taggers in order to improve the results could be applied,like e.g.those mentioned by van Halteren et al.(1998),who used a pairwise voting system or Zavrel &Daelemans(2000)who used a learning system based on the results from the POS taggers.Data Availability.The manually annotated data as well as the automatically tagged data can be downloaded from the projects homepage http://www.eml-research. de/nlp/diana-summ.php.Acknowledgments.This work has been supported by the DFG under grant STR545/2-1within the DIANA-Summ project and by the Klaus Tschira Foundation.ReferencesBrants,T.(2000).TnT-a statistical part-of-speech tagger.In Pro-ceedings of the6th Conference on Applied Natural Language Processing,Seattle,Wash.,29April–4May2000,pp.224–231.Brill,E.(1994).Some advances in transformation based part-of-speech tagging.In Proceedings of the12th National Con-ference on Artiﬁcial Intelligence,Seattle,Wash.,1–4August 1994,pp.722–727.Clark,S.,J.R.Curran&M.Osborne(2003).Bootstrapping POS taggers using unlabelled data.In Proceeings of the Seventh CoNLL conference held at HLT-NAACL2003,Edmonton,Al-berta,Canada,27May–1June,2003,pp.49–55.Collins,M.(2002).Discriminative training methods for Hidden Markov Models:Theory and experiments with Perceptron al-gorithms.In Proceedings of the2002Conference on Empirical Methods in Natural Language Processing,Philadelphia,Pa., 6–7July2002.Godfrey,J.J.,E.Holliman&J.McDaniel(1992).Switchboard: Telephone speech corpus for research and development.In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing San Francisco,Cal.,USA,pp.517–520.Heeman,P.A.&J.F.Allen(1999).Speech repairs,intonational phrases,and discourse markers:Modeling speakers’utterances in spoken putational Linguistics,25(4):527–571.Janin,A.,D.Baron,J.Edwards,D.Ellis,D.Gelbart,N.Mor-gan,B.Peskin,T.Pfau,E.Shriberg,A.Stolcke&C.Wooters (2003).The ICSI meeting corpus.In Proceedings of IEEE In-ternational Conference on Acoustics,Speech and Signal Pro-cessing Hong Kong,China,6–10April2003,pp.364–367. Marcus,M.P., B.Santorini&M.A.Marcinkiewicz(1993).Building a large annotated corpus of English:The Penn putational Linguistics,19(2):313–330. Nakagawa,T.,T.Kudo&Y.Matsumoto(2002).Revision learning and its application to part-of-speech tagging.In Proceedings of the40th Annual Meeting of the Association for Computational Linguistics,Philadelphia,Penn.,7–12July2002,pp.497–504. Santorini,B.(1990).Part of Speech Tagging Guidelines for the Penn Treebank Project./tree-bank/home.html.Toutanova,K.,D.Klein,C.D.Manning&Y.Singer(2003).Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the Human Language Technolgy Conference of the North American Chapter of the Association for Computational Linguistics,Edmonton,Alberta,Canada,27 May–1June,2003,pp.252–259.Toutanova,K.&C.D.Manning(2000).Enriching the knowl-edge sources used in a maximum entropy part-of-speech tagger.In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Cor-pora,Hong Kong,pp.63–70.van Halteren,H.,J.Zavrel&W.Daelemans(1998).Improv-ing data driven wordclass tagging by system combination.In Proceedings of the17th International Conference on Computa-tional Linguistics and36th Annual Meeting of the Association for Computational Linguistics,Montr´e al,Qu´e bec,Canada,10–14August1998,pp.491–497.Wermter,J.&U.Hahn(2004).Really,is medical sublanguage that different?Experimental counter-evidence from tagging medical and newspaper corpora.In Medinfo’0411th World Congress on Medical Informatics,pp.560–564.Zavrel,J.&W.Daelemans(2000).Bootstrapping a tagged cor-pus through combination of existing heterogeneous taggers.In Proceedings of the2nd International Conference on Language Resources and Evaluation,Athens,Greece,May,2000,pp.1–4.Zechner,K.(2001).Automatic Summarization of Spoken Di-alogues in Unrestricted Domains,(Ph.D.thesis).Language Technology Institute,School of Computer Science,Carnegie Mellon University,Pittsburgh,USA:School of Computer Sci-ence.。

语料库

15
3 语料库的设计
语料库三方面 A. 语料本身
属性规模领域
体裁时代语体语种
语言层次
值
百万词级 | 千万词级 | 亿万词级 | … 政治 | 经济 | 体育 | 心理学 | …
文学 | 应用文 | 新闻 | …
共时 | 历时书面语 | 口语单语 | 双语 | 多语双语平行语料库 | 双语比较语料库语音（音节，韵律） | 语法（词，句，…）
11
第二代语料库
建于1980年代，由英国Birmingham大学与Collins出版社合作完成，规模达2000 万词次，基于该语料库出版的Collins Cobuild词典（1987）受到了广泛的好评
COBUILD语料库 Longman语料库
千万词级词典编纂－应用导向
建于1980年代，包括三个语料库： LLELC语料库（Longman/Lancaster英语语料库） LSC语料库（Longman口语语料库） LCLE（Longman英语学习语料库）目标是编撰英语学习词典，为外国人学习英语服务，词典规模达5000万词次
7
London-Lund英语口语语料库部分标记
标记
含义
#
语调群的结束 (end of tone group)
^
语音开始 (onset)
/
上升型核心语调 (rising nuclear tone)
\
下降型核心语调 (falling nuclear tone)
^
先升后降型核心语调 (rise-fall nuclear tone)
检索工具 | 人机界面 | 数据接口 | … 16
语料的选取
精品原则有影响力原则随机挑选原则高流通度原则典型性原则易于获得原则具有统计样本意义原则符合语言规范原则

TCN-Transformer-CTC的端到端语音识别

收稿日期：２０２１０８１４；修回日期：２０２１１００８基金项目：国家自然科学基金面上项目（６１６７２２６３）作者简介：谢旭康（１９９８），男，湖南邵阳人，硕士研究生，主要研究方向为语音识别、机器学习等；陈戈（１９９６），女，河南信阳人，硕士研究生，主要研究方向为语音识别、语音增强等；孙俊（１９７１），男（通信作者），江苏无锡人，教授，博导，博士，主要研究方向为人工智能、计算智能、机器学习、大数据分析、生物信息学等（ｊｕｎｓｕｎ＠ｊｉａｎｇｎａｎ．ｅｄｕ．ｃｎ）；陈祺东（１９９２），男，浙江湖州人，博士，主要研究方向为演化计算、机器学习等．

ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣ的端到端语音识别谢旭康，陈　戈，孙　俊，陈祺东（江南大学人工智能与计算机学院，江苏无锡２１４１２２）

摘　要：基于Ｔｒａｎｓｆｏｒｍｅｒ的端到端语音识别系统获得广泛的普及，但Ｔｒａｎｓｆｏｒｍｅｒ中的多头自注意力机制对输入序列的位置信息不敏感，同时它灵活的对齐方式在面对带噪语音时泛化性能较差。针对以上问题，首先提出使用时序卷积神经网络（ＴＣＮ）来加强神经网络模型对位置信息的捕捉，其次在上述基础上融合连接时序分类（ＣＴＣ），提出ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣ模型。在不使用任何语言模型的情况下，在中文普通话开源语音数据库ＡＩＳＨＥＬＬ１上的实验结果表明，ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣ相较于Ｔｒａｎｓｆｏｒｍｅｒ字错误率相对降低了１０．９１％，模型最终字错误率降低至５．３１％，验证了提出的模型具有一定的先进性。关键词：端到端语音识别；Ｔｒａｎｓｆｏｒｍｅｒ；时序卷积神经网络；连接时序分类中图分类号：ＴＮ９１２３４文献标志码：Ａ文章编号：１００１３６９５（２０２２）０３００９０６９９０５ｄｏｉ：１０．１９７３４／ｊ．ｉｓｓｎ．１００１３６９５．２０２１．０８．０３２３

ＴＣＮＴｒａｎｓｆｏｒｍｅｒＣＴＣｆｏｒｅｎｄｔｏｅｎｄｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎＸｉｅＸｕｋａｎｇ，ＣｈｅｎＧｅ，ＳｕｎＪｕｎ，ＣｈｅｎＱｉｄｏｎｇ（ＳｃｈｏｏｌｏｆＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ＆ＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅ，ＪｉａｎｇｎａｎＵｎｉｖｅｒｓｉｔｙ，ＷｕｘｉＪｉａｎｇｓｕ２１４１２２，Ｃｈｉｎａ）

语料库特征提取

语料库特征提取是指从大规模语料库中提取用于机器学习或自然语言处理任务的特征。

这些特征可以是词语、短语、句子结构、语义关系等。

特征提取是自然语言处理中的关键步骤之一，因为它能够为机器学习模型提供有用的输入，从而使模型能够更好地理解和生成自然语言。

在进行语料库特征提取时，通常会遵循以下步骤：1. 预处理：对原始语料库进行预处理，包括去除停用词、词形还原、分词等。

停用词是指在自然语言中频繁出现但与语义无关的词语，它们会影响模型的性能，因此需要去除。

分词是将文本拆分成单个词语或短语的过程。

2. 特征选择：选择与任务相关的特征。

这可以通过手动或自动方法来完成。

手动方法需要人工筛选出与任务相关的特征，而自动方法则可以通过算法自动提取特征。

3. 特征提取：使用提取出的特征进行建模。

这可以通过各种机器学习算法来完成，如词袋模型、TF-IDF、Word2Vec等。

这些算法能够将文本转换为数值表示，以便于机器学习模型的训练和预测。

在进行语料库特征提取时，需要考虑以下因素：* 语料库规模：随着语料库规模的增加，可以提取更多的特征，从而提高模型的性能。

但是，过大的语料库也会增加计算成本和时间。

* 任务类型：不同的任务类型需要不同的特征提取方法。

例如，命名实体识别需要提取实体词语和语义关系，而文本分类则可以通过文本表示和聚类等方法来提取特征。

* 领域和语言：不同领域的语料库和不同语言的语料库具有不同的特征分布和结构。

因此，在进行特征提取时需要考虑到这些因素。

通过合理的语料库特征提取，可以获得更准确、更高效的机器学习模型，从而提高自然语言处理任务的效果和效率。

在实际应用中，可以通过构建大规模语料库来提高特征提取的准确性和丰富性，从而更好地支持自然语言处理技术的发展和应用。

此外，随着深度学习技术的发展，基于深度学习的方法（如卷积神经网络、循环神经网络等）也逐渐成为语料库特征提取的重要工具。

这些方法能够更有效地从文本中提取出丰富多样的特征，从而提高自然语言处理任务的效果和效率。

口语对话系统中的语句主题提取

1引言人机口语对话系统是语音识别技术走向实用的一个重要研究方向。

口语对话系统的目标是能够让人通过自然语言表达自己的思想，与计算机就某一领域的内容进行信息交互[1]。

近年来各国都投入了大量人力、物力、财力来研究口语对话系统，美国有DARPA的Communicator计划，欧洲有ARISE计划、RE-WARD计划、VERBMOBIL计划等。

很多著名的学府与研究机构都在开展这项研究，如MIT的SLS实验室、CMU的ISL实验室、Lucent-BeII实验室、日本的ATR实验室、OGI的CSLU中心和PhiIips公司等[2]。

国内也有中科院自动化所、清华大学、香港中文大学、台湾大学等多家研究单位从事此方面研究。

口语对话系统可分为四个层次：人机交互层，自然语言处理层，对话管理层，应用程序层。

目前很多口语对话系统都将自然语言处理层研究的重点放在语法和语义平面，这样处理的一个问题是无法理清一段对话的整体内在联系[3]。

而对话往往由于口语中省略、指代、结构歧义等现象的存在，使得分析的结果具有歧义。

这就要求我们用话语分析（Discourse AnaIysis）模块利用上下文语境和相关的领域知识进行排歧从而能得到最后的语义表示[4]。

话语分析后存储的对话历史还可以帮助系统推测用户下面将说的话语，以实现语言处理模型的动态转换，从而提高系统识别的准确率。

话语分析包括两方面：一是从独立的对话中抽取出主题和用户意图，二是用恰当的数据结构描述出主题与意图的转换关系[5]。

话语分析策略可以分为基于知识的方法与基于语料库的方法。

基于知识的方法用一系列规则从对话中抽取主题和用户意图，并用规则的方法描述状态的转换过程。

这些规则的设计主要根据语言学者的总结[6]。

而基于语料库的策略需要用到两个概率：P（TIW）和P（IIW）。

P（TIW）是主题T的条件概率，P（II W）是用户意图I在一个对话中出现过的词符集W下的条件概率。

这两个概率通过对已标注的语料库的分析来估测，并用来抽取主题和识别用户意图。

语料库术语中英对照

语料库术语中英对照Aboutness 所⾔之事Absolute frequency 绝对频数Alignment (of parallel texts) （平⾏或对应）语料的对齐Alphanumeric 字母数字类的Annotate 标注（动词）Annotation 标注（名词）Annotation scheme 标注⽅案ANSI/American National Standards Institute 美国国家标准学会ASCII/American Standard Code for Information Exchange 美国信息交换标准码Associate (of keywords) （主题词的）联想词AWL/Academic word list 学术词表Balanced corpus 平衡语料库Base list 底表、基础词表Bigram ⼆元组、⼆元序列、⼆元结构Bi-hapax 两次词Bilingual corpus 双语语料库CA/Contrastive Analysis 对⽐分析Case-sensitive ⼤⼩写敏感、区分⼤⼩写Chi-square (χ2) test 卡⽅检验Chunk 词块CIA/Contrastive Interlanguage Analysis 中介语对⽐分析CLAWS/Constituent Likelihood Automatic Word-tagging System CLAWS词性赋码系统Clean text policy ⼲净⽂本原则Cluster 词簇、词丛Colligation 类联接、类连接、类联结Collocate n./v. 搭配词；搭配Collocability 搭配强度、搭配⼒Collocation 搭配、词语搭配Collocational strength 搭配强度Collocational framework/frame 搭配框架Comparable corpora 类⽐语料库、可⽐语料库ConcGram 同现词列、框合结构Concordance (line) 索引（⾏）Concordance plot （索引）词图Concordancer 索引⼯具Concordancing 索引⽣成、索引分析Context 语境、上下⽂Context word 语境词Contingency table 连列表、联列表、列连表、列联表Co-occurrence/Co-occurring 共现Corpora 语料库（复数）Corpus Linguistics 语料库语⾔学Corpus 语料库Corpus-based 基于语料库的Corpus-driven 语料库驱动的Corpus-informed 语料库指导的、参考了语料库的Co-select/Co-selection/Co-selectiveness 共选（机制）Co-text 共⽂DDL/Data Driven Learning 数据驱动学习Diachronic corpus 历时语料库Discourse 话语、语篇Discourse prosody 话语韵律Documentation 备检⽂件、⽂检报告EAGLES/Expert Advisory Groups on Language Engineering Standards EAGLES⽂本规格Empirical Linguistics 实证语⾔学Empiricism 经验主义Encoding 字符编码Error-tagging 错误标注、错误赋码Extended unit of meaning 扩展意义单位File-based search/concordancing 批量检索Formulaic sequence 程式化序列Frequency 频数、频率General (purpose) corpus 通⽤语料库Granularity 颗粒度Hapax legomenon/hapax ⼀次词Header/Text head ⽂本头、头标、头⽂件HMM/Hidden Markov Model 隐马尔科夫模型Idiom Principle 习语原则Index/Indexing （建）索引In-line annotation ⽂内标注、⾏内标注Key keyword 关键主题词Keyness 主题性、关键性Keyword 主题词KWIC/Key Word in Context 语境中的关键词、语境共现（⽅式）Learner corpus 学习者语料库Lemma 词⽬、原形词、词元Lemma list 词形还原对应表Lemmata 词⽬、原形词、词元（复数）Lemmatization 词形还原、词元化Lemmatizer 词形还原（词元化）⼯具Lexical bundle 词束Lexical density 词汇密度Lexical item 词项、词语项⽬Lexical priming 词汇触发理论Lexical richness 词汇丰富度Lexico-grammar/Lexical grammar 词汇语法Lexis 词语、词项LL/Log likelihood (ratio) 对数似然⽐、对数似然率Longitudinal/Developmental corpus 跟踪语料库、发展语料库、历时语料库Machine-readable 机读的Markup 标记、置标MDA/Multi-dimensional approach 多维度分析法Metadata 元信息Meta-metadata 元元信息MF/MD (Multi-feature/Multi-dimensional) approach 多特征/多维度分析法Mini-text 微型⽂本Misuse 误⽤Monitor corpus （动态）监察语料库Monolingual corpus 单语语料库Multilingual corpus 多语语料库Multimodal corpus 多模态语料库MWU/Multiword unit 多词单位MWE/Multiword expression 多词单位MI/Mutual information 互信息、互现信息N-gram N元组、N元序列、N元结构、N元词、多词序列NLP/Natural Language Processing ⾃然语⾔处理Node 节点（词）Normalization 标准化Normalized frequency 标准化频率、标称频率、归⼀频率Observed corpus 观察语料库Ontology 知识本体、本体Open Choice Principle 开放选择原则Overuse 超⽤、过多使⽤、使⽤过度、过度使⽤Paradigmatic 纵聚合（关系）的Parallel corpus 平⾏语料库、对应语料库Parole linguistics ⾔语语⾔学Parsed corpus 句法标注的语料库Parser 句法分析器Parsing 句法分析Pattern/patterning 型式Pattern grammar 型式语法Pedagogic corpus 教学语料库Phraseology 短语、短语学POSgram 赋码序列、码串POS tagging/Part-of-Speech tagging 词性赋码、词性标注、词性附码POS tagger 词性赋码器、词性赋码⼯具Prefab 预制语块Probabilistic （基于）概率的、概率性的、盖然的Probability 概率Rationalism 理性主义Raw text/Raw corpus ⽣⽂本（语料）Reference corpus 参照语料库Regex/RE/RegExp/Regular Expressions 正则表达式Register variation 语域变异Relative frequency 相对频率Representative/Representativeness 代表性（的）Rule-based 基于规则的Sample n./v. 样本；取样、采样、抽样Sampling 取样、采样、抽样Search term 检索项Search word 检索词Segmentation 切分、分词Semantic preference 语义倾向Semantic prosody 语义韵SGML/Standard Generalized Markup Language 标准通⽤标记语⾔Skipgram 跨词序列、跨词结构Span 跨距Special purpose corpus 专⽤语料库、专门⽤途语料库、专题语料库Specialized corpus 专⽤语料库Standardized TTR/Standardized type-token ratio 标准化类符/形符⽐、标准化类/形⽐、标准化型次⽐Stand-off annotation 分离式标注Stop list 停⽤词表、过滤词表Stop word 停⽤词、过滤词Synchronic corpus 共时语料库Syntagmatic 横组合（关系）的Tag 标记、码、标注码Tagger 赋码器、赋码⼯具、标注⼯具Tagging 赋码、标注、附码Tag sequence 赋码序列、码串Tagset 赋码集、码集Text ⽂本TEI/Text Encoding Initiative ⽂本编码计划The Lexical Approach 词汇中⼼教学法The Lexical Syllabus 词汇⼤纲Token 形符、词次Token definition 形符界定、单词界定Tokenization 分词Tokenizer 分词⼯具Transcription 转写Translational corpus 翻译语料库Treebank 树库Trigram 三元组、三元序列、三元结构T-score T值Type 类符、词型TTR/Type-token ratio 类符/形符⽐、类/形⽐、型次⽐Underuse 少⽤、使⽤不⾜Unicode 通⽤码Unit of meaning 意义单位WaC/Web as Corpus ⽹络语料库Wildcard 通配符Word definition 单词界定Word form 词形Word family 词族Word list 词表XML/EXtensible Markup Language 可扩展标记语⾔Zipf's Law 齐夫定律Z-score Z值。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Automatic Extraction of Subcategorization from Corpora
Cognitive and Computing Sciences Computer Laboratory University of Sussex University of Cambridge Brighton BN1 9QH, UK Pembroke Street, Cambridge CB2 3QG, UK
ejb@ john.carroll@
Ted Briscoe
John Carroll
Abstract
cmp-lg/9702002 4 Feb 97
We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount1. Predicate subcategorization is a key component of a lexical entry, because most, if not all, recent syntactic theories `project' syntactic structure from the lexicon. Therefore, a wide-coverage parser utilizing such a lexicalist grammar must have access to an accurate and comprehensive dictionary encoding (at a minimum) the number and category of a predicate's arguments and ideally also information about control with predicative arguments, semantic selection preferences on arguments, and so forth, to allow the recovery of the correct predicate-argument structure. If the parser uses statistical techniques to rank
1 This work was supported by UK DTI/SALT project 41/5808 `Integrated Language Database', CEC Telematics Applications Programme project LE1-2111 `SPARKLE: Shallow PARsing and Knowledge extraction for Language Engineering', and by SERC/EPSRC Advanced Fellowships to both authors. We would like to thank the COMLEX Syntax development team for allowing us access to pre-release data (for an early experiment), and for useful feedback.
1 Motivation
analyses, it is also critical that the dictionary encode the relative frequency of distinct subcategorization classes for each predicate. Several substantial machine-readable subcategorization dictionaries exist for English, either built largely automatically from machine-readable versions of conventional learners' dictionaries, or manually by (computational) linguists (e.g. the Alvey NL Tools (ANLT) dictionary, Boguraev et al. (1987); the COMLEX Syntax dictionary, Grishman et al. (1994)). Unfortunately, neither approach can yield a genuinely accurate or comprehensive computational lexicon, because both rest ultimately on the manual e orts of lexicographers / linguists and are, therefore, prone to errors of omission and commission which are hard or impossible to detect automatically (e.g. Boguraev & Briscoe, 1989; see also section 3.1 below for an example). Furthermore, manual encoding is labour intensive and, therefore, it is costly to extend it to neologisms, information not currently encoded (such as relative frequency of di erent subcategorizations), or other (sub)languages. These problems are compounded by the fact that predicate subcategorization is closely associated to lexical sense and the senses of a word change between corpora, sublanguages and/or subject domains (Jensen, 1991). In a recent experiment with a wide-coverage parsing system utilizing a lexicalist grammatical framework, Briscoe & Carroll (1993) observed that half of parse failures on uቤተ መጻሕፍቲ ባይዱseen test data were caused by inaccurate subcategorization information in the ANLT dictionary. The close connection between sense and subcategorization and between subject domain and sense makes it likely that a fully accurate `static' subcategorization dictionary of a language is unattainable in any case. Moreover, although Schabes (1992) and others have proposed `lexicalized' probabilistic grammars to improve the accuracy of parse ranking, no wide-coverage parser has yet been
constructed incorporating probabilities of di erent subcategorizations for individual predicates, because of the problems of accurately estimating them. These problems suggest that automatic construction or updating of subcategorization dictionaries from textual corpora is a more promising avenue to pursue. Preliminary experiments acquiring a few verbal subcategorization classes have been reported by Brent (1991, 1993), Manning (1993), and Ushioda et al. (1993). In these experiments the maximum number of distinct subcategorization classes recognized is sixteen, and only Ushioda et al. attempt to derive relative subcategorization frequency for individual predicates. We describe a new system capable of distinguishing 160 verbal subcategorization classes|a superset of those found in the ANLT and COMLEX Syntax dictionaries. The classes also incorporate information about control of predicative arguments and alternations such as particle movement and extraposition. We report an initial experiment which demonstrates that this system is capable of acquiring the subcategorization classes of verbs and the relative frequencies of these classes with comparable accuracy to the less ambitious extant systems. We achieve this performance by exploiting a more sophisticated robust statistical parser which yields complete though `shallow' parses, a more comprehensive subcategorization class classi er, and a priori estimates of the probability of membership of these classes. We also describe a small-scale experiment which demonstrates that subcategorization class frequency information for individual verbs can be used to improve parsing accuracy.