The Chinese Penn Treebank Tag Set中文宾州树库标记及其含义
基于层次化聚类的稀疏谓词语义角色标注方法

基于层次化聚类的稀疏谓词语义角色标注方法杨海彤【摘要】中文语义角色标注中,稀疏谓词的标注性能要远远低于其它谓词,而在实际应用中,标注系统经常需要处理大量的稀疏谓词,因此,稀疏谓词问题大大限制了语义角色标注系统的应用效果.为解决上述问题,提出一种基于聚合层次化聚类的方法.通过聚合层次化聚类建立起稀疏谓词与常见谓词的联系,稀疏谓词可以泛化为与之语义相近的常用谓词,缓和语义角色标注系统中的稀疏谓词问题.在中文命题库上的实验结果表明,该方法可有效处理中文语义角色标注中的稀疏谓词问题.【期刊名称】《计算机工程与设计》【年(卷),期】2018(039)011【总页数】6页(P3384-3388,3407)【关键词】语义角色标注;稀疏谓词;聚合层次化聚类;常见谓词;语义【作者】杨海彤【作者单位】华中师范大学计算机学院,湖北武汉430079【正文语种】中文【中图分类】TP3110 引言语义角色标注是一种自然语言处理领域的浅层语义分析技术。
它以句子为单位,分析句子中的谓词与其相关成分之间的语义关系,进而获取句子所表达语义的浅层表示。
下面是一个语义角色标注的例子:[警方]Agent [正在]Time [调查]Pred [事故原因]Patient其中“调查”是谓词,代表了一个事件,“警方”是施事者,“事故原因”是受事者,“正在”是事件发生的时间。
由此可见,语义角色标注能够抽取出一个句子表达的事件的全部重要信息。
由于语义角色标注可以提供较为简洁、准确、有益的分析结果,因此近年来受到了学术界的普遍重视,并已经成功地应用到信息抽取[1]、自动问答[2]、机器翻译[3]等任务中。
由于语义角色标注的简洁、有效的语义分析能力,吸引大量的研究人员投入到语义角色标注的研究中。
文献[4]细致地分析了哪些特征对中文语义角色标注是有效的,并进行了大量的实验验证。
文献[5]提出了一种中文句法分析和语义角色标注联合学习模型。
文献[6]融合了4个基本的语义角色标注系统,取得了较好的结果。
LDC中文树库Chinese Treebank

CTB简介
• 在CTB 的基础上, 宾州大学又分别完成了标 注谓词论元结构的中文命题库1. 0( B abko— M alaya, et a.l 2004; Xue and Parmer 2003)建 设, 以及标注了语篇连接的汉语语篇树库 ( Xue 2005)的建设。这将大大促进机器翻译、 信息检索和信息抽取等应用技术的进一步 发展。
companynamesctb中的句法标记nppnnr中国nn人民nn银行npnppnnr中国npnn民族nn企业ctb中的句法标记titlenppnnrnn总理ctb中的句法标记15datesplacesnpnt一九九九nt四月nt十五日nppnnr河北省nr保定市ctb中的句法标记2npmodifiersfollowingtypemodifierscan21qpsnpqpcd30多clpadjpjj主要npnn负责人ctb中的句法标记22dpsnpdpdt任何npnn人npdpdt全体npnn外交nn官员ctb中的句法标记23adjpsnounheadmodifiedadjpsalwaysprojectnp
CTB简介
• ( 3) 按照不同的应用需求, 树结构可以转换 为骨架分析树和依存关系树等。同时, 也可 从树库中自动提取基本短语和语法功能的 标注信息, 建立现有的句法树标注体系与汉 语部分分析体系的内在联系, 扩大目前树库 语料的应用范围(周强2004: 4)。
CTB简介
• ( 4) 短语结构语法体系下多年来的研究与教 学, 已形成了丰富的人才储备库, 可以较容易 地找到树库校对人员, 不需要经过大量培训 就可以胜任校对任务。这可以大大降低大 规模树库的开发费用(周强2004: 3)。
CTB简介
CTB简介
• 在标注体系上,从CTB, 1. 0( 1998- 2002)起, 基 本上沿用了宾州大学英语树库PTB, 2的标注 体系。即从最初的PTB, l采用骨架分析思想, 形成比较扁平的句法结构树的基础上, 增加 了一些功能标记,用于标注句子中主要句法 成分的语法功能(周强2004: 2)。目前的总标 注规模为50万词的新闻语料。
LDC中文树库Chinese Treebank

CTB中汉语词性划分规则
• VE:you3 as the main verb • Only 有[have], 没[not]{有 [have]}, and 无[not have] are tagged as VE when they are the main verbs。
CTB中汉语词性划分规则
• Other verb: VV • This includes the rest of the verbs, such as modals, raising predicates (e.g., 可能[maybe, probably]), control verbs (e.g., 要[want], 想 [want to]), action verbs (e.g., 走[walk]), psychverb (e.g.,喜欢[like]/ 了解[understand]/ 憎恨 [hate]), and so on.
树库简介
• 树库( treebank)就是一种经过了结构标注的 语料库。一般来说, 一个句子虽然表面上呈 现词语的线性排列, 其内部的成分组织是存 在一定层次结构的。这种层次结构通常用 树这种形式工具来表示。如果考虑歧义, 那 么一个句子可能对应多棵树。大量句子以 及其对应的树结构的集合就构成树库。
CTB简介
• 当然, 宾州树库的标注仍值得商榷的。比如, 运用英语的语法框架来分析汉语, 有的时候 跟汉语为母语的语感不符。另外, 标注的颗 粒度有时候比较粗, 在向依存结构树库转换 时就会出错。有的地方的层次还应该细分 等。
• • • • •
一 、树库简介 二、CTB简介 三、CTB中汉语词性划分规则 四、CTB中的句法标记 五、CTBParser
CTB简介
Toronto, ON

Automatic Verb Classification Using Multilingual ResourcesVivian Tsang Department of Computer Science University of Toronto10King’s College RoadToronto,ONCanada M5S3G4vyctsang@Suzanne Stevenson Department of Computer Science University of Toronto6King’s College RoadToronto,ONCanada M5S3H5suzanne@AbstractWe propose the use of multilingual corpora in the automatic classification of verbs.We ex-tend the work of(Merlo and Stevenson,2001), in which statistics over simple syntactic fea-tures extracted from textual corpora were used to train an automatic classifier for three lexical semantic classes of English verbs.We hypoth-esize that some lexical semantic features that are difficult to detect superficially in English may manifest themselves as easily extractable surface syntactic features in another language. Our experimental results combining English and Chinese features show that a small bilin-gual corpus may provide a useful alternative to using a large monolingual corpus for verb classification.1IntroductionRecently,a number of researchers have de-vised corpus-based approaches for automat-ically learning the lexical semantic class of verbs(e.g.,(McCarthy and Korhonen,1998; Lapata and Brew,1999;Schulte im Walde, 2000;Merlo and Stevenson,2001)).Automatic verb classification yields important potential benefits for the creation of lexical resources. Lexical semantic classes incorporate both syn-tactic and semantic information about verbs, such as the general sense of the verb(e.g., change-of-state or manner-of-motion)and the allowable mapping of verbal arguments to syn-tactic positions(e.g.,whether an experiencer argument can appear as the subject or the ob-ject of the verb)(Levin,1993).By automati-cally learning the assignment of verbs to lexi-cal semantic classes,each verb inherits a great deal of information about its possible usage in an NLP system,without that information hav-ing to be explicitly hand-coded.In this paper,we explore the use of multilin-gual corpora in the automatic learning of verb classification.We extend the work of(Merlo and Stevenson,2001),in which statistics over simple syntactic features extracted from syn-tactically annotated corpora were used to train an automatic classifier for a set of sample lex-ical semantic classes of English verbs.This work had two potential limitations:first,only a small number(five)of syntactic features that correlate with semantic class were proposed; second,a very large corpus was needed(65M words)to extract sufficiently discriminating statistics.We address both of these issues in the cur-rent study by exploiting the use of a parallel English-Chinese corpus.Our motivating hy-pothesis is that some lexical semantic features that are difficult to detect superficially in En-glish may manifest themselves as surface syn-tactic features in another language.If this is indeed the case,then we should be able to aug-ment the initial set of English features with features over the translated verbs in the other language(in our case,Chinese).Our hypothesis that a non-English verb fea-ture set can be useful in English verb classifica-tion is inspired by SLA(Second Language Ac-quisition)research on learning English verbs. As the name suggests,SLA research stud-ies how humans acquire a second language.“Transfer effects”—the impact of one’s native language when learning a second language(El-lis,1997)—are of particular interest to us.Re-cent research has shown that properties of a non-English native lexicon can influence hu-man learning of English verb class distinctions (e.g.,(Helms-Park,1997;Inagaki,1997;Juffs,2000)).Carrying this idea of“transfer”over to the machine learning setting,we hypothesize that features from a second language may pro-vide an additional source of information that complements the English features,making it possible that a smaller corpus(a bitext)can be a useful alternative to using a large mono-lingual corpus for verb classification.2The Verb Classes and English FeaturesMerlo and Stevenson(2001)tested their ap-proach on the major classes of optionally in-transitive verbs in English.All the classes allow the same subcategorizations(transitive and intransitive),entailing that they cannot be discriminated by subcategorization alone. Thus,successful classification demonstrates the induction of semantic information from syntactic features.In our work,we focus on two of these classes, the change-of-state verbs,such as open,and the verbs of creation and transformation,such as perform(classes45and26,respectively, from(Levin,1993)).Both classes are option-ally intransitive,but differ in the alternation between the transitive and intransitive forms. The transitive form of a change-of-state verb is a causative form of the intransitive(the door opened/the cat opened the door),while the transitive/intransitive alternates of a cre-ation/transformation verb arise from simple object optionality(the actors performed the skit/the actors performed).Merlo and Stevenson(2001)used5numeric features that encoded summary statistics over the usage of each verb across the corpus(65M words of Wall Street Journal,WSJ).The fea-tures captured subcategorization and aspec-tual frequencies(of transitivity,passive voice, and VBN POS tag),as well as statistics that approximated thematic properties of NP ar-guments(animacy and causativity)from sim-ple syntactic indicators.We adopt these same features in our work,and augment them with Chinese features as described next.3Chinese FeaturesWe selected the following Chinese features for our task,based on the properties of the change-of-state and creation/transformation classes. Each numbered item refers to a collection of related features.We describe how we expect each type of feature to vary across the two classes.1.Chinese POS tags for Verbs:We usedthe CKIP(Chinese Knowledge Informa-tion Processing Group)POS-tagger to as-sign one of15verb tags to each verb.Additionally,each of these tags can be mapped into the UPenn Chinese Treebank standard(Fei Xia,email communication), which characterizes each verb as“active”or“stative”.We note that change-of-state verbs are more likely to be adjectivized than cre-ation/transformation verbs;furthermore, this adjectival property is not unlike the stative property in Chinese.We expect then to see the Chinese translation of English change-of-state verbs to be more likely assigned a stative verb tag.2.Passive Particles:The adjectival natureof change-of-state verbs may also be re-flected in a higher proportion of passive use,since the adjectival use is a passive use.In Chinese,a passive construction is indicated by a passive particle preceding the main verb.For example,the passive sentence:This store is closed.can be translated as:Zhe4ge4(this)shang1dian4(store)bei4 (passive particle)guan1bi4(closed).We thus expect tofind that translations of change-of-state verbs have a higher fre-quency of occurrence with a passive par-ticle in Chinese.3.Periphrastic(Causative)Particles:In Chinese,some causative sentences use an external(periphrastic)particle to indi-cate that the subject is the causal agent of the event specified by the verb.For ex-ample,one possible translation forI cracked an egg.can beWo3(I)jiang1(made,periphrastic parti-cle)dan4(egg)da3lan4(crack).Since change-of-state verbs havea causative alternate,and cre-ation/transformation verbs do not,we expect to see a more frequent use of such particles in the translated equivalent of the change-of-state verbs.4.Morpheme Information:The types offeatures discussed so far involve the POS tag of the translated verb,or additional syntactic particles it occurs with.We also hypothesize that the semantic class membership of an English verb may in-fluence its word-level translation into Chi-nese.That is,the sublexical component—the precise morphemic constitution of the translated Chinese verb—may reflect properties of the class of the English verb.The following features are an attempt to exploit this potential source of informa-tion:•Average number of morphemes intranslated verb.•Different categories of morphemes intranslated verb.(We count occur-rences of all combinations of pairs ofPOS tags V,N,and A.)•Semantic specificity of translatedverb.(Is it semantically more spe-cific than the English verb,e.g.,byincluding additional morphemes?) The four general types of features we de-scribe above lead to17Chinese features in to-tal,which we use alone or in combination with the original5features proposed by Merlo and Stevenson(2001).4Experimental Materials and MethodIn our experiments,we use the Hong Kong Laws Parallel Text(HKLaws)from the Lin-guistic Data Consortium,a sentence-aligned bilingual corpus with6.5M words of English, and9M characters of Chinese.We tagged the Chinese portion of the corpus using the CKIP tagger,and the English portion using Rat-naparkhi’s tagger(Ratnaparkhi,1996).Note that the English portion of HKLaws is about 10%of the size of the corpus used by Merlo and Stevenson(2001)in their original experi-ments,so we are restricted to a much smaller source of data.Given the relatively small size of our corpus, and its narrow domain,we were only able to find a sample of16change-of-state and16cre-ation/transformation verbs in English of suffi-cient frequency;see the appendix for the list of verbs used.1The English features for these32 verbs were automatically extracted using regu-lar expressions over the tagged English portion of the corpus.The Chinese features were calculated as fol-lows.For each English verb,we manually determined the Chinese translation in each aligned sentence to yield a collection of all (aligned)translations of the verb.This is the “aligned translation set.”We also extracted all occurrences of the Chinese verbs in the aligned translation set across the corpus,yielding the “unaligned translation set”—i.e.,the possible Chinese translations of an English target verb even when they did not occur as the transla-tion of that verb.The required counts for the Chinese features were collected for these verbs partly automat-ically(Chinese Verb POS tags,Passive Par-ticles,Periphrastic Particles,and Morpheme Length)and partly by hand(Semantic Speci-ficity and Morpheme POS combinations).The value of a Chinese feature for a given verb is the normalized frequency of occurrence of the fea-ture across all occurrences of that verb in the given translation set.The resulting frequencies for the aligned translation set form the aligned dataset,and those for the unaligned transla-tion set form the unaligned dataset.The motivation for collecting unaligned data is to examine an alternative method for com-bining multilingual data.Note that parallel corpora,especially those that are sentence-aligned,are difficult to construct.Most paral-lel corpora we found are considerably smaller than some of the more popular monolingual ones.Given that more monolingual corpora are available,we want to explore the possibil-ity of using non-parallel texts from multiple languages(hence,necessarily unaligned data), rather than solely looking at bilingual corpora.1In the set of creation/transformation verbs,we in-clude one item not from that class,but with simi-lar syntactic behavior,the verb pack.We included this verb because we could notfind another cre-ation/transformation verb in the HKLaws corpus.We could have used another optionally intransitive(non-causative)class from Levin’s classification,but wanted to focus on these two classes in order to provide max-imum comparability to the ongoing work by Steven-son and Merlo,who are currently investigating these classes.In order to compare our results to the mono-lingual method on a large corpus(as in(Merlo and Stevenson,2001)),we also collected the5 English features for our verbs from the65M word WSJ corpus.As a result,we have a to-tal of four data sets:English HKLaws dataset, English WSJ dataset,aligned Chinese HK-Laws dataset,and unaligned Chinese HKLaws dataset.This allows us to look at four datasets individually(the two English and two Chinese sets),and to pair up the English and Chinese datasets in four different ways(each English set paired with each Chinese set).The data for each of our machine learning experiments consists of a vector of the rel-evant(English and/or Chinese)features for each verb:Template:[verb,Eng.Feats.,Chi.Feats.,class] Example:[altered,0.04,...,1,change-of-state]Combining all the English and Chinese fea-tures yields a total of22features.We use the resulting vectors as the training data for a classifier using the same decision tree algo-rithm as in(Merlo and Stevenson,2001)(C5.0, ).We used both8-fold cross-validation(repeated50times)and leave-one-out training methodologies for our experiments.2For our8-fold cross-validation experiments, we empirically tested the tuning options avail-able in C5.0.Except for the tree pruning per-centage,we found the available options offer little to no improvements over the default set-tings.We set the pruning factor to30%for the best overall performance over a variety of dif-ferent combinations of features.(According to the manual,the default is25%.A larger prun-ing factor results in less pruning in the decision tree.)The cross-validation experiments train on a large number of random subsets of the data, for which we report average accuracy and stan-dard error.The goal of the cross-validation experiments is to evaluate the contribution of different features to learning,and if possible 2An8-fold cross-validation experiment divides the data into eight parts(folds)and runs eight times,each time training on a different7/8of the data and testing on the remaining1/8.We chose8folds simply because it evenly divides our32verbs.In leave-one-out experi-ments,we leave out one vector for testing and use the remaining vectors for training,repeated32times(once for each verb).find the best feature combination(s).To do so,we varied the precise set of features used in each experiment.Since we have a total of 17features,performing an exhaustive search of 217≈131thousand experiments is near impos-sible.Instead,we analysed the performance of individual monolingual features alone,and their performance when combined with the fea-tures from the other language.The leave-one-out experiments complement the cross-validation methodology:there are a small number of tests,but we have the result of classifying each verb rather than average per-formance data on random subsets.Our goal for the leave-one-out experiments is to com-pare the precision and recall across the two classes.A feature is selected for the leave-one-out experiments if it contributed highly to per-formance in the cross-validation experiments. 5Experimental ResultsWe report here the key results of our cross-validation and leave-one-out experiments.(For additional results and details,see(Tsang, 2001).)Since our task is a two-way classi-fication with equal-sized classes,the chance accuracy is50%.Although the theoretical maximum accuracy is100%,it is worth not-ing that,for their three-way verb classifica-tion task,(Merlo and Stevenson,2001)exper-imentally determined a best performance of 87%among a group of human experts,indicat-ing that a more realistic upper-bound for the machine-learning task falls well below100%.5.18-Fold Cross-ValidationOur cross-validation experiments fall into three general sets.In each of these types of experiments,we use various combinations of the datasets(English HKLaws,English WSJ, Chinese aligned and unaligned),as explained in detail below.First,we analysed the contri-bution of the English features to learning by testing all English features together,and all English features individually.These tests form our baseline results using monolingual English data.Second,we similarly analysed the con-tribution of the Chinese features to learning by testing all Chinese features together and all Chinese features individually.Finally,since our overall goal is to observe possible infor-mation gain by augmenting English data with non-English data,we present results in whichFeatures%Acc.%SEHKLaws,All English Features41.30.7HKLaws,Transitivity49.50.5WSJ,All English Features66.30.6WSJ,Animacy72.50.4Table1:Accuracy(%Acc.)and Standard Error(%SE)in the8-Fold Cross-Validation Experi-ments,Using English Features OnlyAligned Features%Acc.%SE Unaligned Features%Acc.%SE HKLaws,All Chi.Features75.40.6HKLaws,All Chi.Features74.10.6 HKLaws,UPenn VA-Tag75.10.4HKLaws,UPenn VV-Tag71.50.5Table2:Accuracy(%Acc.)and Standard Error(%SE)in the8-Fold Cross-Validation Experi-ments,Using Chinese Features Onlywe add selected Chinese features to the set of English features.Table1shows the results of our experi-ments evaluating the English ing the HKLaws dataset,English features alone achieved a best performance of no better than chance(49.5%accuracy,SE0.5%).Using the WSJ dataset,all the English features together achieved an accuracy of66.3%(SE0.6%),al-though the best performance was achieved by a single English feature alone(animacy),with an accuracy of72.5%(SE0.4%).We note then that the English HKLaws dataset alone is not sufficiently informative for the classifica-tion task.The best accuracy achieved with the WSJ data,of72.5%,will serve as our monolin-gual baseline—i.e.,the performance we would like to beat with our multilingual data. Next,we turn to our evaluation of Chinese features alone;the results are reported in Ta-ble2.We see that,in contrast to the English HKLaws dataset,the Chinese features alone performed very well.For the aligned and un-aligned Chinese HKLaws datasets,using all Chinese features achieved an accuracy of75.4% and74.1%,respectively,as shown in line1of the table;the two results are not significantly different at the p<ing the verb POS tags alone in the aligned set—e.g.,the UPenn VA(stative)tag,in line2of the table—achieves comparable performance of75.1%,SE 0.4%(again,not statistically different from the first two results).(The best single feature in the unaligned dataset is also one of the verb tags,achieving only a slightly lower accuracy of71.5%,SE0.5%.)Thus,we have the surprising result that Chi-nese features alone,from a fairly small dataset, are far superior to the English features from the same bilingual corpus(75.4%versus49.5% best accuracy respectively).In fact,the Chi-nese features alone outperform the monolin-gual baseline of72.5%,which uses English fea-tures from a much larger corpus.(The differ-ence between the best English-only and best Chinese-only accuracies is small,but statisti-cally significant at the p<0.05level.) Finally,we want to look at the performance of all English features(from either corpus)aug-mented with selected Chinese features(aligned or unaligned,from the HKLaws corpus).The results are shown in Table 3.In general, combining English with Chinese features per-formed very ing English HKLaws data, the best feature combination(using the Chi-nese CKIP POS tags)achieved a performance of77.9%accuracy(SE0.8%),for a reduction of56%of the baseline error rate.(See line1of Table3;the results for aligned and unaligned data are not significantly different.)Note that, although numerically larger,these results do not differ significantly from the Chinese-only results.We conclude that for the English HK-Laws dataset,the Chinese features greatly help the English features,and the English features do not hurt performance of the Chinese fea-tures.We also augmented the English WSJ dataset with the Chinese HKLaws dataset;the best accuracy is at80.6%(SE0.6%),for an errorAligned Features%Acc.%SE Unaligned Features%Acc.%SEHKLaws only,All Eng. Features+CKIP Tags 77.50.7HKLaws only,All Eng.Features+CKIP Tags77.90.8WSJ+HKLaws,All Eng. Features+UPenn VA-Tag 80.60.6WSJ+HKLaws,All Eng.Features+Peri.Part.76.20.6Table3:Accuracy(%Acc.)and Standard Error(%SE)in the8-Fold Cross-Validation Experi-ments,Using a Combination of English and Chinese Features.Aligned UnalignedChange-of-State Creation/Transfor-mationAll Verbs Change-of-StateCreation/Transfor-mationAll VerbsFeatures F F%Acc.(#E)F F%Acc.(#E) Chi.Only0.770.7978.1(7)0.820.8081.3(6) Eng.Only0.630.6362.5(12)Aligned=Unaligned+10.800.8281.3(6)0.630.6362.5(12) +20.580.6159.4(13)0.730.7675.0(8) +30.520.5553.1(15)0.800.8281.3(6) +1,20.790.8381.3(6)0.830.8684.4(5) +2,30.480.5753.1(15)0.690.6968.8(10) +1,30.790.8381.3(6)0.570.6762.5(12) +1,2,30.790.8381.3(6)0.620.7468.8(10) Table4:F-measure(F),Accuracy(%Acc.),and Number of Errors(#E)in the Leave-one-out Experiments.(1=CKIP Tags;2=Passive Particles;3=Periphrastic Particles)rate reduction of61%(see line2of Table3). This best performance is achieved using the UPenn VA tag in the aligned corpus,shown above to be highly useful on its own.Here,the performance of the combined dataset—using both English and Chinese features—is signifi-cantly better than both the English monolin-gual baseline(of72.5%),and the Chinese fea-tures alone(best accuracy of75.4%)(p<0.05). We conclude that combining multilingual data has a significant performance benefit over monolingual data from either language.In particular,in augmenting English-only data with Chinese data,we achieve higher accura-cies than that using either the English HK-Laws subcorpus or the much larger WSJ cor-pus alone.On the other hand,we found that Chinese features alone achieve very good ac-curacies,close to the performance of the com-bined datasets,indicating that the Chinese fea-tures are highly informative in and of them-selves.Finally,we note that,although the English features from the smaller bilingual corpus were not useful in classification on their own,the combination of English and Chinese features from that corpus performed comparably to the combination of English WSJ features with the Chinese features.Thus,a smaller bilingual corpus may be effectively used either alone or in combination with a larger monolingual cor-pus.5.2Leave-One-Out MethodologyFor the leave-one-out experiments,we only re-port results using English WSJ data in con-junction with the Chinese HKLaws data,since that yielded the best performance.We fo-cus here on augmenting the English dataset with Chinese features that seem particularly promising.Recall that since the leave-one-out method yields the result of classifying each in-dividual verb,we can further analyse the per-formance within and across the two classes with this multilingual data.For these tests,we selected the three Chi-nese features CKIP Tags,Passive Particles, and Periphrastic Particles,because they con-sistently had an above-chance performance, and/or improved performance when combined with other features,in the cross-validation ex-periments.The results are shown in Table4. The italicized sections highlight the feature sets with the best overall accuracies.On the left panel,showing the results with aligned Chinese data,the addition to the English fea-tures of any feature combination that includes CKIP Tags has the(same)best overall ac-curacy.On the right panel,showing the un-aligned data,the addition of CKIP Tags and Passive Particles has the best overall perfor-mance.We see again that with the right fea-ture combination,using multilingual data is superior to using English-only data.Since we know the number of errors per class,we were able to calculated the preci-sion and recall of each of the two classes as well.Due to space limitations,we only report the F-measure in Table4.For each class,we calculated a balanced F score as2PR/(P+R), where P and R are the precision and recall. The two classes yield similar F scores in almost all cases,and the trend is not different from that of the overall accuracy.Observe in the italicized sections in the table(the best over-all performance),the F scores are larger than those in the monolingual section(first two lines of the table).We conclude that adding Chinese features to English features has a performance benefit over the monolingual features alone for both verb classes,as well as overall.6Related WorkOur work is thefirst use of a bilingual corpus-based technique for the automatic learning of verb classification,though we are not thefirst to utilized multilingual resources for lexical ac-quisition tasks generally.For example,(Siegel and McKeown,2000)suggested the use of par-allel corpora in learning the aspectual classi-fication(i.e.,state or event)of English verbs. (Ide,2000)and(Resnik and Yarowsky,2000) made use of parallel corpora for word sense disambiguation.That is,a parallel(English-non-English)corpus was used as a source for lexicalizing somefine-grained English senses. Other work using multilingual resources that is highly related to ours are studies by Fung (1998)and by Melamed et al.(1997;1998), in which a bilingual corpus was used to ex-tract bilingual lexical entries.An important assumption is that the bilingual corpus is sen-tence or segment alignable,which allows for the calculation some co-occurence score be-tween any two possible translations.One com-mon theme in these papers is that,given any arbitrary tokens and some text coordinate sys-tem,the closer the two tokens’coordinates are, the more likely they are translational equiv-alents.Although we did not use an auto-matic method tofind translations of verbs, our aligned data collection technique is simi-lar in spirit.We also make one further impli-cation that is absent in these papers:in one subcorpus of a bitext,the distribution of the different senses and usages of a word should be reflected/correlated in the distribution of its translations in the other subcorpus.We have suggested that some Chinese features are related to some English features;therefore, these Chinese features should also make a simi-lar n-way distinction between the English verb classes.7ConclusionsWe conclude that the use of multilingual cor-pora,either alone or in combination with monolingual data,can be an effective aid in verb classification.The Chinese features that worked best were the(active/stative)POS tags,and the passive and causative particles—easily extractable features indicating proper-ties that are difficult to detect in English using only simple syntactic counts.This supports our hypothesis that a second language that provides surface-level features complementing the available English features can extend the possible feature set for verb classification,al-lowing the use of smaller parallel corpora in place of,or in addition to,larger monolingual data sets.We have presented some preliminary results demonstrating the benefit of using multilingual data.However,we conducted our experiments only on a small test set of32verbs in one lan-guage pair.To test the generality of our hy-pothesis,we plan to duplicate our experiments using a larger test set,and expand our inves-tigation to other language pairs.In fact,given our success with even unaligned data,we con-jecture that our approach may be greatly en-hanced by using multiple monolingual corpora from different languages which differentially express semantic features relevant to verb clas-sification.AcknowledgementsWe gratefully acknowledge thefinancial sup-port of the US National Science Foundation, the Natural Sciences and Engineering Research Council of Canada,and the University of Toronto.We thank Paola Merlo for helpful discussions on the work.AppendixChange-of-state verbs:alter,change,clear, close,compress,contract,cool,decrease,di-minish,dissolve,divide,drain,flood,multiply, open,reproduce.Creation and transformation verbs:build, clean,compose,direct,hammer,knit,organ-ise,pack,paint,perform,play,produce,recite, stitch,type,wash.ReferencesRod Ellis.1997.Second Language Acquisition.Oxford University Press,Oxford.Pascale Fung.1998.A statistical view on bilingual lexi-con extraction:from parallel corpora to non-parallel corpora.In Lecture Notes in Artificial Intelligence, pages1–17.Springer Publisher.Rena Helms-Park.1997.Building an L2Lexicon:The Acquisition of Verb Classes Relevant to Causativiza-tion in English by Speakers of Hindi-Urdu and Vietnamese.Ph.D.thesis,University of Toronto, Toronto,Canada.Nancy Ide.2000.Cross-lingual sense determination: Can it work?Computers and the Humanities, 34:223–234.Shunji Inagaki.1997.Japanese and Chinese learn-ers’acquisition of the narrow-range rules for the dative alternation in nguage Learning, 47(4):637–669.Alan Juffs.2000.An overview of the second lan-guage acquisition of links between verb semantics and morpho-syntax.In John Archibald,editor,Sec-ond Language Acquisition and Linguistic Theory, pages170–179.Blackwell Publishers.Maria Lapata and Chris ing subcate-gorization to resolve verb class ambiguity.In Pro-ceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora,pages266–274,College Park,MD. Beth Levin.1993.English Verb Classes and Alter-nations:A Preliminary Investigation.University of Chicago,Chicago.Diana McCarthy and Anna-Leena Korhonen.1998.Detecting verbal participation in diathesis alterna-tions.In Proceedings of the36th Annual Meet-ing of the ACL and the17th International Confer-ence on Computational Linguistics(COLING-ACL 1998),pages1493–1495,Montreal,Canada.I.Dan Melamed and Mitchell P.Marcus.1998.Au-tomatic construction of Chinese-English translation lexicons.Technical Report98-28,University of Pennsylvania,Philadelphia,PA.I.Dan Melamed.1997.A portable algorithm for map-ping bitext correspondence.In Proceedings of the 35th Conference of the Association for Computa-tional Linguistics,Madrid,Spain.Paola Merlo and Suzanne Stevenson.2001.Automatic verb classification based on statistical distributions of argument putational Linguistics.To appear.Adwait Ratnaparkhi.1996.A maximum entropy part-of-speech tagger.In Proceedings of The Empiri-cal Methods in Natural Language Processing Confer-ence,Philadelphia,PA.Philip Resnik and David Yarowsky.2000.Distinguish-ing systems and distinguishing senses:New evalua-tion methods for word sense disambiguation.Natu-ral Language Engineering,5(2):113–133.Sabine Schulte im Walde.2000.Clustering verbs se-mantically according to their alternation behaviour.In Proceedings of COLING2000,pages747–753, Saarbr¨u cken,Germany.Eric V.Siegel and Kathleen R.McKeown.2000.Learn-ing methods to combine linguistic indicators:Im-proving aspectual classification and revealing lin-guistic insights.Journal of Computational Linguis-tics,26(4):595–628,December.Vivian Tsang.2001.Second language information transfer in automatic verb classification:A prelim-inary investigation.Master’s thesis,University of Toronto,Toronto,Canada.。
汉语语义角色标注研究概述

中文语义角色标注研究概述南京师范大学文学院陈菜芳1摘要:语义角色标注是实现浅层语义分析的一种方式,在问答系统、机器翻译和信息抽取等方面得到了成功地应用,是目前自然语言理解领域中比较热门的一个研究方向。
本文介绍了中文语义角色标注语料资源、中文语义角色标注发展现状以及对中文语义角色标注未来工作进行了展望。
关键词:浅层语义分析语义角色标注资源语义角色标注0 引言语义角色的自动标注是对句子中谓词所支配的语义角色进行自动标注,是对句子进行浅层语义分析的一种方法。
语义角色标注技术在大规模语义知识库的构建、问答系统、机器翻译和信息抽取等领域都有着广泛的应用,其深入的研究对自然语言处理技术的整体发展有着重要意义。
下面主要从三个方面来介绍中文语义角色标注研究状况:首先,介绍相关的中文语义角色标注语料资源;其次,描述了中文语义角色标注的发展现状;最后,对中文语义角色标注未来的工作进行展望。
1 中文语义角色标注语料资源语义角色标注离不开语料资源的支持。
英语较为知名的语义角色标注资源有FrameNet、PropBank和NomBank等。
中文语义角色标注语料资源主要是从英语语义角色标注语料资源的基础上发展起来或参照其建设的。
Chinese Proposition Bank(CPB)同英文PropBank基本类似。
在CPB中,总共定义了20多个角色,只对每个句子中的核心动词进行了标注,所有动词的主要角色最多有6个,均以Arg0~Arg5和ArgM为标记,其中核心的语义角色为Arg0~5六种,其余为附加语义角色,用前缀ArgM表示,后面跟一些附加标记来表示这些参数的语义类别。
它几乎对Penn Chinese Treebank中的每个动词及其语义角色进行了标注,国内大多数语义角色标注研究都是基于此资源。
中文Nombank是在英文命题库(Proposition Bank)和Nombank的标注框架上进行扩展,对中文名词性谓词的标注。
北大标注集

北大标注集:代码名称帮助记忆的诠释Ag 形语素形容词性语素。
形容词代码为a,语素代码g前面置以A。
a 形容词取英语形容词adjective的第1个字母。
ad 副形词直接作状语的形容词。
形容词代码a和副词代码d并在一起。
an 名形词具有名词功能的形容词。
形容词代码a和名词代码n并在一起。
b 区别词取汉字“别”的声母。
c 连词取英语连词conjunction的第1个字母。
Dg 副语素副词性语素。
副词代码为d,语素代码g前面置以D。
d 副词取adverb的第2个字母,因其第1个字母已用于形容词。
e 叹词取英语叹词exclamation的第1个字母。
f 方位词取汉字“方”的声母。
g 语素绝大多数语素都能作为合成词的“词根”,取汉字“根”的声母。
h 前接成分取英语head的第1个字母。
i 成语取英语成语idiom的第1个字母。
j 简称略语取汉字“简”的声母。
k 后接成分l 习用语习用语尚未成为成语,有点“临时性”,取“临”的声母。
m 数词取英语numeral的第3个字母,n,u已有他用。
Ng 名语素名词性语素。
名词代码为n,语素代码g前面置以N。
n 名词取英语名词noun的第1个字母。
nr 人名名词代码n和“人(ren)”的声母并在一起。
ns 地名名词代码n和处所词代码s并在一起。
nt 机构团体“团”的声母为t,名词代码n和t并在一起。
nz 其他专名“专”的声母的第1个字母为z,名词代码n和z并在一起。
o 拟声词取英语拟声词onomatopoeia的第1个字母。
p 介词取英语介词prepositional的第1个字母。
q 量词取英语quantity的第1个字母。
r 代词取英语代词pronoun的第2个字母,因p已用于介词。
s 处所词取英语space的第1个字母。
Tg 时语素时间词性语素。
时间词代码为t,在语素的代码g前面置以T。
t 时间词取英语time的第1个字母。
u 助词取英语助词auxiliary 的第2个字母,因a已用于形容词。
基于语义依存关系的汉语语料库的构建

中文信息学报第17卷第1期JOURNAL OF CHINESE INFORMATION PROCESSING Vol.17No.1文章编号:1003-0077(2003)01-0046-08基于语义依存关系的汉语语料库的构建¹尤1,李涓子2,王作英1(11清华大学电子工程系,北京10008421清华大学计算机科学与技术系,北京100084)摘要:语料库是自然语言处理中用于知识获取的重要资源。
本文以句子理解为出发点,讨论了在设计和建设一个基于语义依存关系的汉语大规模语料库过程中的几个基础问题,包括:标注体系的选择、标注关系集的确定,标注工具的设计,以及标注过程中的质量控制。
该语料库设计规模100万词次,利用70个语义、句法依存关系,在已具有语义类标记的语料上进一步标注句子的语义结构。
其突出特点在于将5知网6语义关系体系的研究成果和具体语言应用相结合,对实际语言环境中词与词之间的依存关系进行了有效的描述,它的建成将为句子理解或基于内容的信息检索等应用提供更强大的知识库支持。
关键词:计算机应用;中文信息处理;语料库;语义依存关系;5知网6;动态角色与属性中图分类号:TP391文献标识码:AOn Construction of a Chinese Corpus Basedon Semantic Dependency RelationsYOU F ang1,LI Juan2zi2,WANG Zuo2ying1(11Dept.of Electronics Engineeri ng,T si nghua University,Beijing100084,Chi na21Dept.of C omputer Science Technol ogy,Tsinghua U niversity,B eiji ng100084,China)Abstr act:Cor pora are important resources for knowledge acquisition in the field of natural language processing.For t he pur pose of sentence understanding,we are constructing a Chinese large2scale2corpus based on semantic dependen2 cy relations.T his paper introduces the tagging formalisms we adopt,the tagging set we choose,t he tagging tool we develop,and the method we use to guarantee the good consistency of tagging.The corpus under discussion is at a scale of1million words.Each sentence in the corpus,which already had annotations of sense,is further tagged with its semantic structure using70semantic2dependency2relat ions.The highlight of this cor pus is its ability to effectively descr ibe various relations between Chinese words.All of these profited from using<HowNet>for reference and the combination with specific use of language.The construct ion of this corpus can definitely provide mor e knowledge sup2 ports for sentence understanding,content2based information retrieval,and so on.Key wor ds:computer application;Chinese information processing;corpus;semantic dependency relations;HowNet; Event Role&Features一、引言自然语言处理面临的最大障碍在于词汇、句法、语义等知识的匮乏,建立带有各类标注附加信息的大规模语料库正是解决这一瓶颈的有效方法。
基于树库的汉语依存句法分析

万方数据万方数据万方数据万方数据万方数据基于树库的汉语依存句法分析作者:刘海涛, 赵怿怡, LIU Hai-Tao, ZHAO Yi-Yi作者单位:中国传媒大学,应用语言学研究所,北京,100024刊名:模式识别与人工智能英文刊名:PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE年,卷(期):2009,22(1)被引用次数:0次1.Abeill6 A Treebank:Building and Using Parsed Corpera 20032.Tesnibro L E16ments de la Syntaxe Structurale 19593.冯志伟特思尼耶尔的从属关系语法 1983(01)4.Hudson R A Language Networks:The New Word Grammar 20075.Nivre J Inductive Dependency Parsing 20066.Nivre J.Hall J.Nilsson J MaltParser:A Language-Independent System for Data-Driven Dependency Parsing 2007(02)7.Liu Haitao.Huang Wei A Chinese Dependency Syntax for Treebanking 20068.刘海涛影响依存句法分析的因素探讨 20079.刘海涛.冯志伟自然语言处理的概率配价模式理论[期刊论文]-语言科学 2007(03)10.Liu Haitao Probability Distribution of Dependency Distance 20071.会议论文刘海涛基于树库和机器学习的汉语依存句法分析2007基于树库和机器学习的语言处理方法是自然语言处理领域中的一个研究热点。
本文旨在探索利用语言学手段来提高句法分析精度的可能性。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
The Chinese Penn Treebank Tag Set
1 Part-Of-Speech tags: 33 tags
标
记
英语解释中文解释
AD adverbs 副词
AS Aspect marker 体态词,体标记(例如:了,在,着,过)BA把in ba-const “把”,“将”的词性标记
CC Coordinating conjunction 并列连词,“和”
CD Cardinal numbers 数字,“一百”
CS Subordinating conj 从属连词(例子:若,如果,如…)
DE
C
的for relative-clause etc “的”词性标记
DE
G
Associative 的联结词“的”
DE
R
得in V-de construction, and V-de-R “得”
DE
V
地before VP 地
DT Determiner 限定词,“这”
ET C Tag for words 等,等等in coordination
phrase
等,等等
FW Foreign words 例子:ISO
IJ interjection 感叹词
JJ Noun-modifier other than nouns
LB被in long bei-construction 例子:被,给
LC Localizer 定位词,例子:“里”
M Measure word (including classifiers) 量词,例子:“个”
MS
P
Some particles 例子:“所”
NN Common nouns 普通名词
NR Proper nouns 专有名词
NT Temporal nouns 时序词,表示时间的名词OD Ordinal numbers 序数词,“第一”
ON Onomatopoeia 拟声词,“哈哈”
P Prepositions (excluding 把and 被) 介词
PN pronouns 代词
PU Punctuations 标点
SB被in long bei-construction 例子:“被,给”
SP Sentence-final particle 句尾小品词,“吗”
VA Predicative adjective 表语形容词,“红”
VC Copula 是系动词,“是”
VE有as the main verb “有”
VV Other verbs 其他动词
2 Syntactic tags: 2
3 tags 句法标记
2.1 Tags for phrase: 17 tags 短语句法标记
标记英语解释中文解释
ADJP Adjective phrase 形容词短语
ADVP Adverbial phrase headed by AD (adverb) 由副词开头的副词短语CLP Classifier phrase 量词短语
CP Clause headed by C (complementizer) 由补语引导的补语从句DNP Phrase formed by “XP+DEG”
DP Determiner phrase限定词短语
DVP Phrase formed by “XP+DEV”
FRAG fragment
IP Simple clause headed by I (INFL 或其曲折成分)
LCP Phrase formed by “XP+LC”LC 位置词
LST List marker 列表标记,如“--”
NP Noun phrase 名词短语
PP Preposition phrase 介词短语
PRN Parenthetical 括号中的,插入的
QP Quantifier phrase 量词短语
UCP unidentical coordination phrase 非对等同位语短语
VP Verb phrase 动词短语
2.2 Tags for verb compounds: 6 tags 动词复合6个标记
标记英文解释中文解释
VCD Coordinated verb compound 并列动词复合,例子:
“(VCD (VV 观光) (VV 游览))”VCP Verb compounds formed by VV+VC 动词+是,例子:
“(VCP (VV 估计) (VC 为))”
VNV Verb compounds formed by A-not-A or A-one-A “(VNV (VV 看) (CD 一) (VV 看))”
“(VNV (VE 有) (AD 没) (VE 有))”
VPT Potential form V-de-R or V-bu-R V-de-R, V不R
“(VPT (VV 卖) (AD 不) (VV
完))”
“(VPT (VV 出) (DER 得) (VV
起))”
VRD Verb resultative compound 动词结果复合,
“(VRD (VV 反映) (VV 出))”
“(VRD (VV 卖) (VV 完))”
VSB Verb compounds formed by a modifier + a head 定语+中心词
“(VSB (VV 举债) (VV 扩张))”
3 Functional tags: 26 tags 功能标记26个
标记英语解释中文解
释
ADV Adverbial 副词的APP appositive 同位的BNF Beneficiary 受益CND Condition 条件DIR Direction 方向EXT Extent 范围FOC Focus 焦点HLN Headline 标题
IJ Interjective 插入语IMP Imperative 祈使句IO Indirect object 间接宾
语
LGS Logic subject 逻辑主
语
LOC Locative 处所MNR Manner 方式OBJ Direct object 直接宾
语
PN Proper nouns 专有名
词PRD Predicate 谓词PRP Purpose or reason 目的或
理由
Q Question 疑问SBJ Subject 主语
SHOR T Short term 缩略形
式
TMP Temporal 时间TPC Topic 话题TTL Title 标题WH Wh-phrase Wh-短
语VOC Vocative (special form of a noun, a pronoun or an adjective used when addressing or invoking a person or thing )
呼格
4 Empty categories (null elements): 7 tags 空范畴标记
标记英文解释中文解释
*OP*operator 在relative constructions相关结构中的操作
符
*pro*dropped argument 丢掉的论元
*PRO*used in control structures 在受控结构中使用
*RNR*right node raising 右部节点提升的空范畴
*T*trace of A’-movement A’移动的虚迹,话题化*trace of A-movement A移动的虚迹
其他未知的空范畴*?*other unknown empty
categories。