Using Discourse Analysis to Improve Text Categorization in MEDLINE

Using Discourse Analysis to Improve Text Categorization in MEDLINE
Using Discourse Analysis to Improve Text Categorization in MEDLINE

Using Discourse Analysis to Improve Text Categorization in MEDLINE Patrick Ruch a, Antoine Geissbühler a, Julien Gobeill, Frederic Lisacek c, Imad Tbahriti ab, Anne-Lise Veuthey b, Alan R. Aronson d

a Medical Informatics Service, University and Hospital of Geneva, Geneva, Switzerland

b Swiss-Prot, Swiss Institute of Bioinformatics, Geneva, Switzerland

c Protein Informatics Group, Swiss Institute of Bioinformatics, Geneva, Switzerland

d Lister Hill Center, National Library of Medicine, Bethesda, MD, USA

Abstract

PROBLEM: Automatic keyword assignment has been largely studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative “gist” of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or combination of methods, including machine learning (na?ve Bayes, neural networks…), linguistically-motivated methods (syntactic pars-ing, semantic tagging, or information retrieval. METHODS: In the present study, we propose to evaluate the impact of the argumentative structures of scientific articles to improve the categorization effectiveness of a categorizer, which combines linguistically-motivated and information retrieval methods. Our argumentative categorizer, which uses representation levels inherited from the field of discourse analysis, is able to classify sentences of an abstract in four classes: PURPOSE; METHODS; RESULTS and CONCLUSION. For the evalua-tion, the OHSUMED collection, a sample of MEDLINE, is used as a benchmark. For each abstract in the collection, the result of the argumentative classifier, i.e. the labeling of each sentence with an argumentative class, is used to modify the original ranking of the MeSH categorizer. RESULTS: The most effective combination (+2%, p<0.003) strongly over-weights the METHODS section and moderately the RESULTS and CONCLUSION section. CONCLUSION: Although mod-est, the improvement brought by argumentative features for text categorization confirms that discourse analysis methods could benefit text mining in scientific digital libraries. Keywords: Text Mining; Abstracting and Indexing; Informa-tion Storage and Retrieval; Natural Language Processing; Libraries, Medical; Artificial Intelligence.

Introduction

Systems for text mining are becoming increasingly important in biomedicine because of the exponential growth of written information in this domain. The mass of scientific literature needs to be filtered and categorized to provide for the most efficient use of the data. The problem of accessing this in-creasing volume of data demands the development of systems that can extract pertinent information from unstructured texts, hence the importance of key words extraction, as well as key sentence extraction. While the former task has been largely addressed in text categorization studies [1], the latter has been more rarely studied. In this report, we propose to merge the two ideas to improve the first task. We intend to use sentence extraction and sentence ranking methods to improve text cate-gorization in MEDLINE based on Medical Subject Headings (MeSH). Selecting a set of appropriate sentences likely to improve MeSH assignment is a complex task because, more than key words, the importance of a sentence is dependent on the point of view of every reader. However, as for key words, which can be more or less comprehensively listed in a con-trolled vocabulary, we believe it is possible to propose task-specific criteria, which can help to define such sentences. Our model is based on the implicit discourse-level information found in scientific reports; in particular, our model uses the argumentative sections as they often appear – sometimes ex-plicitly but usually implicitly – in abstracts. The argumenta-tive categories we are experimenting to rank the sentences originate from professional guidelines (ANSI Z39.14-1979). Indeed, articles in experimental sciences tend to respect strict argumentative patterns with at least four sections: purpose-methods-results-conclusion. These four moves –leaving aside minor variation of labels- are reported to be very stable across different scientific genres and experimental sciences in gen-eral (chemistry, anthropology, computer sciences, linguis-tics…) [4]. They are also confirmed in biomedical abstracts and articles [5][6][29]. Following recent trends, which show that argumentative criteria can be useful to improve various text mining and information retrieval applications such as related articles search [7], blind feedback [27] for ad hoc search, or feature and passage selection for automatic index-ing in full-text articles [30], our objective is to evaluate the benefit of using discourse analysis representation levels for keyword assignment in MEDLINE.

Background: Argumentative Categorization

In this section, we present a set of background methods and already reported results, which are useful to understand the rationales supporting our approach. Our argumentative cate-gorizer is formally defined as a mono class classifier: for each piece of text the system will have to decide whether it is a PURPOSE, or a METHOD, or a RESULT, or a CONCLUSION. Sentences are natural candidate segments for such a classification [8] because they are more self-contained than phrases; however anaphoric phenomena may demand larger segments.

Summarization has a related task

Modern summarization systems use annotated corpora in or-der to acquire appropriate knowledge; based on textual fea-tures summarization tools are able to conduct general summa-rization tasks. Thus, Kupiec et al. [10] report that in abstracts produced by professionals, 80% of sentences are also found in the source document. In such systems, the complex abstract-ing task is recast into a more modest sentence selection prob-lem. To do so, experts identify a set of relevant sentences from large corpora; these sentences are then used to train the learning system. Complementary to machine learning ap-proaches, Teufel et al. in [23], who design a task very similar to our one, combine a set of manually crafted triggered ex-pressions, such as finally, we have shown that, we conclude that… to classify sentences into seven argumentative classes. Basic classifiers, features selection and weighting

Choosing a priori an appropriate classifier for a given task is a fairly difficult task; therefore, empirical comparisons are often necessary. Among state-of-the-art classifiers for text categorization, such as k-NN [11], SVM [12][28], neural net-works [13], and rule-induction systems [14], Bayesian classi-fiers [15] show a linear complexity, while most top perform-ing algorithms have quadratic complexity; therefore they are often more adapted for rapid application development and exploratory studies [16] [24]. The basic features in text cate-gorization are usually word-based. Possible variants are stems, which often implies stop-word removal, and/or sequences of stems, such as bi- or trigrams of stems. Examples of stem-ming algorithms are provided in Table 1.

Table 1: String Normalization Methods

Token Lovins

Porter

S-stemmer

genetic genetically genetics gene

genes homogeneous plaid

play genet

genet

genet

gene

gene

gene

plai

plai

genet

genetically

genet

gene

gene

homogen

plai

plai

genetic

genetically

genetic

gene

gene

homogeneous

plai

plai

Our preliminary studies confirmed that elaborate string nor-malization and stop-word removal strategies such as Porter and Lovins did not outperform simpler approaches, such as plural normalization, which process morphological variations of plural forms (-s, -ies). This strategy appears sufficient to help the classifiers to generalize without removing interesting features, such as verb tense (as suggested in [4]), which is usually removed by stemming (cf. Table 1).

The last step concerns feature weighting. Indeed naive Bayes classifiers combine the log-likelihood of each feature in order to select the most probable category in the category space; however the real frequency observed in training corpora can follow different refinement and smoothing processes, known as feature weighting [11].

PURPOSE: While gemcitabine (GEM) is widely accepted for the treatment of advanced pancreatic cancer, capecitabine (CAP) has shown single agent activity and promising effi-cacy in combination with GEM. […] METHODS: Patients had advanced pancreatic adenocarcinoma, no prior systemic chemotherapy other than that given concurrently with radia-tion therapy, at lease one measurable disease, and adequate organ functions. […] RESULTS: The objective RR among 45 patients was 40.0% (95% CI; 25.1-54.9), including 1CR (2.2%). The median TTP and OS were 5.4 months (95% CI; 1.8-9.0) and 10.4 months (95% CI; 6.2-14.5), respectively. […] CONCLUSIONS: The combination of GEM with dose escalated 14-day CAP is well tolerated and offers encourag-ing activity in the treatment of advanced pancreatic cancer. […]

Figure 1: Partial example of an explicitly structured abstracts in MEDLINE.

We tested 3 weighting methods: tf-idf (term frequency-inverse document frequency), chi-2, and df-thresholding (only fea-tures appearing frequenty in each class are selected). Our con-clusion is that chi-2 and df-thresholding perform similarly, while tf-idf weighting should be avoided. Indeed, tf-idf weighting is appropriate for weighting content-bearing fea-tures, while argumentative content is supported by functional words. Three types of features are linearly combined to get a final probability ranking per class: stems, stem bigrams and stem trigrams. As in [15], length normalization of sentences as been applied in order to overcome biases introduced by too long or too short sentences.

Training and Test Data

In text classification tasks, two types of strategies are compet-ing: expert-driven and data-driven approaches. While the for-mer, which rely on a domain expert, are often time and la-bour-intensive, the latter are directly dependent on the avail-ability of large training sets. Fortunately, training data for our task can be acquired in a cheap way. Most abstracts in MEDLINE are unstructured (i.e. provided without explicit argumentative markers, such as METHODS, PURPOSES…); but fortunately, a significant fraction of these abstracts contain explicit argumentative markers. Using PubMed and its Boo-lean query interface, we collected a set of 12000 MEDLINE citations containing strings such as “PURPOSE:”, “METHODS:”, “RESULTS”, “CONCLUSION:”. This fully automatic data collection process introduces some argumenta-tive noise since some of the explicit markers gather additional argumentative content. Thus, explicit markers such as

“BACKGROUND AND PURPOSES:” were also collected as pure “PURPOSES:” markers by this simple method. The col-lection was then split into two sets:

? set A (10800 abstracts) was used for training and

validation purposed,

?

set B (1200) was used for the final assessment.

In addition to sets A and B, we also collected a smaller set (C) of marker-free abstracts (100 items). Then, two human anno-tators were asked to manually annotate this set, using the four selected argumentative classes. In contrast with the automati-cally acquired sets, here we do not assume that argumentative segments and sentences are overlapping items. As shown in Figure 2, some sentences in set C receive more than one label, because they may express two different argumentative moves in the same sentence. In such cases, we do not attempt to iden-tify segment boundaries (as explored in [18] and [19]) and instead ask the system to provide any of the relevant classes. The interannotator agreement on the C set for argumentative segments is 0.81, when measured by kappa statistics, which indicates that agreement is good.

As mentioned above, our goal is to extract conclusion sen-tences, but because the information is available in our training data, this binary task has been modified into a four-class prob-lem: {PURPOSE, METHODS, RESULTS, CONCLUSION}. We expect that working with more classes will help the sys-tem to discriminate between classes that have been reported to be lexically similar [4][21], such as PURPOSE and CONCLUSION. In the data sets used for the evaluation (B and C), explicit argumentative markers have been removed.

Skin surface proteolytic activity in the living

animal was detected by a sensi-tive, non-invasive methodol. Developed in our labora-tory. A non-leaky well was con-structed on the shaved back of an anesthetized guinea pig. The well contained the reaction mixture including the substrate 125I-S-carboxymethylated insulin B-chain

(ICMI). The prote-olytic activity was shown to be time-dependent. The activity was strongly inhibited by pepstatin A, indi-cating the involvement of aspartic proteinase(s) such as

cathepsin D and/or

E. Pretreatment of the skin with propyl-ene glycol blocked the proteolytic ctiveity. The present study demonstrates the presence of

proteolytic activity located on skin surface using a unique, non-invasive method for in situ pro-teinase detn. In the living animal. Figure 2: Example of an automatically structured abstracts. Four argumentative classes are annotated with XML tags; Gene and Protein Names (GPN and ASP_GPN), are also an-notated.

Combining positions of segments

Optionally, we also investigated the sentence position’s im-pact on the classification effectiveness through assigning a relative position to each sentence. Thus, if there were ten sen-tences in an abstract: the first sentence has a relative position

of 0.1, while the sentence in position 5 receives a relative po-sition of 0.5, and the last sentence has a relative position of 1. The following distributional heuristics are encoded in a distri-butional model: 1) if a sentence has a relative score strictly inferior to 0.4 and is classified as CONCLUSION , then its class becomes PURPOSE ; 2) if a sentence has a relative score strictly superior to 0.6 and is classified as PURPOSE , then its class is rewritten as CONCLUSION .

Categorization effectiveness

Table 2 indicates the categorization effectiveness of our ar-gumentative categorizer. In Table 2, we evaluate the effect of positional information on the categorizer. We also evaluate the performance of the system on explicitly and non-explicitly structured abstracts. Finally, with an F-score (i.e the harmonic mean, with recall and precision having the same importance; cf [22]) above 85%, recall and precision measures are com-petitive with the trigger-based approach proposed by Teufel et al [23] (F-score ~ 68%), and the SVM learner used in McKnight and Srinivasan [25] (F-score ~ 80%). While recall exhibits excellent levels of performance, precision could still be improved. Thus, expected conclusion segments are well classified, but we observe that some non-conclusion sentences are also classified as conclusion (false positives).

Without sentence positions

PURP METH RESU CONC PURP 80.65% 0% 3.23% 16% METH 8% 78% 8% 6% RESU 18.58% 5.31% 52.21% 23.89% CONC 18.18% 0% 2.27% 79.55%

With sentence positions

PURP METH RESU CONC PURP 93.55% 0% 3.23% 3% METH 8% 78% 8% 6%

RESU 12.43% 5.31% 74.25% 13.01%

CONC 2.27% 0% 2.27% 95.45%

Table 2: Confusion matrices expressing the classification ef-fectiveness of the argumentative categorizer, with and without using positional information. Materials and Methods To conduct our key word assignment experiments and evalua-tions, we used the OHSUGEN collection [25], which contains

more than 300,000 MEDLINE records. From these records,

we randomly selected 500 citations with abstracts and key-words to tune our system and 1000 citations with abstracts and keywords to provide the final results. All experiments were conducted with the MeSH 2000 version, which roughly corresponds to the time period covered by the collection. Our baseline system, which was originally developed for the Bio-Creative challenge 2003 to perform text categorization tasks with Gene Ontology categories, combine a vector space ranker and a pattern matcher. Both stems (Porter) and linguis-tically-motivated features were used by the system [35]. The system achieved competitive results in the context of the Bio-

Cretative challenge; cf. [34] for a comparative presentation and [35] for a comprehensive presentation and evaluation.

In our MeSH categorizer, the ranking was based on the cate-gorization status value returned by the system. Thus, the best candidate categories (i.e. the relevance estimate) obtained the highest score, which directly expresses a similarity between a category and the input abstract. To overweight a given argu-mentative section, we simply modified the term frequency of the feature in the abstract. Thus, if we wanted to emphasize features appearing in the METHODS section, then every fea-ture in a sentence classified as METHODS by the argumenta-tive categorizer received a boosting factor. The fine-tuning of the optimal model, which includes the calculus of the optimal boosting factor, was based on direct search. We worked with integer values, ranging from 1 to 7 and we strove to maximize the mean average precision of our system, which is the best metric to express the full ordering skill of the system. Results

Figure 3 provides the expected list of Medical Subject Head-ings for the abstract in Figure 1.

Assigned MeSH:

Adult; Cystic Fibrosis/genetics; DNA/analysis*; Genotype; HLA-DQ Antigens/genetics; Humans; Research Support, Non-U.S. Gov't; Reverse Transcriptase Polymerase Chain Reaction; Sequence Analysis, DNA/methods*; Spectrum Analysis, Raman

Top-12 categories proposed by the categorizer:

12. dna mutational analysis

11. dna

10. genetics

9. fibrosis [not in NP index]

8. alleles

7. sequence analysis, dna

6. genotype

5. mutation

4. fluorescence

3. oligonucleotides

2. polymerase chain reaction

1. cystic fibrosis

Figure 3: Expected and predicted Medical Subject Headings for abstracts in Figure 1 (PMID: 12404725). Major headings are listed with a star. We do not separate between major and minor headings. Qualifiers are ignored in our benchmark.

We observe that several headings, which relates to age, or population-related groups (Adult), and to grants (Research Support, Non-U.S. Gov't) cannot be inferred from the abstract; therefore, as is well known, both the average precision and the recall of MeSH categorizers are generally low. In contrast, some categories, in particular major headings, which are more important for annotators, can be relevantly inferred from ab-stracts. Table 3 provides results of the final evaluation. In par-ticular, we observe that the best combination emphasizes the PURPOSE section (x5 times) and the METHODS and RESULTS sections (each with a multiplicative factor of 2). The overall resulting improvement (+2%) is statistically sig-nificant, although quite modest (p<0.003) [32]. We also see that another simpler yet quite effective combination can be obtained by boosting just the PURPOSE section (x3 times). It is also interesting to observe that the best combination regard-ing the mean average precision (MAP), i.e. 0.217, is not the best one for precision at high ranks, as expressed by the Preci-sion at recall = 0. In contrast, the best precision at high ranks (0.93) is achieved by trading recall for precision. Thus, in that setting, only 3064 relevant categories are proposed while 3068 were proposed with the baseline system.

Table 3: Results of the argumentative boosting on the effec-tiveness of the MeSH categorizer (C CONCLUSION, P=PURPOSE; M=METHODS, R=RESULTS, MAP= Mean average precision).

Parameters

Kc=C,P,M,R

Relevant

Retrieved

Prec. at Rec.=0 MAP

Kc

=

1,5,2,2 3068 0.927 (+0.8%) 0.217 (+2%)

Kc=1,3,1,1 3064 0.93 (+1.1%) 0.216 (+1.4%)

Kc=1,1,1,1

(Baseline)

3068 0.92 (100%) 0.213 (100%) Conclusion

Our results (precision at high ranks ~ 93%) suggest that ar-gumentative contents as available in abstracts stored in digital libraries are helpful for text categorization tasks, such as auto-matic assignment of Medical Subject Headings (MeSH). The reported improvement is statistically significant. Although modest (+2%), the improvement confirms that discourse analysis methods are useful for a growing number of text mining applications. Indeed, while attempts to apply lin-guistically-motivated approaches based on syntactic tools (part-of-speech tagging, shallow or deep parsing) to informa-tion retrieval and text categorization were rather inconclusive, methods inherited from discourse analysis could provide a more scalable and dependable improvement. Finally, it would be interesting to evaluate the benefit of argumentative features using more elaborate approaches such as those working with novelty detection [33], full-text articles [30], or with more advanced learners [31].

Acknowledgements

The first author has been supported by a visiting faculty grant at the National Library of Medicine of the National Institute of Health (Bethesda, MD). The Oak Ridge Institute for Sci-ence and Education (ORISE), managed and operated by Oak Ridge Associated Universities (ORAU), provided the grant through an interagency agreement with the Department of Energy. The author would like to thank the LHC/NCBI team, in particular Anantha Bangalore, Olivier Bodenreider, May Cheh, Dina Demner, Susan Humphrey, Cliff Gay, Will Roger, Larry Smith, Lorry Tanabe, and John Wilbur. The study was also funded by the EAGL project (SNSF 3252B0-.105755).

References

[1] F Sebastiani, Machine learning in automated text catego-

rization. ACM Computing Surveys 34(1): 1-47, 2002.

[2] A Aronson, O Bodenreider, H Chang, S Humphrey, J

Mork, S Nelson, T Rindflesch, W Wilbur, The NLM In-

dexing Initiative. Proc AMIA Symp. 2000;:17-21.

[3]P Ruch, R Baud, and A Geissbühler. Learning-free Text

Categorization, AIME 2003, LNCS 2780.

[4] C Orasan, Patterns in scientific abstracts, in Proceedings

of Corpus Linguistics Conference, Lancaster University, Lancaster, UK, pp. 433 - 443, 2001.

[5]J Swales, Genre Analysis: English in academic and re-

search settings, Cambridge University Press, 1990. [6] F Salanger-Meyer, Discoursal movements in medical

English abstracts and their linguistic exponents: a genre analysis study, INTERFACE: Journal of Applied Lin-

guistics 4(2), 1990, pp. 107 - 124

[7]Tbahriti, C Chichester, F Lisacek and P Ruch. Using

Argumentation to Retrieve Articles with Similar Cita-

tions: an Inquiry into Improving Related Articles Search in the MEDLINE Digital Library. Int J Med Inform.

2006 Jun;75(6):488-95

[8]Ruch, C Chichester, G Cohen, G Coray, F Ehrler, H

Ghorbel, H Müller, V Pallotta. Report on the TREC 2003 Experiment: Genomic Track, TREC 2003.

[9]J Kupiec, J Pedersen and F Chen: A Trainable Document

Summarizer, SIGIR 1995, p. 55-60.

[10] Y Yang, An evaluation of statistical approaches to text

categorization. Journal of Information Retrieval 1 (1999) p. 67-88.

[11]S Dumais, J Platt, D Sahami: Inductive learning algo-

rithms and representations for text categorization. In: CIKM, ACM (1998) p. 148-155.

[12] K Ming Adam Chai, Hai Leong Chieu, Hwee Tou Ng,

Bayesian online classifiers for text classification and fil-

tering, SIGIR 2002, p. 89-96.

[13] D Beeferman, A Berger, and J Lafferty, Statistical mod-

els for text segmentation. Machine Learning, (34):177-

210, 1999. Special Issue on Natural Language Learning

(C. Cardie and R. Mooney, eds).

[14] D Lewis and M Ringuette, M.: A comparison of two

learning algorithms for text categorization. In: SDAIR.

1994, p. 81-93

[15]P Domingos and M Pazzani, On the Optimality of the

Simple Bayesian Classifier under Zero-One Loss, Ma-

chine Learning, 29 (2-3), 1997, p. 103-130.

[16]Y Yang, J Pedersen, A Comparative Study on Feature

Selection in Text Categorization, ICML, 14th Interna-

tional Conference on Machine Learning, 1997, p. 114-

121.

[17] K Nigam, J Lafferty and A McCallum. Using maximum

entropy for text classification. In IJCAI Workshop on Machine Learning for Information Filtering, pages 61-

67, 1999.

[18] M Hearst, Multi-Paragraph Segmentation of Expository

Text. Proceedings of the 32nd Meeting of the Associa-

tion for Computational Linguistics, 1994, p. 79-94. [19]J Reynar and A Ratnaparkhi. A maximum entropy ap-

proach to identifying sentence boundaries. ANLP, pages 16-19, 1997. ACL

[20] S Teufel and M Moens, Argumentative classification of

extracted sentences as a first step towards flexible ab-

stracting. Mani, M. Maybury (eds), Advances in auto-

matic text summarization, MIT, 1999.

[21] C van Rijsbergen, Information Retrieval. Buttersworth,

1979.

[22]S Teufel and M Moens, Sentence Extraction and rhe-

torical classification for flexible abstracts, AAAI Spring Symposium on Intelligent Text summarization, 1998, p.

89-97.

[23] P Ruch, L Perret and J Savoy. Feature Combination for

Extraction of Gene Functions. ECIR 2005. Springer LNCS, 112-126.

[24]L McKnight and P Srinivasan. Categorization of Sen-

tence Types in Medical Abstracts. AMIA 2003.

[25]P Ruch, R Baud, P Bouillon, and G Robert. Minimal

Commitment and Full Lexical Disambiguation: Balanc-

ing Rules and Hidden Markov Models. CoNLL-2000, pages 111-115.

[26]W Hersh. Report on the TREC 2004 Genomics track.

SIGIR Forum (2005)

[27] P Ruch, I Tbahriti, J Gobeill, AR Aronson: Argumenta-

tive Feedback: A Linguistically-Motivated Term Expan-

sion for Information Retrieval. ACL 2006

[28]Y Mizuta and N Collier: Zone Identification in Biology

Articles as a Basis for Information Extraction, In Por-

ceedings of JNLPBA (NLPBA/BioNLP), 28-29 August, Geneva, Switzerland.

[29]F Couto, M Silva and P Coutinho, P. FIGO: Findings

GO Terms in Unstructured Text (2004). BioCreative Notebook Papers.

[30]C Gay, M Kayaalp, A Aronson. Semi-Automatic Index-

ing of Full Text Biomedical Articles. AMIA 2005 (2005).

[31]M Ruiz and P Srinivasan. Hierarchical text categoriza-

tion using neural networks. Information Retrieval, 5(1), pp. 87-118, 2002.

[32]J Zobel. How reliable are large-scale information re-

trieval experiments? ACM-SIGIR\/, pages 307—314, 1998.

[33]F Lisacek, C Chichester, A Kaplan, S Agnes. Discover-

ing paradigm shift patterns in biomedical abstracts: Ap-

plication to neurodegenerative diseases. SMBM 2005. [34]F Ehrler, A Jimeno Yepes, A Geissbühler and P Ruch.

Data-poor Categorization and Passage Retrieval for Gene Ontology Annotation in Swiss-Prot, BMC Bioin-

formatics, Special Issue on BioCreative: A Critical As-

sessment of Text Mining Methods in Molecular Biology,

vol. 6 (suppl. 1), 2005.

[35]P Ruch: Automatic assignment of biomedical categories:

toward a generic approach. Bioinformatics 22(6): 658-

664 (2006).

Address for correspondence

Patrick Ruch, University Hospitals of Geneva

24 Micheli du Crest, CH-1211 Geneva, Switzerland

@: patrick.ruch@sim.hcuge.ch , tel: +41 22 372 61 64

【免费下载】胡壮麟语言学名词解释总结

胡壮麟语言学名词解释总结 1.design feature: are features that define our human languages,such as arbitrariness,duality,creativity,displacement,cultural transmission,etc. 2.function: the use of language tocommunicate,to think ,https://www.360docs.net/doc/0c1400465.html,nguage functions inclucle imformative function,interpersonal function,performative function, emotive function,phatic communion,recreational function and metalingual function. 3.etic: a term in contrast with emic which originates from American linguist Pike’s distinction of phonetics and phonemics.Being etic mans making far too many, as well as behaviously inconsequential,differentiations,just as was ofter the case with phonetic vx.phonemic analysis in linguistics proper. 4.emic: a term in contrast with etic which originates from American linguist Pike’s distinction of phonetics and phonemics.An emic set of speech acts and events must be one that is validated as meaningful via final resource to the native members of a speech communith rather than via qppeal to the investigator’s ingenuith or intuition alone. 5.synchronic: a kind of description which takes a fixed instant(usually,but not necessarily,the present),as its point of observation.Most grammars are of this kind. 6.diachronic:study of a language is carried through the course of its history. 7.prescriptive: a kind of linguistic study in which things are prescribed how ought to be,https://www.360docs.net/doc/0c1400465.html,ying down rules for language use. 8.descriptive: a kind of linguistic study in which things are just described. 9.arbitrariness: one design feature of human language,which refers to the face that the forms of linguistic signs bear no natural relationship to their meaning. 10.duality: one design feature of human language,which refers to the property of having two levels of are composed of elements of the secondary.level and each of the two levels has its own principles of organization. 11.displacement: one design feature of human language,which means human language enable their users to symbolize objects,events and concepts which are not present c in time and space,at the moment of communication. 12.phatic communion: one function of human language,which refers to the social interaction of language. 13.metalanguage: certain kinds of linguistic signs or terms for the analysis and description of particular studies. 14.macrolinguistics: the interacting study between language and language-related disciplines such as psychology,sociology,ethnograph,science of law and artificial intelligence etc.Branches of macrolinguistics include psycholinguistics,sociolinguistics, anthropological linguistics,et https://www.360docs.net/doc/0c1400465.html,petence: language user’s underlying knowledge about the system of rules. 16.performance: the actual use of language in concrete situation. https://www.360docs.net/doc/0c1400465.html,ngue: the linguistic competence of the speaker. 18.parole: the actual phenomena or data of linguistics(utterances). 19.Articulatory phonetics: the study of production of speechsounds. 20.Coarticulation: a kind of phonetic process in which simultaneous or overlapping articulations are involved..Coarticulation can be further divided into anticipatory coarticulation and perseverative coarticulation. 21.Voicing: pronouncing a sound (usually a vowel or a voiced consonant) by vibrating the vocal cords. 22.Broad and narrow transcription: the use of a simple set of symbols in transcription is called broad transcription;while,the use of more specific symbols to show more phonetic detail is referred to as narrow transcription.

胡壮麟名词解释

胡壮麟《语言学教程》术语表 第一章 phonology音系学 grammar语法学 morphology形态学 syntax句法学 lexicology词汇学 general linguistics普通语言学theoretical linguistics理论语言学historical linguistic s历史语言学descriptive linguistics描写语言学empirical linguistics经验语言学dialectology方言学 anthropology人类学 stylistics文体学 signif ier能指 signif ied所指 morphs形素 morphotactics语素结构学/形态配列学 syntactic categori es句法范畴syntactic classes句法类别序列 sub-structure低层结构 super-structure上层结构 open syllable开音节 closed syllable闭音节 checked syllable成阻音节 rank 等级 level层次 ding-dong theory/nativistic theory本能论 sing-song theory唱歌说 yo-he-ho theory劳动喊声说 pooh-pooh theory感叹说 ta-ta theory模仿说 animal cry theory/bow-wow theory模声说 Prague school布拉格学派 Bilateral opposition双边对立Mutilateral opposition多边对立Proportional opposition部分对立Isolated opposition孤立对立 Private opposition表缺对立 Graded opposition渐次对立Equipollent opposition均等对立Neutralizable opposition可中立对立Constant opposition恒定对立Systemic-f unctional grammar系统功能语法 Meaning potential意义潜势Conversational implicature会话含义Deictics指示词 Presupposition预设 Speech acts言语行为 Discourse analysis话语分析Contetualism语境论 Phatic communion寒暄交谈Metalanguage原语言 Applied linguistic s应用语言学Nominalism唯名学派Psychosomatics身学 第二章trachea/windpipe气管 tip舌尖 blade舌叶/舌面 front舌前部 center舌中部 top舌顶 back舌后部 dorsum舌背 root舌跟 pharynx喉/咽腔 laryngeals喉音 laryngealization喉化音 vocal cords声带 vocal tract声腔 initiator启动部分 pulmonic airstream mechanism肺气流 机制 glottalic airstream mechanism喉气流 机制 velaric airstream mechanism腭气流机 制 Adam’s apple喉结 Voiceless sound清音 Voiceless consonant请辅音 Voiced sound浊音 Voiced consonant浊辅音 Glottal stop喉塞音 Breath state呼吸状态 Voice state带音状态 Whisper state耳语状态 Closed state封闭状态 Alveolar ride齿龈隆骨 Dorsum舌背 Ejective呼气音 Glottalised stop喉塞音 Impossive内爆破音 Click/ingressive吸气音 Segmental phonology音段音系学 Segmental phonemes音段音位 Suprasegmental超音段 Non-segmental非音段 Plurisegmental复音段 Synthetic language综合型语言 Diacritic mark附加符号 Broad transcription宽式标音 Narrow transcription窄式标音 Orthoepy正音法 Orthography正字法 Etymology词源 Active articulator积极发音器官 Movable speech organ能动发音器官 Passive articulator消极发音器官 Immovable speech organ不能动发音 器官 Lateral边音 Approximant [j,w]无摩擦延续音 Resonant共鸣音 Central approximant中央无摩擦延续 音 Lateral approximant边无摩擦延续音 Unilateral consonant单边辅音 Bilateral consonant双边辅音 Non-lateral非边音 Trill [r]颤音trilled consonant颤辅音 rolled consonant滚辅音 Labal-velar唇化软腭音 Interdent al齿间音 Post-dental后齿音 Apico-alveol ar舌尖齿龈音 Dorso-alveol ar舌背齿龈音 Palato-alveolar后齿龈音 Palato-alveolar腭齿龈音 Dorso-palat al舌背腭音 Pre-palat al前腭音 Post-palatal后腭音 Velarization软腭音化 Voicing浊音化 Devoicing清音化 Pure vowel纯元音 Diphthong二合元音 Triphthong三合元音 Diphthongization二合元音化 Monophthongization单元音化 Centring diphthong央二合元音 Closing diphthong闭二合元音 Narrow diphthong窄二合元音 Wide diphthong宽二合元音 Phonetic similarity语音相似性 Free variant自由变体 Free variation自由变异 Contiguous assimilation临近同化 Juxtapostional assimilation邻接同化 Regressive assimilation逆同化 Anticipatory assimilation先行同化 Progressive assimilation顺同化 Reciprocal assimilation互相同化 Coalescent assimilation融合同化 Partial assimilation部分同化 Epenthesis插音 Primary stress主重音 Secondary stress次重音 Weak stress弱重音 Stress group重音群 Sentence stress句子重音 Contrastive stress对比重音 Lexical stress词汇重音 Word stress词重音 Lexical tone词汇声调 Nuclear tone核心声调 Tonetics声调学 Intonation contour语调升降曲线 Tone units声调单位 Intonology语调学 Multilevel phonology多层次音系学 Monosyllabic word多音节词 Polysyllabic word单音节次 Maximal onset principle最大节首辅 音原则 第三章词汇 liaison连音 contract ed f orm缩写形式 frequency count词频统计 a unit of vocabulary词汇单位 a lexical item词条 a lexeme词位 hierarchy层次性 lexicogrammar词汇语法 morpheme语素 nonomorphemic words单语素词 polymorphemic words多语素词 relative uninterruptibility相对连续性 a minimum f ree f orm最小自由形式 the maximum f ree f orm最大自由形式 variable words 可变词 invariable words不变词 paradigm聚合体 grammatical words(function words)语 法词/功能词 lexical words(cont ent words)词汇词/ 实义词 closed-cl ass words封闭类词 opened-class words开放类词 word class词类 particles小品词 pro-f orm代词形式 pro-adjective(so)代形容词 pro-verb(do/did)代副词 pro-adverb(so)代动词 pro-locative(there)代处所词/代方位词 determiners限定词 predeterminers前置限定词 central determiners中置限定词 post determiners后置限定词 ordinal number序数词 cardinal number基数词 morpheme词素 morphology形态学 free morpheme自由词素 bound morpheme黏着词素 root词根 aff ix词缀 stem词干 root morpheme词根语素 pref ix前缀 inf ix中缀 suff ix后缀 bound root morpheme黏着词根词素 inf lectional aff ix屈折词缀 derivational aff ix派生词缀 inf lectional morphemes屈折语素 derivational morphemes派生语素 word-f ormation构词 compound复合词 endocentri c compound向心复合词 exocentri c compound离心复合词 nominal endocentric compound名词性 向心复合词 adjective endocentric compound形容 词性向心复合词 verbal compound动词性复合词 synthetic compound综合性复合词 derivation派生词 morpheme语素 phoneme音位

语言学重要知识点(胡壮麟版)

Language is a means of verbal communication. It is a system of arbitrary vocal symbols used for human communication. 1.Design features of language The features that define our human languages can be called design features which can distinguish human language from any animal system of communication.Arbitrariness refers to the fact that the forms of linguistic signs bear no natural relationship to their meanings.eg.the dog barks wowwow in english but 汪汪汪in chinese.Duality refers to the property of having two levels of structures, such that units of the primary level are composed of elements of the secondary level and each of the two levels has its own principles of organization.eg.dog-woof(but not w-oo-f)Creativity means that language is resourceful because of its duality and its recursiveness. Eg. An experiment of bee communication.Displacement means that human languages enable their users to symbolize objects, events and concepts which are not present (in time and space) at the moment of communication. 3. Origin of language The bow-wow theory In primitive times people imitated the sounds of the animal calls in the wild environment they lived and speech developed from that.The pooh-pooh theory In the hard life of our primitive ancestors, they utter instinctive sounds of pains, anger and joy which gradually developed into language. The “yo-he-ho” theory As primitive people worked together, they produced some rhythmic grunts which gradually developed into chants and then into language. 4.Linguistics is the scientific study of language. It studies not just one language of any one community, but the language of all human beings. 5. Main branches of linguistics ?Phonetics is the study of speech sounds, it includes three main areas: articulatory phonetics, acoustic phonetics, and auditory phonetics. ?Phonology studies the rules governing the structure, distribution, and sequencing of speech sounds and the shape of syllables. ?Morphology studies the minimal units of meaning – morphemes and word-formation processes. ?Syntax refers to the rules governing the way words are combined to form sentences in a language, or simply, the study of the formation of sentences. ?Semantics examines how meaning is encoded in a language. It is concerned with both meanings of words as lexical items and levels of language below the word and above it. ?Pragmatics is the study of meaning in context. It concerned with the way language is used to communicate rather than with the way language is structured. 6.Important distinctions in linguistics 1)Descriptive vs. prescriptive For example, ―Don’t say X.‖ is a prescriptive command; ―People don’t say X.‖ is a descriptive statement. The distinction lies in prescribing how things ought to be and describing how things are.Lyons 2)Synchronic vs. diachronic A synchronic study takes a fixed instant (usually at present) as its point of observation. Saussure’s diachronic description is the study of a language through the course of its history. E.g. a study of the features of the English used in Shakespeare’s time would be synchronic, and a study of the changes English has undergone since then would be a diachronic study. 3)Langue & parole langue: the linguistic competence of the speaker. parole: the actual phenomena or data of linguistics(utterances). Saussure 4)Competence and performance According to Chomsky,a language user’s underlying knowledge about the system of rules is called the linguistic competence, and the actual use of language in concrete situations is called performance. Competence 7.consonant is produced by constricting or obstructing the vocal tract at some places to divert, impede, or

胡壮麟语言学教程(修订版)一至三单元课后名词解释中英对照

语言学教程chapter1-3 1.design feature: are features that define our human languages,such as arbitrariness,duality,creativity,displacement,cultural transmission,etc. 本质特征:决定了我们语言性质的特征。如任意性、二重性、创造性、移位性等等。 2.function: the use of language to communicate,to think ,https://www.360docs.net/doc/0c1400465.html,nguage functions inclucle imformative function,interpersonal function,performative function, emotive function,phatic communion,recreational function and metalingual function. 功能:运用语言进行交流、思考等等。语言的功能包括信息功能、人际功能、施为功能、感情功能。3.etic: a term in contrast with emi c which originates from American linguist Pike’s distinction of phonetics and phonemics.Being etic means making far too many, as well as behaviously inconsequential,differentiations,just as was ofter the case with phonetic vx.phonemic analysis in linguistics proper. 非位的:相对于“位学的”源于美国语言学家派克对于语音学和音位学的区分。 4.emic: a term in contrast with etic which originates from American linguist Pike’s distinction of phonetics and phonemics.An emic set of speech acts and events must be one that is validated as meaningful via final resource to the native members of a speech communith rather than via a ppeal to the investigator’s ingenuith or intuition alone. 位学的:相对于“非位的”源于美国语言学家派克对于语音学和音位学的区分。言语行为和事件中的位学系统必须是有效而有意义的,是通过言语社会中的本族语者而不仅仅是调查者的聪明和直觉获得的。5.synchronic: a kind of description which takes a fixed instant(usually,but not necessarily,the present),as its point of observation.Most grammars are of this kind. 共时:以一个固定的时间(通常,但非必须,是现在)为它的观察角度的描写。大多数的语法书属于此类型。 6.diachronic:study of a language is carried through the course of its history. 历时:在语言的历史过程中研究语言。 7.prescriptive: a kind of linguistic study in which things are prescribed how ought to be,https://www.360docs.net/doc/0c1400465.html,ying down rules for language use. 规定式:规定事情应该是怎样的。如制定语言运用规则。 8.descriptive: a kind of linguistic study in which things are just described. 描写式:描述事情是怎样的。 9.arbitrariness: one design feature of human language,which refers to the face that the forms of linguistic signs bear no natural relationship to their meaning. 任意性:人类语言的本质特征之一。它指语言符号的形式与意义之间没有自然的联系。 10.duality: one design feature of human language,which refers to the property of having two levels of are composed of elements of the secondary.level and each of the two levels has its own principles of organization. 二重性:人类语言的本质特征之一。拥有两层结构的这种特性,底层结构是上层结构的组成成分,每层都有自身的组合规则。 11.displacement: one design feature of human language,which means human language enable their users to symbolize objects,events and concepts which are not present (in time and space),at the moment of communication.

胡壮麟语言学术语英汉对照翻译表-(1)(DOC)

胡壮麟语言学术语英汉对照翻译表 1. 语言的普遍特征: 任意性arbitrariness 双层结构duality 既由声音和意义结构 多产性productivity 移位性displacement:我们能用语言可以表达许多不在场的东西 文化传播性cultural transmission 2。语言的功能: 传达信息功能informative 人济功能:interpersonal 行事功能:Performative 表情功能:Emotive 寒暄功能:Phatic 娱乐功能recreatinal 元语言功能metalingual 3. 语言学linguistics:包括六个分支 语音学Phonetics 音位学phonology 形态学Morphology 句法学syntax 语义学semantics 语用学pragmatics 4. 现代结构主义语言学创始人:Ferdinand de saussure 提出语言学中最重要的概念对之一:语言与言语language and parole ,语言之语言系统的整体,言语则只待某个个体在实际语言使用环境中说出的具体话语 5. 语法创始人:Noam Chomsky 提出概念语言能力与语言运用competence and performance 1. Which of the following statements can be used to describe displacement. one of the unique properties of language: a. we can easily teach our children to learn a certain language b. we can use both 'shu' and 'tree' to describe the same thing. c. we can u se language to refer to something not present d. we can produce sentences that have never been heard befor e. 2.What is the most important function of language? a. interpersonal b. phatic c. informative d.metallingual 3.The function of the sentence "A nice day, isn't it ?"is __ a informative b. phatic c. directive d. performative

胡壮麟版语言学名词解释

Define the following terms: 1.design feature:are features that define our human languages,such as arbitrariness,duality,creativity,displacement,cultural transmission,etc. 2.function: the use of language tocommunicate,to think ,https://www.360docs.net/doc/0c1400465.html,nguage functions inclucle imformative function,interpersonal function,performative function,interpersonal function,performative function,emotive function,phatic communion,recreational function and metalingual function. 3. etic:a term in contrast with emic which originates f rom American linguist Pike’s distinction of phonetics and phonemics.Being etic mans making far too many, as well as behaviously inconsequential,differentiations,just as was ofter the case with phonetic vx.phonemic analysis in linguistics proper. 4. emic:a term in contrast with etic which originates from American linguist Pike’s distinction of phonetics and phonemics.An emic set of speech acts and events must be one that is validated as meaningful via final resource to the native members of a speech commun ith rather than via qppeal to the investigator’s ingenuith or intuition alone. 5. synchronic: a kind of description which takes a fixed instant(usually,but not necessarily,the present),as its point of observation.Most grammars are of this kind. 6. diachronic:study of a language is carried through the course of its history. 7. prescriptive: the study of a language is carried through the course of its history. 8. prescriptive: a kind of linguistic study in which things are prescribed how ought to be,https://www.360docs.net/doc/0c1400465.html,ying down rules for language use. 9. descriptive: a kind of linguistic study in which things are just described. 10. arbitrariness: one design feature of human language,which refers to the face that the forms of linguistic signs bear no natural relationship to their meaning. 11. duality: one design feature of human language,which refers to the property of having two levels of are composed of elements of the secondary.level and each of the two levels has its own principles of organization. 12. displacement: one design feature of human language,which means human language enable their users to symbolize objects,events and concepts which are not present c in time and space,at the moment of communication. 13. phatic communion: one function of human language,which refers to the social interaction of language. 14. metalanguage: certain kinds of linguistic signs or terms for the analysis and description of particular studies. 15. macrolinguistics: he interacting study between language and language-related disciplines such as psychology,sociology,ethnograph,science of law and artificial intelligence etc.Branches of macrolinguistics include psycholinguistics,sociolinguistics, anthropological linguistics,et 16. competence: language user’s underlying knowledge about the system of rules. 17. performance: the actual use of language in concrete situation. 18. langue: the linguistic competence of the speaker. 19. parole: the actual phenomena or data of linguistics(utterances). 20.Articulatory phonetics: the study of production of speechsounds. 21.Coarticulation: a kind of phonetic process in which simultaneous or overlapping articulations are

相关文档
最新文档