Language-dependent and language-independent approaches to cross-lingual text retrieval
undeutsch假设 -回复

undeutsch假设-回复问题: 为什么语言学家会提出undeutsch假设?在语言学的研究中,undeutsch假设是指一个假设,即人类的每一种语言或方言,都有其独特之处,这些特点是无法用其他语言准确地描述或翻译的。
该假设源自于语言学家对于不同语言之间差异性的观察和研究。
为了更好地理解这个假设,我们需要逐步解答几个问题。
问题一: 什么是undeutsch假设?在语言学中,undeutsch假设是指每一种语言或方言都拥有其独特之处,这些特点无法被其他语言准确地描述或翻译。
根据这个假设,不同的语言在语音、语法、词汇等方面存在差异。
这样的差异使得翻译、交流、理解和学习不同语言变得更加复杂和困难。
问题二: 为什么语言具有独特之处?语言是人类沟通的基本工具,每种语言都随着时间的推移逐渐演化,形成其独特的语音系统、语法结构和词汇。
语言的演变受到不同地理、历史、文化等因素的影响,因此不同语言之间的差异是不可避免的。
例如,由于不同语音系统的存在,某些语言中的特定音素在其他语言中可能不存在,因此很难用准确的翻译来表达。
问题三: undeutsch假设对语言学研究的影响是什么?undeutsch假设的提出为语言学研究提供了更深入的思考。
首先,它促进了语言学家对各种语言的实地考察和深入分析,以进一步探索语言的差异和共性。
通过这种研究,语言学家可以更好地理解不同语言之间的相似性和差异性,为跨语言交流和翻译提供更准确的指导。
其次,undeutsch假设也指导了翻译和语言教学领域的发展。
翻译工作需要翻译者理解源语言的独特特点,并尝试用目标语言的表达方式进行准确翻译。
语言教学方面,教师需要考虑学习者的母语背景和文化差异,针对不同母语习得的语言难点进行有针对性的教学。
问题四: 这个假设有助于多语言社会的沟通吗?尽管不同语言之间存在差异,但undeutsch假设的提出并不阻碍跨语言社交和沟通。
事实上,人类具备学习和理解多种语言的能力,通过学习其他语言,我们可以更好地理解不同文化之间的差异和共通之处。
专八语言学考点

专八语言学考点语言学概论一.语言的甄别特征(Design Features):语言的甄别特征(Design Features)包括:1. 任意性(Arbitrariness)2. 能产性(Productivity)3. 双层性(Duality)4. 移位性(Replacement)5. 文化传承(Cultural transmission)二.语言学的主要分支(the Main Branches of Linguistics):1. 语音学(phonetics):用以研究语音的特点,并提供语音描写、分类和标记方法的学科。
2. 音系学(phonology):研究语言中出现的区别语音及其模式是如何形成语音系统来表达意义的学科。
3. 形态学(morphology):研究词的内部结构和构词规则。
4. 句法学(syntax):用以研究词是被如何组成句子,以及支配句子构成的学科。
5. 语义学(semantics):研究语言意义的学科。
6. 语用学(pragmatics):研究语言的意义在语境中如何被理解、传递和产出的学科。
7. 宏观语言学(Macrolinguistics):主要包括社会语言学(Sociolinguistics)、心理语言学(Psycholinguistics)、人类语言学(Anthropological Linguistics)、计算机语言学(Computational Linguistics)。
三.语言学的流派(Different Approaches of Linguistics):1. 结构主义语言学(Structural Lingustics):1.1 布拉格学派(The Prague School)1.2 哥本哈根学派(The Copenhagen School)1.3 美国结构主义学派(American Structuralism)以上三个学派都受到索绪尔(Saussure)的影响,例如都区分语言和言语(Langue vs. Parole),共时和历时(Synchronic vs. Diachronic)。
完整word版语言学英汉术语对比word文档良心出品

1. 语言的普遍特征: 任意性 arbitrariness 双层结构 duality 既由声音和意义结构 多产性 productivity 移位性 displacement: 我们能用语言可以表达许多不在场的东西 文化传播性 cultural transmission2。
语言的功能: 传达信息功能 informative 人济功能: 行事功能: 表情功能: 寒暄功能: 娱乐功能 元语言功能metalingual 3. 语言学 linguistics :包括六个分支 语音学 音位学 形态学 句法学 语义学 语用学 4. 现代结构主义语言学创始人: Ferdinand de saussure 提出语言学中最重要的概念对之一: 语言与言语 language and parole ,语言之语言系统的整 体,言语则只待某个个体在实际语言使用环境中说出的具体话语5. 语法创始人: Noam Chomsky 提出概念语言能力与语言运用 competence and performance 1. Which ofthe following statements can be used to describe displacement. one of the unique properties of language: a. we can easily teach our children to learn a certain languageb. we can use both 'shu' and 'tree'todescribe the same thing.c. we can u se language to refer to something not presentd. we can produce sentences that have never been heard before.2. What is the most important function of language?a. interpersonalb. phaticc. informatived.metallingual3. The function of the sentence "A nice day, isn't it ?"is __ a informativeb. phaticc. directived. performative4. The distinction between competence and performance is proposed by __ a saussureb. hallidayc. chomskyd. the prague school5. Who put forward the distinction between language and parole?a. saussureb. chomskyc. hallidayd anomymous第二节 语音学1.发音器官由声带 the vocal cords 和三个回声腔组成2. 辅音 consonant :there is an obstruction of the air stream at some point of the vocal tract.3. 辅音的发音方式爆破音 complete obstruction鼻音 nasals破裂音 plosivesinterpersonal PerformativeEmotive Phatic recreatinalPhoneticsphonology Morphologysyntaxsemanticspragmatics部分阻塞辅音 partial obstruction擦音 fricatives破擦音 affricates 等4.辅音清浊特征 voicing 辅音的送气特征 aspiration5.元音 vowel 分类标准舌翘位置,舌高和嘴唇的形状6 双元音 diphthongs, 有元音过渡 vowel glides1.Articulatory phonetics mainly studies __.a.the physical properties of the sounds produced in speechb.the perception of soundsc.the combination of soundsd.the production of sounds2.The distinction between vowel s and consonants lies ina.the place of articulationb.the obstruction f airstreamc.the position of the tongued.the shape of the lips3.What is the common factor of the three sounds: p, k ta.voicelessb.spreadc.voicedd.nasal4.What phonetic feature distinguish the p in please and the p in speak?a.voicingb.aspirationc.roundnessd.nasality5.Which of the following is not a distinctive feature in English?a.voicingb.nasalc.approximationd.aspiration6.The phonological features of the consonant k are __a.voiced stopb.voiceless stopc.voiced fricatived.voiceless fricative7.p is divverent from k in __a.the manner of articuIationb.the shape of the Iipsc.the vibration of the vocaI cordsd.the paIce of articuaItion8.Vibration of the vocaI cords resuIts in __a.aspirationb.nasaIityc.obstructiond.voicing第三节音位学 phonoIogy1.音位学与语音学的区别:语音学着重于语音的自然属性,主要关注所有语言中人可能发出的所有声音;音位学则强调语音的社会功能,其对象是某一种语言中可以用来组合成词句的那些语音。
第二语言习得个体差异

第二语言习得的个体差异众所周知众所周知,,一个发育正常的婴幼儿一个发育正常的婴幼儿,,其母语习得的成功率可以达到百分之百到百分之百,,他们从咿呀呀学语到比较完整地表达意愿和进行交际仅需6-7年时间年时间,,而且过程轻松、偷快。
第二语言习得的效果却不尽然尽然,,其过程不仅漫长而且困难重重其过程不仅漫长而且困难重重,,成功率也往往低于第一语言习得。
特别值得注意的是习得。
特别值得注意的是,,使用同一教材使用同一教材,,由同一教师执教由同一教师执教,,运用同样的教学方法和处于同样的环境样的教学方法和处于同样的环境,,学习者的最终学习效果往往会表现出很大的个体差异。
这是为什么呢现出很大的个体差异。
这是为什么呢? ?根据目前的研究根据目前的研究,,影响第二语言习得的因素是多种多样的影响第二语言习得的因素是多种多样的,,大致可分为内部因素和外部因素两大类。
内部因素一般指学习者自身的因素因素,,如学习者的学习态度、学习动机、智力水平如学习者的学习态度、学习动机、智力水平,,语言学能、性格及学习策略等。
外部因素指那些不以习者个人意志为转移的因素及学习策略等。
外部因素指那些不以习者个人意志为转移的因素,,如年龄因素、社会环境因素、家庭因素及性别因素等。
本文将着重讨论影响语言学习的一些内部因素讨论影响语言学习的一些内部因素,,因为某些内部因素是随外界环境的变化而变化的境的变化而变化的,,因而教师可以对他们进行调控因而教师可以对他们进行调控,,并经过使其朝着有利于语言学习的方面转变。
着有利于语言学习的方面转变。
一、态度和动机一、态度和动机(attitude and miti-vation) (attitude and miti-vation)态度和动机被认为是决定第二语言学习成败的最重要的情感因素。
态度一般包括三个内涵素。
态度一般包括三个内涵::对第二语言社团及其成员所持的态度对第二语言社团及其成员所持的态度,,对所学语言的态度及对语言学习一般性的态度。
专八语言学考点

语言学概论一.语言的甄别特征(Design Features):语言的甄别特征(Design Features)包括:1. 任意性(Arbitrariness)2. 能产性(Productivity)3. 双层性(Duality)4. 移位性(Replacement)5. 文化传承(Cultural transmission)二.语言学的主要分支(the Main Branches of Linguistics):1. 语音学(phonetics):用以研究语音的特点,并提供语音描写、分类和标记方法的学科。
2. 音系学(phonology):研究语言中出现的区别语音及其模式是如何形成语音系统来表达意义的学科。
3. 形态学(morphology):研究词的内部结构和构词规则。
4. 句法学(syntax):用以研究词是被如何组成句子,以及支配句子构成的学科。
5. 语义学(semantics):研究语言意义的学科。
6. 语用学(pragmatics):研究语言的意义在语境中如何被理解、传递和产出的学科。
7. 宏观语言学(Macrolinguistics):主要包括社会语言学(Sociolinguistics)、心理语言学(Psycholinguistics)、人类语言学(Anthropological Linguistics)、计算机语言学(Computational Linguistics)。
三.语言学的流派(Different Approaches of Linguistics):1. 结构主义语言学(Structural Lingustics):1.1 布拉格学派(The Prague School)1.2 哥本哈根学派(The Copenhagen School)1.3 美国结构主义学派(American Structuralism)以上三个学派都受到索绪尔(Saussure)的影响,例如都区分语言和言语(Langue vs. Parole),共时和历时(Synchronic vs. Diachronic)。
言语的英语作文

Language is a powerful tool that enables communication,expression,and connection among individuals.In an English composition about the power of language, one can explore various aspects of how language shapes our world and influences our lives.The Importance of Language in CommunicationLanguage is the primary means through which we communicate our thoughts,feelings, and ideas.It allows us to share experiences,learn from one another,and build relationships.In the essay,you could discuss how language bridges cultural gaps, enabling people from different backgrounds to understand and appreciate each others perspectives.Language as a Reflection of CultureEvery language is a reflection of the culture from which it originates.It carries the history, traditions,and values of a society.In your composition,you might delve into how language is not only a means of communication but also a carrier of cultural identity, shaping our worldview and influencing our behavior.The Role of Language in EducationLanguage plays a crucial role in education,as it is the medium through which knowledge is imparted.Your essay could explore how language proficiency is essential for academic success and how learning a new language can open doors to new educational opportunities and a broader understanding of the world.Language and IdentityLanguage is closely tied to ones identity.It is a part of who we are and where we come from.In your composition,you could discuss how language can be a source of pride and a marker of belonging to a particular community or group.The Evolution of LanguageLanguage is not static it evolves over time,influenced by social,technological,and environmental factors.Your essay could examine the dynamic nature of language and how new words and expressions are created to reflect changing societal norms and innovations.The Challenges of LanguageWhile language is a powerful tool,it can also present challenges,such as misunderstandings and miscommunications.Your composition might address the difficulties of language barriers and the importance of learning multiple languages to foster global understanding and cooperation.The Power of Persuasive LanguageLanguage has the power to persuade and influence.In your essay,you could explore how the choice of words and the structure of sentences can sway opinions and affect decisionmaking processes.The Impact of Language on Mental HealthLanguage can also have a profound impact on mental health.Positive and supportive language can uplift and encourage,while negative or derogatory language can harm and demotivate.Your composition could touch on the importance of using language mindfully to foster mental wellbeing.The Future of LanguageWith advancements in technology,the way we use language is changing.Your essay might speculate on the future of language,considering the role of artificial intelligence, machine translation,and the potential for new forms of communication.ConclusionIn conclusion,your English composition on the power of language could emphasize its multifaceted role in our lives,from facilitating communication to shaping our identities and influencing our perceptions of the world.By exploring these themes,you can demonstrate the profound impact that language has on individuals and societies alike.。
胡壮麟的语言学笔记

胡壮麟的语言学笔记1. What is language?“Language is system of arbitrary(随意的)vocal(发音的,口头的)symbols used for human communication. It is a system, since linguistic elements are arranged systematically, rather than randomly. Arbitrary, in the sense that there is usually no intrinsic(固有的,内在的,本质的)connection between a work (like “book”) and the object it refers to. This explains and is explained by the fact that different languages have different “books”: “book” in English, “livre” in French, “shu” in Chinese. It is sy mbolic, because words are associated with objects, actions, ideas etc. by nothing but convention. Namely, people use the sounds or vocal forms to symbolize what they wish to refer to. It is vocal, because sound or speech is the primary medium for all human languages. Writing systems came much later than the spoken forms. The fact that small children learn and can only learn to speak (and listen) before they write (and read) also indicates that language is primarily vocal, rather than written. The term “human” in the definition is meant to specify that language is human specific.2. What are design features of language?“Design features”here refer to the defining properties of human language that tell the difference between human language and any system of animal communication. They are arbitrariness, duality(二元性), productivity, displacement, cultural transmission(文化传播)and interchangeability(可交换性)3. What is arbitrariness?By “arbitrariness”, we mean there is no logical connection between meanings and sounds. A dog might be a pig if only the first person or group of persons had used it for a pig. Language is therefore largely arbitrary. But language is not absolutely seem to be some sound-meaning association, if we think of echo words, like “bang”, “crash”, “roar”, which are motivated in a certain sense. Secondly, some compounds (words compounded to be one word) are not entirely arbitrary either. “Type” and “write” are opaque(不透明的,难理解的,晦涩的)or unmotivated words, while “type-writer”is less so, or more transparent or motivated than the words that make it. So we can say “arbitrariness”is a matter of degree.4. What is duality?Linguists refer “duality” (of structure) to the fac t that in all languages so far investigated, one finds two levels of structure or patterning. At the first, higher level, language is analyzed in terms of combinations of meaningful units (such as morphemes, words etc.); at the second, lower level, it is seen as a sequence of segments which lack any meaning in themselves, but which combine to form units of meaning. According to Hu Zhanglin et al., language is a system of two sets of structures, one of sounds and the other of meaning. This is important for the workings of language. A small number of semantic units (words), and these units of meaning can be arranged and rearranged into an infinite number of sentences (note that we have dictionaries of words, but no dictionary of sentences!). Duality makes it possible for a person to talk about anything within his knowledge. No animal communication system enjoys this duality.5. What is productivity?Productivity refers to the ability to the ability to construct and understand an indefinitely large number of sen tences in one’s native language, including those that has never heard before, but that are appropriate to the speaking situation. No one has ever said or heard “A red-eyed elephant is dancing on the small hotel bed with an African gibbon”, but he can say i t whennecessary, and he can understand it in right register. Different from artistic creativity, though, productivity never goes outside the language, thus also called “rule-bound creativity” (by N.Chomsky).6.What is displacement?“Displacement”, as one of the design features of the human language, refers to the fact that one can talk about things that are not present, as easily as he does things present. In other words, one can refer to real and unreal things, things of the past, of the present, of the future. Language itself can be talked about too. When a man, for example, is crying to a woman, about something, it might be something that had occurred, or something that is occurring, or something that is to occur. When a dog is barking, however, you can decide it is barking for something or at someone that exists now and there. It couldn’t be bow-wowing sorrowfully for a bone to be lost. The bee’s system, nonetheless, has a small share of “displacement”, but it is an unspeakable tiny share.7.What is cultural transmission?This means that language is not biologically transmitted from generation to generation, but that the details of the linguistic system must be learned anew by each speaker. It is true that the capacity for language in human beings (N. Cho msky called it “language acquisition device”, or LAD) has a genetic basis, but the particular language a person learns to speak is a cultural one other than a genetic one like the dog’s barking system. If a human being is brought up in isolation he cannot acquire language. The Wolf Child reared by the pack of wolves turned out to speak the wolf’s roaring “tongue” when he was saved. He learned thereafter, with no small difficulty, the ABC of a certain human language.8. What is interchangeability?Interchangeability means that any human being can be both a producer and a receiver of messages. Though some people suggest that there is sex differentiation in the actual language use, in other words, men and women may say different things, yet in principle there is no sound, or word or sentence that a man can utter and a woman cannot, or vice versa. On the other hand, a person can be the speaker while the other person is the listener and as the turn moves on to the listener, he can be the speaker and the first speaker is to listen. It is turn-taking that makes social communication possible and acceptable. Some male birds, however, utter some calls which females do not (or cannot). When a dog barks, all the neighboring dogs bark. Then people around can hardly tell wh ich dog (dogs) is (are) “speaking” and which listening.9.Why do linguists say language is human specific?First of all, human language has six “design features” which animal communication systems do not have, at least not in the true sense of them. Secondly, linguists have done a lot trying to teach animals such as chimpanzees to speak a human language but have achieved nothing inspiring. Washoe, a female chimpanzee, was brought up like a human child by Beatnice and Alan Gardner. She was taught “American sign Language”, and learned a little that made the teachers happy but did mot make the linguistics circle happy, for few believed in teaching chimpanzees. Thirdly, a human child reared among animals cannot speak a human language, not even when he is taken back and taught to do so.10. What functions does language have?Language has at least seven functions: phatic, directive, Informative, interrogative, expressive, evocative and performative. According to Wang Gang (1988,p.11), language has three main functions: a tool of communication, a tool whereby people learn about the world, and a tool bywhich people learn about the world, and a tool by which people create art . M .A. K. Halliday, representative of the London school, recognizes three “Macro-Functions”:ideational, interpersonal and textual.11. What is the phatic function?The “phatic function” refers to language being used for setting up a certain atmosphere or maintaining social contacts(rather than for exchanging information or ideas). Greetings, farewells, and comments on the weather in English and on clothing in Chinese all serve this function. Much of the phatic language (e.g. “How are you?” “Fine, thanks.”) is insincere if taken literally, but it is important. If you don't say “Hello” to a friend you meet, or if you don’t answer his “Hi”, you ruin your friendship.12. What is the directive function?The “directive function” means that language may be used to get the hearer to do something. Most imperative sentences perform this function, e. g., “Tell me the result when you finish.” Other syntactic structures or sentences of other sorts can, according to J. Austin and J. Searle’s “Indirect speech act theory” at least, serve the purpose of direction too, e.g., “If I were you, I would have blushed to t he bottom of my ears!”13. What is the informative function?Language serves an “informational function” when used to tell something, characterized by the use of declarative sentences. Informative statements are often labelled as true (truth) or false (fal sehood). According to P. Grice’s “Cooperative Principle”, one ought not to violate the “Maxim of Quality”, when he is informing at all.14. What is the interrogative function?When language is used to obtain information, it serves an “interrogative function”. This includes all questions that expect replies, statements, imperatives etc., according to the “indirect speech act theory”, may have this function as well, e.g., “I’d like to know you better.” This may bring forth a lot of personal information. Note that rhetorical questions make an exception, since they demand no answer, at least not the reader’s/listener’s answer.15. What is the expressive function?The “expressive function” is the use of language to reveal something about the feelings or attitudes of the speaker. Subconscious emotional ejaculations are good examples, like “Good heavens!” “My God!” Sentences like “I’m sorry about the delay” can serve as good examples too, though in a subtle way. While language is used for the informative function to pass judgment on the truth or falsehood of statements, language used for the expressive function evaluates, appraises or asserts the speaker’s own attitudes.16. What is the evocative function?The “evocative function” is the use of language to create cer tain feelings in the hearer. Its aim is , for example, to amuse, startle, antagonize, soothe, worry or please. Jokes(not practical jokes, though) are supposed to amuse or entertain the listener; advertising to urge customers to purchase certain commodities; propaganda to influence public opinion. Obviously, the expressive and the evocative functions often go together, i.e., you may express, for example, your personal feelings about a political issue but end up by evoking the same feeling in, or imposing it on, your listener. That’s also the case with the other way round.17. What is the performative function?This means people speak to “do things” or perform actions. On certain occasions the utteranceitself as an action is more important than what words or sounds constitute the uttered sentence. The judge’s imprisonment sentence, the president’s war or independence declaration, etc., are performatives.18. What is linguistics?“Linguistics” is the scientific study of language. It studies not just one languag e of any one society, but the language of all human beings. A linguist, though, does not have to know and use a large number of languages, but to investigate how each language is constructed. He is also concerned with how a language varies from dialect to dialect, from class to class, how it changes from century to century, how children acquire their mother tongue, and perhaps how a person learns or should learn a foreign language. In short, linguistics studies the general principles whereupon all human languages are constructed and operate as systems of communication in their societies or communities.19. What makes linguistics a science?Since linguistics is the scientific study of language, it ought to base itself upon the systematic, investigation of language data which aims at discovering the true nature of language and its underlying system. To make sense of the data, a linguist usually has conceived some hypotheses about the language structure, to be checked against the observed or observable facts. In order to make his analysis scientific, a linguist is usually guided by four principles: exhaustiveness, consistency, and objectivity. Exhaustiveness means he should gather all the materials relevant to the study and give them an adequate explanation, in spite of the complicatedness. He is to leave no linguistic “stone” unturned. Consistency means there should be no contradiction between different parts of the total statement. Economy means a linguist should pursue brevity in the analysis when it is possible. Objectivity implies that since some people may be subjective in the study, a linguist should be (or sound at least) objective, matter-of-face, faithful to reality, so that his work constitutes part of the linguistics research.20. What are the major branches of linguistics?The study of language as a whole is often called general linguistics. But a linguist sometimes is able to deal with only one aspect of language at a time, thus the arise of various branches: phonetics, phonology, morphology, syntax, semantics, pragmatics, sociolinguistics, applied linguistics, psycholinguistics etc.21. What are synchronic and diachronic studies?The description of a language at some point of time (as if it stopped developing) is a synchrony study (synchrony). The description of a language as it changes through time is a diachronic study (diachronic). An essay entitled “On the Use of THE”, for example, may be synchronic, if the author does not recall the past of THE, and it may also be diachronic if he claims to cover a large range or period of time wherein THE has undergone tremendous alteration.22. What is speech and what is writing?No one needs the repetition of the general principle of linguistic analysis, namely, the primacy of speech over writing. Speech is primary, because it existed long long before writing systems came into being. Genetically children learn to speak before learning to write. Secondly, written forms just represent in this way or that the speech sounds: individual sounds, as in English and French as in Japanese. In contrast to speech, spoken form of language, writing as written codes, gives language new scope and use that speech does not have. Firstly, messages can be carried through space so that people can write to each other. Secondly, messages can be carried through timethereby, so that people of our time can be carried through time thereby, so that people of our time can read Beowulf, Samuel Johnson, and Edgar A. Poe. Thirdly, oral messages are readily subject to distortion, either intentional or unintentional, while written messages allow and encourage repeated unalterable reading. Most modern linguistic analysis is focused on speech, different from grammarians of the last century and theretofore.23. What are the differences between the descriptive and the prescriptive approaches?A linguistic study is “descriptive” if it only describes and analyses the facts of language, and “prescriptive” if it tries to lay down rules for “correct” language behavior. Linguistic studies before this century were largely prescriptive because many early grammars were largely prescriptive because many early grammars were based on “high” (literary or religious) written records. Modern linguistics is mostly descriptive, however. It (the latter) believes that whatever occurs in natural speech (hesitation, incomplete utterance, misunderstanding, etc.) should be described in the analysis, and not be marked as incorrect, abnormal, corrupt, or lousy. These, with changes in vocabulary and structures, need to be explained also.24. What is the difference between langue and parole?F. de Saussure refers “langue” to the abstract linguistic system shared by all the members of a speech community and refers “parole” to the actual or actualized language, or the realization of langue. Langue is abstract, parole specific to the speaking situation; langue not actually spoken by an individual, parole always a naturally occurring event; langue relatively stable and systematic, parole is a mass of confused facts, thus not suitable for systematic investigation. What a linguist ought to do, according to Saussure, is to abstract langue from instances of parole, i.e. to discover the regularities governing all instances of parole and make than the subject of linguistics. The langue-parole distinction is of great importance, which casts great influence on later linguists. 25. What is the difference between competence and performance?According to N. Chomsky, “competence” is the ideal language user’s knowledge of the rules of his language, and “performance” is the actual realization of this knowledge in utterances. The former enables a speaker to produce and understand an indefinite number of sentences and to recognize grammatical mistakes and ambiguities. A speaker’s competence is stable while h is performance is often influenced by psychological and social factors. So a speaker’s performance does not always match or equal his supposed competence. Chomsky believes that linguists ought to study competence, rather than performance. In other words, they should discover what an ideal speaker knows of his native language. Chomsky’s competence-performance distinction is not exactly the same as, though similar to, F. de Saussure’s langue-parole distinction. Langue is a social product, and a set of conventions for a community, while competence is deemed as a property of the mind of each individual. Sussure looks at language more from a sociological or sociolinguistic point of view than N. Chomsky since the latter deals with his issues psychologically or psycholinguistically.26. What is linguistic potential? What is actual linguistic behaviour?These two terms, or the potential-behavior distinction, were made by M. A. K. Halliday in the 1960s, from a functional point of view. There is a wide range of things a speaker can do in his culture, and similarly there are many things he can say, for example, to many people, on many topics. What he actually says (i.e. his “actual linguistic behavior”) on a certain occasion to a certain person is what he has chosen from many possible injustice items, each of which he could have said (linguistic potential).27. In what way do language, competence and linguistic potential agree? In what way do they differ? And their counterparts?Langue, competence and linguistic potential have some similar features, but they are innately different. Langue is a social product, and a set of speaking conventions; competence is a property or attribute of each ideal speaker’s mind; linguistic potential is all the linguistic corpus or repertoire available from which the speaker chooses items for the actual utterance situation. In other words, langue is invisible but reliable abstract system. Competence means “knowing”, and linguistic potential a set of possibilities for “doing” or “performing actions”. They are similar in that they all refer to the constant underlying the utterances that constitute what Saussure, Chomsky and Halliday respectively called parole, performance and actual linguistic behavior. Parole, performance and actual linguistic behavior enjoy more similarities than differences.28. What is phonetics?“Phonetics” is the science which studies the characteristics of human sound-making, especially those sounds used in speech, and provides methods for their description, classification and transcription, speech sounds may be studied in different ways, thus by three different branches of phonetics. (1) Articulatory phonetics; the branch of phonetics that examines the way in which a speech sound is produced to discover which vocal organs are involved and how they coordinate in the process. (2) Auditory phonetics, the branch of phonetic research from the hearer’s point of view, looking into the impression which a speech sound makes on the hearer as mediated by the ear, the auditory nerve and the brain. (3) Acoustic phonetics: the study of the physical properties of speech sounds, as transmitted between mouth and ear. Most phoneticians, however, are interested in articulatory phonetics.29. How are the vocal organs formed?The vocal organs or speech organs, are organs of the human body whose secondary use is in the production of speech sounds. The vocal organs can be considered as consisting of three parts; the initiator of the air-stream, the producer of voice and the resonating cavities.30. What is place of articulation?It refers to the place in the mouth where, for example, the obstruction occurs, resulting in the utterance of a consonant. Whatever sound is pronounced, at least some vocal organs will get involved, e.g. lips, hard palate etc., so a consonant may be one of the following (1) bilabial: [p, b, m]; (2) ]; (4) alveolar:[t, d, l, n, s, z]; (5)T, Plabiodental: [f, v]; (3) dental:[ retroflex; (6) palato-alveolar:[ ]; (7) palatal:[j]; (8) velar[ k, g]; (9) uvular; (10) glottal:[h]. Some sounds involve the simultaneous use of two places of articulation. For example, the English [w] has both an approximation of the two lips and that two lips and that of the tongue and the soft palate, and may be termed “labial-velar”.31. What is the manner of articulation?The “manner of articulation” literally means the way a sound is articulated. At a given place of articulation, the airstream may be obstructed in various ways, resulting in various manners of articulation, are the following: (1) plosive:[p, b, t, d, k, g]; (2) nasal:[m, n,]; (3) trill; (4) tap or flap;(5) lateral:[l]; (6) fricative:[f, v, s, z]; (7) approximant:[w, j]; (8) affricate:[ ].32. What is IPA? When did it come into being ?The IPA, abbreviation of “International Phonetic Alphabet”, is a compromise system making use of symbols of all sources, including diacritics indicating length, stress and intonation, indicating phonetic variation. Ever since it was developed in 1888, IPA has undergone a number of revisions.33. What is narrow transcription and what is broad transcription?In handbook of phonetics语音学, Henry Sweet made a distinction between “narrow”and “broad”transcriptions, which he called “Narrow Romic”. The former was meant to symbolize all the possible speech sounds, including even the most minute shades of pronunciation while Broad Romic or transcription was intended to indicate only those sounds capable of distinguishing one word from another in a given language.34. What is phonology? What is difference between phonetics and phonology?“Phonology” is the study of soun d systems- the invention of distinctive speech sounds that occur in a language and the patterns wherein they fall. Minimal pair, phonemes, allophones, free variation, complementary distribution, etc., are all to be investigated by a phonologist. Phonetics is the branch of linguistics studying the characteristics of speech sounds and provides methods for their description, classification and transcription. A phonetist is mainly interested in the physical properties of the speech sounds, whereas a phonologist studies what he believes are meaningful sounds related with their semantic features, morphological features, and the way they are conceived and printed in the depth of the mind phonological knowledge permits a speaker to produce sounds which from meaningf ul utterances, to recognize a foreign “accent”, to make up new words, to add the appropriate phonetic segments to from plurals and past tenses, to know what is and what is not a sound in one’s language.35. What is a phone? What is a phoneme? What is an allophone?A “phone” is a phonetic unit or segment. The speech sounds we hear and produce during linguistic communication are all phones. When we hear the following words pronounced: [pit], [tip], [spit], etc., the similar phones we have heard are [p] for one thing, and three different [p]s, readily making possible the “narrow transcription or diacritics”. Phones may and may not distinguish meaning. A “phoneme” is a phonological unit; it is a unit that is of distinctive value. As an abstract unit, a phoneme is not any particular sound, but rather it is represented or realized by a certain phone in a certain phonetic context. For example, the phoneme[p] is represented differently in *pit+, *tip+ and *spit+. The phones representing a phoneme are called its “allophones”, i.e., the different (i.e., phones) but do not make one word so phonetically different as to create a new word or a new meaning thereof. So the different [p] s in the above words are the allophones of the same phoneme [p]. How a phoneme is represented by a phone, or which allophone is to be used, is determined by the phonetic context in which it occurs. But the choice of an allophone is not random. In most cases it is rule-governed; these rules are to be found out by a phonologist.36. What are minimal pairs?When two different phonetic forms are identical in every way except for one sound segment which occurs in the same place in the string, the two forms (i. e., word) are supposed to form a “minimal pair”, e.g., “pill” and “bill”, “pill” and “till”, “till” and “dill”, “till” and “kill”, etc. All these words together constitute a minimal set. They are identical in form except for the initial consonants. There are many minimal pairs in English, which makes it relatively easy to know what are English phonemes. It is of great importance to find the minimal pairs when a phonologist is dealing with the sound system of an unknown language.37. What is free variation?If two sounds occurring in the same environment do not contrast; namely, if the substitution of one for the other does not generate a new word form but merely a different pronunciation of the same word, the two sounds then are said to be in “free variation”. The plosives, for example, maynot be exploded when they occur before another plosive or a nasal (e. g., act, apt, good morning). The minute distinctions may, if necessary, be transcribed in diacritics. These unexploded and exploded plosives are in free variation. Sounds in free variation should be assigned to the same phoneme.38. What is complementary distribution?When two sounds never occur in the same environment, they are in “complementary distribution”. For example, the aspirated English plosives never occur after *s+, and the unsaturated ones never occur initially. Sounds in complementary distribution may be assigned to the same phoneme. The allophones of [l], for example, are also in complementary distribution. The clear [l] occurs only before a vowel, the voiceless equivalent of [l] occurs only after a voiceless consonant, such as in the words “please”, “butler”, “clear”, etc., and the dark *l+ occurs only after a vowel or as a syllabic sound after a consonant, such as in the words “feel”, “help”, “middle”, etc.39. What is the assimilation rule? What is the deletion rule?The “assimilation rule” assimilates one segment to another by “copying” a feature of a sequential phoneme, thus making the two phones more similar. This rule accounts for the raring pronunciation of the nasal [n] that occurs within a word. The rule is that within a word the nasal consonant[n] assumes the same place of articulation as the following consonant. The negative prefix “in-“ serves as a good example. It may be pronounced as *in+, or *im+ when occurring in different phonetic contexts: e. g., indiscrete-[ ] (alveolar) inconceivable-[ ](velar) input-*‘imput+ (bilabial)The “deletion rule” tells us when a sound is to be deleted although is orthographically represented. While the letter “g” is mute in “sign”, “design” and “paradigm”, it is pronounced in thei r corresponding derivatives: “signature”, “designation” and “paradigmatic”. The rule then can be stated as: delete a [g] when it occurs before a final nasal consonant. This accounts for some of the seeming irregularities of the English spelling.40. What is suprasegmental phonology? What are suprasegmental features? “Suprasegmental phonology” refers to the study of phonological properties of linguistic units larger than the segment called phoneme, such as syllable, length and pitch, stress, intonation. 41. What is morphology?“Morphology” is the branch of grammar that studies the internal structure of words, and the rules by which words are formed. It is generally divided into two fields: inflectional morphology and lexical/derivational morphology.42. What is inflection/inflexion?“Inflection” is the manifestation of grammatical relationships through the addition of inflectional affixes, such as number, person, finiteness, aspect, and case, which does not change the grammatical class of the items to which they are attached.43. What is a morpheme? What is an allomorph?The “morpheme” is the smallest unit in terms of relationship between expression and content, a unit which cannot be divided without destroying or drastically altering the meaning, whether it is lexical or grammatical. The word “boxes”, for example, has two morphemes: “box” and “-es”, neither of which permits further division or analysis if we don’t wish to sacrifice meaning. Therefore a morpheme is considered the minimal unit of meaning. Allomorphs, like allophones vs. phones, are the alternate shapes (and thus phonetic forms) of the same morphemes. Some。
语言学概论知识汇总(英文)

第一章Invitation to Linguistics1.Definition of language:Language is a system of vocal (and written) symbols with meaning attached that is used forhuman communication of thoughts and feelings.2.Design features of language(语言的普遍特征):①.Arbitrariness 任意性:The forms of linguistic signs generally bear no natural relationship to the meanings they carry②.Duality 二重性:Human language has two levels of structures: the primary meaningful level of morphemes, words, phrases, sentences and the secondary meaningless level of sounds. The units of the primary level are composed of elements of the secondary level, and each of the two levels has its own principles of organization.③.Creativity 创造性:Language is resourceful because of its duality and recursiveness.④.Displacement移位性:Human languages enable their users to symbolize objects, events and concepts which are not present in time and space at the moment of communication.3.Functions of language1)Informative function2)Interpersonal function人际功能3)Performative (行为) function4)Emotive function5)Phatic (寒暄) function6)Recreational function7)Metalingual function(元语言功能)指用语言去说明或解释语言的功能4.Main branches of linguistics:Main branches of linguistics (microlinguistics微观) and interdisciplinary(跨领域、跨学科)fields of linguistics (macrolinguistics宏观)1) Main branches of linguistics:(1) Phonetics发音学,语音学;(2) Phonology;(音位学、语音体系)(3) Morphology 词法/ Lexicology词汇学;(4) Syntax句法;(5) Semantics语义学(6) Pragmatics语用学:研究特定情境中的特定话语,在不同的语言交际环境中如何理解和运用语言支。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Language-dependent and Language-independent Approaches to Cross-Lingual Text Retrieval Jaap Kamps,Christof Monz,Maarten de Rijke,and B¨o rkur Sigurbj¨o rnsson Language&Inference Technology Group,University of AmsterdamNieuwe Achtergracht166,1018WV Amsterdam,The NetherlandsE-mail:{kamps,christof,mdr,borkur}@science.uva.nl Abstract.We investigates the effectiveness of language-dependent ap-proaches to document retrieval,such as stemming and decompounding,and constrast them with language-independent approaches,such as char-acter n-gramming.In order to reap the benefits of more than one type ofapproach,we also consider the effectiveness of the combination of bothtypes of approaches.We focus on document retrieval in nine Europeanlanguages:Dutch,English,Finnish,French,German,Italian,Russian,Spanish,and Swedish.We look at four different cross-lingual informationretrieval tasks:monolingual,bilingual,multilingual,and domain-specificretrieval.The experimental evidence is obtained using the2003test suiteof the cross-language evaluation forum(CLEF).1IntroductionResearchers in Information Retrieval(IR)have experimented with a great va-riety of approaches to document retrieval for European languages.Differences between these approaches range from the text representation used(e.g.,whether to apply morphological normalization or not,or which type of query formula-tion to use),to the choice of search strategy(e.g.,which weighting scheme to use,or whether to use blind feedback).We focus on approaches using different document representations,but using the same retrieval settings and weighting scheme.In particular,we focus on different approaches to morphological nor-malization or tokenization.We conducted experiments on nine European lan-guages(Dutch,English,Finnish,French,German,Italian,Russian,Spanish, and Swedish).There are notable differences between these languages,such as the complexity of inflectional and derivational morphology[1].A recent overview of monolingual document retrieval can be found in[2]. The options considered in[2]include word-based runs(indexing the tokens as they occur in the documents),stemming(using stemmers from the Snowball family of stemming algorithms),lemmatizing(using the lemmatizer built into the TreeTagger part-of-speech tagger),and compound splitting(for compound forming languages such as Dutch,Finnish,German,and Swedish).Additionally, there are experiments with adding character n-grams(of length4and5).The main lessons learned in[2]were two fold.First,there is no language for which the best performing run significantly improves over the“compound split andstem”run(treating splitting as a no-op for non-compound forming languages). Second,the hypothesis that adding4-gramming is the best strategy is refuted for Spanish only.Notice that these comparisons did not involve combinations of runs,but only runs based on a single index.The aim of this paper is to redo some of the experiments of[2],to investigate the combination of approaches,and to extend these experiments to a number of cross-lingual retrieval tasks(we give details below).In particular,we will inves-tigate the effectiveness of language-dependent approaches to document retrieval, i.e.,approaches that require detailed knowledge of the particular language at hand.The best known example of a language-dependent approach is the use of stemming algorithms.The effectiveness of stemming in English is a recurring issue in a number of studies[3,4].Here we consider the effectiveness of stem-ming for nine European languages.Another example of a language-dependent approach is the use of decompounding strategies for compound-rich European languages,such as Dutch and German[5].Compounds formed by the concate-nation of words are rare in English,although exceptions like database exist. We will also investigate the effectiveness of language-independent approaches to document retrieval,i.e.,approaches that do not depend on knowledge of the lan-guage at hand.The best known example of language-independent approaches is the use of character n-gramming techniques.Finally,we will investigate whether both approaches to document retrieval can be fruitfully combined[6].Hoping to establish the robustness and effectiveness of these approaches for a whole range of cross-lingual retrieval tasks,we supplement the monolingual retrieval experi-ments with bilingual retrieval experiments,with multilingual experiments,and with domain-specific experiments.Experimental evaluation is done on the test suite of the Cross-Language Evaluation Forum[7].The paper is organized as follows.In Section2we describe the FlexIR sys-tem as well as the approaches used for all of the crosslingual retrieval tasks.In Section3we discuss our experiments for monolingual retrieval(in Section3.1), bilingual retrieval(in Section3.2),multilingual retrieval(in Section3.3),and domain-specific retrieval(in Section3.4).Finally,in Section4,we offer some conclusions drawns from our experiments.2System Description2.1Retrieval Approach.All retrieval runs used FlexIR,an information retrieval system developed at the University of Amsterdam[5].The main goal underlying FlexIR’s design is to fa-cilitateflexible experimentation with a wide variety of retrieval components and techniques.FlexIR is implemented in Perl and supports many types of prepro-cessing,scoring,indexing,and retrieval tools.Retrieval Model.FlexIR supports several retrieval models,including the stan-dard vector space model,language models,and probabilistic models.All runsreported in the paper use the vector space model with the Lnu.ltc weighting scheme[8]to compute the similarity between a query and a document.For all the experiments,wefixed slope at0.2;the pivot was set to the average number of unique words per document.Morphological Normalization.We apply a range of language-dependent and language-independent approaches to morphological normalization or tokeniza-tion.Words—We consider as a baseline the straightforward indexing of the words as encountered in the collection.We do some limited sanitizing:diacritics are mapped to the unmarked character,and all characters are put in lower-case.Thus a string like‘Information Retrieval’is indexed as‘information retrieval’and a string like the German‘Rastst¨a tte’(English:motorway restaurant)is indexed as‘raststatte.’Stemming—The stemming or lemmatization of words is the most popular language-dependent approach to document retrieval.We use the set of stem-mers implemented in the Snowball language[9].Thus a string like‘Information Retrieval’is indexed as the stems‘inform retriev.’An overview of stemming algorithms can be found in[10].The string process-ing language Snowball is specifically designed for creating stemming algorithms for use in Information Retrieval.It is partly based on the familiar Porter stem-mer for English[11],and provides stemming algorithms for all the nine European languages that we consider in this paper.We perform the same sanitizing oper-ations as for the word-based run.Decompounding—For the compound rich languages,Dutch,German,Finnish, and Swedish,we apply a decompounding algorithm.We treat all words occurring in the CLEF corpus as potential base words for decompounding,and also use their associated collection frequencies.We ignore words of length less than four characters as potential compound parts,thus a compound must consist of at least eight characters.As a safeguard against oversplitting,we only regard compound parts that have a higher collection frequency than the compound itself.We consider linking elements-s-,-e-,and-en-for Dutch;-s-,-n-,-e-,and-en-for German;-s-,-e-,-u-,and-o-for Swedish;and none for Finnish.We prefer a split with no linking element over a split with a linking element,and a split with a single character linker over a two character linker.Each document in the collection is analyzed and if a compound is identified, the compound is kept in the document and all of its parts are added to the docu-ment.Thus a string like the Dutch‘boekenkast’(English:bookshelf)is indexed as‘boekenkast boek kast.’Compounds occurring in a query are analyzed in a similar way:the parts are simply added to the query.Since we expand both the documents and the queries with compound parts,there is no need for compound formation[12].n-Gramming—Character n-gramming is the most popular language-indepen-dent approach to document retrieval.Our n-grams were not allowed to cross word boundaries.This means that the string‘Information Retrieval’is in-dexed as the fourteen4-gram tokens‘info nfor form orma rmat mati atio tion retr etri trie riev ieva eval’.We experimented with two n-gram approaches.First,we replaced the words with their n-grams.Second,we added the n-grams to the documents but kept the original words as well.Character n-grams are an old technique for improving retrieval effectiveness. An excellent overview of n-gramming techniques for cross-lingual information retrieval is given in[13].Again,we perform the same sanitizing operations as for the word-based run.Character Encodings.Until CLEF2003,the languages of the CLEF col-lections all used the Latin alphabet.The addition of the new CLEF language, Russian,is challenging because of the use of a non-Latin alphabet.The Cyril-lic characters used in Russian can appear in a variety of font encodings.The collection and topics are encoded using the UTF-8or Unicode character encod-ing.We converted the UTF-8encoding into a1-byte per character encoding KOI8or KOI8-R(for Kod Obmena Informatsii or Code of Information Ex-change).1We did all our processing,such as lower-casing,stopping,stemming, and n-gramming,on documents and queries in this KOI8encoding.Finally,to ensure the proper indexing of the documents using our standard architecture, we converted the resulting documents into the Latin alphabet using the Volapuk transliteration.We processed the Russian queries similar to the documents. Stopwords.Both topics and documents were stopped using the stopword lists from the Snowball stemming tool[9],for Finnish we used the Neuchˆa tel-stop-list[14].Additionally,we removed topic specific phrases such as‘Find documents that discuss...’from the queries.We did not use a stop stem or n-gram list, but wefirst used a stop word list,and then stemmed/n-grammed the topics and documents.Blind Feedback.Blind feedback was applied to expand the original query with related terms.Term weights were recomputed by using the standard Rocchio method[15],where we considered the top10documents to be relevant and the bottom500documents to be non-relevant.We allowed at most20terms to be added to the original query.Combination Methods.For each of the CLEF2003languages we created base runs using a variety of indexing methods(see below).We then combined these base runs using one of two methods,either a weighted or an unweighted 1We used the excellent Perl package Convert::Cyrillic for conversion between char-acter encodings and for lower-casing Cyrillic characters.combination.An extensive overview of combination methods for cross-lingual information retrieval is given in[16].The weighted combination was produced as follows.First,we normalized the retrieval status values(RSVs),since different runs may have radically different RSVs.For each run we reranked these values in[0,1]using:RSV i=RSV i−min i max i−min i;this is the Min Max Norm considered in[17].Next,we assigned new weights to the documents using a linear interpolation factorλrepresenting the relative weight of a run:RSV new=λ·RSV1+(1−λ)·RSV2.Forλ=0.5this is similar to the simple(but effective)combSUM function used by Fox and Shaw[18].The interpolation factorsλwere obtained from experiments on the CLEF2002data sets(whenever available).When we combine more than two runs,we give all runs the same relative weight,resulting effectively in the familiar combSUM method.Statistical Significance.Finally,to determine whether the observed differ-ences between two retrieval approaches are statistically significant,we used the bootstrap method,a non-parametric inference test[19,20].We take100,000re-samples,and look for significant improvements(one-tailed)at significance levels of0.95( );0.99( );and0.999( ).3ExperimentsIn this section,we describe our experiments for the monolingual task,the bilin-gual task,the multilingual task,and the domain-specific task.3.1Monolingual RetrievalFor the monolingual task,we conducted experiments with a number of language-dependent and language-independent approaches to document retrieval.All our monolingual runs used the title and descriptionfields of the topics. Baseline.Our baseline run is straightforwardly indexing the words as encoun-tered in the collection(with case-folding and mapping marked characters to the unmarked symbol).The mean-average-precision(MAP)scores are shown in Table1.The baseline run is fairly high performing run for most languages.In particular,Dutch with a MAP of0.4800performs relatively well.Table1.Word-based run.Dutch English Finnish French German Italian Russian Spanish Swedish Words0.48000.44830.31750.43130.37850.46310.25510.44050.3485Table2.Snowball stemming algorithm.Dutch English Finnish French German Italian Russian Spanish Swedish Words0.48000.44830.31750.43130.37850.46310.25510.44050.3485 Stems0.46520.42730.39980.45110.45040.47260.25360.46780.3707 %Ch.-3.1-4.7+25.9+4.6+19.0+2.1-0.6+6.2+6.4 Stat.-- - -- -Stemming.For all eight languages,we use a stemming algorithm from the Snowball family[9](see Section2).The results are shown in Table2.The results are mixed.On the one hand,we see a decrease in retrieval effectiveness for Dutch,English,and Russian.On the other hand,we see an increase in retrieval effectiveness for Finnish,French,German,Italian,Spanish,and Swedish.The improvements for Finnish,German,and Spanish are statistically significant.pounds are split using the method described in Sec-tion2.We decompound documents and queries for the four compound-rich lan-guages:Dutch,Finnish,German,and Swedish.After decompounding,we apply the same stemming procedure as above.The results are shown in Table3.TheTable3.Decompounding.Dutch Finnish German SwedishWords0.48000.31750.37850.3485Split+Stem0.49840.44530.48400.3957%Ch.+3.8+40.3+27.9+13.5Stat.- -results for decompounding are positive overall.We now see an improvement for Dutch,and further improvement for Finnish,German,and Swedish.Our results indicate that for all four compound forming languages,Dutch, Finnish,German,and Swedish,we should decompound before stemming.We treat the resulting(compound-split and)stem runs as a single language-depen-dent approach,where we only decompound the four compound-rich languages. The results are shown in Table4.These resulting(compound-split and)stem runs improve for all languages,except for English and the low-performing Rus-sian.n-Gramming.Both topic and document words are n-grammed,using the set-tings discussed in Section2.For all languages we use4-grams,that is,characterTable4.(Compound splitting and)stemming algorithms.Dutch English Finnish French German Italian Russian Spanish Swedish Words0.48000.44830.31750.43130.37850.46310.25510.44050.3485 Split+Stem0.49840.42730.44530.45110.48400.47260.25360.46780.3957 %Ch.+3.8-4.7+40.3+4.6+27.9+2.1-0.6+6.2+13.5 Stat.-- - -- -n-grams of length4.The results for replacing the words with n-grams are shownTable5.4-Gramming.Dutch English Finnish French German Italian Russian Spanish Swedish Words0.48000.44830.31750.43130.37850.46310.25510.44050.34854-Grams0.44880.37310.46760.41420.46390.38830.28710.45450.3751%Ch.-6.5-16.8+47.3-4.0+22.6-16.2+12.5+3.2+7.6 Stat.- - ---in Table5.We see a decrease in performance for four languages:Dutch,English, French,and Italian,and an improvement for the otherfive languages:Finnish, German,Russian,Spanish,and Swedish.The increase in retrieval effectiveness is statistically significant for Finnish and German,the decrease in performance is significant for English and Italian.The results are mixed,and the technique of character n-gramming is far from being a panacea.We explore a second language-independent approach,by adding the n-grams to the free-text of the documents,rather than replacing the free-text with n-grams.The results of adding n-grams are shown in Table6.The runs improve Table6.4-Gramming while retaining words.Dutch English Finnish French German Italian Russian Spanish Swedish Words0.48000.44830.31750.43130.37850.46310.25510.44050.3485 Word+4-Gr.0.49960.41190.49050.46160.50050.42270.30300.47330.4187 %Ch.+4.1-8.1+54.5+7.0+32.2-8.7+18.8+7.4+20.1 Stat.- - - over pure n-grams for all the nine languages.With respect to the words baseline, we see a decrease in performance for English and Italian,and an improvement for the other seven languages:Dutch,Finnish,French,German,Russian,Span-ish,and Swedish.The deviating behavior for Italian may be due to the different ways of encoding marked characters in the Italian sub-collections[7].Improve-ments are significant forfive of the languages,namely Finnish,German,Russian, Spanish,and Swedish.However,the decrease in performance for English remains significant too.Combining.It is clear from the results above that there is no equivocal best strategy for monolingual document retrieval.For English,our baseline run scores best.For Italian,the stemmed run scores best.For the other seven languages, Word+4-Gramming scores best.Here,we consider the combination of language-dependent and language-independent approaches to document retrieval.We ap-ply a weighted combination method,also referred to as linear fusion.From the experiments above we select the approaches that exhibit the best overall perfor-mance:Best language-dependent approach is to decompound for Dutch,Finnish, German,and Swedish,and then apply a stemming algorithm.Best language-independent approach is to add n-grams while retaining the original words.In particular,we combine the(compound split and)stem run of Table3with the Word+4-Gram run of Table6.The used interpolation factors are based on experiments using the CLEF2002test suite(whenever available).We used the following relative weights of the n-gram run:0.25(Dutch),0.4(English),0.51 (Finnish),0.66(French),0.36(German),0.405(Italian),0.60(Russian),0.35 (Spanish),and0.585(Swedish).bination of(Compound-splitting and)Stemming and adding4-Grams.Dutch English Finnish French German Italian Russian Spanish Swedish Words0.48000.44830.31750.43130.37850.46310.25510.44050.3485 Combination0.50720.45750.52360.48880.50910.47810.29880.48410.4371 %Ch.+5.7+2.1+64.9+13.3+34.5+3.2+17.1+9.9+25.4 Stat.-- - The results are shown in Table7.Wefind only positive results:all languages improve over the baseline,even English!Even though both English runs scored lower than the baseline(one of them even significantly lower),the combination improves over the baseline.The improvements for six of the languages,Finnish, French,German,Russian,Spanish,and Swedish,are significant.All languages except Russian improve over the best run using a single index.3.2Bilingual RetrievalWe restrict our attention here to bilingual runs using the English topic set. All our bilingual runs used the title and descriptionfields of the topics.We experimented with the WorldLingo machine translation[21]for translations into Dutch,French,German,Italian,and Spanish.For translation into Rus-sian we used the PROMT-Reverso machine translation[22].For translations into Swedish,we used the thefirst mentioned translation in the Babylon on-line dictionary[23].Since we use the English topic set,the results for English arethe monolingual runs discussed above in Section3.1.We also ignore English to Finnish retrieval for lack of an acceptable automatic translation method.Thus, we focus on seven European languages.We created the exact same set of runs as for the monolingual retrieval task described above:a word-based baseline run;a stemmed run with decompounding for Dutch,German,and Swedish;a words+4-gram run;and a weighted combi-nation of words+4-gram and(split and)stem runs.We use the following relative weights of the words+4-gram run:0.6(Dutch),0.7(French),0.5(German),0.6 (Italian),0.6(Russian),0.5(Spanish),and0.8(Swedish).Table8.Bilingual runs using EN topic set.Best scores are in boldface.We compare the best scoring run with the word-based baseline run.Dutch French German Italian Russian Spanish Swedish Words0.35540.35470.33780.38100.13790.32460.1187(Split+)Stem0.40430.35670.39680.38600.22700.35880.1898Word+4-Grams0.36900.37620.42280.38010.19830.37750.2371Combination0.39710.39510.44790.39270.21950.38880.2478%Change+13.8+11.4+32.6+3.1+64.6+19.8+108.8Stat.Sign. - -Table8shows our MAP scores for the English to Dutch,French,German, Italian,Russian,Spanish,and Swedish,bilingual runs.For our official runs for the2003bilingual task,we refer the reader to[24].Adding4-grams improves re-trieval effectiveness over the word-based baseline for all languages except Italian (which exhibits a marginal drop in performance).The stemmed,and decom-pounded for Dutch,German,and Swedish,runs do improve for all seven lan-guages.The Dutch stemmed and decompounded run and the Russian stemmed run turn out to be particularly effective,and outperform the respective n-gram and combination runs.A conclusion on the effectiveness of the Russian stemmer, based on only the monolingual evidence earlier,would prove to be premature. Although the stemmer failed to improve retrieval effectiveness for the monolin-gual Russian task,it is effective for the bilingual Russian task.For the otherfive languages(French,German,Italian,Spanish,and Swedish)the combination of stemming and n-gramming results in the best bilingual performance.The best performing run does significantly improve over the word-based baseline forfive of the seven languages:Dutch,German,Russian,Spanish,and Swedish.The results on the English topic set are,as expected,somewhat lower than the monolingual runs.Table9shows the decrease in effectiveness of the best bilingual run compared to the best monolingual run for the respective target language.The difference ranges from a12%decrease(German)to a43%de-crease(Swedish)in MAP score.The big gap in performance for Swedish is most likely a result of the use of a translation dictionary,rather than a proper ma-chine translation.The results for the other languages seem quite acceptable, considering that we used a simple,straightforward machine translation for theTable9.Decrease in effectiveness for bilingual runs.Dutch French German Italian Russian Spanish Swedish Best monolingual0.50720.48880.50910.47810.30300.48410.4371Best bilingual0.40430.39510.44790.39270.22700.38880.2478%Change−20.3−19.2−12.0−17.9−25.1−19.7−43.3Stat.Sign. - - bilingual tasks[21].The bilingual results do,in general,confirm the results ob-tained for the monolingual task.This increases our confidence in the effectiveness and robustness of the language-dependent and language-independent approaches employed for building the indexes.3.3Multilingual RetrievalWe used the English topic set for our multilingual runs,using only the ti-tle and descriptionfields of the topics.We use the English monolingual run (see Section3.1)and the English to Dutch,French,German,Italian,Spanish, and Swedish bilingual runs(see Section3.2)to construct our multilingual runs. There are two different multilingual tasks.The small multilingual task uses four languages:English,French,German,and Spanish.The large multilingual task extends this set with four additional languages:Dutch,Finnish,Italian,and Swedish.Recall from our bilingual experiments in Section3.2that we do not have an English to Finnish bilingual run,and that our English to Swedish bilin-gual runs perform somewhat lower due to the use of a translation dictionary.This prompted the following three sets of experiments:1.on the four languages of the small multilingual task(English,French,Ger-man,and Spanish),2.on the six languages for which we have an acceptable machine translation(also including Dutch and Italian),and3.on the seven languages(also including Swedish,but no Finnish documents)for which we have,at least,an acceptable bilingual dictionary.For each of these experiments,we build a number of combined runs,where we use the unweighted combSUM rule introduced by[18].First,we combine a single,uniform run per language,in all cases the bilingual words+4-gram run(see Section3.1and3.2).Second,we again use a single run per language, the weighted combination of the words+4-gram and(Split+)Stem run(see Sec-tion3.1and3.2).Third,we form a big pool of runs,two per language:the Word+4-Grams runs and the(Split+)Stem runs.Table10shows our multilingual MAP scores for the small multilingual task (covering four languages)and for the large multilingual task(covering eight lan-guages).For all multilingual experiments,first making a weighted combination per language outperforms the unweighted combination of all Word+4-Grams run and all(Split+)Stem runs.However,as we add languages,we see that theTable10.Overview of MAP scores for multilingual runs.Multi-4Multi-8(without FI/SV)(without FI) Word+4-Gram0.29530.24250.2475Combined Word+4-Gram/(Split+)Stem0.33410.28060.2860Both Word+n-Gram and(Split+)Stem0.32920.27640.2843 unweighted combination of all Word+4-Grams runs and all(Split+)Stem runs performs almost as well as the weighted combinations.Our results show that multilingual retrieval on a subpart of the collection (leaving out one or two languages)can still be an effective strategy.However, the results also indicate that the inclusion of further languages does consistently improve MAP scores.3.4Domain-specific RetrievalFor our domain-specific retrieval experiments,we used the German Information Retrieval Test-database(GIRT).We focus on monolingual experiments using the German topics and the German collection.We used the title and description fields of the topics,and used the title and abstractfields of the collection.We experimented with a reranking strategy based on the keywords assigned to the documents,the resulting rerank runs also use the controlled-vocabularyfields in the collection.We make three different indexes mimicking the settings used for our monolin-gual German experiments discussed in Section3.1.First,we make an word-based index as used in our baseline runs.Second,we make a stemmed index in which we did not use a decompounding strategy.Third,we build a Word+4-Grams index.Table11contains our MAP scores for the GIRT monolingual task.The re-Table11.Overview of MAP scores for GIRT runs.GIRT%Change Stat.sign.Words(baseline)0.2360Stems0.2832+20.0Word+4-Grams0.3449+46.1sults for the GIRT tasks show the effectiveness of stemming and n-gramming approaches over a plain word index.Notice also that the performance of German domain-specific retrieval are somewhat lower than those of German monolingual retrieval.The main aim of our domain-specific experiments is tofind way to exploit the manually assigned keywords in the collection.These keywords are basedon the controlled-vocabulary thesaurus maintained by GESIS[25].In particu-lar,we experiment with an improved version of the keyword-based reranking strategy introduced in[6].We calculate vectors for the keywords based on their (co)occurrences in the collection.The main innovation is in the use of higher dimensional vectors for the keywords,for which we use the best reduction onto a100-dimensional euclidean space.The reranking strategy is as follows.We cal-culate vectors for all initially retrieved documents,by simply taking the mean of the vectors of keywords assigned to the documents.We calculate a vector for a topic by taking the relevance-weighted mean of the top10retrieved docu-ments.We now have a vector for each of the topics,and for each of the retrieved documents.Thus,ignoring the RSV of the retrieved documents,we can sim-ply rerank all documents by the euclidean distance between the document and topic vectors.Next,we combine the original text-based similarity scores with the keyword-based distances using the unweighted combSUM rule of[18].The results of the reranking strategy are shown in the rest of Table12.For Table12.Overview of MAP scores for GIRT runs.We compare the rerank runs with the respective orginal runs.GIRT baseline Rerank%Change Stat.sign.Words0.23600.2863+21.31%Stems0.28320.3361+18.68%Word+4-Grams0.34490.3993+15.77%all the three index approaches,the results are positive.There is a significant im-provement of retrieval effectiveness due to the keyword-based reranking method. The obtained improvement is additional to the improvement due to blind feed-back,and consistent even for high performing base runs.4ConclusionsThis paper investigated the effectiveness of language-dependent and language-independent approaches to cross-lingual text retrieval.The experiments de-scribed in this paper indicate the following.First,morphological normalization does improve retrieval effectiveness,especially for languages that have a more complex morphology than English.We also showed that n-gram-based can be a viable option in the absence of linguistic resources to support deep morphological normalization.Although no panacea,the combination of runs provides a method that may help improve base runs,even high quality base runs.The interpola-tion factors required for the best gain in performance seem to be fairly robust across topic sets.Moreover,the effectiveness of the unweighted combination of runs is usually close to the weighted combination,and the difference seems to diminish with the number of runs being combined.Our bilingual experiments showed that a simple machine translation strategy can be effective for bilingual。