A stemming procedure and stopword list for general French corpora

合集下载

英语专业 Word List Idioms and Expressions

lot
场地
Connecticut
康涅狄格州（美国）
retail
零售
shroud
覆盖
slumber
睡眠，安眠
boulevard
大街
precinct
区域
sedan
小轿车
shriek
尖叫
stagger
蹒跚
slump
跌落；落下
deliberation
深思熟虑；研究
sheepishly
怯懦地
knowingly
subtle
微妙的
specify
指定，指明
attire
着装；盛装
spurn
摒弃
Mac
老兄
buddy
老兄
executive
行政管理人员
attaché case
公文包
pathological
病理（学）的；疾病的
rise to the occasion
应付自如；随机应变
mechanism
机构
look out
pick up
拾起；通过实践学会
jury
陪审团
tip off
因倾斜而掉下来；暗示
turn on
打开；（使）兴奋
ambiguous
意义不明的；模棱两可的
single out
挑选
embarrassment
为难；尴尬
make up
编造；讲和；弥补
staff
为…配备职员
slip into
滑进，塞进；悄悄进入
cringe
畏缩；退缩
go into
进入；从事
arbitrary
任意的；恣意的

人工智能机器学习技术练习(试卷编号221)

人工智能机器学习技术练习(试卷编号221)1.[单选题]分类的类别标签列是()A)类别数值B)类别的不同C)具有次序、大小的意义答案:B解析:2.[单选题]主成分分析用于()A)特征降维B)特征膨胀C)特征子集计算答案:A解析:3.[单选题]分类模型在进行训练时需要()A)训练集B)训练集与测试集C)训练集、验证集、测试集答案:C解析:4.[单选题]如果我们说“线性回归”模型完美地拟合了训练样本（训练样本误差为零），则下面哪个说法是正确的？A)测试样本误差始终为零B)测试样本误差不可能为零C)以上答案都不对答案:C解析:根据训练样本误差为零，无法推断测试样本误差是否为零。

值得一提是，如果测试样本样本很大，则很可能发生过拟合，模型不具备很好的泛化能力！5.[单选题]Task 运行在下来哪里个选项中 Executor 上的工作单元 ()。

A)Driver programB)spark masterC)worker nodeD)Cluster manager答案:C解析:A)基因数据B)影评数据C)股票市场价格D)以上所有答案:D解析:本题考查的是隐马尔可夫模型适用于解决哪类问题。

隐马尔可夫模型（Hidden Markov Model，HMM）是关于时序的概率模型，描述一个隐藏的马尔可夫链随机生成不可观测的状态随机序列，再由各个状态生成一个观察而产生观测随机序列的过程。

因此，隐马尔可夫模型适用于解决时间序列问题。

7.[单选题]以下对大数据“涌现”描述不正确的是( )。

A)安全涌现是大数据涌现现象B)小数据可能没价值，但是小数据组成的大数据却很有价值，这叫做价值涌现C)小数据可能质量没问题，但是大数据质量会出现问题这叫质量涌现D)小数据可能不涉及隐私，但是大数据可能严重威胁个人隐私，这叫隐私涌现答案:C解析:8.[单选题]使用什么函数接收用输入的数据（）。

A)accept()B)input()C)readline()D)login()答案:B解析:9.[单选题]（__)的学习目的是生成一颗泛化能力强，即处理未见示例能力强的决策树。

浅探关节镜下盘状半月板损伤的治疗

Document frequency is close to optimal. By far the simplest feature selection method.
Similar results for LLSF (regression).
Results
Why is selecting common terms a good strategy?
Dimension reduction: each dimension is a unique linear combination of all words (linear case)
Dimension reduction is good for generic topics (“politics”), bad for specific classes (“ruanda”). Why?
CS276B
Text Information Retrieval, Mining, and Exploitation
Lecture 5 23 January 2003
Recap
Today’s topics
Feature selection for text classification Measuring classification performance Nearest neighbor categorization
Term clustering Dimension reduction: PCA / SVD
Word Rep. vs. Dimension Reduction
Word representations: one dimension for each word (binary, count, or weight)

《语言学》Chapter 4 Syntax 习题兼答案

《新编简明英语语言学教程》第二版第4章练习题参考答案Chapter 4 Syntax1. What is syntax?Syntax is a branch of linguistics that studies how words are combined to form sentences and the rules that govern the formation of sentences.2. What is phrase structure rule?The grammatical mechanism that regulates the arrangement of elements (i.e. specifiers, heads, and complements) that make up a phrase is called a phrase structure rule.The phrase structural rule for NP, VP, AP, and PP can be written as follows:NP →(Det) N (PP) ...VP →(Qual) V (NP) ...AP →(Deg) A (PP) ...PP →(Deg) P (NP) ...We can formulate a single general phrasal structural rule in which X stands for the head N, V, A or P.3. What is category? How to determine a word's category?Category refers to a group of linguistic items which fulfill the same or similar functions in a particular language such as a sentence, a noun phrase or a verb.To determine a word's category, three criteria are usually employed, namely meaning, inflection and distribution.若详细回答，则要加上：Word categories often bear some relationship with its meaning. The meanings associated with nouns and verbs can be elaborated in various ways. The property or attribute of the entities denoted by nouns can be elaborated by adjectives. For example, when we say that pretty lady, we are attributing the property ‘pretty’ to the lady designated by the noun. Similarly, the properties and attributes of the actions, sensations and states designated by verbs can typically be denoted by adverbs. For example, in Jenny left quietly the adverb quietly indicates the manner of Jenny's leaving.The second criterion to determine a word's category is inflection. Words of different categories take different inflections. Such nouns as boy and desk take the plural affix -s. Verbs such as work and help take past tense affix -ed and progressive affix -ing. And adjectives like quiet and clever take comparative affix -er and superlative affix -est. Although inflection is very helpful in determining a word's category, it does not always suffice. Some words do not take inflections. For example, nouns like moisture, fog, do not usually take plural suffix -s and adjectives like frequent, intelligent do not take comparative and superlative affixes -er and -est.The last and more reliable criterion of determining a word's category is its distribution. That is what type of elements can co-occur with a certain word. For example, nouns can typically appear with a determiner like the girl and a card, verbs with an auxiliary such as should stay and will go, andadjectives with a degree word such as very cool and too bright.A word's distributional facts together with information about its meaning and inflectional capabilities help identify its syntactic category.4. What is coordinate structure and what properties does it have?The structure formed by joining two or more elements of the same type with the help of a conjunction is called coordinate structures.It has (或写Conjunction exhibits) four important properties:1) There is no limit on the number of coordinated categories that can appear prior to the conjunction.2) A category at any level (a head or an entire XP) can be coordinated.3) Coordinated categories must be of the same type.4) The category type of the coordinate phrase is identical to the category type of the elements beingconjoined.5. What elements does a phrase contain and what role does each element play?A phrase usually contains the following elements: head, specifier and complement. Sometimes it also contains another kind of element termed modifier.The role each element can play:Head:Head is the word around which a phrase is formed.Specifier:Specifier has both special semantic and syntactic roles. Semantically, it helps to make more precise the meaning of the head. Syntactically, it typically marks a phrase boundary.Complement:Complements are themselves phrases and provide information about entities and locations whose existence is implied by the meaning of the head.Modifier:Modifiers specify optionally expressible properties of the heads.6. What is deep structure and what is surface structure?There are two levels of syntactic structure. The first, formed by the XP rule in accordance with the head's subcategorization properties, is called deep structure(or D-structure). The second, corresponding to the final syntactic form of the sentence which results from appropriate transformations, is called surface structure (or S-structure).（以下几题只作初步的的成分划分，未画树形图, 仅供参考）7. Indicate the category of each word in the following sentences.a) The old lady got off the bus carefully.Det A N V P Det N Advb) The car suddenly crashed onto the river bank.Det N Adv V P Det Nc) The blinding snowstorm might delay the opening of the schools.Det A N Aux V Det N P Det Nd) This cloth feels quite soft.Det N V Deg A8. The following phrases include a head, a complement, and a specifier. Draw the appropriatetree structure for each.a) rich in mineralsXP(AP) →head (rich) A + complement (in minerals) PPb) often read detective storiesXP(VP) →specifier (often) Qual +head (read) V +complement (detective stories) NPc) the argument against the proposalsXP(NP) →specifier (the) Det +head (argument) N +complement (against the proposals) PP d) already above the windowXP(VP) →specifier (already) Deg +head (above) P +complement (the window) NPd) The apple might hit the man.S →NP (The apple) + Infl (might) +VP (hit the man)e) He often reads detective stories.S →NP (He) +VP (often reads detective stories)9. The following sentences contain modifiers of various types. For each sentence, first identify the modifier(s), then draw the tree structures.（斜体的为名词的修饰语，划底线的为动词的修饰语）a) A crippled passenger landed the airplane with extreme caution.b) A huge moon hung in the black sky.c) The man examined his car carefully yesterday.d) A wooden hut near the lake collapsed in the storm.10. The following sentences all contain conjoined categories. Draw a tree structure for each of the sentences.（划底线的为并列的范畴）a) Jim has washed the dirty shirts and pants.b) Helen put on her clothes and went out.c) Mary is fond of literature but tired of statistics.11. The following sentences all contain embedded clauses that function as complements of a verb, an adjective, a preposition or a noun. Draw a tree structure for each sentence.a) You know that I hate war.b) Gerry believes the fact that Anna flunked the English exam.c) Chris was happy that his father bought him a Rolls-Royce.d) The children argued over whether bats had wings.12. Each of the following sentences contains a relative clause. Draw the deep structure and the surface structure trees for each of these sentences.a) The essay that he wrote was excellent.b) Herbert bought a house that she lovedc) The girl whom he adores majors in linguistics.13. The derivations of the following sentences involve the inversion transformation. Give the deep structure and the surface structure of each of these sentences. （斜体的为深层结构，普通字体的为表层结构）a) Would you come tomorrow?you would come tomorrowb) What did Helen bring to the party?Helen brought what to the partyc) Who broke the window?who broke the window。

福建省漳州市2023-2024学年高二下学期7月期末英语试题

福建省漳州市2023-2024学年高二下学期7月期末英语试题一、阅读理解With the seasons changing, now is the perfect time to start planning summer experiences at outdoor theatres in parks, city squares, woodlands and even the side of a cliff! It’s time to start planning for summer! There are plenty of outdoor theatres to choose from, and we’ve put together a list of the best ones to visit across the UK.Regent’s Park Open Air Theatre, LondonNo matter the weather, as far as I’m concerned when Regent’s Park starts its theatre season, summer has truly begun. Surrounded by dense whistling trees and singing birds, you may as well be a thousand miles from the surrounding city.This year sees a typically varied programme with Twelfth Night opening, Fiddler on the Roof closing and plenty in between.Thorington Theatre, SuffolkA new and beautiful addition to the outdoor theatre world, opening in 2021, Thorington is built in a natural amphitheatre(圆形露天竞技场)hidden in the Suffolk woodlands.This year’s programme includes family favourites such as Teddy Bear’s Picnic and The Little Mermaid, Shakespeare’s The Tempest as well as the mandatory A Midsummer Night’s Dream.Willow Globe, PowysAn attractive theatre, the Willow Globe’s design is based on the London Globe Theatre, but it’s about a third of the size and formed not from wood, but willow trees woven together.Their programme is largely filled with all things Shakespeare: straight plays, of course, but also plenty of fun introductions and interpretations to bring the Bard’s words to new or reluctant audiences.Minack Theatre, CornwallSituated on the cliff of Porthcurno, Minack Theatre is one you hardly need an excuse to visit; seeing anything at all, backdropped by the azure Cornish sea, will be an unforgettable experience.This year sees touring productions of Little Shop of Horrors, The Massive Tragedy of Madame Bovary and some others. If you want a show particularly fitting for this unique surrounding, check out The Pirates of Penzance playing in September.1．Which theater would you go if you are a Teddy Bear lover?A．Regent’s Park Open Air Theatre.B．Thorington Theatre.C．Willow Globe.D．Minack Theatre.2．What do Willow Globe and the London Globe Theatre have in common?A．Their design.B．Their size.C．Their building materials.D．Their programmes.3．What is the main purpose of the text?A．To recommend outdoor theatres.B．To organize outdoor theatre events.C．To show the architectural features of outdoor theatres.D．To explore historical background of outdoor theatres.My family moved to Melbourne where my father re-entered academia and my mother reshaped my life with devotion and patience. Abandoning dreams of a career may have been in keeping with her era, but my mother threw herself into motherhood. My tennis shoes always shone thanks to her care when I slept. At lunch, my friends crowded around my lunch, the tastiest with enough to spare.My brother and I left home around age 16 for a better education. Our big house fell empty. There was no help and even no phone connection. Her “coping strategy” was to remind herself that she wanted us to have what she did not. Back then I was occupied with my own needs; today, I realise my mother paid for my education with her tears.If her self-sacrifice lifted me to greater heights, her second quality built the foundation. When I was young, my mother displayed in me an absolute faith and absolute love. In her eyes, I could do no wrong that I couldn’t fix. She didn’t say overly sweet things to declare her love, but just being with her made me feel like I was the most important person in the world. My mother allowed me to live a fuller life than hers without bargain, or guilt. To this she added extraordinary generosity. Today I credit my strength and stability to the certainty that, regardless of who elsejudges me, my mother will not.Currently I have my own kids. They are children of a different time. Today’s mothers have competing demands related to professional and personal goals. Sometimes, I must put my work above their needs. My mom tried to be my “everything” parent while I am what is called the good-enough mother at my best. We may differ in our methods and tools, but the essence of motherhood remains constant.4．How does the author show her mom’s quality in the first paragraph?A．By quoting a remark.B．By giving examples.C．By making a comparison.D．By analyzing the cause.5．Which words can best describe the author’s mother?A．Optimistic and persevering.B．Hardworking and demanding.C．Devoted and selfless.D．Cheerful and reliable.6．What does the underlined phrase “good-enough mother” probably refer to in the last paragraph?A．A mother who pursues perfection in childcare.B．A mother who devotes herself wholly to her work.C．A mother who places her kids’ needs above all else.D．A mother who tries to balance her parenting with her career.7．What can be the most suitable title for the text?A．The Art of Being a MotherB．Lessons of Love and StrengthC．Motherhood: The Greatest AdventureD．Motherhood: A Journey of Sacrifice and StrengthMuch of the conservation and climate change spotlight falls on tropical (热带的) forests. Given this, people might forget that forests in the temperate (温带的) areas — those found in large parts of North America, Europe and higher latitudes in Asia and Australia —also have the power to help limit climate change. Although preserving tropical rainforests is essential to climate progress, policy makers cannot neglect the important role of temperate forests. This Earth Week, we must turn our attention and dollars to these stretches of trees, or we will face the loss of an important tool in managing global warming.Temperate forests represent about 25percent of Earth’s arboreal lands. As temperatures have changed, temperate trees face threats of harmful invasive pests (侵入的害虫) from other regions, loss of forest lands from urban and farmland expansion. We believe the greatest emerging threat to temperate forests is wildfires that occur beyond normal historic frequency and severity. But surprisingly, widespread fire suppression (抑制), especially in dry forests in the West, has allowed a build-up of dangerous fuels like deadwood. These fuels, combined with the drought caused by climate change, have led to increasingly frequent and severe fires that kill enormous numbers of trees and release a large quantity of CO2 to the atmosphere in bad fires years in the United States.We need to reduce land-clearing for housing and agriculture, then allow trees to regrow where they have been removed, and protect and better care for the few temperate forests that still contain stands of very old trees. These old forests are some of the most carbon-dense ecosystems and possess unique biodiversity. Therefore, governments and landowners must make sure middle-aged forests that regrew after cutting will develop into the old-growth forests of tomorrow.We need to take advantage of current public funding for forest conservation and management and, at the same time, promote private investment to support restorative measures and sustainable forestry to capture the climate potential of temperate forests in the U. S. and elsewhere.8．What is emphasized in the first paragraph?A．Temperate forests’ impact on climate change.B．Distribution of temperate forests.C．Conservation of tropical forests.D．Causes of global warming.9．What does the underlined word “neglect” mean in the first paragraph?A．Ignore.B．Limit.C．Perform.D．Assess. 10．What caused the more frequent and severe wildfires in temperate forests?A．Increasing urbanization.B．Fire control practices.C．The invasive species.D．Farmland expansion.11．Which of the following is a measure to protect temperate forests?A．Banning tree cutting.B．Restricting investment.C．Conserving old forests.D．Protecting farm land.Nowadays we are living in the age of anxiety. 12 When you have faith in yourself, you are best placed to handle the challenges life presents. Strong self-belief is your greatest advantage in life. So how do you build self-belief and overcome self-doubt?Build your teamChoose your companions carefully. 13 You need people who “get you”, and see your true potential. When the going gets tough, you may need to lean on their belief in you and borrow it until your own has recovered and bounced back! Above all, do not get isolated (孤寂的). We need to connect with others and feel a sense of belonging and community.MoveOne of the simplest ways to change your internal state from feeling doubtful to one of enthusiasm and self-belief is to move your body. 14 You can often find a solution to a problem and a fresh new perspective from getting out of your mind and into physical activity.Meditate (冥思)The philosopher Blaise Pascal said, “All of humanity’s problems originate from man’s inability to sit quietly in a room alone.” You could argue that his words are more relevant in today’s times than then! 15 Otherwise, you could be constantly “on” and your nervous system could be flooded with stress, making it hard for your system to come back to balance.Stop the criticism16 There is no one who can do more damage to your self-belief than — you! Show compassion to your “mistakes” and perceived failings. See that no one creates success without learning along the way. Also, let go of what you may be blaming yourself for, either in your professional life or personal.A．Forgive others, or you may stay stuck in blame.B．They will be feeding your self-belief or starving it.C．But we have the power to control what happens around.D．You can be a great partner or a great enemy for yourself.E．But your attitude and your self-belief are your cure for anxiety.F．As little as 10 minutes of swift walking increases your energy and positive mood.G．You need a break from the 24-hour newsfeeds and alarms from your smart phones.二、完形填空Only two tickets to the big basketball game. Three pairs of eyes all 17 the tickets in Dad’s hand. Marcus, the oldest, asked “Dad, which of us gets to go with you?” Dad scratched his head, acknowledging the need to find a fair 18 to pick one of us by the next morning and figure out who 19 it the most.The next morning, we were 20 only by a note on the table, which read, “After finishing breakfast, don’t forget to 21 Saturday chores (家务活) early.” My two brothers swiftly fled out of the house, leaving behind a mess on the table. Well, it looks like Saturday morning chores 22 right here, I thought to myself.As I was doing the chores, I 23 out of the kitchen window and saw Marcus practicing shooting the basketball while Caleb cheered him on. I carried the 24 bag out to the dustbin outside. The moment I opened the lid (盖子) on the garbage container, a flash of white on the inside of the heavy black plastic lid caught my 25 . A white envelope was taped 26 , instead of casually, to the underside of the lid, 27 the word “Congratulations!” on its front. Inside of the envelope was a ticket to the basketball game 28 to a folded piece of paper, reading, “To the one who deserves to go”!That evening turned out to be as 29 as I’d imagined: Two seats at Center Court, and a dad and his daughter cheering their team to victory. It was a long-remembered 30 in individual responsibility from a dad who let his kids make their own choices and earn their own 31 .17．A．looked into B．searched for C．scanned through D．focused on 18．A．bet B．way C．trial D．trade 19．A．needs B．deserves C．mentions D．checks 20．A．panicked B．amused C．greeted D．touched 21．A．complete B．list C．avoid D．leave 22．A．continue B．end C．work D．start 23．A．pointed B．climbed C．glanced D．shouted 24．A．hand B．rubbish C．leather D．sleeping25．A．fancy B．breath C．arm D．attention 26．A．professionally B．normally C．permanently D．securely 27．A．hiding B．bearing C．whispering D．reflecting 28．A．attached B．admitted C．applied D．related 29．A．special B．busy C．boring D．casual 30．A．fight B．performance C．competition D．lesson 31．A．reputation B．profits C．rewards D．living三、语法填空阅读下面文章，在空白处填入1个适当的单词或括号内单词的正确形式。

全文检索原理

全⽂检索原理在介绍全⽂检索前，先简单说下全⽂数据搜索的两种⽅式：顺序扫描法(Serial Scanning)：所谓顺序扫描，⽐如要找内容包含某⼀个字符串的⽂件，就是⼀个⽂档⼀个⽂档的看，对于每⼀个⽂档，从头看到尾，如果此⽂档包含此字符串，则此⽂档为我们要找的⽂件，接着看下⼀个⽂件，直到扫描完所有的⽂件。

如利⽤windows的搜索也可以搜索⽂件内容，只是相当的慢。

如果你有⼀个80G硬盘，如果想在上⾯找到⼀个内容包含某字符串的⽂件，不花他⼏个⼩时，怕是做不到。

Linux下的grep命令也是这⼀种⽅式。

⼤家可能觉得这种⽅法⽐较原始，但对于⼩数据量的⽂件，这种⽅法还是最直接，最⽅便的。

但是对于⼤量的⽂件，这种⽅法就很慢了。

全⽂检索(Full-text Search) ：对全⽂数据中的⼀部分信息提取出来，重新组织，使其变得有⼀定结构，然后对此有⼀定结构的数据进⾏搜索，从⽽达到搜索相对较快的⽬的。

这部分从⾮结构化数据中提取出的然后重新组织的信息，我们称之索引。

这种先建⽴索引，再对索引进⾏搜索的过程就叫全⽂检索(Full-text Search)。

下⾯这幅图描述了全⽂检索的⼀般过程:全⽂检索⼤体分两个过程，索引创建(Indexing)和搜索索引(Search)。

索引创建：将现实世界中所有的结构化和⾮结构化数据提取信息，创建索引的过程。

搜索索引：就是得到⽤户的查询请求，搜索创建的索引，然后返回结果的过程。

于是全⽂检索就存在三个重要问题：1. 索引结构？(Index)2. 如何创建索引？(Indexing)3. 如何对索引进⾏搜索？(Search)下⾯我们顺序对每个问题进⾏研究。

1.索引⾥⾯究竟存些什么索引⾥⾯究竟需要存些什么呢？⾸先我们来看为什么顺序扫描的速度慢：其实是由于我们想要搜索的信息和⾮结构化数据中所存储的信息不⼀致造成的。

⾮结构化数据中所存储的信息是每个⽂件包含哪些字符串，也即已知⽂件，欲求字符串相对容易，也即是从⽂件到字符串的映射。

文本挖掘的模糊方法(IJMSC-V1-N4-4)

I.J.Mathematical Sciences and Computing,2015, 4, 34-43Published Online November 2015 in MECS ()DOI: 10.5815/ijmsc.2015.04.04Available online at /ijmscA Fuzzy Approach for Text MiningDeepa B. Patil a, Yashwant V. Dongre ba Vishwakarma Institute of Information Technology,3/4 Kondhwa (Bk), Pune-411048, Indiab Vishwakarma Institute of Information Technology,3/4 Kondhwa (Bk), Pune-411048, IndiaAbstractDocument clustering is an integral and important part of text mining. There are two types of clustering, namely, hard clustering and soft clustering. In case of hard clustering, data item belongs to only one cluster whereas in soft clustering, data point may fall into more than one cluster. Thus, soft clustering leads to fuzzy clustering wherein each data point is associated with a membership function that expresses the degree to which individual data points belong to the cluster. Accuracy is desired in information retrieval, which can be achieved by fuzzy clustering. In the work presented here, a fuzzy approach for text classification is used to classify the documents into appropriate clusters using Fuzzy C Means (FCM) clustering algorithm. Enron email dataset is used for experimental purpose. Using FCM clustering algorithm, emails are classified into different clusters. The results obtained are compared with the output produced by k means clustering algorithm. The comparative study showed that the fuzzy clusters are more appropriate than hard clusters.Index Terms: Fuzzy clustering, fuzzy c means clustering algorithm, text mining© 2015 Published by MECS Publisher. Selection and/or peer review under responsibility of the Research Association of Modern Education and Computer Science1.IntroductionDocument clustering or data clustering divides data items into groups so that items in the same group are most similar, at the same time; they are most dissimilar with the data item in other clusters. Depending on the nature of the data and the purpose of clustering, different measures of similarity can be used to place items into different clusters or groups. The main objective of similarity measure is to control cluster formation. Some methods find similarity between two objects by distance between them. Such distances can be defined using Euclidean distance, Cosine similarity, dice coefficient, extended Jaccard coefficient etc.Data can be partitioned into hard clusters or soft clusters. In hard clustering, data items are divided into separate clusters, where each data element belongs to only one cluster. In soft clustering or fuzzy clustering, data items may belong to more than one cluster, with different degree of membership. For example, if we want * Corresponding author. Tel.:+91-9822893671E-mail address: deepa_p100@yahoo.co.into partition height of human being as tall, medium and short, we can consider a person as a tall if his/her height is above 5‘6‖, person as medium in height if his/her height falls between 4‘6‖ to 6‖ and person is short if his/her height is below 5‘. A person whose height is 5‘9‖ fall s in both groups, in medium as well as in tall. The height 5‘9‖ is closer to the value6‖ so the person is classified as tall. Such type of classification is more appropriate and accurate using fuzzy clustering than hard clustering wherein there is a chance of falling data item in wrong cluster. My work described here, focuses on comparative study of hard clustering vs soft clustering using k means clustering algorithm and Fuzzy C Means (FCM) clustering algorithm using five similarity/distance measures namely, Euclidean distance, Cosine similarity, dice coefficient, extended Jaccard coefficient and Similarity Measure for Text Processing (SMTP).For experimental purpose I used Enron email data set [1] which is which is available free on World Wide Web. I also created a small email data set for comparative analysis. Fuzzy classification can be rule based or keyword based. I focused my work on classifying emails based on keywords. Many a times there is a situation wherein a person is unable to locate a piece of information he/she knows is out but can‘t find it. This can be extremely frustrating, particularly if you know you‘ve seen it before.Corporate emails, most of the times contain some typical keywords. For example, employees of research and development department of pharmaceutical company may receive emails with some particular keywords describing name of diseases, drugs , contents in drugs , composition, name of another pharmaceutical company, symptoms of disease, side effect of particular drug etc. Finding a particular email or emails containing required name of diseases doesn‘t require fuzzy rules as fuzzy rules are not applicable in such situations. So my main focus was on keyword based fuzzy classification.2.Literature SurveyA lot of similarity measures are in existence to calculate similarity between given two documents. Euclidean distance [2] is one of the popular similarity measures. Cosine similarity [3] is a measure which takes cosine of the angle between two given document vectors. The Jaccard coefficient [4] is a statistic used for comparing the similarity of two document sets. It is defined as size of intersection divided by size of union on sample data sets. An information-theoretic measure for document similarity called IT_Sim [5], [6]. It is a phrase-based measure which computes the similarity based on Suffix Tree Document Model. Pairwise-adaptive similarity [7] is a measure which selects a number of features dynamically, out of document d1 and document d2. In [4], [8] Hamming distance is used, hamming distance between two document vectors is number of positions where the corresponding symbols differ. In [9] a non-symmetric similarity measure called Kullback-Leibler divergence is described. It is difference between probability distributions associated with two vectors. In [10] an advanced similarity measure is proposed known as Similarity Measure for Text Processing (SMTP) which gives more value to presence or absence of features (words) than frequency of features (words).Lotfi A. Zadeh is initiator of fuzzy logic. He introduced fuzzy sets [11] in 1965. Fuzzy sets are based on fuzzy logic. In a keynote speech, Zadeh himself have said that fuzzy logic is not fuzzy, but is a precise logic of imprecision. In fuzzy logic, a variable may have any real value between 0 and 1 unlike in boolean logic where value of variable is either 0 or 1. Fuzzy logic can have linguistic variables, for example age, which may take non-numeric values such as young, middle-aged or old. Amongst early applications of fuzzy logic, notable ones are use of fuzzy logic in high speed train to improve precision of ride and economy, in handwriting recognition and to improve fuel consumption in automobiles. Fuzzy logic has many applications in the field of engineering as well as non-engineering fields. For example, fuzzy logic can be applied in the fields of artificial intelligence, image processing and control theory, medical diagnosis systems, stock trading applications to name a few. Fuzzy logic used in washing machine can be used to control washing process such as intake of water, temperature of water such as hot, cold, lukewarm , cloth wash time, spin speed and rinse performance. Thus, use of fuzzy logic in washing machine helps to increase its lifespan. Zadeh explored more about fuzzy logic [12]. In [13] Zahed et.al contributed more about fuzzy sets and fuzzy logic. Kosko [14] has important contribution in development of fuzzy logic in the field of artificial intelligence.There are many fuzzy clustering algorithms which are proposed by various researchers, namely fuzzy C-means, fuzzy K-nearest neighbor, fuzzy ISODATA, algorithm, potential-based clustering, and many others [15]. Fuzzy C-means (FCM) clustering algorithm is one of the most popular and widely used fuzzy clustering algorithm. FCM was originally proposed by Dunn [16], later modified by Bezdek [17]. FCM determines, and iteratively updates the membership values of each data point in each of the clusters. So, a data point is member of all clusters with varying degree of membership values. The logic of FCM is extensively used in varied fields of research [18, 19, 20, 21]. There are several variants of FCM algorithm. Sikka et al. [22] developed a modified FCM known as MFCM to estimate the tissue and tumor areas in a brain MRI scan. Krinidis and Chatzis [23] proposed a Fuzzy Local Information C-Means (FLICM) algorithm. A modified FCM algorithm was developed by Belhassen and Zaidi [24] to overcome the problems faced by conventional FCM algorithm .3.Proposed SystemFollowing figure describes proposed system in detail.Fig. 1. System ArchitectureFollowing steps will be carried out to achieve final results. In the above figure, the steps are numbered as 1 to 6.Document Preprocessing- Document preprocessing can be carried out in two steps namely- Stopwords removal and Stemmingo Stopwords Removal –Words which occur very frequently are not useful for the purpose of information retrieval. It is observed that a word which appears in almost 80% of documents in the document set, are useless for information retrieval. Such word is referred to as stopword. Example of such words are – a, an, the, are, on, to, about, above, up to, onto, etc. In stopwords removal process, stopwords are removed from the document collection.o Stemming – Stem is the fraction of word which is left after removal of its affixes. So, stemming is the process of removing plurals, gerund forms and past tenses from a word. An example of stem is the word ‗calculate‘ which is a stem for variants such as calculated, calculating, calculation and calculations.∙Find term frequency- After pre-processing document collection, each word that remains in the document collection is called as ‗term‘. Frequency of each term in each document in the collection is calculated.For example if an email contains following text-―First year engineering admissions (FE) are commencing from 2nd July; so, you are requested to update college website on priority basis. Please update the Placement Report on our College website. Pls do the needful as early as possible as FE admissions are commencing from 2 nd July and people will surf our website more frequently. ―After stopwords removal and stemming, the document will have following terms-first year engineering admission (FE) commenc July request update college website priority update placement report college website needful possible FE admission commenc July people surf website frequentThe frequency of each term in the document is calculated. In the above example, frequency of word ‗website‘ is 3∙Term selection- All the terms in the collection are not useful for information retrieval. For example terms with low frequency count may not be considered, for example in above example terms such as ‗FE‘ and ‗needful‘ with low frequency count are not usefu l for information retrieval, so such terms are omitted. In term selection step, terms with high frequency count are retained for information retrieval.∙Generate document vector- In this step, document vector for each document is generated. Consider following two emails –Email 1:- First year engineering admissions (FE) are commencing from 2nd July; so, you are requested to update college website on priority basis. Please update the Placement Report on our College website.Pls do the needful as early as possible as FE admissions are commencing from 2 nd July and people will surf our website more frequently.Email 2:- Kindly update the designation "Asst. Professor" on related pages on college website. Also add email id in the description deepa.patil@viit.ac.in and wherever needed on our college website. Pls update website as early as possible as DTE visit is scheduled on Friday 27th this month.For above two emails, following terms are identified which are displayed below in alphabetical order- college, designation, engineering, email, placement, websiteFor the above terms, document vector for each email is-document vector for email 1:- <2,0,1,0,1,3>document vector for email 2:-<2,1,0,1,0,3>∙Find similarity matrix- Calculate similarity matrix for all documents in the document collection using five distance measures namely, Euclidean distance, Cosine similarity, dice coefficient, extended Jaccard coefficient and Similarity Measure for Text Processing(SMTP) with the help of document vector generated for each document in the above step. Separate similarity matrix is generated for each distance measure.∙Apply clustering algorithm-Apply clustering algorithms k means and FCM on each of the similarity matrix generated in the above step. The final outcome is clustered emails. The output produced by two clustering algorithms using above mentioned distance measures are then compared.4.Algorithmic Details4.1.K Means Clustering AlgorithmFollowing are the steps for k means clustering algorithmStep 1: Arbitrarily choose k objects from data set as initial cluster centers(centroids)Step 2: RepeatStep 2.1: Determine the distance using distance measure, between each object and each one of the centroidStep 2.2: Assign the object to the cluster to which the object is most similarStep 2.3: Update the cluster centersStep 3:Until no change4.2.Fuzzy C Means Clustering AlgorithmThe aim of fuzzy c means clustering algorithm is to minimize objective function –∑∑‖‖ (1) Following are the steps for FCM clustering algorithm-Step 1: Initialize the matrix U=[u ij] matrix, U(0)Step 2: At k-step: calculate the centers vectors C(k)=[c j] with U(k) using following equation -∑(2)∑Step 3: Update U(k) , U(k+1) using following equation -(3)∑(‖‖)Step 4 : If || U(k+1) - U(k)||< ε then STOP; otherwise return to step 2.In the above algorithm –∙u ij is the degree of membership of x i in the cluster j∙c j is the center of the cluster∙This loop iteration will stop when{| |}(4)∙m is fuzziness coefficient –m determines how much clusters can overlap each other. m lies between 1 and ∞. Higher the value of m , more data points will lie in fuzzy band. Usually initially m=2 is chosen ∙ε is termination tolerance – the algorithm stops when || U(k+1) - U(k)||< ε. Usually choice of ε is .0015.Result Analysis5.1.Dataset UsedFor experimental purpose, Enron email data set [1] is used which can be downloaded free. The data set is cleaned and made available on the World Wide Web for research purpose. A small email data set is alsocreated for comparative analysis.5.2.AnalysisThe result analysis is done on the basis of similarity measures and clustering algorithms used. For experimental purpose, four clusters are considered. First, K means clustering algorithm is applied on the similarity matrix generated, for five distance measures namely, Euclidean distance, Cosine similarity, dice coefficient, extended Jaccard coefficient and Similarity Measure for Text Processing (SMTP). Experts‘ results are generated, that is, emails for each cluster for each similarity measure are identified and are compared with system generated email clusters. Following screen shot represents output for K Means algorithm used with SMTP. It can be observed in the following screen shot that, each cluster contains some emails whose numbers are displayed against cluster id.Fig. 2. Output of K means with SMTPAccuracy is calculated using precision. Precision is defined as –(number of relevant record retrieved / number of irrelevant record retrieved+ number of relevant record retrieved) * 100.The following table shows accuracy obtained with five similarity measures used with k means clustering algorithm.Table 1.Accuracy for four clusters for five similarity measures using K Means algorithmSimilarity Measure Cluster 1 Cluster 2 Cluster 3 Cluster 4Cosine Similarity 36.1111 6.4516 20.8333 28.5714Euclidean Dist 24.6575 25.00 2.38095 20.00Dice Coefficient 23.3333 8.8235 23.9130 3.8461Extended Jaccard Coefficient 24.2424 20.00 27.4509 26.3157SMTP 53.8461 23.0769 83.3333 31.5789It can be observed from above table, that the accuracy is more for SMTP. SMTP gives best results for all four clusters. Main difference between SMPT and other four similarity measures used is, SMTP considers presence or absence of words (features), while others consider frequency count i.e. number of times each term appears in given document. Advantages of SMTP can be discussed as follows. SMTP considers presence or absence of features than difference between two values associated with present feature. It also considers that similarity degree should increase when difference between two non-zero values of a specific term decreases. It also takes into account that similarity degree should decrease when the number of presence or absence of terms increases. SMTP takes into consideration one more important aspect that two documents are the least similar to each other if none of the terms have non-zero values in both documents. SMTP is symmetric similarity measure. The last and most important fact is that it considers standard deviation of term or feature which is taken into count for its contribution to similarity between the two documents.The result analysis is also done for FCM algorithm applied on similarity matrix generated, for five distance measures- Euclidean distance, Cosine similarity, dice coefficient, extended Jaccard coefficient and SMTP. Again experts results are generated which compared with system generated results. Following screen shot represents output for FCM algorithm used with SMTP. Again for experimental purpose, 4 clusters are considered.Fig. 3. Output of FCM with SMTPIn the above screen shot, membership of each email with each of the four clusters is displayed. Email is assigned to a cluster for which it has the highest membership. Accuracy is calculated using precision. Following table shows accuracy for FCM algorithm with five similarity measures.Table 2. Accuracy for four clusters for five similarity measures using FCM algorithmSimilarity Measure Cluster 1 Cluster 2 Cluster 3 Cluster 4Cosine Similarity 37.1871 8.8765 22.76 29.8139Euclidean Dist 25.7654 28.8712 5.00 21.2198Dice Coefficient 27.1987 10.8876 24.90 4.00Extended Jaccard Coefficient 25.4874 22.65 31.3451 28.98SMTP 55.1238 24.00 86.387 35.87As can be observed from table 2, FCM outperformed K Means in almost all four clusters. The results are represented in the following graph-Fig. 4. Comparison between FCM algorithm and K Means algorithm with five distance measures6.ConclusionsThe main aim of FCM clustering algorithm is to minimize the objective function given in (1). The algorithm forms clusters by iteratively searching for a set of fuzzy clusters and the associated cluster centers that represent the structure of the data as best as possible. Both algorithms use distance measures. But FCM clustering uses distance measures along with fuzziness coefficients which control the degree of membership of each data item to a particular cluster. Also, because every document will have some membership value in each of the clusters, no useful document will ever be excluded from search results in case of fuzzy clustering; moreover fuzzy classification takes care of outliers in much better way. The clusters formed using FCM clustering is more accurate than clusters formed using k means clustering. The final conclusion is FCM clustering gives better results than k means clustering for given data set.AcknowledgementInspiration and guidance are invaluable in every aspect of life, especially in the fields of academics, which I have received from my respected guide Prof. Y.V. Dongre. I would like to thank him for his endless contributions of time, effort, valuable guidance and encouragement she has given me. I also wish to thank everyone who has contributed to this work directly or indirectly.References[1][Online] Availablehttps:///~./enron/[2]T. W. Schoenharl & G. Madey. Evaluation of measurement techniques for the validation of agent-basedsimulations against streaming data. Proc. ICCS 2008, Krakow, Poland.[3]J. Han & M. Kamber. Data Mining Concepts and Techniques. 2nd ed. San Francisco ,CA, USA:Elsevier;2006.[4] C.G. Gonzalez, W. Bonventi, Jr. & A.L.V. Rodrigues.Density of closed balls in real-valued andautometrized Boolean spaces for clustering applications. Proc. 19th Brazilizn Symp. Artif. Intel 2008; pp.8-22.[5]J. A. Aslam & M. Frost. An information-theoretic measure for document similarity. Proc. 26th SIGIR2003; pp. 449-450.[6] D. Lin. An information theoretic definition of similarity. Proc. 15th Int. Conf. Mach. Learn 1998SanFrancisco, CA, USA.[7]J.D‘hondt, J. Vertommen, P.A. Verhaegen, D. Cattrysse & R.J. Du flou. Pairwise-adaptive dissimilaritymeasure for document clustering. Inf. Sci. 2010; Vol. 180, No. 12, pp. 2341-2358.[8]R.W.Hamming. Error detecting and error correcting codes. Bell Syst. Tech. J. 1950; Vol. 29, No.2,pp.147-160.[9]S. Kullback & R.A.Leibler. On information and sufficiency. Annu. Math. Statist. 1951; Vol. 22, No. 1, pp.79-86.[10]Yung-Shen Lin, Jung-Yi Jiang & Shie-Jue Lee. Similarity Measure for Text Classification and clustering.IEEE Transactions on Knowledge and Data Engineering 2014; Vol. 26, No. 7.[11]Zadeh, L.A. Fuzzy sets. Information and Control 8(3): 338–353 1965; doi:10.1016/s0019-9958(65)90241-x.[12]Zadeh, L.A. Fuzzy Logic. Stanford Encyclopedia of Philosophy. Stanford University 2006.[13]Zadeh, L. A. et al. Fuzzy Sets, Fuzzy Logic, Fuzzy Systems, World Scientific Press 1996; ISBN 981-02-2421-4[14]Kosko, B. Fuzzy Thinking: The New Science of Fuzzy Logic.1994; Hyperion.[15]Pratihar, D.K.: Soft Computing. Narosa Publishing House, New Delhi, India[16]Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well- SeparatedClusters. J. Cybernet 1973; Vol. 3, pp. 32–57.[17]Bezdek, J.C.Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer AcademicPublishers, Norwell, MA, USA, 1981.[18]Pal, N.R.—Bezdek, J.C. .On Cluster Validity for the Fuzzy C-Means Model. IEEEFS 1995; Vol. 3, No. 3,p. 370.[19]Albayrak, S.—Armasyali, F. Fuzzy C-Means Clustering on Medical Diagnostic System. Proc. Int. XIITurkish Symp 2003; on Artif. Intel. NN.[20]Zhang, D.Q.—Chen, S.C. A Novel Kernelized Fuzzy C-Means Algorithm With Application in MedicalImage Segmentation. Artif. Intel. Med 2004; Vol. 32, pp. 37–50.[21]Migaly, S.—Abonyi, J.—Szeifert, F. Fuzzy Self-Organizing Map Based on Regularized Fuzzy C-MeansClustering. Advances in Soft Computing, Engineering Design and Manufacturing. J.M. Benitez, O.Cordon, F. Hoffmann, et al. (Eds.), Springer Engineering Series 2002; 2002, pp. 99–108.[22]Sikka, K.—Sinha, N.—Singh, P.K.—Mishra, A.K. A Fully Automated Algorithm Under Modified FCMFramework for Improved Brain MR Image Segmentation. Magnetic Resonance Imaging. 2009 Vol. 27, No. 7, pp. 994–1004.[23]Krinidis, S.—Chatzis, V. A Robust Fuzzy Local Information C-Means Clustering Algorithm.IEEE Trans.on Image Processing 2010; Vol. 19, No. 5, pp. 1328–1337.[24]Belhassen, S.—Zaidi, H. A Novel Fuzzy C-Means Algorithm for Unsupervised Heterogeneous TumorQuantification. PET. Medical Physics 2010; Vol. 37, No. 3, pp. 1309–1324.Authors’ ProfilesMs. Deepa B. Patil is a post graduate student in Computer Engineering from VishwakarmaInstitute of Information Technology under Savitribai Phule Pune University, Pune, MaharashtraState, India.Prof. Yashwant V. Dongre is Assistant Professor in Vishwakarma Institute of InformationTechnology (VIIT) under Savitribai Phule Pune University, Pune, Maharashtra State, India. Hisarea of interest includes database management, data mining and information retrieval. He hasseveral journal papers to his credit published in prestigious journals.How to cite this paper: Deepa B. Patil, Yashwant V. Dongre,"A Fuzzy Approach for Text Mining", International Journal of Mathematical Sciences and Computing(IJMSC), Vol.1, No.4, pp.34-43, 2015.DOI: 10.5815/ijmsc.2015.04.04。

The NASA STI Program Office... in Profile

NASA/CR-2002-211768
Very Large Scale Optimization
Garrett Vanderplaats Vanderplaats Research and Development, Inc., Colorado Springs, Colorado
National Aeronautics and Space Administration Langley Research Center Hampton, Virginia 23681-2199 Prepared for Langley Research Center under Contract NAS1-00102
TECHNICAL TRANSLATION. Englishlanguage translations of foreign scientific and technical material pertinent to NASA’s mission.
Specialized services that complement the STI Program Office’s diverse offerings include creating custom thesauri, building customized databases, organizing and publishing research results . . . even providing videos. For more information about the NASA STI Program Office, see the following: • Access the NASA STI Program Home Page at • Email your question via the Internet to help@ • Fax your question to the NASA STI Help Desk at (301) 621-0134 • Telephone the NASA STI Help Desk at (301) 621-0390 • Write to: NASA STI Help Desk NASA Center for AeroSpace Information 7121 Standard Drive Hanover, MD 21076-1320

英文句子相似性判断

英⽂句⼦相似性判断1.要求本次项⽬提供⼀系列的英⽂句⼦对，每个句⼦对的两个句⼦，在语义上具有⼀定的相似性；每个句⼦对，获得⼀个在0-5之间的分值来衡量两个句⼦的语义相似性，打分越⾼说明两者的语义越相近。

如：2.基本实现过程2.1 数据处理：（1）分词：（2）去停⽤词：停⽤词是⼀些完全没有⽤或者没有意义的词，例如助词、语⽓词等。

stopword就是类似 a/an/and/are/then 的这类⾼频词，⾼频词会对基于词频的算分公式产⽣极⼤的⼲扰，所以需要过滤（3）词⾏还原：词⼲提取( Stemming ) 这是西⽅语⾔特有的处理，⽐如说英⽂单词有单数复数的变形，-ing和-ed的变形，但是在计算相关性的时候，应该当做同⼀个单词。

⽐如 apple和apples，doing和done是同⼀个词，提取词⼲的⽬的就是要合并这些变态（4）词⼲化：其中上述过程的代码如下：def data_cleaning(data):data["s1"] = data["s1"].str.lower()data["s2"] = data["s2"].str.lower()# 分词tokenizer = RegexpTokenizer(r'[a-zA-Z]+')data["s1_token"] = data["s1"].apply(tokenizer.tokenize)data["s2_token"] = data["s2"].apply(tokenizer.tokenize)# 去停⽤词stop_words = stopwords.words('english')def word_clean_stopword(word_list):words = [word for word in word_list if word not in stop_words]return wordsdata["s1_token"] = data["s1_token"].apply(word_clean_stopword)data["s2_token"] = data["s2_token"].apply(word_clean_stopword)# 词形还原lemmatizer=WordNetLemmatizer()def word_reduction(word_list):words = [lemmatizer.lemmatize(word) for word in word_list]return wordsdata["s1_token"] = data["s1_token"].apply(word_reduction)data["s2_token"] = data["s2_token"].apply(word_reduction)# 词⼲化stemmer = nltk.stem.SnowballStemmer('english')def word_stemming(word_list):words = [stemmer.stem(word) for word in word_list]return wordsdata["s1_token"] = data["s1_token"].apply(word_stemming)data["s2_token"] = data["s2_token"].apply(word_stemming)return data2.2 传统⽅法的使⽤：（1）bag of words：其中具体的描述可以在这⾥看到：# bag of wordsfrom sklearn.feature_extraction.text import CountVectorizerdef count_vector(words):count_vectorizer = CountVectorizer()emb = count_vectorizer.fit_transform(words)return emb, count_vectorizerbow_data = databow_data["words_bow"] = bow_data["s1"] + bow_data["s2"]bow_test = bow_data[bow_data.score.isnull()]bow_train = bow_data[~bow_data.score.isnull()]list_test = bow_test["words_bow"].tolist()list_train = bow_train["words_bow"].tolist()list_labels = bow_train["score"].tolist()from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(list_train, list_labels, test_size=0.2, random_state=42)X_train_counts, count_vectorizer = count_vector(X_train)X_test_counts = count_vectorizer.transform(X_test)test_counts = count_vectorizer.transform(list_test)# print(X_train_counts.shape, X_test_counts.shape, test_counts.shape)（2） TF-IDF：其中具体的描述可以在这⾥看到：# TF-IDFfrom sklearn.feature_extraction.text import TfidfVectorizerimport scipy as scdef tfidf(data):tfidf_vectorizer = TfidfVectorizer()train = tfidf_vectorizer.fit_transform(data)return train, tfidf_vectorizertf_data = datatf_data["words_tf"] = tf_data["s1"] + tf_data["s2"]tf_test = tf_data[tf_data.score.isnull()]tf_train = tf_data[~tf_data.score.isnull()]list_tf_test = tf_test["words_tf"].tolist()list_tf_train = tf_train["words_tf"].tolist()list_tf_labels = tf_train["score"].tolist()X_train, X_test, y_train, y_test = train_test_split(list_tf_train, list_tf_labels, test_size=0.2, random_state=42)X_train_tfidf, tfidf_vectorizer = tfidf(X_train)X_test_tfidf = tfidf_vectorizer.transform(X_test)test_tfidf = tfidf_vectorizer.transform(list_test)然后通过⼀些基本的回归算法，进⾏训练和预测即可；3 三种基于w2v的基本⽅案3.1 使⽤Word2Vec模型的训练：通过给定的语料库，来训练⼀个词向量的模型，⽤于后期对句⼦进⾏词向量的表⽰：并且使⽤余弦相似度对句⼦相似度进⾏打分，不同于前⾯的是，通过word2vec⽅法所进⾏的是⽆监督学习，因此对于元数据中给的score并没有使⽤；（1）这⾥⾸先给出所使⽤的语料：path_data = "text_small"path_train_lab = "train_ai-lab.txt"path_test_lab = "test_ai-lab.txt"path_other_lab = "sicktest"def get_sentences():"""获取⽂件中句⼦作为语料库使⽤:return:"""sentences = []with open(path_train_lab) as file:for line in file:item = line.split('\t')sentences.append(prep_sentence(item[1]))sentences.append(prep_sentence(item[2]))with open(path_test_lab) as file:for line in file:item = line.split('\t')sentences.append(prep_sentence(item[1]))sentences.append(prep_sentence(item[2]))# # 添加额外语料# with open(path_other_lab) as file:# for line in file:# item = line.split('\t')# sentences.append(prep_sentence(item[0]))# sentences.append(prep_sentence(item[1]))# sentences += word2vec.Text8Corpus(path_data)return sentencesView Code（2）训练模型：def train_w2v_model(sentences):"""训练w2v模型:param sentences: 语料:return:"""# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=)model = Word2Vec(sentences, size=200, min_count=1, iter=2000, window=10)model.save("w2v.mod")3.2 三种基于w2v的基本⽅案（1）基于余弦距离直接计算句⼦相似度：通过直接对句⼦中所有词向量加和求均值，作为句向量，直接计算两个句向量的余弦距离，作为最终的结果，发现在不做额外处理的情况下基本可以到达0.7左右的分值。

沪教版英语小学六年级上学期期末试题及解答参考

沪教版英语小学六年级上学期期末复习试题及解答参考一、听力部分（本大题有12小题，每小题2分，共24分）1、Listen to the dialogue between two students in the library and answer the following question:A. What are the students doing in the library?B. How does the girl know the boy?C. Why does the boy ask the girl to help him find a book?Answer: A. The students are borrowing books from the library. Explanation: The conversation starts with the girl asking the boy if he needs help with borrowing books, indicating that they are in the library and the students are involved in the process of borrowing books.2、 Listen to the short passage about a famous landmark in Shanghai and complete the following sentence:The Bund is a well-known_______in Shanghai.A. bridgeB. parkC. skyscraperD. buildingAnswer: D. buildingExplanation: In the passage, the speaker mentions that The Bund is a famous building along the Huangpu River in Shanghai, which is a significant landmark.3、Listen and choose the correct answer. (听音选择正确答案。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

A S TEMMING P ROCEDURE AND S TOPWORD L ISTFOR G ENERAL F RENCH C ORPORAJacques SavoyInstitut interfacultaire d'informatiqueUniversité de NeuchâtelPierre-à-Mazel 7CH - 2000 Neuchâtel (Switzerland)to appear inJournal of the American Society for Information Science, 50(10), 1999, 944-952AbstractDue to the increasing use of network-based systems, there is a growing interest in access to and search mechanisms for text databases in languages other than English. To adapt searching systems to those foreign languages with characteristics similar to the English language, all we need to do for the most part is to establish a general stopword list and a stemming procedure. This article presents the tools needed to establish these in the French language databases and some retrieval experiments that have been carried out using two medium-sized French language test collections. These experiments were conducted to evaluate the retrieval effectiveness of the propositions described.IntroductionThe browser technologies currently available for use on CD-ROMs and also local and wide-area networks (Internet and WWW), allow us to store, distribute and manage larger volumes of documents, many of which are not always written in English. To provide access and search mechanisms for these sources of information accessed through digital libraries (Lesk, 1997) or web browsers, we need to readapt portions of certain existing retrieval systems so that they can handle languages other than English.Most European languages (e.g., French, Slovene, Italian) share many of the characteristics of Shakespeare's language (e.g., word boundaries marked in a conventional manner, variant word forms generated by adding suffixes at theend of a root, etc.). Any adaptation therefore means the elaboration of a general stopword list and a fast stemming procedure. The stopword list contains non-significant words that are removed from a document or a request before beginning the indexing process. The stemming procedure tries to remove inflectional and derivational suffixes in order to conflate word variants into the same stem or root. In resolving this problem for the French language, it is important to remember that French and other European languages involve a more complex morphology than does English (Sproat, 1992). Previous examples of such adaptations are reported in (Popovic & Willett, 1992; Buckley et al., 1995) where a stemming procedure is proposed for both the Slovene and Spanish languages respectively.The aim of this article is therefore to propose a general stopword list and a simple stemming procedure required for a French corpora. Moreover, as a result of recent cooperation between various research groups, two medium-sized French test collections (see Appendix 2) have been created. These corpora, together with various current search strategies, were used to corroborate or invalidate prior assumptions or algorithms. This means that our findings are based on more solid arguments than on conclusions derived from a single retrieval model working on a small text collection (e.g., less than 500 records).The rest of this paper is organized as follows. The first part describes the approach we used to establish a general stop list for French corpora. The second part details our "quick and dirty" inflectional stemming procedure based on a few general linguistic considerations. The third chapter summarizes and comments upon some of the experimental results that are used to justify both the suggested stopword list and the stemming procedure developed, and based on two French language test collections.General Stopword ListFor the purposes of this research, we consider a word to be each uninterrupted sequence composed of letters (a..z) , digits (0..9) or two special characters (@ and _). Thus, the phrase "la machine IBM-360" counts as four words but "la machine IBM360" as only three. In French, the apostrophe «'» is very often used as a word delimiter (e.g. "l'avenir" is composed of two words, namely the article "l" (the) and the noun "avenir" (future)). An exception would be the noun "aujourd'hui" (today), various English name transcriptions (e.g.,McDonald's or K'NEX) and the comma used as a separator in numbers (e.g.,3,000,000 is written as 3'000'000 in French typography (Corthésy et al., 1993)).We defined a general stopword list for those words which serve no purpose for retrieval, but are used very frequently in composing the documents, and these stopword lists are developed for two main reasons: Firstly, we hope that each match between a query and a document will be based on good indexing terms. Thus, retrieving a document because it contains words like "be", "your" and "the" in the corresponding request does not constitute an intelligent search strategy. These non-significant words represent noise, and may actually damage the retrieval performance because they do not discriminate between relevant and nonrelevant documents. Secondly, we expect to reduce the size of the inverted file, hopefully in the range of 30% to 50%.Although the objectives seem clear, we do not have a clear theoretical foundation upon which we can define a methodology for the development of a stop list, thus a certain arbitrariness is required (Fox, 1990). For example, the SMART system has 571 English words in its stopword list, Fox (1990) suggests 421 words while DIALOG Information services (Harter, 1986, p. 88) propose using only nine terms (namely "an", "and", "by", "for", "from", "of", "the", "to" and "with").In establishing a general stopword list for French, we followed the guidelines described in (Fox, 1990). Firstly, we sorted all the word forms appearing in our French corpora according to their frequency of occurrence and we extracted the 200 most frequently occuring words. Secondly, we inspected this list to remove all numbers (e.g., "1992", "1"), plus all nouns and adjectives more or less directly related to with the main subjects of the underlying collections (French articles were extracted from a newspaper as described in Appendix 2). For example, the words "France" ranked at the 66th position on the list as well as the noun "Président" (ranked at the 69th position) were removed from the list. Also removed were other nouns such as "janvier" (January), "Paris", "francs", "millions" or "Jean" (John) as well as adjectives (e.g., "premier" usually appearing in the expression "premier ministre" (prime minister) or "deux" (two)). From our point of view, such words can be useful as indexing terms in only some circumstances. Thirdly, we included some non-information-bearing words, even if they did not appear in the first 200 most frequent words. For example, we added various personal or possessive pronouns (such as "moi" (me),"tien" (yours)), prepositions ("dessus" (upon)) and conjunctions ("cependant" (however)).In the resulting stopword list there were thus a large number of pronouns, articles, prepositions and conjunctions. As in various English stopword lists, there were also some verbal forms ("être" (to be), "ont" (have), "sont" (are)). However there was only one noun ("aujourd'hui" (today) included as two words "aujourd" and "hui" because a quote is considered as a word boundary).We did not included various frequently used words such as "world" ("monde" appearing in the 81st position of the 200 most frequent words in our corpora) or "political" ("politique" appearing in the 78th rank order), "years" ("ans", 71st position), "city" ("ville", 158th position), "ministre" (79th position), "day" and "days" ("jour" and "jours", 190th and 191st position), "life" ("vie",152nd position). The presence of homographs represents another debatable issue, and to some extent, we had to make arbitrary decisions concerning their inclusion in a stopword list. For example, the French word "son" can be translated as "sound" or "his", and the French term "or" as "thus/therefore" or "gold".The general stopword list suggested for French contains 215 words and is included in Appendix 1. When using such a stopword list, the size of the inverted file was reduced by about 21% for one test collection, and about 35% for the second corpus. Ordering the words according to their occurrence frequency also confirms Zipf's law, and based on our French corpora, the 10 most frequent words represent 23.2% of all occurrences in these text databases, while the 20 most frequent words cover 32.4% of all forms appearing in the documents.Stemming ProcedureAfter removing high frequency words, an indexing procedure tries to conflate word variants into the same stem or root using a stemming algorithm. For example, the words "thinking", "thinkers" or "thinks" may be reduced to the stem "think". In information retrieval, grouping words having the same root under the same stem (or indexing term) may increase the success rate when matching documents to a query (van Rijsbergen, 1979, Chapter 2; Salton, 1989; Frakes, 1992). Such an automatic procedure may therefore be a valuable tool in enhancing retrieval effectiveness, assuming that words with the same stem refer to the same idea or concept and must be therefore indexed under the same form.When defining a stemming algorithm, a first approach will only remove inflectional suffixes or, for English, such a procedure conflates singular and plural word forms as well as removing the past participle ending «-ed» and the gerund or present participle ending «-ing». More sophisticated schemes for English corpora have also been proposed for the removal of derivational suffixes (e.g., «-ize», «-ably», «-ship»). For example, Lovins' stemmer (Lovins, 1968) is based on a list of over 260 suffixes, while Porter's algorithm looks for about 60 suffixes (Porter, 1980). Most of these suffix-stripping algorithms are controlled by both quantitative constraints (e.g., a minimal stem length must be respected for a given suffix removal operation) and qualitative constraints (e.g., the ending must satisfy a certain condition). Finally, a set of recoding rules may be followed in order to alter stems and to improve the conflation (e.g., "hopping" minus «-ing»gives "hop" and not "hopp"). Various implementation strategies have also been suggested (Frakes, 1992).In defining an inflectional stemmer for French, there are a greater number of irregularities to consider (Grevisse & Goose, 1988). Although English contains morphological irregularities (e.g., box/boxes, mouse/mice, keep/kept) there are even more in French and in other languages (e.g., Slovene, Italian). In fact, these include inflectional suffixes governed by gender variations (masculine vs. feminine) and number variations (singular vs. plural) both for nouns and adjectives. For verbs, we must add variations in tense and person. The resulting set of rules and exceptions is quite large, and, as an extreme example, the verb "être" (to be) possesses 40 different possible forms. As another stop list example, the one we suggest contains the variations in gender and number for various pronouns ("mien" in masculine singular, "miens" in masculine plural, "mienne" in feminine singular, and "miennes" in feminine plural) (Sproat, 1992).In order to resolve this problem, Krovetz (1993) suggests using a stemming procedure based on both inflectional and derivational suffixes within which the suffix stripping process is under the control of an English dictionary. Hull (1996) presents a similar approach based on various linguistic tools. For French, Savoy (1993) proposes a suffixing algorithm also based on grammatical categories, although such an approach requires a French dictionary, an electronic resource that is not freely available. Moreover, the suggested procedure is time consuming compared to various approaches designed for the English language (e.g., Porter's stemmer) or for the Slovene language (Popovic & Willett, 1992).Figure 1 below depicts a detailed description of our "quick and dirty" stemming procedure for the French language. The principal feature of this suggested stemming procedure is that it is based on only a few general morphological rules. In French the main inflectional rule is to add a final «-s» to denote the plural form for both nouns and adjectives. Another common morpheme for indicating the plural is adding a final «-x» (as in "hibou/hiboux" (owl/owls) or in a slightly more complex circumstance, for nouns ending with «-al» such as "cheval/chevaux" (horse/horses)). The suggested algorithm does not account for person and tense variations, or for the morphological variations used by verbs. Our procedure therefore corresponds to the English "S stemmer" which conflates singular and plural word forms (Harman, 1991).For words of five or more lettersif the final letter is «-x» thenif final is «-aux» then replace final «-aux» by «-al»(e.g., chevaux -> cheval)otherwise, remove final «-x»(e.g., hiboux -> hibou) otherwise (words not ending with «-x»)if final letter is «-s» then remove final «-s»(e.g., chantés -> chanté)if final letter is «-r» then remove final «-r»(e.g., chanter -> chante)if final letter is «-e» then remove final «-e»(e.g., chante -> chant)if final letter is «-é» then remove final «-é»(e.g., chanté-> chant)(a simple recoding rule, e.g., baronn-> baron)if final two letters are the same, remove final letterotherwise does not alter words of four or less lettersFigure 1: Weak stemmer for French languageUsing our stemming procedure, the French words "baronnes" (baronesses), "barons" and "baron" will be reduced to the same stem "baron". Of course, various counter-examples can also be found, such as "français" and "françaises" (the adjective "French" in its masculine and feminine plural forms) that cannot be reduced to the same root ("français" is reduced to "françai", a non-French word and "françaises" to "français"). Moreover, obtaining the exact semantic root of a given form is not always achieved by the automatic stemming procedures, so that we are faced with various conflation errors (see examples of various English stemming procedures in (Krovetz, 1993)). Working with "real" and large text collections reveals other problems such as conflating of misspelled terms or removing suffixes from the proper nouns appearing in a document or a request.Experimental ResultsTo evaluate the retrieval effectiveness of our suggested stopword list and stemming procedure, we have used two French test collections. The first corpus, OFIL, contains selected articles from the French newspaper Le Monde (11,016 documents, 26 queries). I NIST is our second test collection, composed of very short abstracts of scientific articles (163,308 documents and 30 queries). Various statistics regarding both test collections can be found in Appendix 2.As a means of evaluation, we used the non-interpolated average precision at 11 recall values provided by the TREC2_EVAL software based on 1,000 retrieved items per request (Harman, 1995). To decide whether a search strategy is better than another, we need a decision rule. The following rule of thumb may be used to define such a rule: a difference of at least 5% in average precision is generally considered significant and a 10% difference is considered material (Sparck Jones & Bates, 1977, p. A25). For a more precise decision, we might also apply statistical inference methods such as Wilcoxon's signed rank test (Salton & McGill, 1983, Section 5.2; Hull, 1993) or hypothesis testing based on bootstrap methodology (Savoy, 1997).Evaluation of stemming and nonstemming searchesIn evaluating various search strategies, we considered the OKAPI probabilistic model (Robertson et al., 1995) and various vector-processing schemes (retrieval status computed according to the inner product (Salton, 1989, p. 318)). Following Buckley et al., (1995), we used three letters to denote the weighting method for documents, combined with three letters for the weighting method for queries. The exact formulation for each indexing scheme is described in more detail in Appendix 3. For example, one can find the simple coordinate match (doc = BNN, query = BNN) within which the retrieval status value of each document corresponds to the number of terms in common with the query. Another simple indexing strategy which uses only the occurrence frequency for each term in the document or the request is described using the label "doc = NNN, query = NNN". In addition to these two well-known indexing weighting schemes, we also suggest employing more complex indexing formulae (e.g., LTN, LTC, ATN) within which an indexing term weight depends on both its frequency of occurrence within a document and its importance within the entire collection (idf component). Finally, we also used the L NU and OKAPI weighting schemes, which take account of document length.To provide a more precise interpretation of these retrieval effectiveness results, in the following tables we have underlined statistically significant differences based on a one-sided Wilcoxon signed rank test with a significance level fixed at 5%. Our baseline performance shown in the second column of Table 1 is achieved by a retrieval scheme with does not use a stopword list and ignores our weak suffix-stripping procedure.Table 1a: Average precision of various indexing strategies (OFIL collection)Table 1b: Average precision of various indexing strategies (INIST collection)From data depicted in Table 1, it can be seen that retrieval performance depends on the test collection. Average precision for the INIST collection is lower than that of the OFIL corpus. Based on various statistics shown in Appendix 2, we may point out that the average document length is much shorter for the INIST corpus than for the OFIL collection (52.0 words per document vs. 379.8). Short documents contain less evidence, resulting in poorer retrieval effectiveness.Moreover, the number of documents included in the INIST test collection is 14 times greater than the size of the OFIL collection.The last two rows of Table 1 displays the two poorest retrieval performances achieved by retrieval schemes ignoring collection-wide information ("doc = NNN, query = NNN"; "doc = BNN, query = BNN"). On the other hand, it could be inferred that the OKAPI probabilistic model results in very interesting retrieval performance for both test collections.When presenting the results obtained by various vector-processing strategies, we rank them according to the retrieval performance achieved by the OFIL corpus when using both the suggested stopword list and stemming procedures (last column of Table 1). In the first line, we add the OKAPI model (representing a probabilistic retrieval model) which has good retrieval performance overall. Looking at the INIST corpus retrieval performance, it can be seen that we cannot obtain consistent ranking between the two test collections leading to the conclusion that the performance for a given search scheme depends on the underlying test collection characteristics.The second column of Table 1 depicts the average performance obtained without using stopword list and stemming procedures. The overall retrieval effectiveness is poor compared to the other columns leading to the general conclusion that for retrieval purposes both stemming and removing highly frequent words are overall beneficial.As a study of the relative merit of the stopwording and stemming procedures, the third column of Table 1 depicts the average performance obtained with stopwording but without using our weak suffix-stripping procedure. The data shows that stopwording is strongly advantageous for both collections when using the OKAPI search strategy. In our set up, we removed any search keyword having a negative indexing weight which correspond to very frequent words. Such a context is also strongly advantageous for the two poorest retrieval schemes. With the third search strategy ("doc = LTC, query = LTC"), there appears to be no advantage in using a stopword list. For the remaining strategies, stopwording seems to be beneficial, but the extent of the effect is varied and rather inconsistent across test-collections.The fourth column shows the average performance achieved by various retrieval schemes with our suffix removing procedure but without the removalof the highly frequent words included in the stopword list. The stemming procedure seems particularly beneficial for the INIST collection.Our stemming procedure can also be evaluated when looking at average precision results depicted in the last column of Table 1. The comparative performance between the conflated and nonconflated document representation indicates that a stemming procedure significantly favors the system performance and thus is confirmed by other studies based on English language corpora (Krovetz, 1993; Hull, 1996) and partially by Harman's study (1991) in which the differences in average precision are close to 5%, the limit value of our significance level.Leaving the two poorest strategies aside, stemming is highly beneficial for the INIST collection, but only modestly beneficial for OFIL. This is presumably related to the different document lengths and collection sizes. In OFIL documents (an average document length of about 379.8), key concepts are likely to be mentioned several times, so both singular and plural forms will be represented: a search term is therefore likely to match in the document whether its form is singular or plural. In the INIST collection (a mean document length of 52.0), this will apply much less frequently, so the benefits of stemming will be greater. For the two least effective strategies, stemming is significantly damaging for OFIL documents based on the simple coordinate search strategy ("doc = BNN, query = BNN"), but is neutral or advantageous for INIST documents.As usual, average performance may hide performance irregularities among requests. We performed a more detailed analysis of the performance achieved by the OKAPI model for stemming vs. nonstemming searches for both test collections and without stopwording. In a per-query analysis, the stemming procedure performs better for 19 of the searches, and worse for the remaining 7 for the OFIL corpus (an average precision of 32.21 vs. 33.65 (+4.47%)). For the INIST collection, the stemming search performs better for 25 requests, and worse for the remaining 5 (an average precision of 10.70 vs. 14.83 (+38.60%)). Based on the Wilcoxon signed ranking test (significance level fixed at 5%), the null hypothesis stating that both retrieval schemes produce similar retrieval performance must be accepted for the OFIL collection. This null hypothesis is rejected for the INIST corpus (average precision is therefore significant between the two retrieval schemes). A similar conclusion can be drawn when using bootstrap methodology.In another experiment, we studied the retrieval effectiveness of various French stemming procedures. We developed another stemming algorithm which also removes most frequent French derivational suffixes defined by conducting a quantitative study of the frequency of various endings. When we compared such a strategy with our weak suffix-stripping approach, the difference in average precision was not significant (about 1.1%) and was in favor of our simple weak stemmer. These results tended to confirm other studies carried out on English stemming (Frakes, 1992, Section 8.3; Harman, 1991; Krovetz, 1993) in which the differences between various stemming procedures were not significant. However, since French morphology is far more complex than English morphology, a direct comparison cannot be made. According to Popovic & Willett (1992), and when trying to remove a large number of suffixes for a morphologically complex natural language, a simple stemming procedure seems to be more useful and effective than a more complex one which results in more conflation errors.And the accents?In most European languages, one of the first problems encountered is the requirement for storing each character using 8 bits (e.g., using the ISO LATIN standard) instead of the standard ASCII code. In French, as in most European languages, accents are used to indicate the precise pronunciation and to identify some homographs (e.g., "où" means "where " and "ou" "or", "mais" means "but" and "maïs" "corn").Thus, according to the strict rules of composition (Corthésy et al., 1993), words containing letters with accents must be written with the accents, even when these words appear as capitals. The word "Québec" must always therefore be written with its accent (even in a title as in "QUÉBEC"). However, if only the first letter in a word is a capital, any accent on it must be removed (e.g., "état" must be composed as "Etat").As in every rule of usage, this principle is not always respected, and usually the words in a title written in capitals appear without any accent. To account for this usage, the stopword list contains, for example, both the correct form of the verb "to be" ("être") and the form without an accent ("etre" which is no longer a French word).To evaluate the relative importance of the accents for retrieval purposes, we modified the queries that included accented words. For those terms, we automatically included a copy of the corresponding word without its accents in the request. For example, an original request written as "chômage et économie" (unemployment and economics) will be treated as "chômage chomage etéconomie economie". Our prior assumption was that such a modification could be valuable because a search keyword included without its accent would now match any identically word appearing in a title (and written in capitals without its accents). Of course, we have also assumed that a match with a word included in a title can be considered as an important match. On the other hand, we also knew that the exact meaning of a phrase is often affected when the accents are removed as, for example, the noun phrase "un dossier critiqué" (a criticized case) and "un dossier critique" (a critical case).Table 2a: Evaluation of various indexing strategies (OFIL collection)Table 2b: Evaluation of various indexing strategies (INIST collection)。