python 基础练习题

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Advanced computation linguistics

1. Collect the most frequent words in 5 genres of Brown Corpus:

news, adventure, hobbies, science_fiction, romance

To collect most frequent words from the given genres we can follow the following steps:

>>> import nltk

>>> from nltk.corpus import brown

>>> brown.categories()

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',

'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance'

, 'science_fiction']

>>> news_text = brown.words(categories=['news','adventure','hobbies','science_fi

ction','romance'])

>>> from nltk.probability import FreqDist

>>> fdist=FreqDist([w.lower() for w in news_text])

>>> voca=fdist.keys()

>>> voca[:50]

['the', ',', '.', 'and', 'of', 'to', 'a', 'in', 'he', "''", '``', 'was', 'for',

'that', 'it', 'his', 'on', 'with', 'i', 'is', 'at', 'had', '?', 'as', 'be', 'you

', ';', 'her', 'but', 'she', 'this', 'from', 'by', '--', 'have', 'they', 'said',

'not', 'are', 'him', 'or', 'an', 'one', 'all', 'were', 'would', 'there', '!', '

out', 'will']

>>> voca1=fdist.items()

>>> voca1[:50]

[('the', 18635), (',', 17215), ('.', 16062), ('and', 8269), ('of', 8131), ('to',

7125), ('a', 7039), ('in', 5549), ('he', 3380), ("''", 3237), ('``', 3237), ('w

as', 3100), ('for', 2725), ('that', 2631), ('it', 2595), ('his', 2237), ('on', 2

162), ('with', 2157), ('i', 2034), ('is', 2014), ('at', 1817), ('had', 1797), ('

?', 1776), ('as', 1725), ('be', 1610), ('you', 1600), (';', 1394), ('her', 1368)

, ('but', 1296), ('she', 1270), ('this', 1248), ('from', 1174), ('by', 1157), ('

--', 1151), ('have', 1099), ('they', 1093), ('said', 1081), ('not', 1051), ('are

', 1019), ('him', 955), ('or', 950), ('an', 911), ('one', 903), ('all', 894), ('

were', 882), ('would', 850), ('there', 807), ('!', 802), ('out', 781), ('will',

775)]

This means that the frequency of word “the” is more than others.

2. Exclude or filter out all words that have a frequency lower than 15 occurrencies. (hint using conditional frequency distribution)

By adding functionalities on the first task of collecting words based on their frequency of

occurrences, we can filter words which has frequency occurrence of >=15.

>>> filteredText= filter(lambda word: fdist[word]>=15,fdist.keys())

>>> voca=fdist.keys()

>>> filteredText[:50] /*first 50 words*/

['the', ',', '.', 'and', 'of', 'to', 'a', 'in', 'he', "''", '``', 'was', 'for',

'that', 'it', 'his', 'on', 'with', 'i', 'is', 'at', 'had', '?', 'as', 'be', 'you

', ';', 'her', 'but', 'she', 'this', 'from', 'by', '--', 'have', 'they', 'said',

'not', 'are', 'him', 'or', 'an', 'one', 'all', 'were', 'would', 'there', '!', '

out', 'will']

>>> filteredText[-50:] /*last 50 words*/

['musical', 'naked', 'names', 'oct.', 'offers', 'orders', 'organizations', 'para

de', 'permit', 'pittsburgh', 'prison', 'professor', 'properly', 'regarded', 'rel

ease', 'republicans', 'responsible', 'retirement', 'sake', 'secrets', 'senior',

'sharply', 'shipping', 'sir', 'sister', 'sit', 'sought', 'stairs', 'starts', 'st

yle', 'surely', 'symphony', 'tappet', "they'd", 'tied', 'tommy', 'tournament', '

understanding', 'urged', 'vice', 'views', 'village', 'vital', 'waddell', 'wagner

', 'walter', 'waste', "we'd", 'wearing', 'winning']

3. Then exclude or filter out all stopwords from the lists you have created.(hint using conditional frequency distribution)

To filter the stop words we have to define tiny function using the word net library for 'english' language.

>>> from nltk.corpus import stopwords

>>> stopwords.words('english')

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yo

urs', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'he

rs', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'thems

elves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am',

'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having

', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'be

cause', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'again

st', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below'

, 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again'

, 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'al

l', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',

'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'ca

n', 'will', 'just', 'don', 'should', 'now']

>>> def content_fraction(text):

... stopwords= nltk.corpus.stopwords.words('english')

... content = [w for w in text if w.lower() not in stopwords]

... return len(content) / len(text)

...

相关文档
最新文档