ABSTRACT NATURAL LANGUAGE PROCESSING AND THE AUTOMATIC ACQUISITION OF KNOWLEDGE
《自然语言处理》课程教学分析与实践

关键词:自然语言处理;实践教学;认知驱动;编程巩固;人工智能
中图分类号:G642 文献标识码:A
文章编号:1009-3044(2021)18-0160-02
开放科学(资源服务)标识码(OSID):
Analysis and Practice of“Natural Language Processing”Course Teaching
4.1“认知驱动”教学
“认知驱动”教学法,即基于学生认知的教学方法。不同于 传统教学方法以教师的角度去执行,该方法从学生的角度去执 行,以学生现有的认知水平为起点并规划学习的内容,让学生 根据自己对自然语言处理的现有认知去探索研究某一子领域 内容,教师在此过程中扮演了观察者以及评估者的角色。”认知 驱动“教学法一方面可以提高学生学习的兴趣和积极性,培养 学生在学习过程中的独立思考能力和创新思维,另一方面可以
帮助教师掌握每一位学生的知识基础,基于因材施教的理念为 学生设计不同的教学策略。
例如,在讲解“文本处理”方法时让每一位学生根据自己的 现有认知表述什么是文本处理,如何对文本进行处理。有些同 学数学基础较强,可以将文本处理的过程用数学公式形式化描 述,还有些同学编程能力较强,用伪代码算法框架描述了文本 处理的流程。
Key words: natural language processing; practical teaching; cognitive drive; programming consolidation; artificial intelligence
1 引言
《自然语言处理》课程属于人工智能专业选修课,是一门融 语言学、计算机科学、数学于一体的科学,它研究能实现人与计 算机之间用自然语言进行有效通信的各种理论和方法,是计算 机科学领域与人工智能领域中的一个重要方向[1-2]。《自然语言 处理》课程理论性较强、知识体系庞大,其主要教学内容包括: 词法分析、句法分析、语义分析、文本分类、对话系统,统计机器 翻译等,传统的教学方法只能使学生了解自然语言处理的理论 知识,难以理论联系实际并灵活运用,此外,固有的理论教学模 式降低了学生学习的兴趣和积极性,也无法培养学生的创造性 思维。针对上述传统教学体系存在的问题,本文在先前的改革 实践教学研究[3-7]的基础上提出了新的“认知驱动+编程巩固”教 学方法,达到了现代教育对教师与时俱进、因材施教的要求。
自然语言处理书籍pdf

自然语言处理书籍pdf自然语言处理(Natural Language Processing,简称NLP)是计算机科学与人工智能领域的重要研究方向,涉及理解、处理和生成人类语言的方法和技术。
对于对NLP感兴趣的学习者和从业者来说,阅读相关的书籍是快速入门和深入学习的重要途径。
在本文中,我将推荐几本优秀的自然语言处理书籍,并提供这些书籍的PDF下载链接,方便读者获取所需的学习资料。
1. 《Speech and Language Processing》(第3版)- Daniel Jurafsky & James H. Martin《Speech and Language Processing》是自然语言处理领域的经典教材之一,覆盖了广泛的NLP主题,包括语言学基础、文本分类、情感分析、信息抽取等。
该书以深入浅出的方式介绍了NLP的基本概念和技术,并通过丰富的示例和案例帮助读者理解和实践。
您可以点击以下链接获取《Speech and Language Processing》的PDF下载:[PDF下载链接]2. 《Foundations of Statistical Natural Language Processing》- Christopher D. Manning & Hinrich Schütze《Foundations of Statistical Natural Language Processing》是一本经典的统计自然语言处理教材。
该书系统地介绍了用统计方法解决语言处理问题的基本原理和方法,并提供了大量的数学公式和推导。
阅读该书可以使读者对统计NLP的理论基础有深入的了解。
您可以点击以下链接获取《Foundations of Statistical Natural Language Processing》的PDF下载:[PDF下载链接]3. 《Natural Language Processing with Python》- Steven Bird, Ewan Klein & Edward Loper《Natural Language Processing with Python》是一本以Python为工具的自然语言处理教材。
Natural Language Processing Techniques

Natural Language Processing Techniques Natural Language Processing (NLP) TechniquesNatural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language. In recent years, NLP techniques have made significant advancements in various applications such as sentiment analysis, chatbots, machine translation, and speech recognition. In this article, we will explore some of the most commonly used NLP techniques and their applications.1. Tokenization: Tokenization is the process of breaking down a text into individual words, phrases, or symbols known as tokens. This technique is essential for many NLP tasks as it helps to convert unstructured text data into a structured format that can be easily processed by machines. Tokenization can be done at different levels, such as word level, sentence level, or character level.2. Part-of-Speech (POS) tagging: POS tagging is the process of assigning a grammatical category (noun, verb, adjective, etc.) to each word in a sentence. This technique helps in understanding the syntactic structure of a sentence and is crucial for tasks like named entity recognition, sentiment analysis, and machine translation.3. Named Entity Recognition (NER): Named Entity Recognition is the task of identifying and classifying named entities (such as names of people, organizations, locations, etc.) in a text. NER is widely used in information extraction, question answering systems, and social media analysis.4. Sentiment Analysis: Sentiment analysis is the process of determining the sentiment expressed in a piece of text, whether it is positive, negative, or neutral. This technique is commonly used in social media monitoring, customer feedback analysis, and brand reputation management.5. Machine Translation: Machine translation is the task of translating text from one language to another automatically. NLP techniques such as neural machine translation have significantly improved the accuracy and fluency of machine translation systems.6. Text Classification: Text classification is the process of categorizing text data into predefined categories or classes. This technique is widely used in spam detection, topic categorization, and sentiment analysis.7. Information Extraction: Information extraction is the process of automatically extracting structured information from unstructured text data. This technique is used in various domains such as web scraping, document summarization, and question answering systems.8. Summarization: Text summarization is the task of generating a concise and coherent summary of a longer text. NLP techniques such as extractive and abstractive summarization have been widely used in news summarization, document summarization, and keyword extraction.9. Word Embeddings: Word embeddings are vector representations of words in a continuous vector space. This technique allows us to capture semantic relationships between words and is crucial for tasks like named entity recognition, sentiment analysis, and machine translation.10. Speech Recognition: Speech recognition is the task of automatically converting spoken language into text. NLP techniques such as acoustic modeling and language modeling have significantly improved the accuracy and performance of speech recognition systems.In conclusion, natural language processing techniques have revolutionized the way we interact with machines and have enabled a wide range of applications in various domains. As NLP continues to evolve and innovate, we can expect even more advanced applications and capabilities in the future.。
《自然语言处理》课程教学大纲

《自然语言处理》课程教学大纲《自然语言处理》课程教学大纲一、课程基本信息1、课号:CS2292、课程名称(中/英文):自然语言处理/Natural Language Processing3、学时/学分:32/24、先修课程:程序设计语言5、面向对象:本科三\四年级(ACM班)7、教材、教学参考书:James Allen. Natural Language Understanding (The Second Ver.) TheBenjamin / Cummings Publishing Company, Inc., 1995.Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. The MIT Press.Springer-Verlag, 1999二、本课程的性质和任务自然语言处理是计算机科学与技术专业的一门专业选修课。
它的主要任务是使学生了解自然语言处理的主要研究内容及关键技术,并介绍自然语言处理方面的研究成果,为学生从事自然语言处理研究和开发做准备。
此外,通过指导学生阅读计算语言学专业会议的论文,进行摘要和评价,并进行介绍、提问和讨论,使他们对所学课程的有关概念与目前的流行方法和技术的关系有更深入地了解。
在此基础上,要求学生完成一篇有关自然语言处理主题的课程项目,使他们能用所学的知识发挥自身的能力查找有关资料和概括某一研究领域的国内外最新理论和技术并最终加以实践。
三、本课程教学内容和基本要求1. Overview (4)1.1 History of Natural Language Processing (NLP)1.2 Different Levels of Language Analysis1.3 Applied Approaches in NLP Systems1.4 NLP Applications2.Lexicons and Lexical Analysis (8)2.1 Lexicon: A Language Resource2.2 A Lexicon for English Words: WordNet2.3 Generative Lexicon2.4 Finite State Models and Morphological Analysis2.5 Collocation2.6 Statistical n-gram language models3.Syntactic Processing (14)3.1 Basic English Syntax3.2 Grammars and Parsing3.3 Features and Augmented Grammars3.4 Grammars for Natural Language3.5 Toward Efficient Parsing3.6 Ambiguity Resolution: Statistical Methods4.Semantic Interpretation (10) 备选4.1 Semantics and Logical Form4.2 Linking Syntax and Semantics4.3 Ambiguity Resolution4.4 Other Strategies for Semantic Interpretation5. Learning Approaches for Natural language processing (8 lhs) 5.1 Main machine learning approachesMaximum entropyK-nearest neighborSupport vector machine5.2 Sequence labeling: HMM, Maximum Entropy Markov Model and CRFs 5.3 A Case Study: train a Part-of-speech taggerfrom labeled corpus6.An Introduction to Human Languages7.Student Workshop四、实验(上机)内容和基本要求1.阅读指定的有关自然语言处理的专业论文,培养学生阅读专业论文的能力;2.召开学生研讨会,请一部分学生对所读论文进行摘要和评价,并进行介绍、提问和讨论。
Natural Language Processing

Testing against natural language RequirementsHarry M. SneedAnecon GmbH, Vienna, AustriaEmail: Harry.Sneed@t-online.atAbstract:Testing against natural language requirements is the standard approach for system and acceptance testing. This test is often performed by an independent test organization unfamiliar with the application area. The only things the testers have to go by are the written requirements. So it is essential to be able to analyze those requirements and to extract test cases from them. In this paper an automated approach to requirements based testing is presented and illustrated on an industrial application.Keywords: Acceptance Testing, System Testing, Requirements Analysis, Test Case Generation, Natural Language ProcessingA test is always a test against something. According to the current test literature this something can be the code itself, the design documentation, the data interfaces, the requirements or the unwritten expectations of the users [1]. In the first case, one is speaking of code based testing where the test cases are actually extracted from an analysis of the code. In the second case, one is speaking of design based testing where test cases are taken from the design documents, e.g. the UML diagrams. In the third case, we speak of data based testing, where test cases are generated from the data structures, e.g. the SQL schema or the XML schema. In the fourth case, we speak of requirements based testing, where we extract the test cases from the requirements documents. This is also known as functional testing. In the fifth and final case, we speak of user based testing, in which a representative user invents test cases as he goes along. This is also referred to as creative testing [2].Another form of testing is regression testing in which a new version of a previous system is tested against the older version. Here the test cases are taken from the old data or from the behavior of the old system. In both cases one is comparing the new with the old, either entirely or selectively [3].1. Functional testingIn this paper the method of requirements based testing is being described, i.e. testing against the functional and non functional requirements as defined in an official document. This type of testing is used primarily as a final system test or as an acceptance test. Bill Howden referred to this as functional testing [4]. It assumes that other kinds of testing, such as code based unit testing and/or design based integration testing have already taken place so that the software is executable and fairly reliable. It is then the task of the requirements based test to demonstrate that the system does what it should do according to the written agreement between the user organization and the developing organization. Very often this test is performed by an independent test organization so as to eliminate any bias. The test organization is called upon not only to test, but also to interpret the meaning of the requirements. In this respect, the requirements are similar to laws and the testers are performing the roles of a judge, whose job it is to interpret the laws to apply to a particular case [5].What laws and requirements have in common is that they are both written in natural language and they are both fuzzy. Thus, they are subject to multiple interpretations. Judges are trained to interpret laws. Testers are not always prepared to interpret requirements. However, in practice this is the essence of their job. Having an automated tool to dissect the requirement texts and to distinguish between different types of requirement statements is a first step in the direction of automated requirements testing. The Text Analyzer is intended to be such a tool.2. Nature of natural language requirementsBefore examining the functions of a requirement analysis tool, it is first necessary to investigate the nature of requirement documents. There may be certain application areas where requirements are written in a formal notation. There are languages for this, such as VDM, SET and Z, and more recently OCL the object Constraint Language propagated by the OMG [6]. However, in the field of information technology such formal methods havenever really been accepted. There, the bulk of the requirements are still written in prose text.Christof Ebert distinguishes between unstructured text, structured text and semi formal text [7]. In a structured text the requirements are broken down into prescribed chapters and sections with specific meanings. A good example of this is the ANSI/IEEE-830: Guide to Requirements Specification. It prescribes a nested hierarchy of topics including the distinction between functional and non functional requirements [8]. Functional Requirements are defined in terms of their purpose, their sequence, their preconditions and post conditions as well as their inputs and outputs. Inputs are lists of individual arguments and outputs lists of results. Arguments and results may be even defined in respect to their value ranges. This brings the functional specification very close to a functional test case.Non functional requirements are to be defined in terms of their pass and fail criteria. Rather than depicting what flows in and out of a function, a measurable goal is set such as a response time of less than 3 seconds. Each non functional requirement may have one or more criteria which have to be fulfilled in order for the requirement to be fulfilled. In addition to the functional and non functional requirements of the product, the ANSI/IEEE standard also stipulates that constraints, risks and other properties of the projects be defined. The end result is a highly structured document with 7 sections. Provided that standard titles or standard numbering is used, a text analysis tool should easily recognize what is being described even if the description itself is not interpretable. By having such a structured document a tester has an easier job of extracting test cases. The job becomes even easier if the structured requirements are supplemented by acceptance criteria as proposed by Christiana Rupp and others [9]. After every functional and non functional requirement a rule is defined for determining whether the requirement is fulfilled or not. Such a rule could be that in case of a withdrawal from an account, the balance has to be less than the previous balance by the withdrawal amount. Account = Account@pre Withdrawal;An acceptance criterion is equivalent to a post condition assertion so that it can be readily copied into a test case definition.Semi formal requirements go one step further. They have their text content placed in a specific format, such as the use case format. Use cases are typical semi formal descriptions. They have standardized attributes which the requirement writer must fill out, attributes like trigger, rule, precondition, post condition, paths, steps and relations to other use cases. In the text these attributes will always have the same name so that a tool can readily recognize them. Most of the use cases are defined in standard frameworks or boxes which make it even easier to process them [10].A good semi formal requirements document will also have links between the use cases and the functional requirements. Each requirement will consist of a few sentences and will have some kind of number or mnemonic identifier to identify it. This identifier will then be referred to by the use case. One use case can fulfill one or more functional requirements. One attribute of the use case will be a list of such pointers to the requirements it fulfills [11].At the upper end of a semiformal requirement specification arithmetic expressions or logical conditions may be formulated. Within an informal document there can be scattered formal elements. These should be recognizable to an analysis tool.In the current world of information technology, the requirement documents range from structured to semi formal. Even the most backward users will have some form of structured requirements document in which it is possible to distinguish between individual functional requirements as well as between constraints and non functional requirements. More advanced users will have structured, semi formal documents in which individual requirements are numbered, use cases are specified with standardized attributes, and processing rules are defined in tables. Really sophisticated requirement documents such as can be found in requirements engineering tools like Doors and Rational Requisite Pro will also have links between requirements, rules, objects and processes, i.e. use cases [12].3. The Testing StrategyA software system tester in industry is responsible for demonstrating that a system does what it is supposed to do. To accomplish this, he must have an oracle to refer to. The concept of an automated oracle for functional testing was introduced by Howden in 1980 [13]. As foreseen by Howden then,the test oracle was to be a formal specification in terms of pre and post conditions. However the oracle could also be a natural language text provided the text is structured and has some degree of formality. In regression testing the oracle is the input and output data of the previous version. In unit testing it is the pre and post conditions of the methods and the invariant states of the objects. In integration testing it is the specification of the interfaces and in system testing it is the requirements document [14]. Thus, it is the task of the system tester to extract test cases from the functional and non functional requirements. Using this as a starting point, he then proceeds to carry out seven steps on the way to achieving confidence in the functionality of a software system. These seven steps are:identifying the test casescreating a test designspecifying the test casesgenerating the test casessetting up the test environmentexecuting the testevaluating the test.3.1 Identifying the test casesHaving established what it is to be tested against, i.e. the test oracle, it is first up to the tester to analyze that object and to identify the potential test cases. This is done by scanning through the document and selecting all statements about the behavior of the target system which need to be confirmed. These statements can imply actions or states, or they define conditions which have to be fulfilled if an action is to take place or a state is to hold [15].Producing a customer reminder is an action of the system. The fact that the customer account is overdrawn is a state. The rule that when a customer account is overdrawn the system should produce a customer reminder is a condition. All three are candidates for a test case. Testing whether the system produces a customer reminder is one test case. Testing if the customer account can be overdrawn is another test case, and testing whether the system produces a customer reminder when the customer account is overdrawn is a test case which combines the other two.In scanning the requirements document the tester must make sure to recognize each action to be performed, each state which may occur and each condition under which an action is performed or a state occurs. From these statements the functional test cases are extracted. But not only the functional test cases. Statements like the response time must be under 3 seconds and the system must recognize any erroneous input data are non functional requirements which must be tested. Every statement about the system, whether functional or non functional is a potential test case. The tester must recognize and record them [16].3.2. Creating a test designOf course, this is only the beginning of a system test. Once the test cases have been defined they must be ordered by time and place and grouped by test session. A test session encompasses a series of test cases performed within one dialog session or one batch process. In one session several requirements and several related use cases are executed. The test cases can be run sequentially or in parallel. The result of this test case ordering by execution sequence is part of the test design.3.3 Specifying the test casesFollowing the test design is the test case specification. This is where the attributes of the test cases are filled out in detail down to the level of the input and output data ranges. Each test case will already have an identifier, a purpose, a link to the requirements, objects and use cases it tests, as well as a source, a type and a status. It may even have a pre and post condition depending on how exact the requirements are. Now it is up to the tester to design the predecessor test cases, the physical interface or database being tested and to assign data values. Normally the general test case description will be in a master table whereas the input and output values will be in sub tables one for the test inputs and one for the expected outputs. In assigning the data, the tester will employ such techniques as equivalence classes, representative values, boundary values and progression or degression intervals. Which technique is used, depends on the type of data. In the end there will be for each test case a set of arguments and results [17].3.4 Generating the test dataProvided the test data definitions are made with a formal syntax, the test data itself can then be automatically generated. The tester may only have to oversee and guide the test data generation process. The basis for the test data generation will be the interface descriptions such as HTML forms, XML schemas, WSDL specifications and SQL database schemas. The values extracted from the test case, specifications are united with the structuralinformation provided by the data definition formats to create test objects, i.e. GUI instances, maps, records, database tables and other forms of test data [18].3.5 Setting up the test environmentIn the 5th step the test environment is prepared. Test databases must be allocated and filled with the generated data. Test work stations are loaded with the client software and the input test objects. The network is activated. The server software is initialized. The source code may be instrumented for tracing execution and test coverage.3.6 Execution the testNow the actual test can be started, one session at a time or several sessions in parallel depending on the type of system under test. The system tester will be either submitting the input data manually or operating a tool for submitting the data automatically. The latter approach is preferable since it is not only much faster, but also more reliable and above all repeatable. While the test is running the execution paths are being monitored and the test coverage of the code is being measured.3.7 Evaluating the testAfter each test session or test run the tester should perform an analysis of the test results. This entails several sub tasks. One sub task will be to report any incidents which may have occurred during the test session. Another task will be to record and document the functional test coverage. A third and vital task is to confirm the correctness of the data results, i.e. the post conditions. This can and should be done automatically by comparing the actual results with the expected results as specified in the test cases. Any deviations between the actual and the specified data results should be reported. Finally the tester will want to record various test metrics such as the number of test cases executed, the number of requirements tested, the number of data validated, the number of errors recorded and the degree of test coverage achieved [19].4. Automating the requirement analysisAs can be gathered from this summary of the system tester s tasks, there are many tasks which lend themselves to automation. Both test data generation and test data validation can be automated. Automated test execution has been going on for years and there are several tools for performing this. What are weakly automated are the test case specification and the test design. Not automated at all are the activities setting up the test environment and identifying the test cases [20].The focus of this paper is on the latter activity, i.e., identifying and extracting test cases. It is the first and most important task in functional system testing. Since the test we are discussing here is a requirements based test, the test cases must be identified in and extracted from the requirements document.The tool for doing that is the text analyzer developed by the author. The same tool goes on to create a test design, thus covering the first two steps of the system testing process. The Text Analyzer was conceived to do what a tester should do when he begins a requirements based system test. It scans through the requirements text to pick out potential test cases.4.1 Recognizing and selecting essential objects The key to requirements analysis is to have a natural language processor which extracts information from the text based on key words and sentence structure. This is referred to as text mining, a technique used by search engines on the internet. [21] The original purpose of text mining was to automatically index documents for classification and retrieval. The purpose here is to extract test cases from natural language text.Test cases relate to the objects of a system. Objects in a requirement document are either acted upon or their state is checked. Therefore, the first step of the text analysis is to identify the pertinent objects. For this all of the nouns must be identified. This is not an easy task, especially in the English language, since nouns can often be verbs or compound words such as master record. In this respect other languages such as German and Hungarian are more precise. In German nouns begin with a capital letter which makes the object recognition even easier.A pre scanner can examine the text to identify and record all nouns. However, only the human analyst can determine which nouns are potential objects based on the context in which they are used. To this end all of the nouns are displayed in a check box and the user can uncheck all nouns which he perceives to be irrelevant. The result is a list of pertinent nouns which can be recorded as the essential objects. Depending on the scope of the requirementsdocument their number can be anywhere from 100 to 1000.Besides that, object selection is apt to trigger a lengthy and tedious discussion among the potential users about which objects are relevant and which are not. In presenting the list of potential objects it becomes obvious, how arbitrary software systems are. In order to come up with an oracle to test against, the users must first come to a consensus on what the behavior of the system should be. Confronting them with the contradictions in their views helps to establish that consensus. [22]4.2 Defining key words in contextAs a prerequisite for the further text analysis, the user must identify the key words used in the requirement text. These key words can be any string of characters, but they must be assigned a predefined meaning. This is done through a key word table. There are currently some 20 predefined notions which can be assigned to a key word in the text. These are:SKIP= ignore lines beginning with this word REQU= this word indicates a requirementMASK= this word indicates a user interfaceINFA= this word indicates a system interface REPO = this word indicates a reportINPT = this word indicates a system inputOUTP = this word indicates a system outputSERV = this word indicates a web serviceDATA = this word indicates a data storeACT= this word indicates a system actorTRIG= this word indicates a triggerPRE= this word indicates a preconditionPOST= this word indicates a post condition PATH= this word indicates a logical path or sequence of stepsEXCP= this word indicates an exception condition ATTR = this word indicates any user assigned text attributeRULE= this word indicates a business rule PROC = this word indicates a business process GOAL = this word indicates a business goalOBJT= this word is the name of an object.By means of the key words, the analyzer is able to recognize certain requirement elements embedded in the prose text.4.3 Recognizing and extracting potential test cases The next step is for the tool to make a second scan of the document. This time only sentences in which an essential object occurs are processed, the others are skipped over. Each sentence selected is examined whether it is an action, a state query, or a condition. The sentence is an action when the object is the target of a verb. The sentences The customer account is updated daily and The system updates the customer account are both actions. The account is the object and updates is the action. The test case will be to test whether the system really updates the account.The sentence The account is overdrawn when the balance exceeds the credit limit is a state which needs to be tested and the sentence If an account is overdrawn, it should be frozen until a payment comes in is a condition combining an object state with an object action. The object is the account. The state is overdrawn. The action is should be frozen. There are actually two tests here. One is to confirm that an account can become overdrawn. The other is to confirm that an account is frozen when it is overdrawn.To qualify as a statement to be tested, a sentence must contain at least one relevant object. In the sentence If his credit rating is adequate, the customer can order a book. there are three relevant objects - credit, customer and book - so this qualifies the sentence to be processed further. The clause if his credit rating is adequate indicates that this is a condition which needs to be tested. There are many words which can be used to identify a condition. Besides the word if there are other words like should, provided, when, etc. there are also word patterns like in case of and as long as. When they occur the statement is assumed to be a condition.If the sentence is not a condition it may be a state declaration. A state declaration is when a relevant object is declared to be in a given state, i.e.The customer must be registered.The word customer is a selected object and the word pattern be registered indicates a state that the object is in. Predicate verbs such as be, is, are, were, etc denote that this is a state declaration.If the sentence is neither a condition nor a state, it may be an action. An action is indicated by a verb which acts upon a selected object e.g.The system checks the customer order.Here the order is a relevant object and checks is a verb which acts upon it. Normally these verbs will end with an s if they are in present tense and with ed if they are in past tense. So this makes it easier to recognize them. The advantage of requirement texts as opposed to texts in general is that they are almost always written in the third person, thus reducing the number of verb patterns to be checked. Sentences which qualify as a statement to be tested are extracted from the text and stored in the test case table. Assuming that all sentences are embedded in the text of a section,, a requirement or a use case, it is possible to assign test cases to individual requirements, use cases or simply to titled sections. If a test case originates from a requirement it receives the number or title of that requirement. If the test cases are created from a use case, then they bear the title of that use case. If these structural elements are missing the test case is simply assigned to the last text title. Relations are also established between test cases and objects. Test cases extracted from a particular sentence will have a reference to the objects referred to in that sentence.A generated test case will have an id, a purpose, a trigger, a pre-condition and a post-condition. The id of the test case is generated from the system name and a running number. The condition if the customer s credit rating is adequate, he can order a book implies two pre conditions1.the customer s credit rating is adequate2.the customer s credit rating is not adequate There are also two post conditions1.the customer has ordered a book2.the customer has not ordered a bookThis shows that for every conditional clause there should be two test casesone which fulfils the condition, andanother which does not fulfil the condition. They both have the same trigger, namely the customer orders a book.These are samples of functional test cases. Non functional test cases are all either states or conditions. The sentence The system should be able to process at least 2000 transactions per hour is a state denoted by the verb should be. The sentence In case of a system crash, the system has to be restarted within 2 minutes is a condition determined by the predicate In case of, followed by an action restarted. Both requirements must be tested. The tool itself can only distinguish between functional and non functional test cases based on the objects acted on or whose state is checked. Here again the user must interact by marking those objects such as system which are not part of the actual application.4.4 Storing the potential test casesThe result of the text analysis is a table of potential system test cases. If the requirements document is structured so that the individual requirements are recognizable, the test cases will be ordered by requirement. If there are use case definitions, the test cases extracted from a particular use case will be associated with that use case. Otherwise, the test cases will be grouped by subtitles.In the end every test case, whether functional or non-functional will have at least the following attributes:a test case Ida test case purpose = the sentence fromwhich the case was takena test case type = {action | state | condition }a preconditiona post conditiona triggera reference to the objects involveda reference to the requirements being testeda reference to the use case being tested5. Generating a test designIt is not enough to extract potential test cases. The test cases also need to be assigned to an overall test framework. The test framework is derived from the structure of the requirements document. Requirements should be enhanced by events. An event is something which occurs at one place at one time. Use cases are such events. An account withdrawal is an example of a use case event. A money transfer is another event. Printing out an account statement is yet another event. Events are triggered by a user, by the system itself or by some other system.In system testing it is essential to test every event, first independently of the other events and then in conjunction with them. An event will have at least two test cases - a positive and a negative outcome, but it may have many. In the case of an account withdrawal, the user may give in a bad PIN number, he may have an invalid card, the amount to be withdrawn may exceed the daily limit or his account may be frozen. There are usually 4 to 20 test cases for each use case.In generating a test design the text analyzer tool orders the test cases by event. The event is the focus of a testing session. Requirements and essential objects are assigned to an event so that it becomes clear which functions and which objects are tested within a session. If recognizable in the requirements text, the user or system interface for an event is also assigned. This grouping of all relevant information pertaining to an event is then presented in an XML document for viewing and editing by the tester. In so doing, the text analyzer has not only extracted the potential test cases from the requirements, it has also generated a test design based on the events specified.6. Experience with automated requirements analysisThe German language version of the text analyzer was first employed in a web application for the state of Saxony at the beginning of 2005. The requirements of that application were split up among 4 separate documents with 4556 lines of text. Some 677 essential objects were identified. Specified with these objects were 644 actions, 103 states and 114 rules. This led to 1103 potential test cases in 127 use cases. The generated test case table served as a basis for the test case specification. As might be expected, several test cases were added, so in the end there were 1495 test cases to be tested. These test cases revealed 452 errors in the system under test as opposed to the 96 errors discovered in production giving an error discovery rate of 89%. This demonstrated that the automatic extraction of test cases from requirements documents, complemented by manual test case enhancement is a much cheaper and more efficient way of exposing errors than a pure manual test case selection process [23]. Besides that it achieves higher functional test coverage. In this project over 95% of the potential functions were covered.Since this first trial in the Saxon e-Government project the German language version has been employed in no less than 12 projects to generate test cases from the requirements text including a project to automate the administration of the Austrian Game Commission, a project to introduce a standard software package for administering the German water ways, and a project to develop a university web site for employment opportunities.The English language version has only recently been completed, but has already been used in 3 projects once for analyzing the use cases of a mobile phone billing system, secondly for analyzing the requirements of an online betting system, and thirdly to generate test cases for a Coca Cola bottling and distribution system. In the case of the mobile billing system, a subsystem with 7 use cases was analyzed in which there were 78 actions and 71 rules for 68 objects rendering 185 test cases. The online betting system had 111 requirements of which 89 were functional and 22 were non-functional. There were 69 states, 126 actions and 112 rules for 116 specified objects from which 304 test cases were extracted. The specification of the Coca Cola distribution system is particularly interesting because it used neither a list of requirements nor a set of use cases, but instead a table of outputs to a relational database. In the first column of the table was the name of the output data, in the second the data description, in the third the algorithm for creating the data and in the fourth the condition for triggering the algorithm. A typical output specification is depicted in Table 1. Name Definition Source ConditionA400TotalNumber ofBottlesXX20QuantityfromMobileDeviceTranstype<5(Sampling)Transtype >7(Breakage)ARTIDF =1Aor1RTable 1: Output SpecificationFor this one output 6 test cases were generated Transtype <5 & ARTIDF = 1ATranstype <5 & ARTIDF = 1RTranstype <5 & ARTIDF ! = 1A & ARTIDF ! = 1RTranstype > 7 & ARTIDF = 1ATranstype > 7 & ARTIDF = 1ATranstype > 7 & ARTIDF = 1RTranstype > 7 & ARTIDF ! = 1A & ARTIDF ! = 1R。
自然语言( natural language)

Brief History
• 20世纪30年代初,法国科学家G.B.阿尔楚尼提出了用 机器来进行翻译的想法。 • 1933年,苏联发明家П.П.特罗扬斯基设计了把一种语 言翻译成另一种语言的机器,并在同年9月5日登记了 他的发明;但是,由于30年代技术水平还很低,他的 翻译机没有制成。 • 1946 年,第一台现代电子计算机 ENIAC 诞生。 • 美国科学家 W. Weaver 和英国工程师A. D. Booth 在讨 论电子计算机的应用范围时,于1947年提出了利用计 算机进行语言自动翻译的想法。 • 1949年,W. Weaver 发表《翻译备忘录》 ,正式提出 机器翻译的思想。
• It may be enriched by review of business process and system documentation, functional or technical specifications, data dictionaries, subject matter experts, or other sources of data knowledge. 每个知识源由条件部分和动作部分 组成,前者说明何时条件适用,而后者则处理相关 的黑板元素和生成新的黑板元素。 • Each knowledge source is organized as a condition part that specifies when it is applicable and an action part that processes relevant blackboard elements and generates new ones. 通过数据挖掘技术将计算实例提炼出来,作为一种 知识源参与到设计优化过程中去,将CAE从设计验 证层次提升到设计驱动层次。
自然语言处理NaturalLanguageProcessing(NLP)

图书馆的图书分类 网页分类 信息过滤 ......
信息检索(Information Retrieval,IR)
基于关键词,从某文档集合中检索出相关的文档。 谷歌搜索、搜索、... 主题相关的文本获取。
自动问答(Question Answering,QA)
词性标注
为句子中的词标上预定义类别集合中的类。
命名实体识别
识别出句子中的人名、地名、机构名等。
分词(针对汉语、日语等)
识别出句子中的词。
形态还原(英语)
把句子中的词还原成原形,作为词的其它信息 (词典、个性规则)的索引。
构词特点
屈折变化:词尾和词形变化,词性不变。如:
2版),清华大学出版社,2002 赵铁军等,机器翻译原理,哈尔滨工业大学出版社,2000 宗成庆等译,统计机器翻译,电子工业出版社,2012
Peter F. Brown, et al., A Statistical Approach to MT, Computational Linguistics, 1990,16(2)
农工民主党中央主席蒋正华主持了会议,他说,农工民主党有1 00多名党员作为代表和委员参加了今年的“两会”,各位党员要认 真履行代表和委员的职责,开好会,在1998年的工作中认真贯彻 “两会”精神,加强农工民主党的自身建设,推动事业进一步发展, 为建设有中国特色社会主义事业作出新的贡献。
会前,农工民主党中央邀请参加“两会”的来自全国各省、自治 区、直辖市的农工民主党党员进行了联谊活动。
自然语言处理
Natural Language Processing(NLP)
NLP (Natural Language Processing) for NLP (Natural Language Programming)

NLP(Natural Language Processing)for NLP(Natural Language Programming)Rada Mihalcea1,Hugo Liu2,and Henry Lieberman21Computer Science Department,University of North Texasrada@2Media Arts and Sciences,Massachusetts Institute of Technology{hugo,henry}@Abstract.Natural Language Processing holds great promise for making com-puter interfaces that are easier to use for people,since people will(hopefully)beable to talk to the computer in their own language,rather than learn a specializedlanguage of computer commands.For programming,however,the necessity of aformal programming language for communicating with a computer has alwaysbeen taken for granted.We would like to challenge this assumption.We believethat modern Natural Language Processing techniques can make possible the useof natural language to(at least partially)express programming ideas,thus drasti-cally increasing the accessibility of programming to non-expert users.To demon-strate the feasibility of Natural Language Programming,this paper tackles whatare perceived to be some of the hardest cases:steps and loops.We look at a cor-pus of English descriptions used as programming assignments,and develop sometechniques for mapping linguistic constructs onto program structures,which werefer to as programmatic semantics.1IntroductionNatural Language Processing and Programming Languages are both established areas in thefield of Computer Science,each of them with a long research tradition.Although they are both centered around a common theme–“languages”–over the years,there has been only little interaction(if any)between them1.This paper tries to address this gap by proposing a system that attempts to convert natural language text into computer programs.While we overview the features of a natural language programming system that attempts to tackle both the descriptive and procedural programming paradigms,in this paper we focus on the aspects related to procedural programming.Starting with an English text,we show how a natural language programming system can automatically identify steps,loops,and comments,and convert them into a program skeleton that can be used as a starting point for writing a computer program,expected to be particularly useful for those who begin learning how to program.We start by overviewing the main features of a descriptive natural language pro-gramming system M ETAFOR introduced in recent related work[6].We then describe in 1Here,the obvious use of programming languages for coding natural language processing sys-tems is not considered as a“meaningful”interaction.A.Gelbukh(Ed.):CICLing2006,LNCS3878,pp.319–330,2006.c Springer-Verlag Berlin Heidelberg2006320R.Mihalcea,H.Liu,and H.Liebermandetail the main components of a procedural programming system as introduced in this paper.We show how some of the most difficult aspects of procedural programming, namely steps and loops,can be handled effectively using techniques that map natural language onto program structures.We demonstrate the applicability of this approach on a set of programming assignments automatically mined from the Web.2BackgroundEarly work in natural language programming was rather ambitious,targeting the gen-eration of complete computer programs that would compile and run.For instance,the “NLC”prototype[1]aimed at creating a natural language interface for processing data stored in arrays and matrices,with the ability of handling low level operations such as the transformation of numbers into type declarations as e.g.float-constant(2.0), or turning natural language statements like add y1to y2into the programmatic expres-sion y1+y2.Thesefirst attempts triggered the criticism of the community[3],and eventually discouraged subsequent research on this topic.More recently,however,researchers have started to look again at the problem of nat-ural language programming,but this time with more realistic expectations,and with a different,much larger pool of resources(e.g.broad spectrum commonsense knowledge [9],the Web)and a suite of significantly advanced publicly available natural language processing tools.For instance,Pane&Myers[8]conducted a series of studies with non-programming fifth grade users,and identified some of the programming models implied by the users’natural language descriptions.In a similar vein,Lieberman&Liu[5]have conducted a feasibility study and showed how a partial understanding of a text,coupled with a dialogue with the user,can help non-expert users make their intentions more precise when designing a computer program.Their study resulted in a system called M ETAFOR [6],[7],able to translate natural language statements into class descriptions with the associated objects and methods.Another closely related area that received a fair bit of attention in recent years is the construction of natural language interfaces to databases,which allows users to query structured data using natural language questions.For instance,the system described in[4],or previous versions of it as described in[10],implements rules for mapping natural to“formal”languages using syntactic and semantic parsing of the input text.The system was successfully applied to the automatic translation of natural language text into RoboCup coach language[4],or into queries that can be posed against a database of U.S.geography or job announcements[10].3Descriptive Natural Language ProgrammingWhen storytellers speak fairy tales,theyfirst describe the fantasy world–its characters, places,and situations–and then relate how events unfold in this world.Programming, resembling storytelling,can likewise be distinguished into the complementary tasks of description and proceduralization.While this paper tackles primarily the basics ofNLP(Natural Language Processing)for NLP(Natural Language Programming)321 building procedures out of steps and loops,it would be fruitful to also contextualize pro-cedural rendition by discussing the architecture of the descriptive world that procedures animate.Among the various paradigms for computer programming–such as logical,declar-ative,procedural,functional,object-oriented,and agent-oriented–the object-oriented and agent-oriented formats most closely embody human storytelling intuition.Consider the task of programming a MUD2world by natural language description,and the sen-tence There is a bar with a bartender who makes drinks[6].Here,bar is an instance of the object class bar,and bartender is an instance of the agent(a class with methods)class bartender,with the capability makeDrink(drink).Gener-alizing from this example,characters are reified as agent classes,things and places become object classes,and character capabilities become class methods.A theory of programmatic semantics for descriptive natural language programming is presented in[7];here,we overview its major features,and highlight some of the differences between descriptive and procedural rendition.These features are at the core of the Metafor[6]natural language programming system that can render code following the descriptive paradigm,starting with a natural language text.3.1Syntactic CorrespondencesThere are numerous syntactic correspondences between natural language and descrip-tive structures.Most of today’s natural languages distinguish between various parts of speech that taggers such as Brill’s[2]can parse–noun chunks are things,verbs are ac-tions,adjectives are properties of things,adverbs are parameters of actions.Almost all natural languages are built atop the basic construction called independent clause,which at its heart has a who-does-what structure,or subject-verb-directObject-indirectObject (SVO)construction.Although the ordering of subject,verb,and objects differ across verb-initial(VSO and VOS,e.g.Tagalog),verb-medial(SVO,e.g.Thai and English), and verb-final languages(SOV,e.g.,Japanese),these basic three ingredients are rather invariant across languages,corresponding to an encoding of agent-method and method-argument relationships.This kind of syntactic relationships can be easily recovered from the output of a syntactic parser,either supervised,if a treebank is available,or un-supervised for those languages for which manually parsed data does not exist.Note that the syntactic parser can also resolve other structural ambiguity problems such as prepositional attachment.Moreover,other ambiguity phenomena that are typically encountered in language,e.g.pronoun resolution,noun-modifier relationships,named entities,can be also tackled using current state-of-the-art natural language processing techniques,such as coreference tools,named entity annotators,and others.Starting with an SVO structure,we can derive agent-method and method-argument constructions that form the basis of descriptive programming.Particular attention needs to be paid to the I SA type of constructions that indicate inheritance.For instance,the statement Pacman is a character who...indicates a super-class character for the more specific class Pacman.2A MUD(multi-user dungeon,dimension,or dialogue)is a multi-player computer game that combines elements of role-playing games,hack and slash style computer games,and social instant messaging chat rooms(definition from ).322R.Mihalcea,H.Liu,and H.Lieberman3.2Scoping DescriptionsScoping descriptions allow conditional if/then rules to be inferred from natural lan-guage.Conditional sentences are explicit declarations of if/then rules,e.g.When the customer orders a drink,make it,or Pacman runs away if ghosts approach.Condition-als are also implied when uncertain voice is used,achieved through modals as in e.g. Pacman may eat ghosts,or adverbials like sometimes–although in the latter case the an-tecedent to the if/then is underspecified or omitted,as in Sometimes Pacman runs away.Fig.1.The descriptive and procedural representations for the conditional statement When cus-tomer orders a drink,the bartender makes itAn interesting interpretative choice must be made in the case of conditionals,as they can be rendered either descriptively as functional specifications,or procedurally as if/then constructions.For example,consider the utterance When customer orders a drink,the bartender makes it.It could be rendered descriptively as shown on the left of Figure1,or it could be proceduralized as shown on the right of the samefigure. Depending upon the surrounding discourse context of the utterance,or the desired rep-resentational orientation,one mode of rendering might be preferred over the other.For example,if the storyteller is in a descriptive mood and the preceding utterance was there is a customer who orders drinks,then most likely the descriptive rendition is more ap-propriate.3.3Set-Based Dynamic ReferenceSet-based dynamic reference suggests that one way to interpret the rich descriptive se-mantics of compound noun phrases is to map them into mathematical sets and set-based operations.For example,consider the compound noun phrase a random sweet drink from the menu.Here,the head noun drink is being successively modified by from the menu,sweet,and random.One strategy in unraveling the utterance’s programmatic im-plications is to view each modifier as a constraintfilter over the set of all drink instances. Thus the object aRandomSweetDrinkFromTheMenu implies a procedure that cre-ates a set of all drink instances,filters for just those listed in theMenu,filters for those having the property sweet,and then applies a random choice to the remaining drinks to select a single one.Set-based dynamic reference lends great conciseness and power toNLP(Natural Language Processing)for NLP(Natural Language Programming)323 natural language descriptions,but a caveat is that world semantic knowledge is often needed to fully exploit their semantic potential.Still,without such additional knowl-edge,several descriptive facts can be inferred from just the surface semantics of a ran-dom sweet drink from the menu–there are things called drinks,there are things called menus,drinks can be contained by menus,drinks can have the property sweet,drinks can have the property random or be selected ter in this paper,we harness the power of set-based dynamic reference to discover implied repetition and loops.Occam’s Razor would urge that code representation should be as simple as possible, and only complexified when necessary.In this spirit,we suggest that automatic pro-gramming systems should adopt the simplest code interpretation of a natural language description,and then complexify,or dynamically refactor,the code as necessary to ac-commodate further descriptions.For example,consider the following progression of descriptions and the simplest common denominator representation implied by all utter-ances up to that step.a)There is a bar.(atom)b)The bar contains two customers.(unimorphic list of type Customer)c)It also has a waiter.(unimorphic list of type Person)d)It has some stools.(polymorphic list)e)The bar opens and closes.(class/agent)f)The bar is a kind of store.(agent with inheritance)g)Some bars close at6pm,others at7pm.(forks into two subclasses)Applying the semantic patterns of syntactic correspondence,representational equiv-alence,set-based dynamic reference,and scoping description to the interpretation of natural language description,object-oriented code skeletons can be produced.These description skeletons then serve as a code model which procedures can be built out of. Mixed-initiative dialog interaction between computer and storyteller can disambiguate difficult utterances,and the machine can also use dialog to help a storyteller describe particular objects or actions more thoroughly.The Metafor natural language programming system[6]implementing the features highlighted in this section was evaluated in a user study,where13non-programmers and intermediate programmers estimated the usefulness of the system as a brainstorming tool.The non-programmers found that Metafor reduced their programming task time by22%,while for intermediate programmers thefigure was11%.This result supports the initial intuition from[5]and[8]that natural language programming can be a useful tool,in particular for non-expert programmers.It remains an open question whether Metafor will represent a stepping stone to real programming,or will lead to a new programming paradigm obviating the need for a formal programming language.Either way,we believe that Metafor can be useful as a tool in itself,even if it is yet to see which way it will lead.4Procedural Natural Language ProgrammingIn procedural programming,a computer program is typically composed of sequences of action statements that indicate the operations to be performed on various data structures. Correspondingly,procedural natural language programming is targeting the generationLieberman324R.Mihalcea,H.Liu,and H.Fig.2.Side by side:the natural language(English)and programming language(Perl)expressions for the same problemof computer programs following the procedural paradigm,starting with a natural lan-guage text.For example,starting with the natural language text on the left side offigure2,we would ideally like to generate a computer program as the one shown on the right side of thefigure3.While this is still a long term goal,in this section we show how we can automatically generate computer program skeletons that can be used as a starting point for creating procedural computer programs.Specifically,we focus on the description of three main components of a system for natural language procedural programming:–The stepfinder,which has the role of identifying in a natural language text the action statements to be converted into programming language statements.–The loopfinder,which identifies the natural language structures that indicate repe-tition.–Finally,the comment identification components,which identifies the descriptive statements that can be turned into program comments.3Although the programming examples shown throughout this section are implemented using Perl,other programming languages could be used equally well.NLP(Natural Language Processing)for NLP(Natural Language Programming)325Starting with a natural language text,the system isfirst analyzing the text with the goal of breaking it down into steps that will represent action statements in the output program.Next,each step is run through the comment identification component,which will mark the statements according to their descriptive role.Finally,for those steps that are not marked as comments,the system is checking if a step consists of a repet-itive statement,in which case a loop statement is produced using the corresponding loop variable.The following sections provide details on each of these components(step finder,loopfinder,comment identification),as well as a walk-through example illustrat-ing the process of converting natural language texts into computer program skeletons.4.1The Step FinderThe role of this component is to read an input natural language text and break it down into steps that can be turned into programming statements.For instance,starting with the natural language text You should count how many times each number is generated and write these counts out to the screen.(seefigure2),two main steps should be iden-tified:(1)[count how many times each number is generated],and(2)[write these counts out to the screen].First,the text is pre-processed,i.e.tokenized and part-of-speech tagged using Brill’s tagger[2].Some language patterns specific to program descriptions are also identified at this stage,including phrases such as write a program,create an applet,etc.,which are not necessarily intended as action statements to be included in a program,but rather as general directives given to the programmer.Next,steps are identified as statements containing one verb in the active voice.We are therefore identifying all verbs that could be potentially turned into program func-tions,such as e.g.read,write,count.We attempt tofind the boundaries of these steps:a new step will start either at the beginning of a new sentence,or whenever a new verb in the active voice is found(typically in a subordinate clause).Finally,the object of each action is identified,consisting of the direct object of the active voice verb previously found,if such a direct object exists.We use a shallow parser tofind the noun phrase that plays the role of a direct object,and then identify the head of this noun phrase as the object of the corresponding action.The output of the stepfinder process is therefore a series of natural language state-ments that are likely to correspond to programming statements,each of them with their corresponding action that can be turned into a program function(as represented by the active voice verb),and the corresponding action object that can be turned into a function parameter(as represented by the direct object).As a convention,we use both the verb and the direct object to generate a function name.For example,the verb write with the parameter number will generate the function call writeNumber(number).4.2The Loop FinderAn important property of any program statement is the number of times the statement should be executed.For instance,the requirement to generate10000random numbers (seefigure2),implies that the resulting action statement of[generate random num-bers]should be repeated10000times.326R.Mihalcea,H.Liu,and H.LiebermanThe role of the loopfinder component is to identify such natural language structures that indicate repetitive statements.The input to this process consists of steps,fed one at a time,from the series of steps identified by the stepfinder process,together with their corresponding actions and parameters.The output is an indication of whether the cur-rent action should be repeated or not,together with information about the loop variable and/or the number of times the action should be repeated.First,we seek explicit markers of repetition,such as each X,every X,all X.If such a noun phrase is found,then we look for the head of the phrase,which will be stored as the loop variable corresponding to the step that is currently processed.For example, starting with the statement write all anagrams occurring in the list,we identify all anagrams as a phrase indicating repetition,and anagram as the loop variable.If an explicit indicator of repetition is not found,then we look for plural nouns as other potential indicators of repetition.Specifically,we seek plural nouns that are the head of their corresponding noun phrase.For instance,the statement read the values contains one plural noun(values)that is the head of its corresponding noun phrase, which is thus selected as an indicator of repetition,and it is also stored as the loop variable for this step.Note however that a statement such as write the number of integers will not be marked as repetitive,since the plural noun integers is not the head of a noun phrase,but a modifier.In addition to the loop variable,we also seek an indication of how many times the loop should be repeated–if such information is available.This information is usually furnished as a number that modifies the loop variable,and we thus look for words labeled with a cardinal part-of-speech tag.For instance,in the example generate10000 random numbers,wefirst identify numbers as an indicator of repetition(noun plural), and thenfind10000as the number of times this loop should be repeated.Both the loop variable and the loop count are stored together with the step information.Finally,another important role of the loopfinder component is the unification process, which seeks to combine several repetitive statements under a common loop structure, if they are linked by the same loop variable.For example,the actions[generate numbers]and[count numbers]will be both identified as repetitive statements with a common loop variable number,and thus they will be grouped together under the same loop structure.4.3Comment IdentificationAlthough not playing a role in the execution of a program,comments are an impor-tant part of any computer program,as they provide detailed information on the various programming statements.The comment identification step has the role of identifying those statements in the input natural language text that have a descriptive role,i.e.they provide additional spec-ifications on the statements that will be executed by the program.Starting with a step as identified in the stepfinding stage,we look for phrases that could indicate a descriptive role of the step.Specifically,we seek the following natural language constructs:(1)Sentences preceded by one of the expressions for example, for instance,as an example,which indicate that the sentence that follows provides an example of the expected behavior of the program.(2)Statements including a modalNLP(Natural Language Processing)for NLP(Natural Language Programming)327 verb in a conditional form,such as should,would,might,which are also indicators of expected behavior.(3)Statements with a verb in the passive voice,if this is the only verb in the statement4.(4)Finally,statements indicating assumptions,consisting of sentences that start with a verb like assume,note,etc.All the steps found to match one of these conditions are marked as comments,and thus no attempt will be made to turn them into programming statements.An example of a step that will be turned into a comment is For instance,23is an odd number,which is a statement that has the role of illustrating the expected behavior of the program rather than asking for a specific action,and thus it is marked as a comment.The output of the comment identification process is therefore aflag associated with each step,indicating whether the step can play the role of a comment.Note that although all steps,as identified by the stepfinding process,can play the role of informative comments in addition to the programming statements they generate,only those steps that are not explicitly marked as comments by the comment identification process can be turned into programming statements.In fact,the current system implementation will list all the steps in a comment section(see the sample output in Figure2),but it will not attempt to turn any of the steps marked as“comments”into programming statements.4.4A Walk-Through ExampleConsider again the example illustrated infigure2.The generation of a computer pro-gram skeleton follows the three main steps highlighted earlier:step identification,com-ment identification,loopfinder.First,the stepfinder identifies the main steps that could be potentially turned into programming statements.Based on the heuristics described in section4.1,the natural language text is broken down into the following steps:(1)[generate10000random numbers between0and99inclusive],(2)[count how many of times each number is generated],(3)[write these counts out to the screen],with the functions/parameters: generateNumber(number),count(),and writeCount(count).Next,the commentfinder does not identify any descriptive statements for this input text,and thus none of the steps found by the stepfinder are marked as comments.By default,all the steps are listed in the output program in a comment section.Finally,the loopfinder inspects the steps and tries to identify the presence of repe-tition.Here,wefind a loop in thefirst step,with the loop variable number and loop count10000,a loop in the second step using the same loop variable number,and finally a loop in the third step with the loop variable count.Another operation per-formed by the loopfinder component is unification,and in this case thefirst two steps are grouped under the same loop structure,since they have a common loop variable (number).The output generated by the natural language programming system for the example infigure2is shown infigure3.4Note that although modal and passive verbs could also introduce candidate actions,since for now we target program skeletons and not fully-fledged programs that would compile and run, we believe that it is important to separate the main actions from the lower level details.We therefore ignore the“suggested”actions as introduced by modal or passive verbs,and ex-plicitely mark them as comments.328R.Mihalcea,H.Liu,and H.Lieberman4.5Evaluation and ResultsOne of the potential applications of such a natural language programming system is to assist those who begin learning how to program,by providing them with a skele-ton of computer programs as required in programming assignments.Inspired by these applications,we collect a corpus of homework assignments as given in introductory programming classes,and attempt to automatically generate computer program skele-tons for these programming assignments.The corpus is collected using a Web crawler that searches the Web for pages con-taining the keywords programming and examples,and one of the keyphrases write a program,write an applet,create a program,create an applet.The result of the search process is a set of Web pages likely to include programming assignments.Next,in a post-processing phase,the Web pages are cleaned-up of HTML tags,and paragraphs containing the search keyphrases are selected as potential descriptions of programming problems.Finally,the resulting set is manually verified and any remaining noisy entries are thusly removed.Thefinal set consists of120examples of programming assign-ments,with three examples illustrated in Table1.For the evaluation,we randomly selected a subset of25programming assignments from the set of Web-mined examples,and used them to create a gold standard testbed. For each of the25program descriptions,we manually labeled the main steps(which should result into programming statements),and the repetitive structures(which should result into loops).Next,from the automatically generated program skeletons,we iden-tified all those steps and loops that were correct according to the gold standard,and。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
ABSTRACT
The paper presents the general design and the f i r s t results of a research project whose long term goal is to develop and implement ALICE, an experimental system capable of augmenting i t s knowledge base by processing natural language texts. ALICE (an acronym for Automatic Learning and Inference Computerized Engine) is an attempt to model the cognitive processes that occur in humans when they learn a series of descriptive texts and reason about what they have learned. In the paper a general overview of the system is given with the descrlption of i t s specifics, basic methodologies, and general architecture. How parsing is performed in ALICE is illustrated by following the analysis of a sa parsing is performed is illustrated in section four by following the analysis of a small sample t e x t . Section five concludes the paper by giving a summary of the main ideas and some implementational details. 2. ALICE: A GENERALOVERVIEW 2. I Specifics The main goal of the ALICE project is to examine how i t is possible to build a machine which could, in a psychologically plausible way, learn new facts about a given domain by analysing natural language texts. ALICE can operate according to two different ways: in learning mode and in consult mode. In learning mode ALICE is given in input a series of sentences in Italian forming simple introductory s c i e n t i f i c passages. The domains chosen for the i n i t i a l experimentation are elementary chemistry and electronics. The system understands the input texts and integrates the information extracted f r o m them with that previously stored in i t s knowledge base. For checking purposes the system outputs the sentence-by-sentence internal representation that is added to the knowledge base. When working in consult mode, ALICE receives in input a question concerning the processed texts and returns the portion of the knowledge base containing the information needed to answer i t . I t should be noted that the system has no generation capabilities; i t does not output natural language sentences but only the internal representation of a small part of i t s knowledge base. Another limitation of the system is that i t can deal with questions only in a piece-meal fashion. ALICE, in other words, lacks the dialogic capabilities needed to build a graceful man-machine interface. User modelling, mixed-initiative dialogue, co-operative behavior etc. are simply outside the scope of the project. ALICE cannot obviously understand all the sentences that is possible to express in a given language. Unrestricted language comprehension is currently beyond our capabilities. As work in artificial intelligence and computational linguistics has taught us, i t is very d i f f i c u l t to build programs that could successfully cope with l i n g u i s t i c materials. This is due to the fact that language is essentially a knowledge-based process. In understanding natural language i t is necessary to make a heavy reliance on world knowledge even to do very elementary operations: disambiguate the meaning of a word, identify an anaphoric referent, capture the syntactic structure of a sentence. Paradoxically i t has been said that one cannot learn anythingt unless (s)he almost knows i t
NATURALLANGUAGEPROCESSINGAND THE AUTOMATICACQUISITIONOF K N O W L E D G E : A SIMULATIVE A P P R O A C H Danilo FUM Laboratorio di Psicologia E.E. - Universit8 di Trieste via Tigor 22, I - 34124 Trieste ( I t a l y )
restricted arguments and specific phenomena whose explanations too often look suspiciously ad hoc. Unfortunately, those who addressed the full problem of 'meaningful verbal learning' (e.g. Ausubel, 1963) stated t h e i r theories so vaguely that i t is almost impossible to express them in form of effective procedures and to implement them in computer programs. In the last few years the situation has changed and several projects (Frey, Reyle, and Rohrer, 1983; Haas and Hendrix, 1983; Nishida, Kosaka, and Doshita, 1983; Norton, 1983) are now devoted to develop computer systems which could automatically extract information from written texts. Practical applications, besides theoretical interest, motivates t h i s kind of research. In the expert system technology, for example, the process of discovering what is known to the experts of the f i e l d in which the program must perform requires tedious and costly interactions between ~the knowledge engineer and those experts. Automatic acquisition of knowledge by text understanding could represent a way to p a r t i a l l y reduce the labor and fatigue involved in the transfer of expertise. The paper presents the general design and the f i r s t results of a research project whose long term goal is to develop and implement ALICE, an experimental system capable of augmenting i t s knowledge base by processing natural language texts and reasoning a b o u t t h e m . Particular attention is given to the simulative aspects of the project. ALICE (an acronym for Automatic Learning and Inference Computerized Engine) is an attempt to model the cognitive processes that occur in humans when they learn a series of descriptive texts and reason about what they have learned. Comparisons with what is Known about human cognitive behavior are therefore e x p l i c i t l y taken into account in devising algorithms and data structures for the system. In the next section a general overview of the system is provided with the description of its specifics, basic methodologies, and generar architecture. The third section briefly describes the parser used in