基于Lucene的语段模糊匹配中文检索系统设计与实现
基于Lucene二次全文检索系统的设计与实现

[1] 郑轶媛 .基于J2EE的站 内搜索引擎的研究[D].上海 交通 大学.2005.1:8-13
[2] 邱 哲 , 符 滔 滔 . 开 发 自 己 的 搜 索 引 擎 ——Lucene 2 .0+ Heri terx [M]. 北京:人民邮电出版社.2 0 07 .6. 235 -24 6.
系统对PDF文档提供了更深层次的检索,可将检索结果 定位到书籍的具体页,并在页面标示出关键字的具体位置。 该层次的检索用Lucene API是无法实现的。本文定义了一种 二次索引组织方式,二次索引组织格式是 “Book_id#keyword#page#以 逗号隔 开的 X,Y坐 标#关键 词出 现的上下文”。当关键词在页面 可以出现多次时,这样多个 坐标间用"|"隔开,坐标单位为像素,代表关键词以文档左上 角为原点的水平向右和垂直向下方向上的距离。同样其多个 上下文之间也用"|"隔开。如下为一条存于文本文件中的二次 索引示例:
[3] 王学松 .Lucene+nutch开发搜索引擎[M].北 京:人民 邮电 出版社.2008.08. 125-145.
[4] 于 丹.关 于查全 率和查准 率的新 认识[J].西南 民族大 学 学报,2009;2(210):283-285
[5] 励子 闰,余青 松,陈胜 东.基于 全文检索引 擎的信息检 索 技 术 的 应 用 研 究 [J]. 计 算 机 与 数 字 工 程.2 00 8. 9,V ol .3 6,N o. 9: 81 -85
1.2 数据库设计 数据库主要用于存储二次索引,表结构相对简单,目前
只设计了2个表:图书表和二次索引表。图书表 用于存储需 要进行二次检索的图书资料基础信息,二次索引表则存储图 书的二次索引信息,表结构如表1、2所示:
基于Lucene的搜索关键词辅助系统的设计与实现

南 通纺 织 职业 技术 学 院学 报 ( 合 版 ) 综
Ju a fNa tngTe tl c to a c n lg le e o r lo no xi Vo ain l n e Te h oo yColg
Vo . No1 111. .
般 要 求 用 户输 入 关键 词 , 对 于 一 些 陌生 的领 域 , 户 无 法给 出准 确 的关键 词. 但 用 没有 准 确 的关 键 词 , 就无 法 从 网络 上 迅 速搜 索 到 需 要 的信 息 .Y h o 搜 狐 、 oge 北 大 天 网 、 度 等 搜 索 引擎 在 一 定程 度 上 满足 ao 、 G ol 、 百
收 稿 t 期 :2 1 — 8 8 5 t 0 0 0 一I 作 者 简 介 :宋 永 生 ( 9 4 ) 男 , 苏 徐 州 人 , 18一 , 江 南通 纺 织 职 业 技 术 学 院 现 代 教 育 技 术 中心 教 师 , 主要 从 事 移 动 开 发 及 搜 索 引 擎 研 究 。
f) 引模 块 . 索 引擎 一般 通过 网络爬 虫进 行信 息 采集 。 采 集到 的信 息存 储 到本 地 .信 息 的格 式 1索 搜 将
多种 多样 . 这些 不 同格 式 的信 息要 进行 不 同 的预处 理 .为 了简化 开 发 。 对 本文 将 采集 到 的信 息 以纯文 本 格 式存储 在本地 . 索之前 。 先建 立索 引. u e e 身无法 对物理 文件 建立索 引 . 搜 要 L cn 本 只能 识别并处 理D c m n ou e t 类型 的文件 l 3 I .先将 物 理文 件 转 化为 D c me t o u n 类型 . 然后 使 用 Idx i r类来 建立 索 引.在 建 立索 引的 n eWr e t 过 程 中 , 进 行分 词处 理 , 掉停 用 词和 常 用词 , 出关 键词 , 记录 关键 词 出现 的位 置.L cn 要 去 找 并 u e e在传 统 倒 排 索 引 的基础 上 。 实现 了 分块 索 引 , 以对 新 的文 件 建立 小文 件 索 引 . 可 从而 提 升索 引建 立 的速 度 . f 搜 索模 块 .用 户在搜 索 框 中输入 搜 索关 键词 , 据 这些关 键 词构 建查 询 条件 , 2 1 根 进行 搜索 查 询.搜 索
基于Lucene全文检索系统的研究与实现

基于Lucene全文检索系统的研究与实现[摘要] lucene是一个开放源代码的全文检索引擎工具包,利用它可以快速地开发一个全文检索系统。
利用lucene开发了一个全文检索系统,通过其特殊的索引结构,实现了传统数据库不擅长的全文索引机制,提供了对非结构化信息的检索能力。
[关键词] lucene 信息检索全文检索索引一、引言计算机技术及网络技术的迅速发展,使得internet成为人类有史以来资源最多、品种最全、规模最大的信息资源库。
如何在这海量的信息里面快速、全面、准确地查找所需要的资料信息已经成了人们关注的焦点,也成了研究领域内的一个热门课题。
这些信息基本上可以分做两类:结构化数据和非结构化数据(如文本文档、word 文档、pdf文档、html文档等)。
现有的数据库检索,是以结构化数据为检索的主要目标,实现相对简单。
但对于非结构化数据,即全文数据,由于复杂的数据事务操作以及低效的高层接口,导致检索效率低下。
随着人们对信息检索的要求也越来越高,而全文检索因为检索速度快、准确性高而日益受到广大用户的欢迎, lucene是一个用java写的全文检索引擎工具包,可以方便地嵌入到各种应用中实现针对应用的全文索引和检索功能。
这个开源项目的推出及发展,为任何应用提供了对非结构化信息的检索能力。
二、全文检索策略通常比较厚的书籍后面常常附关键词索引表(比如,北京:12,34页,上海:3,77页……),它能够帮助读者比较快地找到相关内容的页码。
而数据库索引能够大大提高查询的速度原理也是一样,由于数据库索引不是为全文索引设计的,因此,使用like “%keyword%”时,数据库索引是不起作用的,在使用like查询时,搜索过程又变成类似于一页页翻书的遍历过程了,所以对于含有模糊查询的数据库服务来说,like对性能的危害是极大的。
如果是需要对多个关键词进行模糊匹配:like“%keyword1%”and like “%keyword2%”……其效率也就可想而知了。
基于Lucene的中文分词器的设计与实现

基于Lucene的中文分词器的设计与实现彭焕峰【摘要】According to the low efficiency of the Chinese words segmentation machines of Lucene, this paper designs a new word segmentation machine based on all-Hash segmentation mechanism according to binary-seek-by-word by analyzing many old dictionary mechanisms. The new mechanism uses the word's Hash value to reduce the number of string findings. The maintenance of dictionary file is convenient, and the developers can customize the dictionary based on different application to improve search efficiency.%针对Lucene自带中文分词器分词效果差的缺点,在分析现有分词词典机制的基础上,设计了基于全哈希整词二分算法的分词器,并集成到Lucene中,算法通过对整词进行哈希,减少词条匹配次数,提高分词效率。
该分词器词典文件维护方便,可以根据不同应用的要求进行定制,从而提高了检索效率。
【期刊名称】《微型机与应用》【年(卷),期】2011(030)018【总页数】3页(P62-64)【关键词】Lucene;哈希;整词二分;最大匹配【作者】彭焕峰【作者单位】南京工程学院计算机工程学院,江苏南京211167【正文语种】中文【中图分类】TP391.1信息技术的发展,形成了海量的电子信息数据,人们对信息检索的要求越来越高,搜索引擎技术也得到了快速发展,并逐渐地被应用到越来越多的领域。
lucene 的模糊匹配原理

一、lucene模糊匹配原理概述lucene是一个开源的全文检索引擎工具,提供了强大的文本搜索和分析功能。
在实际应用中,经常需要进行模糊匹配,以处理用户输入的错别字、拼写错误或者同义词。
模糊匹配是lucene中非常重要的功能,它可以帮助用户找到相关的文档,提高搜索的准确性和全面性。
二、lucene模糊匹配的算法原理1. Levenshtein Distance算法Levenshtein Distance是衡量两个字符串相似程度的一种算法,也称为编辑距离。
在lucene中,模糊匹配主要使用Levenshtein Distance算法来实现。
该算法通过计算两个字符串之间的距离,从而确定它们的相似程度。
具体来说,它通过插入、删除和替换操作,将一个字符串转换成另一个字符串所需的最小步骤数来衡量相似度。
2. 模糊查询的实现方式在lucene中,模糊查询可以通过FuzzyQuery类来实现。
利用FuzzyQuery,可以指定一个最大编辑距离,从而允许匹配到具有一定相似度的文档。
FuzzyQuery会基于Levenshtein Distance算法来进行模糊匹配,找到编辑距离小于等于指定值的文档。
三、模糊匹配的应用场景1. 处理用户输入错误当用户在搜索框中输入错别字或者拼写错误时,模糊匹配可以帮助系统找到相关的文档,并提供纠正建议,提高搜索的准确性和用户体验。
2. 同义词匹配在自然语言处理中,同一个概念可能有多种不同的表达方式。
通过模糊匹配,可以将具有相似含义的词语进行匹配,从而提高搜索的全面性。
3. 解决词形变化问题词形变化是自然语言中常见的现象,同一个词可能有不同的变形形式。
通过模糊匹配,可以将不同词形的单词对应起来,使得搜索更加全面和准确。
四、模糊匹配的优化策略1. 设置合适的编辑距离阈值在使用模糊匹配时,需要根据具体的应用场景来设置合适的编辑距离阈值。
如果编辑距离过小,可能会产生大量的不必要匹配;如果编辑距离过大,可能会包含过多的无关文档。
基于Lucene的搜索引擎系统设计与实现说明书

2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016)Research and implementation of search engine based on LuceneWan Pu, Wang LishaPhysics Institute, Zhaotong University,Zhaotong,Yunnan, PR ChinaPhysics InstituteZhaotong University, Zhaotong,Yunnan, PR China**************Keywords: Search engine; Lucene; web spider; Chinese word segmentationAbstract.From in-depth research on the basic principles and architecture of search engine, through secondary development for Lucene development package, this paper designs an entire search engine system framework, and realizes its core modules. This system can make up for the deficiency of the existing Lucene framework, and enhance the accuracy of the search engine system, so has higher real and commercial value.IntroductionWith the continuous expansion of network coverage and the development of network technology, the network information resources have been rapidly spread and increased. Large amounts of network information resources from all walks of life, including the information from different disciplines, different areas, different fields, different languages, very rich, and exist with text, images, audio, video, databases and other forms. Internet information has been hundreds of millions, so how to find the needed information from them has become a very important research topic in Internet technology. To help users find the information they need, the search engine came into being.Search engine is a search tool to help Internet users to query information, it is to collect, find information in the Internet with a certain strategy, and then understand, extract, organize and process the above information, thus providing users with search services, and information navigation purposes achieved. The advent of search engine for our fast, accurate and efficient access to network information resources provide great help. It is a web-based tool developed for the needs of people searching for network information, is the Internet information query navigation, and the bridge between users and network information.Principle analysis of search engineThe basic principle of search engine is to start from the existing resources, through their summary and link to determine the new information points needed to search for, and then by the relevant program search engine designed traverse these points, finally index, classify, and organize the documents on these points to the index database[1]. Logically this recursive traversal method can put all the information into the index database. When users use a search engine, enter the keywords of the content required to be found, the search program will read the information has been traversed and stored in the index database, to match with the user keyword, and then retrieve the corresponding or related information to output to the user through a certain organization method.A.Search engine system workflowA search engine to meet users needs is generally consists of information collection, informationFig.1 Search engine workflowInformation collection: web page collection obtains input from the URL database, to parse the address of a Web server in URL, establish a connection, send requests and receive data, and then store the obtained Web page data in the original page library, and from which to extract the link information to put into the page structure library, at the same time put the URL to be crawled into URL library, to ensure that the entire process is iterative, until URL library is empty[2]. Information preprocessing: after Web information collection, the preserved page information has been saved in a specific format. So the first step in this part is to index the original page, with that to provide the web page snapshot function for search engines; next Web page segmentation to the index page library, to transform each page into a set of a group of words; finally transform the mapping between Web page and index words into converse, to form inverted file, while gathering the unrepeated index words included in the Web page to be converging vocabulary; in addition, based on the structural information among Web pages to analyze the importance of the information, and then establish page meta-information. Retrieval service: the data delivered to the service stage includes index page library, inverted file and web page meta-information. Query agent to accept user input query phrase, after segmentation, retrieval from the index vocabularies and inverted file to get documents including query phrase, history log and other information, and then calculate the importance of the result set, finally sort and return to the user[3].Through the above several components, a search engine system can be built, when the user input the keywords, phrases related o the information and resources to be found, the system will traverse the program in accordance with its design search, from the Internet link address traverse pages, the results will be saved to the index database, and then process, integrate the indexed data, finally optimize of the results, according to certain priority algorithm to sort the results, and then store in the index database. When a user types a keyword the search engine will search for the matched page or data information from the index database, and in a certain way show it to the user through the user interface.B.Key technology of search engine systemA typical search engine structure generally consists of three modules: network spider, indexer, and searcher. Web spider generally first obtains URL from the URL queue to be visited, according to which get the page from Web and analyze it, to extract all of the URL links and add them to the page data URL queue to be accessed, at the same time move the visited URL to the visited URL queue[4]. Continuously repeat the above procedure. All the collected pages to be saved to store for further processing. Initially, in the URL queue only seed URL can be as a starting point the spider traverses the network, generally choose relatively large and very popular website address as a seed URL, because that such pages often have a lot of links to other pages. Web spiders use HTTP protocol to read Web pages and automatically access network resources along an HTML document hyperlink. You can use the network as a directed graph to deal with, each page as a node of it, and the page hyperlink as its directed edge. So you can takeFig.2 Web spider work principleIndexer function is to understand the searcher searched information, and extract the index entry, to be used to indicate documents and the index table to generate the documents. Generally index table uses some form of inverted list, that is, finds the corresponding documents from the index entries. Index table also may need to record the position of index entries appear in the document for indexer to calculate the adjacent or close relationship among the index entries. Indexer can use a centralized or distributed indexing algorithm[5]. The effectiveness of a search engine largely depends on the quality of index. When a user query is completed, the search engine has not real-time data retrieval on the web, and the searched data is actually the web data collected in advance. To achieve fast access to the collection page, it must be done through some sort of indexing mechanism. Page data can be represented by a series of keywords, and from the retrieval purposes they describe the content of the page. Just find the page, they can be found. Conversely, if the establishment of the page index is based on keywords, the relevant pages will be quickly retrieved. Specifically, keywords are stored in the index file, for each keyword there is a pointer list, in which each pointer directs to a page related to the keyword, and all pointer lists constitute placing file.The function of searcher is to quickly detect a document from the index database based on user query, evaluate the association of documents and queries, and then sort the results will be output, finally achieve some sort of user relevance feedback mechanism. The commonly used information retrieval models are set theory model, algebraic model, probabilistic model and hybrid model. Searcher is a module has a direct interaction with the user, and on the interface there are several implementations, the commonly used is Web mode, through these methods, the searcher receives a user query, and carries outword processing for it, finally obtains query keywords. Based on the above, the Web data matched with the query keyword will be obtained, and returned to the user after sorting[6].Search engine based on LuceneLucene is a full-text retrieval tool package based on Java, it is not a complete search application, but to provide indexing and search capabilities for applications. Currently Lucene is an open source project in the family of Apache Jakarta, also the most popular open full-text retrieval package based on Java, at present there are already many application search function is based on Lucene[7]. Lucene can establish indexing for the data with text type, so you just convert your index data format into text, Lucene will be able to index and search the document. For example, if some HTML documents, PDF documents need to be indexed, they must be first converted into text format, and then given to Lucene for indexing, next, the created index file is saved in disk or memory, finally according to the query criteria entered by the user query the index file. No specifying the format of the document to be indexed also makes Lucene is applicable for almost all of the search applications.C.Technical analysis of LuceneLucene architecture has strong object-oriented features. It first defines a platform-independent index file format, followed designs the core components of the system as abstract class, the concrete platform realization part as the achievement of the abstract class, in addition the platform-related part such as file storage is also packaged as a class, after object-oriented processing, finally a search engine system with low coupling, high efficiency, and easy secondary development is obtained. Lucene system structure is shown in figure 3.In Lucene file format, byte is the basis to define the data types, thus ensuring platform-independent, which is also the main reason for the Lucene index file format and platform independent. Lucene index consists of one or more segments, in which each segment composed by a number of documents. Document object can be treated as a virtual document: for example, a web page, an E-mail message or a text file, then you can retrieve large amounts of data. A Document object contains one or more fields with different domain name, and the field represents this document or some metadata related to it. Each field corresponds to a piece of data, and the data may be queried or retrieved in the index during the search process. The field consists of domain name and value. Term is a basic unit for the search, as field object, it includes a pair of string elements: respectively corresponding to the domain name and value. The conceptual structure of Lucene index files is shown in figure 4.Use of segments can quickly add new documents to the index through adding documents to a newly created index segment and only periodically merging with other existing paragraphs. This process increases the efficiency because that it minimizes the modification of the index file physically stored. One of the advantages of Lucene is to support incremental index. After adding a new document in the index, you can immediately search the contents of the document. Lucene supports for incremental index makes Lucene suit for the work environment of large amounts of information processing, in this environment the method of rebuilding index will look inefficient. Mapping to structure from concept, index is treated as a directory (folder), all the files contained in which are its contents, and these files are stored in group according to the different segments they belonged, the files in the same group have the same file name, different extension names. In addition there are three files, separately used to store the record of all the segments, save the record of deleted files, and control the synchronization of reading and writing, which are segments, deletable and lock files, with no extension names. Each segment contains a set of files, their file extension names are different, but the file names are all the names stored in the file segments.Lucene system structure has object-oriented feature. Developers do not need to know the internal structure and implementation of Lueene, but simply need to call application interfaces Lucene provided, and they also can extend their own needed functionality according to the actual situation. In the index, Lucene is different from the most search engines, while establishing the index create a new index file, for different update strategies, it combines the new index file with the existing index file, thus greatly improving the efficiency of the index. Lucene also have incremental indexing function, can make batch indexing, and optimize it, the incremental index with small quantities, so for large amounts of data index has obvious advantages. Lucene uses a common data structure to accept the index input, so can be flexibly adapted to a variety of data sources, such as databases, office documents, PDF documents and html documents, etc., when data indexing, only needs an appropriate parser to convert the data source into the corresponding data structure. Although Lucene has powerful search and indexing capabilities, but it is not a complete search engine, cannot collect the information of Internet pages, and in sorting have yet to be perfected[8]. The sorting of search results is very important for the search engine, usually users only take attention to the first page search engine returned, therefore, taking the pages valuable for users, with high level as the top surface of the page is an important topic of search engine study.D.Search engine based on LuceneSearch engine mainly consists of collecting, indexing, and retrieval system, while the user interface is a way to display search results for users. Web spider in the network according to a certain strategy to extract pages and recursively download the crawled pages. Indexing system for the pages the web spider have collected uses analysis system for word segmentations, then get the corresponding index entry, andfor all types of documents, uses the corresponding parser to parse the text, then index file and store it in the index database. Users input the search keyword through the user interface, and then the retrieval system will analyze it and submit it to the word segmentation system for processing, match the keywords obtained from the above processing with the words have been indexed, by specific algorithms sort the pages with same or similar keywords, finally return the search results to the user interface.The indexing mechanism in Lucene system should have analytical function, Lucene itself has the function to analyze txt, html files, and because of many Internet file formats, so in order to achieve a variety of document analysis, the corresponding search package needs to be added. Lucene analyzer consists of two parts: one part is the word segmentation device, being called Tokenizer; the other part is a filter, known as TokenFilter. A parser often consists of a word segmentation device and a plurality of filter, in which the filter is mainly used to deal with the segmented words. In the index establishment, what can be written in the index and retrieved by users are the entries. In fact, the so-called entry is the text after analyzer word segmentation and related processing. Word segmentation device through a next () method itself provided returns a primitive, segmented entry, and the filter through this method returns a filtered entry, with no segmentation function. As the filter constructor receives an instance of TokenStream, there will be two situations: first, the filter and other filters can be nested together to form a nested pipe filter structure; second, the filter can be combined with tokenizer to filter the segmented words from it. This nesting forms the core structure of Lucene analyzer.Retrieval function is the last link to achieve search engine, and the important factor to measure it in response speed and result sort. When a user enters a search keyword, the word segmentation system to analyze and cut, then the similarity calculation and matching with the morpheme vector in index database, and finally the search results successfully matched will be returned to the user. Retrieving part of the search engine system consists of Lucene search statement analysis system and search result clustering analysis system, in which the former is to understand the user input keywords, according to the reverse maximum matching algorithm for retrieve word segmentation, if the segmented results need to pause word filter, it needs to deal with the ambiguity field by using word segmentation probability, and then get the actual semantic words, establish a search term. Then, the Lucene search system query and submit the results to the clustering analysis system to analyze and process, so find high correlation pages and automatically generate pages. Finally, the analysis system will detect the similar documents from the Lucene search results.ConclusionThe rapid development of the Internet, the amount of information is increasing exponentially, but the ultimate goal is to enable users to easily access the information, this mission falls on the search engine, furthermore, how to return the needed information, web content with high quality to users, presents higher requirements and challenges to the search engine. Because that the Lucene scoring algorithm hasnot well reflected the page location information in the website, this paper designed an improved solution in index and retrieval module, which can well unite the basic points of the document, and the document location information in the website, as well as the document characteristics, to improve the accuracy of search result sorting, thereby enhancing the accuracy of the search.References[1]Monz C. Proceedings of 25th European Conference on Information Retrieval Research [c], Berlin/Heidelberg: Springer, 2003:571-579.[2]Nicholas Lester, Justin Zobel, Hugh E Williams. Efficient Online Index Maintenance for Contiguous Inverted Lists [J], Inf. Process. Manage, 2006, 42(4): 916-933.[3]George Samaras, Odysseas Papapetrou. Distributed Location Aware Web Crawling, In Proceedings of the 13th international World Wide Webconference [J], New York, USA:ACM Press, 2004: 468-469.[4]Hai Zhao, Changning Huang. Effective tag set selection in Chinese word Segmentation via conditional random field modeling [C], In: Proceedings ofPA-CL IC220.WuHan, November 123, 2006: 84-94.[5]Arvind Arasu, Jasmine Novak, Andrew Tomkins, John Tomlin. Page Rank Computation and the Structure of the WEB: Experiments and Algorithms,In Proceedings or 11th International World Wide Web Conference, 2002.[6]Giuseppe Antonio Di Lucca, Anna Rita Fasolino, Porfirio Tramontana. Reverse engineering web applications: the WARE Approach [J], Journal ofSoftware Maintenance and Evolution: Research and Practice, 2004, 11(3):15.[7]Giuseppe Pirro, Pomenico Talia. An approach to Ontology Mapping Based on the Lucene Search Engine Library[C], Proceedings of the 18thInternational Conference on Database and Expert Systems Applications, 2007, 9:156-158.[8]Laurence Hirsch, Robin Hirsch, Masoud Saeedi. Evolving Lucene Search Queries for Text Classification[C], Proceedings of the 9th AnnualConference on Genetic and Evolutionary Computation, 2007, 6(12):166.。
基于Lucene的电子档案检索系统的设计与实现的开题报告

基于Lucene的电子档案检索系统的设计与实现的开题报告一、选题背景在现代社会,电子档案的重要性越来越受到重视,电子档案管理系统的建设也成为了各个机构、企业等单位的必要工作。
然而,随着电子档案的数量不断增加,在传统的手动管理方式下,全面有效地对电子档案进行管理已经面临着很大的困难。
因此,如何实现对大量电子档案的快速、准确的检索已经成为亟待解决的问题。
为了解决这一问题,本文将采用基于Lucene的电子档案检索系统进行设计和实现。
Lucene是一个开放源码的全文检索引擎,具有高效、稳定的检索效果,能够适应大量文本数据的检索需求。
通过Lucene,我们将能够实现对电子档案数据的高速搜索和准确匹配,提高档案管理工作的效率。
二、研究目的和意义本文的主要目的是设计和实现一个基于Lucene的电子档案检索系统,能够实现对文本文件的自动索引、检索和排序功能。
该系统能够对大量的文本数据进行高效的搜索,提高文本数据的检索效率和准确度,帮助用户快速找到所需的电子档案。
本文的意义在于:1. 提高电子档案管理工作的效率和准确度,解决传统手动管理方式下的管理难题。
2. 通过Lucene全文检索引擎,为用户提供高效、准确的电子档案检索服务。
3. 为后续的电子档案管理系统的设计和实现提供参考和借鉴。
三、研究内容和方法本文将采用基于Lucene的电子档案检索系统进行设计和实现,主要包括以下研究内容:1. 电子档案管理系统的需求分析:了解用户的实际需求,明确电子档案检索系统的功能需求和性能指标。
2. Lucene全文检索引擎的原理研究和应用:介绍Lucene全文检索引擎的原理和应用,掌握Lucene的构建、索引和检索等方面的技术。
3. 电子档案检索系统的设计和实现:采用Java语言,通过Lucene 全文检索引擎设计和实现电子档案管理系统,包括档案数据的索引、检索和排序等功能的实现。
4. 电子档案检索系统的测试和分析:对完成的电子档案管理系统进行测试和分析,评估检索效率和准确度,并寻求进一步的优化和改进。
基于Lucene的全文搜索引擎的设计与实现

图 1 L cn u e e系 统 的 结 构 组 织 图
2 Lue e的 系统 结 构 分析 cn
2 2 og aah . cn .i e 索 引 包 是 整 个 系 统 核 心 , . r .p c e [ e e n x u d 主 要提 供 库 的读 写 接 口 , 过 该 包 可 以创 建 库 . 加 删 除 记 录 及 通 添 读 取 记 录等 。 全文 检索 的根 本 就 为 每 个 切 出来 的词 建 立 索 引 , 查 询 时 只需 要遍 历 索 引 , 不 需 要 遍 历 整 个 正 文 , 而 极 大 地 而 从 提 高 了检 索 效率 , 引 创 建 的 质 量 直 接 关 系 整 个 系统 的 质 量 。 索 L cn 的索 引 树 是 非 常 优 质 高 效 的 , 这 个 包 中 , 要 有 I . ue e 在 主 n
查 询结 果 。 图 1是 L cn ue e系 统 的结 构 组 织 图 。 2. 分析 器 An lzr 分 析 器 主 要 用 于 切 词 , 段 文 档 输 入 1 ay e 一
以后 , 过 A a zr 输 出 时 只剩 下 有 用 的 部 分 , 他部 分 被 剔 经 n l e, y 其 除 。 分析 器提 供 了抽 象 的接 口 , 因此 语 言 分 析( n l ) A a  ̄r 是可 以 y 定 制 的 。因 为 L cn 缺 省 提 供 了 2个 比较 通 用 的 分 析 器 S ue e i m. p A a s 和 Sa dr A a sr 这 2个 分 析 器 缺 省 都 不 支持 中 l e le n y r tn ad n l e, y 文 , 以 要加 入 对 中 文 语 言 的 切 分 规 则 , 要 修 改 这 2个 分 析 所 需
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
(责任编辑:陈和榜)浙江理工大学学报,第26卷,第1期,2009年1月Journal of Zhejiang Sci 2Tech U niversityVol.26,No.1,J an.2009文章编号:167323851(2009)0120109205收稿日期:2008-05-29作者简介:黄 珏(1982- ),女,浙江杭州人,助理研究员,主要从事搜索引擎,数字图书馆,软件工程方面的研究。
基于Lucene 的语段模糊匹配中文检索系统设计与实现黄 珏,黄志远(浙江理工大学科技与艺术学院,杭州311121) 摘 要:为提高图书馆中文信息检索的精确度和有效性,设计了基于L ucene 的语段模糊匹配中文检索系统。
其采用了自然语言处理中的词语切分技术,使输入条件可以直接通过自然语言的方式提交,同时针对语段匹配的实际问题情境,设计了一种新的结果有效性判别模型,提高了检索结果相似度的科学性和准确性。
经过多次实验结果的统计,搜索结果有效性可提高12%。
关键词:L ucene ;语段;中文检索;有效性判别中图分类号:TP393 文献标识码:A0 引 言信息检索技术在图书馆领域的应用是举足轻重的,然而,当前图书馆用户在检索资料的时候,常常会遇到这样的情况:记得一篇文章或一本书刊中的某段话,却记不清标题、作者、出版社之类的特征信息。
凭着对这个语段的记忆,选取某些关键字/词进行查询,又无法快速准确的找到目标答案。
个别数字资源自带的搜索引擎具有全文检索功能,允许用户输入一个语段来进行查询,但是查全率和查准率差强人意:要求输入条件与文档内容完全匹配,或者查询结果不能很好地对应用户感兴趣的内容。
首先,基于关键字/词和逻辑表达式的检索方式不能全面地反映用户的需求。
由于用户输入的关键字/词之间,往往不存在任何联系,因此检索条件本身无法清晰表达用户的真正含义[1]。
其次,简单的关键字/词匹配,往往输出大量的文档,而真正相关的文本却很少,使得用户耗费很多的时间和精力处理一些不相关的结果。
因此建立一种基于语段模糊匹配的中文检索系统,为用户提供更为细致和有效的帮助是必要的。
目前检索系统的开发平台并不多见,本文采用了一个较为实用的检索引擎架构———L ucene ,它结构精巧,功能强大,便于嵌入各种应用。
在L ucene 良好的架构之上,本文结合最大正向匹配的中文分词算法,通过对L ucene 评分机制的改进,建立了一个新的文档有效性二次判别模型,设计了一个多维非线性计算函数得到搜索结果的相似度,并对搜索结果按照有效性来进行排序。
与现有的图书馆中文检索系统相比,具有以下改进:输入全文中的某个语段(只需基本一致),即可搜索到与之相关的书籍/文章;检索的精度及结果集的有效性较一般检索系统有所提高。
1 基于Lucene 的语段模糊匹配中文检索系统设计1.1 开放源码的搜索引擎(L ucene )L ucene 是apache 软件基金会J akarta 项目组的子项目,是一个开放源代码的全文检索引擎工具包,它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构[2],提供了完整的查询引擎和索引引擎,部分文本分析引擎。
图1 基于L ucene 的中文检索引擎1.2 基于L ucene 的语段模糊匹配中文检索引擎设计如图1所示,基于L ucene 的语段模糊匹配中文检索系统的设计,首先,需要按照目标语言的词法结构来构建相应的词法分析逻辑,实现L ucene 在org.a 2pache.lucene.analysis 中定义的接口,为L ucene 提供本系统所需的中文语言处理能力。
L ucene 默认的已经实现了英文和德文的简单词法分析逻辑(按照空格分词,并去除常用的语法词,如英语中的is ,am ,are 等等)。
在这里,需要参考实现的接口在org.apache.lu 2cene.analysis 中的Analyzer.java 和To kenizer.java中定义。
其次,需要按照被索引的文件的格式来提供相应的文本分析逻辑,本系统所使用的测试语料库包含t xt 文本、PDF 文档等,通常需要把文档的内容按照所属域分门别类加入索引,这就需要从org.apache.lu 2cene.document 中定义的类document 继承,定义自己的document 类,然后就可以将之交给org.apache.lu 2cene.index 模块来写入索引文件。
此外,还需要改造L ucene 的评分排序逻辑。
默认的,L ucene 采用其内部的相关性方法来处理评分和排序,采用的是一个线性的简单模型,可以根据需要改变它,这一部分的实现包含在org.apache.lucene.search 中。
完成了这几步之后,目标系统的检索引擎就基本上完备了。
2 系统相关算法改进基于L ucene 的语段模糊匹配中文检索系统采用了自然语言处理中的词语切分技术,使输入条件可以直接通过自然语言的方式提交,无须用户抽取语段中的关键词,同时针对语段模糊匹配的实际问题情境,设计了一种新的结果有效性判别模型,提高了检索结果相似度的科学性和准确性。
2.1 最大正向匹配的中文分词由于L ucene 默认实现的只有英文和德文的简单词法分析逻辑,众所周知,英文是由空格隔开的单词组成的,每个单词都有自己的意思。
由于中文的词汇之间没有空格分开,用L ucene 的分析器无法正确切分中文句子,因此必须添加中文处理所需的样本分词模块,本文采用了最大正向匹配的中文分词算法,词表选择的是人民日报97版的词表。
最大正向匹配的分词算法,即,一个字符串S ,从前到后扫描,对扫描的每个字,从词表中寻找最长匹配[3]。
假设对S =C 1C 2C 3C 4…进行正向最大匹配分词,其算法描述如下:a )取一字C 1,在词表中查找C 1并保存是否成词标记;b )再取一字C 2,判断词表中是否有以C 1C 2为前缀的词;c )不存在,则C 1为单字,一次分词结束;d )存在,则判断以C 1C 2为首的多字词是否存在;e )如不存在,一次分词结束;f )否则再取一字C i (i >=3),判断词表中是否有以C 1C 2…C i 为前缀的词;g )若不存在,则返回最近一次能够成词的C 1C 2…C i -1,至步骤i );h )否则i =i +1,至步骤f );i )从字C i 开始,进行下一次分词。
2.2 搜索结果有效性二次判别模型考虑到L ucene 的文档搜索文本相似度计算是一个线性的简单模型,只适用于普通的文档搜索情况,对本文所涉及的文档数据库搜索有一些不合适[425]。
本文中的文档搜索关键字有几个比较明显的特征:首先,在一个文档中如果存在要搜索的关键词,出现的频率越高越合适,但是出现少量次数的时候其搜索相似度应该增长较快,而当出现频率达到较高程度的时候则搜索过滤会认为没有明显的区别;其次,多个关键词同时搜索的时候,每个关键词之间都有相关性,要避免一个关键词出现频率过高而导致整个文档的搜索概率提高011 浙 江 理 工 大 学 学 报2009年 第26卷的情况发生,从而使其他关键词对搜索相似度的敏感性降低,甚至存在一个文档中,仅一个关键词起到搜索作用的情况;最后,还要考虑到多个关键词搜索的时候如果各个关键词出现的频率越接近,则其越应该作为较优的搜索结果[6]。
因此,在搜索文档相似度判别的时候在L ucene 搜索结果的基础上建立了一个新的模型,并提出了一个多维非线性函数来计算每个一次搜索结果的相似度,从而可对所有的搜索结果按照有效性来进行排序。
设L ucene 的一次搜索结果中的关键词在文档中的出现频率数值为ω,加入两个经验常数a 和b 。
搜索文本的有效性判别式如下:α=∑nj =1ωa -1j -b ∑n j =1(|ωj - ω|)n-1(1)式(1)中,常数a 用于调整随着关键词出现频率的增加其有效性提高速度的降低程度,如果a 越大则有效性降低速度越快;常数b 用于调整各个关键词出现频率之间的相关性对有效性的影响。
这两个经验常数可根据具体情况进行调整。
下面分别给出有效性计算公式中一维和二维的关键词频率与有效性的函数图。
当a =2,b =0.3时,一维有效性函数图如图2所示;当a =2,b =0.3时,二维有效性函数图如图3所示:有效性二次判别模型查询结果如图4所示,每篇文档搜索结果由几个部分组成:文档路径、L ucene 文档得分(用来作比较)、该文档命中的检索词个数、该文档命中的检索词、命中的检索词出现的次数以及有效性二次判别模型所得的文档分值[7]。
图4 有效性二次判别模型查询结果111第1期黄 珏等:基于L ucene 的语段模糊匹配中文检索系统设计与实现3 应用测试及结果选取5个不同学科领域的文档进行测试,每次测试选用10篇。
以其中一次测试为例:10篇与计算机相关的文档,8篇为t xt 格式,2篇为pdf 格式,内容包括中文分词技术、硬件信息、软件信息等,形式包括文字、表格、图片等,测试结果如下。
3.1 中文分词测试对语料库进行索引操作,分词结果截取一段如图5所示。
图5 分词结果示意图由图5,语段模糊匹配的中文检索系统实现了p df 文本的提取,中文分词器将文档中表格对象的中文内容信息切分为中文语义的词,存入索引。
对于英文单词、数字、电话号码、邮件地址仍作为一个完整的词编入索引。
3.2 结果有效性二次判别模型测试输入查询条件“电脑手机分词”,分词结果“电脑”“手机”“分词”,对索引进行检索,得到与这三个词相匹配的结果集,以及文档分值,与L ucene 比较如表1所示。
表1 有效性判别比较文档名命中词及在文档中的出现次数有效性二次判别模型所得到的文档分值Lucene 的评分机制所得到的文档分值g.txt电脑(6) 手机(3)-15.8184642180195280.7524695h.txt电脑(3) 手机(3)-16.5359031532338320.74803925i.txt电脑(4) 手机(4)-16.0000047683715820.7198012c.txt分词(24)-91.901028334563050.15122233adobe 2dealer.pdf 电脑(4) 手机(1)-17.0000047683715820.04723695 由表1可知,有效性二次判别模型所计算的文档分值与L ucene 的评分机制所计算的文档分值,有某些明显的区别:h.t xt 文档中,“电脑”和“手机”均出现了3次;i.t xt 中,“电脑”和“手机”均出现了4次。
lucene 的评分机制认为,h.t xt 的文档得分应高于i.t xt ,有效性二次判别模型结论相反。