Search Efficiency in Indexing Structures for Similarity Searching

合集下载

Efficient index structure for accessing the hierar

Efficient index structure for accessing the hierar

专利名称:Efficient index structure for accessing the hierarchical data in a relational databasesystem发明人:ジャラリ,ニーマ,セドラー,エリック,アガーウォル,ニプン,マーシー,ラビ申请号:JP2003533164申请日:20020926公开号:JP4351530B2公开日:20091028专利内容由知识产权出版社提供摘要:Described is a hierarchical index that captures the hierarchical relationship of a hierarchy emulated by a relational database system. The hierarchical index is implemented, using a database table which contains rows that serve as entries of the hierarchical index. Another table has rows that are associated with nodes in the hierarchy. Each entry in the hierarchal index maps to a row that corresponds to a node in the hierarchy. A node in the hierarchy may be a parent node with one or more child nodes. In this case, the corresponding entry in the hierarchical index contains identifiers which identify other entries in the index, where the other entries correspond to rows associated with child nodes of the parent node.申请人:オラクル・インターナショナル・コーポレイション地址:アメリカ合衆国、94065 カリフォルニア州、レッドウッド・ショアーズ、オラクル・パークウェイ、500国籍:US代理人:深見 久郎,森田 俊雄,仲村 義平,堀井 豊,野田 久登,酒井 將行更多信息请下载全文后查看。

搜索作文的名词解释英语

搜索作文的名词解释英语

搜索作文的名词解释英语搜索是预测将某个互联网信息资源在网络中找到并呈现给用户的过程,它主要通过搜索引擎进行实现。

而作文是对某一主题进行阐述和论述的文字表达形式。

在英语中,搜索作文的名词解释是"Essay on the Definition of Search"。

搜索(Search)Search is the process of retrieving and presenting a particular internet information resource to users through the internet. This process is primarily facilitated by search engines. The search engines crawl the web, index the content, and match user queries with relevant information.作文(Essay)An essay is a written expression that aims to elaborate and discuss a specific topic or theme. It is a structured piece of writing that presents arguments, analyses, and conclusions.搜索作文(Essay on the Definition of Search)Search essay refers to a composition that defines and explores the concept of search, its significance, and its impact on various aspects of life. This essay delves into the technical aspects, ethical considerations, and social implications of search, while also highlighting its benefits and challenges.引言(Introduction)The introduction of the essay provides an overview of the significance of search in the contemporary digital age. It emphasizes the importance of search engines and their role in providing accurate and relevant information to users.技术方面(Technical Aspects)This section explores the technical aspects of search. It discusses web crawling, indexing, and ranking algorithms employed by search engines to retrieve information efficiently. Additionally, it highlights the role of artificial intelligence in enhancing search capabilities.伦理考虑(Ethical Considerations)The essay then delves into ethical considerations related to search. It examines issues such as privacy, data security, and bias in search engine results. It discusses the responsibility of search engines and the need for transparency and fairness.社会影响(Social Implications)The impact of search on society is another crucial aspect explored in this essay. It analyzes the influence of search on knowledge acquisition, information consumption, and decision-making processes. It also examines the role of search in shaping public opinion and its impact on democracy.优势与挑战(Benefits and Challenges)The essay assesses the benefits and challenges associated with search. It highlights the advantages of quick access to information, global connectivity, and improved efficiency. However, it also acknowledges challenges such as information overload, fake news, and algorithmic biases.结论(Conclusion)In conclusion, the essay summarizes the key points discussed throughout the essay. It emphasizes the integral role of search engines in modern society while recognizing the need for continuous improvement and ethical considerations.参考文献(References)The essay concludes with a list of references citing sources used to support the arguments and claims made throughout the essay. These references provide additional reading materials for those interested in further exploration of the topic.通过对搜索作文的名词解释英语的探讨,读者能够深入了解搜索的概念、技术和其对社会的影响。

一种应用于搜索引擎的索引结构研究

一种应用于搜索引擎的索引结构研究

一种应用于搜索引擎的索引结构研究①刘 畅 张 辉(北京航空航天大学计算机学院 北京 100083)摘 要索引结构是搜索引擎的核心,直接影响着搜索引擎的检索性能。

本文提出了一种新的索引结构,该结构充分利用字符串前缀个数及排列顺序的潜在规律,在查找过程中有效地重用了先前的匹配信息,提高了检索的效率。

关键词:索引结构 搜索引擎 倒排文件中图分类号:TP391.1Study of Index Structure which Supports the High E ff iciency of SearchingLiu Chang Zhang H ui(Dept.of Computer Science and Technology ,BUAA ,Beijing 100083)Abstract :Index structure is the core of a search engine ,it has an influence on the performance of whole search engine direct 2ly.In this paper ,a new index structure is presented ,which takes full advantage of the latent rules about the suffix number and the order of string ,makes full use of the match information got in searching process ,consequently improves the searching efficiency.K ey w ords :structured text,index structure ,inverted files Class number :TP391.11 引言一个良好的索引结构可以在被检索数据规模庞大的情况下保证检索操作的速度,是实现高效率检索的一个决定性因素[1]。

英语作文-档案馆的数字化档案管理与利用

英语作文-档案馆的数字化档案管理与利用

英语作文-档案馆的数字化档案管理与利用In today’s digital age, the management and utilization of digital archives in archives have become increasingly vital. Digital archives refer to the digitized versions of physical records, documents, and materials stored in archives, ensuring their preservation, accessibility, and usability in the modern world. This transformation from traditional paper-based archives to digital formats offers numerous advantages and has revolutionized the way information is stored, managed, and utilized across various sectors.Firstly, digital archives facilitate efficient storage and preservation of historical and contemporary records. Unlike physical documents that are susceptible to damage from environmental factors such as humidity, temperature, and pests, digital records can be stored securely on electronic devices and in cloud storage systems. This method ensures the longevity of records without compromising their integrity, thereby safeguarding valuable historical information for future generations.Moreover, the accessibility of digital archives enhances research capabilities and promotes knowledge dissemination. Researchers, historians, and scholars can access a vast array of information remotely, reducing the need for physical visits to archives. This accessibility transcends geographical boundaries, allowing individuals worldwide to explore and analyze historical data that was previously restricted due to location or logistical constraints.Furthermore, digital archives facilitate efficient information retrieval and management. Advanced search functionalities and indexing systems enable users to locate specific documents swiftly and accurately. This capability significantly enhances operational efficiency within archival institutions, as archivists spend less time searching for records and more time curating and digitizing additional materials.Additionally, the digitization of archives promotes collaboration and interdisciplinary research initiatives. By sharing digitized collections across institutions and disciplines, scholars can collaborate on projects, conduct comparative studies, anduncover new insights into historical events and social trends. This interdisciplinary approach fosters innovation and enriches scholarly discourse in fields ranging from history and anthropology to sociology and political science.Furthermore, digital archives contribute to the democratization of information and cultural heritage preservation. Through online platforms and digital repositories, archival materials can reach a broader audience, including educators, students, policymakers, and the general public. This accessibility promotes cultural awareness, encourages civic engagement, and facilitates the preservation of diverse cultural identities and traditions.Moreover, digital archives support adaptive reuse and creative applications of archival materials. Cultural institutions and creative industries repurpose digitized content to develop educational resources, multimedia exhibitions, documentaries, and digital art installations. These innovative uses not only engage audiences in new and immersive ways but also promote the economic sustainability of archival institutions through partnerships with creative professionals and commercial ventures.In conclusion, the digitization of archives represents a transformative shift in archival practices, offering unprecedented opportunities for storage, preservation, accessibility, and utilization of historical and contemporary records. By harnessing the power of digital technologies, archival institutions can uphold their missions to preserve cultural heritage, advance scholarly research, and foster global understanding and collaboration. As we continue to embrace digital innovation, the future of archival management promises to be dynamic, inclusive, and infinitely enriching for generations to come.。

fulltext字段 英文处理

fulltext字段 英文处理

fulltext字段英文处理English Answer:Fulltext Search of English Text.Introduction.Fulltext searching is a powerful technique for finding relevant information within a large body of text. In the context of English language processing, fulltext search can be used to retrieve documents that contain specific words or phrases, even if those words or phrases appear in different forms or are misspelled.Tokenization and Normalization.The first step in fulltext search is to tokenize the text into individual words or phrases. This can be done using a variety of techniques, such as whitespace splitting or regular expressions. Once the text has been tokenized,the tokens are normalized to remove common variations such as case, punctuation, and stemming.Stemming.Stemming is a process of reducing words to their root form. This can be done using a variety of algorithms, such as the Porter stemmer or the Lancaster stemmer. Stemming can help to improve the recall of fulltext search queries by matching words that have different suffixes.Stop Words.Stop words are common words that do not add significant meaning to a query. Examples of stop words include "the", "and", and "of". Stop words can be removed from queries to improve efficiency and reduce noise.Indexing.Once the text has been tokenized, normalized, and stemmed, it is indexed. An index is a data structure thatmaps words or phrases to the documents in which they appear. This allows for fast and efficient searching.Querying.To perform a fulltext search, a user submits a query. The query is typically a string of words or phrases that represent the information the user is seeking. The query is then processed using the same techniques as the text, and the results are ranked based on their relevance to the query.Relevance Ranking.There are a number of factors that can be used to determine the relevance of a document to a query. These factors include:Term frequency: The number of times a term appears ina document.Document frequency: The number of documents in thecorpus that contain a term.Inverse document frequency: A measure of how common a term is in the corpus.Proximity: The distance between terms in a document.Boosting: A way to give certain terms or documents more weight in the ranking.Applications.Fulltext search has a wide range of applications, including:Document retrieval: Finding documents that contain specific keywords or phrases.Web search: Searching the web for information.E-commerce: Searching for products on an e-commerce website.Spam filtering: Identifying and blocking spam emails.Machine translation: Translating text from one language to another.中文回答:全文搜索对英文文本的处理。

Search Engines:Information Retrieval in Practice搜索引擎——信息检索实践_Slides_chap1_pdf

Search Engines:Information Retrieval in Practice搜索引擎——信息检索实践_Slides_chap1_pdf

Easytocomparefieldswithwell‐defined semanticstoqueriesinordertofindmatches Textismoredifficult
Documentsvs.Records
Examplebankdatabasequery
– Findrecordswithbalance>$50,000inbranches locatedinAmherst,MA. – Matcheseasilyfoundbycomparisonwithfield valuesofrecords
designingandimplementingthemaremajorissuesfor searchengines
SearchEngines
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
SearchandInformationRetrieval
SearchontheWeb1 isadailyactivityformany peoplethroughouttheworld Searchandcommunicationaremostpopular usesofthecomputer Applicationsinvolvingsearchareeverywhere Thefieldofcomputersciencethatismost involvedwithR&Dforsearchisinformation retrieval(IR)
– Measuringandimprovingtheefficiencyofsearch
e.g.,reducingresponsetime,increasingquery throughput,increasingindexingspeed

irst_精品文档

irst_精品文档

irstTitle: The Importance of Information Retrieval System Technology (IRST)Introduction:Information Retrieval System Technology (IRST) is an essential tool for organizations and individuals in today's digital age. With the exponential growth of available information, the need for efficient and effective information retrieval has become paramount. This document will explore the significance of IRST, its key features, benefits, and its impact on various fields.1. Definition and Components of IRST:Information Retrieval System Technology (IRST) is a software system designed to facilitate the retrieval of relevant information from large databases or collections of documents. It comprises three main components: document indexing, query processing, and relevancy ranking. Document indexing involves the categorization and organization of documents to enable quick and accurate retrieval. Query processing refers to the system's ability to interpret user queries and match them with indexed documents. Relevancy ranking aims todisplay the most relevant results based on user queries and indexing protocols.2. Benefits of IRST:2.1 Time Efficiency:IRST plays a crucial role in streamlining the search process and saving valuable time. By employing efficient indexing techniques and relevancy ranking algorithms, users can obtain the desired information promptly, reducing the time spent searching through vast amounts of data.2.2 Enhanced Accuracy:One of the primary benefits of IRST is its ability to increase the accuracy of information retrieval. Through advanced indexing methods, users can access precisely what they are looking for, minimizing irrelevant results and false positives.2.3 Improved Decision-Making:IRST provides users with quick access to relevant and reliable information, enabling better decision-making. In fields such as medicine, finance, and academia, where accurate information is vital, the use of IRST can significantly enhance the quality of decisions made.3. Applications of IRST:3.1 Information Management in Libraries:IRST has revolutionized information management in libraries by offering efficient cataloging and retrieval of books, journals, and other resources. Librarians can index and categorize books, making it easier for users to find specific information.3.2 Business Intelligence:In the corporate world, IRST finds applications in business intelligence. It enables companies to retrieve data from various sources, analyze market trends, and make informed business decisions. IRST assists in competitor analysis, market research, and data mining.3.3 E-Commerce:In e-commerce, IRST is crucial for providing users with accurate and relevant search results, improving user experience, and ultimately boosting sales. By employing sophisticated indexing and relevancy ranking algorithms, e-commerce platforms can match user queries with the most suitable products or services.4. Challenges and Future Trends:4.1 Scalability:As the volume of digital information continues to grow, IRST faces challenges in handling large-scale databases and ensuring quick retrieval. Future developments in IRST should focus on scalability to meet the growing demands of information retrieval.4.2 Multilingual Information Retrieval:With the globalization of businesses and the internet, the need for efficient multilingual information retrieval has become more apparent. Future IRST advancements may involve incorporating machine translation and cross-language retrieval techniques to bridge language barriers.4.3 Personalized Retrieval:As user preferences and habits continue to shape the digital landscape, personalization in information retrieval becomes crucial. Future trends in IRST may involve incorporating user profiling and machine learning techniques to tailor search results according to individual preferences.Conclusion:Information Retrieval System Technology (IRST) is an indispensable tool that improves the efficiency and accuracyof information retrieval. Its benefits extend across various fields, including libraries, business intelligence, and e-commerce. However, scalability, multilingual retrieval, and personalization are challenges IRST must address in the future to keep up with ever-growing information demands. As technology continues to advance, IRST will play an increasingly vital role in ensuring quick access to relevant information, revolutionizing the way we search for and utilize knowledge.。

3DR-tree模型改进:为了提高索引性能说明书

3DR-tree模型改进:为了提高索引性能说明书

3DR-tree Model Improvement Based on Enhance of Index PerformanceZhang Zhi Tong1, a1Faculty of Technology, Harbin University, ChinaaKeywords:spatio-temporal database; 3DR-tree, index; node splitting; tree splittingAbstract. 3DR-tree is an index method using the traditional R-tree to index moving objects. Its defect is that a cube generates when an object remains still for a period of time. For those objects that remain still for a long period of time, many strip cubes will generate. MBR will be overlong or overlarge, thus increasing a great deal of overlap and reducing index performance greatly. This paper is to improve the 3DR-tree model to enhance index performance. On the basis of 3DR-tree, this paper will put forward improving historical data index performance of 3DR-tree through node splitting. 3DR-tree is provided with the online data index function by means of tree splitting. Introduction3DR-tree [1] is an index method using the traditional R-tree [2] to index moving objects. When a 3DR-tree index is established, the left and right end values in each dimensional interval of each object must be known. Therefore, 3DR-tree processes only offline data, i.e. data items with defined time interval values. It cannot process line segments with an open time interval.In the spatio-temporal database, tense attributes of moving objects are divided into historical tense, online tense and future tense. Therefore, when researching index structure, we correspondingly divide index into moving object historical information index, moving object current information index and moving object future information index. 3DR-tree is reconstructed based on this classification. In this paper, we use a “triple model” [3] to indicate the evolution of a spatio-temporal object.Node Splitting3DR-tree regards time as another spatial dimension information of a spatial object, and then utilizes R-tree to perform spatial index. It uses three-dimensional spatial data to indicate a spatio-temporal object (two-dimensional), which is comparatively intuitional and simple. For example, in the two-dimensional space, a spatial object is represented by the minimum bounding rectangle (MBR). Accordingly, a spatio-temporal object can be represented by the MBR in the three-dimensional space. The bottom of the three-dimensional MBR is the MBR of the two-dimensional space, and the height of the three-dimensional MBR is the lifecycle of the spatio-temporal object.Considering the node form of 3DR-tree, enhancing query effectiveness can start with reducing the volume of the minimum circumscribed cube.First, load the splitting process manually in the evolution of a spatio-temporal object. Assume that a spatio-temporal object evolves to the moment t2 from the moment t1. Load a splitting at the moment ts. Where, ti<ts<t2. Delete the object manually at the moment ts, and then insert two new objects with the same spatial status. These two objects have the same ID as the original object. As a result, the original spatio-temporal object o[t1,t2] is replaced by two new spatio-temporal objects o[t1,ts] and o[ts,t2]. The number of spatio-temporal objects increases, but the total MBR invalid space reduces, i.e. the volume of the minimum circumscribed cube is reduced by segmentation. Figure 3-1 shows that an interval object in a three-dimensional space operates to the moment t2 from the moment t1. Introduce a manual splitting at the midpoint ts between t1 and t2. It can be seen that the original MBR forms a large invalid space. Compare the MBRs of the two newly introduced spatio-temporal objects with the MBR of the original spatio-temporal object, and it is seen that the invalid spaces E1 and E2 are saved. Therefore, the volume of circumscribed cubeFig.1MBR projecting in Y0X when splittingSecond, study the effect of the data set density and the number of spatial objects on index performance. We will use the R-tree forecast model [4] put forward by YANNIS THEODORIDIS to estimate its index performance.Assume that f is the capacity of each node of R-tree, the query window Q = (qx, qy, qz), and D is the density of the data object set that indicates the number of objects (represented by their MBRs) at a random point of a given space. The number of objects is expressed as m. For the query Q, the number of times of I/O required is about:(1)In the definition, the size of each dimension of the space is [0,1). If qi = 0, then DA(q) = Dl. It can be seen that reducing the density of large and small sets can reduce I/O operations. If the query window is small, the increase of the number of objects has little effect on I/O operations.According to the above-mentioned analysis of the R-tree forecast model, the effect of the number of spatial objects on I/O operations is lower than that of the data set density obviously.Therefore, introducing the splitting operation will enhance index performance.Because leaf nodes do not cause splitting of bottom-layer nodes and due to the efficiency reason, leaf nodes are not affected by recursive splitting. However, the splitting operation on other nonleaf nodes may spread downward.The specific splitting process can be expressed as the following algorithm:BOOL Request_Spliting(n)If ( n.leaf <= 2 )//Judge whether a node is a leaf nodeReturn FALSEElseReturn TRUEfor (I = 0, j = 1; I <= count(n.leaf);i++, j++)temp=max(tj-ti)//Select the node with the longest time interval from all nonleaf nodests=(t1+t2)/2//Calculate corresponding time splitting pointsThe process split(n) divides the strip e into two parts along the time axis, i.e. e(o_id, s, [t1, t2]) is divided into e1(o_id, s, [t1, ts]) and e2(o_id, s, [ts, t2]). This process splits a strip cube into a series of short cubes along the time axis, which reduces the zone area, overlapping area and perimeter and thus enhances index performance of 3DR-tree. Figure 2 shows this process. Considering moredetails, the time area is elaborated. Therefore, the area of the minimum circumscribed rectangle that moving objects account for in the YOX plane reduces obviously, as shown in the figure. This provesthat index efficiency can be effectively enhanced by using the node splitting methodFig.2 spliting of square along T-axis Tree SplittingIn history tree, existence of large amount of blank space influences searching efficiency, e.g. searching section may intersect with node’s MBR but without actual object exists in the intersection, or visiting unwanted nodes[4]. In the case of spatial-temporal data processed in 3DR-tree, the time dimension continuously expends upwardly along with increasing update time t in spatial-temporal object data, which induces a large proportion of MBR is occupied by blank space. Therefore, it is considered to take blank space as an inserted cost function in order to control the increasing blank space of nodes in 3DR-tree structure algorithm. This modification has the advantages of more compacted nodes, a reducing possibility of intersection between search space and blank space of node MBR, and an improved searching efficiency.Two methods are used to expand 3DR-tree so that 3DR-tree can index online spatio-temporal data.In the first scheme, the minimum circumscribed cube (MBR) of an R-tree node grows with time. That is, if a node includes online spatio-temporal data items, its circumscribed cube grows with time linearly. Figure 3 shows this process.The rectangle R as shown in the figure above includes two data items o1 and o2. o1 is closed while o2 is unclosed and grows with time. The right endpoint of the time dimension of o2 is unclosed, i.e. o2 is online data. Therefore, the minimum circumscribed rectangle of o1 and o2 is also unclosed, and the right endpoint of the time dimension remains growing. If we want to insert data into index and use index to query, all unclosed right time endpoints must be set to the operation execution time. In this way, all unclosed hypercubes can be considered closed, and the R-tree algorithm can be used. RST-tree is a good example of this aspect, but it is a dual-tense index. The defect of this method is that even though there are only a few online data in the index, the circumscribed cubes of nodes will still result in invalid space and space overlapping, thus causing inefficiency of index performance. In addition, this method has an excessive insertion cost.Fig.3 MBR extending with time The other method is to separately store online data and historical data[5]. Two trees are used toindex all data. The first tree indexes online data and mainly inspects spatial attributes of data with tense information as auxiliary information. The second tree uses improved 3DR-tree to index historical data. When any spatio-temporal data become historical data, the time interval attribute is changed to [ti, tj] from [ti, now], and data move from tree I to tree II. This method is efficient for spatio-temporal data with a long evolution period.ConclusionIt regards time as another dimension of the space and is applicable to spatio-temporal objects whose position and range do not or slightly change. However, it does not take into account the particularity of the time dimension and processes only offline data. In addition, many strip cubes will generate for those objects that remain still for a long period of time, thus reducing index performance greatly.Therefore, we use the combination of node splitting and tree splitting to design and expand the 3DR-tree index structure. In addition, historical data index and online data index are separately established on a uniform 3DR-tree according to different characteristics of historical spatio-temporal data and active spatio-temporal data, thus realizing the spatio-temporal index mechanism for effective indexing of both historical data and online data and breaking through the limitation that the original 3DR-tree only index historical data.References[1] TAO Y, PAPADIAS D. MV3R-tree: A Spatio-Temporal Access Method for Timestamp and Interval Queries[C]. Proceedings of the 27th International Conference on Very Large Databases, San Francisco, 2001: 431-440.[2] YUFEITAO, PAPADIAS DIMITRIS, SUN JIMENG. The TPR*-tree: An Optimized Spatio-Temporal Access Method for Predictive Queries[C]. Proceedings of the 29thVLDB Conference, Berlin, 2006: 790-801.[3] CHON D, AGRAWAL D, ABBADI A E. Storage and Retrieval of Moving Objects[C]. In Proc.of the Intl.Conf. on Mobile Data Management, Hong Kong, China, 2007: 173-184.[4] CAIAND M, REVESZ P. Parametric R-Tree: An Index Structure for Moving Objects[C]. In Proc.of the Intl.Conf. on Management of Data, Pune, India, 2005: 57-64.[5] PEUQUET D, DUAN N. An Event-Based Spatiotemporal Data Model(ESTDM) for Temporal Analysis of Geographical Data[J]. International Journal of Geographical Information Systems, 1995, 9(1): 7-24.。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

a r X i v :c s /0403014v 2 [c s .D B ] 12 M a r 2004Search Efficiency in Indexing Structures for Similarity SearchingGirish Motwani Sandhya GDepartment of Computer Science and AutomationIndian Institute of ScienceNovember 24,2003AbstractSimilarity searching finds application in a wide vari-ety of domains including multilingual databases,com-putational biology,pattern recognition and text re-trieval.Similarity is measured in terms of a distance function (edit distance )in general metric spaces,which is expensive to compute.Indexing techniques can be used reduce the number of distance computations.We present an analysis of various existing similarity index-ing structures for the same.The performance obtained using the index structures studied was found to be un-satisfactory .We propose an indexing technique that combines the features of clustering with M tree(MTB)and the results indicate that this gives better perfor-mance.1IntroductionWith the advent of new application domains such as multilingual databases,computational biology,text retrieval,pattern recognition and function ap-proximation,there is a need for proximity searching,that is,searching for elements similar to a given query element.Similarity is modeled using a distance function;this distance function along with a set of objects defines a metric puting distance function can be expensive,for example,one of the requirements in multilingual database systems is to find similar strings,where the distance(edit distance )between the strings is computed using an O(mn)algorithm where m,n are the length of the strings compared.This necessitates the use of an efficient in-dexing technique which would result in fewer distance computations at query time.Having an indexing structure serves the dual purpose of decreasing both CPU and I/O costs.Existing index structures such as B+trees used in exact matching proves inadequate for the above requirements.Various indexing structures have been proposed for similarity searching in metric spaces.We present the performance analysis of these structures in terms of the percentage of database scanned by varying edit distances from 10%to 100%.After providing a preliminary background in Section 2,we move on to the description of the existing index structures in Section 3.Section 4describes the experimental set up and the analysis is presented in Section 5.Section 6concludes the paper.2PreliminariesA metric space comprises of a collection of objects and an assosciated distance function satisfying the follow-ing properties.•Symmetryd (a,b )=d (b,a )•Non-negativityd (a,b )>0if (a =b )and d (a,b )=0if (a =b )•Triangle inequaltiyd (a,b )≤d (a,c )+d (c,b )a,b,c are objects of the metric space.Edit distance(Levenshtein distance )satisfies the above mentioned properties.The edit distance between two strings is defined as the total number of simple edit operations such as additions,deletions and substitutions required to transform one string to another.For example,consider the strings paris and spire .The edit distance between these two strings is 4,as the transformation of paris to spire requires one addition,one deletion and two substitutions.Edit distance computation is expensive since the alogorithmic complexity is O(mn)where m,n are thelength of the strings compared.One of the common queries in applications requir-ing similarity search is tofind all elements within a given edit distance to a given query string.Indexing structures for similarity search make use of the trian-gle inequality to prune the search space.Consider an element p with an assosciated subset of elements X such that∀x∈X,d(p,x)<=kWe want tofind all strings within edit distance e from given query string q.That is reject all strings x such thatd(q,x)>e(1) From the triangle inequality,d(q,p)≤d(q,x)+d(x,p). Hence d(q,x)≥d(q,p)−d(x,p)which reduces tod(q,x)≥d(q,p)−k(2)From equations(1)and(2),the criterion reduces tod(q,p)−k>e(3)If the inequality is satisfied,the entire subset X is eliminated from consideration.However,we need to compute the O(mn)edit distance for all the elements in the subsets that do not satisfy the above criterion.[8]proposes bag distance which is given asbagelement with its parent routing object is also stored. This helps in reducing some of the distance computa-tions as shown in[4].For a given query and edit distance,search starts at the root and recursively traverses the left subtree ifd(p1,q)−e≤r1(6) and the right subtree if a similar condition holds for p2.3.5M TreeThe bisector tree can be extended to m-ary tree[4] by using m routing objects in the internal node in-stead of two.We select m routing objects for thefirst level.Together with each routing object is stored a covering radius that is the maximum distance of any object in the subtree associated with the routing ob-ject.A new element is compared against the m rout-ing objects and inserted into the best subtree defined as that causing the subtree covering radius to expand less and in the case of ties selecting the closest rep-resentative.Thus it can be viewed that associated with each routing object p i,is a region of the metric space Reg(p i)=(u∈U|d(p i,u)<r i)where r i is the covering radius.Further,each subtree is partitioned recursively.In the internal node,p i and r i are stored together with a pointer to the associated subtree.Further to reduce distance computations M tree also stored pre-computed distances between each routing object and its parent.For a given query string and search distance,the search algorithm starts at the root node and recur-sively traverses all the paths for which the associated routing objects satisfy the following inequalities.|d(p p i,q)−d(p p i,p i)|≤r i+e(7) andd(p i,q)≤r i+e(8) In equation(7),we take advantage of the precom-puted distance between the routing object and its par-ent.3.6VP TreeVantage Point tree(VP tree)[6]is basically a binary tree in which pivot elements called vantage points par-tition the data space into spherical cuts at each level to enable effectivefiltering in similarity search queries. It is built using a top down approach and proceeds as follows.A vantage point S v is chosen from the dataset and the distances between the vantage point and the elements in its subtree are computed.The elements are then grouped into the left and right subtrees based on the median of the distances,i.e.,those elements whose distance from the vantage point is less than or equal to the median is inserted in the left subtree and others are inserted in the right subtree.This parti-tioning continues till the elements in the subtreefit in a leaf.The median value M is retained at each internal node to aid in the insertion and search pro-cess.In addition,each element in both the internal and leaf node holds the distance entries for every an-cestor,which helps in cutting down the number of distance computations at query time.An optimized tree can be obtained by using heuristics to select bet-ter vantage points.Search for a given query string starts at the root node. The distance between q and the vantage point at the node(S v)is computed and left subtree is recursively traversed ifd(q,S v)−e≤M(9) Similarly,right subtree is traversed recursively if the following inequality holds.d(q,S v)+e≥M(10)Once a leaf node is reached,the query string need to be compared with all the elements in the leaf node, but some of the distance computations can be saved using the ancestral distance information.3.7MVP TreeVP tree can be easily generalized to a multiway tree structure called Multiple Vantage Point tree[7].A notable feature of MVP tree is that multiple vantage points can be chosen at each internal node and each of them can partition the data space into m groups. Hence it is required to store multiple cut offvalues instead of a single median value at each internal node. The various parameters that can be tuned to improve the efficiency of MVP tree are•the number of vantage points at each internal node(v).•the number of partitions created by each vantage point(m).•the number of ancestral distances associated with each element in the leaf(p).The insertion procedure starts by selecting a vantage point S v1from the dataset.The elements under the subtree of S v1are ordered with respect to their distances from S v1and partitioned into m groups. The m-1cut offvalues are recorded at the internal node.The next vantage point S v2is a data point in the rightmost(m-1)partitions,which is farthest from S v1and it divides each of the m partitions into m subgroups.It can be observed that the nth vantage point is selected from the rightmost(m-n+1) partitions and the fan out at each internal node is m v.This is continued until all elements in the subgroupfit in a leaf node.At the leaf,each element keeps information about its distance from itsfirst p ancestors.Given a query string q and an edit distance e,q is compared with the v vantage points at each internal node starting at the root.Let the distance between the vantage point S vi and q be d(S vi,q)and M i be the cut offvalue between subtrees T i and T i+1.T i is recursively traversed if the both the inequalitiesd(S vi,q)−e≤M i(11) andd(S vi,q)+e≥M i−1(12) hold.For traversing thefirst subtree,only(11)need to be satisfied.Similarly,the inequality(12)is used to traverse the last subtree.A detailed description of the search procedure can be found in[7].3.8ClusteringAnother technique used in similarity searching to reduce search cost is Clustering.Clustering partitions the collection of data elements into groups called clusters such that similar entities fall into the same group.Similarity is measured using the distance function,which satisfies the triangle inequality.A representative called clusteroid is chosen from each cluster.While searching,the query string is compared against the clusteroid and the associated cluster can be eliminated from consideration in case criterion(3) does not hold,which helps in reducing the search cost.[3]proposes BUBBLE for clustering data sets in arbitrary metric spaces.The two distance measures used in the algorithm are given asRowSum Let O=O1,O2,...,O n be a set of data elementsin metric space with distance function d.The rowsum of an object o∈O is defined as RowSum(o)=Σn j=1d2(O,O j).The clusteroid C is defined as the object C∈O such that ∀o∈O:RowSum(C)≤RowSum(o).Average Inter-Cluster Distance Let O1={O11,...,O1n1}and O2={O21,...,O2n2} be two clusters with number of elements n1and n2 respectively.The average inter-cluster distance is defined as D2=Σn1i=1Σn2j=1d2(O1i,O2j)2.Insertion in BUBBLE starts by creating a CF* tree,which is a height balanced tree.Each non-leaf node has entries of the form(CF∗i,child i)where CF∗i is the cluster feature,i.e.,the summarized representation of the subtree pointed to by child i. The leaf node entries are of the form(CF∗i,cluster i) where CF∗i is the clusteroid and cluster i points to the associated cluster.When an element x is to be inserted,it is compared against all the CF*entries in the internal node using the average inter-cluster distance D2and the child pointer associated with the closest CF*entry is followed.On reaching a leaf node,the cluster closest to x is the one having minimum RowSum value.If the distance between x and the closest clusteroid is less than a threshold value T,it is inserted in that cluster,a new clusteroid is selected and the CF*entries in the path from root to this leaf node are updated.In case the difference is greater than T,a new cluster is formed.In our implementation,each element entry in the cluster contains its distance with the clusteroid to reduce the number of distance computations.For a given query string and search distance,the query is compared with all the clusteroids.If it does not satisfy the(3),the cluster elements need to be searched for similar strings.The precomputed distances can be used to eliminate some distance computations.3.9MTBIn case of M tree,a new element x is compared with the routing objects at the internal node and inserted into the best subtree.The best subtree is defined as the one for which the insertion of this element causes the least increase in the covering radius of the associ-ated routing object.In the case of ties,the closest rep-resentative is selected.This continues until we reach a leaf node.This may cause physically close elements to fall into different subtrees.Along the path,the cov-ering radii of the selected routing objects are updated if x is farther from p than any other element in its0.0.0.0.0.0.0.0.0.N u m b e r o f c o m p a r i s o n s a s a p e r c e n t a g e o f t h e d a t a b a s eEdit distanceFigure 1:Performance Comparison of Similarity Indexing Structuressubtree.Thus there are no bounds on the covering radii associated with the routing objects.A possi-ble optimzation is to bound the elements in the leaf nodes to be within a given THRESHOLD of the rout-ing objects in its parent node.Also,the new element is inserted into the subtree associated with the closest routing object,there by keeping the physically close elements together.These two optimizations result in a new indexing structure,which we call M Tree with Bounds (MTB).Thus,in the case of MTB the insertion of an element that causes the covering radius of the routing object of the lowest level internal to exceed the THRESHOLD results in a partition of the leaf node entries such that the THRESHOLD requirements are maintained.Searching is similar to that of the basic tree implementation.4Experimental SetupWe have performed an analysis of the various similar-ity indexing structures described in the previous sec-tion.The metric used for comparing the performance is the percentage of the database scanned for a given query and search distance,which is a measure of the CPU cost incurred.The experimental analysis were performed on a Pen-tium III(Coppermine)768MHz Celeron machine run-ning Linux 2.4.18-14with 512MB RAM.All the in-dexing structures were implemented in C.The O(mn)dynamic programming algorithm to compute the edit distance between a pair of strings was used in the experiments.The dataset used for the analysis was an English dictionary dataset comprising of 100,000words.The average word length of the dataset isaround 9characters.Six query sets each of 500en-tries were chosen at random from the data set for the experiments.The results presented are obtained by averaging over the results for these query sets.The page size is assumed to be 4K bytes.5AnalysisIn this section,we provide the analysis and the experi-mental results on the performance of the various sim-ilarity indexing structures.The implementation de-tails of the various index structures are presented in the next subsection followed by the results.5.1Implementation DetailsAssuming a page size of 4K bytes,the bucket size is taken to be 512entries for BK tree,FQ tree and FH tree as each entry is 8bytes.The routing data ele-ments are chosen at random from the dataset.The leaf node for VP tree as proposed in [6]has a single entry.The routing element is selected using the best spread heuristic [6].For MVP trees,we ran the experiments for different values of parameters m,v and p and the values 2,2and 10were shown to give better performance.For p =10,the number of leaf node entries is found to be 110.The vantage point is selected at random for MVP tree.In the case of bisector tree and M tree,the two farthest elements are chosen as pivot elements at the time of split of a FULL node.For M tree,we ran the experi-ment with the number of entries in the internal node m taking values 5and 254.In Clustering and Indexing with bounds,theroot12n...c1c2cnFigure2:Splitting characteristic of BKTree THRESHOLD value used in our runs was chosen to be5.5.2Experimental Results5.2.1Search complexityIn all the indexing structures,the criterion(3),which is obtained from the triangle inequality is used to prune the search space.As the search distance is increased,the number of pivots(or routing objects) that fail to satisfy the criterion(3)also increases resulting in an increase in the percentage of the database scanned.Figure1shows the performance of the various similarity indexing structures with variation in the search distance.It can be seen that FQ tree and FH tree perform better than BK tree.This can be attributed to two reasons:The number of pivot element comparisons is less in case of FQ tree and FH tree as these trees have onefixed key per level. Whereas,in case of BK tree,there are as many distinct pivot elements per level as the number of nodes at that level.Further,FQ tree and FH tree use a better splitting technique resulting in more partitions as compared to BK tree.Hence,some of the partitions can be eliminated using(3).Consider the case when a subset C i as shown infigure2is to be split in BK tree.Then the pivot element selected is some c∈C i.Thus the maximum number of partitions that can result is2i.However,in case of FQ tree,since afixed pivot element is selected for each level,the chosen pivot is away from the subset, which may result in more partitions.It is shown in [6]that this results in better performance.In FH tree all the leaves are at the same level.Also,since we have already performed the comparison between the query and pivot of an intermediate level,we eliminate for free the need to consider a leaf.Hence FH tree performs slightly better than FQ tree.Our implementation of VP tree uses the best spread heuristic[6]for selecting the vantage points. In addition,each internal node maintains the lower and upper bounds of the distance of elements in left and right subtrees.This can be used to cut down the distance computations using the triangle inequality. Because of these factors the performance is better as compared to BK tree.However,just like BK tree, as the vantage point is selected from the subset that is being partitioned and there are multiple distinct vantage points at any given level,FQ tree and FH tree show better performance.As can be seen from the plots in Figure1,MVP tree outperforms VP tree.Each leaf node entry in the MVP tree stores its distance to thefirst10ancestors. These precomputed distances help in reducing the search cost as compared to VP tree.In addition, MVP tree needs two vantage points to partition the data space into four regions whereas VP tree requires three vantage points for the same.This further reduces the number of distance computations at the internal nodes at search time.The left partition obtained using vantage point S v1is partitioned again using the farthest point S v2which is present in the right partition.Also,for smaller values of the edit distances(≤0.4)the internal node comparisons dominate the results.In case of the MVP tree,since there are multiple keys at each internal node,it results in more distance computations as compared to the FH and FQ tree,which have onefixed key per level.This explains the crossover in the curves of the FQ tree,FH tree and MVP tree at search distance0.4.The clustering technique partitions the dataset into afixed number of clusters N c.This number varies inversely as the THRESHOLD i.e.the cluster radius. At search time,the query string is compared against each of the cluster representatives,the clusteroids. These comparisons are performed irrespective of the search distance.For a THRESHOLD offive,the clustering algorithm partitioned the dataset into 17,912clusters.This explains the comparitively large number of searches for smaller values of search radii infigure1.For clusteroids that do not satisfy the test in(3),the associated cluster elements are sequentially01020304050607080901000.10.20.30.40.50.60.70.80.91N u m b e r o f c o m p a r i s o n s a s a p e r c e n t a g e o f t h e d a t a b a s eEdit distanceComparison of M Tree and M Tree with bounds"MTree""MTB""cluster"Figure 3:Performance Comparison of M Tree,Clustering and MTBcompared against the query string.In the case of bisector tree,the insertion of a new data element may result in an increase in the covering radius of the routing object.The covering radii values depend upon the order in which the elements are inserted and may have large values.Due to this,at search time,anumber of routing objects satisfy the test in equation (7).Thus,the Bisector Tree shows poor performance as compared to the other indexing structures.With M trees,even though the new element is inserted into a subtree such that the resulting increase in the covering radius is the least,there are no bounds on covering radius value.So the performance is identical to that of bisector tree.The poor performance can also be attributed to the presence of more number of routing objects to partition the data space.It can be observed from the graph in Figure 3that MTB that combines the features of M tree and clus-tering shows better performance.This can be at-tributed to the two optimzations used,which result in well formed clusters.For lower values of the search distance,the overhead of the comparisons with large number of routing objects at the internal nodes results in poor performance.The graph in Figure 4shows comparison of the var-ious indexing structures when bag distance computa-tion is used to reduce some of the edit distance com-putations.The graph shows the edit distance compu-tations needed with search distances varying from 10to 100%.0.0.0.0.0.0.0.0.0.N u m b e r o f c o m p a r i s o n s a s a p e r c e n t a g e o f t h e d a t a b a s eEdit distanceFigure 4:Performance Comparison of Indexing Struc-tures using bag distance 5.2.2Search TimeTable 1lists the average search time(ms)per query taken by various indexing structures.It can be ob-served that MTB tree takes comparatively lesser time.Bag distance computation helps in reducing the time complexity.6Conclusions and Future WorkWe have presented a performance study of the search efficiency of similarity indexing structures.MTB,which combines the features of clustering and M tree is found to perform better than all the other indexing structures for most search distances.Bag distance computation,which is cheaper than editIndex StructureBK tree0.4164FQ tree0.4124FH tree0.4090VP tree0.3041 Cluster0.1465Table1:Time complexitydistance computation,was used in the experiments. Its use resulted in reduced time complexity.Further, in applications where the required search distance is low and the string lengths are large,even better performance might result.It can be observed that index structures like MVP tree,which make use of precomputed distances with ancestors to prune the search space perform better than others.In similarity searching,since multiple paths are traversed,keeping afixed key per level as in FQ tree minimizes the search cost by reusing the precomputed distance at that level.Thus,reusing the pre computed distances results in better performance. Some indexed structures were shown to perform better with smaller values of edit distances(e≤0.3)whereas some others perform better at higher values.It would be advantageous to maintain multiple index structures and depending upon the edit distance,select the ap-propriate ing cheaper distance computation algorithms can also significantly reduce the CPU cost. The quality of partioning is largely dependent on the heuristic used for selecting the pivots.As a future work,we propose to analyse the performance of vari-ous index structures with different heuristics.7AcknowledgementWe thank A Kumaran of Database Systems Lab,IISc for his advice during the work.References[1]W.A.Burkhard,R.M.Keller,Some approaches to best-matchfile searching,Comm.of the ACM,16(4):230–236, 1973.[2]R.Baeza-Yates,W.Cunto,U.Manber,S.Wu,Proximitymatching using Fixed-queries trees,The5th CombinatorialPattern Matching,volume807of Lecture Notes in Com-puter Science,pages198-212,1994.[3]V.Ganti,R.Ramakrishnan,J.Gehrke, A.Powell,J.French,Clustering large datasets in arbitrary metric spaces,In the Proceedings of International Conference on Data Engineering,1999.[4]P.Ciaccia,M.Patella,P.Zezula,M-tree:An efficientaccess method for similarity search in metric spaces,In Proceedings of the23rd VLDB International Conference, Athens,Greece,September1997.[5]Edgar Chavez,Gonzalo Navarro,Ricardo Baeza-Yates,Jose Luis Marroquin,Searching in Metric Spaces,ACM Computing Surveys,1999.[6]Peter N.Yianilos,Data structures and algorithms for near-est neighbor search in general metric spaces,Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms,p.311-321,January25-27,1993.[7]Tolga Bozkaya,Meral Ozsoyoglu,Indexing Large MetricSpaces For Similarity Search Querie,ACM Transactions on Database Systems,Vol.24,No.3,Pages361-404, September1999.[8]Ilaria Bartolini,Paolo Ciaccia,Marco Patella,StringMatching with Metric Trees Using an Approximate Dis-tance,SPIRE2002:271-283.[9]I.Kalantari,G.McDonald,A data structure and an algo-rithm for the nearest point problem,IEEE Transactions on Software Engineering,9(5),1983.。

相关文档
最新文档