基于LCA的高效XML关键字检索算法

合集下载

【CN109977288A】基于CUDA并行和LCA的XML关键字查询方法【专利】

【CN109977288A】基于CUDA并行和LCA的XML关键字查询方法【专利】
推理3 :有两个Dewey编码集合Q1={x1 ,x2 ,x3 ,… ,xn} ,Q2={y1 ,y2 ,y3 ,… ,yn} ,如果Q2集合 的编码是按小到大进行分布,则Q1、Q2的LCA是集合Q={LCA({x1},Q2) ,LCA({x2},Q2) ,LCA ({x3} ,Q2) ,… ,LCA({xn} ,Q2)} ;
推理1 :假设在同 一棵XML文档树上 ,有两棵子树S1、S2 ,S1只包含一个节点v ,S2只包含两 个结点v1、v2,若v、v1、v2的Dewey编码符合关系v<v1<v2,则LCA(v ,v1)与LCA(v ,v2)是相同节点 或者LCA(V ,v1)是LCA(v ,v2)的后代节点;
推理2 :假设在同 一棵XML文档树上 ,有两棵子树S1 、S2 ,S1只包含一个结点v ,S2只包含两 个结点v1、v2,若v、v1、v2的Dewey编码符合关系v>v1>v2,则LCA(v ,v1)与LCA(v ,v2)是相同节点 或者LCA(v ,v1)是LCA(v ,v2)的后代节点;
2 .如权利要求1所述的基于CUDA并行和LCA的XML关键字查询方法,其特征在于,所述 XML文档的解析方法采用SAX解析方法。
3 .如权利要求1所述的基于CUDA并行和LCA的XML关键字查询方法,其特征在于,所述对 n个关键字的节点分别进行Dewey编码,对于名字相同的节点,使用B+索引,按从小到大进行 存储。
权利要求书1页 说明书8页 附图4页
CN 109977288 A
CN 109977288 A
Байду номын сангаас
权 利 要 求 书
1/1 页
1 .基于CUDA并行和LCA的XML关键字查询方法,其特征在于,存储XML文档的数据模型采 用树模型 ,查询时 ,对n个关键字的节点分别进行Dewey编码获得各关键字所对应的Dewey码 集合S1 ,S2 ,S3 ,… ,Sn,采用改进的LCA算法求解LCA得到查询结果,同时使用基于CUDA的GPU 并行运算对改进的LCA算法进行加速运算;所述改进的LCA算法如下:

一种基于XLCA的XML关键字搜索方法

一种基于XLCA的XML关键字搜索方法

一种基于XLCA的XML关键字搜索方法
许建军;汪卫;施伯乐
【期刊名称】《小型微型计算机系统》
【年(卷),期】2008(029)001
【摘要】关键字搜索是大多数普通用户搜索信息的有效手段,因为他们不需要学习复杂的查询语言,也不需要了解底层数据的结构.本文研究了针对XML文档的关键字搜索问题,首先指出前人基于SLCA的结果集定义的不完备性,进而提出基于XLCA的结果集定义,使得其能够包含所有可能的结果.基于这样的结果集定义,给出了一种精简的索引结构以及相应的搜索算法,并实现了这两种不同的方法,实验证明本文提出的方法在性能以及可扩展性方面均有较大的提高.
【总页数】5页(P52-56)
【作者】许建军;汪卫;施伯乐
【作者单位】复旦大学,计算机与信息技术系,上海,200433;复旦大学,计算机与信息技术系,上海,200433;复旦大学,计算机与信息技术系,上海,200433
【正文语种】中文
【中图分类】TP311.13
【相关文献】
1.一种P2P网络中基于关键字的信息搜索方法 [J], 周飞明;吴晟
2.一种基于关键字的XML文档查询算法 [J], 李素清;陶世群
3.MXDR:一种基于关键字的XML多文档分布式检索方法 [J], 李霞;李战怀;张利军;
陈群;李宁
4.一种基于语义相关度的XML关键字查询排序方法 [J], 李瑞霞;苏守宝;周先存
5.一种基于区间预留编码的XML关键字查询算法 [J], 魏东平; 罗丹
因版权原因,仅展示原文概要,查看原文内容请购买。

XML上基于SLCA的关键字查询研究开题报告

XML上基于SLCA的关键字查询研究开题报告

XML上基于SLCA的关键字查询研究开题报告一、选题背景XML作为一种数据交换和存储格式,已经被广泛应用在互联网、数据库和网络数据传输中。

但是,面对大规模、复杂的XML数据,如何从中获取有用信息成为了一个研究热点。

关键字查询是目前比较常用的一种查询方式,用户可以通过组合关键字对XML文档进行检索,从而得到所需信息。

随着XML数据规模的增大,传统的关键字查询方法已经不再满足实际需求。

传统方法需要遍历整个XML文档进行匹配,消耗大量时间和计算资源。

随着XML应用的普及,查询效率成为了越来越重要的问题。

因此,研究一种高效的、可扩展的XML关键字查询方法就显得尤为必要。

二、研究内容本课题基于SLCA(最小公共祖先)算法,研究一种基于SLCA的XML关键字查询方法。

在传统的SLCA算法的基础上,引入了关键字的概念,构建了一种基于关键字的SLCA查询算法,可以有效地减少查询时间和计算资源的消耗。

具体研究内容包括:1.基于SLCA的XML关键字查询算法设计2.查询效率分析及性能优化3.算法实现和测试验证三、研究意义本研究旨在为XML关键字查询提供一种高效、可扩展的解决方案。

通过引入关键字概念,结合SLCA算法,可以有效地减少查询时间和计算资源的消耗,提高查询效率与准确性。

此外,该算法还具有广泛的应用前景,可用于XML文档查询、网络数据传输等方面,具有重要的研究意义和应用价值。

四、研究方法本研究将采用文献综述、算法分析、设计实验、测试比较等方法,重点应用数据结构、算法分析、XML技术等知识对算法进行研究和设计,通过实验和测试验证算法的有效性和可行性。

五、预期结果本研究预期结果如下:1.设计出一种基于SLCA的XML关键字查询算法,具有高效、可扩展等特点。

2.通过实验和测试验证算法的有效性和可行性。

3.提高XML关键字查询的查询效率和准确性,具有重要的应用前景和研究意义。

一种基于关键字的xml文档查询算法

一种基于关键字的xml文档查询算法

一种基于关键字的xml文档查询算法一种基于关键字的XML文档查询算法是一种快速查找XML文档中特定内容的方法。

它利用关键字来查找XML文档中含有特定内容的节点,从而帮助用户快速定位相应的内容。

XML(Extensible Markup Language)是一种可扩展的标记语言,它使用元素和属性来表示数据。

XML文档由XML 元素(也称为节点)和属性组成,每个节点可以包含其他节点和属性,因此XML文档可以形成复杂的树形结构。

由于XML文档的结构组织得更好,因此在XML文档中进行查找会比在普通文本文档中更加方便高效。

基于关键字的XML文档查询算法利用关键字来识别XML文档中含有特定内容的节点。

该算法的基本思想是:首先将关键字分割为独立的单词,然后对XML文档进行遍历,当发现节点属性中含有关键字时,就将该节点的内容添加到结果列表中。

最后,将查询到的结果返回给用户。

基于关键字的XML文档查询算法的实现步骤如下:1、确定关键字:首先需要确定关键字,即需要查询的内容。

2、分割关键字:将关键字分割为独立的单词。

3、遍历XML文档:将XML文档进行遍历,以检索特定内容。

4、判断关键字:检查节点属性中是否存在关键字,如果存在则将该节点的内容加入到结果列表中。

5、返回结果:将查询到的结果返回给用户。

基于关键字的XML文档查询算法有很多优点:首先,比起其他文档查询算法,它的查询速度更快,而且可以查询出更多的信息;其次,它可以查找出XML文档中任意层级的节点,并能从多个节点中检索出相关内容;最后,它支持XPath查询,可以查询出XML文档中满足XPath条件的节点信息。

总之,基于关键字的XML文档查询算法是一种非常有效的查找XML文档中特定内容的方法,它可以让用户快速定位到相应的内容,并且具备良好的查询效率和准确性。

XML关键词检索算法的研究与实现的开题报告

XML关键词检索算法的研究与实现的开题报告

XML关键词检索算法的研究与实现的开题报告一、选题背景和意义随着Internet的迅猛发展, Web服务得到了广泛的应用, 其中以XML(eXtensible Markup Language)语言为基础的Web服务尤为重要。

XML是一种用于描述数据的标记语言, 它拥有强大的灵活性、可扩展性和可读性, 成为互联网中最为流行的数据交换格式之一。

然而, 在XML文档中, 包含了大量的信息, 如何快速、准确地检索出与用户需要相匹配的信息, 是XML文档检索研究的关键问题。

目前, 已经有许多关于XML文档检索的研究, 其中以基于关键词检索的方法为主流。

因此, 本文旨在研究XML关键词检索算法, 并将其实现为一个实用的检索系统, 以方便用户快速、准确地检索出所需信息。

二、研究内容1.分析当前XML文档检索的研究现状, 包括国内外的研究进展和存在的问题。

2.对XML文档中的节点进行索引, 提高检索效率。

3.设计并实现了基于关键词的XML文档检索算法, 针对多种检索关键词的情况进行优化。

4.设计并实现了一个实用的XML文档检索系统, 通过软件界面进行检索操作, 对检索结果进行展示。

5.对检索效率和精度进行测试, 优化慢查询和高并发请求, 提高系统的性能和可靠性。

三、研究方法和实施步骤1.综合文献, 系统性地分析当前XML文档检索的研究现状以及存在的问题。

2.设计并实现索引算法, 将XML文档中的节点进行索引。

3.设计并实现关键词检索算法, 实现基于关键词的XML文档检索。

4.设计并实现XML文档检索系统, 包括用户界面、后端处理和数据存储等组成部分。

5.进行系统的性能测试和异常处理, 对系统进行优化。

四、预期结果和意义本文将设计并实现一个基于关键词的XML文档检索系统, 解决XML文档检索的瓶颈问题, 并为用户提供可靠、快速、准确地检索服务。

五、进度安排1.前期研究(2个月), 包括文献综述和需求分析等阶段。

2.系统设计和实现(5个月), 包括索引算法、关键词检索算法和XML文档检索系统的设计与实现。

高效的xml 关键字查询改写和结果生成技术pdf

高效的xml 关键字查询改写和结果生成技术pdf

收稿日期:基金项目:国家自然科学基金项目(60833005, 60573091), 国家863 计划(2007AA01Z155, 2009AA011904, 2009AA01Z133),教育部博士点基金项目(200800020002),教育部重点项目(109004)本文通讯作者:陆嘉恒(jiahenglu@)高效的XML 关键字查询改写和结果生成技术黄静 陆嘉恒 孟小峰(中国人民大学信息学院 北京 100872) (huangjingruc@)Efficient XML Keyword Query Refinement with Meaningful Results GenerationHuang Jing, Lu Jiaheng, and Meng Xiaofeng(School of Information, Renmin University of China, Beijing 100872)Abstract Keyword search method provides users with a friendly way to query XML data, but a user’s keyword query may often be an imperfect description of their intention. Even when the information need is well described, a search engine may not be able to return the results matching the query as stated. The task of refining the user’s original query is first defined to achieve better result quality as the problem of keyword query refinement in XML keyword search, and guidelines are designed to decide whether query refinement is necessary. Four refinement operations are defined, namely term deletion, merging, split and substitution. Since there may be more then one query refinement candidates, proposes the definition of refinement cost, which is used as a measure of semantic distance between the original query and refined query, and also a dynamic programming solution to compute refinement cost. In order to achieve the goal of finding the best refined queries and generate their associated results within a one-time node list scan, a stack-based algorithm is proposed, followed by a generalized partition-based optimization, which improves the efficiency a lot. Finally, extensive experiments have been done to show efficiency and effectiveness of the query refinement approach.Keywords XML; Keyword Search; Query Refinement;Query Rewriting; Query Suggestion; SLCA摘要 用户使用关键字查询时,可能不能准确的表达他们的意图,即使用户正确的表达了查询意图,查询引擎也可能不能准确地返回查询结果.针对这一问题,重点研究了在XML 关键字查询中如何进行有效的查询改写并生成有意义的结果.提出四种查询改写操作和查询改写代价的概念,给出了动态规划的方法计算查询改写代价.为了找出最优的查询改写,给出了基于栈的查询改写和结果生成算法,并提出了基于划分的优化算法.最后通过丰富的实验对提出的方法进行了验证.关键词 XML; 关键字查询; 查询改写; 查询重写;查询推荐;SLCA中图法分类号 TP3910引言关键字查询为用户提供了友好便捷的查询方式,如何使用关键字查询从XML 数据中获取所需信息已经成为学术界近期研究的一个热点问题[1-5].这些工作主要研究如何过滤无关的查询结果来提高查准率.本文关注的是另一个方面:当查询没有结果返回或是返回太少结果时,如何通过改写原始查询,使得新的查询获得好的查全率.这种情况是普遍存在于关键字查询中的,由于用户可能不能准确表达查询意图,输入的查询可能存在拼写错误或不相关的词,这样使得某些关键字在文档中找不到匹配的结点,导致没有结果返回.第二十六届中国数据库学术会议论文集:1-7,2009.10Fig. 1 Example XML document 图1 带Dewey编码的XML文档即使用户准确地表达了查询意图,搜索引擎也可能不能准确地返回用户想要的结果,例如,对于图1所示的XML文档,用户输入查询“paper, XML”以查找关于XML的文章,文档中包含inproceedings(0.0.1.0)、inproceedings(0.1.1.0)和article(0.1.1.2)三个解,但由于关键字“paper”在文档中无法找到匹配的结点,以至于没有结果返回.根据文献[6]中的调查统计,用户使用关键字搜索时,大概有10%-15%的查询存在错误,有40%-51%的查询要经过修改才能获得所需的信息.我们将改写用户输入的原始查询来获得更好查询结果的问题定义为查询改写。

FastMatch:一种高效的XML关键字查询算法

FastMatch:一种高效的XML关键字查询算法
第2 9卷 第 6期
21 0 2年 6月
计 算 机 应 用 研 究
Ap l a in R s a c fCo u e s p i t e e r h o mp t r c o
Vo . 9 No 6 12 .
Jn 0 2 u .2 1
Hale Waihona Puke F sMac : a t th 一种 高效 的 X 键 字查 询 算 法 术 ML关
m oe t a nc r h n o e,S ti n f c e n r cie. T ov h sp o l m ,t i a e r po e t d u ed fs r up t e c O i si e inti p a tc i o s l et i r b e h sp p rp o s d a meho s a tg o or du e t e tme fs a n h n et d lss.te r p s d aag rt m a d Fa t th b s d o h eh d. i lo ih f u h i s o c nig t e i v re it h n p o o e lo ih n me sMa c a e n t e m t o Th sag rt m o nd als te es lsm e tng s me c ran c nd to y s a n n l n de n t e i v re it nl nc Th x e i n a e l ub r e r ut e i o et i o iinsb c n i g al o s i h n e d lss o y o e. t e e p rme tlr — s lsv rf he h g e fr a e o hi t d. ut e iy t ih p ro m nc ft s meho Ke r y wo ds: XM L; k y r e r h; e ce t fs o p;Fa t ac e wo d s a c i f in ; a tg u r sM th

基于同义词规则的高效XML关键字查询

基于同义词规则的高效XML关键字查询

基于同义词规则的高效XML关键字查询张林林,陆嘉恒中国人民大学数据工程与知识工程教育部重点实验室,北京100872摘要:关键字检索是一种简洁友好的查询方式,它不需要用户了解复杂的XML查询语言及XML文档数据结构。

然而,现有的方法往往局限于用户输入的关键字,而不能有效地挖掘用户查询语义。

本篇文章提出了一个基于SLCA查询语义的高效查询算法:TM-IL,它使用同义词、近义词、缩略词等规则来对用户的查询进行改写,从而能有效提高查询质量。

该方法并不依赖于特殊的查询语义,因此该方法可以广泛应用于其它查询技术。

关键词:计算机软件,XML关键词检索,同义词规则,SLCA中图分类号:TP311.5Effective Keyword Search with Synonym Rulesover XML DocumentZhang Linlin,Lu JiahengKey Laboratory of Data Engineering and Knowledge Engineering,Renmin University ofChina,Beijing100872Abstract:Keyword search is a friendly way for user tofind the information they are interested in from XML documents without having to learn a complex query language or needing prior knowledge of the structure of the underlying data.However,the existing methods are usually limited to the input keywords.In this paper,we introduced the notion of synonyms,acronym,abbreviations and so on to capture user query intentions.We propose a SLCA based keyword search with synonym rules over xml documents which are orthogonal to various of xml keyword search techniques.In addition,we also use this to give a effective and efficient slca based keyword search.Key words:Computer Software,XML Keyword Search,Synonym Rule,SLCA0IntroductionIt is becoming increasingly popular to publish data on the Web in the form of XML documents.Keyword search is well-suited to XML trees.It allows users tofind the information they are interested in without having to learn a complex query language or needing prior基金项目:本文受“高等学校博士点学科专项基金资助(编号:20090004120002)作者简介:Zhang Linlin(1987-),male,master,major research direction:Big data management and Cloud computing, XML data management.Correspondence author:Lu Jiaheng(1975-),male,professor,major research direction:Big data management and Cloud computing,XML data management,String Similarity measurement.knowledge of the structure of the underlying data [1–6].XML keyword search enforces a conjunctive search semantics (i.e.all the keywords should be covered in each query result),such as LCA (Lowest Common Ancestor)[5]and its variants [1,2,4].Among those proposals,SLCA(Smallest LCA)is widely adopted [2],where each SLCA result contains all query keywords but has no descendant whose subtree also contains all keywords.Unfortunately,all of them assume each keyword in the query is intended as part of it.However,a user query may often be an imperfect description of their real information need.Even when the information need is well described,a search engine may not be able to return the results matching user’s query intention as illustrated by the following example.Example 1.Consider a query Q =“Jennie paper”issued on a bibliographic document in Figure 1which is modeled using the conventional labeled tree model.The query most likely intend to find all papers written by Jennie.According to SLCA semantics defined in [1],it will return the most specific relevant answers -the subtrees rooted at nodes 0.1.2and 0.3.1.However,terms inproceedings and article are synonyms of paper.Therefore,we should also take nodes 0.1.1,0.1.3and 0.3.2into consideration to predict user’s intention.bib 0conference 0.3conference 0.1name 0.3.0name 0.1.0paper 0.3.1title 0.3.1.0author 0.3.1.1data base 0.3.1.0.0Jennie 0.3.1.1.0inproceedings 0.1.1title 0.1.1.0author 0.1.1.1data base 0.1.1.0.0Jenny 0.1.1.1.0article 0.1.3paper 0.1.2database 0.3.0.0title 0.1.2.0author 0.1.2.1datum base 0.1.2.0.0Jennie 0.1.2.1.0title 0.1.3.0author 0.1.3.1database 0.1.3.0.0Jennie 0.1.3.1.00.1.0.0administrator 0.0Jenny 0.0.0workshop 0.2chair 0.2.00.2.0.0inproceedings 0.3.2title 0.3.2.0author 0.3.2.1base 0.3.2.0.0Jennie 0.3.2.1.0abstract 0.1.2.2base 0.1.2.2.0abstract 0.3.2.2data 0.3.2.2.0图1:bib.xmlAccording to the phenomenon illustrated in Example 1,we should take advantage of the equivalence between strings to improve the quality of keyword search over xml documents.There are many cases where strings that are syntactically far apart can still represent the same real world object.This happens in a variety of conditions such as synonyms,acronyms,abbreviations and so on.For instance,“bike”is a synonym of “bicycle”,“ICDE”is an acronym of “International Conference on Data Engineering”while “DB”is an abbreviation of “database”.We just use “synonym”as a placeholder for any kinds of equivalence expressions.In this paper,we assume there exists a collection of predefined synonym rules.Such rules can be obtained from users’existing dictionaries,explicit inputs,by data and text mining [7]or query log analysis [8].For ease of presentation,we will describe the rules without instantiating with any particular types in this paper.The following example shows how to use these rulesto rewrite input queries thereby capturing user’s potential purpose.Example2.Consider the query in Example1,the answer should be[0.1.2,0.3.1]according to the semantic of SLCA.Given a set of synonyms which are of the form paper→inproceedings and paper→rmally,thefirst rule means that an occurrence of paper can be replaced by inproceedings.Therefore,the query can be expanded to three equivalent strings-{“Jennie paper”,“Jennie inproceedings”,“Jennie article”}.Thus,the answers should be[0.1.2, 0.1.3,0.3.1,0.3.2].In this paper,we study how to design a effective framework with synonym rules to not only support XML keyword search but also improve the quality of existing methods.we take SLCA as the underline keyword search semantic but our framework is orthogonal to other xml keyword search methods.While the synonym rules is expressive and powerful,it also introduces many new challenges.Our major contributions towards xml keyword search with synonym rules are summarized as follows:•We introduce a effective SLCA based keyword search over xml documents with synonym rules to capture user’s potential query intentions.•We give a formal definition of expansion rules.Meanwhile,we also explain how to rewritea query based on given synonym rule sets and deeply analyze all the possible transforma-tions.•we theoretically discuss how to use this framework to support SLCA based xml keyword search and give some optimization techniques.we also design a Transformation Matching based IL Algorithm(TM-IL algorithm)which could efficiently and effectively return SLCA of input keywords.The rest of the paper is organized as follows.In Section2,we formally define synonym rules and show how to use these rules to rewrite user’s queries.Section3and Section4present how to enhance the quality of SLCA based XML keyword search which show that our expansion based framework is orthogonal to the choice of the underlying XML keyword search technique which are not limited to SLCA or it’s variations but also related ranking strategies.In section 5,we give a briefly conclusion and show the future of our work.1Query RewritingThis section presents how we get and use rules to expand our queries.As we know,user input query is a string which can be modeled as a ordered sequence of grams where each grams is a(smaller)string.We can easily convert a string into a ordered gram sequence by splitting itbased on delimiters(white spaces)or regular expression.i.e.the string“International Conf on Data Eng”can be converted to ordered sequence of grams<International,Conf,on,Data,Eng>.1.1Synonym RulesA synonym rule consists of a pair of strings which is of the form lhs→rhs denoted by r which means they are equivalent with bias.Each of lhs and rhs is a string and could not be empty.Example3.Some example of synonym rules•W illiam→Bill•bike→bicycle•V LDB→V eryLargeDataBasesHere we use“→”to refer to the relevance between lhs and rhs which means lhs could be replaced by rhs.however,the inverse is not true due to abbreviations and acronyms might lose information compared with original strings.There are multi-ways to obtain synonym sets.First,we can easily extract synonyms from existing data including various domains such as biology1,address postal service(e.g.,USPS2), academic publications(e.g.,computer science3and medicine4)and so on.Second,synonyms can be obtained by applying data mining methods on datasets to get equivalence classes which represent the same real-world objects.Obviously,this strategy is domain independent,but require more human intervention to get more accurate results.Third,synonyms can be pro-grammatically generated.For example,we can generate rules that connect the integer and textual representation of numbers such as36th and Thirty-Sixth.1.2Query RewritingWe now describe how to rewrite queries when given a set of synonym rules R.The following is an example that illustrate the procedure of expanding one string to another according to synonym rules.Example4.Considering the query Q=“k1k2k3”and synonym rule set R={r1:“k3”→“k4”, r2:“k2k3”→“k2k5”,r3:“k1k2”→“k6k7”}.This means“k3”,“k2k3”and“k1k2”can be replaced by“k4”,“k2k5”and“k6k7”respectively.Consequently,we can get extended strings as follows:1http://www.expasy.ch/sprot2United States Postal Service,.3/CFP.aspx4/•“k1k2k3”r1−→“k1k2k4”•“k1k2k4”r3−→“k6k7k4”•“k1k2k3”r2−→“k1k2k5”•“k1k2k3”r3−→“k6k7k3”Finally,Q will be converted to{“k1k2k3”,“k1k2k4”,“k6k7k4”,“k1k2k5”,“k6k7k3”}, where underline denotes matching transformations.We can see from example4that“k1k2k5”can not be transformed to“k6k7k5”although “k1k2”can be replaced by“k6k7”(r3).Here we only allow1-level transformation which means a gram which is generated as a result of substitution is prohibited participating in a subsequent transformation.It’s easily to see there might be the case that the string transformation would never stop if we don’t restrict the transformation level.Even we restrict the transformation level to1,it is still NP-Complete to determine whether one string could be transformed to another string by applying rules.2Keyword search with synonym rulesIn this section,we review existing related proposals about keyword search and SLCA.Then we will describe our methods based on these notations.2.1NotationWe model XML docuemnt as a tree using conventional labeled ordered tree model.Nodes in the tree corresponds to an XML element and is labeled with a tag denoted byλ(n).Given two nodes u and v,u≺v denotes u is ancestor of v and u≼v denotes u≺v or u=v.We assign to each node a numerical id pre(v)that is compatible with preorder numbering,in the sense that if a node v1precedes a node v2in the preorder left-to-right depth-first traversal of the tree then pre(v1)<pre(v2).The usual<relationship is also compatible with Dewey numbers[1]. For example,0.1.0.0.0<0.1.1.1.We begin by formally introducing the concepts of Lowest Common Ancestor(LCA)and Smallest Lowest Common Ancestor(SLCA).Definition1(LCA).Given m nodes n1,n2,...n m,u is called LCA of nodes n1,n2,...n m iffu is ancestor of each node n i for1≤i≤m and node v,u≺v that v is also ancestor of each node n i.This can be denoted as u=lca(n1,n2,...n m).Given sets of nodes S1,S2,...S m,the LCA of sets S1,S2,...S m is the set of LCA for each combination of nodes in S1through S m which can be denoted as theflowing expression:lca(S1,S2,...S m)={u|u∈lca(n1,n2,...n m)|n1∈S1,n2∈S2,...n m∈S m}Definition2(SLCA).Given a set of nodes S1,S2,...S m,u is called SLCA of S1,S2,...S m, iffu∈lca(S1,S2,...S m)and∀v∈lca(S1,S2,...S m),u⊀v.This can be denoted as slca(S1,S2,...S m)={u|u∈lca(S1,S2,...S m)∧v∈lca(S1,S2,...S m),u⊀v}.Given a set of nodes S,the basic idea of SLCA is that it will return the closest ancestor node which doesn’t contain any descendent node that is also the ancestor of each node in S. Given two nodes v1and v2and their Dewey number dw1and dw2,lca(v1,v2)is the node with Dewey number that is the longest common common prefix of dw1and dw2.For example,the LCA of nodes xx and xx is the node xx in Figure1.It’s easily to get that slca(S1,S2,...S m) =RemoveAncestor(lca(S1,S2,...S m)).Given a query Q which contains a list of m keywords k1,k2,...k m(for ease of presentation,we don’t make any distinguish between query(k1,k2,...k m)and query(“k1k2...k m”)with an input XML document D),the answers of slca(k1,k2,...k m)are the result nodes of slca(S1,S2,...S m), where S i denotes the sorted keyword list of k i for1≤i≤m,i.e.,the list of nodes whose label directly contains k i sorted by id.2.2SLCA based Keyword Search with Synonym RulesGiven a query Q={k1,k2,...k m},the Indexed Lookup Eager Algorithm(IL)[1]first get the sorted inverted list S i for each k i,and then sort the inverted list according to thesize of S i to make sure S1is the list with smallest size.IL algorithm compute their slca by slca(slca(S1,...S m−1),S m).When computing any slca(S i,S j)for1≤i<j≤m and i=j−1,the algorithm efficiently removing the ancestor nodes according to the following two lemmas.Lemma1.For any two nodes v i,v j and s set S,if pre(v i)<pre(v j)and pre(slca(v i,S))> pre(slca(v j,S))then slca(v i,S)≺slca(v j,S)Lemma2.Given any two nodes v i,v j and s set S such that pre(v i)<pre(v j)and pre(slca(v i,S)) <pre(slca(v j,S)),if slca(v i,S)is not an ancestor of slca(v j,S),then for any v such that pre(v) >pre(v j),slca(v,S)⊀slca(v j,S)Given a set of synonym rules R and a query Q,we denoted the slca as slca(Q,R).When introducing the concept of synonym rules,we shouldfirst get all the possible strings that could be produced after applying rules.Then we can compute the query result for each generated string.we can get thefinal answer by removing the ancestors of these query results.This can be formally defined by the following expression where transform(Q,R)denotes all the strings generated by applying rules.slca(Q,R)={removeAncestor(slca(Q1),...Q i)...slca(Q k)),Q i∈transform(Q,R),1≤i≤k}slcara Jennie database slcadata base(a)slca(Q2,Q2)slcara Jenniedatabase slcadata base(b)slca(Q2,Q2)slcara Jenniera basedata datum(c)slca(Q2,Q2)slcara slcaraJennieslca databasedata baseJennydata(d)slca(Q2,Q2)图2:Possible TransformationNext,we will give a deeply analysis for each possible transformation.Wefirst give the sorted inverted list for the keywords used later as shown in Table1.表1:part of sorted inverted keyword list in Figure1Jennie S1=[0.1.2.1.0,0.1.3.1.0,0.2.0.0,0.3.1.1.0]database S2=[0.1.0.0,0.1.3.0.0,0.3.0.0]data S3=[0.1.1.0.0,0.3.1.0.0,0.3.2.1.0]base S4=[0.1.1.0.0,0.1.2.0.0,0.1.2.2.0,0.3.1.0.0,0.3.2.0.0]datum S5=[0.1.2.0.0]Jenny S6=[0.0.0,0.1.1.1.0]Condition1,Split:Given a query Q=“database Jennie”in Figure1and a set of synonym rules R=“database”→“data base”.After transformation,the query will be expanded to Q1=“database Jennie”and Q2=“data base Jennie”.Therefore,we shouldfirst compute slca(Q1)and slca(Q2)and then remove the ancestor nodes to get thefinal result.As the keyword list for“data”(S3),“base”(S4),“database”(S2)and“Jennie”(S1)are listed in Table1. For slca(Q2),we shouldfirst compute slca(S3,S4)and then compute their slca with S1.The procedure can be expressed as slca(Q,R)=ra(slca(S2,S1),slca(S3,S4,S1))where ra()denotes remove ancestor operation.As we know“database”and“data base”represents the same object,then the expression can be denoted as slca(Q,R)=slca(ra(slca(S3,S4),S2),S1)which could be modeled as a tree as shown in Figure3(a).The optimize technique here we used is to execute ra operation as soon as possible due to slca is much more time-consuming compared with ra.Some early pruned nodes(by ra)would have no chance to participate in the next slca operation.Then,we can get thefinal result by slca(Q,R)=slca(ra([0.1.1.0.0,0.3.1.0.0, 0.3.2],[0.1.0.0,0.1.3.0.0,0.3.0.0]),S1)=slca([0.1.1.0.0,0.3.1.0.0,0.3.2,0.1.0.0,0.1.3.0.0,0.3.0.0], [0.1.2.1.0,0.1.3.1.0,0.2.0.0.0,0.3.1.1.0,0.3.2.1.0])=[0.1.3,0.3.1,0.3.2].Condition2,Merge:Considering a query Q=“data base Jennie”in Figure1and a set of synonym rules R=“data base”→“database”.we will get new strings Q1=“data baseJennie”and Q2=“database Jennie”after transformation.We can see the procedure is very similar to Condition1.slca(Q,R)=RemoveAncestor(Q1,Q2)=[0.1.3,0.3.1,0.3.2].The corresponding tree model is shown in Figure3(b).Condition3,Mix:For query Q=“data base Jennie”in Figure1and a set of synonym rules R=“data base”→“database”,“data”→“datum”.The new generated queries will be Q1=“data base Jennie”,Q2=“database Jennie”and Q3=“datum base Jennie”.According to the xml tree model as shown in Figure3(c),we can easily get the expression of the procedure slca(Q,R)=slca(Q1,Q2,Q3).Thus the result will be slca(Q,R)=[0.1.2,0.1.3,0.3.1,0.3.2].Condition4,Overlap:Given a qeury Q=“data base Jennie”in Figure1and a set of synonym rules R=“data base”→“database”,“base Jennie”→“Jenny”.The new queries will be Q1=“data base Jennie”,Q1=“database Jennie”and Q1=“data Jenny”.The result will be[0.1.1,0.1.3,0.3.1,0.3.2](Figure3(d)).We have listed all the possible conditions we might meet when introducing the synonym rule semantics.Split means the one substring are expanded to more strings according to the size.Similarly,Merge refers to the inverse process.While,Mix and Overlap refers to the condition that Split and Merge come up with the same substring.For example,“data base”could be expanded to“data base”,“database”and“datum base”according to R= {‘‘database′′→‘‘database′′,‘‘data′′→‘‘datum′′}.However,the overhead of computing slca based on synonym rules is much too expensive.The cost will be in exponential scale to the size of useful synonym rules.3OptimizationsIn this section,we will give some optimization techniques to speed up the procedure of computing slcas when introducing synonym rules.Given a string s and a set of synonym rules R,the goal is tofind all possible matching transformations.The observation is that if lhs is a substring of s,then lhs is a prefix of some suffix of s.We can construct a trie over all the distinct lhs in T and then process every string in lhs of R.After that,for a given string,use each of its suffixes to look up the trie,i.e.we need to traversal each input string from rear to head in word level.For each substring we then need to scan from head to rear to check whether there is a matching for the sub of this substring. The details are straightforward and we defer them to the full version of the paper.Traditional strategiesfirst get all sorted inverted list for each keyword occurs in the XML Document and then maintain a B+structure to speed up the look up option as shown in Figure 3(a).Differently,apart from the B+structure we also store the slca of the right-side of each rule.i.e.,given a rule“lhs”→“rhs”,we pre-compute slca(“rhs”)and then store the result into the trie which is called synonym trie as shown in Figure3(b).Note that,we only store slca of string with non-empty result.The benefit is that when getting a transformation,we also getadministratorarticle abstract author bib chair base conference databaseinproceedings data Jennie name paper Jenny title data Jenny base 0.1.2.20.3.2.20.00.1.30.1.1.10.1.2.10.1.3.10.3.1.10.3.2.10.1.1.0.00.1.2.0.00.1.2.2.00.3.1.0.00.3.2.0.000.2.00.10.30.1.1.0.00.1.2.0.00.3.1.0.00.3.2.2.00.1.0.00.1.3.0.00.3.0.00.1.10.3.20.0.00.1.1.1.00.2.0.00.1.2.1.00.1.3.1.00.3.1.1.00.3.2.1.00.1.00.3.00.1.20.3.10.1.1.00.1.2.00.1.3.00.3.1.00.3.2.0(a)B+Tree from the data of Figure 1lhs(b)Synonym Trie图3:Index Structurethe slca of “rhs”.3.1Transformation Matching based IL AlgorithmNext we will explain how to use this synonym trie to speed up our query procedure.Here we combine transformation matching operation and IL algorithm together denoted as Transformation Matching based Indexed Lookup Eager Algorithm(TM-IL).The algorithm is based on transformation matching operations.As shown in Algorithm 1,♯6-10denotes the IL Algorithm,while ♯11-16denotes the matching transformations.when there is a matching,we need to query the slca result of this matching substring(♯11,TMIL)and then pass the query result to next iteration(♯15,TMIL).However,if the query result is null,this might happen due to i)no matching,ii)no keywords in the xml document for the query,and iii)null slca for the matching keywords,the algorithm will goes into next loop.Finally,if we reach to the head of input string s ,the algorithm will call computeSlca(canList)to compute and return slca of candidate inverted lists(♯20,TMIL).For example,we consider again the query used in Section 2.2,Condition 4applied on the data of Figure 1.“base Jennie”will be firstly detected(♯7,TMIL),then we will get slca(“Jenny”)=S 1as shown in Figure 1which will be added to the candidate list canList and passed to next iteration(♯10,TMIL).For the next iteration,s=“data”,canList={S 6}and rList=empty,it will get inverted list S 3for “data”at ♯7,and compute slca(S 3,S 6)at ♯8,then pass the result to next iteration(♯9).The result for “data Jennie”will be add to result list rList at ♯20for the next iteration.After all the return operations,the final result will be combined together(♯4,TMILCall).Algorithm1TM-IL AlgorithmProcedure TMILCall(s)1:result=empty;//receivefinal results2:rList=empty;//store intermediate keyword lists3:canList=empty;//store candidate keyword lists4:return result.addAll(TMIL(s,canList,rList));//combine the result and remove ancesotrsProcedure TMIL(s,canList,rList)1:if s!=null then2:for(i=|s|−1;i≥0;i−−)do3:s1=s.substring(i,|s|);4:for(j=1;j≥|s1|;j++)do5:s2=s1.substring(0,j);6:if(j==1)then7:queryResult=queryBPTree(s2);8:canList=slca(queryResult,canList)9:rList.addAll(TMIL(s.substring(0,i),canList,rList))10:end if11:qeuryResult=qeurySynonymTrie(s2);//query slca(s2)from synonym trie.12:if(qeuryResult!=null)then13:declare canList’and add queryResult to this list.14:canList’.add(IL(s1.substring(j,|s1|)));//IL denotes Index Lookup Eager Algorithm 15:rList.addAll(TMIL(s.substring(0,i),canList’,rList))16:end if17:end for18:end for19:else20:return computeSlca(canList);21:end ifProcedure computeSlca(canList)1:for each list in canList,compute their slca by IL();4Conclusion and Future WorkWe propose in this paper a SLCA based keyword search with synonym rules over xmldocuments.We formally defines the semantics of synonym rules and analyze the possiblematching transformations.In addition,we also apply this method to effectively and efficiently˖ compute slcas over XML Documents.To the best of our knowledge,this is thefirst work that takes into account of synonym rules for xml keyword search.Our theoretical analysis shows that our method is orthogonal to other xml keyword search techniques.There are several avenues for future work.For instance,we can use synonyms to improve the quality of ranking based xml keyword search,such as XSearch[4],XRank[5]and so on.参考文献(References)[1]Yu Xu and Yannis Papakonstantinou.Efficient keyword search for smallest lcas in xmldatabases.In SIGMOD Conference,pages537–538,2005.[2]Yu Xu and Yannis Papakonstantinou.Efficient lca based keyword search in xml data.InCIKM,pages1007–1010,2007.[3]Zhifeng Bao,Tok Wang Ling,Bo Chen,and Jiaheng Lu.Effective xml keyword search withrelevance oriented ranking.In ICDE,pages517–528,2009.[4]Sara Cohen,Jonathan Mamou,Yaron Kanza,and Yehoshua Sagiv.Xsearch:A semanticsearch engine for xml.In VLDB,pages45–56,2003.[5]Lin Guo,Feng Shao,Chavdar Botev,and Jayavel Shanmugasundaram.Xrank:Rankedkeyword search over xml documents.In SIGMOD Conference,pages16–27,2003.[6]Guoliang Li,Jianhua Feng,Jianyong Wang,and Lizhu Zhou.Effective keyword search forvaluable lcas over xml documents.In CIKM,pages31–40,2007.[7]Yoshimasa Tsuruoka,John McNaught,Jun ichi Tsujii,and Sophia Ananiadou.Learningstring similarity measures for gene/protein name dictionary look-up using logistic regression.Bioinformatics,23(20):2768–2774,2007.[8]G.Craig Murray and Jaime Teevan.Query log analysis:social and technological challenges.SIGIR Forum,41(2):112–120,2007.-11-。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

H i h。f i i n g - f c e tXM L e wo d Re r e a g r t e K y r t i v lAl o ihm s d o Ba e n LCA
HAN e g M n . CHEN n W ANG n Qu , Pe g
( c o fCo p trS in ea dTe h lg S ho lo m u e ce c n c noo y,No t we tr ltc ia ie st r h se n Poy ehnc lUnv riy,Xin 7 01 9 ' 1 2 ,Chi ) a na
文献标识码: A
中图 分类号:P1 T31
基 于 L A 的 高 效 XML 关键 字检 索算 法 C
韩 萌, 陈 群, 王 鹏
( 西北工业大学计算机 学院 , 西安 7 0 2 ) 1 1 9
摘 要 : E C 的 语 义 为 基 础 , 析 E A 的诸 多性 质 , 出 EL A 结 果 查 找 算 法 复 杂度 高 的原 因。 在 其 基 础 上提 出 B 以 LA 分 I C 给 C HFA 算 法 , 包括 2种
界 面 , 得 用户 在 不 熟 悉 X 使 ML 文 档 结 构 和 复 杂 的 查 询 语 言
孙 节 点 中 包 含 的 所 有 关 键 字 后 还 能包 含所 有关 键 字 至少 一 次 的结 点 集 合 。例 如 , 图 1中 , 询 Q一 {XML” “ h n ) 在 查 ‘ ‘ , C e ” 的
as n r d c sa lo ito u e n XM L y r ere lago ih ,Bo t m- p Hirr hc lFi e igAlo ih ( ke wo dr tiva l rt m to u e a c ia l rn g rtm BH FA) nt eb sso h r p risa o e t ,o h a i ft ep o e t b v . e
结 果 为 { . , . . }根 结 点 bb不 是 一 个 E C 结 果 , 因 o 10 11 , i L A 原 的情况 下 获 取 自己想 要 的信 息 。 目前 , 于 L A( o et 在 于 除 去 cnee c 结 点 包 含 所 有 关 键 字 后 bb不 再 包 含 所 基 C L w s o frn e i C mmo csos 概 念 的 X o nAn etr) ML关键 字检 索 已获得 广泛 的 有 关 键 字 , 通 过 E L 语 义 定 义 得 知 0 1 一 个 E C 是 而 C A .是 L A, 研究 [] 】 。假 设 用 户 输 入 的 关 键 字 集 合 为 Q一 { , 2 … , 训 " , W ae 结 cneec 结 Zk , E C E cuieL wet o U }在 L A( xlsv o s C mmo n etr)1的 语 因为 除 去包 含 所 有 关 键 字 的 p pr 点 外 ,o frn e 点 还 nA csos_ ]
实现算法 B A 和 BHF I。该 算法计 算 出分布在各层 的 L HF I A I CA, 根据 E C 的性质 由底 向上 、 LA 向左 向右筛选并荻 取结果 。实验 结果表 明, 该 算法的查询性能在 绝大 多数情况下优 于现有算法。
关键 词: XML检 索 算 法 ;关 键 字 检 索 ;最 小 公 共 祖 先
[ yw r s M L r ti a lo i m;k y r e r v l o s o Ke o d ]X er v l g r h e a t e wo dr ti a ;L we t mmo c so s L A) e C nAn e t r ( C
1 概述
关键字检索一个 显著 的特点就是利用简单 的关键字 查询
第3 6卷 第 2 3期
V0 . 6 13






21 0 0年 1 2月
De e b r 2 0 c m e 01
No 2 . 3
Co pu e m t rEng n e i i e rng
软 件技 术与 数据 库 ・
文 章编号: o—3 8 0 ) —I5一Q 1 0 4 ( 1 2 l9 4 0 22o 3 I J
Байду номын сангаас
【 src】Thspp r ie n n lzssmep o et so x uieL wetC mmo cso sE C Abt t a i a e vsa daaye o rpri f d s o s o g e E v nAn etr( L A)bsdo h e ni fE C ae ntesma t so L A.I c t
Th r r wo isa c sflo n heBH F ie ,c l dBHFA n e eae t n t n e o lwig t A d a al e Ia d BHFA Ir s ciey Thi ag rt m o ksfre c ea c clLo s I epetv l. s lo i h lo o a hhir rhia we t Co mmo n Anc so s LCA ) n e sr s lst r u h ly rbyly rslcin a c r ig t hep o e te fELCA.Ex rme t l e ut h w h t e tr ( ,a dg t e u t h o g a e a e ee t c o dn Ot r p riso o pei n a s lss o t a r BHFA u p ro m st xsig man te m lo ih si o tc s. o t e fr hee itn isr a ag rtm nm s a e
相关文档
最新文档