On-Line Construction of Compact Directed Acyclic Word Graphs

合集下载

基于不确定性PPI网络的最大稠密子图挖掘

ＭｉｉｇｍａｉｌｄｎｅｓｂｒｐｓｉｎｅｔｉＰＩｎｔｒｎｎｘｍａｅｓｕｇａｈｎｕｃｒａｎＰｅｗｏｋ
ＬＵＪａｃｉＨＡＮＧｅｑｎ，ＭＥＩｉ— ａ，ＳＸｕ — ｕＮＧ，ＷＡＮＧａＹａＭｉｏ
ＡｂｔａｔＳｖｒｌｔｄｅａｅＳＯｎｔａｈｒｄｃｉｎｏｒｔｉｕｃｉｎｕｉｇＰ１ｄｔｓｐｏｓｎ．ＨｏｖｒｈＰｓｒｃ：ｅｅａｕｉｓｈｖｈＷｈｔｅｐｅｉｔｆｐｏｅｆｎｔｓＰａａｉｒｍｉｉｇｓｔｏｎｏｎｗｅｅ，ｔｅＰＩｄｔｅｅａｅｒｍｘｅｉｎｓａｅｎｉｉｃｍｐｅｅａｄｉａｃｒｔｗｉｈｐｏｔｓｏｒｐｅｅｔＰａａｅｓａｎｅｔｉａａｇｎｒｔｄｆｏｅｐｒｍｅｔｒｏｓｙ，ｎｏｌｔｎｎｃｕａｅ，ｈｃｒｍｏｅｅｒｓｎＩｔｓｔｎｕｃｒａｎｔＰｄａｇａｈｈｓｐｐｒｒｐｓｄａｎｖｌｌｏｉｍｏｍｉｅｍａｉｌｅｓｕｇａｈｆｃｅｔｎｅｔｉＰｅｗｏｋｔｄｐｅｒｐ．ＴｉａｅｏｏｅｏｅｇｒｔｔｎｘｍａｎｅｓｂｒｐｓｅｉｎｌｉｕｃｒａｎＰＩｎｔｒ．Ｉａｏｔｄｐａｈｄｉｙｎ
深度优先搜索策略和点扩展的挖掘算法，它可以有效地从不确定的ＰＩＰ网络中挖掘最大稠密子图。该算法使用了几种高效的剪枝技术来提高挖掘的时间效率。在酵母菌ＰＩ据上的实验结果表明该算法在精度和效率上Ｐ数

基于词共现有向图的中文合成词提取算法

Ｃｈｎｓｍｐｕｄ１０ｄＥｘｒｃｉｎＡｌｏｉｍｉｅｅＣｏｏｎｖｒｔａｔｇｒｔＩｏｈＢａｅｎｓｄｏｒ．ｃｕｒｎｅＤｉｅｔｄＧｒｐｄＣｏｏｃｒｅｃｒｃｅａｈ
２Ｓｈｏｆｏｕｅｃｅｃ，ｙｉｅｓｙＪｇｎ５９２，ｈｎ）．ｃｏｌｍｐｔｒｉｎｅＷｕｉｖｒｉ，ｉｍｅ２００ＣｉａｏＣＳＵｎｔａｎ
［ｓｒｃ］ＷｏｄｓｇｎａｉｎｓｓｅｏｎｔｎｌｄｏｏｎｒｓｉｔｅｒｉｔｎｒｓＳｅａｏｃｇｉｅｃｍｐｕｄｗｒｓＴＡｂｔａｔｒｅｍｅｔｔｙｔｍｓｏｃｕｅｃｍｐｕｄｗｏｄｎｏｔｉｄｃｏａｉ，Ｏｔｙｃｎｎｔｅｏｎｚｏｏｎｏ．ｏｏｄｉｈｉｅｈｒｄ
式从中抽取具有领域性的短语，并经人工修正后得到本体概
大学上海（国际）数据库研究中心ＮＰ小组提供的文本集中的Ｌ１００篇政治经济类论文进行分析统计，６结果与上述研究结论基本吻合。因此，建立一个过滤词性表，在提取词串时，若原子词的词性是过滤词性，则将该原子词过滤掉。
中文文本中的词语可以分为２类：原子词和合成词。原
文献【借鉴人类的认知心理模式，出一种基于词序列２】提
频率有向网的组合词抽取算法，以识别自由文本中的组合词。
子词是语言中用于组合形成其他新词的短词，一般不遵循意义组合原理，原子词较稳定，不易产生新词。合成词由多个原子词构成，遵循意义组合原理，而且表达了一个完整的概念。由于文本中大部分词语是合成词，并且合成词识别广泛

一种基于链接和语义关联的知识图示化方法

计算机研究与发展Journal of Computer Research and Development DOI：10. 7544/issnl000-1239. 2017. 2017017754(8)：1655-1664, 2017一种基于链接和语义关联的知识图示化方法杨林1张立波u罗铁坚1万启阳1武延军2U中国科学院大学北京101408)2 (中国科学院软件研究所北京100190)(icode@)Knowledge Schematization Method Based on Link and Semantic RelationshipYang Lin1，Zhang Libo1，2，Luo Tiejian1, W an Qiyang1, and W u Y anjun21(University o f Chinese Academy o f Sciences , Beijing101408)2(Institute o f Softzvare , Chinese Academy o f Sciences , Beijing100190)Abstract H ow to present knowledge in a more acceptable form has been a difficult problem. In m ost traditional conceptualization m ethods? educators always summ arize and describe knowledge directly. Some education experiences have dem onstrated schem atization, which depicts knowledge by its adjacent knowledge u n its, is more com prehensible to learners. In conventional knowledge representation m ethods, knowledge schem atization m ust be artificially completed. In this p ap er, a possible approach is proposed to finish knowledge schem atization autom atically. We explore the relationship betw een the given concept and its adjacent concepts on the basis of W ikipedia concept topology (W C T) and then present an innovative algorithm to select the m ost related concepts. In addition, the state-of-the-art neural em bedding model W ord2Vec is utilized to m easure the sem antic correlation betw een concepts, aiming to further enhance the effectiveness of knowledge schem atization. Experim ental results show that the use of W ord2Vec is able to improve the effectiveness of selecting the m ost correlated concepts. M oreover, our approach is able to effectively and efficiently extract knowledge structure from W CT and provide available suggestions for students and researchers.Key words knowledge schem atization；concept topology；W ord E m bedding；knowledge representation；W ikipedia摘要将海量的知识梳理成人类更容易接受的形式，一直是数据分析领域的难题.大多数传统分析方式直接对知识本身进行总结和描述概念化（conceptualization);而一些教育实践证明，从临近的知识单元进行刻画图示化（schem atization)更容易使一个知识点被人类接受.在目前的经典计算机知识表达方法中，知识图示化主要依靠人工整理完成.提出了一种利用计算机自动化完成知识图示化的方法，依托维基百科概念拓扑图，探究概念与其临近概念的关系，并且提出了基于链接的自动筛选最关联概念算法；使用目前最新的神经网络模型W ord2V ec对概念间的语义相似度进行量化，进一步改进关联概念算收稿日期：2017-03-20;修回日期：2017-05-12基金项目：中国科学院系统优化基金项目（Y42901VED2，Y42901VEB1，Y42901VEB2)This work was supported by the Foundation of Chinese Academy of Sciences for System Optimization (Y42901VED2, Y42901VEB1, Y42901VEB2).通信作者：张立波（zsmj@hotmail. com)1656计算机研究与发展2017, 54(8)法，提高知识图示化效果•实验结果表明：基于链接的关联概念算法取得了良好的准确率，Word2V e c模型可以有效提高关联概念的排序效果.提出的方法能够准确有效地主动分析知识结构，桄理知识脉络，为科研工作者和学习者提供切实有效的建议.关键词知识图示化；概念拓扑图；词夜入；知识表达；维基百科中图法分类号TP305在人类知识库越来越庞大、分类越来越细化和专业化的今天，个体早已无法完全掌握人类的所有知识.人们自然而然地想到使用计算机来处理、存储和利用海量的知识.一方面，人们希望生产出能够充分利用人类知识库为自己服务的机器，继而产生了人工智能领域;另一方面，人类试图探究如何利用计算机完成对庞大知识数据的梳理和挖掘，用以提高人类的教育水平.无论以上哪个方面，都绕不开一个重要问题——如何用计算机对人类知识进行表示？人工智能领域的知识表示更倾向于对知识结构的描述，如逻辑表示、框架表示、本体表示[1]等，这些工作着眼于创造计算机能够接受的用于描述知识的数据结构.而用于教育人类自身的计算机知识大数据表示方式则更加直接，通常以“将知识梳理成人类更容易接受的形式”为核心目的，探究知识体系的表现形式.很多实践证明，通过相关知识单元来描述给定知识点的知识表示方式更易被人类接受（例如思维导图Mind M ap的蓬勃发展），我们称之为知识的图东化（schematization)[2].在传统的教育方式中，由于人类教育者无法考虑到知识体系整体，只能着眼于对某个知识点本身的刻画，我们称之为知识的概念化（conceptualization)[3];即使在教育过程中使用了相关知识点间的联系，这些联系本身也大多由经验丰富的人类教育者决定，缺乏普适性和客观性.而计算机的优势在于能够处理海量的数据，可以弥补人类教育者对于知识掌控的不足，本文将深人研究如何使用计算机对知识点进行合理、客观的图示化表东.本文的主要贡献有3方面：1)通过建立维基百科概念拓扑图（Wikipedia concept topology,W CT)，并对其拓扑结构进行分析，提出了一种基于链接的概念相关性算法.2)运用自然语言处理中的分布式语义向量方法计算概念间的语义相似度并与基于链接的概念相关性算法结合，最终提出一种主动分析知识点并进行图示化表述的方法；3)提出了依靠知识的网络结构对概念进行分层、分类的思路，对计算机语义分析具有重要意义.1相关工作在人类知识的整个结构框架中，概念是最重要的基本知识单元，而对于各种概念相关性的构建和量化是知识图示化的重要手段.对于概念相关性的计算主要依靠推荐系统和自然语言处理中的相关算法.目前主流的推荐算法主要有基于内容的推荐[4]、协同过滤推荐[5]、基于规则的推荐[6]等，而现有的推荐算法大多应用于社交网络，鲜见应用于知识网络中，本文尝试将推荐系统中的相似度算法推广到概念相关性计算中.前人关于维基百科中的概念相关性算法有基于概念向量的方法、基于路径的方法、基于概率的方法和基于链接的方法等.基于向量的经典方法有Gabrilovich和M arkovitch提出的精确语义分析(explicit semantic analysis)[7]，通过比较 2 篇文章中较重要概念组成的权重向量（weighted vector)判断它们的相关性；以及Shimkawa等人提出的概念向量（concept vector)™，通过比较2个概念投射到的对应分类组成的向量判断相关性.S tru b e和 Ponzetto提出的基于路径的度量方式（path based measures)™，将概念投射■到对应分类后比较分类中的最短路径.基于概率的方法依据模拟人类点击的概率分布（即概念的重要性）来判断相关性大小，例如Y eh等人提出的随机行走算法（random walks)[1°]和Dallmann等人[11]提出的另一种随机行走算法的改进算法等.基于链接的方法是依靠概念网络中的链接来进行距离测算，如M ilne和W itten提出的标准化链接距离（normalized link distance)和链接向量相似度（link vector similarity)[12].虽然计算概念相关性的算法多种多样[11#15]，但这些算法大多适用于中远距离的概念，即区别较大的概念；而本文中需要比较的概念大多属于同一类别，于是我们提出了一种新的基于链接的相关性算法来判断概念的相关性，具体将在2.4节中叙述.杨林等：一种基于链接和语义关联的知识图示化方法1657另外，在考虑概念相关性时自然会考虑到概念语义之间的相关性.自然语言处理领域关于语义关联的研究表明，词汇可以通过由神经网络训练的分布式语义向量表示，即词嵌人(Word Embedding )™ . Word Embedding 是词项在低维空间中的语义向量表示，可用于度量概念间的语义关联.本文尝试利用一种词嵌人模型W 〇rd 2Vec [16]来量化概念间的语义相似度，进而提高纯粹基于链接方法在计算相关概念方面的准确率和覆盖率.2知识图示化模型2.1数据集与预处理互联网百科全书是当前能够随意查阅的最大、最完整的人类知识库，其中维基百科是各国共同参与编撰的规模最大的互联网多语言百科全书，其与专业人士编撰的大英百科全书具有相近的正确率[17]，但由于编辑自由，发展更加迅速.截至2016 年7月25日，英文维基百科总计拥有5 201 640篇文章、39 827 021个页面，覆盖了人类的大部分知识领域.其以词条为最小知识单元，以链接跳转表现知识之间的联系，具有紧密的结构以及普遍连通性，与人类思维模式高度一致，因此本文选取维基百科作为基本数据集.我们使用维基百科2016年6月1日发布的X M L 格式离线数据包作为原始数据，解压后的X M L 文件大小为53. 4 G B ，其中包含维基百科 2016年6月以前的所有页面文本和超链接数据.人类的知识是由一个个知识点（knowledge point )和它们之间的关系构成，概念是知识点的最小单位，维基百科数据集中概念的表达方式就是词条(title )[18].知识体系中，每个概念的刻画都由2个部分组成：1)概念本身包含的信息；2)与其他不同概念之间的关系.对应到维基百科的词条，就是词条本身的内容描述和与其他词条的链接关联.所以为了对概念之间的关系进行分析，我们不仅需要对维基百科的概念建立网络拓扑图，还需要对每个词条对应的文本进行语义分析.于是，我们对X M L 数据进行5步处理：1) 抽取每个页面的文本信息（包括词条和词条的描述）和页面全部的链接；2) 删除其中的非内容页（如分类页、帮助页等等）和空指向页，并且将所有重定向页与其指向的页面合并；3) 将每个词条的首字母改为大写（统一命名规范），并与其文本进行对应；获得维基百科内容语料数据集4)为了建立拓扑图，将单向链接变为双向链接，于是维基百科中的词条和链接就可以分别看成一个有向图的顶点和边的集合；5)利用广度优先搜索计算图的联通分量分布去掉部分无关分量，获得维基百科拓扑图.第5步处理获得的连通分量分布如表1所示，其中最大连通分量包含了维基百科中99. 9%以上的词条，表明维基百科中的概念和知识网络具有普遍连通性的特点.其他连通分量中多为特殊字符等无关紧要的内容，因此我们将其忽略掉.将最大连通分量的12 269 222个词条连同它们之间的链接作为我们的概念拓扑图，其对应的概念和超链接作为我们研究的数据集.最终我们获得了 2个数据集：维基百科内容语料数据（content data of Wikipedia , CDW )和维基百科概念拓扑图（W CT ).前者包含12 269 222个词条和这些词条对应的内容.后者包含了有向图G d (V ， £)和对应的无向图G u C V J )，其中V 为所有概念对应的顶点（vertices )的集合，£为所有链接对应的有向边或无向边的集合.Table 1 Distribution of Connected Components表i 连通分量分布[wConnected ComponentNumber of Connected Component12 269 222Number of Nodes12 269 222216124 10 440248 10 4402. 2概念相关性从人类通常的思维方式出发，在接受一个未知的概念时，人们通常会利用与它相关的概念进行辅助理解.值得注意的是，这里的相关性概念不单包括简单意义上的“相近”，也包括从其他维度上的相关性.以单词记忆这种知识获取方式为例，原始的死记硬背方法效率十分低下.教育领域的相关研究发现：包括词根记忆、近反义词记忆、同领域单词记忆在内1658计算机研究与发展2017, S4(8)的各种维度的相关词记忆，都可以大大提高学习效果，即人们往往对一系列相关单词的记忆更加深刻 (如图1所示）.所以依据概念的相关性建立知识图谱，更加符合人类的思维方式，也更利于提高知识理解的效率.为了研究概念之间的相关性，我们需要完成2 项工作：1)选取与给定概念可能存在某种关联的概念，组成相关概念集合V,;2)对相关概念集合按照某种算法进行相关性排序，筛选出最相关的部分概念，用于建立知识图谱.结合整个维基百科拓扑的网络概念和普遍连通性，本文将利用基于链接距离的方法，从其他概念与给定概念之间的距离进行分析，确定相关概念集合I.2.3相关概念集合维基百科中每个概念的页面中都包含了大量指向其他概念的链接，同时每个概念也被其他概念页面链接，显然这些存在直接页面链接的概念最有可能成为最相关概念，是我们分析的重点.然而维基百科的结构决定了我们并不能直接排除其他概念的相关性，因为一个页面上的超链接个数有限，显然无法将所有与给定概念相关的概念展现出来.例如“Earth”与“Galaxy”直观上是极其相关的2个概念，然而在维基百科概念拓扑图中它们直接并没有直接相连，而是通过“Solar System”间接联系，如图2所示.因此我们在建立相关概念集合时，必须将与给定Direct Link ---------> Indirect LinkFig. 2 Direct and indirect link in Wikipedia图2 Wikipedia中的直接和间接链接概念直接或间接相连的概念(不同级别的邻接概念）都考虑在内.在通过链接距离建立相关概念集合的同时保留它们之间的距离信息(邻接度），并以此作为计算概念之间相关性的重要参数.假设给定概念为c，我们选取无向的维基百科概念拓扑图作为研究对象，若无向图中概念a与概念6对应的顶点之间有边连接，我们称概念a与概念6邻接.定义图中与c直接邻接的概念组成的集合为c的1级邻接集合^^^与V J c)中的概念邻接，但不属于V:(V)的概念组成的集合为c 的2级邻接集合V2(c);依此类推，我们可定义出c 的n级邻接集合1(c). 〃级邻接集合V…(c)中概念的邻接度由a f(V)表示.由以上定义和拓扑图Gu 的普遍连通性可知，任意不同于的概念^，必定属于某一个邻接集合V,(V)，且具有唯一的邻接度v1(c) = {v1e v A c,v1)e E},(1)(c)={G VI 3^, G(c),,v i+1)G Eand\f j^iyVl+i^V j(c)}^i^>l.(2)Table 2 Size of Adjacency Set (AS)表2邻接集合规模Concept 1 st AS2nd AS 1 rd ASPatch43 519>1 000000Ub u ntil 1 16163 180>1 000 000Security894174 958>1 00:0 000Dictionary942131101>1 000 000Lemma93 3 340>100000于是，我们需要对给定概念邻接集合的邻接度进行限制，以确定相关概念集合的大小.我们随机选取了 500个概念，研究它们各级邻接集合分布情况(部分结果如表2所示).可以看出，随着邻接度的增长，邻接集合的规模增加速度极快，这是由维基百科的普遍联通性决定的，也符合概念间广泛联系的特点.大多数概念的3级邻接集合规模大于106，远远超出我们的需求；而部分概念的1级邻接集合过少，无法覆盖足够多的相关概念来建立网络.因此本文选取给定概念c的2级邻接概念集合V2(V)作为c 的相关概念集合I，即：(3) 2.4基于链接的概念相关性排序算法在通过网络拓扑获得了给定概念的邻接集合作为相关概念集合后，我们需要对相关概念集合K中的概念进行相关性排序.由于概念拓扑的复杂性，我们很难使用一种单独的方法既考虑到概念本身的特杨林等：一种基于链接和语义关联的知识图示化方法1659性又兼顾网络整体性，所以在本节中我们将采用偏重概念不同特点的排序方法，再将不同方法的排序结果进行拟合得到基于链接的概念相关性排序.2.4. 1基于邻接概念的相似度排序一个概念可以通过它周围近距离的概念进行一定程度上的描述，这些概念可视为给定概念的特征，且不同的概念其周围的概念一般也不同.因此我们可以仿照推荐系统中相似度判别的方法将某个概念 ^的1级邻接概念集合V:(V)与给定概念r的1级邻接概念集合V:(>)进行相似度判别，以此计算它们的相关性.如图3所示，白色节点为中的元素，灰色节点为V:(V)中的元素，黑色节点为(V) 和(C)的共有兀素，概念I；和C有一^定的相关性 (因为描述它们的概念有重叠部分，相关性大小可由重叠部分的占比计算）.我们引人Jaccard相似性系数人(V)[16]来衡量 ^与c的相关性：.r ly x c^n v ic v) |_J A v)~ |y1(i；)U V i(V) |____________|v i(^)n v i(c)|___________I^(v) |十 |^(v) |— |^(v)n w(v) |•⑷adjacency sets图3 1级邻接概念集的相似度计算人(V)越大，即W与f的特征重合度越大，表明W 和f有更强的相关性.对于给定概念〇我们在其相关概念集合I中对每个概念按照相关性进行排序，可得到一个最相关概念排行，如表3所示（以概念 “Eigenvalues and Eigenvectors’l l).为了验证相似度排序的准确性，我们利用另一个基于维基百科的数据集Clickstream作为测试集.该数据集以月份为单位记录了每个月中用户通过不同方式(搜索引擎、站内跳转或直接访问）访问维基百科不同概念页面的次数.截至2016年7月25日，Clickstream对英文维基百科总共提供了 5个版本(201 501，201502，201 602，201 603 和 201 604).为了充分利用数据集并消除热点词对访问的影响，我们将5个版本进行了组合，同时只保留站内链接跳转的数据，处理成易于访问和查询的格式，详细过程不再赘述.通过Clickstream数据集我们可以得到用户在2个概念之间双向访问的总次数，以此作为对相关概念集合^^■每个概念与给定概念相关性排行的依据，如表4所示(同样以概念“Eigenvalues and Eigenvectors”为例）.Table 3 Rank of "Eigenvalues and Eigenvectorsby Jaccard Similarity Coefficient (JSC)表 3 用 Jaccard 相似性系数对“Eigenvalues and Eigenvectors”排序Rank Concept Jaccard Similarity Coefficient1Matrix1052Mathematics1013Complex Number824Vector Space815Real Number70Table 4 Rank of “Eigenvalues and Eigenvectors” by Clickstream 表 4 Clickstream 中 “Eigenvalues and Eigenvectors”的排序Rank Concept Click count byClickstream 1Determinant 3 9362Eigendecomposition of a Matrix 3 4013Linear Map 3 1574Matrix 2 3805Eigenvalue Algorithm 2 2792. 4. 2基于双向链接距离的相似度排序由表3和表4的对比我们看出，Jaccard相似性系数考虑了概念本身的特性，所以得到了较为合理的排序结果，但由于只利用了 1级邻接概念集合，结果在广泛性上仍存在一些不足.为了提高排序效果，充分利用维基百科网络拓扑结构，本文提出了另一种相关性算法---标准化双向链接距离(normalized bidirectional link distance,NBLD).Cilibrasi和 Vitdnyi 提出了基于 Google判断 2 个单词相关性的标准化G oogle距离(normalized Google distance)[2°]，随后 Milne 和 Witten 提出了标准化链接距离(normalized link distance，NLD)并将其应用到维基百科中[12]，N L D算法根据链人2个概念页面的链接数来计算它们的相关性.NLD 算法只考虑了链人情况，然而事实上维基百科中给定概念页面中链人和链出的概念都与给定概念存在相关性.直接使用N L D算法的排序结果只覆盖了1660计算机研究与发展2017, 54(8)相关概念集合K中约17. 03%的概念.于是本文提出一种标准化双向链接距离算法：N bldc (v) =[log(max( |Vi(c) |, |Vi(w) |))—log(I V!(c)n Vi(v)I)]/[log(IW I)-logCminCl^Cc) |,|))], (5)其中，W是研究对象中所有概念的集合，R U)和 W(W)分别是c和W的1级邻接概念集合.N BLD算法可以将相关概念覆盖率提高40%以上，以概念 “Eigenvalues and Eigenvectors”为例，如表 5 所东：Table 5 Rank of “Eigenvalues and Eigenvectors” by NLD and NBLD表 5 “Eigenvalues and Eigenvectors”的 NLD 及 NBLD 排序Rank Concept NLD1Diagonal Matrix0. 220 22Diagonalizable Matrix0. 224 23Hermitian. Matrix0. 240 44Invertible Matrix0.24105Spectral Theorem0.242 6(a) NLDRank Concept NBLD1Eigendecomposition. of a Matrix0. 19412Diagonal Matrix0. 197 63Diagonalizable Matrix0. 19904Spectral Theorem0.200 85Singular Value Decomposition.0.205 0(b) NBLD2.4.3基于链接关系的相似度算法拟合由表4和表5的对比可以看出，N B LD算法相比于N L D算法有更高的相关概念覆盖率，但NBLD 算法过于注重概念网络整体性，忽略了概念自身的特性，结果仍不够理想.所以我们将；laccard相似性系数和N B L D算法进行拟合，使之弥补各自的不足，获得基于链接的相似度排序算法；同时引人邻接度相关性衰减系数，使得邻接概念n与给定概念c 的距离对最终结果产生加权影响.对邻接概念集合K中的每个概念n与给定概念^：，本文提出相关性排序算法：Corr_linkc(v) =ya c(v>X ，(6)其中，y为邻接度相关性衰减系数，取值范围为（〇, i];y=i表示1级邻接概念和2级邻接概念权重相同，y越趋于〇表示越重视1级邻接概念的重要性.通过对整个W C T数据集的训练，我们可以确定y 的最优值.2.5基于语义的相关性排序改进算法2.4节提出了一种基于链接的相关性排序算法，通过维基网络拓扑的结构来自动生成指定概念c的相关概念排序.除了利用维基百科概念间的网络结构外，通过对概念之间的语义相关性进行分析也可以进一步提高概念相关性排序的准确性.自然语言处理（natural language processing, NLP)领域的研究表明，单词可以通过神经网络计算出的分布式语义向量来表示[16].分布式语义向量表示方法，即Word Embedding,已经被应用在众多自然语言处理任务中.2个词之间的语义关联可通过计算他们的分布式语义向量间的余弦相似度来度量•而Word Embedding模型的1个重要特性是词语的相似度表示不局限于简单的句法规律，语义信息可通过语义向量的运算来获取，例如单词“King”的向量表示，通过简单的向量加减法“King” 一“Ma n”+“W〇man”，会获得1个和单词“Queen”的向量表示极为类似的向量.此外在语义向量空间中，2个具有相似上下文结构的单词其语义向量也相似.综上，利用Word Embedding可方便而准确地度量概念之间的语义关联，且这种语义关联是具有多样性的，并不局限于概念之间的语义相似.本文提出一种基于语义的相似度算法（word embedding based,WEB),使用 Word Embedding 模型来量化概念间的语义相似度，优化相关性排序结果.将Word Embedding用于知识图表示的主要挑战在于如何利用单词的语义向量表示生成概念的语义向量表示.本文利用W〇rd2Vec[16H+算架构的连续词袋模型（continuous bag of words model,CBOW)来训练概念的语义向量表示.CBO W模型的架构如图4所示，模型的神经网络由3层组成，包括输人层、隐藏层和输出层.基本思想是通过词™“）的上下文内容，.context{w('t—n) ,,zv(t—1)，to(Z：+1)，■•■，TO(f+?Z)}来预测词T O(f),其中词的上下文由前后各w个词组成.上下文词的数量被称为窗口大小.模型的似然函数为arg max^logpCva\c(w) ；d),(7) ^(c(z v),z v~)Q D其中和c(™)分别代表选定词和其相关文本. (C(™)，™)是1个训练样本，D是训练样本集合.0 是待优化的参数集，包含了各词的分布式语义向量，训练算法为随机梯度上升法.需要提到的是，P (™|杨林等：一种基于链接和语义关联的知识图示化方法1661Kzt，）；60是Softm ax回归模型，在C B C W模型中有 2种Softm ax回归的实现方法，分层Softm ax和负采样，本文中选用负采样方法.根据相关文献[16'21]，CBO W模型的上下文窗口大小设置为10,词向量的维度设置为100.计算概念分布式语义向量concept =D word.(8)word G concept我们同样通过计算余弦相似度的方法来计算2 个概念的相似度Corr S E M A v) =^/，(9)—c X v其中^和v分别为概念r和概念w的语义向量表示.最后，通过线性加权的方式来优化2. 4节中基于链接距离的概念相似度算法Corrc ( v) =aCorrV m k c ( v)+(1 —a) Corr_SE M c ( v),(10)其中，a是决定基于链接和基于语义2种相似度权重的参数.使用语义关联算法的优化效果将在实验结果及评价章节进行展示.3头验分析3.1评价指标我们通过归一化折损累积增益(normalized discounted cumulative gain，NDCG)[22]指标，对本文提出的算法进行评估.NDCG@K被广泛应用于排序效果的评估NDCG@K =i s k|；n J?i y，(11)其中，如果某个概念排序与标准排序吻合，r,=1，否则n=0.ID C G是人工标注的标准排序.3.2实验设置为了更好地评估本文提出的方法，本文进行了大量的对比实验，涉及的方法有10种：1) JSC，单独使用JSC算法进行概念相关性排序.2) NBLD，本文提出的标准化双向链接距离算法.3) JSC+NBLD，本文提出的基于链接的概念相关性排序算法.4) JSC+W EB，使用W E B算法对J S C算法进行优化.5) NBLD+W EB，使用W E B算法对N B LD算法进行优化.6)Finkelstein等人[13]提出的基于路径的算法.7) Gabrilovich等人[23]提出的E S A算法.8) A girre等人[24]提出的基于分布式网络的相关性排序算法.9) M ilne等人™提出的一种基于超链接的相似度算法.10)本文方法，通过W E B算法对基于链接的念相关性排序算法优化.3.3参数设置本文提出方法在实验中涉及的重要参数有2 个，分别是基于链接的概念相关性算法中的邻接度相关性衰减系数以及W E B算法优化时的加权系数a.经过训练后，衰减系数7 =〇. 7 (参数优化步长〇. 1)，加权系数a=〇. 78(参数优化步长0. 01).3.4实验效果评价通过对包括本文方法在内的10种算法进行对比实验.将相关性概念的排序结果通过NDCG@10和NDCG@50进行评价，结果如表6所示：Table 6 Performance Evaluation Results of DifferentAlgorithms表6不同算法的表现评估Algorithms NDCG@10NDCG@50JSC0. 650. 30NBLD0. 700. 32JSC+NBLD0. 710. 52JCS+W EB0. 690. 41NBLD+W EB0. 740. 48Finkelstein[13]0. 700. 49Gabrilovich^23^0. 730. 53Agirre[24]0. 670. 52Milne[12]0. 670. 34Our algorithm0.790.57。

代数英语

(0,2) 插值||(0,2) interpolation0#||zero-sharp; 读作零井或零开。

0+||zero-dagger; 读作零正。

1-因子||1-factor3-流形||3-manifold; 又称“三维流形”。

AIC准则||AIC criterion, Akaike information criterionAp 权||Ap-weightA稳定性||A-stability, absolute stabilityA最优设计||A-optimal designBCH 码||BCH code, Bose-Chaudhuri-Hocquenghem codeBIC准则||BIC criterion, Bayesian modification of the AICBMOA函数||analytic function of bounded mean oscillation; 全称“有界平均振动解析函数”。

BMO鞅||BMO martingaleBSD猜想||Birch and Swinnerton-Dyer conjecture; 全称“伯奇与斯温纳顿－戴尔猜想”。

B样条||B-splineC*代数||C*-algebra; 读作“C星代数”。

C0 类函数||function of class C0; 又称“连续函数类”。

CAT准则||CAT criterion, criterion for autoregressiveCM域||CM fieldCN 群||CN-groupCW 复形的同调||homology of CW complexCW复形||CW complexCW复形的同伦群||homotopy group of CW complexesCW剖分||CW decompositionCn 类函数||function of class Cn; 又称“n次连续可微函数类”。

Cp统计量||Cp-statisticC。

A phase transition for the diameter of the configuration model Remco van der Hofstad

A phase transition for the diameter of the conﬁguration modelRemco van der Hofstad∗Gerard Hooghiemstra†and Dmitri Znamenski‡August31,2007AbstractIn this paper,we study the conﬁguration model(CM)with i.i.d.degrees.We establisha phase transition for the diameter when the power-law exponentτof the degrees satisﬁesτ∈(2,3).Indeed,we show that forτ>2and when vertices with degree2are present withpositive probability,the diameter of the random graph is,with high probability,bounded frombelow by a constant times the logarithm of the size of the graph.On the other hand,assumingthat all degrees are at least3or more,we show that,forτ∈(2,3),the diameter of the graphis,with high probability,bounded from above by a constant times the log log of the size of thegraph.1IntroductionRandom graph models for complex networks have received a tremendous amount of attention in the past decade.See[1,22,26]for reviews on complex networks and[2]for a more expository account.Measurements have shown that many real networks share two fundamental properties. Theﬁrst is the fact that typical distances between vertices are small,which is called the‘small world’phenomenon(see[27]).For example,in the Internet,IP-packets cannot use more than a threshold of physical links,and if the distances in terms of the physical links would be large,e-mail service would simply break down.Thus,the graph of the Internet has evolved in such a way that typical distances are relatively small,even though the Internet is rather large.The second and maybe more surprising property of many networks is that the number of vertices with degree k falls oﬀas an inverse power of k.This is called a‘power law degree sequence’,and resulting graphs often go under the name‘scale-free graphs’(see[15]for a discussion where power laws occur in the Internet).The observation that many real networks have the above two properties has incited a burst of activity in network modelling using random graphs.These models can,roughly speaking,be divided into two distinct classes of models:‘static’models and’dynamic’models.In static models, we model with a graph of a given size a snap-shot of a real network.A typical example of this kind of model is the conﬁguration model(CM)which we describe below.A related static model, which can be seen as an inhomogeneous version of the Erd˝o s-R´e nyi random graph,is treated in great generality in[4].Typical examples of the‘dynamical’models,are the so-called preferential attachment models(PAM’s),where added vertices and edges are more likely to be attached to vertices that already have large degrees.PAM’s often focus on the growth of the network as a way to explain the power law degree sequences.∗Department of Mathematics and Computer Science,Eindhoven University of Technology,P.O.Box513,5600 MB Eindhoven,The Netherlands.E-mail:rhofstad@win.tue.nl†Delft University of Technology,Electrical Engineering,Mathematics and Computer Science,P.O.Box5031,2600 GA Delft,The Netherlands.E-mail:G.Hooghiemstra@ewi.tudelft.nl‡EURANDOM,P.O.Box513,5600MB Eindhoven,The Netherlands.E-mail:znamenski@eurandom.nlPhysicists have predicted that distances in PAM’s behave similarly to distances in the CM with similar degrees.Distances in the CM have attracted considerable attention(see e.g.,[14,16,17,18]), but distances in PAM’s far less(see[5,19]),which makes it hard to verify this prediction.Together with[19],the current paper takes aﬁrst step towards a rigorous veriﬁcation of this conjecture.At the end of this introduction we will return to this observation,but let usﬁrst introduce the CM and present our diameter results.1.1The conﬁguration modelThe CM is deﬁned as follows.Fix an integer N.Consider an i.i.d.sequence of random variables D1,D2,...,DN.We will construct an undirected graph with N vertices where vertex j has degreeD j.We will assume that LN =Nj=1D j is even.If LNis odd,then we will increase DNby1.Thissingle change will make hardly any diﬀerence in what follows,and we will ignore this eﬀect.We will later specify the distribution of D1.To construct the graph,we have N separate vertices and incident to vertex j,we have D j stubs or half-edges.The stubs need to be paired to construct the graph.We number the stubs in a givenorder from1to LN .We start by pairing at random theﬁrst stub with one of the LN−1remainingstubs.Once paired,two stubs form a single edge of the graph.Hence,a stub can be seen as the left-or the right-half of an edge.We continue the procedure of randomly choosing and pairing the stubs until all stubs are connected.Unfortunately,vertices having self-loops,as well as multiple edges between vertices,may occur,so that the CM is a multigraph.However,self-loops are scarce when N→∞,as shown e.g.in[7].The above model is a variant of the conﬁguration model[3],which,given a degree sequence,is the random graph with that given degree sequence.The degree sequence of a graph is the vector of which the k th coordinate equals the fraction of vertices with degree k.In our model,by the law of large numbers,the degree sequence is close to the distribution of the nodal degree D of which D1,...,DNare i.i.d.copies.The probability mass function and the distribution function of the nodal degree law are denoted byP(D=k)=f k,k=1,2,...,and F(x)= xk=1f k,(1.1)where x is the largest integer smaller than or equal to x.We pay special attention to distributions of the form1−F(x)=x1−τL(x),(1.2) whereτ>2and L is slowly varying at inﬁnity.This means that the random variables D j obey a power law,and the factor L is meant to generalize the model.We denote the expectation of D byµ,i.e.,µ=∞k=1kf k.(1.3)1.2The diameter in the conﬁguration modelIn this section we present the results on the bounds on the diameter.We use the abbreviation whp for a statement that occurs with probability tending to1if the number of vertices of the graph N tends to∞.Theorem1.1(Lower bound on diameter)Forτ>2,assuming that f1+f2>0and f1<1, there exists a positive constantαsuch that whp the diameter of the conﬁguration model is bounded below byαlog N.A more precise result on the diameter in the CM is presented in[16],where it is proved that under rather general assumptions on the degree sequence of the CM,the diameter of the CM divided by log N converges to a constant.This result is also valid for related models,such as the Erd˝o s-R´e nyi random graph,but the proof is quite diﬃcult.While Theorem1.1is substantially weaker,the fact that a positive constant times log N appears is most interesting,as we will discuss now in more detail.Indeed,the result in Theorem1.1is most interesting in the case whenτ∈(2,3).By[18, Theorem1.2],the typical distance forτ∈(2,3)is proportional to log log N,whereas we show here that the diameter is bounded below by a positive constant times log N when f1+f2>0and f1<1. Therefore,we see that the average distance and the diameter are of a diﬀerent order of magnitude. The pairs of vertices where the distance is of the order log N are thus scarce.The proof of Theorem 1.1reveals that these pairs are along long lines of vertices with degree2that are connected to each other.Also in the proof of[16],one of the main diﬃculties is the identiﬁcation of the precise length of these long thin lines.Our second main result states that whenτ∈(2,3),the above assumption that f1+f2>0is necessary and suﬃcient for log N lower bounds on the diameter.In Theorem1.2below,we assume that there exists aτ∈(2,3)such that,for some c>0and all x≥1,1−F(x)≥cx1−τ,(1.4) which is slightly weaker than the assumption in(1.2).We further deﬁne for integer m≥2and a real numberσ>1,CF =CF(σ,m)=2|log(τ−2)|+2σlog m.(1.5)Then our main upper bound on the diameter when(1.4)holds is as follows:Theorem1.2(A log log upper bound on the diameter)Fix m≥2,and assume that P(D≥m+1)=1,and that(1.4)holds.Then,for everyσ>(3−τ)−1,the diameter of the conﬁguration model is,whp,bounded above by CFlog log N.1.3Discussion and related workTheorem1.2has a counterpart for preferential attachment models(PAM)proved in[19].In these PAM’s,at each integer time t,a new vertex with m≥1edges attached to it,is added to the graph. The new edges added at time t are then preferentially connected to older edges,i.e.,conditionally on the graph at time t−1,which is denoted by G(t−1),the probability that a given edge is connected to vertex i is proportional to d i(t−1)+δ,whereδ>−m is aﬁxed parameter and d i(t−1)is the degree of vertex i at time t−1.A substantial literature exists,see e.g.[10],proving that the degree sequence of PAM’s in rather great generality satisfy a power law(see e.g.the references in [11]).In the above setting of linear preferential attachment,the exponentτis equal to[21,11]τ=3+δm.(1.6)A log log t upper bound on the diameter holds for PAM’s with m≥2and−m<δ<0,which, by(1.6),corresponds toτ∈(2,3)[19]:Theorem1.3(A log log upper bound on the diameter of the PAM)Fix m≥2andδ∈(−m,0).Then,for everyσ>13−τ,and withCG (σ)=4|log(τ−2)|+4σlog mthe diameter of the preferential attachment model is,with high probability,bounded above by CGlog log t,as t→∞.Observe that the condition m≥2in the PAM corresponds to the condition P(D≥m+1)=1 in the CM,where one half-edge is used to attach the vertex,while in PAM’s,vertices along a pathhave degree at least three when m≥2.Also note from the deﬁnition of CG and CFthat distancesin PAM’s tend to be twice as big compared to distances in the CM.This is related to the structure of the graphs.Indeed,in both graphs,vertices of high degree play a crucial role in shortest paths. In the CM vertices of high degree are often directly connected to each other,while in the PAM, they tend to be connected through a later vertex which links to both vertices of high degree.Unfortunately,there is no log t lower bound in the PAM forδ>0and m≥2,or equivalently τ>3.However,[19]does contain a(1−ε)log t/log log t lower bound for the diameter when m≥1 andδ≥0.When m=1,results exists on log t asymptotics of the diameter,see e.g.[6,24].The results in Theorems1.1–1.3are consistent with the non-rigorous physics predictions that distances in the PAM and in the CM,for similar degree sequences,behave similarly.It is an interesting problem,for both the CM and PAM,to determine the exact constant C≥0such that the diameter of the graph of N vertices divided by log N converges in probability to C.For the CM,the results in[16]imply that C>0,for the PAM,this is not known.We now turn to related work.Many distance results for the CM are known.Forτ∈(1,2) distances are bounded[14],forτ∈(2,3),they behave as log log N[25,18,9],whereas forτ>3 the correct scaling is log N[17].Observe that these results induce lower bounds for the diameter of the CM,since the diameter is the supremum of the distance,where the supremum is taken over all pairs of vertices.Similar results for models with conditionally independent edges exist,see e.g. [4,8,13,23].Thus,for these classes of models,distances are quite well understood.The authors in [16]prove that the diameter of a sparse random graph,with speciﬁed degree sequence,has,whp, diameter equal to c log N(1+o(1)),for some constant c.Note that our Theorems1.1–1.2imply that c>0when f1+f2>0,while c=0when f1+f2=0and(1.4)holds for someτ∈(2,3).There are few results on distances or diameter in PAM’s.In[5],it was proved that in the PAMand forδ=0,for whichτ=3,the diameter of the resulting graph is equal to log tlog log t (1+o(1)).Unfortunately,the matching result for the CM has not been proved,so that this does not allow us to verify whether the models have similar distances.This paper is organized as follows.In Section2,we prove the lower bound on the diameter formulated in Theorem1.1and in Section3we prove the upper bound in Theorem1.2.2A lower bound on the diameter:Proof of Theorem1.1We start by proving the claim when f2>0.The idea behind the proof is simple.Under the conditions of the theorem,one can,whp,ﬁnd a pathΓ(N)in the random graph such that this path consists exclusively of vertices with degree2and has length at least2αlog N.This implies that the diameter is at leastαlog N,since the above path could be a cycle.Below we deﬁne a procedure which proves the existence of such a path.Consider the process of pairing stubs in the graph.We are free to choose the order in which we pair the free stubs,since this order is irrelevant for the distribution of the random graph.Hence,we are allowed to start with pairing the stubs of the vertices of degree2.Let N(2)be the number of vertices of degree2and SN(2)=(i1,...,i N(2))∈N N(2)the collection of these vertices.We will pair the stubs and at the same time deﬁne a permutationΠ(N)=(i∗1, (i)N(2))of SN(2),and a characteristicχ(N)=(χ1,...,χN(2))onΠ(N),whereχj is either0or1.Π(N)andχ(N)will be deﬁned inductively in such a way that for any vertex i∗k ∈Π(N),χk=1,if and only if vertex i∗k is connected to vertex i∗k+1.Hence,χ(N)contains a substring ofat least2αlog N ones precisely when the random graph contains a pathΓ(N)of length at least 2αlog N.We initialize our inductive deﬁnition by i∗1=i1.The vertex i∗1has two stubs,we consider the second one and pair it to an arbitrary free stub.If this free stub belongs to another vertex j=i∗1 in SN(2)then we choose i∗2=j andχ1=1,otherwise we choose i∗2=i2,andχ1=0.Suppose forsome1<k≤N(2),the sequences(i∗1, (i)k )and(χ1,...,χk−1)are deﬁned.Ifχk−1=1,then onestub of i∗k is paired to a stub of i∗k−1,and another stub of i∗kis free,else,ifχk−1=0,vertex i∗khastwo free stubs.Thus,for every k≥1,the vertex i∗k has at least one free stub.We pair this stubto an arbitrary remaining free stub.If this second stub belongs to vertex j∈SN (2)\{i∗1, (i)k},then we choose i∗k+1=j andχk=1,else we choose i∗k+1as theﬁrst stub in SN(2)\{i∗1, (i)k},andχk=0.Hence,we have deﬁned thatχk=1precisely when vertex i∗k is connected to vertexi∗k+1.We show that whp there exists a substring of ones of length at least2αlog N in theﬁrsthalf ofχN ,i.e.,inχ12(N)=(χi∗1,...,χi∗N(2)/2).For this purpose,we couple the sequenceχ12(N)with a sequence B12(N)={ξk},whereξk are i.i.d.Bernoulli random variables taking value1withprobability f2/(4µ),and such that,whp,χi∗k ≥ξk for all k∈{1,..., N(2)/2 }.We write PNforthe law of the CM conditionally on the degrees D1,...,DN.Then,for any1≤k≤ N(2)/2 ,thePN-probability thatχk=1is at least2N(2)−CN(k)LN −CN(k),(2.1)where,as before,N(2)is the total number of vertices with degree2,and CN(k)is one plus thetotal number of paired stubs after k−1pairings.By deﬁnition of CN(k),for any k≤N(2)/2,we haveCN(k)=2(k−1)+1≤N(2).(2.2) Due to the law of large numbers we also have that whpN(2)≥f2N/2,LN≤2µN.(2.3) Substitution of(2.2)and(2.3)into(2.1)then yields that the right side of(2.1)is at leastN(2) LN ≥f24µ.Thus,whp,we can stochastically dominate all coordinates of the random sequenceχ12(N)with ani.i.d.Bernoulli sequence B12(N)of Nf2/2independent trials with success probability f2/(4µ)>0.It is well known(see e.g.[12])that the probability of existence of a run of2αlog N ones convergesto one whenever2αlog N≤log(Nf2/2) |log(f2/(4µ))|,for some0< <1.We conclude that whp the sequence B1(N)contains a substring of2αlog N ones.Since whpχN ≥B12(N),where the ordering is componentwise,whp the sequenceχNalso contains the samesubstring of2αlog N ones,and hence there exists a required path consisting of at least2αlog N vertices with degree2.Thus,whp the diameter is at leastαlog N,and we have proved Theorem 1.1in the case that f2>0.We now complete the proof of Theorem1.1when f2=0by adapting the above argument. When f2=0,and since f1+f2>0,we must have that f1>0.Let k∗>2be the smallest integer such that f k∗>0.This k∗must exist,since f1<1.Denote by N∗(2)the total number of vertices of degree k∗of which itsﬁrst k∗−2stubs are connected to a vertex with degree1.Thus,eﬀectively, after theﬁrst k∗−2stubs have been connected to vertices with degree1,we are left with a structure which has2free stubs.These vertices will replace the N(2)vertices used in the above proof.It is not hard to see that whp N∗(2)≥f∗2N/2for some f∗2>0.Then,the argument for f2>0can be repeated,replacing N(2)by N∗(2)and f2by f∗2.In more detail,for any1≤k≤ N∗(2)/(2k∗) , the PN-probability thatχk=1is at least2N∗(2)−C∗N(k)LN −CN(k),(2.4)where C∗N(k)is the total number of paired stubs after k−1pairings of the free stubs incident to the N∗(2)vertices.By deﬁnition of C∗N(k),for any k≤N∗(2)/(2k∗),we haveCN(k)=2k∗(k−1)+1≤N∗(2).(2.5)Substitution of(2.5),N∗(2)≥f∗2N/2and the bound on LNin(2.3)into(2.4)gives us that the right side of(2.4)is at leastN∗(2) LN ≥f∗24µ.Now the proof of Theorem1.1in the case where f2=0and f1∈(0,1)can be completed as above.We omit further details. 3A log log upper bound on the diameter forτ∈(2,3)In this section,we investigate the diameter of the CM when P(D≥m+1)=1,for some integer m≥2.We assume(1.4)for someτ∈(2,3).We will show that under these assumptions CFlog log N isan upper bound on the diameter of the CM,where CFis deﬁned in(1.5).The proof is divided into two key steps.In theﬁrst,in Proposition3.1,we give a bound on the diameter of the core of the CM consisting of all vertices with degree at least a certain power of log N.This argument is very close in spirit to the one in[25],the only diﬀerence being that we have simpliﬁed the argument slightly.After this,in Proposition3.4,we derive a bound on the distance between vertices with small degree and the core.We note that Proposition3.1only relies on the assumption in(1.4),while Proposition3.4only relies on the fact that P(D≥m+1)=1,for some m≥2.The proof of Proposition3.1can easily be adapted to a setting where the degrees areﬁxed, by formulating the appropriate assumptions on the number of vertices with degree at least x for a suﬃcient range of x.This assumption would replace(1.5).Proposition3.4can easily be adapted to a setting where there are no vertices of degree smaller than or equal to m.This assumption would replace the assumption P(D≥m+1)=1,for some m≥2.We refrain from stating these extensions of our results,and start by investigating the core of the CM.We takeσ>13−τand deﬁne the core CoreNof the CM to beCoreN={i:D i≥(log N)σ},(3.1)i.e.,the set of vertices with degree at least(log N)σ.Also,for a subset A⊆{1,...,N},we deﬁne the diameter of A to be equal to the maximal shortest path distance between any pair of vertices of A.Note,in particular,that if there are pairs of vertices in A that are not connected,then the diameter of A is inﬁnite.Then,the diameter of the core is bounded in the following proposition:Proposition3.1(Diameter of the core)For everyσ>13−τ,the diameter of CoreNis,whp,bounded above by2log log N|log(τ−2)|(1+o(1)).(3.2)Proof.We note that(1.4)implies that whp the largest degree D(N)=max1≤i≤N D i satisﬁesD(N)≥u1,where u1=N1τ−1(log N)−1,(3.3) because,when N→∞,P(D(N)>u1)=1−P(D(N)≤u1)=1−(F(u1))N≥1−(1−cu1−τ1)N=1−1−c(log N)τ−1NN∼1−exp(−c(log N)τ−1)→1.(3.4)DeﬁneN (1)={i :D i ≥u 1},(3.5)so that,whp ,N (1)=∅.For some constant C >0,which will be speciﬁed later,and k ≥2we deﬁne recursively u k =C log N u k −1 τ−2,and N (k )={i :D i ≥u k }.(3.6)We start by identifying u k :Lemma 3.2(Identiﬁcation of u k )For each k ∈N ,u k =C a k (log N )b k N c k ,(3.7)with c k =(τ−2)k −1τ−1,b k =13−τ−4−τ3−τ(τ−2)k −1,a k =1−(τ−2)k −13−τ.(3.8)Proof .We will identify a k ,b k and c k recursively.We note that,by (3.3),c 1=1τ−1,b 1=−1,a 1=0.By (3.6),we can,for k ≥2,relate a k ,b k ,c k to a k −1,b k −1,c k −1as follows:c k =(τ−2)c k −1,b k =1+(τ−2)b k −1,a k =1+(τ−2)a k −1.(3.9)As a result,we obtain c k =(τ−2)k −1c 1=(τ−2)k −1τ−1,(3.10)b k =b 1(τ−2)k −1+k −2 i =0(τ−2)i =1−(τ−2)k −13−τ−(τ−2)k −1,(3.11)a k =1−(τ−2)k −13−τ.(3.12)The key step in the proof of Proposition 3.1is the following lemma:Lemma 3.3(Connectivity between N (k −1)and N (k ))Fix k ≥2,and C >4µ/c (see (1.3),and (1.4)respectively).Then,the probability that there exists an i ∈N (k )that is not directly connected to N (k −1)is o (N −γ),for some γ>0independent of k .Proof .We note that,by deﬁnition,i ∈N (k −1)D i ≥u k −1|N (k −1)|.(3.13)Also,|N (k −1)|∼Bin N,1−F (u k −1) ,(3.14)and we have that,by (1.4),N [1−F (u k −1)]≥cN (u k −1)1−τ,(3.15)which,by Lemma 3.2,grows as a positive power of N ,since c k ≤c 2=τ−2τ−1<1τ−1.We use a concentration of probability resultP (|X −E [X ]|>t )≤2e −t 22(E [X ]+t/3),(3.16)which holds for binomial random variables[20],and gives that that the probability that|N(k−1)|is bounded below by N[1−F(u k−1)]/2is exponentially small in N.As a result,we obtain that for every k,and whpi∈N(k)D i≥c2N(u k)2−τ.(3.17)We note(see e.g.,[18,(4.34)]that for any two sets of vertices A,B,we have thatPN (A not directly connected to B)≤e−D A D BL N,(3.18)where,for any A⊆{1,...,N},we writeD A=i∈AD i.(3.19)On the event where|N(k−1)|≥N[1−F(u k−1)]/2and where LN≤2µN,we then obtain by(3.18),and Boole’s inequality that the PN-probability that there exists an i∈N(k)such that i is not directly connected to N(k−1)is bounded byNe−u k Nu k−1[1−F(u k−1)]2L N≤Ne−cu k(u k−1)2−τ4µ=N1−cC4µ,(3.20)where we have used(3.6).Taking C>4µ/c proves the claim. We now complete the proof of Proposition3.1.Fixk∗= 2log log N|log(τ−2)|.(3.21)As a result of Lemma3.3,we have whp that the diameter of N(k∗)is at most2k∗,because thedistance between any vertex in N(k∗)and the vertex with degree D(N)is at most k∗.Therefore,weare done when we can show thatCoreN⊆N(k∗).(3.22)For this,we note thatN(k∗)={i:D i≥u k∗},(3.23)so that it suﬃces to prove that u k∗≥(log N)σ,for anyσ>13−τ.According to Lemma3.2,u k∗=C a k∗(log N)b k∗N c k∗.(3.24) Because for x→∞,and2<τ<3,x(τ−2)2log x|log(τ−2)|=x·x−2=o(log x),(3.25) weﬁnd with x=log N thatlog N·(τ−2)2log log N|log(τ−2)|=o(log log N),(3.26) implying that N c k∗=(log N)o(1),(log N)b k∗=(log N)13−τ+o(1),and C a k∗=(log N)o(1).Thus,u k∗=(log N)13−τ+o(1),(3.27)so that,by picking N suﬃciently large,we can make13−τ+o(1)≤σ.This completes the proof ofProposition3.1. For an integer m≥2,we deﬁneC(m)=σ/log m.(3.28)Proposition 3.4(Maximal distance between periphery and core)Assume that P (D ≥m +1)=1,for some m ≥2.Then,for every σ>(3−τ)−1the maximal distance between any vertex and the core is,whp ,bounded from above by C (m )log log N .Proof .We start from a vertex i and will show that the probability that the distance between i and Core N is at least C (m )log log N is o (N −1).This proves the claim.For this,we explore the neighborhood of i as follows.From i ,we connect the ﬁrst m +1stubs (ignoring the other ones).Then,successively,we connect the ﬁrst m stubs from the closest vertex to i that we have connected to and have not yet been explored.We call the arising process when we have explored up to distance k from the initial vertex i the k -exploration tree .When we never connect two stubs between vertices we have connected to,then the number of vertices we can reach in k steps is precisely equal to (m +1)m k −1.We call an event where a stub on the k -exploration tree connects to a stub incident to a vertex in the k -exploration tree a collision .The number of collisions in the k -exploration tree is the number of cycles or self-loops in it.When k increases,the probability of a collision increases.However,for k of order log log N ,the probability that more than two collisions occur in the k -exploration tree is small,as we will prove now:Lemma 3.5(Not more than one collision)Take k = C (m )log log N .Then,the P N -probab-ility that there exists a vertex of which the k -exploration tree has at least two collisions,before hittingthe core Core N ,is bounded by (log N )d L −2N ,for d =4C (m )log (m +1)+2σ.Proof .For any stub in the k -exploration tree,the probability that it will create a collision beforehitting the core is bounded above by (m +1)m k −1(log N )σL −1N .The probability that two stubs will both create a collision is,by similar arguments,bounded above by (m +1)m k −1(log N )σL −1N2.The total number of possible pairs of stubs in the k -exploration tree is bounded by[(m +1)(1+m +...+m k −1)]2≤[(m +1)m k ]2,so that,by Boole’s inequality,the probability that the k -exploration tree has at least two collisions is bounded by (m +1)m k 4(log N )2σL −2N .(3.29)When k = C (m )log log N ,we have that (m +1)m k 4(log N )2σ≤(log N )d ,where d is deﬁned inthe statement of the lemma. Finally,we show that,for k = C (m )log log N ,the k -exploration tree will,whp connect to the Core N :Lemma 3.6(Connecting exploration tree to core)Take k = C (m )log log N .Then,the probability that there exists an i such that the distance of i to the core is at least k is o (N −1).Proof .Since µ<∞we have that L N /N ∼µ.Then,by Lemma 3.5,the probability that there exists a vertex for which the k -exploration tree has at least 2collisions before hitting the core is o (N −1).When the k -exploration tree from a vertex i does not have two collisions,then there are at least (m −1)m k −1stubs in the k th layer that have not yet been connected.When k = C (m )log log N this number is at least equal to (log N )C (m )log m +o (1).Furthermore,the expected number of stubs incident to the Core N is at least N (log N )σP (D 1≥(log N )σ)so that whp the number of stubs incident to Core N is at least (compare (1.4))12N (log N )σP (D 1≥(log N )σ)≥c 2N (log N )2−τ3−τ.(3.30)By(3.18),the probability that we connect none of the stubs in the k th layer of the k-exploration tree to one of the stubs incident to CoreNis bounded byexp−cN(log N)2−τ3−τ+C(m)log m2LN≤exp−c4µ(log N)2−τ3−τ+σ=o(N−1),(3.31)because whp LN /N≤2µ,and since2−τ3−τ+σ>1.Propositions3.1and3.4prove that whp the diameter of the conﬁguration model is boundedabove by CF log log N,with CFdeﬁned in(1.5).This completes the proof of Theorem1.2.Acknowledgements.The work of RvdH and DZ was supported in part by Netherlands Organ-isation for Scientiﬁc Research(NWO).References[1]R.Albert and A.-L.Barab´a si.Statistical mechanics of complex networks.Rev.Mod.Phys.74,47-97,(2002).[2]A.-L.Barab´a si.Linked,The New Science of Networks.Perseus Publishing,Cambridge,Mas-sachusetts,(2002).[3]B.Bollob´a s.Random Graphs,2nd edition,Academic Press,(2001).[4]B.Bollob´a s,S.Janson,and O.Riordan.The phase transition in inhomogeneous randomgraphs.Random Structures and Algorithms31,3-122,(2007).[5]B.Bollob´a s and O.Riordan.The diameter of a scale-free random binatorica,24(1):5–34,(2004).[6]B.Bollob´a s and O.Riordan.Shortest paths and load scaling in scale-free trees.Phys.Rev.E.,69:036114,(2004).[7]T.Britton,M.Deijfen,and A.Martin-L¨o f.Generating simple random graphs with prescribeddegree distribution.J.Stat.Phys.,124(6):1377–1397,(2006).[8]F.Chung and L.Lu.The average distances in random graphs with given expected degrees.A,99(25):15879–15882(electronic),(2002).[9]R.Cohen and S.Havlin.Scale free networks are ultrasmall,Physical Review Letters90,058701,(2003).[10]C.Cooper and A.Frieze.A general model of web graphs.Random Structures Algorithms,22(3):311–335,(2003).[11]M.Deijfen,H.van den Esker,R.van der Hofstad and G.Hooghiemstra.A preferential attach-ment model with random initial degrees.Preprint(2007).To appear in Arkiv f¨o r Matematik.[12]P.Erd¨o s and A.R´e nyi.On a new law of large numbers,J.Analyse Math.23,103–111,(1970).[13]H.van den Esker,R.van der Hofstad and G.Hooghiemstra.Universality forthe distance inﬁnite variance random graphs.Preprint(2006).Available from http://ssor.twi.tudelft.nl/∼gerardh/[14]H.van den Esker,R.van der Hofstad,G.Hooghiemstra and D.Znamenski.Distances inrandom graphs with inﬁnite mean degrees,Extremes8,111-141,2006.。

Solving the feedback vertex set problem on undirected graphs

ቤተ መጻሕፍቲ ባይዱ
Feedback problems consist of removing a minimal number of vertices of a directed or undirected graph in order to make it acyclic. The problem is known to be NP complete. In this paper we consider the variant on undirected graphs. The polyhedral structure of the Feedback Vertex Set polytope is studied. We prove that this polytope is full dimensional and show that some inequalities are facet de ning. We describe a new large class of valid constraints, the subset inequalities. A branch-and-cut algorithm for the exact solution of the problem is then outlined, and separation algorithms for the inequalities studied in the paper are proposed. A Local Search heuristic is described next. Finally we create a library of 1400 random generated instances with the geometric structure suggested by the applications, and we computationally compare the two algorithmic approaches on our library. Key words: feedback vertex set, Branch-and-cut, local search heuristic, tabu search.

基于知网的词图构造

第29卷第3期2008年6月华　北　水　利　水　电　学　院　学　报Journa l of Nort h China Institut e of W ate r Conservancy and Hydroe l ec tric Powe rVol 129No .3　Jun .2008收稿日期36作者简介张瑞霞(—),女,河南郑州人,助教,硕士,主要从事自然语言、人工智能方面的研究文章编号:1002-5634(2008)03-0053-04基于知网的词图构造张瑞霞1,肖　汉2(1.华北水利水电学院,河南郑州450011;2.郑州师范高等专科学校,河南郑州450044)摘　要:为了表示汉语语义信息,对知识图进行了改进,以知网为语义资源,通过研究知网中概念的表示,对汉语词语进行了分类,针对不同类型的词语构造了不同形式的词图,从而表示出了汉语中的最小语义单位,为中文信息处理中的语义分析奠定了基础.关键词:中文信息处理;知识表示;知识图;知网;词图中图分类号:TP391 文献标识码:A 知识表示是自然语言处理中的核心问题,语义成分、语义网络、语义框架和逻辑表示是常用的知识表示方法,其中语义网络是一种很重要的表示方法.概念图属于语义网络范畴,它使用结点表示概念,使用有向弧表示概念间的关系.知识图是一种特殊的概念图,其特殊之处在于强调仅使用非常有限的关系类型表示结点间的关系.文献[1]用知识图表示了自然语言处理中的逻辑词,文献[2]讨论了量词在知识图中的分类与表示,文献[3]提出了基于知识图的语义分析,但目前尚没有系统全面地用知识图表示汉语语义的文献研究.为此,对知识图进行改进并以知网为语义资源,通过研究知网中概念的表示,对汉语词语进行分类,针对不同类型的词语构造不同形式的词图.1　知识图知识图是一种属于语义网络范畴的知识表示方法,它使用结点表示概念,使用有向弧表示概念之间的关系.定义1　设C 为概念的集合,T 为关系类型的集合,G =<N,A,ln ,la >是知识图.其中N 表示结点的集合;A 表示弧的集合;ln 表示结点集到概念集的映射,即ln :N →C ;la 表示弧集到关系类型集的映射,即la :A →T .关系类型有:EQ U ,S UB ,AL I ,D I S,CAU,ORD,P AR ,SK O ,FPAR ,NEGP AR,P OSP AR 和NECPAR [4-7].知识图的基本思想:每个词的词义可以由称作“词图”的知识图来表示,通过合并“词图”组成“短语图”,再通过合并“短语图”组成“句子图”,最后通过合并“句子图”组成“文本图”.因而,构造“词图”是应用知识图的基础和核心,笔者以知网为语义知识资源提出了一种构造“词图”的方法.2　知　网知网是一个以汉英双语所代表的概念以及概念的特征为基础,以揭示概念与概念之间以及概念所具有的特性之间的关系为基本内容的常识知识库,已广泛应用于中文信息处理领域[8].2005版的知网概念词典对词语的概念采用层级嵌套的形式来描述,适合采用知识图表示汉语的语义.3　知识图的改进3.1　弧的改进知识图最大的优势在于弧的关系类型只有12种,非常有益于高层次的推理;但是汉语的特点是情况复杂多变,如果直接用知识图来分析,粒度稍大一些.因此对弧的关系类型进行扩展,并规范弧的扩展:2008-0-1:1979.关系类型向基本关系类型转换的规则.此外,为了量化语义分析的过程,对弧添加了权值.3.1.1　弧的扩展关系类型参照知网,对知识图中弧的关系类型进行扩展,加入表示语义角色的关系类型,如relevant (当事)、agent (施事)、possessor (领事)、patient (受事)、con 2tent (内容)、r e sult (结果)、possessi on (占有物)、isa (类事)、partof (分事)等共90个,称这些关系类型为弧的扩展关系类型,称知识图中原有的12种关系类型为弧的基本关系类型.3.1.2　弧的扩展关系类型向基本关系类型的转换对汉语语义分析的最终结果用2种形式表示出来:一种是用弧的扩展关系类型和基本关系类型共同表示的知识图;一种是用弧的基本关系类型表示的知识图.图1为扩展关系类型向基本关系类型自动转换的转换规则.Event —事件义原;En tity —实体义原;Value —属性值义原图1　弧的扩展关系类型向基本关系类型的转化规则图1中(a),(c),(e),(g)图为弧的扩展关系类型表示的知识图,图1(b),(d),(f),(h),(j)为相应的弧的基本关系类型表示的知识图;图1(a)中弧的关系类型还可以是relevant,expe rience r 和pos 2se ssor ;图1(c )中弧的关系类型还可以是targe t,pos 2se ssion,content,beneficiary,isa 和pa rt of ;图1(c )中弧的关系类型还可以是instrument,m ateria l,m anner,frequency,quantity,ti m e,locati on,Loca tionThr u,Loca 2ti onI ni,Ti meFin,Ti me I ni,Sta te I ni,durati on 和degr ee,代表Entity 的结点还可以代表Va lue;图1(g )代表Value 的结点还可以代表属性义原A ttribute;图1(i )弧的关系类型还可以是,和3　对结点的改进根据实际问题对知识图中结点进行改进,把结点分为词语结点、义原结点和子图结点,并给每一个结点身份标志、角色标志和形象标志.定义2　词语结点:表示词语的结点.定义3　义原结点:表示词语概念中义原的结点.义原结点根据在知识图合并过程中所扮演的角色不同分为中心义原结点和非中心义原结点.中心义原结点在知识图合并过程中起主要作用,非中心义原结点在知识图合并过程中起辅助作用.此外中心义原结点代表整个知识图所要表达的主要含义,非中心义原结点是用来辅助说明中心义原结点的,构成了中心义原结点的上下文环境.定义4　子图结点:指向另一个知识图的结点.若子图结点指向NULL ,则说明这个子图结点指向的知识图暂不存在.通过在知识图中加入子图结点,可以实现知识图由二维的知识表示形式扩展为三维知识表示形式,从而使其表现出清晰的层次性和继承性,能更好地表示汉语的语义信息.定义5　结点的身份标志:区别不同结点的一个整数.定义6　结点的角色标志:说明结点在知识图中所扮演角色的一个整数.定义7　结点的形象标志:代表结点所表示内容的一个字符串.通过对结点的分类和标志,为知识图的遍历、查找以及合并提供了形式化的依据,这样大大增加了知识图的可操作性.4　词图的构造4.1　词图构造的依据词图构造的依据是知网词典的概念项(DEF),因此首先根据DEF 的描述方式,总结出DEF 的形式化描述:<动态角色>::=<知网的动态角色><次要特征角色>::=<知网的次要特征角色><角色>::=<动态角色><次要特征角色><主要义原>::=<实体义原><事件义原>|<属性义原><属性值义原><次要义原>::=<次要特征义原><关系表达式>::=<角色>={<主要义原>}<角色>={<次要义原>}<角色>=<D F ><角色>={<义原标志>}“R ”=45　华　北　水　利　水　电　学　院　学　报 2008年6月l ocati on ti me dur a ti on..2E Event ole{<动态角色>}<义原标志>::=“～”“S|”“3”“?”<DEF>::={<主要义原>}{<次要义原>:<关系表达式>}{<主要义原>:<DEF>[, <关系表达式>][,<DEF>][,<关系表达式>]}{<主要义原>:<关系表达式>[,<DEF >][,<关系表达式>][,<DEF>]}从DEF的形式化描述看,DEF可分为2类:一类称为一般概念,所描述的词语称为一般词语;另一类称为功能概念,所描述的词语称为功能词语.一般概念有以下3个特征:①DEF中至少有一个主要义原;②D EF是“{”和“}”2个符号之间的描述字符串;③D EF是层次的、嵌套的.功能概念的描述中没有主要义原,所对应的词语相当于汉语语法中的虚词.这些特点为自动构造词图提供了依据.4.2　一般词语词图的构造由于DEF是层次的、嵌套的,所以在构造词图的过程中可以借助栈来实现.栈元素结构如图2所示.字符结点的身份标志图2　栈元素的结构设栈def,栈顶元素def_top,从栈顶开始的第2个栈元素def_second,栈元素record,变量r ole用来记录当前的语义角色,则按下述算法构造词图:扫描DEF,遇到“{”,建立栈元素record,并令record的第1元素为“{”,即record:one={,令r e2 cord的第2元素为0,即r ecor d:t w o=0,recor d进栈.当def非空时,执行如下步骤:11继续扫描DEF,遇到“:”,则为读到的义原建立义原结点,若栈中只有一个元素,则此义原结点为中心义原结点,否则为非中心义原结点,并修改def_ top:t wo的值;21继续扫描D EF,若读入字符串为语义角色,则记录在r ole;31继续扫描DEF,若读义原执行:①若读入为义原标志:recored=def_top,def_t op出栈,若此标志为“～”,则record:t wo与栈顶元素def_top:t w o所指的两个义原结点之间建立弧,其关系类型为r ole,其权值为1;否则,建立一个子图结点,并在此子图节点与def_t op:t wo所指的义原结点之间建立弧,其关系类型为,其权值为;=空串;继续扫描D F,若扫描到字符不是“}”,则报错;②若读入为非义原标志,建立非中心义原结点,修改f_t w o,若r ole为非空串,则在def_t op:t wo所指的义原结点与def_second:t w o所指的义原结点之间建立弧,其关系类型为r ole,其权值为1,r ole=空串;41继续扫描DEF,若读入字符为“}”,则def_ t op出栈.当栈def已空,但DEF还没扫描完毕,报错.如对于词语:“眼睛”:DEF={部件:whole={动物},PartPositi on={眼}},根据上述算法构造其词图如图3所示.“工人”:DEF={人:Host Of={职位},dom ain={工}},其词图如图4所示.图3　“眼睛”的词图图4　“工人”的词图“洗衣机”:DEF={工具:{洗涤:instru m ent= {～},pa tient={衣物}}},其词图如图5所示.洗衣机AL I工具instrum ent洗涤patient衣物图5　“洗衣机”的词图 “插座”:DEF={配件:whole={用具:{连接: instrum ent={～},patient={电}}}},其词图如图6所示.插座AL I配件whole用具instrument连接patien t电图6　“插座”的词图 “不可分割”:DEF={不会:scope={分发:pos2 sessi on={S}}},其词图如图7所示.不可分割AL I不会scop e分发po ss essi on　S　图7　“不可分割”的词图4.3　功能词语词图的构造功能词语是一种特殊的词语,一般不能充当句法成分,并且大都以表示语法意义为主,所以其词图构造也很特殊.功能概念有2种描述形式:{功能词:adjunct={<次要义原>}};{功能词:EventRole =<动态角色>}.对于第一种描述形式,构造词图时,仅仅构造一个中心义原结点.如“也”:DEF={功能词:adjunc t= {也}},其词图如图8所示.对于第2种描述形式,构造词图时,建立个子图结点,且这个子图结点均指向N ULL,个结点之间的弧的关系类型为D F 中相应的动态角色如“被”D F={功能词255第29卷第3期张瑞霞等:　基于知网的词图构造 r ole1r oleEde t op:222E .:E:EventRole={agent}},其词图如图9所示. 也子图1agent子图2图8　“也”的词图图9　“被”的词图通过构造词图,实现了以图的形式来存储词义信息,从而使词义信息层次化、明了化,为构造短语图、句子图打下了良好的基础.5　结　语在自然语言处理领域,恰当的表示语义知识是基础.因此对知识图进行改进并以知网为语义资源构造词图,从而清晰地表示汉语中最小的语义单位,为进行语义分析打下了基础.通过应用证明,笔者所提出的语义表示方法有助于在句法分析过程中应用语义分析,且提高句法分析的正确率.参　考　文　献[1]张蕾,李学良,刘小冬.自然语言处理中的逻辑词[J].小微型计算机系统,2000,21(4):520-523.[2]刘小冬,李学良,张蕾.量词在知识图中的分类与表示[J].小微型计算机系统,2000,21(5):153-157.[3]张蕾,李学良,刘小冬.基于知识图的语义分析[J].西北大学学报,2002,32(2):153-156.[4]C Hoede,X Li.Word graphs:the first s e t[C]//P W Ek2lund,G Ellis,GM ann.Concep tua l Structure s:Kno wledge Rep re s entati on a s Int e rlingua//Auxiliary Proceedings of the 4th Internati onal Confe rence on Concep tua l Structure s.Bondi Beach,Sydney,Austra lia,1996:81-93.[5]C Hoede,X Liu.Word graphs:t he second Se t[C]//M LM ugnie r,M Chein.Conceptual Structure s:T heory,T ools and App lica ti ons//P roceeding of the6th I nte rna tiona l Con2 ference on Concep t ual Struc t ures,M ont pe lli e r,1998:375-389.[6]C Hoede,L Z hang.Word graphs:the third set[C]//H SDelugach,G Stu mme.Concep tua l Structure s:B r oadening the B a s e//Proceedings of9th Int e rna tional Confe rence on Conceptual Struc tures,CA,USA,2001:15-27.[7]Lei Z hang.K no w l edge Graph Theory and Struc tural Pa rsing[M].T wente University Press,P O Box217,7500AE En2 s chede,The Nethe rl ands,2002.[8]董振东,董强,郝长伶.知网的理论发现[J].中文信息学报,2007,21(4):3-9.A M ethod of C on struct i ng W ord Graphs Ba sed on HowNetZHANG Rui2xia1,X IA O Han2(1.North China Institut e ofW ate r Conse rvancy and Hydroe lec tric Powe r,Zhengzhou450011,China;2.Zhengzhou Teacher’s College,Zhengzhou450044,China)Ab stra ct:To represent semanti c infor m ati on of Chine s e,Kn owledge Graphs are i mproved,and words are cla ssified by re s ea rching “DEF”i n Ho w Ne t,then all k i nds of Word Graphs a re construc ted according to different kinds of words,so the lea st sem antic units of Chinese a re rep resented,and the ground work of semantic ana lysis is founded i n Chinese inf o r ma ti on proce ssi ng.Key wor ds:Chine se inf orma ti on proce ssi ng;kn owledge rep resentati on;K no w l edge Graphs;Ho w Ne t;Word Grap hs65　华　北　水　利　水　电　学　院　学　报 2008年6月。

dot语言手册

Drawing graphs with dotEmden R.Gansner and Eleftherios Koutsoﬁos and Stephen NorthNovember2,2010Abstractdot draws directed graphs as hierarchies.It runs as a command line pro-gram,web visualization service,or with a compatible graphical interface. Its features include well-tuned layout algorithms for placing nodes and edge splines,edge labels,“record”shapes with“ports”for drawing data struc-tures;cluster layouts;and an underlyingﬁle language for stream-oriented graph tools.Below is a reduced module dependency graph of an SML-NJ compiler that took0.23seconds of user time on a3GHz Intel Xeon.11Basic Graph Drawingdot draws directed graphs.It reads attributed graph textﬁles and writes drawings, either as graphﬁles or in a graphics format such as GIF,PNG,SVG,PDF,or PostScript.dot draws graphs in four main phases.Knowing this helps you to understand what kind of layouts dot makes and how you can control them.The layout proce-dure used by dot relies on the graph being acyclic.Thus,theﬁrst step is to break any cycles which occur in the input graph by reversing the internal direction of certain cyclic edges.The next step assigns nodes to discrete ranks or levels.In a top-to-bottom drawing,ranks determine Y coordinates.Edges that span more than one rank are broken into chains of“virtual”nodes and unit-length edges.The third step orders nodes within ranks to avoid crossings.The fourth step sets X coordi-nates of nodes to keep edges short,and theﬁnal step routes edge splines.This is the same general approach as most hierarchical graph drawing programs,based on the work of Warﬁeld[War77],Carpano[Car80]and Sugiyama[STT81].We refer the reader to[GKNV93]for a thorough explanation of dot’s algorithms.dot accepts input in the DOT language(cf.Appendix D).This language de-scribes three main kinds of objects:graphs,nodes,and edges.The main(outer-most)graph can be directed(digraph)or undirected graph.Because dot makes layouts of directed graphs,all the following examples use digraph.(A separate layout utility,neato,draws undirected graphs[Nor92].)Within a main graph,a subgraph deﬁnes a subset of nodes and edges.Figure1is an example graph in the DOT language.Line1gives the graph name and type.The lines that follow create nodes,edges,or subgraphs,and set s of all these objects may be C identiﬁers,numbers,or quoted C strings.Quotes protect punctuation and white space.A node is created when its nameﬁrst appears in theﬁle.An edge is created when nodes are joined by the edge operator->.In the example,line2makes edges from main to parse,and from parse to execute.Running dot on thisﬁle(call it graph1.gv)$dot-Tps graph1.gv-o graph1.psyields the drawing of Figure2.The command line option-Tps selects PostScript (EPSF)output.graph1.ps may be printed,displayed by a PostScript viewer,or embedded in another document.It is often useful to adjust the representation or placement of nodes and edges in the layout.This is done by setting attributes of nodes,edges,or subgraphs in the inputﬁle.Attributes are name-value pairs of character strings.Figures3and4 illustrate some layout attributes.In the listing of Figure3,line2sets the graph’s1:digraph G{2:main->parse->execute;3:main->init;4:main->cleanup;5:execute->make_string;6:execute->printf7:init->make_string;8:main->printf;9:execute->compare;10:}Figure1:Small graphFigure2:Drawing of small graphsize to4,4(in inches).This attribute controls the size of the drawing;if the drawing is too large,it is scaled uniformly as necessary toﬁt.Node or edge attributes are set off in square brackets.In line3,the node main is assigned shape box.The edge in line4is straightened by increasing its weight (the default is1).The edge in line6is drawn as a dotted line.Line8makes edges from execute to make string and printf.In line10the default edge color is set to red.This affects any edges created after this point in theﬁle.Line11 makes a bold edge labeled100times.In line12,node make_string is given a multi-line label.Line13changes the default node to be a boxﬁlled with a shade of blue.The node compare inherits these values.2Drawing AttributesThe main attributes that affect graph drawing are summarized in Appendices A,B and C.For more attributes and a more complete description of the attributes,you should refer to the Graphviz web site,speciﬁcally/doc/info/attrs.html2.1Node ShapesNodes are drawn,by default,with shape=ellipse,width=.75,height=.5 and labeled by the node name.Other common shapes include box,circle, record and plaintext.A list of the main node shapes is given in Appendix H. The node shape plaintext is of particularly interest in that it draws a node with-out any outline,an important convention in some kinds of diagrams.In cases where the graph structure is of main concern,and especially when the graph is moderately large,the point shape reduces nodes to display minimal content.When drawn,a node’s actual size is the greater of the requested size and the area needed for its text label,unless fixedsize=true,in which case the width and height values are enforced.Node shapes fall into two broad categories:polygon-based and record-based.1 All node shapes except record and Mrecord are considered polygonal,and are modeled by the number of sides(ellipses and circles being special cases),and a few other geometric properties.Some of these properties can be speciﬁed in a graph.If regular=true,the node is forced to be regular.The parameter1There is a way to implement custom node shapes,using shape=epsf and the shapefile attribute,and relying on PostScript output.The details are beyond the scope of this user’s guide. Please contact the authors for further information.1:digraph G{2:size="4,4";3:main[shape=box];/*this is a comment*/4:main->parse[weight=8];5:parse->execute;6:main->init[style=dotted];7:main->cleanup;8:execute->{make_string;printf}9:init->make_string;10:edge[color=red];//so is this11:main->printf[style=bold,label="100times"]; 12:make_string[label="make a\nstring"];13:node[shape=box,style=filled,color=".7.3 1.0"]; 14:execute->compare;15:}Figure3:Fancy graphFigure4:Drawing of fancy graphperipheries sets the number of boundary curves drawn.For example,a dou-blecircle has peripheries=2.The orientation attribute speciﬁes a clock-wise rotation of the polygon,measured in degrees.The shape polygon exposes all the polygonal parameters,and is useful for creating many shapes that are not predeﬁned.In addition to the parameters regular, peripheries and orientation,mentioned above,polygons are parameter-ized by number of sides sides,skew and distortion.skew is aﬂoating point number(usually between−1.0and1.0)that distorts the shape by slanting it from top-to-bottom,with positive values moving the top of the polygon to the right.Thus,skew can be used to turn a box into a parallelogram.distortion shrinks the polygon from top-to-bottom,with negative values causing the bottom to be larger than the top.distortion turns a box into a trapezoid.A variety of these polygonal attributes are illustrated in Figures6and5.Record-based nodes form the other class of node shapes.These include the shapes record and Mrecord.The two are identical except that the latter has rounded corners.These nodes represent recursive lists ofﬁelds,which are drawn as alternating horizontal and vertical rows of boxes.The recursive structure is determined by the node’s label,which has the following schema:rlabel→ﬁeld(’|’ﬁeld)*ﬁeld→boxLabel|’’rlabel’’boxLabel→[’<’string’>’][string]Literal braces,vertical bars and angle brackets must be escaped.Spaces are interpreted as separators between tokens,so they must be escaped if they are to appear literally in the text.Theﬁrst string in a boxLabel gives a name to theﬁeld, and serves as a port name for the box(cf.Section3.1).The second string is used as a label for theﬁeld;it may contain the same escape sequences as multi-line labels (cf.Section2.2).The example of Figures7and8illustrates the use and some properties of records.2.2LabelsAs mentioned above,the default node label is its name.Edges are unlabeled by default.Node and edge labels can be set explicitly using the label attribute as shown in Figure4.Though it may be convenient to label nodes by name,at other times labels must be set explicitly.For example,in drawing aﬁle directory tree,one might have several directories named src,but each one must have a unique node identiﬁer.1:digraph G{2:a->b->c;3:b->d;4:a[shape=polygon,sides=5,peripheries=3,color=lightblue,style=filled];5:c[shape=polygon,sides=4,skew=.4,label="hello world"]6:d[shape=invtriangle];7:e[shape=polygon,sides=4,distortion=.7];8:}Figure5:Graph with polygonal shapesFigure6:Drawing of polygonal node shapes1:digraph structs{2:node[shape=record];3:struct1[shape=record,label="<f0>left|<f1>mid\dle|<f2>right"];4:struct2[shape=record,label="<f0>one|<f1>two"];5:struct3[shape=record,label="hello\nworld|{b|{c|<here>d|e}|f}|g|h"]; 6:struct1->struct2;7:struct1->struct3;8:}Figure7:Records with nestedﬁeldsThe inode number or full path name are suitable unique identiﬁers.Then the label of each node can be set to theﬁle name within its directory.Multi-line labels can be created by using the escape sequences\n,\l,\r to terminate lines that are centered,or left or right justiﬁed.2Graphs and cluster subgraphs may also have labels.Graph labels appear,by default,centered below the graph.Setting labelloc=t centers the label above the graph.Cluster labels appear within the enclosing rectangle,in the upper left corner.The value labelloc=b moves the label to the bottom of the rectangle. The setting labeljust=r moves the label to the right.The default font is14-point Times-Roman,in black.Other font families, sizes and colors may be selected using the attributes fontname,fontsize and fontcolor.Font names should be compatible with the target interpreter.It is best to use only the standard font families Times,Helvetica,Courier or Symbol as these are guaranteed to work with any target graphics language.For example, Times-Italic,Times-Bold,and Courier are portable;AvanteGarde-DemiOblique isn’t.For bitmap output,such as GIF or JPG,dot relies on having these fonts avail-able during layout.Most precompiled installations of Graphviz use the fontconﬁg library for matching font names to available fontﬁles.fontconﬁg comes with a set of utilities for showing matches and installing fonts.Please refer to the font-conﬁg documentation,or the external Graphviz FontFAQ or for further details.If Graphviz is built without fontconﬁg(which usually means you compiled it from source code on your own),the fontpath attribute can specify a list of directo-ries3which should be searched for the fontﬁles.If this is not set,dot will use the DOTFONTPATH environment variable or,if this is not set,the GDFONTPATH environment variable.If none of these is set,dot uses a built-in list.Edge labels are positioned near the center of the ually,care is taken to prevent the edge label from overlapping edges and nodes.It can still be difﬁcult, in a complex graph,to be certain which edge a label belongs to.If the decorate attribute is set to true,a line is drawn connecting the label to its edge.Sometimes avoiding collisions among edge labels and edges forces the drawing to be bigger than desired.If labelfloat=true,dot does not try to prevent such overlaps, allowing a more compact drawing.An edge can also specify additional labels,using headlabel and taillabel, which are be placed near the ends of the edge.The characteristics of these la-bels are speciﬁed using the attributes labelfontname,labelfontsize and 2The escape sequence\N is an internal symbol for node names.3For Unix-based systems,this is a concatenated list of pathnames,separated by colons.For Windows-based systems,the pathnames are separated by semi-colons.labelfontcolor.These labels are placed near the intersection of the edge and the node and,as such,may interfere with them.To tune a drawing,the user can set the labelangle and labeldistance attributes.The former sets the angle, in degrees,which the label is rotated from the angle the edge makes incident with the node.The latter sets a multiplicative scaling factor to adjust the distance that the label is from the node.2.3HTML-like LabelsIn order to allow a richer collection of attributes at aﬁner granularity,dot accepts HTML-like labels using HTML syntax.These are speciﬁed using strings that are delimited by<...>rather than double-quotes.Within these delimiters,the string must follow the lexical,quoting,and syntactic conventions of HTML.By using the<TABLE>element,these labels can be viewed as an extension of and replacement for shape=record.With these,one can alter colors and fonts at the box level,and include images.The PORT attribute of a<TD>element provides a port name for the cell(cf.Section3.1).Although HTML-like labels are just a special type of label attribute,one fre-quently uses them as though they were a new type of node shape similar to records. Thus,when these are used,one often sees shape=none and margin=0.Also note that,as a label,these can be used with edges and graphs as well as nodes.Figures9and10give an example of the use of HTML-like labels.2.4Graphics StylesNodes and edges can specify a color attribute,with black the default.This is the color used to draw the node’s shape or the edge.A color value can be a hue-saturation-brightness triple(threeﬂoating point numbers between0and1,sepa-rated by commas);one of the colors names listed in Appendix J(borrowed from some version of the X window system);or a red-green-blue(RGB)triple4(three hexadecimal number between00and FF,preceded by the character’#’).Thus,the values"orchid","0.8396,0.4862,0.8549"and"#DA70D6"are three ways to specify the same color.The numerical forms are convenient for scripts or tools that automatically generate colors.Color name lookup is case-insensitive and ignores non-alphanumeric characters,so warmgrey and Warm_Grey are equiv-alent.We can offer a few hints regarding use of color in graph drawings.First,avoid using too many bright colors.A“rainbow effect”is confusing.It is better to4A fourth form,RGBA,is also supported,which has the same format as RGB with an additional fourth hexadecimal number specifying alpha channel or transparency information.Figure8:Drawing of records1:digraph html{2:abc[shape=none,margin=0,label=<3:<TABLE BORDER="0"CELLBORDER="1"CELLSPACING="0"CELLPADDING="4"> 4:<TR><TD ROWSPAN="3"><FONT COLOR="red">hello</FONT><BR/>world</TD> 5:<TD COLSPAN="3">b</TD>6:<TD ROWSPAN="3"BGCOLOR="lightgrey">g</TD>7:<TD ROWSPAN="3">h</TD>8:</TR>9:<TR><TD>c</TD>10:<TD PORT="here">d</TD>11:<TD>e</TD>12:</TR>13:<TR><TD COLSPAN="3">f</TD>14:</TR>15:</TABLE>>];16:}Figure9:HTML-like labelsFigure10:Drawing of HTML-like labelschoose a narrower range of colors,or to vary saturation along with hue.Sec-ond,when nodes areﬁlled with dark or very saturated colors,labels seem to be more readable with fontcolor=white and fontname=Helvetica.(We also have PostScript functions for dot that create outline fonts from plain fonts.) Third,in certain output formats,you can deﬁne your own color space.For exam-ple,if using PostScript for output,you can redeﬁne nodecolor,edgecolor, or graphcolor in a libraryﬁle.Thus,to use RGB colors,place the following line in aﬁle lib.ps./nodecolor{setrgbcolor}bind defUse the-l command line option to load thisﬁle.dot-Tps-l lib.ps file.gv-o file.psThe style attribute controls miscellaneous graphics features of nodes and edges.This attribute is a comma-separated list of primitives with optional argu-ment lists.The predeﬁned primitives include solid,dashed,dotted,bold and invis.Theﬁrst four control line drawing in node boundaries and edges and have the obvious meaning.The value invis causes the node or edge to be left undrawn.The style for nodes can also include filled,diagonals and rounded.filled shades inside the node using the color fillcolor.If this is not set,the value of color is used.If this also is unset,light grey5is used as the default.The diagonals style causes short diagonal lines to be drawn between pairs of sides near a vertex.The rounded style rounds polygonal corners.User-deﬁned style primitives can be implemented as custom PostScript proce-dures.Such primitives are executed inside the gsave context of a graph,node, or edge,before any of its marks are drawn.The argument lists are translated to PostScript notation.For example,a node with style="setlinewidth(8)" is drawn with a thick outline.Here,setlinewidth is a PostScript built-in,but user-deﬁned PostScript procedures are called the same way.The deﬁnition of these procedures can be given in a libraryﬁle loaded using-l as shown above.Edges have a dir attribute to set arrowheads.dir may be forward(the default),back,both,or none.This refers only to where arrowheads are drawn, and does not change the underlying graph.For example,setting dir=back causes an arrowhead to be drawn at the tail and no arrowhead at the head,but it does not exchange the endpoints of the edge.The attributes arrowhead and arrowtail specify the style of arrowhead,if any,which is used at the head and tail ends of the edge.Allowed values are normal,inv,dot,invdot,odot,invodot 5The default is black if the output format is MIF,or if the shape is point.and none(cf.Appendix I).The attribute arrowsize speciﬁes a multiplica-tive factor affecting the size of any arrowhead drawn on the edge.For example, arrowsize=2.0makes the arrow twice as long and twice as wide.In terms of style and color,clusters act somewhat like large box-shaped nodes,in that the cluster boundary is drawn using the cluster’s color attribute and,in general,the appearance of the cluster is affected the style,color and fillcolor attributes.If the root graph has a bgcolor attribute speciﬁed,this color is used as the background for the entire drawing,and also serves as the defaultﬁll color.2.5Drawing Orientation,Size and SpacingTwo attributes that play an important role in determining the size of a dot drawing are nodesep and ranksep.Theﬁrst speciﬁes the minimum distance,in inches, between two adjacent nodes on the same rank.The second deals with rank sepa-ration,which is the minimum vertical space between the bottoms of nodes in one rank and the tops of nodes in the next.The ranksep attribute sets the rank separa-tion,in inches.Alternatively,one can have ranksep=equally.This guarantees that all of the ranks are equally spaced,as measured from the centers of nodes on adjacent ranks.In this case,the rank separation between two ranks is at least the default rank separation.As the two uses of ranksep are independent,both can be set at the same time.For example,ranksep="1.0equally"causes ranksto be equally spaced,with a minimum rank separation of1inch.Often a drawing made with the default node sizes and separations is too big for the target printer or for the space allowed for aﬁgure in a document.There are several ways to try to deal with this problem.First,we will review how dot computes theﬁnal layout size.A layout is initially made internally at its“natural”size,using default settings (unless ratio=compress was set,as described below).There is no bound on the size or aspect ratio of the drawing,so if the graph is large,the layout is also large.If you don’t specify size or ratio,then the natural size layout is printed.The easiest way to control the output size of the drawing is to set size="x,y"in the graphﬁle(or on the command line using-G).This determines the size of theﬁnal layout.For example,size="7.5,10"ﬁts on an8.5x11page(assuming the default page orientation)no matter how big the initial layout.ratio also affects layout size.There are a number of cases,depending on the settings of size and ratio.Case1.ratio was not set.If the drawing alreadyﬁts within the given size, then nothing happens.Otherwise,the drawing is reduced uniformly enough to make the critical dimensionﬁt.If ratio was set,there are four subcases.Case2a.If ratio=x where x is aﬂoating point number,then the drawing is scaled up in one dimension to achieve the requested ratio expressed as drawing height/width.For example,ratio=2.0makes the drawing twice as high as it is wide.Then the layout is scaled using size as in Case1.Case2b.If ratio=fill and size=x,y was set,then the drawing is scaled up in one dimension to achieve the ratio y/x.Then scaling is performed as in Case 1.The effect is that all of the bounding box given by size isﬁlled.Case2c.If ratio=compress and size=x,y was set,then the initial layout is compressed to attempt toﬁt it in the given bounding box.This trades off lay-out quality,balance and symmetry in order to pack the layout more tightly.Then scaling is performed as in Case1.Case2d.If ratio=auto and the page attribute is set and the graph cannot be drawn on a single page,then size is ignored and dot computes an“ideal”size. In particular,the size in a given dimension will be the smallest integral multiple of the page size in that dimension which is at least half the current size.The two dimensions are then scaled independently to the new size.If rotate=90is set,or orientation=landscape,then the drawing is rotated90◦into landscape mode.The X axis of the layout would be along the Y axis of each page.This does not affect dot’s interpretation of size,ratio or page.At this point,if page is not set,then theﬁnal layout is produced as one page.If page=x,y is set,then the layout is printed as a sequence of pages which can be tiled or assembled into a mon settings are page="8.5,11" or page="11,17".These values refer to the full size of the physical device;the actual area used will be reduced by the margin settings.(For printer output,the default is0.5inches;for bitmap-output,the X and Y margins are10and2points, respectively.)For tiled layouts,it may be helpful to set smaller margins.This can be done by using the margin attribute.This can take a single number,used to set both margins,or two numbers separated by a comma to set the x and y margins separately.As usual,units are in inches.Although one can set margin=0,un-fortunately,many bitmap printers have an internal hardware margin that cannot be overridden.The order in which pages are printed can be controlled by the pagedir at-tribute.Output is always done using a row-based or column-based ordering,and pagedir is set to a two-letter code specifying the major and minor directions.For example,the default is BL,specifying a bottom-to-top(B)major order and a left-to-right(L)minor order.Thus,the bottom row of pages is emittedﬁrst,from left to right,then the second row up,from left to right,andﬁnishing with the top row, from left to right.The top-to-bottom order is represented by T and the right-to-leftorder by R.If center=true and the graph can be output on one page,using the default page size of8.5by11inches if page is not set,the graph is repositioned to be centered on that page.A common problem is that a large graph drawn at a small size yields unreadable node labels.To make larger labels,something has to give.There is a limit to the amount of readable text that canﬁt on one page.Often you can draw a smaller graph by extracting an interesting piece of the original graph before running dot. We have some tools that help with this.sccmap decompose the graph into strongly connected componentstred compute transitive reduction(remove edges implied by transitivity)gvpr graph processor to select nodes or edges,and contract or remove the rest of the graphunﬂatten improve aspect ratio of trees by staggering the lengths of leaf edges With this in mind,here are some thing to try on a given graph:1.Increase the node fontsize.e smaller ranksep and nodesep.e ratio=auto.e ratio=compress and give a reasonable size.5.A sans serif font(such as Helvetica)may be more readable than Times whenreduced.2.6Node and Edge PlacementAttributes in dot provide many ways to adjust the large-scale layout of nodes and edges,as well asﬁne-tune the drawing to meet the user’s needs and tastes.This section discusses these attributes6.Sometimes it is natural to make edges point from left to right instead of from top to bottom.If rankdir=LR in the top-level graph,the drawing is rotated in this way.TB(top to bottom)is the default.The mode rankdir=BT is useful for draw-ing upward-directed graphs.For completeness,one can also have rankdir=RL.6For completeness,we note that dot also provides access to various parameters which play techni-cal roles in the layout algorithms.These include mclimit,nslimit,nslimit1,remincross and searchsize.In graphs with time-lines,or in drawings that emphasize source and sink nodes, you may need to constrain rank assignments.The rank of a subgraph may be set to same,min,source,max or sink.A value same causes all the nodes in the subgraph to occur on the same rank.If set to min,all the nodes in the subgraph are guaranteed to be on a rank at least as small as any other node in the layout7. This can be made strict by setting rank=source,which forces the nodes in the subgraph to be on some rank strictly smaller than the rank of any other nodes (except those also speciﬁed by min or source subgraphs).The values max or sink play an analogous role for the maximum rank.Note that these constraints induce equivalence classes of nodes.If one subgraph forces nodes A and B to be on the same rank,and another subgraph forces nodes C and B to share a rank,then all nodes in both subgraphs must be drawn on the same rank.Figures11and12 illustrate using subgraphs for controlling rank assignment.In some graphs,the left-to-right ordering of nodes is important.If a subgraph has ordering=out,then out-edges within the subgraph that have the same tail node wll fan-out from left to right in their order of creation.(Also note thatﬂat edges involving the head nodes can potentially interfere with their ordering.) There are many ways toﬁne-tune the layout of nodes and edges.For example, if the nodes of an edge both have the same group attribute,dot tries to keep the edge straight and avoid having other edges cross it.The weight of an edge provides another way to keep edges straight.An edge’s weight suggests some measure of an edge’s importance;thus,the heavier the weight,the closer together its nodes should be.dot causes edges with heavier weights to be drawn shorter and straighter.Edge weights also play a role when nodes are constrained to the same rank. Edges with non-zero weight between these nodes are aimed across the rank in the same direction(left-to-right,or top-to-bottom in a rotated drawing)as far as possible.This fact may be exploited to adjust node ordering by placing invisible edges(style="invis")where needed.The end points of edges adjacent to the same node can be constrained using the samehead and sametail attributes.Speciﬁcally,all edges with the same head and the same value of samehead are constrained to intersect the head node at the same point.The analogous property holds for tail nodes and sametail.During rank assignment,the head node of an edge is constrained to be on a higher rank than the tail node.If the edge has constraint=false,however, this requirement is not enforced.In certain circumstances,the user may desire that the end points of an edge never get too close.This can be obtained by setting the edge’s minlen attribute.7Recall that the minimum rank occurs at the top of a drawing.。

Distance-Constraint Reachability Computation in

Distance-Constraint Reachability Computation inUncertain GraphsRuoming Jin†Lin Liu†Bolin Ding‡Haixun Wang§†Kent State University‡UIUC§Microsoft Research Asia †{jin,liu}@,‡bding3@,§haixunw@ABSTRACTDriven by the emerging network applications,querying and mininguncertain graphs has become increasingly important.In this paper,we investigate a fundamental problem concerning uncertain graphs,which we call the distance-constraint reachability(DCR)problem:Given two vertices s and t,what is the probability that the distancefrom s to t is less than or equal to a user-deﬁned threshold d in the uncertain graph?Since this problem is NP-hard,we focus on efﬁciently and accurately approximating DCR online.Our main results include two new estimators for the probabilistic reachabil-ity.One is a Horvitz-Thomson type estimator based on the unequal probabilistic sampling scheme,and the other is a novel recursive sampling estimator,which effectively combines a deterministic re-cursive computational procedure with a sampling process to boost the estimation accuracy.Both estimators can produce much smaller variance than the direct sampling estimator,which considers each trial to be either1or0.We also present methods to make these esti-mators more computationally efﬁcient.The comprehensive exper-iment evaluation on both real and synthetic datasets demonstrates the efﬁciency and accuracy of our new estimators.1.INTRODUCTIONDriven by the emerging network applications,querying and min-ing uncertain graphs has become increasingly important[19,29, 30].In this paper,we investigate a fundamental research problem in uncertain graphs:the distance-constraint reachability(DCR) query problem.In a deterministic directed graph,the reachability query,which asks whether one vertex can reach another one,is the basis for a variety of database(XML/RDF)and network applica-tions(e.g.,social and biological networks)[15,27].For uncertain graphs,reachability is not a simple Yes/No question,but instead, a probabilistic one.In the most common uncertain graph model, edges are independent of one another,and each edge is associated with a probability that indicates the likelihood of its existence[19, 29].This gives rise to using the possible world semantics to model uncertain graphs[19,1].A possible graph of an uncertain graph G is a possible instance of G.A possible graph contains a subset of edges of G,and it has a weight which is the product of the probabilities of all the edges it has.The reachability from vertex s to vertex t is expressed as the probability that s can reach t in all the possible graphs of G. Consider a simple example in Fig.1.We show an uncertain graph G,and three of its possible graphs G1,G2and G3,each witha weight.We can see that s can reach t in G1and G2but not in G3. If we enumerate all the possible graphs of G and add up the weights of those possible graphs where s can reach t,we get the probability that s can reach t in G(the probability is0.5104).Finding the(shortest-path)distance between two nodes is an-(a)Uncertain GraphG(b)G1with0.0009072(c)G2with0.0009072(d)G3with0.0006048Figure1:Running Exampleother important operation in uncertain graphs[28,19].The shortest-path distance is a key factor in determining the inﬂuence or rela-tionship between two vertices in a graph.Generally,the smaller the distance,the stronger the inﬂuence,trust,or relationship[13, 23].Therefore,in many applications,we are only interested in the reachability between two nodes if their distance is under a given threshold[28].Taking the distance measure into consideration, we deﬁne a distance-constraint reachability(DCR)query as fol-lows:Given two vertices s and t in an uncertain graph G,what is the probability that the distance from s to t is less than or equal to a user-deﬁned threshold d in the possible graphs of G?For the example in Fig.1,if the threshold d is selected to be2,then,we consider s cannot reach t in G2(under this distance constraint). The importance of distance-constraint reachability(DCR)query is multi-fold.First,DCR query can contribute to a wide range of real world applications,ranging from social network analysis to bi-ological networks to ontology[23,5,4,10,11,18].For instance, in trust social network,the trust ranking between any two persons can be formulated as a distance-constraint reachability problem; and in the protein-protein interaction network,DCR query can be applied to compute the function similarity between two proteins and the chance they belong to a common protein complex[].Sec-ond,DCR is a core operator which forms the basis of other more advanced queries.For instance,in the recent k-Nearest Neighbor query studied in uncertain graph,DCR operators from the query center s to its surrounding vertices are repetitively applied[19]. Finally,the simple reachability is a special case of the distance-constraint reachability(considering the case where the threshold d is larger than the length of the longest path in the uncertain graph G,or simply the sum of the all edge weight in G).The distance-constraint can provide more informative results on top of the simple reachability.1.1Problem StatementUncertain Graph Model:Consider an uncertain directed graph G=(V,E,p,w),where V is the set of vertices,E is the set of edges,p:E→(0,1]is a function that assigns each edge e a probability that indicates the likelihood of e’s existance,and w: E→(0,∞)associates each edge a weight(length).Note that we assume the existence of an edge e is independent of any other edge. In our example(Figure1),we assume each edge has unit-length (unit-weight).Let G=(V G,E G)be the possible graph which is realized by sampling each edge in G according to the probability p(e)(denoted as G⊑G).Clearly,we have E G⊆E and the possible graph G has Pr[G]sampling probability:Pr[G]=Y e∈E G p(e)Y e∈E\E G(1−p(e)).There are a total of2m possible graphs(for each edge e,there are two cases:e exists in b G or not).In our example(Figure1), graph G has29possible graphs,and as an example for the graph sampling probability,we havePr[G1]=p(s,a)p(a,b)p(a,t)p(s,c)(1−p(s,b))×(1−p(b,t))(1−p(s,c))(1−p(b,c))(1−p(c,b)) =0.5×0.3×0.5×0.7×(1−0.2)×(1−0.6)×(1−0.1)×(1−0.4)×(1−0.9) =0.0009072Distance-Constraint Reachability:A path from vertex v0to ver-tex v p in G is a vertex(or edge)sequence(v0,v1,···,v p),such that(v i,v i+1)is an edge in E G(0≤i≤p−1).A path is simple if no vertex appears more than once in the sequence.We are con-cerned with simple paths throughout the paper.Given two vertices s and t in G,a path starting from s and ending at t is referred to as an s-t-path.We say vertex t is reachable from vertex s in G if there is an s-t-path in G.The distance or length of an s-t-path is the sum of the lengths of all the edges on the path.The distance from s to t in G,denoted as dis(s,t|G),is the distance or length of the shortest path from s to t,i.e.,minimal distance of all s-t-paths. Given distance-constraint d,we say vertex t is d-reachable from s if the distance from s to t in G is less than or equal to d.D EFINITION 1.(s-t distance-constraint reachability)The prob-lem of computing s-t distance-constraint reachability in an uncer-tain graph G is to compute the probability of the possible graphs G,in which vertex t is d-reachable from s,where d is the distance constraint.Speciﬁcally,letI d s,t(G)=(1,if dis(s,t|G)≤d0,otherwiseThen,the s-t distance-constraint reachability in uncertain graph G with respect to parameter d is deﬁned asR d s,t(G)=X G⊑G I d s,t(G)·Pr[G].(1)Note that the problem of computing s-t distance-constraint reach-ability is a generalization of computing s-t reachability without the distance-constraint,which is often referred to as the two-point reliability problem[20].Simply speaking,it computes the total sampling probability of possible graphs G⊑G,in which vertex t is reachable from vertex ing the aforementioned distance-constraint reachability notation,we may simply choose an upper bound such as W=P e∈E w(e)(the total weight of the graph as an example),and then R W s,t(G)is equivalent to the simple s-t reachability.Computational Complexity and Estimation Criteria The simples-t reachability problem is known to be#P-Complete[25,6],even for special cases,e.g.,planar graphs and DAGs,and so is its gener-alization,s-t distance-constraint reachability.Thus,we cannot ex-pect the existence of a polynomial-time algorithm toﬁnd the exact value of R d s,t(G)unless P=NP.The distance-constraint reacha-bility problem is much harder than the simple s-t reachability prob-lem as we have to consider the shortest path distance between s and t in all possible graphs.Indeed,the existing s-t reachability computing approaches have mainly focused on the small graphs(in the order of tens of vertices)and cannot be directly extended to our problem(Section5).Given this,the key problem this paper addresses is how to efﬁciently and accurately approximate the s-t distance-constraint reachability online.Now,let us look at the key criteria for evaluating the quality of an approximate approach(or the quality of an estimator).Let b R be a general estimator for R d s,t(G).Intuitively,b R should be as close as R d s,t(G).Mathematically,this property can be captured by the mean squared error(MSE),E(b R−R d s,t(G))2,which measures the expected difference between an estimator and the true value.It can also be decomposed into two parts:E(b R−R d s,t(G))2=V ar(b R)+(E b R−R d s,t(G))2=V ar(b R)+(Bias b R)2An estimator is unbiased if the expectation of the estimator is equal to the true value(Bias b R=0),i.e.,E(b R)=R d s,t(G)(for our problem).The variance of estimator V ar(b R)measures the average deviation from its expectation.For an unbiased estimator, the variance is simply the MSE.In other words,the variance of an biased estimator is the indicator for measuring its accuracy.In addition,the variance is also frequently used for the constructing the conﬁdence interval of an estimate for approximation and the smaller the variance,the more accurate conﬁdence interval estimate we have[24].All estimators studied in this paper will be proven to be the unbiased estimators of R d s,t(G).Thus,the key criterion to discriminate them is their variance[24,12].Besides the accuracy of the estimator,the computational efﬁ-ciency of the estimator is also important.This is especially impor-tant for online answering s-t distance-constraint reachability query. To sum,in this paper,our goal is to develop an unbiased estimatorof R d s,t(G)with minimal variance and low computational cost. Minimal DCR Equivalent Subgraph:Before we proceed,we note that given vertices s and t,only subsets of vertices and edgesin G are needed to compute the s-t distance-constraint reachabil-ity.Speciﬁcally,given vertices s and t,the the minimal equivalent DCR subgraph G s=(V s,E s,p,w)⊆G whereV s={v∈V|dis(s,v|G)+dis(v,t|G)≤d},E s={e=(u,v)∈E|dist(s,u|G)+w(e)+dis(v,t|G)≤d}. Basically,V s and E s contain those vertices and edges that appear on some s-t paths whose distance is less than or equal to d.Clearly, we have R d s,t(G s)=R d s,t(G).A fast linear method utilizing BSF (Bread-First-Search)can help extract the minimal equivalent DCR subgraph[].Since we only need to work onG s,in the reminder of the paper,we simply use G for G s when no confusion can arise. 2.BASIC MONTE-CARLO METHODSIn this section,we will introduce two basic Monte-Carlo meth-ods for estimating R d s,t(G),the s-t distance-constraint reachability.2.1Direct Sampling ApproachA basic approach to approximate the s-t distance-constraint reach-ability is using sampling:1)weﬁrst sample n possible graphs, G1,G2,···,G n of G according to edge probability p;and2)we then compute the shortest path distance in each sample graph G i, and thus I d s,t(G i).Given this,we have the basic sampling estimator (b R B):R d s,t(G)≈b R B=P n i=1I d s,t(G i)nThe basic sampling estimator b R B is an unbiased estimator of the s-t distance-constraint reachability,i.e.,E(b R B)=R d s,t(G)Its variance can be simply written as[12]V ar(b R B)=1n R d s,t(G)(1−R d s,t(G))≈1n b R B(1−b R B) The basic sampling method can be rather computationally ex-pensive.Even when we only need to work on the minimal DCR equivalent subgraph G s,its size can be still large,and in order to generate a possible graph G,we have to toss the coin for each edge in E s.In addition,in each sampled graph G,we have to invoke the shortest path distance computation to compute I d s,t(G),which again is costly.We may speedup the basic sampling methods by extending the shortest-path distance method,like Dijkstra’s or A∗[21]algorithm for sampling estimation.Recall that in both algorithms,when a new vertex v is visited,we have to immediately visit all its neigh-bors(corresponding to visiting all outgoing edges in v)in order to maintain their corresponding estimated shortest-path distance from the source vertex s.Given this,we may not need to sample all edges at the beginning,but instead,only sample an edge when it will be used in the computational procedure.Speciﬁcally,only when a ver-tex is just visited,we will sample all its adjacent(outgoing)edges; then,we perform the distance update operations for the end ver-tices of those sampled edges in the graph;we will stop this process either when the targeted vertex t is reached or when the minimal shortest-distance for any unvisited vertex is more than threshold d.A similar procedure on Dijkstra’s algorithm is applied in[19]for discovering the K nearest neighbors in an uncertain graph.2.2Path-Based ApproachIn this subsection,we introduce the paths(or cuts)based ap-proach for estimating R d s,t(G).To facilitate our discussion,weﬁrst formally introduce d-path from s to v.A s-t path in G with length less than or equal to distance constraint d is referred to as d-path between s and t.The d-path is closely related to the s-t distance-constraint reachability:If vertex t is d-reachable from vertex s in a graph G,i.e.,dis(s,t|G)≤d,then,there is a d-path between s and t.If not,i.e.,dis(s,t|G)>d,then,for any s-t path in G,its length is higher than d(there is no d-path).Given this,the complete set of all d-paths in G(the complete possible graph with respect to G which includes all the edges in G), denoted as P={P1,P2,···,P L},can be used for computing the s-t distance-constraint reachability:R d s,t(G)=Pr[P1∨P2···∨P L]=X Pr[P i]−X i=j Pr[P i∩P j]+···+(−1)LπPr[P1∩P2···∩P L]Given this,we can apply the Monte-Carlo algorithm proposed in[16]to estimating Pr[P1∨P2···∨P L]within absolute error ǫwith probability at least1−δ.To sum,the path-based estimation approach contains two steps:1)Enumerating all d-paths from s to t in G(See Subsection A.1);2)Estimating Pr[P1∨P2···∨P L]using the Monte-Carlo algo-rithm[16].This estimator,denoted as b R P,is an unbiased estimator of R d s,t(G) [16]and its variance can be written as[12]:V ar(b R P)=1n R d s,t(G)(L X i=1Pr[P i]−R d s,t(G))Thus,depending on whether P L i=1Pr[P i]is bigger than or less than1,the variance of b R P can be bigger or smaller than b R B.The key issue of this approach is the computational requirementto enumerate and store all d-paths between s and t.This can be both computationally and memory expensive(the number of d-paths canbe exponential).In addition,we note that instead of computing d-paths,we can compute all d-cuts for R d s,t(G).An edge set C d of Gis a d-cut between s and t if G\C d has a distance greater than d,i.e.,dis(s,t|G\C d)>d.A minimal d-cut is a d-cut in which removing any edge from the edge set will introduce a d-path from s to t.Let {C1,C2,···,C K}be the complete set of minimal d-cuts in G. Then,it is easy to see that R d s,t(G)=1−Pr[C1∨C2···∨C K]. Thus,similar to the path-based approach,a two-step procedure canbe used for computing R d s,t(G)based on d-cuts.However,sincethe number of minimal d-cuts can be very large,using d-cuts is as expensive as using d-paths for R d s,t(G).Can we derive a faster and more accurate estimator for R d s,t(G) than these two estimators,b R B and b R P?In the next section,we provide a positive answer this question.3.NEW SAMPLING ESTIMATORSIn this section,we will introduce new estimators based on un-equal probability sampling(UPS)and an optimal recursive sam-pling estimator.To achieve that,we willﬁrst introduce a divide-and-conquer strategy which serves as the basis of the fast compu-tation of s-t distance constraint reachability(Subsection3.1).3.1A Divide-and-Conquer Exact Algorithm Computing the exact s-t distance-constraint reachability(R d s,t(G))is the basis to fast and accurately approximate it.The naive algo-rithm to compute R d s,t(G)is to enumerate G⊑G,and in each G,compute shortest path distance between s and t to test whetherd(s,t|G)≤d.The total running time of this algorithm isO“2|E|(|E|+|V|log|V|)”assuming Dijkstra algorithm is used for distance computation1.Here,we introduce a much faster ex-act algorithm to compute R d s,t(G).Though this algorithm still hasthe exponential computational complexity,it signiﬁcantly reducesour search space by avoiding enumerating2m possible graphs of G. The basic idea is to recursively partition all(2m)possible graphsof G into the groups so that the reachability of these group can be computed easily.To specify the grouping of possible graphs,we introduce the following notation:D EFINITION 2.((E1,E2)-preﬁx group)The(E1,E2)-preﬁx group of possible graphs from uncertain graph G,denoted as G(E1,E2), includes all the possible graphs of G which includes all edges in edge set E1⊆E and does not contain any edge in edge set E2⊆E,i.e.,G(E1,E2)={G⊑G|E1⊆E G∧E2∩E G=∅}1Here,G is actually the minimal DCR equivalent subgraph G s.We refer to E1and E2as the inclusion edge set and the exclusion edge set,respectively.Note that for a nonempty preﬁx group,the inclusion edge set E1 and the exclusion edge set E2are disjoint(E1∩E2=∅).In Fig-ure1,if we want to specify those possible graphs which all include edge(s,a)and do not contain edges(s,b)and(b,t),then,we may refer those graphs as({(s,a)},{(s,b),(b,t)})-preﬁx group.To facilitate our discussion,we introduce the generating probability of the preﬁx group G(E1,E2)as:Pr[G(E1,E2)]=Y e∈E1p(e)Y e∈E2(1−p(e))This indicates the overall sampling probability of any possible graph in the preﬁx group.Given this,the s-t distance-constraint reachability of a(E1,E2)-preﬁx group is deﬁned asR d s,t(G(E1,E2))=X G∈(G(E1,E2))I d s,t(G)·Pr[G]Pr[G(E1,E2)](2)Basically,it is the overall likelihood that t is d-reachable from s conditional on theﬁxed preﬁx G(E1,E2).It is easily derived that R d s,t(G)=R d s,t(G(∅,∅)).The following lemma characterizes the s-t distance-constraint reachability of(E1,E2)-groups and forms the basis for its efﬁcient computation.Its proof is omitted for simplicity.L EMMA 1.(Factorization Lemma)For any(E1,E2)-preﬁx group of uncertain G and any uncertain edge e∈E\(E1∪E2), R d s,t(G(E1,E2))=p(e)R d s,t(G(E1∪{e},E2))+(1−p(e))R d s,t(G(E1,E2∪{e}))In addition,we note that for any(E1,E2)-preﬁx group of uncer-tain G,if E1contains a d-path from s to t,then,R d s,t(G(E1,E2))= 1;if E2contains a d-cut between s and t,then,R d s,t(G(E1,E2))= 0.Also,E1containing a d-path and E2containing a d-cut cannot be both true at the same time though both can be false at the same time.Algorithm1R(G,E1,E2)Parameter:G:Uncertain Graph;Parameter:E1:Inclusion Edge List;Parameter:E2:Exclusion Edge List;1:if E1contains a d-path from s to t then2:return1;3:else if E2contains a d-cut from s to t then4:return0;5:end if6:select an edge e∈E\(E1∪E2){Find a remaining uncertain edge} 7:return p(e)R(G,E1∪{e},E2)+(1−p(e))R(G,E1,E2∪{e})Algorithm1describes the divide-and-conquer computation pro-cedure for R d s,t(G)based on Lemmas1.To compute R d s,t(G),we will invoke the procedure R(G,∅,∅).Based on the factorization lemma(Lemma1),this procedureﬁrst partitions the entire set of possible graphs of uncertain graph G into two parts(preﬁx groups) using any edge e in G:R d s,t(G(∅,∅))=p(e)R d s,t(G({e},∅))+(1−p(e))R d s,t(G(∅,{e})). Then,it applies the same approach to partition each preﬁx group of possible graphs recursively(Line6−7)until preﬁx group G(E1,E2) with either E1containing a d-path or E2containing a d-cut(Line 1−5).The computational process of the recursive procedure R can be represented in a full binary enumeration tree.In the tree,each node corresponds to a preﬁx group G(E1,E2)(also an invoke of the pro-cedure R).Each internal node has two children,one correspond-ing on including an uncertain edge e,another excluding it.In other words,the preﬁx group is partitioned into two new preﬁx groups: G(E1∪{e},E2)and G(E1,E2∪{e}).Further,we may consider each edge in the tree is weighted with probability p(e)for edge inclusion and1−p(e)for edge exclusion.In addition,the leaf node can be classiﬁed into two categories,L which contains all the leaf nodes with E1containing a d-path,and L which contains the remaining leaf nodes,i.e.,all those leaf nodes with E2include a d-cut.Figure2(a)illustrates the enumeration tree.The computational complexity of this procedure is determined by average recursive depth(average preﬁx-length),i.e.,the average number of edges|E1∪E2|we have to select in order to determine whether t is d-reachable from s for all the possible graphs in the preﬁx group.If the average recursive depth is a,then,a total of O(2a)preﬁx groups need to be enumerated,which can be signiﬁ-cantly smaller than the complete O(2m)possible graphs of G.In Section A,we introduce an approach in selecting the uncertain edge e(Line6)for each preﬁx group of the possible graph G(E1,E2)to minimize the average recursive depth.In the following two subsections,we discuss how to transform the exact reachability computation algorithm R into an accurate approximation scheme of R d s,t(G).3.2Tree-based Estimation and Unequal Prob-ability Sampling FrameworkIn this subsection,we will study an estimation framework of R d s,t(G)using its recursive binary enumeration tree representation and unequal probability sampling scheme[24].Unequal Probability Sampling(UPS)Framework:To estimate R d s,t(G),we apply the unequal sampling scheme.We consider that each leaf node in the enumeration tree is associated with a weight, the generating probability of the corresponding preﬁx group,Pr[G(E1,E2)].Next,we sample each leaf node G(E1,E2)with probability q(G(E1,E2)),where the sum of all leaf sampling prob-ability(q(G(E1,E2))is1.Note that in general,the leaf sampling probability q can be different from the leaf weight in the unequal sampling framework.Given this,we now study the well-known unequal sampling esti-mator,the Hansen-Hurwitz estimator[24]:assuming we sampled n leaf nodes,1,2,···,n,in the enumeration tree,and let P r i be the weight associated with the i-th sampled leaf node and let q i be the leaf sampling probability,then the Hansen-Hurwitz estimator (denoted as b R HH)for R d s,t(G)is:b R HH=1n n X i=1P r i I d s,t(G)q i(3)In other words,we may consider each leaf node in L contributes P r i and each leaf node in L contributes0to the estimation.It is easy to show the Hansen-Hurwitz estimator(b R HH)is an unbiased estimator for R d s,t(G),and its variance can be derived asV ar(b R HH)=1n(X i∈L q i(P r i q i−R d s,t(G))2+X i∈L q i R d s,t(G)2)Applying the Lagrange method,we can easilyﬁnd that the op-timal sampling probability for minimal variance V ar(b R HH)is achieved when q i=P r i,and the minimal variance is V ar(b R HH)= 1nR d s,t(G)(1−R d s,t(G)).This result suggests the best leaf sam-(a)Enumeration Tree of Recursive Computation of R s,t(G)(b)Divide and ConquerFigure2:Divide-and-Conquer method pling probability q to minimize the variance of b R HH is the oneequal to the leaf weight(generating probability of the preﬁx group)in L.Given this,we can sample a leaf node in the enumeration tree asfollows:Simply tossing a coin at each internal node in the tree todetermine whether edge e should be included(in E1)with proba-bility p(e)or excluded(in E2)with probability1−p(e);continuingthis process until a leaf node is reached.Basically,we perform arandom walk starting from the root node and stopping at the leafnode in the enumeration tree,and at each internal node,we ran-domly select the edge based on the p(e)deﬁned in the uncertaingraph.Interestingly,we note this UPS estimator is equivalent to the di-rect sampling estimator,as each leaf node is counted as either1or0(like Bernoulli trial):b R HH=b R B.In other words,the di-rectly sampling scheme is simply a special(and optimal)case of theHassen-Hurvitz estimator!This leads to the following observation:for any optimal Hassen-Hurvitz estimator(b R HH)or direct sam-pling estimator(b R B),their variance is only determined by n andhas no relationship to the enumeration tree size.This seems to berather counter-intuitive as the smaller the tree-size(or the smallernumber of the leaf nodes),the better chance(information)we havefor estimating R d s,t(G).A Better UPS Estimator:Now,weﬁrst introduce another UPS estimator,the Horvitz-Thomson estimator(b R HT),which can pro-vide smaller variance than the Hansen-Hurvitz estimator b R HH andthe direct sampling estimator b R B under mild conditions.Assum-ing we sampled n leaf nodes in the enumeration tree and amongthem there are l distinctive ones1,2,···,l(l is also referred to asthe effective sample size),let the inclusion probabilityπi be prob-ability to include leaf i in the sample,which is deﬁne asπi=1−(1−q i)n where q i is the leaf sampling probability.The Horvitz-Thomson estimator for R d s,t(G)is:ˆRHT=lX i=1P r i I d s,t(G)πi(4)Note that if q i is very small,thenπ≈nq i.The Horvitz-Thomson estimator(ˆR HT)is an unbiased estimator for the population total (R d s,t(G)).Its variance can be derived as[24]V ar(b R HT)=X i∈L…1−πiπi«P r2i+X i,j∈L,i=j…πij−πiπjπiπj«P r i P r j, whereπij is the probability that both leafs i and j are included in the sample:πij=1−(1−q i)n−(1−q j)n+(1−q i−q j)n. Using Taylor expansions and Lagrange method,we canﬁnd the minimal variance can be approximated when q i=P r i.This basi-cally suggests the similar leaf sampling strategy(the random walk from the root to the leaf)for the Hansen-Hurwitz estimator can be applied to the Horvitz-Thomson estimator as well.However,dif-ferent from the Hansen-Hurwitz estimator,the Horvitz-Thomson estimator utilizes each distinctive leaf once.Though in general the variances between the Hansen-Hurwitz estimator and the Horvitz-Thomson estimator are not analytically comparable,in our tree-based sampling framework and under reasonable approximation,we are able to prove the latter one has smaller variance.T HEOREM 1.(V ar(b R HT)≤V ar(b R HH))When for any sam-ple leaf node i,nP r i≪1,V ar(b R HH)−V ar(b R HT)=O(P i∈L P r2i). The proof of the theorem can be found in the complete technical report[].This result suggests that for small sample size n and/or when the generating probability of the leaf node is very small,thenthe Horvitz-Thomson estimator is guaranteed to have smaller vari-ance.In Section4,the experimental results will further demon-strate the effectiveness of this estimator.A reason for this estimatorto be effective is that it directly works on the distinctive leaf nodes which partly reﬂect the tree structure.In the next subsection,wewill introduce a novel recursive estimator which more aggressively utilizes the tree structure to minimize the variance.3.3Optimal Recursive Sampling EstimatorIn this subsection,we explore how to reduce the variance basedon the factorization lemma(Lemma1).Then,we will describe a novel recursive approximation procedure which combines the de-terministic procedure with the sampling process to minimize the estimator variance.Variance Reduction:Recall that for the root node in the enumer-ation tree,we have the following results based on the the factoriza-tion lemma(Lemma1):R d s,t(G)=p(e)R d s,t(G({e},∅))+(1−p(e))R d s,t(G(∅,{e}))To facilitate our discussion,letτ=R d s,t(G),τ1=R d s,t(G({e},∅)) andτ2=R d s,t(G(e,∅)).Now,instead of directly sampling all the leaf nodes from the root (like suggested in last subsection),we consider to estimate bothτ1 andτ2independently,and then combine them together to estimate τ.Speciﬁcally,for n total leaf samples,we deterministically allo-cate n1of them to the left subtree(including edge e,τ1),and n2of。

韩家炜数据挖掘讲座PPT04

Bottom-up computation: BUC (Beyer & Ramarkrishnan, SIGMOD‟99)
H-cubing technique (Han, Pei, Dong & Wang: SIGMOD‟01) Star-cubing algorithm (Xin, Han, Li & Wah: VLDB‟03)
Data Mining: Concepts and Techniques 12
7/31/2013
H-Cubing: Using H-Tree Structure
all

Bottom-up computation Exploring an H-tree structure If the current computation of an H-tree cannot pass min_sup, do not proceed further (pruning) No simultaneous aggregation
Data Mining: Concepts and Techniques 5

7/31/2013
Multi-Way Array Aggregation

Array-based “bottom-up” algorithm Using multi-dimensional chunks No direct tuple comparisons Simultaneous aggregation on multiple dimensions Intermediate aggregate values are re-used for computing ancestor cuboids Cannot do Apriori pruning: No iceberg optimization

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。