2006-Local graph partitioning using PageRank vectors.
Document Clustering Using Locality Preserving Indexing

Xiaofei He Department of Computer Science The University of Chicago 1100 East 58th Street, Chicago, IL 60637, USA Phone: (733) 288-2851 xiaofei@
Jiawei Han Department of Computer Science University of Illinois at Urbana Champaign 2132 Siebel Center, 201 N. Goodwin Ave, Urbana, IL 61801, USA Phone: (217) 333-6903 Fax: (217) 265-6494 hanj@
document clustering [28][27]. They model each cluster as a linear combination of the data points, and each data point as a linear combination of the clusters. And they compute the linear coefficients by minimizing the global reconstruction error of the data points using Non-negative Matrix Factorization. Thus, NMF method still focuses on the global geometrical structure of document space. Moreover, the iterative update method for solving NMF problem is computational expensive. In this paper, we propose a novel document clustering algorithm by using Locality Preserving Indexing (LPI). Different from LSI which aims to discover the global Euclidean structure, LPI aims to discover the local geometrical structure. LPI can have more discriminating power. Thus, the documents related to the same semantics are close to each other in the low dimensional representation space. Also, LPI is derived by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the document manifold. Laplace Beltrami operator takes the second order derivatives of the functions on the manifolds. It evaluates the smoothness of the functions. Therefore, it can discover the non-linear manifold structure to some extent. Some theoretical justifications can be traced back to [15][14]. The original LPI is not optimal in the sense of computation in that the obtained basis functions might contain a trivial solution. The trivial solution contains no information and thus useless for document indexing. A modified LPI is proposed to obtain better document representations. In this low dimensional space, we then apply traditional clustering algorithms such as k -means to cluster the documents into semantically different classes. The rest of this paper is organized as follows: In Section 2, we give a brief review of LSI and LPI. Section 3 introduces our proposed document clustering algorithm. Some theoretical analysis is provided in Section 4. The experimental results are shown in Section 5. Finally, we give concluding remarks and future work in Section 6.
Self-adaptive differential evolution algorithm for numerical optimization

n
Abstract—In this paper, we propose an extension of Self-adaptive Differential Evolution algorithm (SaDE) to solve optimization problems with constraints. In comparison with the original SaDE algorithm, the replacement criterion was modified for handling constraints. The performance of the proposed method is reported on the set of 24 benchmark problems provided by CEC2006 special session on constrained real parameter optimization.
2006 IEEE Congress on Evolutionary Computation Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006
Self-adaptive Differential Evolution Algorithm for Constrained Real-Parameter Optimization
“DE/rand/1”: Vi ,G = Xr ,G + F ⋅ Xr ,G − Xr G
1 2 3,
(
“DE/best/1”: Vi ,G = Xbest ,G + F ⋅ Xr ,G − X r G 1 2,
A Short Proof of the Hajnal-Szemerédi Theorem on Equitable Coloring

A Short Proof of the Hajnal-Szemer´e di Theorem onEquitable ColoringH.A.Kierstead∗ A.V.Kostochka†September10,20061IntroductionAn equitable k-coloring of a graph G is a proper k-coloring,for which any two color classes differ in size by at most one.Equitable colorings naturally arise in some scheduling,parti-tioning,and load balancing problems[1,15,16].Pemmaraju[13]and Janson and Ruci´n ski[6] used equitable colorings to derive deviation bounds for sums of dependent random variables that exhibit limited dependence.In1964Erd˝o s[3]conjectured that any graph with maxi-mum degree∆(G)≤r has an equitable(r+1)-coloring.This conjecture was proved in1970 by Hajnal and Szemer´e di[5]with a surprisingly long and complicated argument.Recently, Mydlarz and Szemer´e di[11]found a polynomial time algorithm for such coloring.In search of an easier proof,Seymour[14]strengthened Erd˝o s’conjecture by askingwhether every graph with minimum degreeδ(G)≥kk+1|G|contains the k-th power of ahamiltonian cycle.(If|G|=(r+1)(s+1)and∆(G)≤r thenδ(¯G)≥ss+1|G|;each(s+1)-interval of a s-th power of a hamiltonian cycle in¯G is an independent set in G.)The case k=1is Dirac’s Theorem and the case k=2is P´o sa’s Conjecture.Fan and Kierstead [4]proved P´o sa’s Conjecture with cycle replaced by path.Koml´o s,Sarkozy and Szemer´e di [7]proved Seymour’s conjecture for graphs with sufficiently many(in terms of k)vertices. Neither of these partial results has a simple proof.In fact,[7]uses the Regularity Lemma, the Blow-up Lemma and the Hajnal-Szemer´e di Theorem.A different strengthening was suggested recently by Kostochka and Yu[9,10].In the spirit of Ore’s theorem on hamiltonian cycles[12],they conjectured that every graph in which d(x)+d(y)≤2r for every edge xy has an equitable(r+1)-coloring.∗Department of Mathematics and Statistics,Arizona State University,Tempe,AZ85287,USA.E-mail address:kierstead@.Research of this author is supported in part by the NSA grant MDA904-03-1-0007†Department of Mathematics,University of Illinois,Urbana,IL,61801,USA and Institute of Mathematics, Novosibirsk,630090,Russia.E-mail address:kostochk@.Research of this author is supported in part by the NSF grant DMS-0400498.1In this paper we present a short proof of the Hajnal-Szemer´e di Theorem and present another polynomial time algorithm that constructs an equitable (r +1)-coloring of any graph G with maximum degree ∆(G )≤r .Our approach is similar to the original proof,but a discharging argument allows for a much simpler conclusion.Our techniques have paid further dividends.In another paper we will prove the above conjecture of Kostochka and Yu [9,10]in a stronger form:with 2r +1in place of 2r .They also yield partial results towards the Chen-Lih-Wu Conjecture [2]about equitable r -colorings of r -regular graphs and towards a list analogue of Hajnal-Szemer´e di Theorem (see [8]for definitions).Most of our notation is standard;possible exceptions include the following.For a vertex y and set of vertices X ,N X (y ):=N (y )∩X and d X (y )=|N X (y )|.If µis a function on edges then µ(A,B ):= xy ∈E (A,B )µ(x,y ),where E (A,B )is the set of edges linking a vertex in A to a vertex in B .For a function f :V →Z ,the restriction of f to W ⊆V is denoted by f |W .Functions are viewed formally as sets of ordered pairs.So if u /∈V then g :=f ∪{(u,γ)}is the extension of f to V ∪{u }such that g (u )=γ.2Main proofLet G be a graph with s (r +1)vertices satisfying ∆(G )≤r .A nearly equitable (r +1)-coloring of G is a proper coloring f ,whose color classes all have size s except for one small class V −=V −(f )with size s −1and one large class V +=V +(f )with size s +1.Given such a coloring f ,define the auxiliary digraph H =H (G,f )as follows.The vertices of H are the color classes of f .A directed edge V W belongs to E (H )iffsome vertex y ∈V has no neighbors in W .In this case we say that y is movable to W .Call W ∈V (H )accessible ,if V −is reachable from W in H .So V −is trivially accessible.Let A =A (f )denote the family of accessible classes,A := A and B :=V (G )\A .Let m :=|A|−1and q :=r −m .Then |A |=(m +1)s −1and |B |=(r −m )s +1.Each vertex y ∈B cannot be moved to A and so satisfiesd A (y )≥m +1and d B (y )≤q −1.(1)Lemma 1If G has a nearly equitable (r +1)-coloring f ,whose large class V +is accessible,then G has an equitable (r +1)-coloring.Proof.Let P =V 1,...,V k be a path in H (G,f )from V 1:=V +to V k :=V −.This means that for each j =1,...,k −1,V j contains a vertex y j that has no neighbors in V j +1.So,if we move y j to V j +1for j =1,...,k −1,then we obtain an equitable (r +1)-coloring of G .Suppose V +⊆B .If A =V −then |E (A,B )|≤r |V −|=r (s −1)<1+rs =|B |,a contradiction to (1).Thus m +1=|A|≥2.Call a class V ∈A terminal ,if V −is reachable from every class W ∈A \{V }in the digraph H −V .Trivially,V −is non-terminal.Every non-terminal class W partitions A\{W }into two parts S W and T W =∅,where S W is the set of classes that can reach V −in H −W .Choose a non-terminal class U so that A :=T U =∅is minimal.Then every class in A is terminal and no class in A has a vertex movable to any class in (A \A )\{U }.Set t :=|A |and A := A .Thus every x ∈A satisfiesd A (x )≥m −t.(2)2Call an edge zy with z∈W∈A and y∈B,a solo edge if N W(y)={z}.The ends of solo edges are called solo vertices and vertices linked by solo edges are called special neighbors of each other.Let S z denote the set of special neighbors of z and S y denote the set of special neighbors of y in A .Then at most r−(m+1+d B(y))color classes in A have more than one neighbor of y.Hence|S y|≥t−q+1+d B(y).(3) Lemma2If there exists W∈A such that no solo vertex in W is movable to a class in A\{W}then q+1≤t.Furthermore,every vertex y∈B is solo.Proof.Let S be the set of solo vertices in W and D:=W\S.Then every vertex in N B(S) has at least one neighbor in W and every vertex in B\N B(S)has at least two neighbors in W.It follows that|E(W,B)|≥|N B(S)|+2(|B|−|N B(S))|).Since no vertex in S is movable,every z∈S satisfies d B(z)≤q.By(2),every vertex x∈W satisfies d B(x)≤t+q. Thus,using s=|W|=|S|+|D|,qs+q|D|+2=2(qs+1)−q|S|≤|E(W,B)|≤q|S|+(t+q)|D|≤qs+t|D|It follows that q+1≤t.Moreover,by(3)every y∈B satisfies|S y|≥t−q+d B(y)≥1. Thus y is solo.Lemma3There exists a solo vertex z∈W∈A such that either z is movable to a class in A\{W}or z has two nonadjacent special neighbors in B.Proof.Suppose not.Then by Lemma2every vertex in B is solo.Moreover,S z is a clique for every solo vertex z∈A .Consider a weight functionµon E(A ,B)defined byµ(xy):= q|S x|if xy is solo,0if xy is not solo.For z∈A we haveµ(z,B)=|S z|q|S z|=q if z is solo;otherwiseµ(z,B)=0.Thus µ(A ,B)≤q|A |=qst.On the other hand,consider y∈B.Let c y:=max{|S z|:z∈S y}, say c y=|S z|,z∈S ing that S z is a clique and(1),c y−1≤d B(y)≤q−1.So c y≤q. Together with(3)this yieldsµ(A ,y)=z∈S yq|S z|≥|S y|qc y≥(t−q+c y)qc y=(t−q)qc y+q≥t.Thusµ(A ,B)≥t|B|=t(qs+1)>qst≥µ(A ,B),a contradiction.We are now ready to prove the Hajnal-Szemer´e di Theorem.Theorem4If G is a graph satisfying∆(G)≤r then G has an equitable(r+1)-coloring.3Proof.We may assume that|G|is divisible by r+1.To see this,suppose that|G|= s(r+1)−p,where p∈[r].Let G :=G+K p.Then|G |is divisible by r+1and∆(G )≤r. Moreover,the restriction of any equitable(r+1)-coloring of G to G is an equitable(r+1)-coloring of G.Argue by induction on G .The base step G =0is trivial,so consider the induction step G ≥1.Let e=xy be an edge of G.By the induction hypothesis there exists an equitable(r+1)-coloring f0of G−e.We are done,unless some color class V contains both x and y.Since d(x)≤r,there exists another class W such that x is movable to W.Doing so yields a nearly equitable(r+1)-coloring f of G with V−(f)=V\{x}and V+(f)=W∪{x}. We now show by a secondary induction on q(f)that G has an equitable(r+1)-coloring.If V+∈A then we are done by Lemma1;in particular,the base step q=0holds. Otherwise,by Lemma3there exists a class W∈A ,a solo vertex z∈W and a vertex y1∈S z such that either z is movable to a class X∈A\{W}or z is not movable in A and there exists another vertex y2∈S z,which is not adjacent to y1.By(1)and the primary induction hypothesis,there exists an equitable q-coloring g of B−:=B\{y1}.Let A+:=A∪{y1}.Case1:z is movable to X∈A.Move z to X and y1to W\{z}to obtain a nearly equitable(m+1)-coloringϕof A+.Since W∈A (f),V+(ϕ)=X∪{z}∈A(ϕ).By Lemma 1,A+has an equitable(m+1)-coloringϕ .Thenϕ ∪g is an equitable(r+1)-coloring of G.Case2:z is not movable to any class in A.Then d A+(z)≥d A(z)+1≥m+1.Thus d B−(z)≤q−1.So we can move z to a color class Y⊆B of g to obtain a new coloring g of B∗:=B−∪{z}.Also move y1to W to obtain an(m+1)-coloringψof A∗:=V(G)\B∗. Setψ :=ψ∪g .Thenψ is a nearly equitable coloring of G with A∗⊆A(ψ ).Moreover,y2 is movable to W∗:=W∪{y1}\{z}.Thus q(ψ )<q(f)and so by the secondary induction hypothesis,G has an equitable(r+1)-coloringψ .3A polynomial algorithmOur proof clearly yields an algorithm.However it may not be immediately clear that its running time is polynomial.The problem lies in the secondary induction,where we may apply Case2O(r)times,each time calling the algorithm recursively.Lemma2is crucial here;it allows us to claim that when we are in Case2(doing lots of work)we make lots of progress.As above G is a graph satisfying∆(G)≤r and|G|=:n=:s(r+1).Let f be a nearly equitable(r+1)-coloring of G.Theorem5There exists an algorithm P that from input(G,f)constructs an equitable (r+1)-coloring of G in c(q+1)n3steps.Proof.We shall show that the construction in the proof of Theorem4can be accomplished in the stated number of steps.Argue by induction on q.The base step q=0follows immediately from Lemma1and the observation that the construction of H and the recoloringcan be carried out in14cn3steps.Now consider the induction step.In14cn3steps construct4A,A ,B,W,z,ing the induction hypothesis on the input(G[B−],f|B−),construct thecoloring g of B−in c(q(f|B−)+1)(qs)3≤cqn3steps.In14cn3steps determine whether Case1or Case2holds.If Case1holds,construct the recoloringϕ in14cn3steps.This yields an equitable(r+1)-coloring g∪ϕ in a total of34cn3+cqn3≤c(q+1)n3.If Case2holds then,by Lemma2,q+1≤t.Thus we used only18cqn3steps to constructe an additional14cn3steps to extend g toψ .Notice that W∗is non-terminal inψ .Thus we can choose A (ψ )so that A (ψ )⊆B.If Case1holds forψ then as above we canconstruct an equitable coloring in an additional14cn3+18cqn3steps.So the total numberof steps is at most c(q+1)n3.Otherwise by Lemma2q(ψ )<12q.Thus by the inductionhypothesis we canfinish in c qn316additional steps.Then the total number of steps is less thanc(q+1)n3.Theorem6There is an algorithm P of complexity O(n5)that constructs an equitable(r+1)-coloring of any graph G satisfying∆(G)≤r and|G|=n.Proof.As above,we may assume that n is divisible by r+1.Let V(G)={v1,...,v n}. Delete all edges from G to form G0and let f0be an equitable coloring of G0.Now,for i=1,...,n−1,do the following:(i)Add back all the edges of G incident with v i to form G i;(ii)If v i has no neighbors in its color class in f i−1,then set f i:=f i−1.(iii)Otherwise,move v i to a color class that has no neighbors of v i to form a nearly equitablecoloring fi−1of G i.Then apply P to(G i,fi−1)to get an equitable(r+1)-coloring f i of G i.Then f n−1is an equitable(r+1)-coloring of G n−1=G.Since we have only n−1stages and each stage runs in O(n4)steps,the total complexity is O(n5).Acknowledgement.We would like to thank J.Schmerl for many useful comments. References[1]J.Blazewicz,K.Ecker,E.Pesch,G.Schmidt,J.Weglarz,Scheduling computer andmanufacturing processes.2nd ed.,Berlin:Springer.485p.(2001).[2]B.-L.Chen,K.-W.Lih,and P.-L.Wu,Equitable coloring and the maximum degree,binatorics,15(1994),443–447.[3]P.Erd˝o s,Problem9,in“Theory of Graphs and Its Applications”(M.Fieldler,Ed.),159,Czech.Acad.Sci.Publ.,Prague,1964.[4]G.Fan and H.A.Kierstead,Hamiltonian square paths,bin.Theory Ser.B67(1996)167-182.[5]A.Hajnal and E.Szemer´e di,Proof of a conjecture of P.Erd˝o s,in“CombinatorialTheory and its Application”(P.Erd˝o s,A.R´e nyi,and V.T.S´o s,Eds.),pp.601-623, North-Holland,London,1970.5[6]S.Janson and A.Ruci´n ski,The infamous upper tail,Random Structures and Algorithms,20(2002),317–342.[7]J.Koml´o s,G.Sarkozy and E.Szemer´e di,Proof of the Seymour conjecture for largegraphs,Annals of Combinatorics,1(1998),43-60.[8]A.V.Kostochka,M.J.Pelsmajer,and D.B.West,A list analogue of equitable coloring,J.of Graph Theory,44(2003),166–177[9]A.V.Kostochka and G.Yu,Extremal problems on packing of graphs,Oberwolfachreports,No1(2006),55–57.[10]A.V.Kostochka and G.Yu,Ore-type graph packing problems,to appear in Combina-torics,Probability and Computing.[11]M.Mydlarz and E.Szemer´e di,Algorithmic Brooks’Theorem,manuscript.[12]O.Ore,Note on Hamilton circuits,Amer.Math.Monthly,67(1960),55.[13]S.V.Pemmaraju,Equitable colorings extend Chernoff-Hoeffding bounds,Proceedingsof the5th International Workshop on Randomization and Approximation Techniques in Computer Science(APPROX-RANDOM2001),2001,285–296.[14]P.Seymour,Problem section,in“Combinatorics:Proceedings of the British Combina-torial Conference1973”(T.P.McDonough and V.C.Mavron,Eds.),pp.201-202, Cambridge Univ.Press,Cambridge,UK,1974.[15]B.F.Smith,P.E.Bjorstad,and W.D.Gropp,Domain decomposition.Parallel multi-level methods for elliptic partial differential equations,Cambridge:Cambridge Univer-sity Press,224p.(1996).[16]A.Tucker,Perfect graphs and an application to optimizing municipal services,SIAMReview,15(1973),585–590.6。
外文翻译----数字图像处理和模式识别技术关于检测癌症的应用

引言英文文献原文Digital image processing and pattern recognition techniques for the detection of cancerCancer is the second leading cause of death for both men and women in the world , and is expected to become the leading cause of death in the next few decades . In recent years , cancer detection has become a significant area of research activities in the image processing and pattern recognition community .Medical imaging technologies have already made a great impact on our capabilities of detecting cancer early and diagnosing the disease more accurately . In order to further improve the efficiency and veracity of diagnoses and treatment , image processing and pattern recognition techniques have been widely applied to analysis and recognition of cancer , evaluation of the effectiveness of treatment , and prediction of the development of cancer . The aim of this special issue is to bring together researchers working on image processing and pattern recognition techniques for the detection and assessment of cancer , and to promote research in image processing and pattern recognition for oncology . A number of papers were submitted to this special issue and each was peer-reviewed by at least three experts in the field . From these submitted papers , 17were finally selected for inclusion in this special issue . These selected papers cover a broad range of topics that are representative of the state-of-the-art in computer-aided detection or diagnosis(CAD)of cancer . They cover several imaging modalities(such as CT , MRI , and mammography) and different types of cancer (including breast cancer , skin cancer , etc.) , which we summarize below .Skin cancer is the most prevalent among all types of cancers . Three papers in this special issue deal with skin cancer . Y uan et al. propose a skin lesion segmentation method. The method is based on region fusion and narrow-band energy graph partitioning . The method can deal with challenging situations with skin lesions , such as topological changes , weak or false edges , and asymmetry . T ang proposes a snake-based approach using multi-direction gradient vector flow (GVF) for the segmentation of skin cancer images . A new anisotropic diffusion filter is developed as a preprocessing step . After the noise is removed , the image is segmented using a GVF1snake . The proposed method is robust to noise and can correctly trace the boundary of the skin cancer even if there are other objects near the skin cancer region . Serrano et al. present a method based on Markov random fields (MRF) to detect different patterns in dermoscopic images . Different from previous approaches on automatic dermatological image classification with the ABCD rule (Asymmetry , Border irregularity , Color variegation , and Diameter greater than 6mm or growing) , this paper follows a new trend to look for specific patterns in lesions which could lead physicians to a clinical assessment.Breast cancer is the most frequently diagnosed cancer other than skin cancer and a leading cause of cancer deaths in women in developed countries . In recent years , CAD schemes have been developed as a potentially efficacious solution to improving radiologists’diagnostic accuracy in breast cancer screening and diagnosis . The predominant approach of CAD in breast cancer and medical imaging in general is to use automated image analysis to serve as a “second reader”, with the aim of improving radiologists’diagnostic performance . Thanks to intense research and development efforts , CAD schemes have now been introduces in screening mammography , and clinical studies have shown that such schemes can result in higher sensitivity at the cost of a small increase in recall rate . In this issue , we have three papers in the area of CAD for breast cancer . Wei et al. propose an image-retrieval based approach to CAD , in which retrieved images similar to that being evaluated (called the query image) are used to support a CAD classifier , yielding an improved measure of malignancy . This involves searching a large database for the images that are most similar to the query image , based on features that are automatically extracted from the images . Dominguez et al. investigate the use of image features characterizing the boundary contours of mass lesions in mammograms for classification of benign vs. Malignant masses . They study and evaluate the impact of these features on diagnostic accuracy with several different classifier designs when the lesion contours are extracted using two different automatic segmentation techniques . Schaefer et al. study the use of thermal imaging for breast cancer detection . In their scheme , statistical features are extracted from thermograms to quantify bilateral differences between left and right breast regions , which are used subsequently as input to a fuzzy-rule-based classification system for diagnosis.Colon cancer is the third most common cancer in men and women , and also the third mostcommon cause of cancer-related death in the USA . Y ao et al. propose a novel technique to detect colonic polyps using CT Colonography . They use ideas from geographic information systems to employ topographical height maps , which mimic the procedure used by radiologists for the detection of polyps . The technique can also be used to measure consistently the size of polyps . Hafner et al. present a technique to classify and assess colonic polyps , which are precursors of colorectal cancer . The classification is performed based on the pit-pattern in zoom-endoscopy images . They propose a novel color waveler cross co-occurence matrix which employs the wavelet transform to extract texture features from color channels.Lung cancer occurs most commonly between the ages of 45 and 70 years , and has one of the worse survival rates of all the types of cancer . Two papers are included in this special issue on lung cancer research . Pattichis et al. evaluate new mathematical models that are based on statistics , logic functions , and several statistical classifiers to analyze reader performance in grading chest radiographs for pneumoconiosis . The technique can be potentially applied to the detection of nodules related to early stages of lung cancer . El-Baz et al. focus on the early diagnosis of pulmonary nodules that may lead to lung cancer . Their methods monitor the development of lung nodules in successive low-dose chest CT scans . They propose a new two-step registration method to align globally and locally two detected nodules . Experments on a relatively large data set demonstrate that the proposed registration method contributes to precise identification and diagnosis of nodule development .It is estimated that almost a quarter of a million people in the USA are living with kidney cancer and that the number increases by 51000 every year . Linguraru et al. propose a computer-assisted radiology tool to assess renal tumors in contrast-enhanced CT for the management of tumor diagnosis and response to treatment . The tool accurately segments , measures , and characterizes renal tumors, and has been adopted in clinical practice . V alidation against manual tools shows high correlation .Neuroblastoma is a cancer of the sympathetic nervous system and one of the most malignant diseases affecting children . Two papers in this field are included in this special issue . Sertel et al. present techniques for classification of the degree of Schwannian stromal development as either stroma-rich or stroma-poor , which is a critical decision factor affecting theprognosis . The classification is based on texture features extracted using co-occurrence statistics and local binary patterns . Their work is useful in helping pathologists in the decision-making process . Kong et al. propose image processing and pattern recognition techniques to classify the grade of neuroblastic differentiation on whole-slide histology images . The presented technique is promising to facilitate grading of whole-slide images of neuroblastoma biopsies with high throughput .This special issue also includes papers which are not derectly focused on the detection or diagnosis of a specific type of cancer but deal with the development of techniques applicable to cancer detection . T a et al. propose a framework of graph-based tools for the segmentation of microscopic cellular images . Based on the framework , automatic or interactive segmentation schemes are developed for color cytological and histological images . T osun et al. propose an object-oriented segmentation algorithm for biopsy images for the detection of cancer . The proposed algorithm uses a homogeneity measure based on the distribution of the objects to characterize tissue components . Colon biopsy images were used to verify the effectiveness of the method ; the segmentation accuracy was improved as compared to its pixel-based counterpart . Narasimha et al. present a machine-learning tool for automatic texton-based joint classification and segmentation of mitochondria in MNT-1 cells imaged using an ion-abrasion scanning electron microscope . The proposed approach has minimal user intervention and can achieve high classification accuracy . El Naqa et al. investigate intensity-volume histogram metrics as well as shape and texture features extracted from PET images to predict a patient’s response to treatment . Preliminary results suggest that the proposed approach could potentially provide better tools and discriminant power for functional imaging in clinical prognosis.We hope that the collection of the selected papers in this special issue will serve as a basis for inspiring further rigorous research in CAD of various types of cancer . We invite you to explore this special issue and benefit from these papers .On behalf of the Editorial Committee , we take this opportunity to gratefully acknowledge the autors and the reviewers for their diligence in abilding by the editorial timeline . Our thanks also go to the Editors-in-Chief of Pattern Recognition , Dr. Robert S. Ledley and Dr.C.Y. Suen , for their encouragement and support for this special issue .英文文献译文数字图像处理和模式识别技术关于检测癌症的应用世界上癌症是对于人类(不论男人还是女人)生命的第二杀手。
A Fast and Accurate Plane Detection Algorithm for Large Noisy Point Clouds Using Filtered Normals

A Fast and Accurate Plane Detection Algorithm for Large Noisy Point CloudsUsing Filtered Normals and Voxel GrowingJean-Emmanuel DeschaudFranc¸ois GouletteMines ParisTech,CAOR-Centre de Robotique,Math´e matiques et Syst`e mes60Boulevard Saint-Michel75272Paris Cedex06jean-emmanuel.deschaud@mines-paristech.fr francois.goulette@mines-paristech.frAbstractWith the improvement of3D scanners,we produce point clouds with more and more points often exceeding millions of points.Then we need a fast and accurate plane detection algorithm to reduce data size.In this article,we present a fast and accurate algorithm to detect planes in unorganized point clouds usingfiltered normals and voxel growing.Our work is based on afirst step in estimating better normals at the data points,even in the presence of noise.In a second step,we compute a score of local plane in each point.Then, we select the best local seed plane and in a third step start a fast and robust region growing by voxels we call voxel growing.We have evaluated and tested our algorithm on different kinds of point cloud and compared its performance to other algorithms.1.IntroductionWith the growing availability of3D scanners,we are now able to produce large datasets with millions of points.It is necessary to reduce data size,to decrease the noise and at same time to increase the quality of the model.It is in-teresting to model planar regions of these point clouds by planes.In fact,plane detection is generally afirst step of segmentation but it can be used for many applications.It is useful in computer graphics to model the environnement with basic geometry.It is used for example in modeling to detect building facades before classification.Robots do Si-multaneous Localization and Mapping(SLAM)by detect-ing planes of the environment.In our laboratory,we wanted to detect small and large building planes in point clouds of urban environments with millions of points for modeling. As mentioned in[6],the accuracy of the plane detection is important for after-steps of the modeling pipeline.We also want to be fast to be able to process point clouds with mil-lions of points.We present a novel algorithm based on re-gion growing with improvements in normal estimation and growing process.For our method,we are generic to work on different kinds of data like point clouds fromfixed scan-ner or from Mobile Mapping Systems(MMS).We also aim at detecting building facades in urban point clouds or little planes like doors,even in very large data sets.Our input is an unorganized noisy point cloud and with only three”in-tuitive”parameters,we generate a set of connected compo-nents of planar regions.We evaluate our method as well as explain and analyse the significance of each parameter. 2.Previous WorksAlthough there are many methods of segmentation in range images like in[10]or in[3],three have been thor-oughly studied for3D point clouds:region-growing, hough-transform from[14]and Random Sample Consen-sus(RANSAC)from[9].The application of recognising structures in urban laser point clouds is frequent in literature.Bauer in[4]and Boulaassal in[5]detect facades in dense3D point cloud by a RANSAC algorithm.V osselman in[23]reviews sur-face growing and3D hough transform techniques to de-tect geometric shapes.Tarsh-Kurdi in[22]detect roof planes in3D building point cloud by comparing results on hough-transform and RANSAC algorithm.They found that RANSAC is more efficient than thefirst one.Chao Chen in[6]and Yu in[25]present algorithms of segmentation in range images for the same application of detecting planar regions in an urban scene.The method in[6]is based on a region growing algorithm in range images and merges re-sults in one labelled3D point cloud.[25]uses a method different from the three we have cited:they extract a hi-erarchical subdivision of the input image built like a graph where leaf nodes represent planar regions.There are also other methods like bayesian techniques. In[16]and[8],they obtain smoothed surface from noisy point clouds with objects modeled by probability distribu-tions and it seems possible to extend this idea to point cloud segmentation.But techniques based on bayesian statistics need to optimize global statistical model and then it is diffi-cult to process points cloud larger than one million points.We present below an analysis of the two main methods used in literature:RANSAC and region-growing.Hough-transform algorithm is too time consuming for our applica-tion.To compare the complexity of the algorithm,we take a point cloud of size N with only one plane P of size n.We suppose that we want to detect this plane P and we define n min the minimum size of the plane we want to detect.The size of a plane is the area of the plane.If the data density is uniform in the point cloud then the size of a plane can be specified by its number of points.2.1.RANSACRANSAC is an algorithm initially developped by Fis-chler and Bolles in[9]that allows thefitting of models with-out trying all possibilities.RANSAC is based on the prob-ability to detect a model using the minimal set required to estimate the model.To detect a plane with RANSAC,we choose3random points(enough to estimate a plane).We compute the plane parameters with these3points.Then a score function is used to determine how the model is good for the remaining ually,the score is the number of points belonging to the plane.With noise,a point belongs to a plane if the distance from the point to the plane is less than a parameter γ.In the end,we keep the plane with the best score.Theprobability of getting the plane in thefirst trial is p=(nN )3.Therefore the probability to get it in T trials is p=1−(1−(nN )3)ing equation1and supposing n minN1,we know the number T min of minimal trials to have a probability p t to get planes of size at least n min:T min=log(1−p t)log(1−(n minN))≈log(11−p t)(Nn min)3.(1)For each trial,we test all data points to compute the score of a plane.The RANSAC algorithm complexity lies inO(N(Nn min )3)when n minN1and T min→0whenn min→N.Then RANSAC is very efficient in detecting large planes in noisy point clouds i.e.when the ratio n minN is 1but very slow to detect small planes in large pointclouds i.e.when n minN 1.After selecting the best model,another step is to extract the largest connected component of each plane.Connnected components mean that the min-imum distance between each point of the plane and others points is smaller(for distance)than afixed parameter.Schnabel et al.[20]bring two optimizations to RANSAC:the points selection is done locally and the score function has been improved.An octree isfirst created from point cloud.Points used to estimate plane parameters are chosen locally at a random depth of the octree.The score function is also different from RANSAC:instead of testing all points for one model,they test only a random subset and find the score by interpolation.The algorithm complexity lies in O(Nr4Ndn min)where r is the number of random subsets for the score function and d is the maximum octree depth. Their algorithm improves the planes detection speed but its complexity lies in O(N2)and it becomes slow on large data sets.And again we have to extract the largest connected component of each plane.2.2.Region GrowingRegion Growing algorithms work well in range images like in[18].The principle of region growing is to start with a seed region and to grow it by neighborhood when the neighbors satisfy some conditions.In range images,we have the neighbors of each point with pixel coordinates.In case of unorganized3D data,there is no information about the neighborhood in the data structure.The most common method to compute neighbors in3D is to compute a Kd-tree to search k nearest neighbors.The creation of a Kd-tree lies in O(NlogN)and the search of k nearest neighbors of one point lies in O(logN).The advantage of these region growing methods is that they are fast when there are many planes to extract,robust to noise and extract the largest con-nected component immediately.But they only use the dis-tance from point to plane to extract planes and like we will see later,it is not accurate enough to detect correct planar regions.Rabbani et al.[19]developped a method of smooth area detection that can be used for plane detection.Theyfirst estimate the normal of each point like in[13].The point with the minimum residual starts the region growing.They test k nearest neighbors of the last point added:if the an-gle between the normal of the point and the current normal of the plane is smaller than a parameterαthen they add this point to the smooth region.With Kd-tree for k nearest neighbors,the algorithm complexity is in O(N+nlogN). The complexity seems to be low but in worst case,when nN1,example for facade detection in point clouds,the complexity becomes O(NlogN).3.Voxel Growing3.1.OverviewIn this article,we present a new algorithm adapted to large data sets of unorganized3D points and optimized to be accurate and fast.Our plane detection method works in three steps.In thefirst part,we compute a better esti-mation of the normal in each point by afiltered weighted planefitting.In a second step,we compute the score of lo-cal planarity in each point.We select the best seed point that represents a good seed plane and in the third part,we grow this seed plane by adding all points close to the plane.Thegrowing step is based on a voxel growing algorithm.The filtered normals,the score function and the voxel growing are innovative contributions of our method.As an input,we need dense point clouds related to the level of detail we want to detect.As an output,we produce connected components of planes in the point cloud.This notion of connected components is linked to the data den-sity.With our method,the connected components of planes detected are linked to the parameter d of the voxel grid.Our method has 3”intuitive”parameters :d ,area min and γ.”intuitive”because there are linked to physical mea-surements.d is the voxel size used in voxel growing and also represents the connectivity of points in detected planes.γis the maximum distance between the point of a plane and the plane model,represents the plane thickness and is linked to the point cloud noise.area min represents the minimum area of planes we want to keep.3.2.Details3.2.1Local Density of Point CloudsIn a first step,we compute the local density of point clouds like in [17].For that,we find the radius r i of the sphere containing the k nearest neighbors of point i .Then we cal-culate ρi =kπr 2i.In our experiments,we find that k =50is a good number of neighbors.It is important to know the lo-cal density because many laser point clouds are made with a fixed resolution angle scanner and are therefore not evenly distributed.We use the local density in section 3.2.3for the score calculation.3.2.2Filtered Normal EstimationNormal estimation is an important part of our algorithm.The paper [7]presents and compares three normal estima-tion methods.They conclude that the weighted plane fit-ting or WPF is the fastest and the most accurate for large point clouds.WPF is an idea of Pauly and al.in [17]that the fitting plane of a point p must take into consider-ation the nearby points more than other distant ones.The normal least square is explained in [21]and is the mini-mum of ki =1(n p ·p i +d )2.The WPF is the minimum of ki =1ωi (n p ·p i +d )2where ωi =θ( p i −p )and θ(r )=e −2r 2r2i .For solving n p ,we compute the eigenvec-tor corresponding to the smallest eigenvalue of the weightedcovariance matrix C w = ki =1ωi t (p i −b w )(p i −b w )where b w is the weighted barycenter.For the three methods ex-plained in [7],we get a good approximation of normals in smooth area but we have errors in sharp corners.In fig-ure 1,we have tested the weighted normal estimation on two planes with uniform noise and forming an angle of 90˚.We can see that the normal is not correct on the corners of the planes and in the red circle.To improve the normal calculation,that improves the plane detection especially on borders of planes,we propose a filtering process in two phases.In a first step,we com-pute the weighted normals (WPF)of each point like we de-scribed it above by minimizing ki =1ωi (n p ·p i +d )2.In a second step,we compute the filtered normal by us-ing an adaptive local neighborhood.We compute the new weighted normal with the same sum minimization but keep-ing only points of the neighborhood whose normals from the first step satisfy |n p ·n i |>cos (α).With this filtering step,we have the same results in smooth areas and better results in sharp corners.We called our normal estimation filtered weighted plane fitting(FWPF).Figure 1.Weighted normal estimation of two planes with uniform noise and with 90˚angle between them.We have tested our normal estimation by computing nor-mals on synthetic data with two planes and different angles between them and with different values of the parameter α.We can see in figure 2the mean error on normal estimation for WPF and FWPF with α=20˚,30˚,40˚and 90˚.Us-ing α=90˚is the same as not doing the filtering step.We see on Figure 2that α=20˚gives smaller error in normal estimation when angles between planes is smaller than 60˚and α=30˚gives best results when angle between planes is greater than 60˚.We have considered the value α=30˚as the best results because it gives the smaller mean error in normal estimation when angle between planes vary from 20˚to 90˚.Figure 3shows the normals of the planes with 90˚angle and better results in the red circle (normals are 90˚with the plane).3.2.3The score of local planarityIn many region growing algorithms,the criteria used for the score of the local fitting plane is the residual,like in [18]or [19],i.e.the sum of the square of distance from points to the plane.We have a different score function to estimate local planarity.For that,we first compute the neighbors N i of a point p with points i whose normals n i are close toFigure parison of mean error in normal estimation of two planes with α=20˚,30˚,40˚and 90˚(=Nofiltering).Figure 3.Filtered Weighted normal estimation of two planes with uniform noise and with 90˚angle between them (α=30˚).the normal n p .More precisely,we compute N i ={p in k neighbors of i/|n i ·n p |>cos (α)}.It is a way to keep only the points which are probably on the local plane before the least square fitting.Then,we compute the local plane fitting of point p with N i neighbors by least squares like in [21].The set N i is a subset of N i of points belonging to the plane,i.e.the points for which the distance to the local plane is smaller than the parameter γ(to consider the noise).The score s of the local plane is the area of the local plane,i.e.the number of points ”in”the plane divided by the localdensity ρi (seen in section 3.2.1):the score s =card (N i)ρi.We take into consideration the area of the local plane as the score function and not the number of points or the residual in order to be more robust to the sampling distribution.3.2.4Voxel decompositionWe use a data structure that is the core of our region growing method.It is a voxel grid that speeds up the plane detection process.V oxels are small cubes of length d that partition the point cloud space.Every point of data belongs to a voxel and a voxel contains a list of points.We use the Octree Class Template in [2]to compute an Octree of the point cloud.The leaf nodes of the graph built are voxels of size d .Once the voxel grid has been computed,we start the plane detection algorithm.3.2.5Voxel GrowingWith the estimator of local planarity,we take the point p with the best score,i.e.the point with the maximum area of local plane.We have the model parameters of this best seed plane and we start with an empty set E of points belonging to the plane.The initial point p is in a voxel v 0.All the points in the initial voxel v 0for which the distance from the seed plane is less than γare added to the set E .Then,we compute new plane parameters by least square refitting with set E .Instead of growing with k nearest neighbors,we grow with voxels.Hence we test points in 26voxel neigh-bors.This is a way to search the neighborhood in con-stant time instead of O (logN )for each neighbor like with Kd-tree.In a neighbor voxel,we add to E the points for which the distance to the current plane is smaller than γand the angle between the normal computed in each point and the normal of the plane is smaller than a parameter α:|cos (n p ,n P )|>cos (α)where n p is the normal of the point p and n P is the normal of the plane P .We have tested different values of αand we empirically found that 30˚is a good value for all point clouds.If we added at least one point in E for this voxel,we compute new plane parameters from E by least square fitting and we test its 26voxel neigh-bors.It is important to perform plane least square fitting in each voxel adding because the seed plane model is not good enough with noise to be used in all voxel growing,but only in surrounding voxels.This growing process is faster than classical region growing because we do not compute least square for each point added but only for each voxel added.The least square fitting step must be computed very fast.We use the same method as explained in [18]with incre-mental update of the barycenter b and covariance matrix C like equation 2.We know with [21]that the barycen-ter b belongs to the least square plane and that the normal of the least square plane n P is the eigenvector of the smallest eigenvalue of C .b0=03x1C0=03x3.b n+1=1n+1(nb n+p n+1).C n+1=C n+nn+1t(pn+1−b n)(p n+1−b n).(2)where C n is the covariance matrix of a set of n points,b n is the barycenter vector of a set of n points and p n+1is the (n+1)point vector added to the set.This voxel growing method leads to a connected com-ponent set E because the points have been added by con-nected voxels.In our case,the minimum distance between one point and E is less than parameter d of our voxel grid. That is why the parameter d also represents the connectivity of points in detected planes.3.2.6Plane DetectionTo get all planes with an area of at least area min in the point cloud,we repeat these steps(best local seed plane choice and voxel growing)with all points by descending order of their score.Once we have a set E,whose area is bigger than area min,we keep it and classify all points in E.4.Results and Discussion4.1.Benchmark analysisTo test the improvements of our method,we have em-ployed the comparative framework of[12]based on range images.For that,we have converted all images into3D point clouds.All Point Clouds created have260k points. After our segmentation,we project labelled points on a seg-mented image and compare with the ground truth image. We have chosen our three parameters d,area min andγby optimizing the result of the10perceptron training image segmentation(the perceptron is portable scanner that pro-duces a range image of its environment).Bests results have been obtained with area min=200,γ=5and d=8 (units are not provided in the benchmark).We show the re-sults of the30perceptron images segmentation in table1. GT Regions are the mean number of ground truth planes over the30ground truth range images.Correct detection, over-segmentation,under-segmentation,missed and noise are the mean number of correct,over,under,missed and noised planes detected by methods.The tolerance80%is the minimum percentage of points we must have detected comparing to the ground truth to have a correct detection. More details are in[12].UE is a method from[12],UFPR is a method from[10]. It is important to notice that UE and UFPR are range image methods and our method is not well suited for range images but3D Point Cloud.Nevertheless,it is a good benchmark for comparison and we see in table1that the accuracy of our method is very close to the state of the art in range image segmentation.To evaluate the different improvements of our algorithm, we have tested different variants of our method.We have tested our method without normals(only with distance from points to plane),without voxel growing(with a classical region growing by k neighbors),without our FWPF nor-mal estimation(with WPF normal estimation),without our score function(with residual score function).The compari-son is visible on table2.We can see the difference of time computing between region growing and voxel growing.We have tested our algorithm with and without normals and we found that the accuracy cannot be achieved whithout normal computation.There is also a big difference in the correct de-tection between WPF and our FWPF normal estimation as we can see in thefigure4.Our FWPF normal brings a real improvement in border estimation of planes.Black points in thefigure are non classifiedpoints.Figure5.Correct Detection of our segmentation algorithm when the voxel size d changes.We would like to discuss the influence of parameters on our algorithm.We have three parameters:area min,which represents the minimum area of the plane we want to keep,γ,which represents the thickness of the plane(it is gener-aly closely tied to the noise in the point cloud and espe-cially the standard deviationσof the noise)and d,which is the minimum distance from a point to the rest of the plane. These three parameters depend on the point cloud features and the desired segmentation.For example,if we have a lot of noise,we must choose a highγvalue.If we want to detect only large planes,we set a large area min value.We also focus our analysis on the robustess of the voxel size d in our algorithm,i.e.the ratio of points vs voxels.We can see infigure5the variation of the correct detection when we change the value of d.The method seems to be robust when d is between4and10but the quality decreases when d is over10.It is due to the fact that for a large voxel size d,some planes from different objects are merged into one plane.GT Regions Correct Over-Under-Missed Noise Duration(in s)detection segmentation segmentationUE14.610.00.20.3 3.8 2.1-UFPR14.611.00.30.1 3.0 2.5-Our method14.610.90.20.1 3.30.7308Table1.Average results of different segmenters at80%compare tolerance.GT Regions Correct Over-Under-Missed Noise Duration(in s) Our method detection segmentation segmentationwithout normals14.6 5.670.10.19.4 6.570 without voxel growing14.610.70.20.1 3.40.8605 without FWPF14.69.30.20.1 5.0 1.9195 without our score function14.610.30.20.1 3.9 1.2308 with all improvements14.610.90.20.1 3.30.7308 Table2.Average results of variants of our segmenter at80%compare tolerance.4.1.1Large scale dataWe have tested our method on different kinds of data.We have segmented urban data infigure6from our Mobile Mapping System(MMS)described in[11].The mobile sys-tem generates10k pts/s with a density of50pts/m2and very noisy data(σ=0.3m).For this point cloud,we want to de-tect building facades.We have chosen area min=10m2, d=1m to have large connected components andγ=0.3m to cope with the noise.We have tested our method on point cloud from the Trim-ble VX scanner infigure7.It is a point cloud of size40k points with only20pts/m2with less noise because it is a fixed scanner(σ=0.2m).In that case,we also wanted to detect building facades and keep the same parameters ex-ceptγ=0.2m because we had less noise.We see infig-ure7that we have detected two facades.By setting a larger voxel size d value like d=10m,we detect only one plane. We choose d like area min andγaccording to the desired segmentation and to the level of detail we want to extract from the point cloud.We also tested our algorithm on the point cloud from the LEICA Cyrax scanner infigure8.This point cloud has been taken from AIM@SHAPE repository[1].It is a very dense point cloud from multiplefixed position of scanner with about400pts/m2and very little noise(σ=0.02m). In this case,we wanted to detect all the little planes to model the church in planar regions.That is why we have chosen d=0.2m,area min=1m2andγ=0.02m.Infigures6,7and8,we have,on the left,input point cloud and on the right,we only keep points detected in a plane(planes are in random colors).The red points in thesefigures are seed plane points.We can see in thesefig-ures that planes are very well detected even with high noise. Table3show the information on point clouds,results with number of planes detected and duration of the algorithm.The time includes the computation of the FWPF normalsof the point cloud.We can see in table3that our algo-rithm performs linearly in time with respect to the numberof points.The choice of parameters will have little influence on time computing.The computation time is about one mil-lisecond per point whatever the size of the point cloud(we used a PC with QuadCore Q9300and2Go of RAM).The algorithm has been implented using only one thread andin-core processing.Our goal is to compare the improve-ment of plane detection between classical region growing and our region growing with better normals for more ac-curate planes and voxel growing for faster detection.Our method seems to be compatible with out-of-core implemen-tation like described in[24]or in[15].MMS Street VX Street Church Size(points)398k42k7.6MMean Density50pts/m220pts/m2400pts/m2 Number of Planes202142Total Duration452s33s6900sTime/point 1ms 1ms 1msTable3.Results on different data.5.ConclusionIn this article,we have proposed a new method of plane detection that is fast and accurate even in presence of noise. We demonstrate its efficiency with different kinds of data and its speed in large data sets with millions of points.Our voxel growing method has a complexity of O(N)and it is able to detect large and small planes in very large data sets and can extract them directly in connected components.Figure 4.Ground truth,Our Segmentation without and with filterednormals.Figure 6.Planes detection in street point cloud generated by MMS (d =1m,area min =10m 2,γ=0.3m ).References[1]Aim@shape repository /.6[2]Octree class template /code/octree.html.4[3] A.Bab-Hadiashar and N.Gheissari.Range image segmen-tation using surface selection criterion.2006.IEEE Trans-actions on Image Processing.1[4]J.Bauer,K.Karner,K.Schindler,A.Klaus,and C.Zach.Segmentation of building models from dense 3d point-clouds.2003.Workshop of the Austrian Association for Pattern Recognition.1[5]H.Boulaassal,ndes,P.Grussenmeyer,and F.Tarsha-Kurdi.Automatic segmentation of building facades using terrestrial laser data.2007.ISPRS Workshop on Laser Scan-ning.1[6] C.C.Chen and I.Stamos.Range image segmentationfor modeling and object detection in urban scenes.2007.3DIM2007.1[7]T.K.Dey,G.Li,and J.Sun.Normal estimation for pointclouds:A comparison study for a voronoi based method.2005.Eurographics on Symposium on Point-Based Graph-ics.3[8]J.R.Diebel,S.Thrun,and M.Brunig.A bayesian methodfor probable surface reconstruction and decimation.2006.ACM Transactions on Graphics (TOG).1[9]M.A.Fischler and R.C.Bolles.Random sample consen-sus:A paradigm for model fitting with applications to image analysis and automated munications of the ACM.1,2[10]P.F.U.Gotardo,O.R.P.Bellon,and L.Silva.Range imagesegmentation by surface extraction using an improved robust estimator.2003.Proceedings of Computer Vision and Pat-tern Recognition.1,5[11] F.Goulette,F.Nashashibi,I.Abuhadrous,S.Ammoun,andurgeau.An integrated on-board laser range sensing sys-tem for on-the-way city and road modelling.2007.Interna-tional Archives of the Photogrammetry,Remote Sensing and Spacial Information Sciences.6[12] A.Hoover,G.Jean-Baptiste,and al.An experimental com-parison of range image segmentation algorithms.1996.IEEE Transactions on Pattern Analysis and Machine Intelligence.5[13]H.Hoppe,T.DeRose,T.Duchamp,J.McDonald,andW.Stuetzle.Surface reconstruction from unorganized points.1992.International Conference on Computer Graphics and Interactive Techniques.2[14]P.Hough.Method and means for recognizing complex pat-terns.1962.In US Patent.1[15]M.Isenburg,P.Lindstrom,S.Gumhold,and J.Snoeyink.Large mesh simplification using processing sequences.2003.。
Segmentation - University of M

Robustness
– Outliers: Improve the model either by giving the noise “heavier tails” or allowing an explicit outlier model
– M-estimators
Assuming that somewhere in the collection of process close to our model is the real process, and it just happens to be the one that makes the estimator produce the worst possible estimates
– Proximity, similarity, common fate, common region, parallelism, closure, symmetry, continuity, familiar configuration
Segmentation by clustering
Partitioning vs. grouping Applications
ri (x i , );
i
(u;
)
u2 2
u
2
Segmentation by fitting a model(3)
RANSAC (RAMdom SAmple Consensus)
– Searching for a random sample that leads to a fit on which many of the data points agree
Allocate each data point to cluster whose center is nearest
Azero一个大规模动态负载均衡图处理系统课件

图处理流程
Graph
Data
BSP model
Worker
Graph
Partitioning
Worker
Manage
r
Output
…
Worker
31
计算模型
w1
w2
w3
w4
w5
Local
Computing
Communacatio
n
Barrier Synchronization
32
顶点计算过程
Superstep k1
public long getNumVertexes();
public void voteToHalt();
}
34
PageRank Example
public class PageRank implements VertexProcessor {
public void compute(WorkerContext context) {
Vertex v = context.getCurrentVertex();
if (context.getNumSuperSteps() >= 1) {
double sum = 0;
for (List<String> list :
context.getMessages().values()) {
sum += Double.valueOf(list.get(0));
AC
T
TB
T
RP-1
AC
T
TB
A
TB
T
RP-2
SL
P
SL
P
Computable
文化基因算法课件

process, we can now have access to the larger picture. The
functioning of a MA consists of the iteration of this basic
generational step
Pseudo code:
Process MA () → Individual[ ]
学习交流PPT
7
The basic model of MAs
M (A P 0,0,offsip z ,pe ro in ,p l,F g ,S G ,S U i,L z)e
Initial population
popSize
The initial parameters of the algorithm Population size
variables
pop : Individual[ ];
begin
pop ← Generate-Initial-Population();
rep eat
pop ← Do-Generation (pop)
if Converged(pop) then
pop ← Restart-Population(pop);
学习交流PPT
4
The development of MAs — 1st
generation
Hybrid Algorithms
a marriage between a population-based global search (often in the form of an evolutionary algorithm) coupled with a cultural evolutionary stage.
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Local Graph Partitioning using PageRank VectorsReid Andersen University of California,San DiegoFan ChungUniversity of California,San DiegoKevin LangYahoo!Research AbstractA local graph partitioning algorithmfinds a cut near a specified starting vertex,with arunning time that depends largely on the size of the small side of the cut,rather than the sizeof the input graph.In this paper,we present an algorithm for local graph partitioning usingpersonalized PageRank vectors.We develop an improved algorithm for computing approximatePageRank vectors,and derive a mixing result for PageRank vectors similar to that for randoming this mixing result,we derive an analogue of the Cheeger inequality for PageRank,which shows that a sweep over a single PageRank vector canfind a cut with conductanceφ,provided there exists a cut with conductance at most f(φ),where f(φ)isΩ(φ2/log m),andwhere m is the number of edges in the graph.By extending this result to approximate PageRankvectors,we develop an algorithm for local graph partitioning that can be used to afind a cutwith conductance at mostφ,whose small side has volume at least2b,in time O(2b log3m/φ2).Using this local graph partitioning algorithm as a subroutine,we obtain an algorithm thatfindsa cut with conductanceφand approximately optimal balance in time O(m log4m/φ3).1IntroductionOne of the central problems in algorithmic design is the problem offinding a cut with a small conductance.There is a large literature of research papers on this topic,with applications in numerous areas.Spectral partitioning,where an eigenvector is used to produce a cut,is one of the few approaches to this problem that can be analyzed theoretically.The Cheeger inequality[4]shows that the cut obtained by spectral partitioning has conductance within a quadratic factor of the optimum. Spectral partitioning can be applied recursively,with the resulting cuts combined in various ways, to solve more complicated problems;for example,recursive spectral algorithms have been used to find k-way partitions,spectral clusterings,and separators in planar graphs[2,8,13,14].There is no known way to lower bound the size of the small side of the cut produced by spectral partitioning, and this adversely affects the running time of recursive spectral partitioning.Local spectral techniques provide a faster alternative to recursive spectral partitioning by avoid-ing the problem of unbalanced cuts.Spielman and Teng introduced a local partitioning algorithm called Nibble,whichfinds relatively small cuts near a specified starting vertex,in time proportional to the volume of the small side of the cut.The small cuts found by Nibble can be combined to form balanced cuts and multiway partitions in almost linear time,and the Nibble algorithm is an essential subroutine in algorithms for graph sparsification and solving linear systems[15].The analysis of the Nibble algorithm is based on a mixing result by Lov´a sz and Simonovits[9,10], which shows that cuts with small conductance can be found by simulating a random walk and performing sweeps over the resulting sequence of walk vectors.In this paper,we present a local graph partitioning algorithm that uses personalized PageRank vectors to produce cuts.Because a PageRank vector is defined recursively(as we will describe in section2),we can consider a single PageRank vector in place of a sequence of random walk vectors,which simplifies the process offinding cuts and allows greaterflexibility when computing approximations.We show directly that a sweep over a single approximate PageRank vector can produce cuts with small conductance.In contrast,Spielman and Teng show that when a good cut can be found from a series of walk distributions,a similar cut can be found from a series of approximate walk distributions.Our method of analysis allows us tofind cuts using approximations with larger amounts of error,which improves the running time.The analysis of our algorithm is based on the following results:•We give an improved algorithm for computing approximate PageRank vectors.We use a technique introduced by Jeh-Widom[7],and further developed by Berkhin in his Bookmark Coloring Algorithm[1].The algorithms of Jeh-Widom and Berkhin compute many personal-ized PageRank vectors simultaneously,more quickly than they could be computed individu-ally.Our algorithm computes a single approximate PageRank vector more quickly than the algorithms of Jeh-Widom and Berkhin by a factor of log n.•We prove a mixing result for PageRank vectors that is similar to the Lov´a sz-Simonovits mixing result for random ing this mixing result,we show that if a sweep over a PageRank vector does not produce a cut with small conductance,then that PageRank vector is close to the stationary distribution.We then show that for any set C with small conductance,and for many starting vertices contained in C,the resulting PageRank vector is not close to the stationary distribution,because it has significantly more probability within bining these results yields a local version of the Cheeger inequality for PageRank vectors:if C is a set with conductanceΦ(C)≤f(φ),then a sweep over a PageRank vector pr(α,χv)finds a set with conductance at mostφ,provided thatαis set correctly depending onφ,and that v is one of a significant number of good starting vertices within C.This holds for a function f(φ)that satisfies f(φ)=Ω(φ2/log m).Using the results described above,we produce a local partitioning algorithm PageRank-Nibble which improves both the running time and approximation ratio of Nibble.PageRank-Nibble takes as input a starting vertex v,a target conductanceφ,and an integer b∈[1,log m].When v is a good starting vertex for a set C with conductanceΦ(C)≤g(φ),there is at least one value of b where PageRank-Nibble produces a set S with the following properties:the conductance of S is at mostφ,the volume of S is at least2b−1and at most(2/3)vol(G),and the intersection of S and C satisfies vol(S∩C)≥2b−2.This holds for a function g(φ)that satisfies g(φ)=Ω(φ2/log2m).The running time of PageRank-Nibble is O(2b log3m/φ2),which is nearly linear in the volume of S. In comparison,the Nibble algorithm requires that C have conductance O(φ3/log2m),and runs in time O(2b log4m/φ5).PageRank-Nibble can be used interchangeably with Nibble,leading immediately to faster algorithms with improved approximation ratios in several applications.In particular,we obtain an algorithm PageRank-Partition thatfinds cuts with small conductance and approximately optimalbalance:if there exists a set C satisfyingΦ(C)≤g(φ)and vol(C)≤12vol(G),then the algorithmfinds a set S such thatΦ(S)≤φand12vol(C)≤vol(S)≤56vol(G),in time O(m log4m/φ3).Thisholds for a function g(φ)that satisfies g(φ)=Ω(φ2/log2m).2PreliminariesIn this paper we consider an undirected,unweighted graph G,where V is the vertex set,E is the edge set,n is the number of vertices,and m is the number of undirected edges.We write d(v)for the degree of vertex v,let D be the degree matrix(the diagonal matrix with D i,i=d(v i)),and let A be the adjacency matrix.We will consider distributions on V,which are vectors indexed by the vertices in V,with the additional requirement that each entry be nonnegative.A distribution p is considered to be a row vector,so we can write the product of p and A as pA.2.1.Personalized Pagerank VectorsPageRank was introduced by Brin and Page[12,3].For convenience,we introduce a lazy variation of PageRank,which we define to be the unique solution pr(α,s)of the equationpr(α,s)=αs+(1−α)pr(α,s)W,(1) whereαis a constant in(0,1]called the teleportation constant,s is a distribution called thepreference vector,and W is the lazy random walk transition matrix W=12(I+D−1A).In theAppendix,we show that this is equivalent to the traditional definition of PageRank(which uses a regular random walk step instead of a lazy step)up to a change inα.The PageRank vector that is usually associated with search ranking has a preference vectorequal to the uniform distribution 1n .PageRank vectors whose preference vectors are concentratedon a smaller set of vertices are often called personalized PageRank vectors.These were introduced by Haveliwala[6],and have been used to provide personalized search ranking and context-sensitive search[1,5,7].The preference vectors used in our algorithms have all probability on a single starting vertex.Here are some useful properties of PageRank vectors(also see[6]and[7]).The proofs are given in the Appendix.Proposition1.For any starting distribution s,and any constantαin(0,1],there is a unique vector pr(α,s)satisfying pr(α,s)=αs+(1−α)pr(α,s)W.Proposition2.For anyfixed value ofαin(0,1],there is a linear transformation Rαsuch that pr(α,s)=sRα.Furthermore,Rαis given by the matrixRα=α∞t=0(1−α)t W t,(2)which implies that a PageRank vector is a weighted average of lazy walk vectors,pr(α,s)=α∞t=0(1−α)tsW t.(3)It follows that pr(α,s)is linear in the preference vector s.2.2ConductanceThe volume of a subset S⊆V of vertices isvol(S)=x∈Sd(x).We remark that vol(V )=2m ,and we will sometimes write vol(G )in place of vol(V ).The edge boundary of a set is defined to be ∂(S )={{x,y }∈E |x ∈S,y ∈S },and the conductance of a set is Φ(S )=|∂(S )|min (vol(S ),2m −vol(S )).2.3.DistributionsTwo distributions we will use frequently are the stationary distribution,ψS (x )= d (x )vol(S )if x ∈S 0otherwise.and the indicator function,χv (x )=1if x =v 0otherwise.The amount of probability from a distribution p on a set S of vertices is written p (S )= x ∈S p (x ).We will sometimes refer to the quantity p (S )as an amount of probability even if p (V )is not equal to 1.As an example of this notation,the PageRank vector with teleportation constant αand preference vector χv is written pr(α,χv ),and the amount of probability from this distribution on a set S is written [pr(α,χv )](S ).The support of a distribution is Supp(p )={v |p (v )=0}.2.4.SweepsA sweep is an efficient technique for producing cuts from an embedding of a graph,and is often used in spectral partitioning [11,14].We will use the following degree-normalized version of a sweep.Given a distribution p ,with support size N p =|Supp(p )|,let v 1,...,v N p be an ordering ofthe vertices such that p (v i )d (v i )≥p (v i +1)d (v i +1).This produces a collection of sets,S p j ={v 1,...,v j }for each j ∈{0,...,N p },which we call sweep sets .We letΦ(p )=min j ∈[1,N p ]Φ(S p j )be the smallest conductance of any of the sweep sets.A cut with conductance Φ(p )can be found by sorting p and computing the conductance of each sweep set,which can be done in time O (vol(Supp(p ))log n ).2.5.Measuring the spread of a distributionWe measure how well a distribution p is spread in the graph using a function p [k ]defined for all integers k ∈[0,2m ].This function is determined by setting p [k ]=p S p j ,for those values of k where k =vol(S p j ),and the remaining values are set by defining p [k ]to bepiecewise linear between these points.In other words,for any integer k ∈[0,2m ],if j is the unique vertex such that vol(S p j )≤k ≤vol(S p j +1),thenp [k ]=p S p j +k −vol(S p j )d (v j )p (v j +1).This implies that p [k ]is an increasing function of k ,and a concave function of k .It is not hard to see that p [k ]is an upper bound on the amount of probability from p on any set with volume k ;for any set S ,we have p (S )≤p [vol(S )].3Computing approximate PageRank vectorsTo approximate a PageRank vector pr(α,s),we compute a pair of distributions p and r with the following property.p+pr(α,r)=pr(α,s).(4) If p and r are two distributions with this property,we say that p is an approximate PageRank vector, which approximates pr(α,s)with the residual vector r.We will use the notation p=apr(α,s,r)to refer to an approximate PageRank vector obeying the equation above.Since the residual vector is nonnegative,it is always true that apr(α,s,r)≤pr(α,s),for any residual vector r.In this section,we give an algorithm that computes an approximate PageRank vector with a small residual vector and small support,with running time independent of the size of the graph. Theorem1.ApproximatePageRank(v,α, )runs in time O(1α),and computes an approximatePageRank vector p=apr(α,χv,r)such that the residual vector r satisfies max u∈V r(u)d(u)< ,andsuch that vol(Supp(p))≤1α.We remark that this algorithm is based on the algorithms of Jeh-Widom[7]and Berkhin[1],both of which can be used to compute similar approximate PageRank vectors in time O(log nα).Theextra factor of log n in the running time of these algorithms is overhead from maintaining a heap or priority queue,which we eliminate.The proof of Theorem1is based on a series of facts which we describe below.Our algorithm is motivated by the following observation of Jeh-Widom.pr(α,s)=αs+(1−α)pr(α,sW).(5) Notice that the equation above is similar to,but different from,the equation used in Section2to define PageRank.This observation is simple,but it is instrumental in our algorithm,and it is not trivial.To prove it,first reformulate the linear transformation Rαthat takes a starting distribution to its corresponding PageRank vector,as follows.Rα=α∞t=0(1−α)t W t=αI+(1−α)W Rα.Applying this rearranged transformation to a starting distribution s yields equation(5).pr(α,s)=sRα=αs+(1−α)sW Rα=αs+(1−α)pr(α,sW).This provides aflexible way to compute an approximate PageRank vector.We maintain a pair of distributions:an approximate PageRank vector p and its associated residual vector r.Initially, we set p= 0and r=χv.We then apply a series of push operations,based on equation(5),which alter p and r.Each push operation takes a single vertex u,moves anαfraction of the probability from r(u)onto p(u),and then spreads the remaining(1−α)fraction within r,as if a single step of the lazy random walk were applied only to the vertex u.Each push operation maintains the invariantp+pr(α,r)=pr(α,χv),(6)which ensures that p is an approximate PageRank vector for pr(α,χv )after any sequence of push operations.We now formally define push u ,which performs this push operation on the distributions p and r at a chosen vertex u .push u (p,r ):1.Let p =p and r =r ,except for the following changes:(a)p (u )=p (u )+αr (u ).(b)r (u )=(1−α)r (u )/2.(c)For each v such that (u,v )∈E :r (v )=r (v )+(1−α)r (u )/(2d (u )).2.Return (p ,r ).Lemma 1.Let p and r be the result of the operation push u on p and r .Thenp +pr(α,r )=p +pr(α,r ).The proof of Lemma 1can be found in the Appendix.During each push,some probability is moved from r to p ,where it remains,and after sufficiently many pushes r can be made small.We can bound the number of pushes required by the following algorithm.ApproximatePageRank (v,α, ):1.Let p = 0,and r =χv .2.While max u ∈V r (u )d (u )≥ :(a)Choose any vertex u where r (u )d (u )≥ .(b)Apply push u at vertex u ,updating p and r .3.Return p ,which satisfies p =apr(α,χv ,r )with max u ∈V r (u )d (u )< .Lemma 2.Let T be the total number of push operations performed by ApproximatePageRank ,and let d i be the degree of the vertex u used in the ith push.ThenTi =1d i ≤1 α.Proof.The amount of probability on the vertex pushed at time i is at least d i ,therefore |r |1decreases by at least α d i during the ith push.Since |r |1=1initially,we have α T i =1d i ≤1,and the result follows.To implement ApproximatePageRank ,we determine which vertex to push at each step by main-taining a queue containing those vertices u with r (u )/d (u )≥ .At each step,push operations are performed on the first vertex in the queue until r (u )/d (u )< for that vertex,which is then removed from the queue.If a push operation raises the value of r (x )/d (x )above for some vertex x ,that vertex is added to the back of the queue.This continues until the queue is empty,at which point every vertex has r (u )/d (u )< .We will show that this algorithm has the properties promised in Theorem 1.The proof is contained in the Appendix.4A mixing result for PageRank vectorsIn this section,we prove a mixing result for PageRank vectors that is an analogue of the Lov´a sz-Simonovits mixing result for random walks.For an approximate PageRank vector apr(α,s,r),we give an upper bound on apr(α,s,r)[k]that depends on the smallest conductance found by a sweep over apr(α,s,r).In contrast,the mixing result of Lov´a sz and Simonovits bounds the quantity p(t)[k]for the lazy random walk distribution p(t)in terms of the smallest conductance found by sweeps over the previous walk distributions p(0),...,p(t−1).The recursive property of PageRank allows us to consider a single vector instead of a sequence of random walk vectors,simplifying the process offinding cuts.We use this mixing result to show that if an approximate PageRank vector apr(α,s,r)has significantly more probability than the stationary distribution on any set,the sweep over apr(α,s,r) produces a cut with small conductance.Theorem2.If there exists a set S of vertices and a constantδ≥2√msatisfyingapr(α,s,r)(S)−vol(S)vol(G)>δ,thenΦ(apr(α,s,r))<18αln mδ.The proof of this theorem,and the more general mixing result from which it is derived,is described at the end of this section.The proof requires a sequence of lemmas,which we present below.Every approximate PageRank vector,no matter how large the residual vector,obeys the fol-lowing inequality.It is a one-sided version of the equation used to define PageRank.Lemma3.If apr(α,s,r)is an approximate PageRank vector,thenapr(α,s,r)≤αs+(1−α)apr(α,s,r)W.The proof of Lemma3can be found in the Appendix.Notice that this inequality relates apr(α,s,r)to apr(α,s,r)W.We will soon prove a result,Lemma4,which describes how proba-bility mixes in the single walk step between apr(α,s,r)and apr(α,s,r)W.We will then combine Lemma4with the inequality from Lemma3to relate apr(α,s,r)to itself,removing any reference to apr(α,s,r)W.We now present definitions required for Lemma4.Instead of viewing an undirected graph as a collection of undirected edges,we view each undirected edge{u,v}as a pair of directed edges (u,v)and(v,u).For each directed edge(u,v)we letp(u,v)=p(u) d(u).For any set of directed edges A,we definep(A)=(u,v)∈Ap(u,v).When a lazy walk step is applied to the distribution p ,the amount of probability that moves from u to v is 12p (u,v ).For any set S of vertices,we have the set of directed edges into S ,and the set of directed edges out of S ,defined by in(S )={(u,v )∈E |u ∈S },and out(S )={(u,v )∈E |v ∈S },respectively.Lemma 4.For any distribution p ,and any set S of vertices,pW (S )≤12(p (in(S )∪out(S ))+p (in(S )∩out(S ))).The proof of Lemma 4can be found in the Appendix.We now combine this result with the inequality from Lemma 3to relate apr(α,s,r )to itself.In contrast,the proof of Lov´a sz andSimonovits [9,10]relates the walk distributions p (t )and p (t +1),where p (t +1)=p (t )W ,and p (0)=s .Lemma 5.If p =apr(α,s,r )is an approximate PageRank vector,then for any set S of vertices,p (S )≤αs (S )+(1−α)12(p (in(S )∪out(S ))+p (in(S )∩out(S ))).Furthermore,for each j ∈[1,n −1],p vol(S p j ) ≤αs vol(S p j ) +(1−α)12p vol(S p j )−|∂(S p j )| +p vol(S p j )+|∂(S p j )| .The proof of Lemma 5is included in the Appendix.The following lemma uses the result from Lemma 5to place an upper bound on apr(α,s,r )[k ].More precisely,it shows that if a certain upper bound on apr(α,s,r )[k ]−k 2m does not hold,then one of the sweep sets from apr(α,s,r )has both small conductance and a significant amount of probability from apr(α,s,r ).This lower bound on probability will be used in Section 6to control the volume of the resulting sweep set.Theorem 3.Let p =apr(α,s,r )be an approximate PageRank vector with |s |1≤1.Let φand γbe any constants in [0,1].Either the following bound holds for any integer t and any k ∈[0,2m ]:p [k ]−k 2m ≤γ+αt + min(k,2m −k ) 1−φ28t ,or else there exists a sweep cut S p j with the following properties:1.Φ(S p j )<φ,2.p S p j −vol(S p j )2m >γ+αt + min(vol(S p j ),2m −vol(S p j )) 1−φ28 t ,for some integer t ,3.j ∈[1,|Supp(p )|].The proof can be found in the Appendix.We can rephrase the sequence of bounds from Theorem 3to prove the theorem promised at the beginning of this ly,we show that if there exists a set of vertices,of any size,that contains a constant amount more probability from apr(α,s,r )than from the stationary distribution,then the sweep over apr(α,s,r )finds a cut with conductance roughly √αln m .We remark that this applies to any approximate PageRank vector,regardless of the size of the residual vector:the residual vector only needs to be small to ensure that apr(α,s,r )is large enough that the theorem applies.The proof is given in the appendix.5Local partitioning using approximate PageRank vectorsIn this section,we show how sweeps over approximate PageRank vectors can be used tofind cuts with nearly optimal conductance.Unlike traditional spectral partitioning,where a sweep over an eigenvector produces a cut with conductance near the global minimum,the cut produced by a PageRank vector depends on the starting vertex v,and also onα.Wefirst identify a sizeable collection of starting vertices for which we can give a lower bound on apr(α,χv,r)(C).Theorem4.For any set C and any constantα,there is a subset Cα⊆C,with vol(Cα)≥vol(C)/2, such that for any vertex v∈Cα,the approximate PageRank vector apr(α,χv,r)satisfiesapr(α,χv,r)(C)≥1−Φ(C)α−vol(C)maxu∈Vr(u)d(u).We will outline the proof of Theorem4at the end of this section.Theorem4can be combined with the mixing results from Section4to prove the following theorem,which describes a method for producing cuts from an approximate PageRank vector.Theorem5.Letφbe a constant in[0,1],letα=φ2135ln m,and let C be a set satisfying1.Φ(C)≤φ21350ln m,2.vol(C)≤23vol(G).If v∈Cα,and if apr(α,χv,r)is an approximate PageRank vector with residual vector r satisfyingmax u∈V r(u)d(u)≤110vol(C),thenΦ(apr(α,χv,r))<φ.We prove Theorem5by combining Theorem4and Theorem2.A detailed proof is provided in the Appendix.As an immediate consequence of Theorem5,we obtain a local Cheeger inequality for personalized PageRank vectors,which applies when the starting vertex is within a set that achieves the minimum conductance in the graph.Theorem6.LetΦ(G)be the minimum conductance of any set with volume at most vol(G)/2,and let C opt be a set achieving this minimum.If pr(α,χv)is a PageRank vector whereα=10Φ(G),and v∈C optα,thenΦ(pr(α,χv))<1350Φ(G)ln m.Theorem6follows immediately from Theorem5by settingφ=1350Φ(G)ln m.To prove Theorem4,we will show that a set C with small conductance contains a significant amount of probability from pr(α,χv),for many of the vertices v in C.Wefirst show that this holds for an average of the vertices in C,by showing that C contains a significant amount of probability from pr(α,ψC).Lemma6.The PageRank vector pr(α,ψC)satisfies[pr(α,ψC)](¯C)≤Φ(C) 2α.The proof of Lemma6will be given in the Appendix.To prove Theorem4from Lemma6,we observe that for many vertices in C,pr(α,χv)is not much larger than pr(α,ψC),and then bound the difference between apr(α,χv,r)and pr(α,χv)in terms of the residual vector r.A detailed proof can be found in the Appendix.6An algorithm for nearly linear time graph partitioningIn this section,we extend our local partitioning techniques to find a set with small conductance,while providing more control over the volume of the set produced.The result is an algorithm called PageRank-Nibble that takes a scale b as part of its input,runs in time proportional to 2b ,and only produces a cut when it finds a set with conductance φand volume roughly 2b .We prove that PageRank-Nibble finds a set with these properties for at least one value of b ∈[1, log m ],provided that v is a good starting vertex for a set of conductance at most g (φ),where g (φ)=Ω(φ2/log 2m ).PageRank-Nibble (v,φ,b ):Input:a vertex v ,a constant φ∈(0,1],and an integer b ∈[1,B ],where B = log 2m .1.Let α=φ2225ln(100√m )pute an approximate PageRank vector p =apr(α,χv ,r )with residual vector rsatisfying max u ∈V r (u )d (u )≤2−b 148B .3.Check each set S p j with j ∈[1,|Supp(p )|],to see if it obeys the following conditions:Conductance:Φ(S p j )<φ,Volume:2b −1<vol(S p j )<23vol(G ),Probability Change:p 2b −p 2b −1 >148B ,4.If some set S p j satisfies all of these conditions,return S p j .Otherwise,return nothing.Theorem 7.PageRank-Nibble (v,φ,b )can be implemented with running time O (2b log 3m φ2).Theorem 8.Let C be a set satisfying Φ(C )≤φ2/(22500log 2100m )and vol(C )≤12vol(G ),and let v be a vertex in C αfor α=φ2/(225ln(100√m )).Then,there is some integer b ∈[1, log 2m ]for which PageRank-Nibble (v,φ,b )returns a set S .Any set S returned by PageRank-Nibble (v,φ,b )has the following properties:1.Φ(S )<φ,2.2b −1<vol(S )<23vol(G ),3.vol(S ∩C )>2b −2.The proofs of Theorems 7and 8are included in the Appendix.PageRank-Nibble improves both the running time and approximation ratio of the Nibble algo-rithm of Spielman and Teng,which runs in time O (2b log 4m/φ5),and requires Φ(C )=O (φ3/log 2m ).PageRank-Nibble can be used interchangeably with Nibble in several important applications.For example,both PageRank-Nibble and Nibble can be applied recursively to produce cuts with nearly optimal balance.An algorithm PageRank-Partition with the following properties can be created in essentially the same way as the algorithm Partition in [15],so we omit the details.Theorem 9.The algorithm PageRank-Partition takes as input a parameter φ,and has ex-pected running time O (m log(1/p )log 4m/φ3).If there exists a set C with vol(C )≤12vol(G )andΦ(C )≤φ2/(1845000log 2m ),then with probability at least 1−p ,PageRank-Partition produces aset S satisfying Φ(S )≤φand 12vol(C )≤vol(S )≤56vol(G ).References[1]Pavel Berkhin.Bookmark-coloring approach to personalized pagerank computing.InternetMathematics,To appear.[2]Christian Borgs,Jennifer T.Chayes,Mohammad Mahdian,and Amin Saberi.Exploring thecommunity structure of newsgroups.In KDD,pages783–787,2004.[3]Sergey Brin and Lawrence Page.The anatomy of a large-scale hypertextual Web search engine.Computer Networks and ISDN Systems,30(1–7):107–117,1998.[4]F.Chung.Spectral graph theory,volume Number92in CBMS Regional Conference Series inMathematics.American Mathematical Society,1997.[5]D.Fogaras and B.Racz.Towards scaling fully personalized pagerank.In Proceedings of the3rd Workshop on Algorithms and Models for the Web-Graph(WAW),pages pages105–117, October2004.[6]Taher H.Haveliwala.Topic-sensitive pagerank:A context-sensitive ranking algorithm for websearch.IEEE Trans.Knowl.Data Eng.,15(4):784–796,2003.[7]Glen Jeh and Jennifer Widom.Scaling personalized web search.In Proceedings of the12thWorld Wide Web Conference(WWW),pages271–279,2003.[8]Ravi Kannan,Santosh Vempala,and Adrian Vetta.On clusterings:Good,bad and spectral.J.ACM,51(3):497–515,2004.[9]L´a szl´o Lov´a sz and Mikl´o s Simonovits.The mixing rate of markov chains,an isoperimetricinequality,and computing the volume.In FOCS,pages346–354,1990.[10]L´a szl´o Lov´a sz and Mikl´o s Simonovits.Random walks in a convex body and an improvedvolume algorithm.Random Struct.Algorithms,4(4):359–412,1993.[11]M.Mihail.Conductance and convergence of markov chains—a combinatorial treatment ofexpanders.In Proc.of30th FOCS,pages pp.526–531,1989.[12]Lawrence Page,Sergey Brin,Rajeev Motwani,and Terry Winograd.The pagerank citationranking:Bringing order to the web.Technical report,Stanford Digital Library Technologies Project,1998.[13]Horst D.Simon and Shang-Hua Teng.How good is recursive bisection?SIAM Journal onScientific Computing,18(5):1436–1445,1997.[14]Daniel A.Spielman and Shang-Hua Teng.Spectral partitioning works:Planar graphs andfinite element meshes.In IEEE Symposium on Foundations of Computer Science,pages96–105,1996.[15]Daniel A.Spielman and Shang-Hua Teng.Nearly-linear time algorithms for graph partitioning,graph sparsification,and solving linear systems.In ACM STOC-04,pages81–90,New York, NY,USA,2004.ACM Press.。