Fast Algorithms for Comprehensive N-point Correlation Estimates

合集下载

fastcluster 1.2.3 快速层次聚类算法 R 和 Python 包说明书

fastcluster 1.2.3 快速层次聚类算法 R 和 Python 包说明书

Package‘fastcluster’October13,2022Encoding UTF-8Type PackageVersion1.2.3Date2021-05-24Title Fast Hierarchical Clustering Routines for R and'Python'Copyright Until package version1.1.23:©2011Daniel Müllner<>.All changes from version1.1.24on:©Google Inc.<>.Enhances stats,flashClustDepends R(>=3.0.0)Description This is a two-in-one package which provides interfaces to both R and'Python'.It implements fast hierarchical,agglomerativeclustering routines.Part of the functionality is designed as drop-inreplacement for existing routines:linkage()in the'SciPy'package'scipy.cluster.hierarchy',hclust()in R's'stats'package,and the'flashClust'package.It provides the same functionality with thebenefit of a much faster implementation.Moreover,there arememory-saving routines for clustering of vector data,which go beyondwhat the existing packages provide.For information on how to installthe'Python'files,see thefile INSTALL in the source distribution.Based on the present package,Christoph Dalitz also wrote a pure'C++'interface to'fastcluster':<https://lionel.kr.hs-niederrhein.de/~dalitz/data/hclust/>. License FreeBSD|GPL-2|file LICENSEURL /fastcluster.htmlNeedsCompilation yesAuthor Daniel Müllner[aut,cph,cre],Google Inc.[cph]Maintainer Daniel Müllner<*******************>Repository CRANDate/Publication2021-05-2423:50:06UTC12fastclusterR topics documented:fastcluster (2)hclust (3)hclust.vector (5)Index7 fastcluster Fast hierarchical,agglomerative clustering routines for R and PythonDescriptionThe fastcluster package provides efficient algorithms for hierarchical,agglomerative clustering.Inaddition to the R interface,there is also a Python interface to the underlying C++library,to be foundin the source distribution.DetailsThe function hclust provides clustering when the input is a dissimilarity matrix.A dissimilaritymatrix can be computed from vector data by dist.The hclust function can be used as a drop-in re-placement for existing routines:stats::hclust and flashClust::hclust alias flashClust::flashClust.Once the fastcluster library is loaded at the beginning of the code,every program that uses hierar-chical clustering can benefit immediately and effortlessly from the performance gainWhen the package is loaded,it overwrites the function hclust with the new code.The function hclust.vector provides memory-saving routines when the input is vector data.Further information:•R documentation pages:hclust,hclust.vector•A comprehensive User’s manual:fastcluster.pdf.Get this from the R command line withvignette( fastcluster ).•JSS paper:https:///v53/i09/.•See the author’s home page for a performance comparison:/fastcluster.html.Author(s)Daniel MüllnerReferences/fastcluster.htmlSee Alsohclust,hclust.vectorExamples#Taken and modified from stats::hclust##hclust(...)#new method#hclust.vector(...)#new method#stats::hclust(...)#old methodrequire(fastcluster)require(graphics)hc<-hclust(dist(USArrests),"ave")plot(hc)plot(hc,hang=-1)##Do the same with centroid clustering and squared Euclidean distance,##cut the tree into ten clusters and reconstruct the upper part of the##tree from the cluster centers.hc<-hclust.vector(USArrests,"cen")#squared Euclidean distanceshc$height<-hc$height^2memb<-cutree(hc,k=10)cent<-NULLfor(k in1:10){cent<-rbind(cent,colMeans(USArrests[memb==k,,drop=FALSE]))}hc1<-hclust.vector(cent,method="cen",members=table(memb))#squared Euclidean distanceshc1$height<-hc1$height^2opar<-par(mfrow=c(1,2))plot(hc,labels=FALSE,hang=-1,main="Original Tree")plot(hc1,labels=FALSE,hang=-1,main="Re-start from10clusters")par(opar)hclust Fast hierarchical,agglomerative clustering of dissimilarity dataDescriptionThis function implements hierarchical clustering with the same interface as hclust from the stats package but with much faster algorithms.Usagehclust(d,method="complete",members=NULL)Argumentsd a dissimilarity structure as produced by dist.method the agglomeration method to be used.This must be(an unambiguous abbrevi-ation of)one of"single","complete","average","mcquitty","ward.D","ward.D2","centroid"or"median".members NULL or a vector with length the number of observations.DetailsSee the documentation of the original function hclust in the stats package.A comprehensive User’s manual fastcluster.pdf is available as a vignette.Get this from the Rcommand line with vignette( fastcluster ).ValueAn object of class hclust .It encodes a stepwise dendrogram.Author(s)Daniel MüllnerReferences/fastcluster.htmlSee Alsofastcluster,hclust.vector,stats::hclustExamples#Taken and modified from stats::hclust##hclust(...)#new method#stats::hclust(...)#old methodrequire(fastcluster)require(graphics)hc<-hclust(dist(USArrests),"ave")plot(hc)plot(hc,hang=-1)##Do the same with centroid clustering and squared Euclidean distance,##cut the tree into ten clusters and reconstruct the upper part of the##tree from the cluster centers.hc<-hclust(dist(USArrests)^2,"cen")memb<-cutree(hc,k=10)cent<-NULLfor(k in1:10){cent<-rbind(cent,colMeans(USArrests[memb==k,,drop=FALSE]))}hc1<-hclust(dist(cent)^2,method="cen",members=table(memb))opar<-par(mfrow=c(1,2))plot(hc,labels=FALSE,hang=-1,main="Original Tree")plot(hc1,labels=FALSE,hang=-1,main="Re-start from10clusters")par(opar)hclust.vector Fast hierarchical,agglomerative clustering of vector dataDescriptionThis function implements hierarchical,agglomerative clustering with memory-saving algorithms. Usagehclust.vector(X,method="single",members=NULL,metric= euclidean ,p=NULL) ArgumentsX an(N×D)matrix of’double’values:N observations in D variables.method the agglomeration method to be used.This must be(an unambiguous abbrevia-tion of)one of"single","ward","centroid"or"median".members NULL or a vector with length the number of observations.metric the distance measure to be used.This must be one of"euclidean","maximum", "manhattan","canberra","binary"or"minkowski".Any unambiguoussubstring can be given.p parameter for the Minkowski metric.DetailsThe function hclust.vector provides clustering when the input is vector data.It uses memory-saving algorithms which allow processing of larger data sets than hclust does.The"ward","centroid"and"median"methods require metric="euclidean"and cluster the data set with respect to Euclidean distances.For"single"linkage clustering,any dissimilarity measure may be chosen.Currently,the same metrics are implemented as the dist function provides.The callhclust.vector(X,method= single ,metric=[...])gives the same result ashclust(dist(X,metric=[...]),method= single )but uses less memory and is equally fast.For the Euclidean methods,care must be taken since hclust expects squared Euclidean distances.Hence,the callhclust.vector(X,method= centroid )is,aside from the lesser memory requirements,equivalent tod=dist(X)hc=hclust(d^2,method= centroid )hc$height=sqrt(hc$height)The same applies to the"median"method.The"ward"method in hclust.vector is equivalent to hclust with method"ward.D2",but to method"ward.D"only after squaring as above.More details are in the User’s manual fastcluster.pdf,which is available as a vignette.Get this from the R command line with vignette( fastcluster ).Author(s)Daniel MüllnerReferences/fastcluster.htmlSee Alsofastcluster,hclustExamples#Taken and modified from stats::hclust##Perform centroid clustering with squared Euclidean distances,##cut the tree into ten clusters and reconstruct the upper part of the##tree from the cluster centers.hc<-hclust.vector(USArrests,"cen")#squared Euclidean distanceshc$height<-hc$height^2memb<-cutree(hc,k=10)cent<-NULLfor(k in1:10){cent<-rbind(cent,colMeans(USArrests[memb==k,,drop=FALSE]))}hc1<-hclust.vector(cent,method="cen",members=table(memb))#squared Euclidean distanceshc1$height<-hc1$height^2opar<-par(mfrow=c(1,2))plot(hc,labels=FALSE,hang=-1,main="Original Tree")plot(hc1,labels=FALSE,hang=-1,main="Re-start from10clusters")par(opar)Index∗clusterfastcluster,2hclust,3hclust.vector,5∗multivariatefastcluster,2hclust,3hclust.vector,5dist,2,5double,5fastcluster,2,4,6fastcluster-package(fastcluster),2 flashClust::flashClust,2 flashClust::hclust,2hclust,2,3,3,4–6hclust.vector,2,4,5,5,6stats,3,4stats::hclust,2,47。

FastICA 1.2-4 快速独立成分分析算法说明书

FastICA 1.2-4 快速独立成分分析算法说明书

Package‘fastICA’November27,2023Version1.2-4Date2023-11-27Title FastICA Algorithms to Perform ICA and Projection PursuitAuthor J L Marchini,C Heaton and B D Ripley<***************>Maintainer Brian Ripley<***************>Depends R(>=4.0.0)Suggests MASSDescription Implementation of FastICA algorithm to perform IndependentComponent Analysis(ICA)and Projection Pursuit.License GPL-2|GPL-3NeedsCompilation yesRepository CRANDate/Publication2023-11-2708:34:50UTCR topics documented:fastICA (1)ica.R.def (5)ica.R.par (6)Index7 fastICA FastICA algorithmDescriptionThis is an R and C code implementation of the FastICA algorithm of Aapo Hyvarinen et al.(https: //www.cs.helsinki.fi/u/ahyvarin/)to perform Independent Component Analysis(ICA)and Projection Pursuit.1UsagefastICA(X,p,alg.typ=c("parallel","deflation"),fun=c("logcosh","exp"),alpha=1.0,method=c("R","C"),row.norm=FALSE,maxit=200,tol=1e-04,verbose=FALSE,w.init=NULL)ArgumentsX a data matrix with n rows representing observations and p columns representing variables.p number of components to be extractedalg.typ if alg.typ=="parallel"the components are extracted simultaneously(the default).if alg.typ=="deflation"the components are extracted one at atime.fun the functional form of the G function used in the approximation to neg-entropy (see‘details’).alpha constant in range[1,2]used in approximation to neg-entropy when fun== "logcosh"method if method=="R"then computations are done exclusively in R(default).The code allows the interested R user to see exactly what the algorithm does.ifmethod=="C"then C code is used to perform most of the computations,whichmakes the algorithm run faster.During compilation the C code is linked to anoptimized BLAS library if present,otherwise stand-alone BLAS routines arecompiled.row.norm a logical value indicating whether rows of the data matrix X should be standard-ized beforehand.maxit maximum number of iterations to perform.tol a positive scalar giving the tolerance at which the un-mixing matrix is considered to have converged.verbose a logical value indicating the level of output as the algorithm runs.w.init Initial un-mixing matrix of dimension c(p,p).If NULL(default) then a matrix of normal r.v.’s is used.DetailsIndependent Component Analysis(ICA)The data matrix X is considered to be a linear combination of non-Gaussian(independent)compo-nents i.e.X=SA where columns of S contain the independent components and A is a linear mixing matrix.In short ICA attempts to‘un-mix’the data by estimating an un-mixing matrix W where XW =S.Under this generative model the measured‘signals’in X will tend to be‘more Gaussian’than the source components(in S)due to the Central Limit Theorem.Thus,in order to extract the independent components/sources we search for an un-mixing matrix W that maximizes the non-gaussianity of the sources.In FastICA,non-gaussianity is measured using approximations to neg-entropy(J)which are more robust than kurtosis-based measures and fast to compute.The approximation takes the formJ(y)=[E{G(y)}−E{G(v)}]2where v is a N(0,1)r.v.log cosh(αu)and G(u)=−exp(u2/2).The following choices of G are included as options G(u)=1αAlgorithmFirst,the data are centered by subtracting the mean of each column of the data matrix X.The data matrix is then‘whitened’by projecting the data onto its principal component directionsi.e.X->XK where K is a pre-whitening matrix.The number of components can be specified bythe user.The ICA algorithm then estimates a matrix W s.t XKW=S.W is chosen to maximize the neg-entropy approximation under the constraints that W is an orthonormal matrix.This constraint en-sures that the estimated components are uncorrelated.The algorithm is based on afixed-point iteration scheme for maximizing the neg-entropy.Projection PursuitIn the absence of a generative model for the data the algorithm can be used tofind the projection pursuit directions.Projection pursuit is a technique forfinding‘interesting’directions in multi-dimensional datasets.These projections and are useful for visualizing the dataset and in density estimation and regression.Interesting directions are those which show the least Gaussian distribu-tion,which is what the FastICA algorithm does.ValueA list containing the following componentsX pre-processed data matrixK pre-whitening matrix that projects data onto thefirst p principal compo-nents.W estimated un-mixing matrix(see definition in details)A estimated mixing matrixS estimated source matrixAuthor(s)J L Marchini and C HeatonReferencesA.Hyvarinen and E.Oja(2000)Independent Component Analysis:Algorithms and Applications,Neural Networks,13(4-5):411-430See Alsoica.R.def,ica.R.parExamples#---------------------------------------------------#Example1:un-mixing two mixed independent uniforms#---------------------------------------------------S<-matrix(runif(10000),5000,2)A<-matrix(c(1,1,-1,3),2,2,byrow=TRUE)X<-S%*%Aa<-fastICA(X,2,alg.typ="parallel",fun="logcosh",alpha=1,method="C",row.norm=FALSE,maxit=200,tol=0.0001,verbose=TRUE)par(mfrow=c(1,3))plot(a$X,main="Pre-processed data")plot(a$X%*%a$K,main="PCA components")plot(a$S,main="ICA components")#--------------------------------------------#Example2:un-mixing two independent signals#--------------------------------------------S<-cbind(sin((1:1000)/20),rep((((1:200)-100)/100),5))A<-matrix(c(0.291,0.6557,-0.5439,0.5572),2,2)X<-S%*%Aa<-fastICA(X,2,alg.typ="parallel",fun="logcosh",alpha=1,method="R",row.norm=FALSE,maxit=200,tol=0.0001,verbose=TRUE)par(mfcol=c(2,3))plot(1:1000,S[,1],type="l",main="Original Signals",xlab="",ylab="")plot(1:1000,S[,2],type="l",xlab="",ylab="")plot(1:1000,X[,1],type="l",main="Mixed Signals",xlab="",ylab="")plot(1:1000,X[,2],type="l",xlab="",ylab="")plot(1:1000,a$S[,1],type="l",main="ICA source estimates",xlab="",ylab="")plot(1:1000,a$S[,2],type="l",xlab="",ylab="")#-----------------------------------------------------------#Example3:using FastICA to perform projection pursuit on a#mixture of bivariate normal distributions#-----------------------------------------------------------if(require(MASS)){x<-mvrnorm(n=1000,mu=c(0,0),Sigma=matrix(c(10,3,3,1),2,2)) x1<-mvrnorm(n=1000,mu=c(-1,2),Sigma=matrix(c(10,3,3,1),2,2)) X<-rbind(x,x1)a<-fastICA(X,2,alg.typ="deflation",fun="logcosh",alpha=1,ica.R.def5 method="R",row.norm=FALSE,maxit=200,tol=0.0001,verbose=TRUE)par(mfrow=c(1,3))plot(a$X,main="Pre-processed data")plot(a$X%*%a$K,main="PCA components")plot(a$S,main="ICA components")}ica.R.def R code for FastICA using a deflation schemeDescriptionR code for FastICA using a deflation scheme in which the components are estimated one by one.This function is called by the fastICA function.Usageica.R.def(X,p,tol,fun,alpha,maxit,verbose,w.init)ArgumentsX data matrixp number of components to be extractedtol a positive scalar giving the tolerance at which the un-mixing matrix is consideredto have converged.fun the functional form of the G function used in the approximation to negentropy.alpha constant in range[1,2]used in approximation to negentropy when fun=="logcosh"maxit maximum number of iterations to performverbose a logical value indicating the level of output as the algorithm runs.w.init Initial value of un-mixing matrix.DetailsSee the help on fastICA for details.ValueThe estimated un-mixing matrix W.Author(s)J L Marchini and C HeatonSee AlsofastICA,ica.R.par6ica.R.par ica.R.par R code for FastICA using a parallel schemeDescriptionR code for FastICA using a parallel scheme in which the components are estimated simultaneously.This function is called by the fastICA function.Usageica.R.par(X,p,tol,fun,alpha,maxit,verbose,w.init)ArgumentsX data matrix.p number of components to be extracted.tol a positive scalar giving the tolerance at which the un-mixing matrix is consideredto have converged.fun the functional form of the G function used in the approximation to negentropy.alpha constant in range[1,2]used in approximation to negentropy when fun=="logcosh".maxit maximum number of iterations to perform.verbose a logical value indicating the level of output as the algorithm runs.w.init Initial value of un-mixing matrix.DetailsSee the help on fastICA for details.ValueThe estimated un-mixing matrix W.Author(s)J L Marchini and C HeatonSee AlsofastICA,ica.R.defIndex∗multivariatefastICA,1∗utilitiesica.R.def,5ica.R.par,6fastICA,1,5,6ica.R.def,3,5,6ica.R.par,3,5,67。

A Sequential Algorithm for Generating Random Graphs

A Sequential Algorithm for Generating Random Graphs
A Sequential Algorithm for Generating Random Graphs
Mohsen Bd Amin Saberi1
arXiv:cs/0702124v4 [] 16 Jun 2007
Stanford University {bayati,saberi}@ 2 Yonsei University jehkim@yonsei.ac.kr
(FPRAS) for generating random graphs; this we can do in almost linear time. An FPRAS provides an arbitrary close approximaiton in time that depends only polynomially on the input size and the desired error. (For precise definitions of this, see Section 2.) Recently, sequential importance sampling (SIS) has been suggested as a more suitable approach for designing fast algorithms for this and other similar problems [18, 13, 35, 6]. Chen et al. [18] used the SIS method to generate bipartite graphs with a given degree sequence. Later Blitzstein and Diaconis [13] used a similar approach to generate general graphs. Almost all existing work on SIS method are justified only through simulations and for some special cases counter examples have been proposed [11]. However the simplicity of these algorithms and their great performance in several instances, suggest further study of the SIS method is necessary. Our Result. Let d1 , . . . , dn be non-negative integers given for the degree sequence n and let i=1 di = 2m. Our algorithm is as follows: start with an empty graph and sequentially add edges between pairs of non-adjacent vertices. In every step of the procedure, the probability that an edge is added between two distinct ˆj (1 − di dj /4m) where d ˆi and d ˆj denote ˆi d vertices i and j is proportional to d the remaining degrees of vertices i and j . We will show that our algorithm produces an asymptotically uniform sample with running time of O(m dmax ) when maximum degree is of O(m1/4−τ ) and τ is any positive constant. Then we use a simple SIS method to obtain an FPRAS for any ep, δ > 0 with running time O(m dmax ǫ−2 log(1/δ )) for generating graphs with dmax = O(m1/4−τ ). Moreover, we show that for d = O(n1/2−τ ), our algorithm can generate an asymptotically uniform d-regular graph. Our results are improving the bounds of Kim and Vu [34] and Steger and Wormald [45] for regular graphs. Related Work. McKay and Wormald [37, 39] give asymptotic estimates for number of graphs within the range dmax = O(m1/3−τ ). But, the error terms in their estimates are larger than what is needed to apply Jerrum, Valiant and Vazirani’s [25] reduction to achieve asymptotic sampling. Jerrum and Sinclair [26] however, use a random walk on the self-reducibility tree and give an FPRAS for sampling graphs with maximum degree of o(m1/4 ). The running time of their algorithm is O(m3 n2 ǫ−2 log(1/δ )) [44]. A different random walk studied by [27, 28, 10] gives an FPRAS for random generation for all degree sequences for bipartite graphs and almost all degree sequences for general graphs. However the running time of these algorithms is at least O(n4 m3 dmax log5 (n2 /ǫ)ǫ−2 log(1/δ )). For the weaker problem of generating asymptotically uniform samples (not an FPRAS) the best algorithm was given by McKay and Wormald’s switching technique on configuration model [38]. Their algorithm works for graphs 2 2 3 2 with d3 max =O(m / i di ) with average running i di ) and dmax = o(m + 2 2 2 4 time of O(m + ( i di ) ). This leads to O(n d ) average running time for dregular graphs with d = o(n1/3 ). Very recently and independently from our work, Blanchet [12] have used McKay’s estimate and SIS technique to obtain an FPRAS with running time O(m2 ) for sampling bipartite graphs with given

ASimple,Fast Dominance Algorithm

ASimple,Fast Dominance Algorithm

In practice, both of these algorithms are fast. In our experiments, they process from 50, 000 to 200, 000 control-flow graph (cfg) nodes per second. While Lengauer-Tarjan has faster asymptotic complexity, it requires unreasonably large cfgs—on the order of 30, 000 nodes— before this asymptotic advantage catches up with a well-engineered iterative scheme. Since the iterative algorithm is simpler, easier to understand, easier to implement, and faster in practice, it should be the technique of choice for computing dominators on cfgs.
SOFTWARE—PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2001; 4:1–10 Prepared using speauth.cls [Version: 2000/03/06 v2.1]
A Simple, Fast Dominance Algorithm
key words:
Dominators, Dominance Frontiers
Introduction The advent of static single assignment form (ssa) has rekindled interest in dominance and related concepts [13]. New algorithms for several problems in optimization and code generation have built on dominance [8, 12, 25, 27]. In this paper, we re-examine the formulation of dominance as a forward data-flow problem [4, 5, 19]. We present several insights that lead to a simple, general, and efficient implementation in an iterative data-flow framework. The resulting algorithm, an iterative solver that uses our representation for dominance information, is significantly faster than the Lengauer-Tarjan algorithm on graphs of a size normally encountered by a compiler—less than one thousand nodes. As an integral part of the process, our iterative solver computes immediate dominators for each node in the graph, eliminating one problem with previous iterative formulations. We also show that a natural extension of

fastmarching算法原理

fastmarching算法原理

fastmarching算法原理Fast marching algorithm (FMA) is a numerical technique used for solving the Eikonal equation, which describes the propagation of wavefronts. This algorithm is widely used in various fields such as computer graphics, medical imaging, and computational physics.The basic principle of the fast marching algorithm is to iteratively update the travel time (or distance) from a given starting point to all other points in the computational domain. This is done by considering the local characteristics of the wavefront and updating the travel time based on the minimum arrival time from neighboring points.The algorithm starts by initializing the travel time at the starting point to zero and setting the travel time at all other points to infinity. Then, it iteratively updates the travel time at each grid point based on the neighboring points, ensuring that the travel time decreasesmonotonically as the wavefront propagates outward.At each iteration, the algorithm selects the grid point with the minimum travel time among the set of points that have not been updated yet. It then updates the travel time at this point based on the local wavefront characteristics and the travel times of its neighboring points. This process is repeated until the travel times at all points have been computed.One of the key advantages of the fast marching algorithm is its computational efficiency. By exploiting the properties of the Eikonal equation and the characteristics of the wavefront, the algorithm can compute the travel times in a relatively short amount of time, making it suitable for real-time or interactive applications.In conclusion, the fast marching algorithm is a powerful numerical technique for solving the Eikonal equation and computing wavefront propagation. Itsefficiency and versatility make it a valuable tool invarious fields, enabling the simulation and analysis of wave propagation phenomena in a wide range of applications.。

作者姓名钟柳强

作者姓名钟柳强

作者姓名:钟柳强论文题目:求解两类Maxwell 方程组棱元离散系统的快速算法和自适应方法作者简介::钟柳强,男,1980年10月出生,2006年9月起师从湘潭大学许进超教授,2009年6月获博士学位。

中文摘要目前,电磁场的研究及应用已经影响到科学技术的各个领域,但是面对电磁场实际应用中大量复杂的问题,如复杂电磁波的传播环境,复杂电磁器件的分析和设计等,不仅数学上的经典解析方法无能为力,而且实验手段也未能给予全面的解决。

随着计算机技术及数值方法的发展,计算电磁场为解决实际电磁场工程中越来越复杂的建模与仿真、优化设计等问题提供了新的重要研究手段,为电磁场的理论研究和工程应用开辟了一条新的研究途径。

棱有限元方法是对 Maxwell 方程组进行数值求解的一种基本离散化方法,它能够有效地克服经典的连续节点有限元在求解某些电磁场边值问题或特征值问题时会产生非物理解这一缺陷,从而在工程应用领域得到了越来越广泛的应用。

由于该离散系统通常是大规模, 且高度病态,因此构造其快速求解算法十分必要. 另外由于许多 Maxwell 方程组存在强奇性, 这时若采用一致加密网格进行计算,则会引起自由度的过度增长, 自适应方法是克服该缺陷的有效途径,因此研究求解 Maxwell 方程组的自适应有限元方法具有重要意义.上述两方面的研究是当前计算电磁场中的热点, 其中面临许多难点问题.本文比较系统地研究了求解两类典型 Maxwell 方程组棱有限元离散系统的快速算法和自适应棱有限元方法。

主要内容和结果如下:首先,针对H(curl) 椭圆方程组的高阶棱元离散系统,设计和分析了相应的快速迭代法和高效预条件子。

关于H(curl) 椭圆方程组棱元离散系统的快速算法,已有的大部分研究工作都是针对第一类 Nédélec 线性棱元离散系统。

而在某些时候,高阶Nédélec 棱有限元比线性棱元更具有优势,如可以减少误差的数值耗散,具有更好的逼近性等。

英语 算法 -回复

英语 算法 -回复

英语算法-回复如何使用贪心算法(Greedy Algorithm)解决最优装载问题(Knapsack Problem)。

【引言】贪心算法是一种基于局部最优选择的算法思想,可用于解决最优装载问题,即在给定容量的背包中,如何选择物品使其总价值最大。

本文将介绍如何使用贪心算法逐步解决最优装载问题,帮助读者更好地理解和应用贪心算法。

【步骤一:问题描述】首先,让我们明确最优装载问题的具体要求。

给定一个背包的容量C和N 个物品,每个物品有自己的重量w和价值v。

我们的目标是在不超过背包容量的情况下,选择物品放入背包,使得放入背包的物品的总价值最大。

【步骤二:贪心选择策略】贪心算法的核心思想是进行局部最优选择,以期望最终得到整体最优解。

对于最优装载问题,我们可以采用“单位重量价值最大”的贪心选择策略。

即优先选择单位重量价值最大的物品放入背包中,直至背包无法再放入物品。

【步骤三:算法实现】基于贪心选择策略,我们可以使用如下步骤实现算法:1. 根据物品的重量w和价值v,计算每个物品的单位重量价值vu = v / w。

2. 按照单位重量价值vu从大到小对物品进行排序。

3. 初始化当前背包的总价值val = 0和当前背包的剩余容量rc = C。

4. 逐个遍历排序后的物品列表:a. 如果当前物品的重量小于等于当前背包的剩余容量,则将该物品放入背包中,更新当前背包的总价值val和剩余容量rc。

b. 如果当前物品的重量大于当前背包的剩余容量,则放弃该物品,继续遍历下一个物品。

5. 返回最终的背包总价值val作为最优装载问题的解。

【步骤四:算法示例】接下来,我们通过一个简单的例子演示如何使用贪心算法解决最优装载问题。

假设背包容量C为10,有以下4个物品可供选择:物品1:重量w1 = 2,价值v1 = 5物品2:重量w2 = 3,价值v2 = 8物品3:重量w3 = 4,价值v3 = 9物品4:重量w4 = 5,价值v4 = 10按照贪心选择策略,首先计算每个物品的单位重量价值vu:物品1:vu1 = v1 / w1 = 5 / 2 = 2.5物品2:vu2 = v2 / w2 = 8 / 3 ≈2.67物品3:vu3 = v3 / w3 = 9 / 4 = 2.25物品4:vu4 = v4 / w4 = 10 / 5 = 2.0然后,按照单位重量价值vu从大到小对物品进行排序:物品2 > 物品1 > 物品3 > 物品4接下来,我们按照步骤三中的算法实现进行装载:初始化当前背包的总价值val = 0和剩余容量rc = 10。

计算机国际学术会议英语作文

计算机国际学术会议英语作文

Title: Attending the International Conference on Computer Science: A Transformative ExperienceAs an avid enthusiast and aspiring researcher in the field of computer science, I recently had the immense privilege of attending the prestigious International Conference on Computer Science (hypothetical name for illustrative purposes). This annual gathering of scholars, industry leaders, and innovators from around the globe served as a melting pot of ideas, advancements, and collaborations that profoundly impacted my understanding of the ever-evolving landscape of our discipline.The conference, held in a vibrant metropolis renowned for its technological prowess, kicked off with a keynote address by a renowned computer scientist, who painted a vivid picture of the future of computing. Their visionary insights into emerging technologies such as quantum computing, artificial intelligence, and cybersecurity not only sparked my imagination but also underscored the urgency for continued research and innovation in these areas.Throughout the three-day event, I participated in a myriad of technical sessions, workshops, and poster presentations. Each one was a testament to the ingenuity and dedication of the international computer science community. From discussions on the latest algorithms for big data analytics to debates on the ethical implications of AI, the conference provided a comprehensive platform for sharing knowledge and fostering interdisciplinary dialogue.One of the highlights for me was the opportunity to present my own research work during a dedicated session. Standing before a packed auditorium, I shared my findings on a novel approach to improving the efficiency of machine learning models. The positive feedback and constructive criticism I received from my peers and mentors were invaluable, and they have already sparked new ideas for future research directions.Moreover, the conference was a perfect venue for networking and establishing valuable connections. I had the chance to engage in one-on-one conversations with industry experts, academic luminaries, and fellow researchers from diverse backgrounds. These interactions not only broadened my professional network but also inspired me to think beyond my current research focus and explore new horizons.The social events organized by the conference committee further enhanced the overall experience. From the welcoming reception to the closing banquet, the atmosphere was always warm, friendly, and conducive to informal discussions and idea sharing. These moments allowed me to forge friendships and build lasting relationships with people from all corners of the world.In conclusion, attending the International Conference on Computer Science was a transformative experience that enriched my knowledge, expanded my horizons, and ignited my passion for research. It reaffirmed my belief in the power of collaborationand the limitless potential of computer science to shape our future. As I return to my work with renewed vigor and inspiration, I am eager to contribute to this vibrant field and help drive it forward towards even greater heights.。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Fast Algorithms for Comprehensive N-point CorrelationEstimatesWilliam B.March Georgia Institute ofTechnology266Ferst Dr.Atlanta,GA,USA march@Andrew J.ConnollyUniversity ofWashington391015th Ave.NE Seattle,WA,USAajc@Alexander G.GrayGeorgia Institute ofTechnology266Ferst Dr.Atlanta,GA,USAagray@ABSTRACTThe n-point correlation functions(npcf)are powerful spatial statistics capable of fully characterizing any set of multidi-mensional points.These functions are critical in key data analyses in astronomy and materials science,among other fields,for example to test whether two point sets come from the same distribution and to validate physical models and theories.For example,the npcf has been used to study the phenomenon of dark energy,considered one of the ma-jor breakthroughs in recent scientific discoveries.Unfortu-nately,directly estimating the continuous npcf at a single value requires O(N n)time for N points,and n may be2,3, 4or even higher,depending on the sensitivity required.In order to draw useful conclusions about real scientific prob-lems,we must repeat this expensive computation both for many different scales in order to derive a smooth estimate and over many different subsamples of our data in order to bound the variance.We present thefirst comprehensive approach to the entire n-point correlation function estimation problem,including fast algorithms for the computation at multiple scales and for many subsamples.We extend the current state-of-the-art tree-based approach with these two algorithms.We show an order-of-magnitude speedup over the current best approach with each of our new algorithms and show that they can be used together to obtain over500x speedups over the state-of-the-art in order to enable much larger datasets and more accurate scientific analyses than were possible previously.Categories and Subject DescriptorsJ.2[Physical Sciences and Engineering]:Astronomy;G.4[Mathematical Software]:Algorithm design and anal-ysisKeywordsN-point Correlation Functions,Jackknife ResamplingPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.KDD’12,August12–16,2012,Beijing,China.Copyright2012ACM978-1-4503-1462-6/12/08...$15.00.1.INTRODUCTIONIn this paper,we discuss a hierarchy of powerful statistics:the n-point correlation functions,which constitute a widely-used approach for detailed characterizations of multivariatepoint sets.These functions,which are analogous to the mo-ments of a univariate distribution,can completely describe any point process and are widely applicable.Applications in astronomy.The n-point statistics havelong constituted the state-of-the-art approach in many sci-entific areas,in particular for detailed characterization ofthe patterns in spatial data.They are a fundamental tool in astronomy for characterizing the large scale structure ofthe universe[20],fluctuations in the cosmic microwave back-ground[28],the formation of clusters of galaxies[32],and thecharacterization of the galaxy-mass bias[17].They can beused to compare observations to theoretical models throughperturbation theory[1,6].A high-profile example of this was a study showing large-scale evidence for dark energy[8]–this study was written up as the Top Scientific Break-through of2003in Science[26].In this study,due to themassive potential implications to fundamental physics of theoutcome,the accuracy of the n-point statistics used and thehypothesis test based on them were a considerable focus of the scientific scrutiny of the results–underscoring both thecentrality of n-point correlations as a tool to some of themost significant modern scientific problems,as well as theimportance of their accurate estimation.Materials science and medical imaging.The ma-terials science community also makes extensive use of the n-point correlation functions.They are used to form three-dimensional models of microstructure[13]and to character-ize that microstructure and relate it to macroscopic proper-ties such as the diffusion coefficient,fluid permeability,andelastic modulus[31,30].The n-point correlations have alsobeen used to create feature sets for medical image segmen-tation and classification[22,19,2].Generality of npcf.In addition to these existing ap-plications,the n-point correlations are completely general.Thus,they are a powerful tool for any multivariate or spa-tial data analysis problem.Ripley[23]showed that any point process consisting of multidimensional data can be completely determined by the distribution of counts in cells. The distribution of counts in cells can in turn be shown to be completely determined by the set of n-point correlation func-tions[20].While ordinary statistical moments are defined in terms of the expectation of increasing powers of X,the n-point functions are determined by the cross-correlationsof counts in increasing numbers of nearby regions.Thus, we have a sequence of increasingly complex statistics,anal-ogous to the moments of ordinary distributions,with which to characterize any point process and which can be estimated fromfinite data.With this simple,rigorous characterization of our data and models,we can answer the key questions posed above in one statistical framework.Computational challenge.Unfortunately,directly es-timating the n-point correlation functions is extremely com-putationally expensive.As we will show below,estimating the npcf potentially requires enumerating all n-tuples of data points.Since this scales as O(N n)for N data points,this is prohibitively expensive for even modest-sized data sets and low-orders of correlation.Higher-order correlations are often necessary to fully understand and characterize data [32].Furthermore,the npcf is a continuous quantity.In or-der to understand its behavior at all the scales of interest for a given problem,we must repeat this difficult computa-tion many times.We also need to estimate the variance of our estimated npcf.This in general requires a resampling method in order to make the most use of our data.We must therefore repeat the O(N n)computation not only for many scales,but for many different subsamples of the data.In the past,these computational difficulties have restricted the use of the n-point correlations,despite their power and generality.The largest3-point correlation estimation thus far for the distribution of galaxies used only approximately 105galaxies[16].Higher-order correlations have been even more restricted by computational considerations.Large data.Additionally,data sets are growing rapidly, and spatial data are no exception.The National Biodiversity Institute of Costa Rica has collected over3million observa-tions of tropical species along with geographical data[11]. In astronomy,the Sloan Digital Sky Survey[25]spent eight years imagining the northern sky and collected tens of ter-abytes of data.The Large Synoptic Survey Telescope[14], scheduled to come online later this decade,will collect as much as20terabytes per night for ten years.These mas-sive datasets will render n-point correlation estimation even more difficult without efficient algorithms.Our contributions.These computational considerations have restricted the widespread use of the npcf in the past. We introduce two new algorithms,building on the previ-ous state-of-the-art algorithm[9,18],to address this entire computational challenge.We present thefirst algorithms to efficiently overcome two important computational bottle-necks.•We estimate the npcf at many scales simulta-neously,thus allowing smoother and more accurateestimates.•We efficiently handle jackknife resampling by shar-ing work across different parts of the computation,al-lowing more effective variance estimation and more ac-curate results.For each of these problems,we present new algorithms ca-pable of sharing work between different parts of the compu-tation.We prove a theorem which allows us to eliminate a critical redundancy in the computation over multiple scales. We also cast the computation over multiple subsamples in a novel way,allowing a much more efficient algorithm.Each of these new ideas allows an order-of-magnitude speedup over the existing state of the art.These algorithms are there-fore able to render n-point correlation function estimationtractable for many large datasets for thefirst time by al-lowing N to increase and allow more sensitive and accuratescientific results for thefirst time by allowing multiple scalesand resampling regions.Our work is thefirst to deal directlywith the full computational task.Overview.In Section2,we define the n-point correla-tion functions and describe their estimators in detail.Thisleads to the O(J·M·N n)computational challenge men-tioned above.We then introduce our two new algorithmsfor the entire npcf estimation problem in Section3.Weshow experimental results in Section4.Finally,we conclude in Section5by highlighting future work and extensions ofour method.1.1Related WorkDue to the computational difficulty associated with esti-mating the full npcf,many alternatives to the full npcf havebeen developed,including those based on nearest-neighbordistances,quadrats,Dirichlet cells,and Ripley’s K function (and related functions)(See[24]and[3]for an overview andfurther references).Counts-in-cells[29]and Fourier spacemethods are commonly used for astronomical data.How-ever,these methods are generally less powerful than the fullnpcf.For instance,the counts-in-cells method cannot becorrected for errors due to the edges of the sample window. Fourier transform-based methods suffer from ringing effectsand suboptimal variance.See[27]for more details.Since we deal exclusively with estimating the exact npcf,we only compare against other methods for this task.Theexisting state-of-the-art methods for exact npcf estimationuse multiple space-partitioning trees to overcome the O(N n) scaling of the n-point point estimator.This approach wasfirst introduced in[9,18].It has been parallelized using theNtropy framework[7].We present the serial version here andleave the description of our ongoing work on parallelizing ourapproach to future work.2.N-POINT CORRELATION FUNCTIONS We now define the n-point correlation functions.We pro-vide a high-level description;for a more thorough defintion,see[20,27].Once we have given a simple description of thenpcf,we turn to the main problem of this paper:the com-putational task of estimating the npcf from real data.We give several common estimators for the npcf,and highlightthe underlying counting problem in each.We also discussthe full computational task involved in a useful estimate ofthe npcf for real scientific problems.Problem setting.Our data are drawn from a point pro-cess.The data consist of a set of points D in a subset of R d.Note that we are not assuming that the locations of in-dividual points are independent,just that our data set is afair sample from the underlying ensemble.We assume thatdistant parts of our sample window are uncorrelated,so thatby averaging over them,we can approximate averages over multiple samples from the point process.Following standard practice in astronomy,we assume thatthe process is homogeneous and isotropic.Note that the n-point correlations can be defined both for more general pointprocesses and for continuous randomfields.The estimatorsfor these cases are similar to the ones described below and can be improved by similar algorithmic techniques.Defining the npcf.We now turn to an informal,in-tuitive description of the hierarchy of n -point correlations.Since we have assumed that properties of the point process are translation and rotation invariant,the expected number of points in a given volume is proportional to a global den-sity ρ.If we consider a small volume element dV ,then the probability of finding a point in dV is given by:dP =ρdV(1)with dV suitably normalized.If the density ρcompletely characterizes the process,we refer to it as a Poisson process.Two-point correlation.The assumption of homogene-ity and isotropy does not require the process to lack struc-ture.The positions of points may still be correlated.The joint probability of finding objects in volume elements dV 1and dV 2separated by a distance r is given by:dP 12=ρ2dV 1dV 2(1+ξ(r ))(2)where the dV i are again normalized.The two-point corre-lation ξ(r )captures the increased or decreased probability of points occurring at a separation r .Note that the 2-point correlation is characterized by a single scale r and is a con-tinuously varying function of this distance.Three-point correlation.Higher-order correlations de-scribe the probabilities of more than two points in a given configuration.We first consider three small volume ele-ments,which form a triangle (See Fig.1(b)).The joint probability of simultaneously finding points in volume ele-ments dV 1,dV 2,and dV 3,separated by distances r 12,r 13,and r 23,is given by:dP 123=ρ3dV 1dV 2dV 3[1+ξ(r 12)+ξ(r 23)+ξ(r 13)+ζ(r 12,r 23,r 13)](3)The quantity in square brackets is sometimes called the com-plete (or full)3-point correlation function and ζis the re-duced 3-point correlation function .We will often refer to ζas simply the 3-point correlation function,since it will be the quantity of computational interest to us.Note that unlike the 2-point correlation,the 3-point correlation depends both on distance and configuration.The function varies contin-uously both as we increase the lengths of the sides of the triangle and as we vary its shape,for example by fixing two legs of the triangle and varying the angle between them.Higher-order correlations.Higher-order correlation functions (such as the 4-point correlation in Fig.1(c))are defined in the same fashion.The probability of finding n -points in a given configuration can be written as a summa-tion over the n -point correlation functions.For example,in addition to the reduced 4-point correlation function η,the complete 4-point correlation depends on the six 2-point terms (one for each pairwise distance),four 3-point terms (one for each triple of distances),and three products of two 2-point functions.The reduced four-point correlation is a function of all six pairwise distances.In general,we will denote the n -point correlation function as ξ(n )(·),where the argument is understood to be `n 2´pairwise distances.We refer to this set of pairwise distances as a configuration ,or in the computational context,as a matcher (see below).2.1Estimating the NPCFWe have shown that the n -point correlation function is a fundamental spatial statistic and have sketched the defi-nitions of the n -point correlation functions in terms of theunderlying point process.We now turn to the central taskof this paper:the problem of estimating the n -point corre-lation from real data.We describe several commonly used estimators and identify their common computationaltask.(a)2-point (b)3-point (c)4-pointFigure 1:Visual interpretation of the n -point corre-lation functions.We begin by considering the task of computing an esti-mate b ξ(n )(r )for a given configuration.For simplicity,we consider the 2-point function first.Recall that ξ(r )captures the increased (or decreased)probability of finding a pair of points at a distance r over finding the pair in a Poisson distributed set.This observation suggests a simple Monte Carlo estimator for ξ(r ).We generate a random set of points R from a Poisson distribution with the same (sample)den-sity as our data and filling the same volume.We then com-pare the frequency with which points appear at a distance close to r in our data versus in the random set.Simple estimator.Let DD(r )denote the number of pairs of points (x i ,x j )in our data,normalized by the total number of possible pairs,whose pairwise distance d (x i ,x j )is close to r (in a way to be made precise below).Let RR(r )be the number of points whose pairwise distances are in the same interval (again normalized)from the random sample (DD stands for data-data,RR for random-random).Then,a simple estimator for the two-point correlation is [20,27]:ˆξ(r )=DD(r )RR(r )−1(4)This estimator captures the intuitive behavior we expect.If pairs of points at a distance near r are more common in our data than in a completely random (Poisson)distribution,we are likely to obtain a positive estimate for ξ.This simple estimator suffers from suboptimal variance and sensitivity to noise.The Landy-Szalay estimator [12]ˆξ(r )=DD (r )−2DR (r )+RR (r )RR (r )(5)overcomes these difficulties.Here the notation DR (r )de-notes the number of pairs (x i ,y j )at a distance near r where x i is from the data and y j is from the Poisson sample (DR –data-random pairs).Other widely used estimators use the same underlying quantities –pairs of points satisfying a given distance constraint [12,10].Three-point estimator.The 3-point correlation func-tion depends on the pairwise distances between three points,rather than a single distance as before.We will therefore need to specify three distance constraints,and estimate the function for that configuration.The Landy-Szalay estima-tor for the 2-point function can be generalized to any value of n ,and retains its improved bias and variance [29].Weagain generate points from a Poisson distribution,and the estimator is also a function of quantities of the form D (n ),or D (i )R (n −i ).These refer to the number of unique triples of points,all three from the data or two from the data and one from the Poisson set,with the property that their three pairwise distances lie close to the distances in the matcher.All estimators count tuples of points.Any n -point correlation can be estimated using a sum of counts of n -tuples of points of the form D (i )R (n −i )(r ),where i ranges from zero to n .The argument is a vector of distances of length `n 2´,one for each pairwise distance needed to specify the configuration.We count unique tuples of points whose pairwise distances are close to the distances in the matcher in some ordering.2.2The Computational TaskNote that all the estimators described above depend on the same fundamental quantities:the number of tuples of points from the data/Poisson set that satisfy some set of dis-tance constraints.Thus,our task is to compute this number given the data and a suitably large Poisson set.Enumer-ating all n -tuples requires O (N n )work for N data points.Therefore,we must seek a more efficient approach.We first give some terminology for our algorithm description below.We then present the entire computational task.Matchers.We specify an n -tuple with `n 2´constraints,one for each pairwise distance in the tuple.We mentioned above that we count tuples of points whose pairwise dis-tances are“close”to the distance constraints.Each of the `n2´pairwise distance constraints consists of a lower and upperbound:r (l )ij and r (u )ij .In the context of our algorithms,we refer to this collection of distance constraints r as a matcher .We refer to the entries of the matcher as r (l )ij and r (u )ij ,where the indices i and j refer to the volume elements introduced above (Fig.1).We sometimes refer to an entry as simply r ij ,with the upper and lower bounds being understood.Satisfying matchers.Given an n -tuple of points and a matcher r ,we say that the tuple satisfies the matcher if there exists a permutation of the points such that each pair-wise distance does not violate the corresponding distance constraint in the matcher.More formally:Definition 1.Given an n -tuple of points (p 1,...,p n )and a matcher r ,we say that the tuple satisfies the matcher if there exists (at least one)permutation σof [1,...,n ]such thatr (l )σ(i )σ(j )< p i −p j <r (u )σ(i )σ(j )(6)for all indices i,j ∈[1,...n ]such that i <j .The computational task.We can now formally define our basic computational task:Definition putational Task 1:Compute the counts of tuples D (i )R (j )(r ).Given a data set D ,random set R ,and matcher r ,and 0≤i ≤n ,compute the number of unique n -tuples of points with i points from D and n −i points from R ,such that the tuple satisfies the puting these quantities directly requires enumerating all unique n -tuples of points,which takes O (N n )work and is prohibitively slow for even two-point correlations.Multiple matchers.The estimators above give us avalue b ξ(n )(r )at a single configuration.However,the n -point correlations are continuous quantities.In order to fully char-acterize them,we must compute estimates for a wide range of configurations.In order to do this,we must repeat the computation in Defn.2for many matchers,both of different scales and configurations.Definition putational Task 2:Multiple match-ers.Given a data set D ,random set R ,and a collection of M matchers {r m },compute D (i )R (j )(r m )for each 1≤m ≤M .This task requires us to repeat Task 1M times,where M controls the smoothness of our overall estimate of the npcf and our quantitative picture of its overall behavior.Thus,it is generally necessary for M to be large.Resampling.Simply computing point estimates of any statistic is generally insufficient for most scientific applica-tions.We also need to bound the variance of our estima-tor and compute error bars.In general,we must make the largest possible use of the available data,rather than with-holding much of it for variance estimation.Thus,a resam-pling method is necessary.Jackknife resampling is a widely used variance estimation method and is popular with astronomical data [15].It is also used to study large scale structure by identifying variations in the npcf across different parts of the sample window [16].We divide the data set into subregions.We eliminate each region from the data in turn,then compute our estimate of the npcf.We repeat this for each subset,and use the resulting estimates to bound the variance.This leads to our third and final computational task.Definition putational Task 3:Jackknife re-sampling.We are given a data set D ,random set R ,a set of M matchers r m ,and a partitioning of D into J subsets D k .For each 1≤k ≤J ,construct the set D (−k )=D/D k .Then,compute D (i )(−k )R (j )(r ).This task requires us to repeat Task 1J times on sets of size D −D/J .Note that J controls the quality of our vari-ance estimation,with larger values necessary for a better estimate.The complete computational task.We can now iden-tify the complete computational task for n -point correlation estimation.Given our data and random sets,a collection of M matchers,and a partitioning of the data into J subre-gions,we must perform Task 3.This in turn requires us to perform Task 2J times.Each iteration of Task 2requires M computations of Task 1.Therefore,the entire compu-tation requires O (J ·M ·N n )time if done in a brute-force fashion.In the next section,we describe our algorithmic approach to simultaneously computing all three parts of the computation,thus allowing significant savings in time.3.ALGORITHMSWe have identified the full computational task of n -pointcorrelation estimation.We now turn to our new algorithm.We begin by addressing previous work on efficiently comput-ing the counts D (i )R (j )(r )described above (Computational Task 1.)We first describe the multi-tree algorithm for com-puting these counts.We then give our new algorithm for directly solving computations with multiple matchers (Com-putational Task 2)and our method for efficiently computing counts for resampling regions (Computational Task 3).3.1Basic AlgorithmWe build on previous,tree-based algorithms for the n -point correlation estimation problem [9,18].The key idea is to employ multiple kd -trees to improve on the O (N n )scaling of the brute-forceapproach.(a)Comparing twonodes.(b)Comparing threenodes.Figure 2:Computing node-node bounds for prun-ing.kd -trees.The kd -tree [21,5]is a binary space partition-ing tree which maintains a bounding box for all the points in each node.The root consists of the entire set.Children are formed recursively by splitting the parent’s bounding box along the midpoint of its largest dimension and partitioning the points on either side.We can build a kd -tree on both the data and random sets as a pre-processing step.This requires only O (N log N )work and O (N )space.We employ the bounding boxes to speed up the naive com-putation by using them to identify opportunities for prun-ing .By computing the minimum and maximum distances between a pair of kd -tree nodes (Fig.2),we can identify cases where it is impossible for any pair of points in the nodes to satisfy the matcher.Dual-tree algorithm.For simplicity,we begin by con-sidering the two-point correlation estimation (Alg.1).Recall that the task is to count the number of unique pairs of points that satisfy a given matcher.We consider two tree nodes at a time,one from each set to be correlated.We compute the upper and lower bounds on distances between points in these nodes using the bounding boxes.We can then com-pare this to the matcher’s lower and upper bounds.If the distance bounds prove that all pairs of points are either too far or too close to possibly satisfy the matcher,then we do not need to perform any more work on the nodes.We can thus prune all child nodes,and save O (|T 1|·|T 2|)work.If we cannot prune,then we split one (or both)nodes,and re-cursively consider the two (or four)resoling pairs of nodes.If our recursion reaches leaf nodes,we compare all pairs of points exhaustively.We begin by calling the algorithm on the root nodes of the tree.If we wish to perform a DR count,we call the algorithm on the root of each tree.Note also that we only want to count unique pairs of points.Therefore,we can prune if T 2comes before T 1in an in-order tree traversal.This ensures that we see each pair of points at most once.Multi-tree algorithm.We can extend this algorithm to the general n case.Instead of considering pairs of tree nodes,we compare an n -tuple of nodes in each step of the algorithm.This multi-tree algorithm uses the same basic idea –use bounding information between pairs of tree nodes to identify sets of nodes whose points cannot satisfy the matcher.We need only make two extensions to Alg.1.First,we must do more work to determine if a particularAlgorithm 1DualTree2pt (Tree node T 1,Tree node T 2,matcher r )if T 1and T 2are leaves thenfor all points p 1∈T 1,p 2∈T 2doif r (l )12< p 1−p 2 <r (u )12then result +=15:end ifend forelse if d min (T 1,T 2)>r (u )12or d max (T 1,T 2)<r (l )12then Prune else 10:DualTree2pt (T 1.left ,T 2.left)DualTree2pt (T 1.left ,T 2.right)DualTree2pt (T 1.right ,T 2.left)DualTree2pt (T 1.right ,T 2.right)end if Algorithm 2MultiTreeNpt (Tree node T 1,...,Tree node T n ,matcher r )if all nodes T i are leaves thenfor all points p 1∈T 1,...,points p n ∈T n do if TestPointTuple (p 1,...,p n ,r )then result +=15:end ifend forelse if not TestNodeTuple (T 1,...,T n ,r )then Prune else 10:Let T i be the largest nodeMultiTreeNpt (T 1,...,T i .left ,...,T n ,r )MultiTreeNpt (T 1,...,T i .right ,...,T n ,r )end iftuple of points satisfies the matcher.We accomplish this in Alg.3by iterating over all permutations of the indices.Each permutation of indices corresponds to an assignment of pairwise distances to entries in the matcher.We can quickly check if this assignment is valid,and we only count tuples that have at least one valid assignment.The second exten-sion is a similar one for checking if a tuple of nodes can be pruned (Alg.4).We again iterate through all permu-tations and check if the distance bounds obtained from the bounding boxes fall within the upper and lower bounds of the matcher entry.As before,for an D (i )R (j )count,we call the algorithm on i copies of the data tree root and j copies of the random tree root.3.2Multi-Matcher AlgorithmThe algorithms presented above all focus on computing in-dividual counts of points –putational Task 1,from Sec.2.2.This approach improves the overall dependence on the number of data points –N –and the order of the cor-relation –n .However,this does nothing for the other two parts of the overall computational complexity.We now turn to our novel algorithm to count tuples for many matchers simultaneously,thus addressing Computational Task 2.Intuitively,computing counts for multiple matchers will repeat many calculations.For simplicity,consider the two-point correlation case illustrated in Fig.3(a).We must count the number of pairs that satisfy two matchers,r 1and r 2(assume that the upper and lower bounds for each are very。

相关文档
最新文档