Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks
并行算法的设计与分析-ch2

4
3
示例:n=7
(1)p[a]=b, p[b]=c, p[c]=d, p[d]=e, (4)
4
2
1
0
3
p[e]=f, p[f]=g, p[g]=g
6
5
r[a]=r[b]=r[c]=r[d]=r[e]=r[f]=1, r[g]=0
(2)p[a]=c, p[b]=d, p[c]=e, p[d]=f, p[e]=p[f]=p[g]=g
时间分析 t(n)=m×O(1)=O(logn) p(n)=n/2 c(n)=O(nlogn) 非成本最优
2019/9/26
Y.Xu Copyright
USTC
Parallel Algorithms 4 / Ch2
2.1 平衡树方法
前缀和
问题定义
n个元素{x1,x2,…,xn},前缀和是n个部分和: Si=x1*x2*…*xi, 1≤i≤n 这里*可以是+或×
2019/9/26
Y.Xu Copyright
USTC
Parallel Algorithms 2 / Ch2
2.1 平衡树方法
设计思想
• 树叶结点为输入,中间结点为处理结点,由叶向求最大值 • 计算前缀和
2019/9/26
Y.Xu Copyright
USTC
Parallel Algorithms 3 / Ch2
2.1 平衡树方法
算法2.1 SIMD-SM上求最大值算法
Begin for k=m-1 to 0 do for j=2k to 2k+1-1 par-do A[j]=max{A[2j], A[2j+1]} end for end for
Fast parallel algorithms for short-range molecular dynamics

Fast Parallel Alr Dynamics
Steve Plimpton Parallel Computational Sciences Department 1421, MS 1111 Sandia National Laboratories Albuquerque, NM 87185-1111 (505) 845-7873 sjplimp@ Keywords: molecular dynamics, parallel computing, N{body problem
1
1 Introduction
Classical molecular dynamics (MD) is a commonly used computational tool for simulating the properties of liquids, solids, and molecules 1, 2]. Each of the N atoms or molecules in the simulation is treated as a point mass and Newton's equations are integrated to compute their motion. From the motion of the ensemble of atoms a variety of useful microscopic and macroscopic information can be extracted such as transport coe cients, phase diagrams, and structural or conformational properties. The physics of the model is contained in a potential energy functional for the system from which individual force equations for each atom are derived. MD simulations are typically not memory intensive since only vectors of atom information are stored. Computationally, the simulations are \large" in two domains | the number of atoms and number of timesteps. The length scale for atomic coordinates is Angstroms; in three dimensions many thousands or millions of atoms must usually be simulated to approach even the sub{micron scale. In liquids and solids the timestep size is constrained by the demand that the vibrational motion of the atoms be accurately tracked. This limits timesteps to the femtosecond scale and so tens or hundreds of thousands of timesteps are necessary to simulate even picoseconds of \real" time. Because of these computational demands, considerable e ort has been expended by researchers to optimize MD calculations for vector supercomputers 24, 30, 36, 45, 47] and even to build special{purpose hardware for performing MD simulations 4, 5]. The current state{of{the{art is such that simulating ten{ to hundred{thousand atom systems for picoseconds takes hours of CPU time on machines such as the Cray Y{MP. The fact that MD computations are inherently parallel has been extensively discussed in the literature 11, 22]. There has been considerable e ort in the last few years by researchers to exploit this parallelism on various machines. The majority of the work that has included implementations of proposed algorithms has been for single{instruction/multiple{data (SIMD) parallel machines such as the CM{2 12, 52], or for multiple{instruction/multiple{data(MIMD) parallel machines with a few dozens of processors 26, 37, 39, 46]. Recently there have been e orts to create scalable algorithms that work well on hundred{ to thousand{ processor MIMD machines 9, 14, 20, 41, 51]. We are convinced that the message{passing model of programming for MIMD machines is the only one that provides enough exibility to implement all the data structure and computational enhancements that are commonly exploited in MD codes on serial and vector machines. Also, we have found that it is only the current generation of massively parallel MIMD machines with hundreds to thousands of processors that have the computational power to be competitive with the fastest vector machines for MD calculations. In this paper we present three parallel algorithms which are appropriate for a general class of MD problems that has two salient characteristics. The rst characteristic is that forces are limited in range, meaning each atom interacts only with other atoms that are geometrically nearby. Solids and liquids are often modeled this way due to electronic screening e ects or simply to avoid the computational cost of including long{range Coulombic forces. For short{range MD the computational e ort per timestep scales as N , the number of 2
lammps 计算二面角

lammps 计算二面角LAMMPS计算二面角引言:在材料科学和化学领域,二面角是描述分子结构的重要参数之一。
它指的是由四个原子组成的面,其中三个原子共面,而第四个原子与这个平面垂直。
LAMMPS(Large-scale Atomic/Molecular Massively Parallel Simulator)是一种分子动力学模拟软件,可以用于计算和模拟材料的结构和性质。
本文将介绍如何使用LAMMPS 计算二面角。
一、LAMMPS简介LAMMPS是一种开源的分子动力学模拟软件,广泛应用于材料科学、化学和生物物理学等领域。
它可以模拟原子和分子在时间和空间上的运动,计算材料的力学性质、热学性质和电学性质等。
其计算精度高,模拟效率也很高,被广泛应用于各种研究领域。
二、二面角的定义二面角是描述分子结构的重要参数之一。
对于四个原子A、B、C和D来说,二面角由向量AB和向量BC的夹角来定义。
当向量AB和向量BC共面时,二面角为0度;当向量AB和向量BC垂直时,二面角为90度。
三、计算二面角的步骤在LAMMPS中,计算二面角可以通过以下步骤实现:1. 定义分子结构:首先,需要定义分子的结构,包括原子的坐标和连接关系。
可以使用LAMMPS提供的输入文件格式,例如atom 文件或data文件。
2. 添加计算命令:在LAMMPS输入文件中,添加计算二面角的命令。
可以使用“dihedral_style”命令来定义二面角的计算方法,常用的方法包括CHARMM、OPLS和AMBER等。
3. 运行计算:运行LAMMPS软件,通过输入文件来进行计算。
LAMMPS将根据输入文件中的命令对分子进行模拟,并计算二面角的数值。
4. 分析结果:根据计算结果,可以得到二面角的数值。
可以使用LAMMPS提供的输出命令将结果输出到文件中,以便后续的分析和可视化。
四、应用案例以蛋白质分子为例,我们可以使用LAMMPS计算蛋白质中的二面角。
首先,需要准备蛋白质的结构文件,可以从PDB数据库中下载蛋白质的原子坐标信息。
Skyline Queries with Noisy Comparisons 2015 PODS 支配

Skyline Queries with Noisy ComparisonsBenoit Groz*†benoit.groz@lri.frTova Milo* milo@cs.tau.ac.il*Tel Aviv University,School of Computer Science†Univ Paris-Sud,LRI,UMR8623,and INRIA T el Aviv Orsay,F-91405Israel FranceABSTRACTWe study in this paper the computation of skyline queries-a popular tool for multicriteria data analysis-in the presence of noisy input.Motivated by crowdsourcing applications, we present thefirst algorithms for skyline evaluation in a computation model where the input data items can only be compared through noisy comparisons.In this model com-parisons may return wrong answers with some probability, and confidence can be increased through independent repe-titions of a comparison.Our goal is to minimize the number of comparisons required for computing or verifying a candi-date skyline,while returning the correct answer with high probability.We design output-sensitive algorithms,namely algorithms that take advantage of the potentially small size of the skyline,and analyze the number of comparison rounds of our solutions.We also consider the problem of predicting the most likely skyline given some partial information in the form of noisy comparisons,and show that optimal prediction is computationally intractable.1.INTRODUCTIONThe rapid expansion of data generated by web users,sen-sor networks,and other noisy/uncertain data sources,raises new challenges for decision support systems.We focus in this paper on the computation of skyline queries-a popular tool for multicriteria data analysis-in the presence of noisy input.Given a set of data items,the skyline is the subset of items(a.k.a.Pareto optima)that are not“dominated”, where an item is dominated if there is another item that is superior for every criterion.For instance,consider a scenario in which we wish to identify which cities offer the highest salaries together with high quality education.The skyline and dominated items are as illustrated in Figure1. Skyline queries are traditionally viewed as a problem of computing maximal vectors in a multidimensional space R d; data items correspond to points and each criterion corre-sponds to one dimension.Much research has been devoted to efficient skyline computation in the presence of exact data. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita-tion on thefirst page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise,or re-publish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.Request permissions from permissions@.PODS’15,May31–June4,2015,Melbourne,Victoria,Australia.Copyright is held by the owner/author(s).Publication rights licensed to ACM.ACM978-1-4503-2757-2/15/05...$15.00./10.1145/2745754.2745775.education qualitySkyline of cities:skyline item:dominated itemFigure1:Skyline exampleFor the noisy case,the main approaches so far deal with the computational complexity of skyline queries when the input is a set of points with uncertain location.In contrast, we consider here a different setting where no information is given a priori about point location and points can only be compared,through noisy comparisons,along each dimen-sion.We study the complexity of skyline computation here in terms of the required number of such comparisons.This setting is motivated by the processing of skyline queries in crowdsourcing scenarios.In such settings,uncer-tainty is inherent,numerical estimates are not always rele-vant,and the focus lies in general on the cost of interaction with the crowd rather than computational complexity.To illustrate,let us consider a crowdsourcing scenario based on the example from rmation about the average salary or schooling in different cities may be missing and peo-ple may not be able to return numerical estimates.Instead, comparing different cities may be more natural(Marcus et al.[34],for instance,show that comparisons provide more accurate rankings than ratings in certain Crowdsourcing ex-periments).Therefore to compute the skyline with the help of the crowd we can ask people questions of the form“is the education system superior in city x or city y?”or“can I expect a better salary in city x or city y”.Of course,people are likely to make mistakes,and so each question is typically posed to multiple people.Our objective is to minimize the number of questions that need to be issued to the crowd, while returning the correct skyline with high probability. We refer to our computation model as the noisy compar-ison model.We assume that items are fully ordered along each dimension.The order is unknown but items can be compared through oracles(<i)i≤d,where<i compares a pair of items on dimension i.Each call to the comparison ora-cle would intuitively be implemented in our crowd scenarioby asking a new person to compare two items on a particu-lar dimension.In order to take into account noisy answers, we model queries to the comparison oracle as i.i.d.random boolean variables that may return an erroneous answer with probability bounded away from1/2,e.g.,p<1/3.The as-sumption here is that there is an underlying ground truth, but the oracle may make mistakes.We thus design algo-rithms to mitigate those mistakes.Our cost model reports the number of oracle calls required for skyline computation, rather than computational complexity.Our assumption that error is bounded away from1/2makes sense for real-life sce-narios as it is hard to distinguish error probabilities close to 1/2from statistical noise.Of course the model is a simplification of actual crowd be-havior and evaluating skyline queries on a real crowd raises many further issues which we leave for future work,such as estimating the error rate of workers and dealing with vary-ing error rates.Nevertheless the formal results in this paper may serve as a yardstick on what could be expected from the performance of algorithms in refined crowdsourcing models. Contributions.In this paper,we provide thefirst algo-rithms for skyline queries in the noisy comparison model, and evaluate their performance with respect to the follow-ing parameters:-the number of input items n-the dimension(number of criteria)d-the error probability tolerated for the resultδ-the(unknown)skyline cardinality kSpecifically,we show that if we want our algorithms to return the correct answer with probability at least1−δ•we can check if a candidate set of k items is the skylinewith O(dnk log1δ)or O(dn log dkδ)comparisons,•the skyline can be computed with O(dkn log(dk/δ)), O(dk2n log(k/δ))or O(dn log dnδ)comparisons.•Ω(n log kδ)comparisons are necessary to check a candi-date skyline in the worst case(hence also to compute the skyline).These algorithms rely on sorting,binary search,and maxima procedures from the literature.The complexity of the corre-sponding problems in presence of noisy comparisons has in-deed been established in[16]asΘ(n log(n/δ)),Θ(log(n/δ)), andΘ(n log(1/δ)).Our results thus show that na¨ıvely sort-ing the data along all dimensions and computing the skyline based on the corresponding orders is optimal for constant d when k=Ω(n),and we provide more efficient solutions when this is not the case.We also analyze the number of rounds#rounds required by our algorithms when all comparisons whose execution has been decided at some point of the algorithm are processed in a single round in parallel.This measure is in particular rele-vant for crowdsourcing scenarios where questions are issued in batches,hence the number of successive batches provides some measure on the time required to complete the sce-nario.Obtaining low#rounds for sorting-based algorithms proved challenging.Thefirst(and so far unique)efficient parallel sorting algorithm in presence of noise derives from the notoriously complex AKS comparator circuit network,which achieves an optimal#rounds of O(log n)rounds.We also design a simpler algorithm that does not rely on the AKS network,runs in(optimal)O(n log nδ)noisy compari-son,and uses O(nα)rounds for any(arbitrarily small)con-stantα>0.To achieve our results,this paper slightly extends several results from the skyline[28]and fault-tolerant sorting[32] literature tofit our arbitrary dimension and arbitrary pre-cision setting.Finally,to complete the picture,we also con-sider an incremental scenario where some noisy comparisons were already performed,and the task is to process or com-plement the collected information in order to compute the most likely skyline.We thus prove that results from[19] about maxima computation can be extended to show that in presence of arbitrary sorting information(a multiset ofnoisy comparison results),it isΘp2-hard for every dimension d to:•compute the most likely skyline•decide which additional comparison will most increase our confidence on the skyline.This setting may in particular be relevant for our Crowd-sourcing scenario if one endeavours to make the best of the comparison data available instead of computing this infor-mation from scratch.Organization.Section2introduces formally our model and the problems we investigate.The results from the literature that we exploit in this paper(e.g.,sorting and searching with noisy comparisons)are introduced in Section2,whereas Sec-tion7provides a broader overview of related work.Section3 investigates the complexity of verifying a candidate skyline, and those techniques are exploited in Section4to devise algorithms for computing skylines.The latency of our algo-rithms is analyzed in Section5.Finally,Section6is devoted to our hardness results for the optimal exploitation of avail-able information in skyline computation.2.TECHNICAL PRELIMINARIES Comparison-based results for computing skylines in the noiseless case typically rely on sorting and max algorithms and bounds[31].In this section wefirst present our compu-tation model.Then we survey skyline problems in the noise-less case.Finally we discuss sorting and max algorithms and bounds with noisy comparisons.2.1Model and NotationsLet S denote a set of n items.We assume these items admit a full(but not necessarily strict)order≤i along d dimensions i∈{1,...,d}.These implicit orders are not known and can only be discovered through queries of the form“is item v superior to item v along dimensions i;i.e., v≤i v ?”.We also write v v to denote that v≤i v for each i≤d.When v v and there is some i≤d such that v<i v ,we say that v dominates v,which we denote by v v .The notation extends to any set of items C: v C iffthere exists some v ∈C such that v v .Finally, we denote the lexicographic order among items with<lex: v<lex v iffthere is j≤d satisfying both(1)for all i<j, v≤i v ,and(2)v<j v .Definition1.Given a set of d-dimensional items S,the skyline of S is the set of items that are not dominated(we assume that two items can not coincide):Sky(S)={v∈S|∀v ∈S\{v},∃i≤d.v>i v }.Remark1.Skylines are a generalization to multiple di-mensions of the maximum problem:for d=1,the skyline Sky(S)is the maximal item of S.We therefore only consider the case d≥2in the proofs.Noisy comparison model.The noisy comparison model assumes we are given access to an oracle that takes as input a pair of items v,v together with a dimension i≤d,and answers with probability at least11−p whether v≤i v , where p is afixed constant p<1/2(say,p≤1/3).We assume that oracle queries are independent,so that repeat-ing a query decreases the probability of error.The noiseless comparison model corresponds to the particular case where p=0.We shall also consider the more limited noisy boolean vari-able model from[16]where the algorithm takes as input a set of boolean variables and an oracle providing with error probability p≤1/3the correct value of the variables.The latter can be considered as a noisy comparison model where each item can only be compared to a specific item represent-ing variable“0”.The problems we consider in these models take as input a parameterδcalled the tolerance,and the algorithms must return the correct answer with error probability at mostδ. Henceforth we shall abbreviate error probability as err.pr. and omit to mention that error probability is obviously al-lowed to be smaller thanδ.Complexity measure.Our focus is on the worst case ora-cle complexity;the number of calls to the comparison ora-cle.But at each step of their execution,our algorithms may require extensive computation,based on previous compari-son answers,to decide which comparisons should be asked next or which answer should be returned.Unless specified otherwise,our upper bounds will therefore deal with compu-tational complexity.The latter of course bounds from above oracle complexity.Problems of interest.The problem that we wish to inves-tigate is thefollowing:Before we tackle this Skyline computation problem,we shall address the simpler problem of checking a candidateskyline:We shall in particular investigate output sensitive algo-rithms,namely algorithms whose complexity depends on the 1The algorithms are robust to an adversary oracle that could return a correct answer instead of an incorrect one.number k of items in the skyline.Of course,we do not as-sume prior knowledge of this number,so the algorithm has to guess the value of k.2.2Complexity of noiseless skylinesBefore going into the noisy model wefirst recall results for the noiseless case.Most results in the literature analyze the computational complexity of skyline queries rather than oracle complexity.But the lower bounds generally count the comparisons required to compute the skyline,and conversely a few algorithms guarantee a low oracle complexity under some restrictions.Of course the oracle complexity is at most O(dn log n) since sorting the input along all dimensions solves any prob-lem w.r.t.oracle complexity.For d=2a tight lower bound of f sort(n)+n−1on the oracle complexity was proved by Yao[45],where f sort(n)denotes the number of comparisons required to sort n items.For d=3,an upper bound of 2n log2n+O(n)comparisons follows from Kung et al’s al-gorithm[31];a lower constant factor than the na¨ıve sort-ing approach,but one checks easily that Kung et al’s algo-rithm does not guarantee better constant factors for oracle complexity than na¨ıve sorting beyond d=3.A bound ofn log2k+O(n√log k)was recently established[11],match-ing the information-theoretic lower bound of n log2k com-parisons[28].We are not aware of results on the oracle complexity of skyline for higher dimension.The question of computing an asymptotic equivalent for the oracle complexity of skylines beyond d=2is actually left open in[11].For arbitrary large d,a few algorithms nevertheless outperform the na¨ıve sorting approach in terms of computational and(thereby also)oracle complexity when k is small enough.A standard skyline algorithm allows to compute the sky-line in O(dnk).For this we can for instance maintain a partial skyline S i for i=0,...,k containing the i greatest skyline points for lexicographic order,together with the set R i of points that are not dominated by S i(S0=∅,R0is the whole input).At step i we compute S i+1in O(dn)from S i by adding the largest item of R i for lexicographic order, then compute R i+1from R i,also in O(dn)by removing all items that are dominated by this largest item.This algo-rithm is essentially the one we shall adopt in presence of noisy comparison,except that we will not maintain R i due to the higher cost of this screening operation in presence of noise.When R i is not available,the computation of S i+1 from S i and the set of all input items has a higher cost, though.A shrewder algorithm with low computational complexity for small values of k and d has been proposed by Kirkpatrick and Seidel[28],based on a Divide and Conquer paradigm. The authors only investigate the complexity for constant d, but we outline in the Appendix an analysis for arbitrary dimensions:Theorem1(adapted from[28]).The skyline can be computed in O(d2n log d−2k)(computational complexity)2.2Throughout the paper,we abuse notations and write d−2 for max(1,d−2),d−3for max(1,d−3),etc.2.3Computing in the noisy comparison model While skyline computation has not been previously stud-ied(to the best of our knowledge)in the noisy comparison model,several other operators were investigated:the OR problem decides if one of n boolean input variables is true, whereas MAX returns the maximum of the n items,SORT-ING returns the input items in sorted order,and TOP-k returns the k largest items(in arbitrary order).Finally,BI-NARY SEARCH takes as input(1)an ordered list S of n items and(2)another item v,and returns the successor of v in S.Lemma1([16]).The problems above can be computed within the following bounds for computational complexity, which are tight even for oracle complexity:ORΘ(n log1δ)MAXΘ(n log1δ)SORTINGΘ(n log nδ)Binary searchΘ(log nδ)TOP-kΘ(n log min(k,n−k)δ)Actually,it is obvious that any algorithm for the noiseless case with complexity f(n)can be turned into an algorithm with toleranceδin presence of noise by repeating each com-parison requested by the algorithm log(f(n)/δ)times and taking majority vote:each noiseless comparison will thus be simulated by a comparison with toleranceδ/f(n),hence an overall error ofδby union bound[16].The bounds of Lemma1show that for the problems considered we can do better.The algorithms above assume the input oracle has con-stant error probability.While our access to data is limited to noisy comparisons,we shall consider any procedure com-puting some boolean condition(with the help of such com-parisons)as an additional oracle that can be exploited by other queries.Our algorithms thus compute compositions of boolean queries,etc.In order to optimize the cost of such compositions,we investigate the cost of trust-preserving al-gorithms,whose tolerance is determined by that of the input oracle(s):for allδ<1/3the output must be correct with err.pr.at mostδif the input oracle(s)has err.pr.δ.Trust-preserving algorithms can be pipelined;as observed in[36] for the similar -fault-tolerant model:a trust-preserving al-gorithm for the composition of functions(e.g.boolean func-tions)can be obtained through the composition of trust-preserving algorithms for these functions.In the case of OR,Newman provided a simple trust-preserving algorithm in linear time.Lemma2([36]).OR can be computed in O(n)with a trust-preserving algorithm that returns the variable of mini-mal index among the true ones(if any).We can derive from this simple algorithm a trust-preserving algorithm for MAX.Assume w.l.o.g.the input oracle has err.pr.δ<1/6.We observe that we can simulate a com-parison with err.pr.δ/2by majority of3comparisons hav-ing err.pr.δ.Furthermore the maximum of4items can be computed with err.pr.δ/2in c=O(1).Lemma3.MAX can be computed in O(n)with a trust-preserving algorithm,as illustrated in Algorithm1.Algorithm1:Algorithm T(n,δ)1Partition input items into groups of4(last group may be smaller),AND computewith err.pr.δ/2the max within each group.2Apply recursively T(n/4,δ/2)to thesen/4candidate maxima(if n>4).Proof.The cost of T(n,δ)satisfies the equation below: C(n,δ)=c· n/4 +C(n/4,δ/2)≤c·n+3·C(n/4,δ)=O(n). We show by induction that T(n,δ)errs with probability at mostδ:the probability that the maximum has been unduly eliminated in step1isδ/2,and the probability that it is eliminated in step2is alsoδ/2by induction hypothesis, hence an overall toleranceδ.Alternatively a slightly stronger result can be obtained as follows:[18]shows that MAX can be computed with a deter-ministic noisy tournament tree in O(log log n)rounds,with O(n)comparisons.A simple analysis of their proof shows the algorithm to be trust-preserving(we only need to main-tain the dependency on throughout their proof).In our skyline algorithm we repeatedly compute the max-imal item(for<1)that is not dominated by larger items. For this,we will use the following lemma:Lemma4.Let S denote a set of n items,P a boolean property on S,and<a total order on S.Given an oracle that decides P with toleranceδin timeα(δ),and a simi-lar oracle with toleranceδfor<inβ(δ),one can compute max{v∈S|v satisfies P}with toleranceδin O((α(δ)+β(δ))n).Proof.One can simply view the problem as the search of the maximum for a modified total order;<P,defined from P by:•v<P v if v<v ∧P(v )•v<P v if¬P(v)∧P(v )•when¬P(v)∧¬P(v ),say v and v are ties(any arbi-trary choice would do)We can clearly simulate in O(α(δ)+β(δ))an oracle with tol-eranceδfor<P,using the oracles for<and P.Furthermore, the maximum for P is the item we are looking for,therefore we can solve our problemby executing the max-algorithm of Lemma3on order<P.We show in the next two sections how our skyline prob-lems can be solved using the results about sorting and com-puting maxima with noisy comparisons.3.SKYLINE VERIFICATION PROBLEM. Dominance tests are a cornerstone of our skyline algo-rithms.We therefore begin our exposition of skyline algo-rithms with two procedures for dominance testing. Lemma5.Let C⊆S,v∈S.1.We can check v C with err.pr.δin O(d|C|log1δ)2.When the order of C is known along each dimension,we can check whether v C in O(d log d|C|δ)oracle complexity,with err.pr.δ.Proof.(1)The first procedure views dominance test-ing as the composition of OR queries:v C is equiva-lent to w ∈Ci ≤d v ≤i w .Each dominance test v w = di =1v ≤i w can be checked in O (d )with err.pr.1/3using the OR algorithm from Lemma ing these tests as the basic oracle of the OR algorithm,we can check w ∈C v win O (d |C |log 1δ).(2)Alternatively,assume we know the ordering <i for C in every dimension i ≤d .We can then compute with err.pr.δ/d for each i ∈{1,...,d }the successor for <i of v in C .Using the binary search algorithm of Lemma 1,each suc-cessor can be computed with err.pr.δ/d in O (log d |C |).We then deduce whether v C without further oracle compar-isons (though with possibly large computationalcost).We are now ready to address the problem of checking a candidate skyline.We develop two algorithms that mostly differ on which procedure from Lemma 5they adopt for dom-inance queries.The algorithms simply check the two prop-erties (1)C =Sky(C )and (2)Sky(S )⊆C with err.pr.δ/2.Both properties can be viewed as boolean combinations of dominance tests:•C =Sky(C )iffv =v ∈C ¬(v v )•C ⊇Sky(S )iffv ∈S v C .Our first algorithm uses the first procedure of Lemma 5for the dominance tests,whereas our second algorithm,pre-sented below as Algorithm 2,first sorts C along all dimen-sions and then relies on binary search to process dominance queriesTheorem 2.Let C ⊆S ,we can check if C =Sky (S )1.in O (dn |C |log 1δ)(computational complexity)2.or with O (dn logd |C |δ)oracle complexity.Proof.We first observe that the two conditions above are necessary and sufficient to guarantee Sky(S )=C .(1)Each dominance test v C can be checked in O (d |C |)with err.pr.1/3using the first procedure of Lemma ing these tests as the basic oracle for the OR algorithm,wecheck C ⊇Sky(S )in O (n |C |log 1δ)calls to this dominance test,which yields O (dn |C |log 1δ).We use similarly the OR algorithm to check C =Sky(Cs )with the same putational and Oracle complexity are the same for this algorithm.(2)Line 1of Algorithm 2runs in O (d |C |log(d |C |/δ))according to Lemma 1(which in turn summarizes results from [16]).In line 2we then check C =Sky(C )with-out any further call to the comparison oracle.When the orders computed in line 1are correct,we can simulate in O (d log(d |C |/δ))(oracle complexity)an oracle that for each item v ∈S checks with err.pr.δ/2if v is dominated by C ,according to Lemma ing this oracle,we can then checkv ∈S v C with err.pr.δ/2in O (n ·d log(d |C |/δ)),by Newman’s trust-preserving OR algorithm (see Lemma 2).This yields an overall oracle complexity of O (dn log(d |C |/δ))for the algorithm.The computational complexity,however,may be higher due to the dominance tests and skyline com-putation on C (see discussion in Section7).We do not have tight bounds for checking skylines in the general case,but we next prove that the second bound inAlgorithm 2:Skyline verification(S,C,δ)1Sort C along each dimension with err.pr.δ/(2d )2if C =Sky(C )according to these orderings3return false4else Check v ∈S v C with err.pr.δ/2Theorem 2is optimal for constant d ,whereas the first one is optimal for constant |C |.Proposition 1.Let C ⊆S .Checking if C =Sky (S )has oracle complexity Ω(n log |C |+dn log(1/δ)).Proof.The Ω(n log 2|C |)lower bound is actually a par-ticular case of stronger bounds from the literature [4,28].We can also prove it directly from the information-theoretic bound in the noiseless case with d =2,adapting the argu-ment of Yao:even when items can be dominated by at most one item from C ,there are at least |C |n −|C |·|C |!possible ways to order items of C and assign each remaining items to its dominating point in C .Any algorithm checking the skyline must gather information sufficient to distinguish two such configurations,and therefore performs Ω(n log 2|C |)or-acle comparisons (details are left for the Appendix).The Ω(dn log 1δ)lower bound derives from an immediate reduction of OR:assume that we wish to compute the dis-junction of d ×n noisy variables x i,j (i ≤n,j ≤d ).Let C ={v 0}denote the unique tuple with value 1on dimen-sion d +1and 0on the others.For each i ∈{1,...,n },let v i denote the tuple (x i,1,...,x i,d ,0).The disjunction is true if and only if C =Sky(S ),which concludes our reduction.PUTING SKYLINE.After the last section discussing skyline candidate verifi-cation,we now investigate the complexity of skyline com-putation.The oracle complexity of skyline computation (or actually any problem)is bounded from above by the com-plexity of sorting.Algorithm 3:Full sort skyline algorithm(S,δ)1Sort C along each dimension with err.pr.δ/d2Deduce the skyline,assuming all orders are correct.Theorem 3.Algorithm 3computes Sky (S )with oracle complexity O (dn log(dn/δ)).Proof.Each dimension can be sorted with err.pr.δ/d in O (n log(dn/δ))according to Lemma 1.By union bound,all orders are then correct with err.pr.δ,so that any stan-dardalgorithm for noiseless skylines can compute the skyline based on these orders without further oracle calls.By reduction from skyline verification (Proposition 1),this is again optimal for constant d when the skyline contains k =Ω(n c )(c >0)items.We next turn our attention to output-sensitive algorithms,namely algorithms that perform better when k is small.Recall that k denotes the number of items belonging to the skyline.Our output-sensitive algorithms for computing the skyline rely on the following auxiliary procedure:Algorithm 4:Skysample(S,ˆk,δ)1S 0←∅2for i in 0,...,ˆk−13Compute z ←max <lex {v |v S i }witherr.pr.δ/ˆk4if z =∅return S i 5else S i +1←S i ∪z 6return S ˆk Oracle for <lex (v,v ,δ):7Find l ←min {i |v <i v}with err.pr.δ/28Find L ←min {i |v >i v}with err.pr.δ/29if (L =Null or l L )10return true 11else 12return falseProposition 2.Depending on the procedure adopted for dominance testing in line 3,Algorithm 4computes the firstmin(|Sky (S )|,ˆk )points of the skyline in decreasing lexico-graphic order1.in O (dk 2n log(ˆk/δ))(computational complexity)2.or with O (dkn log(d ˆk/δ))oracle complexity.Proof.Let S i denote the set comprising the i items ofSky(S )having the highest rank for <lex .Lines 7to 12show how an oracle for lexicographic comparison can be simulated in O (d )with err.pr.1/3,using Newman’s OR algorithm (see Lemma 2).Using this oracle for lexicographic order and the oracle for dominance testing of Lemma 5as basic oracles for the max-algorithm of Lemma 4,lines 3to 5compute iteratively S i +1from S i with the requested complexity:(1)Using the first procedure for dominance testing ofLemma 5we get the new skyline item with err.pr.δ/ˆkin O (d ·i ·n log(ˆk/δ)).Overall,we thus get k skyline items with err.pr.δin O ( i<k d ·i ·n log(ˆk/δ))=O (dk 2n log(ˆk/δ)).(2)Using the second procedure of Lemma 5yields the newskyline item with err.pr.δ/ˆkin O (dn log(d ˆk/δ)).Overall,we can thus get with err.pr.δa set of k skyline items inO ( i<k dn log(d ˆk/δ))=O (dkn log(d ˆk/δ)).Remark:For d =2,dominance tests are essentially trivial,so the problem can be solved in a simpler way in the sense that the algorithm needs only use MAX/OR algorithms and not binary insertion.The complexity of the first algorithmis lowered to O (kn log(ˆk/δ)).We next present our main algorithm for computing sky-lines with noisy comparisons.The idea is to exploit the SkySample algorithm from Proposition 2,but since the value of k is not known in advance,we must be careful to set asmall enough value of ˆk≥k in order to guarantee low com-plexity.We thus use Chan’s trick [9]to “guess”k by bi-nary search with increasing candidate values:the i th call toSkySample uses k i =22iinstead of k ,and δ/2i instead of δ.Theorem 4.Algorithm 5computes Sky (S )1.in O (dk 2n log(k/δ))(computational complexity)2.or with O (dkn log(dk/δ))oracle complexity.Algorithm 5:Skyline computation(S,δ)1i ←1;k i ←4;compl ←false 2while compl=false 3R ←SkySample (S,k i ,δ/2i )4if |R |<k i 5compl ←true 6else 7i ←i +18k i ←22i9return RProof.The probability of returning a wrong answer at round i is δ/2i .By union bound,the error probability sums up to at mosti ≤ log log k δ/2i ≤δoverall.(1)Using the first procedure for dominance testing in Proposition 2,the computational (hence oracle)complex-ity of the algorithm is at most O (dn log log k i =1k 2i log(k i ·2i /δ))+O (dk 2n log(k/δ))=O (dk 2n log(k/δ)).(2)Alternatively,if we use the second procedure for dom-inance testing instead,the oracle complexity of Algorithm 5is in O (dn log log ki =1k i log(dk i ·2i /δ))+O (dkn log(dk/δ)),which amounts to O (dkn log(dk/δ)).5.DELAY IN TERMS OF ROUNDSIn the algorithms above,some oracle calls depend on the result of previous calls.This may become an issue in,e.g.,crowdsourcing scenarios where tasks involve substantial de-lays.We therefore analyze the number of rounds required by each algorithm when all comparisons are processed si-multaneously in a same round,unless their execution is de-termined by the outcome of some comparison that has not been processed yet,in which case it is left for future rounds.A formal definition of this number of rounds #rounds can be found in [18].Before presenting our results we survey some previous work on #rounds for binary search and sorting,and im-prove some of these results,then use the improved bounds to analyze our skyline algorithms.5.1Parallel algorithms with noise:sortingFeige et al.[16]already investigate the question of paral-lelism for MAX and Sorting with noisy comparisons when using a maximum of n processors.In our setting we in-stead focus on the model of Newman [36]and Goyal and Saks [18].This model,which they apply to the MAX and OR problems,does not restrict the number of simultaneous comparisons.This model seems more relevant for our crowd scenario as we can recruit additional workers when needed.In particular,Newman [36]shows a trust-preserving OR al-gorithm in O (n log(1/δ))with O (log ∗n )rounds.Goyal and Saks [18]prove a corresponding Ω(log ∗n )lower bound and also propose a trust-preserving MAX algorithm in O (n )with O (log log n )rounds (the algorithm is presented for constant tolerance,but a careful analysis of their proof shows it is error preserving).The O (n log n )sorting algorithm of Feige et al.[16]mentioned in Lemma 1relies on binary search to sort items incrementally by insertion,and therefore requires Ω(n log(n/δ))rounds as their binary search algorithm re-quires O (log(n/δ)).We first show that #rounds of binary search can be lowered to O (log n ).。
IEEE各会议排名

IEEE各会议排名Rank 1:第⼀梯队SIGCOMM: ACM Conf on Comm Architectures, Protocols & AppsINFOCOM: Annual Joint Conf IEEE Comp & Comm SocSPAA: Symp on Parallel Algms and ArchitecturePODC: ACM Symp on Principles of Distributed ComputingPPoPP: Principles and Practice of Parallel ProgrammingRTSS: Real Time Systems SympSOSP: ACM SIGOPS Symp on OS PrinciplesSOSDI: Usenix Symp on OS Design and ImplementationCCS: ACM Conf on Comp and Communications SecurityIEEE Symposium on Security and PrivacyMOBICOM: ACM Intl Conf on Mobile Computing and NetworkingUSENIX Conf on Internet Tech and SysICNP: Intl Conf on Network ProtocolsPACT: Intl Conf on Parallel Arch and Compil TechRTAS: IEEE Real-Time and Embedded Technology and Applications Symposium ICDCS: IEEE Intl Conf on Distributed Comp SystemsRank 2:第⼆梯队CC: Compiler ConstructionIPDPS: Intl Parallel and Dist Processing SympIC3N: Intl Conf on Comp Comm and NetworksICPP: Intl Conf on Parallel ProcessingSRDS: Symp on Reliable Distributed SystemsMPPOI: Massively Par Proc Using Opt InterconnsASAP: Intl Conf on Apps for Specific Array ProcessorsEuro-Par: European Conf. on Parallel ComputingFast Software EncryptionUsenix Security SymposiumEuropean Symposium on Research in Computer SecurityWCW: Web Caching WorkshopLCN: IEEE Annual Conference on Local Computer NetworksIPCCC: IEEE Intl Phoenix Conf on Comp & CommunicationsCCC: Cluster Computing ConferenceICC: Intl Conf on CommWCNC: IEEE Wireless Communications and Networking ConferenceCSFW: IEEE Computer Security Foundations WorkshopRank 3:第三梯队MPCS: Intl. Conf. on Massively Parallel Computing SystemsGLOBECOM: Global CommICCC: Intl Conf on Comp CommunicationNOMS: IEEE Network Operations and Management SympCONPAR: Intl Conf on Vector and Parallel ProcessingVAPP: Vector and Parallel ProcessingICPADS: Intl Conf. on Parallel and Distributed SystemsPublic Key CryptosystemsAnnual Workshop on Selected Areas in CryptographyAustralasia Conference on Information Security and PrivacyInt. Conf on Inofrm and Comm. SecurityFinancial CryptographyWorkshop on Information HidingSmart Card Research and Advanced Application ConferenceICON: Intl Conf on NetworksNCC: Nat Conf CommIN: IEEE Intell Network WorkshopSoftcomm: Conf on Software in Tcomms and Comp NetworksINET: Internet Society ConfWorkshop on Security and Privacy in E-commerceUn-ranked:其他PARCO: Parallel ComputingSE: Intl Conf on Systems Engineering (**)PDSECA: workshop on Parallel and Distributed Scientific and Engineering Computing with Applications CACS: Computer Audit, Control and Security ConferenceSREIS: Symposium on Requirements Engineering for Information SecuritySAFECOMP: International Conference on Computer Safety, Reliability and SecurityIREJVM: Workshop on Intermediate Representation Engineering for the Java Virtual MachineEC: ACM Conference on Electronic CommerceEWSPT: European Workshop on Software Process TechnologyHotOS: Workshop on Hot Topics in Operating SystemsHPTS: High Performance Transaction SystemsHybrid SystemsICEIS: International Conference on Enterprise Information SystemsIOPADS: I/O in Parallel and Distributed SystemsIRREGULAR: Workshop on Parallel Algorithms for Irregularly Structured ProblemsKiVS: Kommunikation in Verteilten SystemenLCR: Languages, Compilers, and Run-Time Systems for Scalable ComputersMCS: Multiple Classifier SystemsMSS: Symposium on Mass Storage SystemsNGITS: Next Generation Information Technologies and SystemsOOIS: Object Oriented Information SystemsSCM: System Configuration ManagementSecurity Protocols WorkshopSIGOPS European WorkshopSPDP: Symposium on Parallel and Distributed ProcessingTreDS: Trends in Distributed SystemsUSENIX Technical ConferenceVISUAL: Visual Information and Information SystemsFoDS: Foundations of Distributed Systems: Design and Verification of Protocols conferenceRV: Post-CAV Workshop on Runtime VerificationICAIS: International ICSC-NAISO Congress on Autonomous Intelligent SystemsITiCSE: Conference on Integrating Technology into Computer Science EducationCSCS: CyberSystems and Computer Science ConferenceAUIC: Australasian User Interface ConferenceITI: Meeting of Researchers in Computer Science, Information Systems Research & StatisticsEuropean Conference on Parallel ProcessingRODLICS: Wses International Conference on Robotics, Distance Learning & Intelligent Communication Systems International Conference On Multimedia, Internet & Video TechnologiesPaCT: Parallel Computing Technologies workshopPPAM: International Conference on Parallel Processing and Applied MathematicsInternational Conference On Information Networks, Systems And TechnologiesAmiRE: Conference on Autonomous Minirobots for Research and EdutainmentDSN: The International Conference on Dependable Systems and NetworksIHW: Information Hiding WorkshopGTVMT: International Workshop on Graph Transformation and Visual Modeling TechniquesIEEE通信领域的主要会议介绍:GLOBECOM (IEEE Global Communications Conference )ICASSP (IEEE International Conference on Acoustics, Speech, and Signal Processing)ICC (IEEE International Conference on Communications )INFOCOM (IEEE Conference on Computer Communications)MILCOM (IEEE Military Communications Conference )OFC (IEEE Optical Fiber Communication Conference and Exposition)PIMRC (IEEE International Symposium on Personal, Indoor and Mobile Radio Communications)SPAWC (IEEE Workshop on Signal Processing Advances in Wireless Communications)VTC (IEEE Vehicular Technology Conference)WCNC (IEEE Wireless Communications & Networking Conference)CSNDSP (IEEE Communication Systems, Networks & Digital Signal Processing)CTW (IEEE Communications Theory Workshop)ICACT (IEEE International Conference Advanced Communication Technology)ICCCAS (IEEE International Conference of Communications, Circuits and Systems)INCC (IEEE International Networking and Communications Conference)ISCC (IEEE Symposium on Computers and Communications)ISSSTA (IEEE International Symposium on Spread Spectrum Techniques and Applications) ISWCS (IEEE International Symposium on Wireless Communication Systems)ISIT (IEEE International Symposium on Information Theory)CCNC (IEEE Consumer Communications & Networking Conference)。
Parallel Algorithms for Computing Temporal Aggregates

Parallel Algorithms for Computing Temporal AggregatesJose Alvin G.Gendrano Bruce C.Huang Jim M.RodrigueBongki Moon Richard T.SnodgrassDept.of Computer Science IBM Storage Systems Division Raytheon Missile Systems Co.University of Arizona9000S.Rita Road1151East Hermans RoadTucson,AZ85721Tucson,AZ85744Tucson,AZ85706jag,bkmoon,rts@ brucelee@ jmrodrigue@AbstractThe ability to model the temporal dimension is essen-tial to many applications.Furthermore,the rate of increase in database size and response time requirements has out-paced advancements in processor and mass storage tech-nology,leading to the need for parallel temporal database management systems.In this paper,we introduce a variety of parallel temporal aggregation algorithms for a shared-nothing architecture based on the sequential Aggregation Tree algorithm.Via an empirical study,we found that the number of processing nodes,the partitioning of the data, the placement of results,and the degree of data reduction ef-fected by the aggregation impacted the performance of the algorithms.For distributed results placement,we discov-ered that Time Division Merge was the obvious choice.For centralized results and high data reduction,Pairwise Merge was preferred regardless of the number of processing nodes, but for low data reduction,it only performed well up to32 nodes.This led us to a centralized variant of Time Division Merge which was best for larger configurations having low data reduction.1.IntroductionAggregate functions are an essential component of data query languages,and are heavily used in many applications such as data warehousing.Unfortunately,aggregate com-putation is traditionally expensive,especially in a tempo-ral database where the problem is complicated by having to compute the intervals of time for which the aggregate value holds.For example,finding the(time-varying)maximum salary of professors in the Computer Science Department This work was sponsored in part by National Science Foundation grants CDA-9500991and IRI-9632569,and National Science Foundation Research Infrastructure program EIA-9500991.The authors assume all responsibility for the contents of the paper.involves computing the temporal extent of each maximum value,which requires determining the tuples that overlap each temporal instant.In this paper,we present several new parallel algorithms for the computation of temporal aggregates on a shared-nothing architecture[8].Specifically,we focus on the Aggregation Tree algorithm[7]and propose several ap-proaches to parallelize it.The performance of the parallel algorithms relative to various data set and operational char-acteristics is of our main interest.The rest of this paper is organized as follows.Section2 gives a review of related work and presents the sequential algorithm on which we base our parallel algorithms.Our proposed algorithms on computing parallel temporal aggre-gates are then described in Section3.Section4presents empirical results obtained from the experiments performed on a shared-nothing Pentium cluster.Finally,Section5con-cludes the paper and gives an outlook to future work.2.Background and Related WorkSimple algorithms for evaluating scalar aggregates and aggregate functions were discussed by Epstein[5].A dif-ferent approach employing program transformation meth-ods to systematically generate efficient iterative programs for aggregate queries has also been suggested[6].Tumas extended Epstein’s algorithms to handle temporal aggre-gates[9];these were further extended by Kline[7].While the resulting algorithms were quite effective in a uniproces-sor environment,all suffer from poor scale-up performance, which identifies the need to develop parallel algorithms for computing temporal aggregates.Early research on developing parallel algorithms focused on the framework of general-purpose multiprocessor ma-chines.Bitton et al.proposed two parallel algorithms for processing(conventional)aggregate functions[1].The Subqueries with a Parallel Merge algorithm computes par-tial aggregates on each partition and combines the partialName Salary BeginEnd Richard40K18 Karen45K820 Nathan35K712 Nathan37K1821Count BeginEnd 178 2812 11218 31820 22021 121(a)Data Tuples(b)ResultTable1.Sample Database and Its TemporalAggregationresults in a parallel merge stage to obtain afinal result.An-other algorithm,Project by list,exploits the ability of the parallel system architecture to broadcast tuples to multi-ple processors.The Gamma database machine project[4] implemented similar scalar aggregates and aggregate func-tions on a shared-nothing architecture.More recently,par-allel algorithms for handling temporal aggregates were pre-sented[11],but for a shared-memory architecture.The parallel temporal aggregation algorithms proposed in this paper are based on the(sequential)Aggregation Tree algorithm(SEQ)designed by Kline[7].The aggregation tree is a binary tree that tracks the number of tuples whose timestamp periods contain an indicated time span.Each node of the tree contains a start time,an end time,and a count.When an aggregation tree is initialized,it begins with a single node containing(see the initial tree in Figure1).In the following example[7],there are tuples to be in-serted into an empty aggregation tree(see Table1(a)).The start time value,,of thefirst entry to be inserted splits the initial tree,resulting in the updated aggregation tree shown in Figure1.Because the original node and the new node share the same end date of,a count of1is assigned to the new leaf node.The aggregation tree after inserting the rest of the records in Table1(a)is shown in Figure1.To compute the number of tuples for the periodin this example,we simply take the count from the leaf node(which is),and add its parents’count val-ues.Starting from the root,the sum of the parents’counts is and adding the leaf count,gives a total of .The temporal aggregate results are given in Table1(b).Though SEQ correctly computes temporal aggregates,it is still a sequential algorithm,bounded by the resources of a single processor machine.This makes a parallel method for computing temporal aggregates desirable.After adding [18,∞)Figure1.Example run of the Sequential(SEQ)Aggregation Tree Algorithm3.Parallel Processing of Temporal AggregatesIn this section,we proposefive parallel algorithms for the computation of temporal aggregates.We start with two simple parallel extensions to the SEQ algorithm,the Sin-gle Aggregation Tree(abbreviated SAT)and Single Merge (SM)algorithms.We then go on to introduce the Time Divi-sion Merge with Centralizing step(TDM+C)and Pairwise Merge(PM)algorithms,which both require more coordi-nation,but are expected to scale better.Finally,we present the Time Division Merge(TDM)algorithm,a variant of TDM+C,which distributes the resulting relation,as differ-entiated from the centralized results produced by the other algorithms.3.1.Single Aggregation Tree(SAT)Thefirst algorithm,SAT,extends the Aggregation Tree algorithm by parallelizing disk I/O.Each worker node reads its data partition in parallel,constructs the valid-time peri-ods for each tuple,and sends these periods up to the coordi-nator.The central coordinator receives the periods from all the worker nodes,builds the complete aggregation tree,and returns thefinal result to the client.3.2.Single Merge(SM)The second parallel algorithm,SM,is more complex than SAT,in that it includes computational parallelism along with I/O parallelism.Each worker node builds a local aggregation tree,in parallel,and sends its leaf nodes to the coordinator.Unlike the SAT coordinator,which inserts periods into an aggregation tree,the SM coordinator merges each of the leaves it receives using a variant of merge-sort.The use of this efficient merging algorithm is possible since the worker nodes send their leaves in a temporally sorted order.Finally,after all the worker nodes finish sending their leaves,the coordinator returns the final result to the client.3.3.Time Division Merge with Coordinator(TDM+C)Like SM,the third parallel algorithm also extends the aggregation tree method by employing both computational and I/O parallelism (see Figure 2).The main steps for this algorithm are outlined in Figure 3.Local TreesFigure 2.Time Division Merge with Centraliz-ing Step (TDM+C)AlgorithmStep 1.Client requestStep 2.Build local aggregation trees Step 3.Calculate local partition sets Step 4.Calculate global partition set Step 5.Exchange data and merge Step 6.Merge local results Step 7.Return results to clientFigure 3.Major Steps for the TDM+C Algo-rithm3.3.1Overall AlgorithmTDM+C starts when the coordinator receives a temporal aggregate request from a client.Each worker node is in-structed to build a local aggregation tree using its data par-tition knowing the number of worker nodes,,participating in the query.After each worker node constructs its local aggregationtree,the tree is augmented in the following manner.Thenode traverses its aggregation tree in DFS order,propagat-ing the count values to the leaf nodes.The leaf nodes now contain the full local count for the periods they represent,and any parent nodes are discarded.After all worker nodes complete their aggregation trees,they exchange minimum (earliest)start time and maximum (latest)end time values to ascertain the overall timeline of the query.Timeline Covered By NodeFigure 4.Timeline divided into partitions,forming a global partition setThe leaves of a local aggregation tree are evenly split into local partitions,consisting of a period and a tuple count.Because each partition is split to have the same (or nearly)the same number of tuples,local partitions can have different durations.The local partition set (containing par-titions)from each processing node is then sent to the coor-dinator.The coordinator takes all local partition sets 1and com-putes global partitions (how this is done is discussed in the next section).After computing the global time partition set,the coor-dinator then naively assigns the period of the partitionto theworker node,and broadcasts the global partition set and respective assignments to all the nodes.The worker nodes then use this information to decide which local ag-gregation tree leaves to send,and to which worker nodes to send them to.Note that periods which span more than one global partition period are split and each part is assigned ac-cordingly(split periods do not affect the result correctness).Each worker node merges the leaves it receives with the leaves it already has to compute the temporal aggregate for their assigned global partitions.When all the worker nodes finish merging,the coordinator polls them for their results in sequential order.The coordinator concatenates the results and sends the final result to the client.1Atotal oflocal partitions are created byworker nodes.0591030350800100015005000100000505050 151515 303030Figure 5.Local Partition Sets from ThreeWorker Nodes3.3.2Calculating the Global Partition SetWe examine in more detail the computation of the global partition set by the coordinator.Recall that the coordinator receives from each worker node a local partition set,con-sisting of contiguous partitions.The goal is to temporally distribute the computation of thefinal result,with each node processing roughly the same number of leaf nodes.As an example,Figure5presents local partitions from worker nodes.The number between each hash mark seg-menting a local timeline represents the number of leaf nodes within that local partition.The total number of leaf nodes from the nodes is.The best plan is having leaf nodes to be processed by each node.Figure4illustrates the computation of the global partition set.We modified the SEQ algorithm to compute the global partition set,using the local partition information sent by the worker nodes.We treat the worker node local parti-tion sets as periods,inserting them into the modified ag-gregation tree.From Figure5,thefirst period to be in-serted is[5,9)(50),the fourth is[0,30)(15),and the seventh is[0,10)(30),and the ninth(last)is[1000,5000)(30).This use of the Aggregation Tree is entirely separate from the use of this same structure in computing the aggregate.Here we are concerned only with determining a division of the time-line into contiguous periods,each with approximately the same number of leaves.There are three main differences between our Modified Aggregation Tree algorithm used in this portion of TDM+C and the original Aggregation Tree[7],used in step2of Figure3.First,the“count”field of this aggregation tree node is incremented by the count value of the local parti-tion being inserted,rather than.Second,a parent node must have a count value of.When a leaf node is split and becomes a parent node,its count is split proportionally be-tween the two new leaf nodes based on the durations of their respective time periods.This new parent count becomes. Third,during an insertion traversal for a record,if the search traversal diverges to both subtrees,the record count is split proportionally between the2sub-trees.Inserted Records [5,9)(50), [9,800)(50), and [800,1500)(50)(a)First3Local Partitions(b)After partition4is addedFigure6.Intermediate Aggregation Tree As an example,suppose we inserted thefirst three lo-cal partitions,and now we are inserting the fourth one [0,30)(15).The current modified aggregation tree,before inserting the fourth local partition,is shown in Figure6a. Notice that for leaf node[5,9)(50),the count value is set to instead of(first difference).The second and third differences are exemplified when the fourth local partition is added.At the root node,we see that the period for this fourth partition overlaps the periods of the left sub-tree and the right sub-tree.In the original aggregation tree,we simply added to a node’s count in the left sub-tree and the right sub-tree at the appropriate places. Here,we see the third difference.We split this partition count of in proportion to the durations of the left and right sub-trees.The root left sub-tree contains a period[0,5) for a duration of time units.The fourth local partition period is[0,30),or time units.We compute the left sub-tree’s share of this local time partition’s count as,while the right sub-tree’s share is.In this case,the left sub-tree leaf node[0,5)now has a count of (see Figure6b).We now pass down the root right sub-tree,increasing its right leaf node count from[5,9)(50)to [5,9)(52)as its share of the newly added partition’s count,, is added,by using the same proportion calculation method. At leaf node[9,800)(50),the inserted partition’s count is now down to,since was taken by node[5,9)(52).Now,the second difference comes into play.Two new leaf nodes are created by splitting[9,800)(50).The new leaves are[9,30)and[30,800).Leaf[9,30)receives all the remaining inserted partition’s count of.The count of from[9,800)(50)is now divvied up amongst the two new leaf nodes.The left leaf node receives of the ,while the right leaf node receives.So the new left leaf node is now[9,30)(12),where comes from,and the new right leaf node shows as[30,800)(49).Again,see Figure6b for the result.Table2shows the leaf node values once all local time partitions from Figure5are inserted.Count Begin End170564593910121030443035043350800218001000401000150032150050009500010000Table2.All leaf node values in a tabular formatonce all9partitions from Figure5are insertedNow that the coordinator has the global span leaf counts and the optimal number of leaf nodes to be processed by each node,it canfigure out the global partition set.For each node(except the last one),we continue adding the span leaf counts until it matches or surpasses the optimal number of leaf nodes.When the sum is more than the optimal number, we break up the leaf node that causes this sum to be greater than the optimal number,such that the leaf node count divi-sion is done in proportion to the period duration.As an example,refer to Table2.We know that the optimal number of periods per global partition is.We add the leaf node counts from the top until we reach node [10,30)(12).The sum at this point is,or more than optimal.We break up[10,30)(12)into two leaf nodes such that thefirst leaf node period should contain a count of, and the newly created leaf node should contain -ing the same idea of proportional count division,we can see that[10,28)(11)and[28,30)(1)are the two new leaf nodes. So thefirst global time partition has the period[0,28)and has a count of.The computation for the second global time partition starts at[28,30)(1).Continuing on,the global time parti-tions for this example are[0,28),[28,866),and[866,10000).The reader should be aware that this global time partition resolution algorithm is not perfect.The actual number of local aggregation tree leaves assigned to each worker node may not be identical.The reason is that the algorithm uses the local partition sets,which are just guides for the global partitioning.When a local partition has leaf nodes in pe-riod[9,800),the global partition scheme assumes a uniform distribution,while the actual leaf nodes distribution may be heavily skewed.3.3.3Expected PerformanceWe expect better scalability for TDM+C as compared to the SAT and SM algorithms because of the data redistribution and its load-balancing effect.However,there are two globalStep1.Client requestStep2.Build local aggregation treesStep3.While notfinal aggregation tree Mergebetween2nodesStep4.Return results to clientFigure7.Major Steps for the PM Algorithm synchronization steps that may limit the performance ob-tained.First,all of the local partition sets must be com-pleted before the global time set partitioning can begin.Sec-ond,all of the worker nodes must complete their merges and send their results to the coordinator before the client can re-ceive thefinal result.The next algorithm,PM,will attempt to obtain better performance,by replacing the two global synchronization steps with localized synchronization steps.3.4.Pairwise Merge(PM)The fourth parallel algorithm,PM(see Figure7),dif-fers from TDM+C in two ways.First,the coordinator is more involved than in TDM+C.Secondly,instead of all the worker nodes merging simultaneously,as in TDM+C,pairs of worker nodes merge when the opportunity presents itself. Which two worker nodes are paired is determined dynami-cally by the query coordinator.A worker node is available for merging when its local aggregation tree has been built.The worker node informs the query coordinator that it has completed its aggregation tree.The query coordinator then arbitrarily picks another worker node that had previously completed its aggregation tree,thereby allowing the two worker nodes to merge their leaves.Then,the query coordinator instructs the worker node with the least number of leaf nodes to send the leaves to the other node,the“buddy worker node”,which does the merging of leaves.Once a worker nodefinishes transmitting leaves to its buddy worker node,it is no longer a participant in the query. This buddying-up continues until the query coordinator as-certains that only one worker node is left,which contains the completed aggregation tree.The query coordinator then directs the sole remaining worker node to submit the results directly to the client.Figure8provides a conceptual picture of this“buddy”system.A portion of a PM aggregation tree may be merged mul-tiple times with other aggregation trees.The merge algo-rithm is a merge-sort variant operating on two sorted lists as input(the local list,and the received list).This merge is near linear,,in the number of leaf nodes to be merged.Sole Remaining Figure 8.Pairwise Merge (PM)Algorithm3.5.Time Division Merge (TDM)The fifth parallel algorithm,TDM,is identical to TDM+C,except that it has distributed result placement rather than centralized result placement.This algorithm simply eliminates the final coordinator results collection phase and completes with each worker node having a dis-tinct piece of the final aggregation tree.A distributed re-sult is useful when the temporal aggregate operation is a subquery in a much larger distributed query.This allows further localized processing on the individual node’s aggre-gation sub-result in a distributed and possibly more efficient manner.4.Empirical EvaluationFor the purposes of our evaluation,we chose the tempo-ral aggregate operation COUNT since it does not require that the attribute itself be sent.This simplifies the data struc-tures maintained while still exhibiting the characteristics of a temporal aggregate computation.Based on this tem-poral aggregate operation we perform a variety of perfor-mance evaluations on the five parallel algorithms presented.The matrix in Table 3summarizes the experiments we have done.Algorithms Covered NumProcessors 1SAT,PM,SM,TDM,TDM+C 2,4,8,16,32,642SAT,PM,SM,TDM,TDM+C 2,4,8,16,32,643SAT,PM,SM,TDM,TDM+C 2,4,8,16,32,644PM,SM,TDM,TDM+C 16Table 3.Experimental Case Matrix Summary4.1.Experimental EnvironmentThe experiments were conducted on a 64-node shared-nothing cluster of 200MHz Pentium machines,each with 128MB of main memory and a 2GB hard disk.The ma-chines were physically mounted on two racks of 32ma-chines.Connecting the machines was a 100Mbps switched Ethernet network,having a point-to-point bandwidth of 100Mbps and an aggregate bandwidth of 2.4Gbps in all-to-all communication.Each machine was booted with version 2.0.30of the Linux kernel.For message passing between the Pen-tium nodes,we used the LAM implementation of the MPI communication standard [2].With the LAM implemen-tation,we observed an average communication latency of 790microseconds and an average transfer rate of about 5Mbytes/second.4.2.Experimental ParametersTo help precisely define the parameters for each set of tests,we established an experiment classification scheme.Table 4lists the different parameters,and the set of param-eter values for each experiment.Synthetic datasets were generated to model relations which store time-varying information for each employee in a database.Each tuple has three attributes,an SSN attribute which is filled with random digits,a StartDate attribute,and an EndDate attribute.The SSN attribute refers to an en-try in a hypothetic employee relation.On the other hand,the StartDate and EndDate attributes are temporal instants which together construct a valid-time period.The data gen-eration method varies from one experiment to another and is described later.NumProcessors depends on the type of performance measurement.Scale-up experiments used 2,4,8,16,32,and 64processing nodes,while the variable reduction ex-periment used a fixed set of 16nodes.To see the effects of data partitioning on the perfor-mance of the temporal algorithms,the synthetic tables were partitioned horizontally either by SSN or by StartDate.The SSN and StartDate partitioning schemes were attempts to model range partitioning based on temporal and non-temporal attributes [3].The tuple size was fixed at 41bytes/tuple.The tuple size was intentionally kept small and unpadded so that the gener-Parameter Exp4.3Exp4.4Exp4.5Exp4.62,4,8,16,32,642,4,8,16,32,642,4,8,16,32,6416by SSN by SSN by StartDate by StartDate41bytes41bytes41bytes41bytes65536tuples65536tuples65536tuples65536tuples*65536*65536*6553616*655360%100%0%0/20/40/60/80/100%Table4.Experiment Parametersated datasets could have more tuples before their size madethem difficult to work with.2All experiments except the single speed-up test used afixed database partition size of65,536tuples.This wasdone to facilitate cross-referencing of results between dif-ferent tests.Because of this,the16-node results of thescale-up experiments are directly comparable to the resultsof the16-node data reduction experiment.The total database size reflects the total number of tuplesacross all the nodes participating in a particular experimentrun.For scale-up tests,the total database size increased with the number of processing nodes.Finally,the amount of data reduction is100minus the ratio between the number of resulting leaves in thefinal aggregation tree and the original number of tuples in the dataset.A reduction of100percent means that a100-tuple dataset produces1leaf in thefinal aggregation tree because all the tuples have identical StartDates and EndDates.4.3.Baseline Scale-Up Performance:No Reductionand SSN PartitioningWe set up ourfirst experiment to compare the scale-up properties of the proposed algorithms on a dataset with no reduction.We will also use the measurements taken from this experiment as a baseline for later comparisons with sub-sequent experiments.The second column of Table4gives the parameters for this particular experiment.For this experiment,a synthetic dataset containing4M tuples was generated.Each tuple had a randomized SSN atrribute and was associated with distinct periods of unit length(i.e.,).The dataset was then sorted by SSN.3and were then distributed to the64 processing nodes.To measure the scale-up performance of the proposed al-gorithms,a series of6runs having2,4,8,16,32,and64 nodes,respectively,were carried out.Note that since we fixed the size of the dataset on each node,increasing the number of processors meant increasing the total database size.Timing results from this experiment are plotted in Fig-ure9and lead us to the following conclusions.2The total database size for the scale-up experiment at64processing nodes was64partitions65536tuples41bytes=171,966,464bytes.3Since the SSNfields are generated randomly,this has the effect of10203040506070248163264 TimeinSecondsNumber of Worker NodesSATSMPMTDMTDM+CFigure9.Scale-Up Results(4M tuple Datasetwith No Reduction and SSN Partitioning)SM performs better than SAT.Intuitively,since the dataset exhibits no reduction,both SM and SAT send all periods from the worker nodes to the coordinator.The rea-son behind SM’s performance advantage comes from the computational parallelism provided by building local aggre-gation trees on each worker node.Aside from potentially reducing the number of leaves passed on to the coordina-tor,this process of building local trees sorts the periods in temporal order.This sorting makes compiling the results more efficient4than SAT’s strategy of having to insert each valid-time period into thefinal aggregation tree.SAT exhibits the worst scale-up performance.This result is not surprising,since the only advantage SAT has over the original sequential algorithm comes from parallelized I/O. This single advantage does not make up for the additional communication overhead and the coordinator bottleneck.5 The performance difference between TDM and TDM+C increases with the number of nodes.For this observation, it is important to remember that TDM+C is simply TDM plus an additional result-collection phase that sends allfinal leaves to the coordinator,one worker node at a time.The performance difference increases with the number of nodesrandomizing the tuples in terms of StartDate and EndDatefields.4The SM coordinator uses a merge-sort variant in compiling and con-structing thefinal results.5In SAT,all the periods are sent to the coordinator which builds a single, but large,aggregation tree.because of the non-reducible nature of the dataset and the fact that scale-up experiments work with more data as the number of nodes increase.Among the algorithms that provide monolithic results, PM has the best scaleup performance up to32nodes.This is attributed to the multiple merge levels needed by PM.A PM computation needs at least merge levels where is the number of processing nodes.On the other hand,the TDM+C algorithm only merges local trees once but has three synchronization steps,as described in Section3.Withthis analysis in mind,we expected PM to perform better or as well as TDM+C for2,4,and8nodes,which have1,2, and3merge levels,respectively.We then expected TDM+C to outperform PM as more nodes are added,but we were suprised to realize that PM was still performing better than TDM+C up to perhaps50nodes.Tofind out what was going on behind the scenes,we used the LAM XMPI package[2]to visually track the pro-gression of messages within the various TDM+C and PM runs.This led us to the reason why TDM+C performed worse than PM for2to32nodes:TDM+C was slowed more by increased waiting time due to load-imbalance(computa-tion skew)as compared to PM.4.4.Scale-Up Performance:100%Reduction andSSN PartitioningThis experiment is designed to measure the effect of a significant amount of reduction(100%in this case)on the scale-up properties of the proposed algorithms.Table4 gives the parameters for this experiment.This experiment is modeled after thefirst one but with a synthetic dataset having100%reduction.This dataset was generated by creating4M tuples associated with the same period and having randomized SSN attributes.The syn-thetic dataset was then rearranged randomly6and split into 64partitions each having65,536tuples.This experiment,like thefirst one,is a scale-up experi-ment.Hence,it was conducted in much the same way.Tim-ing results from this experiment are plotted in Figure10and leads us to the following observations.All algorithms benefit from the100%data reduction. Comparing results from the baseline experiment with re-sults from the current experiment lead us to this observation. Because of the high degree of data reduction,the aggrega-tion trees do not grow as large as in thefirst experiment. With smaller trees,insertions of new periods take less time because there are fewer branches to traverse before reaching the insertion points.Because all of the presented algorithms use aggregation trees,they all experience increased perfor-mance.6The aggregation tree algorithm performs at its worst case when the dataset is sorted by time[7].10203040506070248163264 TimeinSecondsNumber of Worker NodesSATSMPMTDMTDM+CFigure10.Scale-Up Results(4M tuple Datasetwith100%Reduction and SSN Partitioning)With100%reduction,PM and TDM+C catch up to TDM.Aside from constructing smaller aggregation trees, a high degree of data reduction decreases the number of ag-gregation tree leaves exchanged between nodes.TDM does not send its leaves to a central node for result collection,so it does not transfer as many leaves as its peers.Because of this,TDM is not impacted by the amount of data reduction as much as either PM or TDM+C which end up performing as well as TDM.4.5.Scale-Up Performance:No Reduction andTime PartitioningThis experiment is designed to measure the effect of time partitioning on the scale-up properties of the proposed algo-rithms.The settings for this experiment are summarized in Table4.The dataset for this experiment was generated in a man-ner similar to thefirst one,but with StartDate rather than SSN partitioning.This was done by sorting the whole dataset by the StartDate attribute and then splitting it into 64partitions of64K tuples each.Time Partitioning did not significantly help any of the algorithms.We originally expected TDM and TDM+C to benefit from the time partitioning but we also realized that for this to happen,the partitioning must closely match the way the global time divisions are calculated.Because we randomly assigned partitions to the nodes,TDM did not benefit from the time partitioning.In fact,it even performed a little bit poorer in all but the16-node run.We attribute the small performance gaps to differences in how the partition-ing strategies interacting with the number of nodes made TDM redistribute mildly varying numbers of leaves across the runs.As for SM and PM,they exhibited no conclu-sive improvement because they were simple enough to work without considering how tuples were distributed across the various partitions.。
lammps静水压力

Lammps静水压力引言Lammps是一个经典分子动力学软件包,广泛应用于材料科学和生物物理学领域。
其中一个重要的应用是计算系统的静水压力。
本文将深入探讨如何使用Lammps计算静水压力,并介绍相关的理论和方法。
理论背景在分子动力学模拟中,静水压力是指系统受到的压力,当系统达到平衡状态时,压力保持不变。
静水压力可以通过计算系统的压强得到,压强定义为单位面积上受到的力。
在Lammps中,可以通过计算压强张量的平均值来获得静水压力。
Lammps计算压强的命令在Lammps中,可以使用compute pressure命令来计算压强。
该命令需要定义一个计算区域,并指定计算压强所需要的参数。
下面是一个示例命令:compute pressure all pressure thermo_temp上述命令中,all表示计算区域为整个系统,pressure表示计算压强,thermo_temp 表示使用系统温度来计算压强。
计算静水压力的步骤要计算系统的静水压力,可以按照以下步骤进行:1.创建模拟系统:使用Lammps创建一个包含水分子的模拟系统。
2.定义计算区域:使用Lammps的region命令定义计算区域。
3.计算压强:使用compute pressure命令计算压强。
4.平均压强:使用Lammps的thermo_style命令设置输出格式,并使用thermo_modify命令设置平均压强的时间间隔。
5.运行模拟:运行Lammps模拟以达到平衡状态。
6.输出结果:使用Lammps的thermo命令输出平均压强。
示例代码下面是一个使用Lammps计算静水压力的示例代码:# Lammps input script for calculating static water pressure# Step 1: Create simulation systemunits realatom_style full# Step 2: Define compute regionregion box block 0 10 0 10 0 10create_box 1 boxcreate_atoms 1 box# Step 3: Compute pressurecompute pressure all pressure thermo_temp# Step 4: Average pressurethermo_style custom step temp pressthermo_modify flush yesthermo 100# Step 5: Run simulationvelocity all create 300 12345fix 1 all nverun 1000# Step 6: Output resultsvariable avg_press equal c_pressure[1]print "Average pressure: ${avg_press}"结论本文介绍了如何使用Lammps计算静水压力的方法。
并行计算cv::parallel_for_()函数

并⾏计算cv::parallel_for_()函数paralle_for_设置成n个线程,则实际只有n-1线程并⾏,第n个线程会等待其他线程运⾏结束后再执⾏,所以n=1和n=2实际上都是串⾏。
也可以不设置,会默认开启⼀些线程。
【使⽤⽅式】//【说明】对⼀个Mat中所有的元素(按列为单位)做⽴⽅操作#include<opencv2\opencv.hpp>#include<iostream>using namespace cv;using namespace std;//-------------------------【1】继承ParallelLoopBody,重载运算符()-----------------////⾃定义类,继承⾃并⾏计算循环体类(ParallelLoopBody)class myLoopBody : public ParallelLoopBody//构造⼀个并⾏的循环体类{public:myLoopBody(Mat& _src)//⾃定义构造函数{src = &_src;}virtual void operator()(const Range& range) const// *必须此格式,重载操作符(在并⾏计算中要执⾏的操作){Mat& srcMat = *src;for (int colIdx = range.start; colIdx < range.end; ++colIdx){float* pData = (float*)srcMat.col(colIdx).data;for (int i = 0; i < srcMat.rows; ++i)pData[i*srcMat.cols] = std::pow(pData[i*srcMat.cols], 3); //计算⽴⽅}}private:Mat* src;};void parallelTestWithParallel_for_(InputArray _src)//'parallel_for_' 循环{CV_Assert(_src.kind() == _InputArray::MAT);Mat src = _src.getMat();//-------------------------【2】启动循环 -----------------//parallel_for_(Range(0, src.cols), myLoopBody(src)); // * 注意调⽤语句,range是记录myTestBody循环体的⾸末位置。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Parallel Algorithms for Evaluating Centrality Indices in Real-world NetworksDavid A.Bader Kamesh MadduriCollege of ComputingGeorgia Institute of Technology,Atlanta GA30332{bader,kamesh}@AbstractThis paper discusses fast parallel algorithms for evalu-ating several centrality indices frequently used in complex network analysis.These algorithms have been optimized to exploit properties typically observed in real-world large scale networks,such as the low average distance,high lo-cal density,and heavy-tailed power law degree distribu-tions.We test our implementations on real datasets such as the web graph,protein-interaction networks,movie-actor and citation networks,and report impressive parallel per-formance for evaluation of the computationally intensive centrality metrics(betweenness and closeness centrality) on high-end shared memory symmetric multiprocessor and multithreaded architectures.To our knowledge,these are thefirst parallel implementations of these widely-used so-cial network analysis metrics.We demonstrate that it is pos-sible to rigorously analyze networks three orders of mag-nitude larger than instances that can be handled by ex-isting network analysis(SNA)software packages.For in-stance,we compute the exact betweenness centrality value for each vertex in a large US patent citation network(3mil-lion patents,16million citations)in42minutes on16pro-cessors,utilizing20GB RAM of the IBM p5570.Current SNA packages on the other hand cannot handle graphs with more than hundred thousand edges.1IntroductionLarge-scale network analysis is an excitingfield of re-search with applications in a variety of domains such as so-cial networks(friendship circles,organizational networks), the internet(network topologies,web graphs,peer-to-peer networks),transportation networks,electrical circuits,ge-nealogical research and bio-informatics(protein-interaction networks,food webs).These networks seem to be en-tirely unrelated and indeed represent quite diverse relations, but experimental studies[5,23,11,36,35]have shown that they share some common traits such as a low aver-age distance between the vertices,heavy-tailed degree dis-tributions modeled by power laws,and high local densities. Modeling these networks based on experiments and mea-surements,and the study of interesting phenomena and ob-servations[13,18,42,52],continue to be active areas of research.Several models[26,37,40,51,15]have been pro-posed to generate synthetic graph instances with scale-free properties.Network analysis is currently a very active area of re-search in the social sciences[44,50,47,25],and seminal contibutions to thisfield date back to more than sixty years. There are several analytical tools[32,6,45]for visualizing social networks,determining empirical quantitative indices, and clustering.In most applications,graph abstractions and algorithms are frequently used to help capture the salient features.Thus,network analysis from a graph theoretic per-spective is about extracting interesting information,given a large graph constructed from a real-world dataset.Network analysis and modeling have received consider-able attention in recent times,but algorithms are relatively less studied.Real-world networks are often very large in size,ranging from several hundreds of thousands to billions of vertices and edges.A space-efficient memory represen-tation of such graphs is itself a big challenge,and dedicated algorithms have to be designed exploiting the unique char-acteristics of these networks.On single processor worksta-tions,it is not possible to do exact in-core computations on large graphs due to the limited physical memory.Current high-end parallel computers have sufficient physical mem-ory to handle large graphs,and a na¨ıve in-core implemen-tation of a graph theory problem is typically two orders of magnitude faster than the best external memory implemen-tation[1].Algorithm design is further simplified on parallel shared memory systems;due to the globally address mem-ory space,there is no need to partition the graph,and we can avoid the the overhead of message passing.However, attaining good performance is still a challenge,as a large class of graph algorithms are combinatorial in nature,and involve a significant number of non-contiguous,concurrent accesses to global data structures with low degrees of local-ity.The main contributions of our work are the following: We present new parallel algorithms for evaluating network centrality metrics,optimized for scale-free sparse graphs. To our knowledge,this is thefirst work on parallelizing SNA metrics.We have implemented the algorithms on high-end shared memory systems such as the IBM p5570 and the Cray MTA-2,and report results on several large-scale real datasets,that are three orders of magnitude larger than the graph sizes solved using existing SNA packages. Further,our implementations are designed to handle graph instances on the order of billions of vertices and edges.This paper is organized as follows.Section2gives an overview of the various centrality metrics,and the com-monly used sequential algorithms to compute them.We present parallel algorithms for evaluating these metrics in Section3.Section4details the performance of these algo-rithms on parallel shared memory and multithreaded archi-tectures,on a variety of large-scale real-world datasets and synthetic scale-free graphs.2Centrality MetricsOne of the fundamental problems in network analysis is to determine the importance of a particular vertex or an edge in a network.Quantifying centrality and connectivity helps us identify portions of the network that may play interesting roles.Researchers have been proposing metrics for central-ity for the past50years,and there is no single accepted definition.The metric of choice is dependent on the ap-plication and the network topology.Almost all metrics are empirical,and can be applied to element-level[10],group-level[21],or network-level[49]analyses.We present a few commonly used indices in this section.PreliminariesConsider a graph G=(V,E),where V is the set of vertices representing actors or nodes in the social network,and E, the set of edges representing the relationships between the actors.The number of vertices and edges are denoted by n and m,respectively.The graphs can be directed or undi-rected.Let us assume that each edge e∈E has a positive integer weight w(e).For unweighted graphs,we use w(e) =1.A path from vertex s to t is defined as a sequence of edges u i,u i+1 ,0≤i≤l,where u0=s and u l=t.The length of a path is the sum of the weights of edges.We use d(s,t)to denote the distance between vertices s and t(the minimum length of any path connecting s and t in G).Let us denote the total number of shortest paths between ver-tices s and t byσst,and the number passing through vertex v byσst(v).Degree CentralityThe degree centrality DC of a vertex v is simply the de-gree deg(v)for undirected graphs.For directed graphs,we can define two variants:in-degree centrality and out-degree centrality.This is a simple local measure,based on the no-tion of neighborhood.This index is useful in case of static graphs,for situations when we are interested infinding ver-tices that have the most direct connections to other vertices. Closeness CentralityThis index measures the closeness,in terms of distance,of an actor to all other actors in the network.Vertices with a smaller total distance are considered more important.Sev-eral closeness-based metrics[7,46,38]have been devel-oped by the SNA community.A commonly used definition is the reciprocal of the total distance from a particular vertex to all other vertices:CC(v)=1σstBetweenness Centrality of a vertex v is defined asBC(v)= s=v=t∈Vδst(v)This metric can be thought of as normalized stress central-ity.Betweenness centrality of a vertex measures the control a vertex has over communication in the network,and can be used to identify key actors in the network.High central-ity indices indicate that a vertex can reach other vertices on relatively short paths,or that a vertex lies on a considerable fraction of shortest paths connecting pairs of other vertices. We discuss algorithms to compute this metric in detail in the next section.This index has been extensively used in recent years for analysis of social as well as other large scale complex networks.Some applications include biological networks [30,43,20],study of sexual networks and AIDS[33],iden-tifying key actors in terrorist networks[31,17],organiza-tional behavior[12],supply chain management[16],and transportation networks[27].There are a number of commercial and research software packages for SNA(e.g.,Pajek[6],InFlow[32],UCINET [2])which can also be used to determine these centrality metrics.However,they can only be used to study compar-atively small networks(in most cases,sparse graphs with less than40,000vertices).Our goal is to develop fast,high-performance implementations of these metrics so that we can analyze large-scale real-world graphs of millions to bil-lions of vertices.2.1Algorithms for computing Between-ness CentralityA straightforward way of computing betweenness cen-trality for each vertex would be as follows:pute the length and number of shortest paths be-tween all pairs(s,t).2.for each vertex v,calculate every possible pair-dependencyδst(v)and sum them up.The complexity is dominated by step2,which re-quiresΘ(n3)time summation andΘ(n2)storage of pair-dependencies.Popular SNA tools like UCINET use an ad-jacency matrix to store and update the pair-dependencies. This yields anΘ(n3)algorithm for betweenness by aug-menting the Floyd-Warshall algorithm for the all-pairs shortest-paths problem with path counting[8].Alternately,we can modify Dijkstra’s single-source shortest paths algorithm to compute the pair-wise depen-dencies.Observe that a vertex v∈V is on the shortest path between two vertices s,t∈V,iff d(s,t)=d(s,v)+d(v,t). Define a set of predecessors of a vertex v on shortest paths from s as pred(s,v).Now each time an edge u,v is scanned for which d(s,v)=d(s,u)+d(u,v),that ver-tex is added to the predecessor set pred(s,v).Then,the following relation would hold:σsv=u∈pred(s,v)σsuSetting the initial condition of pred(s,v)=s for all neigh-bors v of s,we can proceed to compute the number of short-est paths between s and all other vertices.The computa-tion of pred(s,v)can be easily integrated into Dijkstra’s SSSP algorithm for weighted graphs,or BFS Search for un-weighted graphs.But even in this case,determining the fraction of shortest paths using v,or the pair-wise depen-denciesδst(v),proves to be the dominant cost.The number of shortest s−t paths using v is given byσst(v)=σsv·σvt. Thus computing BC(v)requires O n2 time per vertex v, and O n3 time in all.This algorithm is the most com-monly used one for evaluating betweenness centrality.To exploit the sparse nature of typical real-world graphs, Brandes[8]came up with an algorithm that computes the betweenness centrality score for all vertices in the graph in O mn+n2log n time for weighted graphs,and O(mn) time for unweighted graphs.The main idea is as follows. We define the dependency of a source vertex s∈V on a vertex v∈V asδs(v)= t∈Vδst(v)The betweenness centrality of a vertex v can be then ex-pressed as BC(v)= s=v∈Vδs(v).The dependency δs(v)satisfies the following recursive relation:δs(v)=w:v∈pred(s,w)σsvarchitectures such as the Cray MTA-2[19,14].The high memory bandwidth and uniform memory access offered by these systems aid the design of high-performance graph the-ory implementations[3].We use a cache-friendly adja-cency array representation[41]for storing the graph.For algorithm analysis,we use a complexity model similar to the one proposed by Helman and J´a J´a[28]that has been shown to work well in practice.This model takes into account both computational complexity and memory con-tention.We measure the overall complexity of an algo-rithm on SMPs using T M(n,m,p),the maximum number of non-contiguous accesses made by any processor to mem-ory,and T C(n,m,p),the upper bound on the local com-putational complexity.The MTA-2is a novel architectural design and has no data cache;rather than using a memory hierarchy to hide latency,the MTA-2processors use hard-ware multithreading to tolerate the latency.Thus,if there is enough parallelism in the problem,and a sufficient number of threads per processor are kept busy,the system is satu-rated and the memory access complexity becomes zero.We report only T C for the MTA-2algorithms.Degree CentralityWe store both the in-and out-degree of each vertex in con-tiguous arrays during construction of the graph from the test datasets.This makes the computation of degree centrality straight-forward,with a constant time look-up on the MTA-2and the p5570.As noted earlier,this is a useful metric for determining the graph structure,and for afirst pass at identifying important vertices.Closeness CentralityRecall the definition of closeness centrality:CC(v)=1p+log p using p processors)to sum up an entire row of distance val-ues.However,since real-world graphs are typically sparse, we have m≪n2and this algorithm would be very inef-ficient in terms of actual running time and memory utiliza-tion.Instead,we can just compute n shortest path trees,one for each vertex v∈V,with v as the source vertex for BFS or Dijkstra’s algorithm.On p processors,this would yield T C=O nm+n2p for unweighted graphs.For weighted graphs,using a na¨ıve queue-based representa-tion for the expanded frontier,we can compute all the cen-trality metrics in T C=O nm+n3p.The bounds can be further improved with the use of efficient pri-ority queue representations.Since the evaluation of closeness centrality is computa-tionally intensive,it is valuable to investigate approximate ing a random sampling technique,Eppstein and Wang[22]show that the closeness centrality of all ver-tices in a weighted,undirected graph can be approximated with high probability in O log nnΣki=1d(v i,u)The error bounds follow from a result by Hoeffding[29]on probability bounds for sums of independent random vari-ables.We parallelize this algorithm as follows.Each pro-cessor can run SSSP computations from kp and T M=kmǫ2)to obtain the error bounds given above.The approximate closeness centrality value of each vertex can then be calculated in O(k)=O log npǫ2 and constant T M.Stress and Betweenness CentralityThese two metrics require shortest paths enumeration and we design our parallel algorithm based on Brandes’[8]se-quential algorithm for sparse graphs.Alg.1outlines the general approach for the case of unweighted graphs.On each BFS computation from s,the queue Q stores the cur-rent set of vertices to be visited,S contains all the vertices reachable from s,and P(v)is the predecessor set associ-ated with each vertex v∈V.The arrays d andσstore the distance from s,and shortest path counts,respectively.The centrality values are computed in steps22–25,by summing the dependenciesδ(v),v∈V.Thefinal scores need to be divided by two if the graph is undirected,as all shortest paths are counted twice.We observe that parallelism can be exploited at two lev-els:Input:G(V,E)Output:Array BC[1..n],where BC[v]gives thecentrality metric for vertex v1for all v∈V in parallel do2BC[v]←0;for all s∈V in parallel do3S←empty stack;4P[w]←empty list,w∈V;5σ[t]←0,t∈V;σ[s]←1;6d[t]←−1,t∈V;d[s]←0;7Q→empty queue;8enqueue s←Q;9while Q not empty do10dequeue v←Q;11push v→S;12for each neighbor w of v in parallel do13if d[w]<0then14enqueue w→Q;15d[w]←d[v]+1;16if d[w]=d[v]+1then17σ[w]←σ[w]+σ[v];18append v→P[w];19δ[v]←0,v∈V;20while S not empty do21pop w←S;22for v∈P[w]do23δ[v]←δ[v]+σ[v]p .We can easily adapt our SMP implementation to considerweighted graphs also.We intend to work on a parallel mul-tithreaded implementation of SSSP on the MTA-2for scale-free graphs in the near future.An approximate algorithm for betweenness centrality isdetailed in[9],which is again derived from the Eppstein-Wang algorithm[22].As in the case of closeness cen-trality,the sequential running time can be reduced toO log nDataset Network description[34]ND-web a directed network with325,729vertices and1,497,135arcs(27,455loops).Each vertex represents a web page within the Univ.of Notredame domain,and the arcs representfrom→to links.[34]PAJ-patent a network of about3million U.S.patents granted between January1963and December 1999,and16million citations made among them between1975and1999[39](a)(b)(c)(d)Figure3.Multiprocessor Performance of Betweenness Centrality for various graph instances on the IBM p5570and the Cray MTA-2IBM p5570.We report results on a40-processor MTA-2,with each processor having a clock speed of220MHz and4GB of RAM.Since the address space of the MTA-2is hashed,from the programmer’s viewpoint,the MTA-2is a flat shared memory machine with160GB memory.The IBM p5570is a16-way symmetric multiprocessor with 161.9GHz Power5cores with simultaneous multithread-ing(SMT),and256GB shared memory.We test our centrality metric implementations on a va-riety of real-world graphs,summarized in Table1.Our implementations have been extended to read inputfiles in both PAJEK and UCINET graph formats.We also use a synthetic graph generator[15]to generate graphs obey-ing small-world characteristics.The degree distributions of some test graph instances are shown in Fig.1.We observe that the out-degree distribution contains a heavy tail,which is in agreement with prior experimental studies on scale-free graphs.Fig.2compares the single processor execution time of the three centrality metrics for three graph instances,on the MTA-2and the p5570.All three metrics are of the same computational complexity and show nearly similar running times in practice.Fig.3summarizes multiprocessor execution times for computing Betweenness centrality,on the p5570and the MTA-2.Fig.3(a)gives the running times for the ND-actorgraph on the p570and the MTA-2.As expected,the execu-tion time falls nearly linearly with the number of processors. It is possible to evaluate the centrality metric for the entire ND-actor network in42minutes,on16processors of the p570.We observe similar performance for the patents cita-tion data.However,note that the execution time is highly dependent on the size of the largest non-trivial connected component in these real graphs.The patents network,for instance,is composed of several disconnected sub-graphs, representing patents and citations in unrelated areas.How-ever,it did not take significantly more time to compute the centrality indices for this graph compared to the ND-actor graph,even though this is a much larger graph instance.Figs.3(c)and3(d)plot the execution time on the MTA-2and the p5570for ND-web,and a synthetic graph in-stance of the same size generated using the R-MAT algo-rithm.Again,note that the actual execution time is depen-dent on the graph structure;for the same problem size,the synthetic graph instance takes much longer than the ND-web pared to a four processor run,the execution time reduces significantly for forty processors of the MTA-2,but we do not attain performance comparable to our prior graph algorithm implementations[3,4].We need to opti-mize our implementation to attain better system utilization, and the current bottleneck is due to automatic handling of nested parallelism.For better load balancing,we need to further pre-process the graph and semi-automatically assign teams of threads to the independent BFS computations.Our memory management routines for the MTA-2implementa-tion are not as optimized p570routines,and this is another reason for drop in performance and scaling.5ConclusionsWe present parallel algorithms for evaluating several net-work indices,including betweenness centrality,optimized for Symmetric Multiprocessors and multithreaded architec-tures.To our knowledge,this is thefirst effort at paralleliz-ing these widely-used social network analysis tools.Our implementations are designed to handle problem sizes in the order of billions of edges in the network,and this is three orders of magnitude larger than instances that can be han-dled by current social network analysis tools.We are cur-rently working on improving the betweenness centrality im-plementation on the MTA-2,and also extend it to efficiently compute scores for weighted graphs.In future,we plan to implement and analyze performance of approximate algo-rithms for closeness and betweenness centrality detailed the paper,and also apply betweenness centrality values to solve harder problems like graph clustering.AcknowledgementsThis work was supported in part by NSF Grants CA-REER CCF-0611589,ACI-00-93039,NSF DBI-0420513, ITR ACI-00-81404,ITR EIA-01-21377,Biocomplexity DEB-01-20709,ITR EF/BIO03-31654,and DARPA Con-tract NBCH30390004.We are grateful to Jonathan Berry and Bruce Hendrickson for discussions on large-scale net-work analysis and betweenness centrality.References[1] D.Ajwani,R.Dementiev,and U.Meyer.A computationalstudy of external-memory bfs algorithms.In Proc.17th Ann.Symp.Discrete Algorithms(SODA-06),pages601–610.ACM-SIAM,2006.[2]Analytic Technologies.UCINET6social network anal-ysis software./ ucinet.htm.[3] D.Bader,G.Cong,and J.Feo.On the architectural require-ments for efficient execution of graph algorithms.In Proc.34th Int’l Conf.on Parallel Processing(ICPP),Oslo,Nor-way,June2005.[4] D.Bader and K.Madduri.Designing multithreaded algo-rithms for Breadth-First Search and st-connectivity on the Cray MTA-2.Technical report,Georgia Institute of Tech-nology,Feb.2006.[5] A.-L.Barabasi and R.Albert.Emergence of scaling in ran-dom networks.Science,286(5439):509–512,October1999.[6]V.Batagelj and A.Mrvar.Pajek–Program for large networkanalysis.Connections,21(2):47–57,1998.[7] munication patterns in task orientedgroups.J.Acoustical Soc.of America,22:271–282,1950.[8]U.Brandes.A faster algorithm for betweenness centrality.J.Mathematical Sociology,25(2):163–177,2001.[9]U.Brandes and T.Erlebach,work Analysis:Methodological Foundations,volume3418of Lecture Notes in Computer Science.Springer-Verlag,2005.[10]S.Brin and L.Page.The anatomy of a large-scale hyper-textual web search puter Networks and ISDN Systems,30(1–7):107–117,1998.[11] A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Ra-jagopalan,R.Stata,A.Tomkins,and J.Wiener.Graph struc-ture in the w.,33(1-6):309–320,2000. [12]N.Buckley and M.van Alstyne.Does email make whitecollar workers more productive.Technical report,University of Michigan,2004.[13] D.Callaway,M.Newman,S.Strogatz,and D.Watts.Network robustness and fragility:percolation on random graphs.Physics Review Letters,85:5468–5471,2000. [14]L.Carter,J.Feo,and A.Snavely.Performance and program-ming experience on the Tera MTA.In Proc.SIAM Conf.on Parallel Processing for Scientific Computing,San Antonio, TX,Mar.1999.[15] D.Chakrabarti,Y.Zhan,and C.Faloutsos.R-MAT:A recur-sive model for graph mining.In Proc.4th SIAM Intl.Conf.on Data Mining,Florida,USA,Apr.2004.[16] D.Cisic,B.Kesic,and L.Jakomin.Research of the power inthe supply chain.International Trade,Economics Working Paper Archive EconWPA,April2000.[17]T.Coffman,S.Greenblatt,and S.Marcus.Graph-basedtechnologies for intelligence munications of the ACM,47(3):45–47,2004.[18]R.Cohen,K.Erez,D.Ben-Avraham,and S.Havlin.Break-down of the internet under intentional attack.Physics Re-view Letters,86:3682–3685,2001.[19]Cray Inc.The MTA-2multithreaded architecture.http:///products/systems/mta/. [20] A.del Sol,H.Fujihashi,and P.O’Meara.Topology ofsmall-world networks of protein-protein complex structures.Bioinformatics,21(8):1311–1315,2005.[21]P.Doreian and L.Albert.Partitioning political actor net-works:Some quantitative tools for analyzing qualitative net-works.Quantitative Anthropology,161:279–291,1989. [22] D.Eppstein and J.Wang.Fast approximation of centrality.In Proc.12th Ann.Symp.Discrete Algorithms(SODA-01), Washington,DC,2001.[23]M.Faloutsos,P.Faloutsos,and C.Faloutsos.On power-law relationships of the Internet topology.In Proc.ACM SIGCOMM,pages251–262,1999.[24]L.C.Freeman.A set of measures of centrality based onbetweenness.Sociometry,40(1):35–41,1977.[25]L.C.Freeman.The development of social network analysis:a study in the sociology of science.Booksurge Pub.,2004.[26]J.-L.Guillaume and tapy.Bipartite graphs as modelsof complex networks.In Proc.1st Int’l Workshop on Com-binatorial and Algorithmic Aspects of Networking,2004. [27]R.Guimera,S.Mossa,A.Turtschi,and L.Amaral.Theworldwide air transportation network:Anomalous central-ity,community structure,and cities’global roles.Proc.Nat.Academy of Sciences USA,102(22):7794–7799,2005. [28] D.R.Helman and J.J´a J´a.Prefix computations on symmetricmultiprocessors.Journal of Parallel and Distributed Com-puting,61(2):265–278,2001.[29]W.Hoeffding.Probability inequalities for sums of boundedrandom variables.Journal of American Statistical Associa-tion,58:713–721,1963.[30]H.Jeong,S.Mason,A.-L.Barabasi,and Z.Oltvai.Lethalityand centrality in protein networks.Nature,411:41,2001. [31]V.Krebs.Mapping networks of terrorist cells.Connections,24(3):43–52,2002.[32]V.Krebs.InFlow3.1-Social network mapping software,2005..[33] F.Liljeros,C.R.Edling,L.A.N.Amaral,H.E.Stanley,and Y.Aberg.The web of human sexual contacts.Nature, 411:907,2001.[34]Notredame CNet resources./˜networks.[35]M.Newman.Scientific collaboration networks:shortestpaths,weighted networks and centrality.Physics Review E, 64,2001.[36]M.Newman.The structure and function of complex net-works.SIAM Review,45(2):167–256,2003.[37]M.Newman,S.Strogatz,and D.Watts.Random graph mod-els of social networks.Proceedings of the National Academy of Sciences USA,99:2566–2572,2002.[38]U.J.Nieminen.On the centrality in a directed graph.SocialScience Research,2:371–378,1973.[39]PAJEK datasets.http://www.vlado.fmf.uni-lj.si/pub/networks/data/.[40] C.Palmer and J.Steffan.Generating network topologies thatobey power laws.In Proc.ACM GLOBECOM,Nov.2000.[41]J.Park,M.Penner,and V.Prasanna.Optimizing graph al-gorithms for improved cache performance.In Proc.Int’l Parallel and Distributed Processing Symp.(IPDPS2002), Fort Lauderdale,FL,Apr.2002.[42]R.Pastor-Satorras and A.Vespignani.Epidemic spreadingin scale-free networks.Physics Review Letters,86:3200–3203,2001.[43]J.Pinney,G.McConkey,and D.Westhead.Decompo-sition of biological networks using betweenness central-ity.In Proc.Poster Session of the9th Ann.Int’l Conf.on Research in Computational Molecular Biology(RECOMB 2004),Cambridge,MA,May2005.[44]W.Richards.International network for social network anal-ysis,2005..[45]W.Richards.Social network analysis software links,2005./INSNA/soft_ inf.html.[46]G.Sabidussi.The centrality index of a graph.Psychome-trika,31:581–603,1966.[47]J.Scott.Social Network Analysis:A Handbook.SAGEPublications,2000.[48] A.Shimbel.Structural parameters of communication net-works.Mathematical Biophysics,15:501–507,1953. [49]University of Virginia.Oracle of Bacon.http://www..[50]S.Wasserman and K.Faust.Social Network Analysis:Meth-ods and Applications.Cambridge Univ.Press,1994. [51] D.Watts and S.Strogatz.Collective dynamics of smallworld networks.Nature,393:440–442,1998.[52] D.Zanette.Critical behavior of propagation on small-worldnetworks.Physics Review E,64,2001.。