Performance Comparison of Optimal Routing and Dynamic Routing On Low-Earth Orbit Satellite

合集下载

Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands

Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands

Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic DemandsXin Yuan Wickus Nienaber Zhenhai Duan Department of Computer ScienceFlorida State UniversityT allahassee,FL32306{xyuan,nienaber,duan}@Rami Melhem Department of Computer Science University of PittsburghPittsburgh,P A15260melhem@ABSTRACTFat-tree based system area networks have been widely adopted in high performance computing clusters.In such systems, the routing is often deterministic and the traffic demand is usually uncertain and changing.In this paper,we study routing performance on fat-tree based system area networks with deterministic routing under the assumption that the traffic demand is uncertain.The performance of a rout-ing algorithm under uncertain traffic demands is charac-terized by the oblivious performance ratio that bounds the relative performance of the routing algorithm and the op-timal routing algorithm for any given traffic demand.We consider both single path routing where the traffic between each source-destination pair follows one path,and multi-path routing where multiple paths can be used for the traffic between a source-destination pair.We derive lower bounds of the oblivious performance ratio of any single path rout-ing scheme for fat-tree topologies and develop single path oblivious routing schemes that achieve the optimal oblivi-ous performance ratio for commonly used fat-tree topologies. These oblivious routing schemes provide the best perfor-mance guarantees among all single path routing algorithms under uncertain traffic demands.For multi-path routing, we show that it is possible to obtain a scheme that is opti-mal for any traffic demand(an oblivious performance ratio of1)on the fat-tree topology.These results quantitatively demonstrate that single path routing cannot guarantee high routing performance while multi-path routing is very effec-tive in balancing network loads on the fat-tree topology. Categories and Subject DescriptorsC.2.1[Computer Systems Organization]:Computer-Communication Networks—Network Architecture and De-signGeneral TermsPerformancePermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.SIGMETRICS’07,June12–16,2007,San Diego,California,USA. Copyright2007ACM978-1-59593-639-4/07/0006...$5.00.KeywordsOblivious routing,fat-tree,system area networks1.INTRODUCTIONThe fat-tree topology has many properties that make it attractive for large scale interconnects and system area net-works[8,9].Most importantly the bisection bandwidth of the fat-tree topology scales linearly with the network size. The topology is also inherently highly resilient with a large number of redundant paths between two processing nodes. The fat-tree topology is very popular for building medium and large system area networks[6,12].In particular,it has been widely adopted by high performance computing(HPC) clusters that employ the off-the-shelf high speed system area networking technology,such as Myrinet[13]and Infiniband [7].The fat-tree topology is used in many of the top500 fastest supercomputers listed in the June2006release[16]. Although the fat-tree topology provides rich connectivity, having a fat-tree topology alone does not guarantee high network performance:the routing mechanism also plays a crucial role.Historically,adaptive routing,which dynami-cally builds the path for a packet based on the network con-dition,has been used with the fat-tree topology to achieve load balance in the network[9].However,the routing in the current major system area networking technology such as Infiniband and Myrinet is deterministic[7,13].For a fat-tree based system area network with deterministic routing, it is important to employ an efficient load balance routing scheme in order to fully exploit the rich connectivity pro-vided by the fat-tree topology.Traditional load balance routing schemes usually optimize the network usage for a given traffic demand.Such demand specific schemes may not be effective for system area net-works where the traffic demand is usually uncertain and changing.Consider,for example,the traffic demands in a large HPC cluster.Since many users share such a sys-tem and can run many different applications,the traffic de-mand depends both on how the processing nodes are allo-cated to different applications and on the communication requirement within each application.Hence,an ideal rout-ing scheme should provide load balancing across all possible traffic patterns.This requirement motivates us to study demand-oblivious load balance routing schemes,which have recently been shown to promise excellent performance guar-antees with changing and uncertain traffic demands in the Internet environment[1,2,18].In this paper,we investigate routing performance on fat-tree based system area networks with deterministic routing under the assumption that the traffic demand is uncertain and changing.For a given traffic demand that can be rep-resented by a traffic matrix,the performance of a routing scheme is measured by the maximum link load metric.The performance of a routing algorithm under uncertain traffic demands is characterized by the oblivious performance ratio [1].The formal definition of the oblivious performance ratio will be introduced in the next rmally,a rout-ing algorithm,r,with an oblivious performance ratio of x means that for any traffic demand,the performance(maxi-mum link load)of r on the demand is at most x times that of the optimal routing algorithm for this demand.An oblivi-ous performance ratio of1means that the routing algorithm is optimal for all traffic demands.System area networks,including Infiniband and Myrinet, typically support single path routing and some form of multi-path routing.It is well known that single path routing is simple,but may not be as effective as multi-path routing in balancing network loads.On the other hand,multi-path routing introduces complications such as packet reordering that the network system must handle.However,the per-formance difference between single path routing and multi-path routing on the fat-tree topology is not well understood. Without a clear understanding of the performance differ-ence,it is difficult to make a wise decision about whether a system should use single path routing for its simplicity or multi-path routing for its performance.This paper resolves this problem:it provides a concrete quantitative compar-ison between the performance of single path routing and multi-path routing on the fat-tree topology.This study focuses on fat-tree topologies formed with m-port switches,where m is a parameter that is restricted to be a multiple of2.Although the results are obtained for this type of fat-trees,the results,as well as our analyzing techniques,can be easily extended to other types of fat-tree topologies.The major conclusions in this paper include the following.For a3-level fat-tree,we prove that the oblivious performance ratio of any single path routing algorithm is at least p2.For a4-level fat-tree,we prove that the oblivious performance ratio of any single path routing algorithm is at least m2) H−22=8times worse than theoptimal algorithm for that traffic demand.We show that the lower bounds are tight for3-level and4-level fat-trees by de-veloping optimal single path oblivious routing schemes that achieve these bounds.These algorithms provide the best performance guarantees among all single path routing algo-rithms under uncertain traffic demands.It must be noted that practical fat-tree topologies are usually no more than 4levels:depending on the number of ports in the switches forming the fat-tree,a4-level fat-tree can easily support more than ten thousands processing nodes.Hence,the single path routing schemes developed in this paper are sufficient for most practical fat-tree based networks.For multi-path routing,we show that it is possible to obtain a scheme that is optimal for any traffic demand(an oblivious performance ratio of1)on the fat-tree topology.This suggests that multi-path routing is much more effective than single path routing in providing the worst case performance guarantees on the fat-tree topology.The rest of the paper is organized as follows.In Section2, we formally define routing and the metrics for evaluating routing schemes and specify the fat-tree topology.In Sec-tion3,we study the single path oblivious routing schemes for the fat-tree topology.In Section4,we present the re-sults for multi-path routing.Section5reports the results of our performance study of the proposed algorithms and other routing algorithms designed for the fat-tree topology. Section6describes the related work.Finally,Section7con-cludes the paper.2.BACKGROUND2.1Routing and its performance metricsLet the system have N processing nodes,numbered from 0to N−1.The traffic demand is described by an N×N Traffic Matrix,T M.Each entry tm i,j in T M,0≤i≤N−1, 0≤i≤N−1,denotes the amount of traffic from node i to node j.Let A be a set,|A|denotes the size of the set.The definitions of routing and the performance metrics in this paper are modeled after[1].A routing specifies how the traffic of each Source-Destination(SD)pair is routed across the network.We consider two types of routing schemes: single path routing where only one path can be used for each SD pair,and multi-path routing where multiple paths can be used.In multi-path routing,each path for an SD pair routes a fraction of the traffic for the SD pair.The multi-path routing can be characterized by a set of paths MP i,j={MP1i,j,MP2i,j,...,MP|MP i,j|i,j}for each SD pair(i,j),and the fraction of the traffic routed through eachpath f i,j={f k i,j|k=1,2,...,|MP i,j|}.P|MP i,j|k=1f k i,j=1. Hence,a multi-path routing,mr,is specified by a set of paths MP i,j and a vector representing the fraction of the traffic routed through each path f i,j for each SD pair(i,j), 0≤i≤N−1and0≤j≤N−1.Let link l∈MP k i,j, the contribution of the traffic tm i,j to link l through path MP k i,j is thus tm i,j×f k i,j.Notice that link l may be in more than one path in MP i,j.In this case,multiple paths for the same SD pair can contribute to the traffic on link l.Single path routing is a special case of multi-path routing where |MP i,j|=1and all traffic from node i to node j is routed through MP1i,j(f1i,j=1).Hence,a single path routing can be specified by a path MP1i,j for each SD pair(i,j).A common metric for the performance of a routing scheme with respect to a certain traffic matrix,T M,is the maxi-mum link load.Since all links in a fat-tree network have the same capacity,the maximum link load is equivalent to the maximum link utilization.Let Links denote the set of all links in the network.For a multi-path routing mr,the maximum link load is given byMLOAD(mr,T M)=maxl∈Links{Xi,j,k such that l∈MP ki,jtm i,j×f k i,j}.For a single path routing sr,the maximum link load for-mula is degenerated toMLOAD(sr,T M)=maxl∈Links{Xi,j such that l∈P1i,jtm i,j}.An optimal routing for a given traffic matrix T M is arouting that minimizes the maximum link load.Formally,the optimal load for a traffic matrix T M is given byOP T U(T M)=minr is a routing{MLOAD(r,T M)}The performance ratio of a given routing r on a giventraffic matrix T M measures how far r is from being optimalon the traffic matrix T M.It is defined as the maximum linkload of r on T M divided by the minimum possible maximumlink load on T M[1].P ERF(r,T M)=MLOAD(r,T M)2of the ports in the switch connect to m2ports remain open.We will call these opened portsup-link ports since they will be used to connect to the upper level switches.We denote the number of up-link ports in2ab123123123SUBFT(4, 2)level link level12switch 01Figure 3:The 4-port 3-three (F T (4,3))SUBF T (m,l )as nu (m,l ).nu (m,1)=m2)l .The up-link ports in SUBF T (m,l )are numbered from 0to nu (m,l )−1.SUBF T (m,l )is formed by having nu (m,l −1)=(m 2SUBF T (m,l −1)’s.Each of the top level switchesuses m 2of the SUBF T (m,l −1)’s.Let us number top level switches from 0to nu (m,l −1)−1.The up-link ports i ,0≤i <nu (m,l −1),in all of the SUB (m,l −1)’s are connected to top level switch i .The rest m2×i tom2=(m2)l up-link ports and connects to (m2)n −1rootlevel switches connecting with m SUBF T (m,n −1)’s.Let us number top level switches from 0to (m2)n −1processing nodes.It has n levels of switches.The root level contains nu (m,n −1)=(m2)n −1switches.Hence,the total number of switches in F T (m,n )is (2n −1)×(m2)3−1=20switches and 4×(42SUBF T (m,n −2)’s,...,m ×(m2)i −1different shortest paths from a to b .If such a sub-tree does not exist,there are (m2)n −1=22=4shortest paths.In bothcases in Property 3,the number of shortest paths betweenany two nodes can be represented as(m2)x differentshortest paths from processing node s to processing node d. Each of the level n−1−i up/down links that carry traffic from s to d is used by(m2)x shortestpaths use the link.For the next level(i=1),a source will have m2)x/m2)x−1shortest paths.The cases for links in other levels are similar.Consider the4 paths from node s to node d in Figure3,all4paths use the level2up/down links(the link connecting the processing node),2paths use each of the level1up/down links that carries traffic from a to b,and1path uses each of the level 0up/down links carrying traffic from a to b.Property5:In F T(m,n),a level i,0≤i≤n−1,up link carries traffic from at most(m2)n−1−i destinationnodes.This property is also intuitive.For example,when i= n−1,a level i=n−1link directly connects to a process-ing nodes.So such a link carries traffic to/from at most (m2)n−1−(n−2)=mlayer fat-tree,denoted as EF T2(m,k),contains two levels of switches.The top level contains m2children.In thisfigure,we separate the two directional channels.Figure6:F T(m,2)topologyFigure7:SEF T2(m,k)topologyLemma6:Let the processing nodes in EF T2(m,k)be numbered from0to N−1.Let A={(s1,d1),...,(s|A|,d|A|)} be a set of node disjoint SD pairs(for1≤i≤|A|,s i∈{0,...,N−1}and d i∈{0,...,N−1}).When|A|≤k,the SD pairs in A can be routed in EF T2(m,k)with|A|link disjoint paths.Proof:In EF T2(m,k),each of the top level switches has a link with each of the bottom level switches.Since|A|≤k, we can assign a different top level switch for each SD pair (s i,d i)∈A.For each(s i,d i),if s i and d i are in the same switch,there is only one path between s i and d i(from s i to the switch connecting both s i and d i,and then to d i).Since A is node disjoint,these links are not used by other paths for other SD pairs.If s i and d i are not in the same switch, the path for(s i,d i)is:from s i to the bottom level switch connecting s i to the top level switch assigned to(s i,d i)to the bottom level switch connecting d i to d i.This way,all the SD pairs in A are routed with link disjoint paths.2 Lemma7:Let sr be a single path routing on EF T2(m,k). Assume that under routing sr,there exists a link l that carries traffic for a set A of node disjoint SD pairs,|A|≤k, Then,P ERF(sr)≥|A|.Proof:To show that P ERF(sr)≥|A|,we must show that there exists a traffic matrix T M such that MLOAD(sr,T M)OUT U(T M)≥|A|k.Proof:In SEF T2(m,k),at most k(k−1)(m2.Let(s,d)be a SD pair.The pair must be routed through the root only when nodes s and d are connected to differ-ent switches.We will call the root switch in SEF T2(m,k) switch R and the k level1switches sw(0),sw(1),...,sw(k−1)as shown in Figure7.Let S be a largest set of SD pairs that are routed through the root when the largest of the maximum disjoint sizes of all links is at most X.Let S i,j, 0≤i=j≤k−1,be the set of SD pairs in S with source nodes in switch sw(i)and destination nodes in switch sw(j). S=S i,j such that0≤i=j≤k−1S i,j.Let us denoteLX srci,j=[a such that a∈SRC(S i,j)and|SS i,ja|>XS S i,jaLet E i,j=|SRC(LX srci,j)|.For the SD pairs in S i,j,E i,j is the number of source nodes in switch sw(i),each of whichhas more than X destination nodes in switch sw(j).LX srci,j contains all such SD pairs.Similarly,we will denoteLX dsti,j=[d such that d∈DST(S i,j)and|DS i,jd|>XD S i,jdLet F i,j=|DST(LX dsti,j)|.For the SD pairs in S i,j,F i,j is the number of destination nodes in switch sw(j),each ofwhich has more than X source nodes in switch sw(i).LX dsti,j contains all such SD pairs.All SD pairs in S i,j must pass through links sw(i)→R and R→sw(j).First,let us consider link sw(i)→R.Let all SD pairs with source nodes in sw(i)be All i→R= S j=i S i,j.All SD pairs in All i→R must go through link sw(i)→R.Hence,L(All i→R)≤X.From Lemma4,L(All i→R−S x=i LX src i,x)≤X−P x=i E i,x.Since S i,j−LX srci,j⊆All i→R−S x=i LX src i,x,we have L(S i,j−LX src i,j)≤L(All i→R−S x=i LX src i,x)≤X−P x=i E i,x.Hence,applying Lemma4,L (S i,j −LX src i,j−LX dst i,j )≤X −Xx =iE i,x −F i,j .Using the similar logic,by considering link R →sw (j ),we can obtainL (S i,j −LX src i,j −LX dsti,j )≤X −E i,j −X x =jF x,j .Combining these two in-equations,we obtain L (S i,j −LX src i,j −LX dsti,j )≤X −(P x =i E i,x +F i,j +E i,j +P x =j F x,j )/2.Each source or destination node in S i,j −LX src i,j −LX dsti,j can have no more than X SD pairs in the set (otherwise,these SD pairs would be included in either LX src i,j or LX dsti,j ).Hence,LS (S i,j −LX src i,j −LX dsti,j )≤X .From Lemma5,|S i,j −LX src i,j −LX dst i,j |≤L (S i,j −LX src i,j −LX dsti,j )×LS (S i,j −LX src i,j −LX dsti,j )≤(X −(P x =i E i,x +F i,j +E i,j +Px =j F x,j )/2)×X .Hence,9|S k −1i =0S j =i (S i,j −LX src i,j −LX dst i,j )|≤P k −1i =0P j =i |S i,j −LX src i,j −LX dst i,j |≤P k −1i =0P j =i X ×(X −(Px =i E i,x +F i,j+E i,j +P x =j F x,j )/2)≤k (k −1)X 2−kX 2P k −1i =0P j =i F i,j Since each switch connects to m2and LX dsti,j ≤F i,j ×m 2P k −1i =0Pj =iE i,j −kX2P k −1i =0Pj =iE i,j +m2−mk ,kX 2.Thus,|S |≤k (k −1)X 2.2Let us denote the maximum number of SD pairs routed through SEF T 2(m,k )when the largest of the maximum disjoint sizes of the links in SEF T 2(m,k )is X by T(X).Obviously,when X >Y and T (Y )is a subset of all SD pairs that can be routed,T (X )>T (Y )regardless of the relation among X ,m ,and k .Lemma 8states that when X ≥m k,T (X )<T (mk )2.Lemma 9:Let r be a single path routing algorithm on EF T 2(m,k ).If k ≥√m 2)2SD pairs must be routed through top levelswitches.Since there are m2)22=k (k −1)m2SD pairs pass-ing through)with all level 1switches and all processing nodes.Let the maximum disjoint size of the links connect-ing to this switch be X.Under the assumption k ≥√k ≤m 2m=p 2,T (X )<k (k −1)(m 2.Since there are k (k −1)m2),X <p2cannot be true.Thus,X ≥p 2.Since k ≥√mmm2m .From Lemma 9,P ERF (r )≥p2.2Lemma 11:Let r be a single path routing algorithm for F T (m,3),P ERF (r )≥m2)2-port switch,F T (m,3)isapproximated as EF T 2(2(m2×2(m2×(m2=m2.2Lemma 12:Let r be a single path routing algorithm forF T (m,n ),P ERF (r )≥(m 3 .Proof:Let us consider the maximum disjoint sizes on links connecting the to up-link ports of SUBF T (m,i )’s,1≤i ≤n −1,in F T (m,n ).From Property 1and Property 2of F T (m,n ),the connectivity in F T (m,n )can be partitioned into two levels (with respective to such links):the lower level connectivity provided by SUBF T (m,i )’s and the up-per level connectivity provided by the upper level switches for SUBF T (m,i )’s.The connectivity in SUBF T (m,i )can be approximated as a 2(m2)n −1−i -port switch.Consider the casewhen i = 2(n −1)2) 2(n −1)2)n −12)n −12×2(m3,following the proof in Lemma 9,the largest of the maximum disjoint sizes on such links is at leasts2) 2(n −1)2=(m 3 .From Lemma 7,P ERF (r )≥(m3 .2Note that the tree height of F T (m,n )is n +1:F T (m,2)is a 3-level fat-tree;F T (m,3)is a 4-level fat-tree;and F T (m,n )is n +1-level.This sub-section establishes that the lower bounds of the oblivious performance ratio for any single path routing is p 2for a 3-level fat-tree,m2)H −2ing32-port switches,F T(32,2)supports up to512 processing nodes while F T(32,3)support up to8192pro-cessing nodes.With48-port switches,F T(48,3)can sup-port24×24×48=27648processing nodes.Hence,optimaloblivious routing schemes for F T(m,2)and F T(m,3)bearmost practical significance.Moreover,the development ofthese algorithms also bears theoretical significance by mak-ing the lower bounds on the oblivious performance ratio forF T(m,2)and F T(m,3)(Lemma10and Lemma11)tightbounds.Let N be the number of processing nodes.Let the trafficmatrix be T M with entries tm i,j,0≤i≤N−1and0≤j≤N−1,specifying the amount of traffic from node i to node j.The total traffic sent from node i is P j tm i,j and the total traffic received by node i is P j tm j,i.Since there is only one link connecting each processing node to thenetwork,such traffic must be carried on that link regardlessof the routing scheme.Hence,for any routing scheme(singlepath or multi-path)the load on the link(which has twodirections)connecting to node i is max{P j tm i,j,P j tm j,i}. We define the base load of a traffic matrix T M as baseload(T M)=max0≤i≤N−1{max{X j tm i,j,X j tm j,i}}.The minimum maximum link load on the fat-tree topology using any routing scheme,single path or multi-path,is at least baseload(T M)for any traffic matrix T M.In other words,OP T U(T M)≥baseload(T M).Our optimal single path oblivious routing schemes are based on the following Lemma.Lemma13:If a single path routing scheme r routes SD pairs such that the SD pairs in each of the links in F T(m,n) are either from at most X sources or towards at most X destinations,then P ERF(r)≤X.Proof:As discussed earlier,for any traffic demand T M, on F T(m,n),OP T U(T M)≥baseload(T M).Since each link carries traffic either from at most X sources or towards X destinations,the load of the link is no more than X×baseload(T M),hence,P ERF(r,T M)≤X×baseload(T M)2switches andsupports m22switchesare in the level0and m switches are in level1.There is onelink from each switch in level1to each of the switch in level 0.In order to describe the oblivious routing algorithm,wewill give a non-recursive description of F T(m,2).The m2−1,0). The m level1switches are labeled switches(0,1),(1,1),...,(m−1,1).Each level1switch(i,1),0≤i≤m−1,is connected with m2−1).There is a link between switch(i,0), 0≤i≤m2−1,and switch(i,1).Figure6depicts the F T(m,2)topology as well as the switch and processingnode labeling.Lemma10states that any single path routing r,P ERF(r)≥p2for F T(m,2).To ease exposition,let us assume that p2is an integer.The cases when p2is not an integer can be handled with some minor modifications.We give the algorithm that deals with the cases when p2is not an integer in Figure8.Following Lemma13,the optimal obliv-ious routing algorithm schedules the SD pairs such that the traffic in each up link from a bottom level switch to a top level switch has exactly p2sources while each down from a top level switch to a bottom level switch has exactly p2 destinations.Note that in F T(m,2),each of the level1links carries traffic either to1node or from1node(the node that the link is connected to)and we do not have to consider such links in the design of the optimal oblivious routing al-gorithm.Let Z=p2.The routing algorithm partitions the m2processing nodes connected to switch(i,1), 0≤i≤m−1,are partitioned into Z=p2groups:group j,0≤j≤Z−1,includes nodes(i,j∗Z),(i,j∗Z+1),..., (i,j∗Z+Z−1).The SD pairs are scheduled as follows.The up-link(i,1)→(j,0),0≤i≤m−1and0≤j≤m2−1,which is fixed once the traffic in the up-link is decided,carries trafficwith source nodes in group j/Z in all other switches than switch(i,1)to destination nodes in group j mod Z in switch (i,1).Hence,each of the up-links carries traffic from exactly Z source nodes and each of the down links carries traffic to exactly Z destination nodes.The detailed routing algorithm,called OSRM2,is shown in Figure8.When p2is an integer,the algorithm works exactly as just described.When p2is not an integer, the algorithm partitions the mm2 destinations into Z d= p2 groups.It then uses the same logic as the cases when p2is an integer to schedule the SD pairs.Theorem1:When p2is an integer,P ERF(OSRM2)= p2.Proof:As discuss earlier,using OSRM2,each link carries traffic either from p2sources or to p2destinations.From Lemma13,P ERF(OSRM2)≤p2.From Lemma10, P ERF(OSRM2)≥p2.Hence,P ERF(OSRM2)=p2 and OSRM2is an optimal oblivious routing algorithm forAlgorithm OSRM2:Route from node (s 0,s 1)to node (d 0,d 1)Let m 2=mm 2 ,Z d = √Z s,N d = m 2m2×m 2switches (m SUBF T (m,2)’s,eachSUBF T (m,2)havingm2−1and 0≤i 1≤m2−1;the level 2switches arelabeled as ((i 0,i 1),2),0≤i 0≤m −1and 0≤i 1≤m2×m2−1,and 0≤p 2≤m 2−1,and0≤p 2≤m2−1,has a link to each of the level 1switches((i 0,X ),1),0≤X ≤m2−1,has a link to each of the level 0switches ((i 1,X ),0),0≤X ≤m2.Like in the F T (m,2)case,our optimal routing algorithm ensures that the SD pairs in each link are either from at most m2destinations.From Property 5in Section 2.2,each level 1or level 2link in F T (m,3)carries traffic either from no more than m2destinations.Hence,routing within each SUBF T (m,2)does not affect the performance oblivious ra-tio.Hence,we only need to focus on level 0links (the links connecting layer 0and layer 1switches).The idea is similar to that in OSRM 2:the routing algorithm ensures that each up link out of the sub-fat-tree SUBF T (m,2)carries traffic from m2destinations.Basically,we can treat each SUBF T (m,2)as if it is a 2(m2)2processing nodes and has (m 2)2=Z 2processing nodes in a SUBF T (m,2)into Z =m2nodes.Node (p 0,p 1,p 2)is in group p 2of the p 0-th SUBF T (m,2).Algorithm OSRM3:Route from node (s 0,s 1,s 2)to (d 0,d 1,d 2):if (s 0==d 0and s 1==d 1)/*within one SUBF T (m,2)*//*routing won’t affect the oblivious ratio */Use route:node (s 0,s 1,s 2)→switch ((s 0,s 1),2)→node (d 0,d 1,d 2)else if (s 0==d 0)/*within one SUBF T (m,2)*//*routing won’t affect the oblivious ratio */Use route:node (s 0,s 1,s 2)→switch ((s 0,s 1),2)→switch ((s 0,s 2),1)→switch ((s 0,d 1),2)→node (d 0,d 1,d 2)else/*must be careful about links to/from level 0switches */Use route:node (s 0,s 1,s 2)→switch ((s 0,s 1),2)→switch ((s 0,s 2),1)→switch ((s 2,d 2),0)→switch ((d 0,s 2),1)→switch ((d 0,d 1),2)→node (d 0,d 1,d 2)Figure 9:Optimal oblivious single routing forF T (m,3)The routing for links between SUBF T (m,2)and the top level switch is similar to that for links between level 1switches to level 0switches in F T (m,2):the up-link ((i 0,0),1)→((0,0),0)carries traffic from group 0processing nodes (in the i 0-th SUBF T (m,2))to group 0processing nodes in other SUBF T (m,2)’s;((i 0,0),1)→((0,1),0)carries traffic from group 0processing nodes to group 1processing nodes in other SUBF T (m,2)’s;...;((i 0,0),1)→((0,Z −1),0)carries traffic from group 0processing nodes to group Z −1process-ing nodes in other SUBF T (m,2)’s;((i 0,1),1)→((1,0),0)carries traffic from group 1processing nodes to group 0pro-cessing nodes in other SUBF T (m,2)’s;...;((i 0,1),1)→((1,Z −1),0)carries traffic from group 1processing nodes to group Z −1processing nodes in other SUBF T (m,2)’s;...;((i 0,Z −1),1)→((Z −1,0),0)carries traffic from group Z −1processing nodes to group 0processing nodes in other SUBF T (m,2)’s;...;((i 0,Z −1),1)→((Z −1,Z −1),0)car-ries traffic from group Z −1processing nodes to group Z −1processing nodes in other SUBF T (m,2)’s.This way,each up-link only carries SD pairs with exactly Z =m2and OSRM 3is an optimal oblivious routing algorithm for F T (m,3).Proof :From above discussion,using OSRM 3,the SD pairs in each link have either at most m2destination nodes.From Lemma 13,P ERF (OSRM 3)≤m2is the low bound for any single path routing scheme on F T (m,3).Hence,OSRM 3is an optimal oblivious routing algorithm for F T (m,3).24.MULTI-PATH OBLIVIOUS ROUTINGIn the previous section,it is shown that any single pathrouting would have at best a (m 3 oblivious perfor-mance ratio on F T (m,n ).This indicates that single path。

optimaztion

optimaztion

a r X i v :cs /0102001v 2 [c s .M S ] 5 M a r 2004Mathematical Programming manuscript No.(will be inserted by the editor)Elizabeth D.Dolan ·Jorge J.Mor´eBenchmarking Optimization Software with Performance Profilesthe date of receipt and acceptance should be inserted laterAbstract.We propose performance profiles —distribution functions for a performance metric —as a tool for benchmarking and comparing optimization software.We show that performance profiles combine the best features of other tools for performance evaluation.Key words.benchmarking –guidelines –performance –software –testing –metric –timingElizabeth D.Dolan:Northwestern University and Mathematics and Computer Science Division,Argonne National Laboratory.e-mail:dolan@Jorge J.Mor´e :Mathematics and Computer Science Division,Argonne National Laboratory.e-mail:more@2.min{t p,s:s∈S}We assume that a parameter r M≥r p,s for all p,s is chosen,and r p,s=r M if and only if solver s does not solve problem p.We will show that the choice of r M does not affect the performance evaluation.The performance of solver s on any given problem may be of interest,but we would like to obtain an overall assessment of the performance of the solver.If we define1ρs(τ)=4,τ∈R,n pfor s∈S.Moreover,ˆρs(τ)=ρs(τ)forτ<min{r q,s,ˆr q,s}orτ≥max{r q,s,ˆr q,s}. Thus,if n p is reasonably large,then the result on a particular problem q does not greatly affect the performance profilesρs.Not only are performance profiles relatively insensitive to changes in results on a small number of problems,they are also largely unaffected by small changes in results over many problems.We demonstrate this property by showing that small changes from r p,s toˆr p,s result in a correspondingly small L1error betweenρs andˆρs.Theorem2.1.Let r i andˆr i for1≤i≤n p be performance ratios for some solver,and letρandˆρbe,respectively,the performance profiles defined by these ratios.If|r i−ˆr i|≤ǫ,1≤i≤n p(2.1) for someǫ>0,then ∞|ρ(t)−ˆρ(t)|dt≤ǫ1Proof.Since performance profiles do not depend on the ordering of the data,we can assume that{r i}is monotonically increasing.We can reorder the sequence{ˆr i}so that it is also monotonically increasing,and(2.1)still holds.These reorderings guarantee thatρ(t)=i/n p for t∈[r i,r i+1),with a similar result forˆρ.We now show that for any integer k with1≤k≤n p,s k|ρ(t)−ˆρ(t)|dt≤k ǫ1n p ,k <n p .(2.4)Together,(2.2)and (2.4)show that (2.2)holds for k +1.We present the proof for the case when ˆr k ≤r k .A similar argument can be made for r k ≤ˆr k .If ˆr k ≤r k then s k =r k and ˆr k ≤r k ≤r k +1.The argument depends on the position of ˆr k +1and makes repeated use of the fact that ρ(t )=k/n p for t ∈[r k ,r k +1),with a similar result for ˆρ.If r k +1≤ˆr k +1then ρ(t )=ˆρ(t )in [r k ,r k +1).Also note that |ρ(t )−ˆρ(t )|=1/n p in [r k +1,ˆr k +1).Hence,(2.4)holds with s k +1=ˆr k +1.The case where ˆr k +1≤r k +1makes use of the observation that ˆr i =r i ≥r k +1for i >k +1.If r k ≤ˆr k +1≤r k +1,then ρ(t )=ˆρ(t )in [r k ,ˆr k +1),and |ρ(t )−ˆρ(t )|=1/n p in [ˆr k +1,r k +1).Hence,(2.4)holds.On the other hand,if ˆr k ≤ˆr k +1≤r k ,then we only need to note that |ρ(t )−ˆρ(t )|=1/n p in [r k ,r k +1)in order to conclude that (2.4)holds.We have shown that (2.2)holds for all integers k with 1≤k ≤n p .In particular,the case k =n p yields our result since ρ(t )=ˆρ(t )for t ∈[s n p ,∞).⊓⊔3.Benchmarking DataThe timing data used to compute the performance profiles in Sections 4and 5are gener-ated with the COPS test set,which currently consists of seventeen different applications,all models in the AMPL [10]modeling language.The choice of the test problem set P is always a source of disagreement because there is no consensus on how to choose prob-lems.The COPS problems are selected to be interesting and difficult,but these criteria are subjective.For each of the applications in the COPS set we use four instances of the application obtained by varying a parameter in the application,for example,the number of grid points in a discretization.Application descriptions and complete absolute timing results for the full test set are given in [9].Section 4focuses on only the subset of the eleven optimal control and parameter estimation applications in the COPS set,while the discussion in Section 5covers the complete performance results.Accordingly,we provide here information specific to this subset of the COPS problems as well as an analysis of the test set as a whole.Table 3.1gives the quartiles for three problem parameters:the number of variables n ,the number of constraints,and the ratio (n −n e )/n ,where n e is the number of equality constraints.In the optimization literature,n −n e is often called the degrees of freedom of the problem,since it is an upper bound on the number of variables that are free at the solution.6COPS subsetmin q1q2q3max10044989920004815 Num.constraints0150498159850480599******* Deg.freedom(%)0.0 1.033.2100.0100.0The data in Table3.1is fairly representative of the distribution of these parameters throughout the test set and shows that at least three-fourths of the problems have the number of variables n in the interval[400,5000].Our aim was to avoid problems where n was in the range[1,50]because other benchmarking problem sets tend to have a pre-ponderance of problems with n in this range.The main difference between the full COPS set and the COPS subset is that the COPS subset is more constrained with n e≥n/2for all the problems.Another feature of the COPS subset is that the equality constraints are the result of either difference or collocation approximations to differential equations.We ran ourfinal complete runs with the same options for all models.The options involve setting the output level for each solver so that we can gather the data we need, increasing the iteration limits as much as allowed,and increasing the super-basics limits for MINOS and SNOPT to5000.None of the failures we record in thefinal trials include any solver error messages about having violated these limits.While we relieved restrictions unnecessary for our testing,all other parameters were left at the solvers’default settings.The script for generating the timing data sends a problem to each solver succes-sively,so as to minimize the effect offluctuation in the machine load.The script tracks the wall-clock time from the start of the solve,killing any process that runs3,600sec-onds,which we declare unsuccessful,and beginning the next solve.We cycle through all the problems,recording the wall-clock time as well as the combination of AMPL sys-tem time(to interpret the model and compute varying amounts of derivative information required by each solver)and AMPL solver time for each model variation.We repeat the cycle for any model for which one of the solvers’AMPL time and the wall-clock time differ by more than10percent.To further ensure consistency,we have verified that the AMPL time results we present could be reproduced to within10percent accuracy.All computations were done on a SparcULTRA2running Solaris7.We have ignored the effects of the stopping criteria and the memory requirements of the solvers.Ideally we would have used the same stopping criteria,but this is not possible in the AMPL environment.In any case,differences in computing time due to the stopping criteria are not likely to change computing times by more than a factor of two.Memory requirements can also play an important role.In particular,solvers that use direct linear equation solvers are often more efficient in terms of computing time provided there is enough memory.The solvers that we benchmark have different requirements.MINOS and SNOPT use onlyfirst-order information,while LANCELOT and LOQO need second-order in-formation.The use of second-order information can reduce the number of iterations,but the cost per iteration usually increases.In addition,obtaining second-order information is more costly and may not even be possible.MINOS and SNOPT are specifically de-8size{p∈P:log2(r p,s)≤τ}n pin Figure4.3.This graph reveals all the features of the previous two graphs and thus captures the performance of all the solvers.The disadvantage is that the interpretation of the graph is not as intuitive,since we are using a log scale.Figures4.1and4.2are mapped into a new scale to reflect all data,requiring at least the interval[0,log2(1043)]in Figure4.3to include the largest r p,s<r M.We extend the range slightly to show theflatlining of all solvers.The newfigure contains all the information of the other twofigures and,in addition,shows that each of the solvers fails on at least8%of the problems.This is not an unreasonable performance for the COPS test set because these problems were generally chosen to be difficult.5.Case Study:The Full COPSWe now expand our analysis of the data to include all the problems in version2.0of the COPS[7]test set.We present in Figure5.1a log2scaled view of the performance profiles for the solvers on that test set.Figure5.1gives a clear indication of the relative performance of each solver.As in the performance profiles in Section4,thisfigure shows that performance profiles elim-inate the undue influence of a small number of problems on the benchmarking process and the sensitivity of the results associated with the ranking of solvers.In addition,per-formance profiles provide an estimate of the expected performance difference between solvers.The most significant aspect of Figure5.1,as compared with Figure4.3,is that on this test set LOQO dominates all other solvers:the performance profile for LOQO lies above all others for all performance ratios.The interpretation of the results in Figure5.1 is important.In particular,these results do not imply that LOQO is faster on every problem.They indicate only that,for anyτ≥1,LOQO solves more problems within10LP(1.0),PCx(1.1),SOPLEX (1.1),LPABO(5.6),MOSEK(1.0b),BPMPD(2.11),and BPMPD(2.14).In keeping with our graphing practices with the COPS set,we designate as failures those solves that are marked in the original table as stopping close to thefinal solution without convergence under the solver’s stopping criteria.One feature we see in the graph of Mittelmann’s results that does not appear in the COPS graphs is the visual display of solvers that neverflatline.In other words,the solvers that climb off the graph are those that solve all of the test problems successfully.As with Figure4.3,all of the events in the datafit into this log-scaled representation.While this data set cannot be universally representative of benchmarking results by any means,it does show that our reporting technique is applicable beyond our own results.As in the case studies in Sections4and5,the results in Figure6.1give an indication of the performance of LP solvers only on the data set used to generate these results.In particular,the test set used to generate Figure6.1includes only problems selected by12,Numerical experiments with the LANCELOT package(Release A)for large-scale nonlinear optimization,Math.Program.,73(1996),pp.73–110.7.COPS.See /˜more/cops/.8.H.C ROWDER,R.S.D EMBO,AND J.M.M ULVEY,On reporting computational experiments withmathematical software,ACM Trans.Math.Softw.,5(1979),pp.193–203.9. E.D.D OLAN AND J.J.M OR´E,Benchmarking optimization software with COPS,Technical Memoran-dum ANL/MCS-TM-246,Argonne National Laboratory,Argonne,Illinois,2000.10.R.F OURER,D.M.G AY,AND B.W.K ERNIGHAN,AMPL:A Modeling Language for MathematicalProgramming,The Scientific Press,1993.11.P.E.G ILL,W.M URRAY,AND M.A.S AUNDERS,SNOPT:An algorithm for large-scale constrainedoptimization,Report NA97-2,University of California,San Diego,1997.12.N.J.H IGHAM,Accuracy and Stability of Numerical Algorithms,SIAM,Philadelphia,Pennsylvania,1996.13.R.H.F.J ACKSON,P.T.B OGGS,S.G.N ASH,AND S.P OWELL,Guidelines for reporting results ofcomputational experiments.Report of the ad hoc committee,Math.Program.,49(1991),pp.413–426.14.H.M ITTELMANN,Benchmarks for optimization software.See/bench.html.15.。

Optimal Aggregation Algorithms for Middleware

Optimal Aggregation Algorithms for Middleware

Optimal Aggregation Algorithms for MiddlewareRonald Fagin∗Amnon Lotem†Moni Naor‡Abstract:Assume that each object in a database has m grades,or scores,one for each of m attributes.For example,an object can have a color grade,that tells how red it is,and a shape grade,that tells how round it is. For each attribute,there is a sorted list,which lists each object and its grade under that attribute,sorted by grade (highest gradefirst).There is some monotone aggregation function,or combining rule,such as min or average, that combines the individual grades to obtain an overall grade.To determine the top k objects(that have the best overall grades),the naive algorithm must access every object in the database,tofind its grade under each attribute.Fagin has given an algorithm(“Fagin’s Algorithm”,or FA) that is much more efficient.For some monotone aggregation functions,FA is optimal with high probability in the worst case.We analyze an elegant and remarkably simple algorithm(“the threshold algorithm”,or TA)that is optimal in a much stronger sense than FA.We show that TA is essentially optimal,not just for some monotone aggregation functions,but for all of them,and not just in a high-probability worst-case sense,but over every database.Unlike FA,which requires large buffers(whose size may grow unboundedly as the database size grows),TA requires only a small,constant-size buffer.TA allows early stopping,which yields,in a precise sense,an approximate version of the top k answers.We distinguish two types of access:sorted access(where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top),and random access(where the middleware system requests the grade of object in a list,and obtains it in one step).We consider the scenarios where random access is either impossible,or expensive relative to sorted access,and provide algorithms that are essentially optimal for these cases as well.Extended abstract appeared in Proc.Twentieth ACM Symposium on Principles of Database Systems,2001(PODS2001),pp.102–113.∗IBM Almaden Research Center,650Harry Road,San Jose,California95120.Email:fagin@ †University of Maryland–College Park,Dept.of Computer Science,College Park,Maryland20742.Email: lotem@‡Dept.of Computer Science and Applied Mathematics,Weizmann Institute of Science,Rehovot76100,Israel.Email: naor@wisdom.weizmann.ac.il.The work of this author was performed while a Visiting Scientist at the IBM Almaden Research Center.1IntroductionEarly database systems were required to store only small character strings,such as the entries in a tuple in a traditional relational database.Thus,the data was quite homogeneous.Today,we wish for our database systems to be able to deal not only with character strings(both small and large),but also with a heterogeneous variety of multimedia data(such as images,video,and audio).Furthermore,the data that we wish to access and combine may reside in a variety of data repositories,and we may want our database system to serve as middleware that can access such data.One fundamental difference between small character strings and multimedia data is that multimedia data may have attributes that are inherently fuzzy.For example,we do not say that a given image is simply either“red”or“not red”.Instead,there is a degree of redness,which ranges between0(not at all red)and1(totally red).One approach[Fag99]to deal with such fuzzy data is to make use of an aggregation function t. If x1,...,x m(each in the interval[0,1])are the grades of object R under the m attributes,then t(x1,...,x m)is the(overall)grade of object R.1As we shall discuss,such aggregation functions are useful in other contexts as well.There is a large literature on choices for the aggregation function (see Zimmermann’s textbook[Zim96]and the discussion in[Fag99]).One popular choice for the aggregation function is min.In fact,under the standard rules of fuzzy logic[Zad69],if object R has grade x1under attribute A1and x2under attribute A2,then the grade under the fuzzy conjunction A1∧A2is min(x1,x2).Another popular aggregation function is the average(or the sum,in contexts where we do not care if the resulting overall grade no longer lies in the interval [0,1]).We say that an aggregation function t is monotone if t(x1,...,x m)≤t(x 1,...,x m)whenever x i≤x i for every i.Certainly monotonicity is a reasonable property to demand of an aggregation function:if for every attribute,the grade of object R is at least as high as that of object R,then we would expect the overall grade of R to be at least as high as that of R.The notion of a query is different in a multimedia database system than in a traditional database system.Given a query in a traditional database system(such as a relational database system),there is an unordered set of answers.2By contrast,in a multimedia database system,the answer to a query can be thought of as a sorted list,with the answers sorted by grade.As in[Fag99],we shall identify a query with a choice of the aggregation function t.The user is typically interested infinding the top k answers, where k is a given parameter(such as k=1,k=10,or k=100).This means that we want to obtain k objects(which we may refer to as the“top k objects”)with the highest grades on this query,each along with its grade(ties are broken arbitrarily).For convenience,throughout this paper we will think of k as a constant value,and we will consider algorithms for obtaining the top k answers in databases that contain at least k objects.Other applications:There are other applications besides multimedia databases where we make use of an aggregation function to combine grades,and where we want tofind the top k answers.One important example is information retrieval[Sal89],where the objects R of interest are documents,the m attributes are search terms s1,...,s m,and the grade x i measures the relevance of document R for 1We shall often abuse notation and write t(R)for the grade t(x1,...,x m)of R.2Of course,in a relational database,the result to a query may be sorted in some way for convenience in presentation,such as sorting department members by salary,but logically speaking,the result is still simply a set,with a crisply-defined collection of members.1search term s i,for1≤i≤m.It is common to take the aggregation function t to be the sum.That is, the total relevance score of document R when the query consists of the search terms s1,...,s m is takento be t(x1,...,x m)=x1+···+x m.Another application arises in a paper by Aksoy and Franklin[AF99]on scheduling large-scale on-demand data broadcast.In this case each object is a page,and there are twofields.Thefirstfield repre-sents the amount of time waited by the earliest user requesting a page,and the secondfield represents the number of users requesting a page.They make use of the product function t with t(x1,x2)=x1x2, and they wish to broadcast next the page with the top score.The model:We assume that each database consists of afinite set of objects.We shall typically takeN to represent the number of objects.Associated with each object R are mfields x1,...,x m,wherex i∈[0,1]for each i.We may refer to x i as the i thfield of R.The database is thought of as consistingof m sorted lists L1,...,L m,each of length N(there is one entry in each list for each of the N objects). We may refer to L i as list i.Each entry of L i is of the form(R,x i),where x i is the i thfield of R. Each list L i is sorted in descending order by the x i value.We take this simple view of a database,since this view is all that is relevant,as far as our algorithms are concerned.We are taking into account only access costs,and ignoring internal computation costs.Thus,in practice it might well be expensive to compute thefield values,but we ignore this issue here,and take thefield values as being given.We consider two modes of access to data.Thefirst mode of access is sorted(or sequential)access. Here the middleware system obtains the grade of an object in one of the sorted lists by proceeding through the list sequentially from the top.Thus,if object R has the th highest grade in the i th list,then sorted accesses to the i th list are required to see this grade under sorted access.The second mode of access is random access.Here,the middleware system requests the grade of object R in the i th list,and obtains it in one random access.If there are s sorted accesses and r random accesses,then the sorted access cost is sc S,the random access cost is rc R,and the middleware cost is sc S+rc R(the sum of the sorted access cost and the random access cost),for some positive constants c S and c R.Algorithms:There is an obvious naive algorithm for obtaining the top k answers.Under sorted access,it looks at every entry in each of the m sorted lists,computes(using t)the overall grade of every object,and returns the top k answers.The naive algorithm has linear middleware cost(linear in the database size),and thus is not efficient for a large database.Fagin[Fag99]introduced an algorithm(“Fagin’s Algorithm”,or FA),which often does much better than the naive algorithm.In the case where the orderings in the sorted lists are probabilistically indepen-dent,FAfinds the top k answers,over a database with N objects,with middleware cost O(N(m−1)/m k1/m), with arbitrarily high probability.3Fagin also proved that under this independence assumption,along with an assumption on the aggregation function,every correct algorithm must,with high probability, incur a similar middleware cost in the worst case.We shall present the“threshold algorithm”,or TA.This algorithm was discovered independently by(at least)three groups,including Nepal and Ramakrishna[NR99](who were thefirst to publish), G¨u ntzer,Balke,and Kiessling[GBK00],and ourselves.4For more information and comparison,see Section10on related work.3We shall not discuss the probability model here,including the notion of“independence”,since it is off track.For details, see[Fag99].4Our second authorfirst defined TA,and did extensive simulations comparing it to FA,as a project in a database course taught by Michael Franklin at the University of Maryland–College Park,in the Fall of1997.2We shall show that TA is optimal in a much stronger sense than FA.We now define this notion of optimality,which we consider to be interesting in its own right.Instance optimality:Let A be a class of algorithms,let D be a class of databases,and let cost(A,D) be the middleware cost incurred by running algorithm A over database D.We say that an algorithm B is instance optimal over A and D if B∈A and if for every A∈A and every D∈D we havecost(B,D)=O(cost(A,D)).(1) Equation(1)means that there are constants c and c such that cost(B,D)≤c·cost(A,D)+c for every choice of A∈A and D∈D.We refer to c as the optimality ratio.Intuitively,instance optimality corresponds to optimality in every instance,as opposed to just the worst case or the average case.FA is optimal in a high-probability worst-case sense under certain assumptions.TA is optimal in a much stronger sense:it is instance optimal,for several natural choices of A and D.In particular,instance optimality holds when A is taken to be the class of algorithms that would normally be implemented in practice(since the only algorithms that are excluded are those that make very lucky guesses),and when D is taken to be the class of all databases.Instance optimality of TA holds in this case for all monotone aggregation functions.By contrast,high-probability worst-case optimality of FA holds only under the assumption of“strictness”(we shall define strictness later;intuitively,it means that the aggregation function is representing some notion of conjunction).Approximation and early stopping:There are times when the user may be satisfied with an ap-proximate top k list.Assumeθ>1.Define aθ-approximation to the top k answers for the aggregation function t to be a collection of k objects(each along with its grade)such that for each y among these k objects and each z not among these k objects,θt(y)≥t(z).Note that the same definition withθ=1 gives the top k answers.We show how to modify TA to give such aθ-approximation(and prove the instance optimality of this modified algorithm under certain assumptions).In fact,we can easily modify TA into an interactive process where at all times the system can show the user its current view of the top k list along with a guarantee about the degreeθof approximation to the correct answer.At any time, the user can decide,based on this guarantee,whether he would like to stop the process.Restricting random access:As we shall discuss in Section2,there are some systems where random access is impossible.To deal with such situations,we show in Section8.1how to modify TA to obtain an algorithm NRA(“no random accesses”)that does no random accesses.We prove that NRA is instance optimal over all algorithms that do not make random accesses and over all databases.What about situations where random access is not impossible,but simply expensive?Wimmers et al. [WHRB99]discuss a number of systems issues that can cause random access to be expensive.Although TA is instance optimal,the optimality ratio depends on the ratio c R/c S of the cost of a single random access to the cost of a single sorted access.We define another algorithm that is a combination of TA and NRA,and call it CA(“combined algorithm”).The definition of the algorithm depends on c R/c S. The motivation is to obtain an algorithm that is not only instance optimal,but whose optimality ratio is independent of c R/c S.Our original hope was that CA would be instance optimal(with optimality ratio independent of c R/c S)in those scenarios where TA is instance optimal.Not only does this hope fail,but interestingly enough,we prove that there does not exist any deterministic algorithm,or even probabilistic algorithm that does not make a mistake,with optimality ratio independent of c R/c S in these scenarios!However,wefind a new natural scenario where CA is instance optimal,with optimality ratio independent of c R/c S.3Outline of paper:In Section2,we discuss modes of access(sorted and random)to data.In Sec-tion3,we present FA(Fagin’s Algorithm)and its properties.In Section4,we present TA(the Threshold Algorithm).In Section5,we define instance optimality,and compare it with related notions,such as competitiveness.In Section6,we show that TA is instance optimal in several natural scenarios.In the most important scenario,we show that the optimality ratio of TA is best possible.In Section6.1,we discuss the dependence of the optimality ratio on various parameters.In Section6.2,we show how to turn TA into an approximation algorithm,and prove instance optimality among approximation algo-rithms.We also show how the user can prematurely halt TA and in a precise sense,treat its current view of the top k answers as an approximate answer.In Section7,we consider situations(suggested by Bruno,Gravano,and Marian[BGM02])where sorted access is impossible for certain of the sorted lists.In Section8,we focus on situations where random accesses are either impossible or expensive.In Section8.1we present NRA(No Random Access algorithm),and show its instance optimality among algorithms that make no random accesses.Further,we show that the optimality ratio of NRA is best possible.In Section8.2we present CA(Combined Algorithm),which is a result of combining TA and NRA in order to obtain an algorithm that,intuitively,minimizes random accesses.In Section8.3,we show instance optimality of CA,with an optimality ratio independent of c R/c S,in a natural scenario. In Section8.4,we show that the careful choice made by CA of which random accesses to make is necessary for instance optimaltiy with an optimality ratio independent of c R/c S.We also compare and contrast CA versus TA.In Section9,we prove various lower bounds on the optimality ratio,both for deterministic algorithms and for probabilistic algorithms that never make a mistake.We summarize our upper and lower bounds in Section9.1.In Section10we discuss related work.In Section11,we give our conclusions,and state some open problems.2Modes of Access to DataIssues of efficient query evaluation in a middleware system are very different from those in a traditional database system.This is because the middleware system receives answers to queries from various subsystems,which can be accessed only in limited ways.What do we assume about the interface between a middleware system and a subsystem?Let us consider QBIC5[NBE+93](“Query By Image Content”)as a subsystem.QBIC can search for images by various visual characteristics such as color and texture(and an experimental version can search also by shape).In response to a query,such as Color=‘red’,the subsystem will output the graded set consisting of all objects,one by one,each along with its grade under the query,in sorted order based on grade,until the middleware system tells the subsystem to halt.Then the middleware system could later tell the subsystem to resume outputting the graded set where it left off.Alternatively,the middleware system could ask the subsystem for,say,the top10objects in sorted order,each along with its grade.then request the next10,etc.In both cases,this corresponds to what we have referred to as“sorted access”.There is another way that we might expect the middleware system to interact with the subsystem. The middleware system might ask the subsystem for the grade(with respect to a query)of any given object.This corresponds to what we have referred to as“random access”.In fact,QBIC allows both sorted and random access.There are some situations where the middleware system is not allowed random access to some subsystem.An example might occur when the middleware system is a text retrieval system,and the 5QBIC is a trademark of IBM Corporation.4subsystems are search engines.Thus,there does not seem to be a way to ask a major search engine on the web for its internal score on some document of our choice under a query.Our measure of cost corresponds intuitively to the cost incurred by the middleware system in pro-cessing information passed to it from a subsystem such as QBIC.As before,if there are s sorted accesses and r random accesses,then the middleware cost is taken to be sc S+rc R,for some positive constants c S and c R.The fact that c S and c R may be different reflects the fact that the cost to a middleware system of a sorted access and of a random access may be different.3Fagin’s AlgorithmIn this section,we discuss FA(Fagin’s Algorithm)[Fag99].This algorithm is implemented in Garlic [CHS+95],an experimental IBM middleware system;see[WHRB99]for interesting details about the implementation and performance in practice.Chaudhuri and Gravano[CG96]consider ways to simulate FA by using“filter conditions”,which might say,for example,that the color score is at least0.2.FA works as follows.1.Do sorted access in parallel to each of the m sorted lists L i.(By“in parallel”,we mean that weaccess the top member of each of the lists under sorted access,then we access the second member of each of the lists,and so on.)6Wait until there are at least k“matches”,that is,wait until there is a set H of at least k objects such that each of these objects has been seen in each of the m lists.2.For each object R that has been seen,do random access to each of the lists L i tofind the i thfieldx i of R.pute the grade t(R)=t(x1,...,x m)for each object R that has been seen.Let Y be a setcontaining the k objects that have been seen with the highest grades(ties are broken arbitrarily).The output is then the graded set{(R,t(R))|R∈Y}.7It is fairly easy to show[Fag99]that this algorithm is correct for monotone aggregation functions t (that is,that the algorithm successfullyfinds the top k answers).If there are N objects in the database, and if the orderings in the sorted lists are probabilistically independent,then the middleware cost of FA is O(N(m−1)/m k1/m),with arbitrarily high probability[Fag99].An aggregation function t is strict[Fag99]if t(x1,...,x m)=1holds precisely when x i=1for every i.Thus,an aggregation function is strict if it takes on the maximal value of1precisely when each argument takes on this maximal value.We would certainly expect an aggregation function representing the conjunction to be strict(see the discussion in[Fag99]).In fact,it is reasonable to think of strictness as being a key characterizing feature of the conjunction.Fagin shows that his algorithm is optimal with high probability in the worst case if the aggregation function is strict(so that,intuitively,we are dealing with a notion of conjunction),and if the orderings6It is not actually important that the lists be accessed“in lockstep”.In practice,it may be convenient to allow the sorted lists to be accessed at different rates,in batches,etc.Each of the algorithms in this paper where there is“sorted access in parallel”remain correct even when sorted access is not in lockstep.Furthermore,all of our instance optimality results continue to hold even when sorted access is not in lockstep,as long as the rates of sorted access of the lists are within constant multiples of each other.7Graded sets are often presented in sorted order,sorted by grade.5in the sorted lists are probabilistically independent.In fact,the access pattern of FA is oblivious to the choice of aggregation function,and so for eachfixed database,the middleware cost of FA is exactly the same no matter what the aggregation function is.This is true even for a constant aggregation function; in this case,of course,there is a trivial algorithm that gives us the top k answers(any k objects will do)with O(1)middleware cost.So FA is not optimal in any sense for some monotone aggregation functions t.As a more interesting example,when the aggregation function is max(which is not strict), it is shown in[Fag99]that there is a simple algorithm that makes at most mk sorted accesses and no random accesses thatfinds the top k answers.By contrast,as we shall see,the algorithm TA is instance optimal for every monotone aggregation function,under very weak assumptions.Even in the cases where FA is optimal,this optimality holds only in the worst case,with high proba-bility.This leaves open the possibility that there are some algorithms that have much better middleware cost than FA over certain databases.The algorithm TA,which we now discuss,is such an algorithm.4The Threshold AlgorithmWe now present the threshold algorithm(TA).1.Do sorted access in parallel to each of the m sorted lists L i.As an object R is seen under sortedaccess in some list,do random access to the other lists tofind the grade x i of object R in every list L i.8Then compute the grade t(R)=t(x1,...,x m)of object R.If this grade is one of the k highest we have seen,then remember object R and its grade t(R)(ties are broken arbitrarily,so that only k objects and their grades need to be remembered at any time).2.For each list L i,let x i be the grade of the last object seen under sorted access.Define the thresholdvalueτto be t(x1,...,x m).As soon as at least k objects have been seen whose grade is at least equal toτ,then halt.3.Let Y be a set containing the k objects that have been seen with the highest grades.The output isthen the graded set{(R,t(R))|R∈Y}.We now show that TA is correct for each monotone aggregation function t.Theorem4.1:If the aggregation function t is monotone,then TA correctlyfinds the top k answers. Proof:Let Y be as in Part3of TA.We need only show that every member of Y has at least as high a grade as every object z not in Y.By definition of Y,this is the case for each object z that has been seen in running TA.So assume that z was not seen.Assume that thefields of z are x1,...,x m.Therefore, x i≤x i,for every i.Hence,t(z)=t(x1,...,x m)≤t(x1,...,x m)=τ,where the inequality follows by monotonicity of t.But by definition of Y,for every y in Y we have t(y)≥τ.Therefore,for every y in Y we have t(y)≥τ≥t(z),as desired.We now show that the stopping rule for TA always occurs at least as early as the stopping rule for FA(that is,with no more sorted accesses than FA).In FA,if R is an object that has appeared under8It may seem wasteful to do random access tofind a grade that was already determined earlier.As we discuss later,this is done in order to avoid unbounded buffers.6sorted access in every list,then by monotonicity,the grade of R is at least equal to the threshold value. Therefore,when there are at least k objects,each of which has appeared under sorted access in every list(the stopping rule for FA),there are at least k objects whose grade is at least equal to the threshold value(the stopping rule for TA).This implies that for every database,the sorted access cost for TA is at most that of FA.This does not imply that the middleware cost for TA is always at most that of FA,since TA may do more random accesses than FA.However,since the middleware cost of TA is at most the sorted access cost times a constant(independent of the database size),it does follow that the middleware cost of TA is at most a constant times that of FA.In fact,we shall show that TA is instance optimal,under natural assumptions.We now consider the intuition behind TA.For simplicity,we discussfirst the case where k=1, that is,where the user is trying to determine the top answer.Assume that we are at a stage in the algorithm where we have not yet seen any object whose(overall)grade is at least as big as the threshold valueτ.The intuition is that at this point,we do not know the top answer,since the next object we see under sorted access could have overall gradeτ,and hence bigger than the grade of any object seen so far.Furthermore,once we do see an object whose grade is at leastτ,then it is safe to halt,as we see from the proof of Theorem4.1.Thus,intuitively,the stopping rule of TA says:“Halt as soon as you know you have seen the top answer.”Similarly,for general k,the stopping rule of TA says,intuitively,“Halt as soon as you know you have seen the top k answers.”So we could consider TA as being an implemenation of the following“program”:Do sorted access(and the corresponding random access)until you know you have seen the top k answers.This very high–level“program”is a knowledge–based program[FHMV97].In fact,TA was de-signed by thinking in terms of this knowledge-based program.The fact that TA corresponds to this knowledge–based program is what is behind instance optimality of TA.Later,we shall give other scenarios(situations where random accesses are either impossible or expensive)where we implement the following more general knowledge–based progam:Gather what information you need to allow you to know the top k answers,and then halt.In each of our scenarios,the implementaiton of this second knowledge-based program is different.When we consider the scenario where random accesses are expensive relative to sorted accesses,but are not impossible,we need an additional design principle to decide how to gather the information,in order to design an instance optimal algorithm.The next theorem,which follows immediately from the definition of TA,gives a simple but impor-tant property of TA that further distinguishes TA from FA.Theorem4.2:TA requires only bounded buffers,whose size is independent of the size of the database. Proof:Other than a little bit of bookkeeping,all that TA must remember is the current top k objects and their grades,and(pointers to)the last objects seen in sorted order in each list.By contrast,FA requires buffers that grow arbitrarily large as the database grows,since FA must remember every object it has seen in sorted order in every list,in order to check for matching objects in the various lists.7。

以对称反对称分裂预条件处理GMRES(m)的不精确牛顿法潮流计算

以对称反对称分裂预条件处理GMRES(m)的不精确牛顿法潮流计算

第33卷第19期电网技术V ol. 33 No. 19 2009年11月Power System Technology Nov. 2009 文章编号:1000-3673(2009)19-0123-04 中图分类号:TM712 文献标志码:A 学科代码:470·40以对称反对称分裂预条件处理GMRES(m)的不精确牛顿法潮流计算刘凯1,陈红坤1,向铁元1,高志新2(1.武汉大学电气工程学院,湖北省武汉市 430072;2.中南电力设计院,湖北省武汉市 430072)Inexact Newton Flow Computation Based on Hermitian andSkew-Hermitian Splitting Preconditioners GMRES(m)LIU Kai1,CHEN Hong-kun1,XIANG Tie-yuan1,GAO Zhi-xin2(1.School of Electrical Engineering,Wuhan University,Wuhan 430072,Hubei Province,China;2.Central Southern Electric Power Design Institute,Wuhan 430072,Hubei Province,China)ABSTRACT: According to the feature that the correction equation of large-scale power grid is highly sparse, a method for inexact Newton power flow computation based on Hermitian and skew-Hermitian preconditioners is researched. By use of symmetric and skew-Hermitian splitting of matrix, a new type of preconditioner is proposed. Combining the new preconditioner with GMRES(m), both convergency and convergence rate of power flow computation can be improved. Power flow computation results of IEEE 300-bus system show that the proposed algorithm is effective.KEY WORDS: power flow calculation;Hermitian and skew-Hermitian splitting;generalized minimal residual algorithm (GMRES(m));preconditioning摘要:针对大规模电力系统修正方程式高度稀疏的特点,研究了一种基于对称反对称预处理的不精确牛顿法。

optimal和optimum英文辨析

optimal和optimum英文辨析

optimal和optimum英文辨析《Optimal vs. Optimum: A Comparative Analysis》In the English language, there are often multiple words that have similar meanings but subtle differences in usage. Two such words are “optimal” and “optimum”. While they both refer to the best or most favorable option, there are some key distinctions between the two.The word “optimal” is an adjective that means the most desirable or best possible. It implies that there is a range of options, and the one being described as optimal is the one that offers the greatest benefit or advantage. For example, “The optimal solution to this problem would be to implement a new system.” Here, “optimal” suggests that there are other potential solutions, but the one being proposed is the best among them.On the other hand, “optimum” is also an adjective, but it is often used in a more specific or technical context. It refers to the point or condition at which something is at its best or most efficient. For instance, “The optimum temperature for this chemical reaction is 50 degrees Celsius.”In this case, “optimum” indicates a specific value or range that is considered the most favorable for a particular process or outcome.One way to think about the difference between “optimal” and “optimum” is that “optimal” is more subjective and depends on the specific circumstances or goals, while “optimum” is more objective and based on specific criteria or measurements. Another difference is that “optimal” can be used to describe a wide range of situations, while “optimum” is often used in more specialized fields such as science, engineering, or economics.In some cases, the two words can be used interchangeably, but it is important to be aware of the subtle differences in meaning and usage.Using the wrong word can lead to confusion or a less precise expression of ideas. For example, saying “The optimum solution to this problem is to do nothing” might sound odd, as “optimum” typically implies a specific action or condition that is considered the best. In this case,“optimal” would be a more appropriate choice.To further illustrate the differences between “optimal” and “optimum”,let’s consider a few examples:In a business context, finding the optimal marketing strategywould involve considering various factors such as target audience,budget, and competition. The goal is to identify the approach that is most likely to lead to success. On the other hand, determining the optimum inventory level would involve analyzing data such as sales trends, lead times, and carrying costs to find the level that minimizes costs while meeting customer demand.In a medical setting, choosing the optimal treatment plan for a patient would depend on their specific condition, medical history,and personal preferences. The doctor would aim to select thetreatment that offers the best chance of recovery with the least side effects. However , when it comes to setting the optimumdosage of a medication, it would involve precise calculations based on the patient’s weight, age, and other factors to ensure the most effective and safe treatment.In a sports context, an athlete might strive for the optimalperformance by training hard, eating well, and getting enoughrest. This would involve finding the right balance between different aspects of their training and lifestyle. On the other hand, a coach might look for the optimum lineup or strategy for a particulargame based on the strengths and weaknesses of the team and the opponent.While “optimal” and “optimum” are similar in meaning, they have distinct nuances that can affect their usage. Understanding thesedifferences can help us communicate more precisely and effectively in various contexts. Whether we are discussing business, science, or any other field, choosing the right word can make a significant difference in how our ideas are understood. So, the next time you are faced with a choice between “optimal” and “optimum”, take a moment to consider • • •the specific context and intended meaning to ensure you are using the most appropriate word.。

最优调制阶次的英文

最优调制阶次的英文

最优调制阶次的英文Optimal Modulation OrderThe concept of modulation order is a crucial aspect in the field of digital communications, as it directly impacts the performance and efficiency of data transmission. The modulation order refers to the number of distinct signal levels or symbols used to represent the transmitted information. The choice of an optimal modulation order is a trade-off between various factors, including data rate, spectral efficiency, and error probability.One of the primary considerations in selecting the optimal modulation order is the desired data rate. Higher modulation orders, such as 64-QAM or 256-QAM, can transmit more bits per symbol, allowing for a higher data rate. This is particularly important in applications where bandwidth is limited, and the need for high-speed data transfer is paramount. However, as the modulation order increases, the signal constellation becomes more densely packed, making the system more susceptible to noise and interference, leading to a higher probability of errors.Spectral efficiency is another crucial factor in the selection of theoptimal modulation order. Spectral efficiency refers to the amount of information that can be transmitted within a given bandwidth. Higher modulation orders generally provide better spectral efficiency, as they can transmit more bits per second per hertz of bandwidth. This is particularly advantageous in scenarios where the available bandwidth is limited, such as in wireless communications or satellite links.The error probability is a crucial consideration when choosing the optimal modulation order. As the modulation order increases, the signal constellation becomes more densely packed, and the distance between adjacent signal points decreases. This makes the system more vulnerable to noise, interference, and other impairments, leading to a higher probability of errors in the received data. In applications where reliability and integrity of the transmitted data are paramount, such as in mission-critical systems or financial transactions, a lower modulation order with a lower error probability may be preferred.Another factor to consider is the available transmit power. Higher modulation orders typically require higher signal-to-noise ratios (SNRs) to achieve a desired level of performance. In scenarios where the available transmit power is limited, such as in battery-powered devices or satellite communications, a lower modulation order may be more appropriate to maintain a acceptable error rate.The choice of the optimal modulation order also depends on the characteristics of the communication channel. Certain channel conditions, such as multipath fading, can introduce more distortion and interference, making higher modulation orders more susceptible to errors. In such scenarios, a lower modulation order may be more suitable to maintain reliable communication.In practice, the selection of the optimal modulation order is often a dynamic process that takes into account the specific requirements of the application, the prevailing channel conditions, and the available resources. Adaptive modulation schemes, where the modulation order is adjusted in real-time based on changing channel conditions, have become increasingly popular in modern communication systems to optimize performance and efficiency.In conclusion, the selection of the optimal modulation order is a complex and multifaceted decision that involves various trade-offs and considerations. By carefully balancing factors such as data rate, spectral efficiency, error probability, and available resources, communication system designers can optimize the performance and efficiency of their systems to meet the specific needs of their applications.。

Silver Peak Unity EdgeConnect SD-WAN 产品说明书

Key Features>Single Screen Administration: Enables rapid and consistent implementation of network-wide business intent policies, eliminating many of the repetitive and mundane manual steps required to configure and connect remote offices andbranch locations>Centralized Orchestration and Policy Auto-mation: Empowers network administrators tocentrally define and orchestrate granular security policies and create secure end-to-end zones across any combination of users, application groups and virtual overlays, pushing configurations to sites in accordance with business intent. In addition, it offers seam-less drag and drop service chaining to next generation security servicesUnity Orchestrator offers customers the unique ability to centrally assign business intent policies to secure and control all Silver Peak Unity EdgeConnect software-defined Wide Area Network (SD-WAN) traffic. An intuitive user interface provides unprecedented levels of visibility into both data center and cloud-based applications.>Live View: Monitors real-time throughput,loss, latency and jitter across business intent overlays and the underlying transportservices to proactively identify potential perfor-mance impacts>Granular Real-Time Monitoring and HistoricalReporting: Provides specific details into application, location, and network statistics,including continuous performance monitoring of loss, latency, and packet ordering for all network paths; identifies all web and native application traffic by name and location, and alarms and alerts allow for faster resolution of network issues>Bandwidth Cost Savings Reports:Documents the cost savings for moving to broadband connectivitySD-WAN Deployments Done FasterUnity Orchestrator™ enables secure zero-touch provisioning of Unity EdgeConnect™ appliances in the branch. Orchestrator automates the assignment of business intent policies to ensure faster and easier connectivity across multiple branches, eliminating the configuration drift that can come from manually updating rules and access control lists (ACLs) on a site-by-site basis. With Orchestrator, customers can:>Avoid WAN reconfigurations by delivering appli -cations to users in customized virtual overlays >Align application delivery to business goals through business intent policies>Simplify branch deployments with EdgeConnectProfiles that describe the virtual and physical configuration of the locationReal-Time Health Monitoring and Historical ReportingOrchestrator provides specific details into SD-WAN health and performance:>Appliance dashboard displays a centralized sum -mary of appliances connected on the network, top talkers, applications, topology map and more>Health map provides a high-level view ofEdgeConnect appliance status and network health based on configured thresholds for pack -et loss, latency and jitter>Monitoring and reporting tools generate andschedule multiple customized reports to track a variety of performance metrics; reports may be scheduled on a regular basis and automatically sent to specific individuals or departmentsFigure 2: Unity Orchestrator enables centralized definition and auto -mated distribution of network-wide business intent policies to multiplebranch offices.Figure 1: A matrix view from Orchestrator, provides an easy-to-read, intuitive visualization of configured zones and defined whitelist exceptions.Gain Control over the CloudGain an accurate picture of how Software-as-a-Ser-vice (SaaS) and Infrastructure-as-a-Service (IaaS) are being used within the organization.>Name-based identification and reporting of all cloud and data center-hosted applications>Tracking of SaaS provider network traffic>Cloud Intelligence provides internet mappingof optimal egress to SaaS servicesFlexible Deployment>On-premise: Deploy Orchestrator as a virtualmachine in an existing environment>Private cloud: Deploy Orchestrator as a virtualinstance within Amazon Web Services (AWS)>Cloud-hosted Orchestrator: A Silver Peakcloud-hosted Orchestrator provides a highly reli-able, zero-CAPEX alternative deployment mod -el. With an optional license, organizations can subscribe to Orchestrator as a software service that supports all Orchestrator features without the complexity of managing on premise virtual compute and storage resource. UniqueOrchestrator instance for each enterprise cus-tomer ensures secure SD-WAN management, monitoring and reporting.Orchestrator Licensing>Unity Orchestrator, hosted on premise or in aprivate cloud, is included with the purchase of Unity EdgeConnect (see Unity EdgeConnect data sheet )>Optional cloud-hosted Orchestrator requires aseparate subscriptionFigure 4: Unity Orchestrator Dashboard summarizes overall SD-WANhealth, appliance status, topology and top applications.Figure 3: Unity Orchestrator monitoring report on application consumption.Delivering Real Business Value EdgeConnect is the most agile SD-WAN unified plat-form that. also powers industry-leading performance improvements to any form of connectivity. Silver Peak customers benefit from significant:>Performance:End-user satisfaction and produc-tivity are significantly improved due to consistent and enhanced performance and availability forboth legacy and cloud applications.>Visibility and Control: Customers benefit from unprecedented levels of visibility into both legacy and cloud applications.>Security: Centralized segmentation of users, applications and WAN services into secure zones and automated application traffic steeringacross the LAN and WAN in compliance withpredefined security policies, regulatory man-dates and business intent.>Extensibility: Fully compatible with existing WAN infrastructure hardware and transport services,customers can rapidly and non-disruptively aug-ment or replace their MPLS networks with anyform of broadband connectivity. Furthermore,customers can replace conventional routers with EdgeConnect SD-WAN that consolidates network functions like SD-WAN, WAN optimization, rout-ing and security into a single software instance;all managed centrally from the Orchestrator.Easy integration with orchestration systems isprovided via RESTful APIs. >Savings: With EdgeConnect, customers can dramatically lower connectivity, equipment and network administration costs; these savings are achieved through:>Reduction in bandwidth costs by actively using broadband connectivity>OPEX: Reducing the time and expertise needed to connect branch offices>CAPEX: Reducing appliance sprawl andmoving to a “thin branch” architectureSP-DS-ENT-UNITY-ORCHESTRATOR-091918。

Effect of nano-silica on the hydration and microstructure development of Ultra-High Performance Conc

Effect of nano-silica on the hydration and microstructure development of Ultra-High Performance Concrete (UHPC)with a low binderamountR.Yu ⇑,P.Spiesz,H.J.H.BrouwersDepartment of the Built Environment,Eindhoven University of Technology,P.O.Box 513,5600MB Eindhoven,The Netherlandsh i g h l i g h t sA dense skeleton of UHPC can be obtained with a relatively low binder amount.To get the highest mechanical properties,an optimal amount of nano-silica is calculated. The combined effect of SP and nano-silica additions on the hydration of UHPC is analyzed. The mechanism of the microstructure development of UHPC is analyzed.a r t i c l e i n f o Article history:Received 17January 2014Received in revised form 21March 2014Accepted 4April 2014Keywords:Ultra-High Performance Concrete (UHPC)Low binder amount Nano-silica HydrationMicrostructure developmenta b s t r a c tThis paper presents the effect of nano-silica on the hydration and microstructure development of Ultra-High Performance Concrete (UHPC)with a low binder amount.The design of UHPC is based on the mod-ified Andreasen and Andersen particle packing model.The results reveal that by utilizing this packing model,a dense and homogeneous skeleton of UHPC can be obtained with a relatively low binder amount (about 440kg/m 3).Moreover,due to the high amount of superplasticizer utilized to produce UHPC in this study,the dormant period of the cement hydration is extended.However,due to the nucleation effect of nano-silica,the retardation effect from superplasticizer can be significantly compensated.Additionally,with the addition of nano-silica,the viscosity of UHPC significantly increases,which causes that more air is entrapped in the fresh mixtures and the porosity of the hardened concrete correspondingly increases.In contrary,due to the nucleation effect of nano-silica,the hydration of cement can be pro-moted and more C–S–H gel can be generated.Hence,it can be concluded that there is an optimal nano-silica amount for the production of UHPC with the lowest porosity,at which the positive effect of the nucleation and the negative influence of the entrapped air can be well balanced.Ó2014Elsevier Ltd.All rights reserved.1.IntroductionIn recent years,with the development of new plasticizing con-crete admixtures and fine pozzolanic materials,it has become possi-ble to produce high performance concrete (HPC)or even Ultra-High Performance Concrete (UHPC).For the production of UHPC,the poz-zolanic materials (silica fume,ground granulated blast-furnace slag,fly ash)are widely utilized [1–3].Due to its high purity and high spe-cific surface area,the pozzolanic reaction of silica fume is fast and could more effectively promote the strength development of con-crete,compared to the other pozzolanic materials [4].Nevertheless,the new developments of nano-technology guarantee that various forms of nano-sized amorphous silica can be produced,which have higher specific surface areas and activities compared to conventional silica fume [5,6].Hence,a considerable investigation effort has been paid on clarifying their effect on the properties of concrete.In the available literature it can be found that with the addition of nano-silica in cement or concrete,even at small dosages,nano-silica can significantly improve the mechanical properties of cementitious materials [7].For instance,Nazari and Riahi [8]showed that a 70%compressive strength improvement of concrete can be achieved with an addition of 4%(by mass of cement)of nano-silica.Li et al.[9]found that when 3%and 5%nano-silica were added to plain cement mortar,the compressive strength increased by 13.8%and 17.5%at 28days,respectively.However,some contradictory experimental results can also be found in the literature.For example,Senff et al.[10]found that the contribution of nano-SiO 2,nano-TiO 2,and nano-SiO 2plus nano-TiO 2defined by factorial design,did not lead to any significant effect on the compressive strength.Moreover,they also found that the values/10.1016/j.conbuildmat.2014.04.0630950-0618/Ó2014Elsevier Ltd.All rights reserved.⇑Corresponding author.Tel.:+31(0)402475469;fax:+31(0)402438595.E-mail address:r.yu@tue.nl (R.Yu).of torque,yield stress and plastic viscosity of mortars with nano-additives increased significantly.According to the experimental results of Ltifi[11],even a lower compressive strength of samples with3%nano-silica can be observed,compared to the plain speci-mens.The difference of these experimental results should be attributed to the basic characteristics of the nano-silica(e.g.pozzo-lanic activity,specific surface area).To interpret the influence of nano-silica on the cement hydration,some theoretical mechanisms can be found in the available literature.For example,Land and Ste-phan[5]observed that the hydration heat of Ordinary Portland Cement blended with nano-silica in the main period increases sig-nificantly with an increasing surface area of silica.Thomas et al.[12]showed that the hydration of tri-calcium silicate(C3S)can be accelerated by the addition of nano-scaled silica or C–S–H-par-ticles.Björnström et al.[6]monitored the hydration process of C3S pastes and the accelerating effects of a5nm colloidal silica addi-tive on the rate of C3S phase dissolution,C–S–H gel formation and removal of non-hydrogen bound OH groups.However,it can be noticed that the investigation of the effect of nano-silica on cement hydration and microstructure development of UHPC is insufficient[13].This should be owed to the fact that the nano-sil-ica is still a relatively new material for the application in high per-formance concrete,and its price is much higher compared to that of silica fume or other pozzolanic materials[14].Additionally,for the production of UHPC,the water/binder ratio is relatively low and a large amount of superplasticizer is commonly used,which means it is not suitable to predict the pozzolanic activity of nano-silica at low water/binder ratio,based on the experimental results that are obtained for high water/binder ratios.Therefore, how the additional nano-silica influences the cement hydration and microstructure development of UHPC remains an open question.As commonly known,the cement production is said to repre-sent about7%of the total anthropogenic CO2emissions[15–17]. Hence,one of the key sustainability challenges for the next decades is to design and produce concrete with less clinker and inducing lower CO2emissions than traditional ones,while providing the same reliability,with a much better durability.Considering the superior properties of UHPC,for which the cross section area of the structure can be slender and the total cement consumption maybe reduced.Moreover,to further minimize the negative influ-ence of UHPC on the sustainable development,one of the most effective methods is to reduce the cement amount,without any significant scarification of the mechanical properties.From the available literature,it can be found that an optimal packing of granular ingredients is the key for a strong and durable concrete [18–20].Hence,utilization of optimal packing models to design and produce the UHPC should be a possible way to reduce the cement amount and increase the cement efficiency.Consequently,based on these premises,the objective of this study is to investigate the effect of nano-silica on the hydration and microstructure development of UHPC with low binder amount. The design of the concrete mixtures is based on the aim to achieve a densely packed material,by employing the modified Andreasen and Andersen particle packing model.The properties of the designed concrete,including the fresh and hardened behavior are evaluated in this study.Techniques such as isothermal calorimetry, thermal analysis and scanning electron microscopy are employed to evaluate the cement hydration and microstructure development of the UHPC.2.Materials and experimental methodology2.1.MaterialsThe cement used in this study is Ordinary Portland Cement(OPC)CEM I52.5R, provided by ENCI(the Netherlands).A polycarboxylic ether based superplasticizer is used to adjust the workability of UHPC.The limestone and quartz powder are used asfinefillers.Two types of sand are used,one is normal sand with the fractions of0–2mm and the other one is microsand with the fraction0–1mm(Graniet-Import Benelux,the Netherlands).A nano-silica slurry is selected as pozzolanic material to be used in this study.Short straight steelfibers(length of13mm and diameter of0.2mm)are employed to further improve the mechanical properties of the designed concrete.The detailed information about the used materials is sum-marized in Table1and Fig.1.Additionally,the characterization and chemical anal-ysis of the used nano-silica are shown in Tables2and3,respectively.2.2.Experimental methodology2.2.1.Mix design of UHPCFor the design of mortars and concretes,several mix design tools are in use. Based on the properties of multimodal,discretely sized particles,De Larrard and Sedran[21,22]postulated different approaches to design concrete:the Linear Pack-ing Density Model(LPDM),Solid Suspension Model(SSM)and Compressive Packing Model(CPM).Fennis et al.[23]have developed a concrete mix design method based on the concepts of De Larrard and Sedran[21,22].However,all these design meth-ods are based on the packing fraction of individual components(cement,sand,etc.) and their combinations,and therefore it is complicated to include veryfine particlesNomenclatureD max maximum particle size(l m)D min minimum particle size(l m)f c compressive strength of UHPC at28days(N/mm2)K t strength improvement(%)m binder total mass of the used binder(kg)m d mass of oven dried sample(g)m s mass of surface dried and water-saturated sample in air(g)m w hydrostatic mass of water-saturated sample(g)M CaCO3mass change of UHPC paste caused by the decomposi-tion of CaCO3(g)M i mass of the fraction i in solid materials(g) M j mass of the fraction j in liquid materials(g)M0Water mass of non-evaporable water(g)M Water-Full water requirement for full hydration cement(g)M105mass of UHPC paste after heat treatment at105°C for 2h(g)M1000mass of UHPC paste after heat treatment at1000°C for 2h(g)P mix composed mix curveP tar target curveq distribution modulusRSS sum of the squares of the residualsS i strength of UHPC with nano-silica(i corresponds to nano-silica amount)(MPa)S0strength of UHPC without nano-silica(reference sam-ple)(MPa)V container volume of the container(cm3)V liquid volume of liquid in the container(cm3)V solid volume of solid particles in the container(cm3)X binder binder efficiency(N/mm2)/(kg/m3)b t degree of cement hydration at hydration time t(days)(%)u v,water water-permeable porosity(%)/air air content in UHPC(%)q i density of the fraction i in solid materials(g/cm3)q j density of the fraction j in liquid materials(g/cm3)R.Yu et al./Construction and Building Materials65(2014)140–150141in these mix design tools,as it is difficult to determine the packing fraction of such very fine materials or their combinations.Another possibility for mix design is offered by an integral particle size distribution approach of continuously graded mixes,in which the extremely fine particles can be integrated with a relatively lower effort,as detailed in the following.Based on the investigation of Fuller and Thompson [24]and Andreasen and Andersen [25],a minimal porosity can be theoretically achieved by an optimal par-ticle size distribution (PSD)of all the applied particle materials in the mix,as shown in the following equation:P ðD Þ¼D D maxqð1Þwhere P (D )is a fraction of the total solids being smaller than size D ,D the particle size (l m),D max the maximum particle size (l m),and q is the distribution modulus.However,in Eq.(1),the minimum particle size is not incorporated,while in reality there must be a finite lower size limit.Hence,Funk and Dinger [26]proposed a modified model based on the Andreasen and Andersen equation.In this study,all the concrete mixtures are designed based on this so-called modified Andreasen and Andersen model,which is shown as follows [26]:P ðD Þ¼D q ÀD q minD q max ÀD qminð2Þwhere D min is the minimum particle size (l m).The modified Andreasen and Andersen packing model has already been suc-cessfully employed in optimization algorithms for the design of normal density concrete [20]and lightweight concrete [27,28].Different types of concrete can be designed using Eq.(2)by applying different value of the distribution modulus q ,as it determines the proportion between the fine and coarse particles in the mix-ture.Higher values of the distribution modulus (q >0.5)lead to coarse mixture,while lower values (q <0.25)result in concrete mixes which are rich in fine parti-cles [29].Brouwers [30,18]demonstrated that theoretically a q value range of 0–0.28would result in an optimal packing.Hunger [20]recommended using q in the range of 0.22–0.25in the design of SCC.Hence,in this study,considering that a large amount of fine particles are utilized to produce UHPC,the value of q is fixed at 0.23,as shown in [51,52].In this research,the modified Andreasen and Andersen model (Eq.(2))acts as a target function for the optimization of the composition of mixture of granular mate-rials.The proportions of each individual material in the mix are adjusted until an optimum fit between the composed mix and the target curve is reached,using an optimization algorithm based on the Least Squares Method (LSM),as presented inEq.(3).When the deviation between the target curve and the composed mix,expressed by the sum of the squares of the residuals (RSS)at defined particle sizes,is minimized,the composition of the concrete is treated as the best one [19]:RSS ¼X n i ¼1P mix D i þ1i ÀP tar D i þ1i 2ð3Þwhere P mix is the composed mix and P tar is the target grading calculated from Eq.(2).According to the optimized particle packing model,the developed UHPC mix-tures are listed in Table 4.It can be noticed that the cement amount of the UHPC is relatively low,around 440kg/m 3.Moreover,large amount of powder materials (limestone and quartz powder)are utilized to replace cement in this study.Based on the results presented in [53],the used powder materials are not considered as binder in the concrete.Furthermore,the nano-silica is added in the amount of 1%,2%,3%,4%and 5%as a cement replacement.Hence,based on the results of the freshTable 1Properties of the used materials.MaterialsSpecific density (g/cm 3)Solid content (%w/w)pH CEM I 52.5R3.15––Limestone powder 2.71––Quartz powder2.66––Microsand (sandstone) 2.72––Sand 0–22.64––Superplasticizer 1.05357.0Steel fiber7.80––Table 2Characterization of the used nano-silica.aTypeSlurry Stabilizing agentAmmonia Specific density (g/cm 3) 2.2pH (at 20°C)9.0–10.0Solid content (%w/w)20Viscosity (mPa s)6100BET (m 2/g)22.7PSD by LLS (l m)0.05–0.3Mean particle size (l m)0.12aData obtained from the supplier.Table 3Chemical analysis of the used cement and nano-silica.Substance Cement (mass%)Nano-silica slurry (mass%)Al 2O 3 4.800.367SiO 220.2798.680Na 2O 0.260.315P 2O 5–0.220K 2O 0.530.354CaO 63.710.089TiO 2–0.007Fe 2O 3 3.430.004CuO –0.001ZrO 2–0.003Ag 2O –0.022MgO 1.58–SO 3 2.91–Cl À–0.037L.O.I.2.51–142R.Yu et al./Construction and Building Materials 65(2014)140–150and hardened properties of the designed UHPC,it is possible to evaluate the influ-ence of nano-silica on the properties of UHPC.An example of the target curve and the resulting integral grading curve of the UHPC is shown in Fig.2.Additionally,Table 5shows characteristics of each recipe listed in Table 4,which indicates the water/cement ratio,water/binder ratio,water/powder ratio,SP content by the weight of cement and SP content by the weight of powder.Due to the fact that a large amount of cement has already been replaced by limestone and quartz powder,the water to cement ratio is relatively high (about 0.4).How-ever,the water/powder ratios of all the batches are still relatively low (around 0.18).Considering that the used water can be significantly absorbed by the powder materials,the SP amount by the weight of powder is fixed at about 4.5%.Finally,to further improve the mechanical properties of the UHPC,around 2.5%(vol.)steel fibers are added into the designed concrete matrix.2.2.2.Mixing proceduresIn this study,all powder ingredients (particle size <125l m)and sand fractions are slowly blended for 30s in a dry state.Then,nano-silica slurry and around 75%of the total amount of water are slowly added and mixed for another 90s.After that,the mixing is stopped for 30s,in which the first 20s are used to scratch the mate-rials from the wall of the mixing bowl and the paddle.Afterwards,the remaining mixing water and superplasticizer are added into the mixer,during the consecutive mixing for 180s at slow speed.For the samples with steel fibers,about 2.5%(vol.)of short steel fibers are added when a flowable UHPC matrix is obtained.The last step is mixing for 120s at high speed.The mixing is always executed under laboratory conditions with dried and tempered aggregates and powder materials.The room temperature while mixing,testing and casting is constant at around 20±1°C.2.2.3.WorkabilityAfter mixing,the suspension is filled into a conical mold in the form of a frus-tum (as described in EN 1015-3[29]).During the test,the cone is lifted straight upwards in order to allow free flow for the paste without any jolting.In the test,two diameters perpendicular to each other are determined.Their mean is recorded as the slump flow value of the designed UHPC.2.2.4.Air contentThe air content of UHPC is experimentally determined following the subsequent procedure.The fresh mixes are filled in cylindrical container of known volume and vibrated for 30s.The exact volume of the containers is determined beforehand using demineralized water of 20°C.In order to avoid the generation of menisci at the water surface,the completely filled contained is covered with a glass plate,whose mass is determined before.Hence,based on the assumption that the fresh concrete is a homogeneous system,a possibility for determining the air content of concrete can be derived from the following equation:/air ¼V container ÀV solid ÀV liquidcontainerð4Þwhere /air is the air content (%,V/V)of UHPC,V container the volume of the cylindrical container that mentioned before,and V solid and V liquid are the volume of solid parti-cles and liquid in the container (cm 3).As the composition of each mixture is known,the mass percentage of each ingredient can be computed.Because it is easy to measure the total mass of con-crete in the container,the individual masses of all materials in the container can be obtained.Applying the density of the respective ingredients,the volume per-centages of each mix constituent can be computed.Hence,V solid ¼X iM iq ið5ÞandV liquid ¼X jM jq jð6Þwhere M i and q i are the mass (g)and density (g/cm 3)of the fraction i in solid mate-rials,M j and q j are the mass (g),and density (g/cm 3)of the fraction j in liquid mate-rials,respectively.The schematic diagram for calculating the air content in concrete is shown in Fig.3.Table 4Mix recipe of the UHPC with nano-silica.Cement (kg/m 3)Limestone (kg/m 3)Quartz (kg/m 3)Microsand (kg/m 3)Sand 0–2(kg/m 3)Nano-silica (kg/m 3)Water (kg/m 3)SP (kg/m 3)Ref.439.5263.7175.9218.71054.70175.843.9UHPC-1%435.1263.7175.9218.71054.7 4.4175.843.9UHPC-2%430.7263.7175.9218.71054.78.8175.843.9UHPC-3%426.3263.7175.9218.71054.713.2175.843.9UHPC-4%421.9263.7175.9218.71054.717.6175.843.9UHPC-5%417.5263.7175.9218.71054.722.0175.843.9Table 5Characteristics of the designed concrete recipes.w/cw/b w/p SP content SP content (%bwoc)(%bwop)Ref.0.4000.4000.1809.987 4.486UHPC-1%0.4040.4000.18010.090 4.486UHPC-2%0.4080.4000.18010.193 4.486UHPC-3%0.4120.4000.18010.298 4.486UHPC-4%0.4170.4000.18010.405 4.486UHPC-5%0.4210.4000.18010.5154.486w:water,c:cement,b:binder,p:powder (particle size <125l m),bwoc:by the weight of cement,bwop:by the weight of powder.R.Yu et al./Construction and Building Materials 65(2014)140–1501432.2.5.Mechanical propertiesThe concrete samples (with and without steel fibers)are cast in molds with the size of 40mm Â40mm Â160mm.The prisms are demolded approximately 24h after casting and then cured in water at about 20±1°C.After curing for 3,7and 28days,the flexural and compressive strength of the specimens are tested accord-ing to the EN 196-1[31].At least three specimens are tested at each age to compute the average strength.2.2.6.PorosityThe porosity of the hardened UHPC is measured applying the vacuum-satura-tion technique,which is referred to as the most efficient saturation method [32].The saturation is carried out on at least 3samples (100mm Â100mm Â20mm)for each mix,following the description given in NT Build 492[33]and ASTM C1202[34].The water permeable porosity is calculated from the following equation:u v ;water ¼m s Àm ds wÁ100ð7Þwhere u v ,water is the water permeable porosity (%),m s the mass of the saturated sam-ple in surface-dry condition measured in air (g),m w hydrostatic mass of water-sat-urated sample (g),and m d is the mass of oven dried sample (g).2.2.7.Calorimetry analysisBased on the recipes shown in Table 4,cement,limestone and quartz powder are mixed with silica slurry,superplasticizer and water.Nano-silica/binder mass ratios from 1%to 5%are investigated.Moreover,another reference sample (refer-ence-2)with only cement and water (water/cement =0.4)is simultaneously designed and prepared.All the pastes are mixed for 2min and then injected into a sealed glass ampoule,which is then placed into the isothermal calorimeter (TAM Air,Thermometric).The instrument is set to a temperature of 20°C.After 7days,the measurement is stopped and the obtained data is analyzed.All results are ensured by double measurements (two-fold samples).2.2.8.Thermal analysisA Netzsch simultaneous analyzer,model STA 449C,is used to obtain the thermo-gravimetric (TG)and differential scanning calorimetry (DSC)curves of con-crete pastes.According to the recipes shown in Table 4,the pastes are produced without any aggregates.After curing in water for 28days,the hardened samples are grinded to powder.Analyses are conducted at a heating rate of 10°C/min from 20°C to 1000°C under flowing nitrogen.Based on the TG test results,the hydration degree of the cement is calculated.Here,the loss-on-ignition (LOI)measurements of non-evaporable water content for hydrated UHPC paste are employed to estimate the hydration degree of cement [35].Assuming that the UHPC paste is a homogeneous system,the non-evaporable water content is determined according to the following equation:M 0Water ¼M 105ÀM 1000ÀM CaCO 3ð8Þwhere M 0Water is the mass of non-evaporable water (g),M 105the mass of UHPC paste after heat treatment under 105°C for 2h (g),M 1000the mass of UHPC paste after heat treatment under 1000°C for 2h (g),and M CaCO 3is the mass change of UHPC paste caused by the decomposition of CaCO 3during heating (g).Then,the hydration degree of the cement in UHPC paste is calculated as:b t ¼M 0Water M Water -Fullð9Þwhere b t is the cement hydration degree at hydration time t (%)and M Water -Full is the water required for the full hydration of cement (g).Based on the investigation results shown in [36],the value of M Water -Full employed in this study is 0.256.2.2.9.Scanning electron microscopy analysisScanning electron microscopy (SEM)is employed to study the microstructure of UHPC.After curing for 28days,the specimens are cut into small fragments and soaked in ethanol for over 7days,in order to stop the hydration of cement.Subse-quently,the samples are dried and stored in a sealed container before the SEM imaging.3.Experimental results and discussion 3.1.Slump flowThe slump flow of fresh UHPC mixes versus the amount of nano-silica is depicted in Fig.4.The data illustrates the direct rela-tion between the nano-silica amount and the workability of fresh UHPC.It is important to notice that with the addition of nano-sil-ica,the slump flow of fresh UHPC decreases linearly.For the UHPC developed here,the slump flow value of the reference sample is 33.75cm,which sharply drops to about 22.5cm when about 5%of nano-silica is added.This behavior is in accordance with the results shown in [10],which indicates that the addition of nano-silica greatly increases the water demand of cementitious mixes.One hypothesis explaining this is that the presence of nano-silica decreases the amount of lubricating water available within the interparticle voids,which causes an increase of the yield stress and plastic viscosity of concrete [10].Hence,in this study,the plas-tic viscosity of UHPC significantly increases with an increase of the nano-silica amount,which in turn causes that its workability obvi-ously decreases.3.2.Air content and porosityThe air content of UHPC in fresh state and the porosity of UHPC in hardened state are presented in Fig.5.As can be observed,the air content of the reference sample (without nano-silica)is rela-tively low (2.01%),which should be attributed to the dense particle packing of the designed UHPC.However,with an increasing amount of nano-silica,the air content of UHPC clearly increases.When the additional nano-silica amount is about 5%,the air con-tent in fresh UHPC sharply increases to about 3.5%.To interpret this phenomenon,the effect of nano-silica on the viscosity of UHPC should be considered.With an increasing amount of nano-silica,the viscosity of UHPC significantly increases [10],which cause that the entrapped air cannot easily escape from the fresh concrete.Hence,the air content of the fresh concrete with high content of nano-silica is relatively larger.Compared to the air content results,the porosity of hardened UHPC show very different development tendency.With an addi-tion of nano-silica,the porosity of UHPC firstly decreases,and then slightly increases after reaching a critical value.In this study,the porosity of reference sample is about 10.5%,and sharply decreases144R.Yu et al./Construction and Building Materials 65(2014)140–150to about9.5%when4%of nano-silica is added.However,the poros-ity of the sample with5%of nano-silica increases to about9.8% again.This phenomenon should be attributed to the positive effect of nano-silica on the cement hydration.As commonly known,due to the nucleation effect of nano-silica,the formation of C–S–H-phase is no longer restricted on the grain surface alone,which cause that the hydration degree of cement is higher and more pores can befilled by the newly generated C–S–H[12].For this rea-son,the porosity of UHPCfirstly decreases with the addition of nano-silica.However,due to the fact that the air content of UHPC increases with the addition of nano-silica,the porosity of UHPC will increase again when the new generated C–S–H is insufficient to compensate the pores that are generated by the entrapped air in fresh concrete.Consequently,considering these two opposite processes,there is an optimal value of the nano-silica amount at which the lowest porosity of UHPC can be obtained.3.3.Mechanical propertiesTheflexural and compressive strengths of UHPC at3,7and 28days versus the nano-silica amounts are shown in Fig.6.With the addition of nano-silica,the parabolic growth tendency of the flexural and compressive strength of UHPC can be observed.For example,theflexural strength of reference sample at28days is about10.4MPa,which gradually increase to about14.0MPa when 4%of nano-silica is added.Afterward,this value slightly decreases to about13.2MPa when5%of nano-silica is included.Hence,there should be an optimal amount of nano-silica,at which theflexural or compressive strengths of the designed UHPC can theoretical be the largest.According to the regression equation of each parabola shown inFig.6,it is easy to understand that when the differential coefficient (y0)of each function equals zero,the value of x represent the opti-mal nano-silica amount to obtain the best mechanical properties. Hence,the optimal nano-silica amount and the computed maxi-mumflexural and compressive strengths at3,7and28days are shown in Table6.As can be seen,the optimal nano-silica amount for theflexural strength at3,7and28days are3.72%,3.46%and 3.70%,respectively,which increase to3.92%,3.90%and4.29%for the compressive strength.Based on these optimal nano-silica amounts,the computed maximum strengths are9.99MPa, 12.46MPa and13.79MPa forflexural strength and56.99MPa, 69.63MPa and88.93MPa for the compressive strength.Moreover, it is important to notice that computed maximum compressive strength at28days is88.93MPa,which is even smaller than the one with4%of nano-silica(91.29MPa).This should be attributed to the deviation of the regression equation.As can be noticed that the coefficient of determination(R2)of the one representing the compressive strength at28days is only0.862.Hence,to accurately obtain the optimal nano-silica amount in this study,only3.72%, 3.46%,3.70%,3.92%and3.90%are selected to calculate the average value(3.74%),which represents the optimal amount for this nano-silica in UHPC to theoretically get the best mechanical properties. This results are also in agreement with the results presented in Fig.5.In this study,to clearly show the advantages of the modified Andreasen and Andersen particle packing model in designing UHPC,the concept of the binder efficiency is utilized and shown as follows:X binder¼f cm binderð10Þwhere X binder is the binder efficiency,f c the compressive strength of UHPC at28days,and m binder is the total mass of the binders.R.Yu et al./Construction and Building Materials65(2014)140–150145。

ADS单通道触摸芯片TS01S应用参考设计方案 (1)

随着科技的发展进步,电子信息产业也高速发达。

当今时代是电子信息时代,电子元器件组成电子信息装置、设备电子元器件是电子信息时代的基础、源头。

深圳市奥伟斯科技有限公司是一家专注触摸芯片,单片机,电源管理芯片,语音芯片,场效应管,显示驱动芯片,网络接收芯片,运算放大器,红外线接收头及其它半导体产品的研发,代理销售推广的高新技术企业。

自成立以来一直致力于新半导体产品在国内的推广与销售,年销售额超过壹亿人民币,是一家具有综合竞争优势的专业电子元器件代理商。

主要品牌产品:一、OWEIS-TECH:OWEIS 触摸芯片、 OWEIS 接口芯片、 OWEIS 电源芯片、 OWEIS 语音芯片、 OWEIS 场效应管一、电容式触摸芯片、ADSEMI 触摸芯片代理、芯邦科技触控芯片、万代科技触摸按键芯片、博晶微触摸控制芯片、海栎创触摸感应芯片、启攀微触摸、 IC 融和微触摸感应、IC 合泰触摸按键、IC 通泰触摸芯片二、汽车电子/电源管理/接口芯片/逻辑芯片:IKSEMICON 一级代理、 ILN2003ADT、IK62783DT、 IL2596、IL2576 、ILX485、 ILX3485、 ILX232 、ILX3232三、功率器件/接收头/光电开关:KODENSHI、 AUK、 SMK系列、 MOS管、SMK0260F、 SMK0460F、SMK0760F、 SMK1260F、 SMK1820F、 SMK18T50F四、LED 显示驱动芯片:中微爱芯 AIP 系列: AIP1668、 AIP1628 、AIP1629 、AIP1616 、天微电子 TM 系列: TM1628 TM1668 TM1621五、电源管理芯片:Power Integrations LNK364PN LNK564PN 芯朋微 PN8012 PN8015 AP5054 AP5056 力生美晶源微友达天钰电子FR9886 FR9888六、语音芯片:APLUS 巨华电子AP23085 AP23170 AP23341 AP23682 AP89085 AP89170 AP89341 AP89341K AP89682七、运算放大器:3PEAK 运算放大器、聚洵运算放大器、圣邦微运算放大器八八、发光二极管:OSRAM 欧司朗发光二极管、Lite-On 光宝发光二极管、Everlight 亿光发光二极管、 Kingbright 今台发光二极管九、CAN收发器:NXP恩智浦CAN收发器、Microchip微芯CAN收发器十、分销产品线:ONSEMI安森美 TI德州仪器 ADI TOSHIBA东芝 AVAGO安华高十一、 MCU单片机ABOV现代单片机MC96F系列、 Microchip微芯单片机PIC12FPIC16F PIC18F系列、 FUJITSU富仕通单片机MB95F系列、STM单片机STM32F STM32L系列、 CKS中科芯单片机CKS32F系列、TI单片机MSP430系列、TMS320F系列、 NXP单片机LPC系列下面奥伟斯将给大家详细介绍TS01S 相关信息:TS01S (1-CH Differential Sensitivity Calibration Capacitive Touch Sensor)SPECIFICATION VER. 2.5 :1 Pin Configuration(SOT-26)2 Pin Description(SOT-26)3 Absolute Maximum RatingMaximum supply voltage 5.5V Maximum voltage on any pin VDD+0.3 V Maximum current on any PAD 100mA Continuous Power Dissipation 200mW Storage Temperature -50 ~ 150℃Operating Temperature -20 ~ 75℃Junction Temperature 150℃Note 2 : Unless any other command is noted, all above are operated in normal temperature.4 ESD & Latch-up Characteristics5 Electrical Characteristics▪ VDD=3.3V, Typical system frequency (Unless otherwise noted), TA = 25℃Note 3: The sensitivity can be increased with lower CS value. The recommended value of CS is 10pF when using 3T PC(Poly Carbonate) cover and 10 ㎜ x 7 ㎜touch pattern and middle sensitivity selection.Note 4: CR value is recommended as same that of CS_TOT as possible for effective differential sensitivity calibration. CS_TOT = CS + CPARA (CPARA is parasitic capacitance of CS pin)If proper CR capacitor value is used, CR pin has almost same frequency as that of CS pin.6 TS01S Implementation6.1 Current consumptionTS01S uses internal bias circuit, so internal clock frequency and current consumption is not adjusted. Thetypical current consumption curve of TS01S is represented in accordance with VDD voltage as below. The higherVDD requires more current consumption.Internal bias circuit can make the circuit design simple and reduce external components.Typical current consumption curve of TS01S6.2 CS and CR implementatioThe parallel capacitor CS is added to CS and CR to CR to adjust fine sensitivity. The major factor of thesensitivity is CS. The sensitivity would be increased when smaller CS value is used. (Ref. below SensitivityExample Figure) The CR value should be almost the same as the total CS capacitance (CS_TOT) for effectivedifferential sensitivity calibration. The total CS capacitance is composed of CS which is set for optimalsensitivity and parasitic capacitance of CS pattern (CPARA). The parasitic capacitance of CS pattern is about 2pFif normal touch pattern size is used. But in the case of using larger touch pattern, CPARA is bigger than normalvalue.The RS is serial connection resistor to avoid malfunction from external surge and ESD. (It might be optional.)From 200Ω to 1kΩ is recommended for RS. The size and shape of touch PAD might have influence on thesensitivity. The sensitivity will be optimal when the size of PAD is approximately an half of the first knuckle(it’s about 10 ㎜x 7 ㎜). The connection line of CS to the touch PAD is recommended to be routed as short aspossible to prevent from abnormal touch detection caused by connection line.Sensitivity example figure of TS01S (when normal sensitivity selection selected) 6.3 SYNC implementationFrom two TS01S to ten TS01S (or other TS series touch sensor) can work on the one application at the same time thanks to SYNC function with this pin. The SYNC pulse prevents over two sensing signal from interfering with each other. During the sense disenable period and SYNC input becomes high, internal clock is suspended. The RSYNC is pull-down resistor of SYNC pin. Too big value of RSYNC makes the SYNC pulse falling delay, and too small value of RSYNC makes rising delay. The typical value of RSYNC is 2MΩ.TS01S has high sensitivity when SYNC is implemented as above figure (connect RSYNC between SYNC and GND).6.4 SYNC implementation for sensitivity selectionAnother function of SYNC pin of TS01S is the selection of sensitivity without any additional external component. The SYNC implementation for sensitivity selection is informed as below chart.6.5 OUTPUT implementationThe OUT is an open drain structure. For this reason, the connection of pull-up resistor ROUT is required between OUT and VDD or another lower voltage node. When ROUT is connected to higher voltage node than VDD, the output current passes through protection diode to VDD and abnormal operation may be occurred.The maximum output sink current is 4mA, so over a few kΩ must be used as ROUT. Normally 10kΩ is used as ROUT. The OUT is high in normal situation, and the value is low when a touch is detected on CS.6.6 Internal reset operationThe TS01S has stable internal reset circuit that offers reset pulse to digital block. The supply voltage for a system start or restart should be under 0.3∙VDD of normal operation VDD. No external components required for TS01S power reset, thus it helps simple circuit design and minimize the cost ofapplication. CAUTION: The VDD rising time should be less then 100ms for proper power on reset.7 Recommended Circuit Diagram7.1 Application Example1、The capacitor and resistor might be connected with CS (pin4) for getting a stable sensitivity.2、The capacitor value which is connected to CR pin (CR) should be almost the same as the total CS capacitance (include parasitic capacitance) for an effective differential sensitivity calibration.3、TS01S is reset by internal reset circuit. VDD voltage rising time should be shorter than 100msec for proper operation.4、The sensitivity can be adjusted through a connection of SYNC pin. (Refer to chapter 6.4)5、From two TS01S to ten TS01S (or other TS series touch sensor) can work on the one application at the same time thanks to SYNC function. (Refer to chapter 6.3)6、TS01S OUT port has an open drain structure. The pull-up resistor should therefore be needed as above figure.7、VDD periodic voltage ripples over 50mV or the ripple frequency which is lower than 10 kHz it can cause wrong sensitivity calibration. To prevent above problem, power (VDD, GND) line of touch circuit should be separated from the other circuit. Especially the LED driver power line or digital switching circuit power line should be certainly treated to be separated from touch circuit.8、The CS pattern should be routed as short as possible and the width of the line should be around 0.25mm.9、The CS pattern routing should be formed by bottom metal (opposite metal of touch PAD).10、The capacitor which is between VDD and GND is an obligation. It should be placed as close as possible from TS01S.11、The empty space of PCB must be filled with GND pattern to strengthen GND pattern and to prevent external noise that causes interference with the sensing frequency.7.2 Example – Power Line Split Strategy PCB LayoutA. Not split power Line (Bad power line design)8 PACKAGE Description8.1 Mechanical DrawingNOTE:1. Dimensions and tolerances are as per ANSI Y14.5, 1982.2. Package surface to be matte finish VDI 11 ~ 13.3. Die is facing up for mold. Die is facing down for trim/form, ie. Reverse trim/form.4. The footlength measuring is based on the gauge plane method.5. Dimension is exclusive of mold flash and gate burr.6. Dimension is exclusive of solder plating.7. All dimensions are mm.8.2 Marking DescriptionNOTES:LIFE SUPPORT POLICYAD SEMICONDUCTOR’S PRODUCTS ARE NOT AUTHORIZED FOR USE AS CRITICAL COMPONENTS IN LIFE SUPPORT DEVICES OR SYSTEMS WITHOUT THE EXPRESS WRITTEN APPROVAL OF THE PRESIDENT AND GENERAL COUNSEL OF AD SEMICONDUCTOR CORPORATIONThe ADS logo is a registered trademark of ADSemiconductorⓒ 2006 ADSemiconductor – All Rights Reserved奥伟斯科技提供专业的智能电子锁触摸解决方案,并提供电子锁整套的芯片配套:低功耗触摸芯片低功耗单片机马达驱动芯片显示驱动芯片刷卡芯片时针芯片存储芯片语音芯片低压MOS管TVS二极管。

On Using Incremental Profiling for the Performance Analysis of Shared Memory Parallel Appli

On Using Incremental Profiling for thePerformance Analysis of Shared MemoryParallel ApplicationsKarl Fuerlinger1,Michael Gerndt2,and Jack Dongarra11Innovative Computing Laboratory,Department of Computer Science,University of Tennessee{karl,dongarra}@2Lehrstuhl f¨u r Rechnertechnik und Rechnerorganisation,Institut f¨u r Informatik,Technische Universit¨a t M¨u nchengerndt@in.tum.deAbstract.Profiling is often the method of choice for performance anal-ysis of parallel applications due to its low overhead and easily compre-hensible results.However,a disadvantage of profiling is the loss of tempo-ral information that makes it impossible to causally relate performancephenomena to events that happened prior or later during execution.Weinvestigate techniques to add temporal dimension to profiling data by in-crementally capturing profiles during the runtime of the application anddiscuss the insights that can be gained from this type of performancedata.The context in which we explore these ideas is an existing profilingtool for OpenMP applications.1IntroductionPerformance is an important concern for many developers of parallel applications and a large number of tools are available that support can be used for analyzing performance and to identify potential bottlenecks.Most tools for parallel performance analysis capture data during the execu-tion of an application to later analyze it offline.Depending on the kind of data recorded,tracing and profiling is commonly distinguished.Profiling denotes the reporting of performance data in a summarized style(such as fraction of to-tal execution time spent in a certain subroutine call)and tracing refers to the recording of individual time-stamped program execution events.Profiling is often preferred over tracing since it generates less performance data and greatly facilitates a manual interpretation.However,a detailed analysis of the interaction of processes or threads can sometimes require the level of insight only time-stamped event recoding offers.More generally,a significant drawback of profiling is the fact that the temporal dimension of performance data is completely lost.That is,with“one-shot”profiles is not possible to relateperformance phenomena in profiling reports to the time when they occurred in the application and to explain the reason and consequences in terms of causal relations with other phenomena or events happening earlier or later.In this paper we investigate the utility of incremental profiling(or“profil-ing over time”)for performance analysis of parallel applications.We discuss general approaches to add a temporal component to profiling data and the kind of insights that can be derived from incremental profiling.The context in which we explore these ideas is our own profiling tool for OpenMP applica-tions,called ompP[3].Examples that show the utility of our proposed approach come from several applications from the SPEC OpenMP benchmark suite(SPEC OMPM2001)[1].The rest of this paper is organized as follows:Sect.2briefly introduces our profiling tool and describes its existing capabilities.Sect.3then lists general options to add temporal dimension to performance data and describes the ap-proach we have taken with ompP.Sect.4serves as an evaluation of our ideas: we describe the new types of performance data of incremental profiling(often graphical views)that are available to the user and show examples of their utility on application examples that come from the SPEC OpenMP benchmark suite. We describe related work in Sect.5and conclude and discuss further directions for our work in Sect.6.2The OpenMP Profiler ompPompP is a profiling tool for OpenMP applications that relies on the Opari in-strumenter[8]for source code instrumentation.ompP is a static library linked to the target application and delivers a text-based profiling report that is meant to be easily comprehensible by the user at program termination.As opposed to subroutine-based profiling tools like gprof,ompP is able to report timings and counts for various OpenMP constructs in terms of the user execution model[6]. As ompP is based on instrumentation-added measurement calls(and not on PC sampling,like gprof),it is exact in the sense that the reported values are not subject to sampling inaccuracy.An example for the profiling data delivered by ompP is shown in Fig.1.R00002main.c(34-37)(default)CRITICALTID execT execC bodyT enterT exitT PAPI_TOT_INS0 3.001 1.00 2.000.0015951 1.001 1.000.000.0063472 2.001 1.00 1.000.0015953 4.001 1.00 3.000.001595 SUM10.014 4.00 6.000.0011132 Fig.1:An example for ompP’s profiling data for an OpenMP critical section.The body of this critical section contained a call to sleep(1.0)only.Thefirst line gives the source code location and type of construct.Different counts and timing categories(depending on the particular type of OpenMP con-struct)are listed as column headers and each line lists the accumulated values for a particular thread,while the last line sums over all threads.ompP also supports the measurement of hardware performance counter data for individual threads and OpenMP constructs as well as user-defined regions using PAPI[2].The user selects a counter to measure via environment variables and the summed counter values then appear as additional columns in the profiles,as shown in Fig1.In addition to theflat region profile shown in Fig.1,ompP performs an over-head analysis in which four well-defined overhead classes(synchronization,load imbalance,thread management,limited parallelism)are quantitatively evalu-ated.Furthermore ompP tries to detect common inefficiency situations,such as load imbalance in parallel loops,contention for locks and critical sections,etc. The profiling reports contains a list of the discovered instances of these–so called–performance properties[5]sorted by their severity(negative impact on performance).3Adding Temporal Dimension to Profiling DataA straightforward way to add a temporal component to profiling-type perfor-mance data is to capture profiles at several points during the execution of the tar-get application(and not just at the end)and to analyze how the profiles change between those capture points.Alternatively(and equivalently),the changes be-tween capture points can be recorded incrementally and the overall state at capture time can later be recovered.Several trigger events for the collection of profiling reports are conceivable:–Timer based,fixed:Profiles are captured in regularfixed intervals during the execution of the target application.No prior knowledge or modification of the application is required.Since data samples are captured in uniform intervals,they are easy to visualize and comprehend.The overhead with respect to storage space for profiling reports is predicable as it depends on the duration of the program run and the capture interval only.–Timer based,adaptive:This method dynamically adapts the duration between two capture points based on amount of change in profiling data that has been observed.Possible measures for this change are the number of different constructs executed and their invocation count.This option has the potential advantage of offering afiner-grained insight into phases of execution where a lot of change occurs while avoiding profiling reports for phases of largely uniform behavior.–Overflow based:This is another method to correlate the profiling fre-quency with the activity of the program.With this technique,profiling re-ports are generated when a hardware counter overflows a pre-set threshold.Forfloating-point intensive applications,it may for example be beneficial to trigger profiling reports each nfloating point operations have occurred.Other potential triggers for profiling reports are the number of cache misses or the occurrence of page faults.–User added:A method to trigger the generation of profiling reports can be exposed to the user.This technique is especially useful for phase-based programs,where program phases are iteratively executed.In this paper we investigate the simplest form of incremental profiling de-scribed above,capturing profiles in regular,fixed-length intervals during the entire execution time of the application.4Evaluation of Incremental ProfilingFor both profiling and tracing,the following dimensions of performance data can generally be distinguished:–Kind of data:Describes which type of data is measured or reported to the user.Examples include timing data,execution counts,performance counter values,and so on.–Source code location:Data can be collected globally for the whole program or for particular source code entities such as subroutines,OpenMP constructs, basic blocks,individual statements,etc.–Threads or processes dimension:Measured data can either be reported for individual threads or processes or accumulated over the whole set(by sum-ming or averaging,for example).–Time dimension:Describes when a particular measurement was made(time-stamp)or for which time duration values have been measured.An appealing property of profiling data is its low dimensionality,i.e.,it can often be comprehended textually(like gprof output)or it can be visualized as 1D or2D graphs in a straightforward way.Adding a new(temporal)dimension jeopardizes this advantage and requires more sophisticated performance data displays.We came up with the following types of useful performance views that can be extracted from the incremental profiling reports delivered by ompP:3 Performance properties over time:Performance properties[5]offer a very compact way to represent performance analysis knowledge and their change over time can thus be easily visualized.A property like“Imbalance in paral-lel region foo.f(23-42)with severity of4.5%”carries all relevant context information with it.The severity value denotes the percentage of total execu-tion time improvement that can be expected if the cause for the inefficiency is completely removed.The threads dimension is collapsed in the specification of the property and the source code dimension is encoded as the properties context(foo.f(23-42)in the above example).Properties over time can be visualized as a 2D line-plot,where the x-axis is the time and the y-axis denotes severity values and a line is drawn for each property from the first time it was detected until program termination.Depending on the particular test application,valuable information can be deduced,depending on the behavior of the property graphs.For example,in the example graph shown in Fig.2it is evident that the severity of the properties appears to be continuously increasing as time proceeds,indicating that the imbalance situations in this code will become increasingly significant with longer runtime (e.g.,larger data sets or more iterations).Other applications from the SPEC OpenMP benchmark suite showed other interesting features such as initialization routines that generated high initial overheads which amortized over time (i.e.,the severity decreased).01021324353647586971072.952.662.362.071.771.481.180.890.590.300.00ImbalanceInParallelLoop.LOOP muldeo.F (105-143)ImbalanceInParallelLoop.LOOP muldoe.F (105-143)ImbalanceInParallelLoop.LOOP muldoe.F (68-102)ImbalanceInParallelLoop.LOOP muldeo.F (68-102)ImbalanceInParallelLoop.LOOP zaxpy.F (55-59)Severity [% of total accumulated execution time]Time [sec]310.wupwiseFig.2:An example for the “performance properties over time”display for the 310.wup-wise application.Shown are the five most severe performance properties.Region invocations over time:Depending on the size of the test applicationand the analyst’s familiarity with the source code,it can be valuable to know when and how often a particular OpenMP construct,such as a parallel loop,was executed.The region-invocation over time displays offers this function-ality.As shown in Fig.3the graph gives the number of invocations of a particu-lar region,this particular case shows the two most time-consuming parallel regions of the 328.fma3d application.This view is most useful when aggregating (e.g.,summing)over all threads,but in certain cases it can be valuable (for critical sections and locks,for example)to see which particular thread executed a construct at which time.02244678911113415617920122353848443037632326921516110854platq.f90 LOOP (255-370)fma2.f90PARALLEL LOOP (3764-3773)Time [sec]Invocation Count328.fma3d Fig.3:This graph shows the number of region invocations over time for the 328.fma3d application.A surface plot can be used for visualization in this case or a heatmap display similar to the one used for visualizing performance counters (see below).Region execution time over time:This display is similar to the region in-vocation over time display but shows the execution accumulated execution time between dump intervals instead of the invocation count.Again this dis-play allows the user to see when particular portions of the code actually get executed.Overheads over time:ompP evaluates four overhead classes based on the flatprofiling data for individual parallel regions and for the program as a whole.For example,the time required to enter a critical section is attributed to the containing parallel region as synchronization time.A detailed discussion and motivation of this classification scheme can be found in [4].The overheads over time display plots the incurred overheads (either for a particular parallel region or for the entire application)over the execution timeline of the application,as shown in Fig.4.This graph gives the overheads as percentage of total aggregated CPU time.Hence,for an execution with 32CPUs,a overhead percentage of 50means that 16CPUs are not doing useful work.The total overhead incurred over the entire program run is thus the integral of the overhead function (area under the graph)and the graph shows when a particular overhead was incurred.In the example in Fig.4,the most noticeable overhead is synchronization overhead starting at about 30seconds of execution and lasting for several seconds.A closer examination of the OpenMP profiling reports reveals that this overhead is caused by critical section contention.One thread after the other enters thecritical section and performs a time-consuming initialization operation.This effectively serializes the execution for more than 10seconds and shows up as a overhead of 31/32=97%in the overheads graph.022446789111134156179201223100.0090.0080.0070.0060.0050.0040.0030.0020.0010.000.00Synchronization Imbalance Limited Parallelism Thread Management Time [sec]Overhead [%]Fig.4:This graph shows overheads over time for the 328.fma3d application.Performance counter heatmaps:This display is used to visualize hardwareperformance counter values over time and for several threads.Fig.5shows examples of the performance counter heatmap display.The x-axis corresponds to the time (in seconds)while the y-axis corresponds to the thread ID.A color gradient (or gray-scale)coding is indicating high or low counter values.A tile is not filled if no data samples are available for that time period.This type of display is supported for both the entire program as well as for individual OpenMP regions.The example in Fig.5a shows the DATA CACHE...(a)Cache misses that took longer than1024cycles to be satisfied for the318.galgel application.(b)Retired load instructions for the316.applu application.(c)Retiredfloating point operations for the324.apsi application.Fig.5:Example performance counter heatmaps.Time is displayed on the horizontal axis(in seconds),the vertical axis lists the threads(32in this case).The middle part of5a of the display has been cut out.(mapping of threads to processors and their arrangement in the machine an its interconnect).As another example,Fig.5c gives the number of retiredfloating point operations for the324.apsi application and this graph shows a marked difference for processors0to14vs15to31.We were not able to identify the exact cause for this behavior until now.–Identification of phase-based behavior,as in Figs.5a and5b,some appli-cations show a marked phase-based behavior.It is also evident in many cases that the characteristics of each phase change from iteration to it-eration.–Identification of temporary performance bottlenecks such as short-term bus-contention.5Related WorkThere are a number of performance analysis tools for OpenMP.Vendor-specific tools such as the Intel Thread Profiler and Sun Studio are usually limited to their respective platform but have the advantage of being able to make use of internal details of the compiler’s OpenMP implementation and runtime system. Both the Intel and the Sun tool are based on sampling and can provide the user with some timeline profile displays.Neither of those tools however has a concept similar to ompP’s high-level abstraction of performance properties.TAU[9,7]is also able to profile and trace OpenMP applications by utilizing the Opari instrumenter.Its performance data visualizer Paraprof supports a number of different profile displays and also supports interactive3D exploration of performance data,but to the best of our knowledge does not currently have support for a display similar to our performance counter heatmaps.The TAU toolset also contains a utility to convert TAU tracefiles to profiles which can generate profile series and interval profiles.OProfile and its predecessor the Digital Continuous Profiling Infrastructure (DCPI)are system-wide statistical profilers based on hardware counter over-flows.Both approaches rely on a profiling daemon running in the background and support the dumping of profiling reports at any time.Data acquisition in a style similar to our incremental profiling approach would thus be easy to im-plement.We are,however,not aware of any study using OProfile or DPCI that investigated continuous profiling for parallel applications.In practice,the neces-sity of root privileges and the difficulty of relating profiling data back to the user’s OpenMP threading model are major stumbling blocks when using those tools.Both issues are not a concern with ompP since it is based on source code instrumentation.6Conclusion and Future WorkWe have presented a study on the utility of incremental profiling for performance analysis of shared memory parallel applications.Our results indicate that valu-able information about the temporal behavior of applications can be discoveredby incremental profiling and that this technique strikes a good balance between the level of detail offered by tracing and the simplicity and efficiency of pro-fiing incremental profiling we where able to acquire new insights into the behavior of applications which can due to the lack of temporal data not be gained from pure profiling.The most interesting features are the revelation of iterative behavior,the identification of short-term contention for resources,and the temporal localization of overheads and execution patterns.Future work is planned in several areas.Firstly,we plan to support other triggers for capturing profiles,most importantly user-added and overflow based. Secondly,be plan to integrate our profiling data with TAU’s Paraprof viewer in order to interactively explore the incremental profiling data delivered by ompP. Thirdly,we plan to test our ideas in the context of MPI as well,a planned integrated MPI/OpenMP profiling tool based on mpiP[10]and ompP is thefirst step in this direction.References1.Vishal Aslot and Rudolf Eigenmann.Performance characteristics of the SPECOMP2001benchmarks.SIGARCH Comput.Archit.News,29(5):31–40,2001. 2.Shirley Browne,Jack Dongarra,N.Garner,G.Ho,and Philip J.Mucci.A portableprogramming interface for performance evaluation on modern processors.Int.J.High put.Appl.,14(3):189–204,2000.3.Karl Fuerlinger and Michael Gerndt.ompP:A profiling tool for OpenMP.In Pro-ceedings of the First International Workshop on OpenMP(IWOMP2005),Eugene, Oregon,USA,May2005.Accepted for publication.4.Karl Fuerlinger and Michael Gerndt.Analyzing overheads and scalability charac-teristics of OpenMP applications.In Proceedings of the Seventh International Meet-ing on High Performance Computing for Computational Science(VECPAR’06), Rio de Janeiro,Brasil,2006.To appear.5.Michael Gerndt and Karl F¨u rlinger.Specification and detection of performanceproblems with ASL.Concurrency and Computation:Practice and Experience, 2006.To appear.6.Marty Itzkowitz,Oleg Mazurov,Nawal Copty,and Yuan Lin.An OpenMP runtimeAPI for profiling.Accepted by the OpenMP ARB as an official ARB White Paper available online at /futures/omp-api.html.7.Allen D.Malony,Sameer S.Shende,Robert Bell,K.Li,L.Li,and N.Trebon.Ad-vances in the TAU performance analysis system.In V.Getov,M.Gerndt,A.Hoisie,A.Malony,andler,editors,Performance Analysis and Grid Computing,pages129–144.Kluwer,2003.8.Bernd Mohr,Allen D.Malony,Sameer S.Shende,and Felix Wolf.Towards aperformance tool interface for OpenMP:An approach based on directive rewriting.In Proceedings of the Third Workshop on OpenMP(EWOMP’01),September2001.9.Sameer S.Shende and Allen D.Malony.The TAU parallel performance system.International Journal of High Performance Computing Applications,ACTS Col-lection Special Issue,2005.10.Jeffrey S.Vetter and Frank munication characteristics of large-scalescientific applications for contemporary cluster architectures.J.Parallel Distrib.Comput.,63(9):853–865,2003.。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Performance Comparison of Optimal Routing and Dynamic Routing On Low-Earth Orbit Satellite Networks
Abstract
We compare the performances of two di erent routing schemes for LEO satellite networks through simulation. The rst one is a dynamic routing scheme based on the shortest-path algorithm. Its routing table is updated according to cost estimation based on link status information that is broadcasted periodically. The second one is an optimal routing scheme proposed by the authors of this paper, where a LEO satellite network is modeled by a Finite State Automaton (FSA). The states of the FSA are derived from the relative visibility on the basis of the constellation data. In each state, the LEO satellite network has a xed topology. Since LEO satellite networks exhibit periodic orbit movements, they can have only a nite number of states. In the proposed scheme, the link assignment and routing problems are solved o -line for each state in the FSA using simulated annealing and the optimal routing algorithm. The resulting link assignment table and the routing table are stored at each satellite. As the system enters a new state, a new set of pre-computed tables are retrieved and used throughout the state. We also propose a scheme by which we can reduce probability of ongoing call blocking and analyze a proposed scheme through simulation. ink assignment, routing, ongoing call blocking
1 Introduction
With the rapid progress in satellite communication technologies, it is now feasible to build mobile communication networks based on Low-Earth Orbit (LEO) satellites 1, 2]. In many LEO satellite networks such as Iridium 3] and Calling 4], direct inter-satellite links (ISL's) between mutually visible satellites are used to provide communication paths among satellites. Routing long-distance tra c via ISL's enhances the autonomy of the system, decreases communications delay, and reduces atmospheric transmission loss 5, 6]. However, the use of ISL's poses a set of new problems. For example, as one satellite travels along its orbit, its relative position to others and the earth changes continuously, and so does the set of satellites visible from it. Therefore, it is necessary that links be dynamically established and/or deleted as the 1
set of visible satellites (i.e., topology) changes. When this topological change takes place, the following points should be considered. The link assignment should be done in such a way to maximize the overall network performance. Despite topological changes, the ongoing connections should be maintained. The problem of link assignment and routing in LEO satellite networks has not been treated well until present. Many LEO satellite networks assume simple regular topologies. For example, links are set up between adjacent satellites on the same orbit, and two nearest satellites on two neighboring orbits are connected. Although simple, this method does not consider the tra c requirement, and thus the network resources are poorly utilized. In the dynamic link assignment given in 7], each satellite collects link state information from its neighbors, and recon gures the link connectivity. But, the main focus of the scheme is on network connectivity rather than on overall performance. For tra c routing, simple routing rules (as found for regular topologies such as Shu eNet and Torus) can be applied in the case of a regular link assignment. The optimal routing methods developed for the packet switched data networks 8, 9] can be applied in the case of a dynamic link assignment. In 10], we modeled LEO satellite network as a FSA(Finite State Automata) using satellite constellation information and solved a topological design and routing problem on the con guration corresponding to each state in the FSA. In this paper, we compare the performance of LEO satellite systems with two di erent routing schemes through simulation. One operates with the link assignment tables and routing tables pre-calculated, and the other operates running a dynamic routing scheme instead of looking up the routing tables pre-calculated. We also propose a scheme by which we can reduce probability of ongoing call blocking and analyze a proposed scheme. If state transition takes place in the FSA, ISL's may be re-assigned and the calls a ected by topological change must be assigned a path in the new topology. During this process, ongoing calls may be blocked due to an unsuccessful re-routing. We de ne this blocking as ongoing call blocking. It can be reduced by reserving a certain number of channels for the purpose of re-routing during state transition. In this paper, we compare two routing schemes in the various aspects of ongoing call blocking. 2
相关文档
最新文档