Parallel Implementation of the Polyhedral Homotopy Method

合集下载

Gene概述

The Steps: 1. A portion of the double helix is unwound by a helicase. 2. A molecule of a DNA polymerase binds to one strand of the DNA and begins moving along it in the 3' to 5' direction, using it as a template for assembling a leading strand of nucleotides and reforming a double helix. In eukaryotes, this molecule is called DNA polymerase delta (δ).
3. Because DNA synthesis can only occur 5' to 3', a molecule of a second type of DNA polymerase (epsilon, ε, in eukaryotes) binds to the other template strand as the double helix opens. This molecule must synthesize discontinuous segments of polynucleotides (Okazaki fragments). Another enzyme, DNA ligase I then stitches these together into the lagging strand.
• There are an estimated 19,000-20,000 human protein-coding genes.

A PARALLEL MULTIGRID SOLVER FOR SEMICONDUCTOR DEVICE EQUATIONS

in. MULTIGRID METHOD
It is obvious that PDEs need to be solved on a fine-enough grid to obtain accurate solutions. In semiconductor simulation a large truncation error may cause convergence problems. In some circumstances the larger die grid spaces the smaller the time step is required to maintain solution stability [10]. Classical iterative methods slow down with increasing grid point number. Multigrid iterative methods are highly efficient solvers for PDEs, in which the combination of fine grid relaxation and coarse grid correction produces a fast convergence rate. The multigrid full approximation scheme (MG-FAS) [11] is used in this work. To solve the system F(u)=f the MG method uses a sequence of grids, Gk(l<k<K), where G, is the finest grid and GK is the coarsest grid. There exists a system Fk(uk)=fk on each grid Gk. F*k+l is a prolongation operator from a coarse grid Gk+, to a fine grid Gk and Rkk.j is a restriction operator from a fine grid Gk_, to a coarse grid Gk. MG-FAS cycles may be defined as follows. (a) Interpolate K, and ft to each of the grids and solve Fk(uk)=F^Rkk.,uk.,) + /?**./(/"*., IF Gk is the coarsest grid (k=K) solve Fgt[UK)=FK exactly Prolongation correction VK = I uKae" - uK°" I to the next fine grid GK_, ELSE Solve F t f« t ;=/ t +P* + ^ + i Prolongation correction Vk = I uknew - ukM I to the next fine grid Gk_, (c) Do (b) until to the finest grid Gj The Gauss-Seidel method is used as smoothing operator on all grids except on the coarsest grid where the GS-SOR/SUR method is employed to speed-up the solution. The interpolation operators are bi-linear prolongation and half-weighting restriction. (b)

Schur flows and orthogonal polynomials on the unit circle

′Fra bibliotekb′ n
2 = 2(a2 n − an−1 ),
= an (bn+1 − bn ),
n ∈ Z+ = {0, 1, . . .},
a− 1 = 0 ,
where means diﬀerentiation with respect to t, in positive a’s with the initial data {bn (0) = bn (0), compose a semi-inﬁnite Jacobi matrix b0 (t) a0 (t) a0 (t) b1 (t) (1.2) J = J ({an }, {bn }) = a1 (t)
arXiv:math/0511269v2 [math.CA] 14 Jan 2006
SCHUR FLOWS AND ORTHOGONAL POLYNOMIALS ON THE UNIT CIRCLE
LEONID GOLINSKII Abstract. The relation between the Toda lattices and similar nonlinear chains and orthogonal polynomials on the real line has been elaborated immensely for the last decades. We examine another system of diﬀerential-diﬀerence equations known as the Schur ﬂow, within the framework of the theory of orthogonal polynomials on the unit circle. This system can be displayed in equivalent form as the Lax equation, and the corresponding spectral measure undergoes a simple transformation. The general result is illustrated on the modiﬁed Bessel measures on the unit circle and the long time behavior of their Verblunsky coeﬃcients

Parallel Saturating Fractional Arithmetic Units

Parallel Saturating Fractional Arithmetic UnitsNavindra Yadav and Michael SchulteEECS DepartmentLehigh UniversityBethlehem,PA18015,USAJohn GlossnerAdvanced DSP Compilers and Architectures Lucent Technologies,Inc.Allentown,PA18103,USAAbstractThis paper describes the designs of a saturating adder, multiplier,single MAC unit,and dual MAC unit with one cy-cle latencies.The dual MAC unit can perform two saturat-ing MAC operations in parallel and accumulate the results with saturation.Specialized saturation logic ensures that the output of the dual MAC unit is identical to the result of the operations performed serially with saturation after each multiplication and each addition1.IntroductionA characteristic of many DSP applications is that they are computationally intensive and execute millions of satu-rating arithmetic operations.For example,the Global Sys-tem for Mobile communication(GSM)enhanced full rate speech coder and decoder each require over128million sat-urating arithmetic operations[2].To be compliant with the GSM standard,the results produced must be identical to re-sults produced when the operations are performed serially.To achieve high performance,DSPs have multipliers, adders,and multiply and accumulate(MAC)units with one or two cycle latencies[4].Typically,saturating arithmetic operations require two cycles,with saturation performed in the second cycle.More complicated operations,such as a MAC operation with saturation after both the multiplication and the addition,often require a greater number of cycles. The challenge is to design circuits that perform parallel sat-urating arithmetic operations in a single cycle and still pro-duce the same results as when the operations are performed serially.In this paper,designs of a fractional saturating adder, multiplier,single MAC unit,and dual MAC unit with single cycle latencies are presented.The dual MAC unit can per-form two MAC operations in parallel and accumulate their results.The output of the dual MAC unit is identical to the result of the operations performed serially with saturation after each multiplication and each addition.The designs are coded with Verilog,and synthesized using the Synopsys Module Compiler[8]and a0.25micron CMOS standard cell library.Estimates for the critical path delay and area are presented for16-bit arithmetic units.2.Saturating MultiplierSaturation occurs for two’s complement fractional mul-tiplication only when is computed.Since the true result of cannot be represented as a two’s com-plement fractional number,the product is saturated to the largest positive fractional number.To detect when satura-tion needs to be performed,saturation detection logic(SDL) is used.If the input operands arethe logic equation for the SDL isWhen and,becomes1 to indicate that saturation has occurred.Figure1shows the multiplication matrix for a two’s complement-bit multiplier,where[1].Except in the case of,and are identical and the product has the formTo put the result back in fractional form,the product is shifted left one position and a zero is inserted into the least signiﬁcant bit.When the multiplication of oc-curs,the product that is generated is.Since this is not a valid two’s complement fractional number,the result should be saturated to.Figure2shows the modiﬁcations that need to be made so that multiplication yields a saturated result.In thisﬁg-ure,is the saturation bit,which is output by the SDL. To produce a result in fractional form,the multiplicationx 1y n-1x 0y n-1x n-2x 1y n-2x 0y n-2x n-1x 1y 1x 0y 1np 2n-32n-1p 2n-2x x 1x 0x n-1x 1y 0y n-1y n-2y n-2x 1y 0x 0y 0x n-1x n-21y y 1x n-1x n-20y y 0n-2y n-1y n-1y x n-1p 0p n-11p p (1)1p x n-2n-2y Figure 1.Non-Saturating Multiplication.x 1y n-1x n-2x 1y n-2x 0y n-2x n-1x 1y 1x 0y 1x x 1x 0x n-1x 1y 0y n-1y n-2y n-2x 1y 0x 0y 0x n-1x n-21y y 1x n-1x n-20y y n-2y n-1y n-1y x n-1..1S SS Sp np 2n-3p 2n-1p 2n-2p n-11p p 0.0x n-2n-2y x 0n-1S y S +Figure 2.Saturating Multiplication.matrix is shifted left one position and bits to the left of column are omitted.To handle saturation,is added to the least signiﬁcant columns of the multiplica-tion matrix,and the partial product is replaced by.When is performed,,,and the multiplier produces .In all other cases,and the product does not saturate.Figure 3shows a block diagram of an 8-bit two’s com-plement array multiplier [5]that has been modiﬁed to pro-duce a saturated product.In this diagram,a modiﬁed half adder cell (MHA)consists of an AND gate and a half adder.Similarly,a modiﬁed full adder (MFA)consists of an AND gate and a full adder.In the leftmost column and bottom row of the array,NAND gates are used to produce partialproducts with eitheror .Modiﬁed full adders that use NAND gates to generate the partial product are la-beled with NMFA.To perform saturation,is set to the bit,and the bit is ORed with outputs from the rightmost column of the array to produce to .The SMFA cell in the bottom-right corner of the array produces the partial product and adds it with sum and carry bits from the previous row to produce and a carry bit.A carry look-ahead adder (CLA)[6]adds the sums and carries from the bottom row of the array to produce to .Using the technique described previously,the saturation bit is computed early in the multiplication process.Sinceis ready before the sum and carry bits thatcome into the SMFA and the computation of partial product bits to is not on the critical path,adding saturation65432180714y 7y 6y 5y 4y 3y 2y 1y 013111210915Figure 3.8-Bit Saturating Array Multiplier.to the multiplier does not increase its critical path delay.The only increase in area is due to the extra OR gates for saturation and changing one of the NMFAs to a SMFA.The technique presented here can easily be extended to other parallel multiplier implementations,such as tree multipliers and Booth-encoded multipliers [9],[7].3.Saturating AdderFor two’s complement addition,saturation only occursif the two operands have the same sign and the sign of the result is different.A fast method for detecting this is to take the exclusive-or (XOR)of the carry-in and carry-out of the most signiﬁcant bit position,which are denoted asand ,respectively.If the XOR of these two bits is one,the result saturates;otherwise,it does not.It is also necessary to determine the value of the satu-rated result.If both operands are positive and the sum sat-urates,the saturated result is .If both operands are negative,and the sum saturates,the saturated result is.When performing ,thevalue used to saturate the result can be computed aswhere is the sign-bit of and indicates that is saturated.The design of a saturating CLA is shown in Figure 4.The CLA [6],[3]is easily modiﬁed to produce a saturated result.If the output of is ,the sum needs to be saturated and is selected as the result;otherwise,the sumfrom the CLA is pared to a non-saturating CLA,additional logic is needed to compute,perform ,and select between and.Since is available at the same time as,the only delay added to the critical path is the delay of the multiplexor.This tech-nique for saturating addition can also be applied to other adder implementations.<S>= <A + B>Figure4.Saturating Adder.4.Saturating MAC UnitFigure5shows the design of the saturating MAC unit, which computes.One ap-proach for computing is to have and be inputs to a saturating multiplier and then to add the saturated prod-uct to using a saturating adder.The main disadvantage of this approach is that there are two CLAs on the critical path.To decrease the critical path delay,the saturating multi-plier is modiﬁed so that it does not add the sum and carry bits generated in the last row of the array.This is accom-plished by simply replacing the-bit CLA,shown at the bottom of Figure3,by a-bit carry save adder (CSA).The modiﬁed saturating multiplier,produces a-bit sum vector and an-bit carry vector,such that.The values,,and are then used as inputs to a-bit CSA,which combines,,and to produce two-bit vectors that are summed using a CLA to yield.MAC-saturation detection logic(MAC-SDL)is used to detect if the sum of the three vectors A,S and C saturates. The MAC-SDL uses the following logic equation to detect saturation.where is the sign bit of,is the sign bit of,and is the sign bit of. Saturation occurs only when the signs of andare equal to each other,but not equal to the sign of.IfFigure5.Saturating MAC Unit.is one,the output of the VGen unit is selected as theﬁnal result;otherwise,is selected.The VGen unit generates the correct saturated value5.Saturating Dual MAC UnitFigure6shows the block diagram of a saturating dual MAC unit.This unit performs two MAC operations in par-allel,with saturation after each addition and each multipli-cation.The dual MAC uses two saturating MAC units plus additional hardware,so that it can combine the results of the two MAC operations and produce a result that is equiv-alent to the result that would be obtained if the operations had been performed serially.This computation can be ex-pressed aswhere is the accumulator value,and are operands for theﬁrst multiplication,and and are operands for the second multiplication.The values and represent the saturated sum and carry vectors from theﬁrst and second saturated multipliers,respectively,whereTheﬁnal result is computed asy112Figure6.Saturating Dual MAC Unit.To obtain the same result as when two MAC operations are executed serially,three summations are performed in parallel.1.2.3.where is when is positive,andwhen is negative.If saturates,; otherwise,.By performing the three sum-mations in parallel and then selecting the proper result,the only increase in delay compared to a single saturating MAC is one CSA and one2-to-1multiplexor.The dual MAC unit can also perform two independent saturating MAC operations in parallel,where the results produced correspond toand.This is accomplished by setting the control bit to1,so that.With minor modiﬁcations,the dual MAC unit can also perform parallel saturating multiply and subtract operations,as well as non-saturating arithmetic operations.6.Results and Conclusions16-bit designs for saturating and non-saturating arith-metic units were implemented using Verilog.The Synopsys Module Compiler[8]and a0.25micron CMOS standard cell library were used to synthesize each design.Estimates of the area and critical path delay are shown in Table1.Ar-eas are reported in grid units and the critical path delay is given in nanoseconds for a supply voltage of2.5V olts.One unexpected result in Table1is the non-saturating MAC unitSaturation No SaturationUnit Delay Area Delay AreaAdder 1.12373 1.02323Multiplier 6.616695 6.616608MAC7.5185677.218853 Dual MAC8.1429827.237706Table1.Synthesis Results for16-Bit Units requires less area than the saturating MAC unit.This may have resulted from the synthesis tool’s ability to exchange area for delay.The area and delay estimates shown for the non-saturating dual MAC unit correspond to two non-saturating MAC units that can operate in parallel,but cannot combine their results in the same cycle.On many DSPs,it takes two or more cycles to perform saturating arithmetic operations.In comparison,the arith-metic units presented in this paper perform saturating arith-metic operation in a single cycle,yet require only a small in-crease in area and delay.For many DSP applications,which require millions of saturating arithmetic operations,these units can provide signiﬁcant performance improvements. AcknowledgmentThis material is based upon work supported by the National Science Foundation under Grant No.MIP-9703421,and by a grant from Lucent Technologies and the Pennsylvania Infrastruc-ture Technology Alliance under Project No.AMD-003. References[1]K.Bickerstaff,M.J.Schulte,and E.E.Swartzlander,Jr.Par-allel Reduced Area Multipliers.Journal of VLSI Signal Pro-cessing,9:181–192,April1995.[2]European Telecommunication Standards Institute.DigitalCellular Telecommunications System:ANSI-C Code for the GSM Enhanced Full Rate(EFR)Speech Code,1997.[3]puter Arithmetic and Algorithms.Prentice Hall,1993.[4]psley.DSP Processor Fundamentals:Architectures andFeatures.IEEE Press,1997.[5]J.C.Majithia and R.Kita.An Iterative Array for Multiplica-tion of Signed Binary Numbers.IEEE Transactions on Com-puter,C-20:28–33,February1971.[6]T.F.Ngai,M.J.Irwin,and S.Rawat.Regular,Area-TimeEfﬁcient Carry-Lookahead Adders.Journal of Parallel and Distributed Computing,3:92–105,1986.[7]H.Sam and A.Gupta.A Generalized Multibit Recoding ofTwo’s Complement Binary Numbers and Its Proof with Ap-plication in Multiplier Implementations.IEEE Transactions on Computers,39(8):1006–1015,1990.[8]Synopsys.Synopsys Module Compiler Features.Synopsis,Inc,1998.[9] C.S.Wallace.Suggestion for a Fast Multiplier.IEEE Trans-actions on Electronic Computers,EC-13:14–17,1964.。

Parallel lumigraph reconstruction

Parallel Lumigraph ReconstructionPeter-Pike Sloan Charles HansenMicrosoft Research Dept of Computer ScienceOne Microsoft Way31/1056University of UtahRedmond W A98052Salt Lake City,UT84112ppsloan@ hansen@April7,1999AbstractThis paper presents three techniques for reconstructing Lumigraph/Lightﬁelds on commercial parallel distributed shared memory computers.Theﬁrst method is a par-allel extension of the software based method proposed in the Lightﬁeld paper.Thisexpands the ray/2-plane intersection test along theﬁlm plane which effectively be-comes scan conversion.The second method extends this idea by using a shear/warpfactorization which accelerates rendering.The third technique runs on an SGI”Re-ality Monster”using up to8IR pipes and texture mapping hardware to reconstructimages.We characterize the memory access patterns exhibited using the hardwarebased method and show how to use this information to reconstruct images from a tiled plane.We also show how to use quad-cubic reconstruction kernels.We also ana-lyze the memory access patterns that occur when viewing Lumigraphs.This allows usto ascertain the cost/beneﬁt ratio of various tilings of the texture plane.1IntroductionLumigraphs[2]and Lightﬁelds[4]are ways of representing the plenoptic function[5]using 4DOF under the following two conditions:that the viewer is outside the convex hull of the object being viewed1and that both the geometry and illumination of the scene is static. The fundamental concept behind these representations is that if you have a point in space, say on a plane,and you know what light leaves that plane from any incident viewing angle, you have all you need to reconstruct that point from any viewpoint.By extending this to all points on the convex hull of a surface,say a box for example,you have all you need to reconstruct that surface from any viewpoint.There are many ways to parameterize this4D function.The two-plane parameteriza-tion,and,is currently the most common parameterization and is used by both the Lumigraph and the Lightﬁeld.Unfortunately,the Lumigraph and Lightﬁeld papers used similar symbols in different contexts,we will use the convention in the Lumigraph paper and consider theﬁrst plane to be and the second plane to be.We will also just refer to this4D function as a Lumigraph.In all of the examples discussed in this paper,we choose planes that are parallel to each other which is a reasonable assumption in terms of sampling distribution[4].The Lumigraph and Lightﬁeld used different methods to reconstruct from this4D space. The Lightﬁeld reconstructed using strictly software and the Lumigraph method leveraged texturing hardware.The Lumigraph also presented the notion of geometric correction[2], but that won’t be dealt with further here.It is difﬁcult to interactively reconstruct views using Lumigraphs and Lightﬁelds due to the enormous size of the4D function.An obvious method for handling the large amounts of data is through compression[4].Another method for addressing this is to sparsely sample the4D function thereby trading off imageﬁdelity for interactivity[8].To render at full ﬁdelity,one needs parallel techniques for two fundamental reasons:1.reconstruction of the4D plenoptic function is computationally intensive and2.the storage requirements for a densely sampled Lumigraph are enormous.This paper will present parallel methods to accelerate purely software based recon-struction and demonstrate a parallel implementation using a parallel ray tracer[7].We also describe an architecture for distributing hardware based reconstruction using texture map-ping by leveraging multiple graphics accelerators.In this case on an SGI Reality Monster with8InﬁniteReality pipes,and extend the reconstruction to use a tiled plane and show how to reconstruct with higher order basis functions.In the next section,we present the software based and hardware based reconstruction methods.We then present results of the parallel implementation on a64CPU/8-IR SGI Origin2000.Then conclude with possible future directions for this research.2Overview of Parallel Reconstruction MethodsParallel techniques are necessary for reconstruction of the full Lumigraph due to the enor-mous size of the4D representation and the reconstruction costs involving ray tracing or hardware based reconstruction leveraging texture mapping hardware.We have chosen to present both software and hardware implementations since multipipe graphics systems are not common whereas multiple CPU systems are much more prevalent.2.1Software ReconstructionThe software reconstruction method is similar to the reconstruction used for the Light-ﬁeld[4].In the Lightﬁeld technique,the planes are scan converted into the recon-2S TFigure1:Ray tracing,,through theﬁlm plane,,into the Lumigraph is a point on the lower left of theﬁlm planeis the center of projection of the camera(eye point)is the“x”axis of theﬁlm planeis the“y”axis of theﬁlm planestruction image plane and then textureﬁltering and lookups are performed.As pointed out in[4],one can extend ray tracing to reconstruct from the2-plane parameterization of the Lumigraph.Let us consider the generalized ray tracing solution(seeﬁgure1).To reconstruct the image corresponding to the new viewpoint,,it sufﬁces to intersect the ray,,passing through theﬁlm plane with the and planes.The intersection point in the plane deﬁnes4views2.The intersection point is bilinearly interpolated for each of these using the same weights.Finally,these four samples,one for each node on the plane are bilinearly interpolated with weights based on the coordinates.However,this can be optimized by employing a canonical slab space coordinate system. This system places the origin in the plane while aligning the axis with and and the axis with and.The axis is normal to and.In this conﬁguration,it sufﬁces to employ only the individual coordinates.For example,becomessince the X and Y coordinates are aligned appropriately with and.By starting at,a point on the lower left of theﬁlm plane,the ray equation becomes: Since and have the same normal and are separated by distanceWhich for becomes:and for becomes:Due to the canonical slab space,all that is needed is to map the points into indices.For example,to map for:where scale and offset map from space to indicesThis combines to:This can be further optimized by noting that is constant during a frame and is the same for both the and intersection.The reduces the inner loop computation to determining,a multiply and an add per.Thus,mapping to the canonical slab space allows computation with a single coordinate for the different variables, for ray intersection,for and,and y for and.To efﬁciently compute the reconstructed image,we partition theﬁlm plane into blocks which effectively utilize data cache and render each partition independently.Since each partition is independent,we can parallelize them.For this,we employ the parallel ray tracing framework described in[6]which provides dynamic load balancing of frames by assigning image subblocks to a work queue for each frame to be rendered.Other processes obtain groups of these blocks from the queue in a monotonically decreasing amount.This heuristic provides quite reasonable dynamic load balancing without the overhead of deter-mining work loads needed for task stealing.The parallel algorithm is based on a master/slave conﬁguration.The master process is responsible for populating the work queue,managing the display and performing updates, such as view transformation,from the user.While the implementation is straight forward, careful attention to performance details has allowed this parallel ray tracer to achieve in-teractive rates.This is particularly interesting for Lumigraph reconstruction where one has multiple CPUs but limited graphics hardware.As will be seen in the results section,we obtain both interactive frame rates for the reconstruction as well as excellent scaling.This method can be further improved with the observation that if the reconstructed image is scan converted directly on the plane,there is no need to perform any ray plane intersection tests.This can be accomplished through the use of a sheared frustum as shown inﬁgure2a.We again utilize the canonical slab space which ensures that and line up with and.If we reconstruct onto the portion of the plane which is covered by the projection of theﬁlm plane,we can warp this resultant image back to the view plane using standard2D texture mapping hardware with a single texture mapped polygon.This is4(a)(b)Figure2:Reconstructing on the plane.similar to the idea used in shear/warp volume rendering[3].We can incrementalize the scan conversion on both the and the planes indicated by the dashed lines inﬁgure2a. Note,in3D the camera may be rotated and we can scan convert the bounding box of the projected viewing frustum.This can be further accelerated if the projection of the plane onto the plane is not fully contained in the frustum as shown inﬁgure2b.In this case,we need only reconstruct the portion of the projected frustum which contains samples for both the and planes.We can project the end-points of the plane onto the plane,project the bounding box of the view frustum onto the plane,and we know the extents of the plane itself.We can limit the scan conversion to the region deﬁned by the greatest minimum and least maximum of all these points.The scan conversion is always done so that only and are incremented in the inner loop-effectively having two adds to adjust the indices per pixel.Figure3a shows an image of the sheared frustum reconstructed on the plane.Notice how the bowl is warped.Figure3b shows theﬁnal image which warps the sheared image onto theﬁlm plane.By parallelizing the scan conversion process,we can speed up the rendering.We use the same parallel infrastructure as before except theﬁlm plane is not partitioned,the portion of the plane covered by the view frustum is partitioned.The same dynamic load balancing technique can be employed.2.2Hardware Based ReconstructionIt is possible to exploit graphics hardware to reconstruct Lumigraphs[2,8].First,basis functions must be deﬁned over the plane.A triangulation of the nodes on the plane is commonly used.Then for each node on the plane all of the triangles connected to it are drawn,using an alpha value of1at that particular node and0at all of the other nodes.The texture coordinates are determined by intersecting rays from the eye through the corresponding node with the plane.The basis functions for each image are5(a)(b)Figure 3:Images of the sheared image (a)and the ﬁnal image (b)summed,so that each triangle is rendered three times and the results added,summing up to one everywhere in the triangle.For more general basis functions this property must bepreserved -the sum of the basis functions at any point on theplane is one.Otherwise leveraging the hardware to perform the blending would be difﬁcult.We have implemented bi-cubic reconstruction on theplane using multi-pass ren-dering.The current implementation uses a 1D texture map and three passes:two passes for deﬁning the support of the tensor product basis function,one for drawing the ﬁnal image.If the application became ﬁll rate limited a 2D image of the basis function could be used instead.The 1D texture map is initialized with the values for the basis function (Cubic Bspline in this case.)The Pseudo-Code is as follows://only draw into alpha planesglColorMask(GL_FALSE,GL_FALSE,GL_FALSE,GL_TRUE);glEnable(GL_TEXTURE_1D);//turn on texturing//load 1D textureDrawQuad(0,0,1,1);//LL,UL,UR,LR ->draws horizontal passglEnable(GL_BLEND);glBlendFunc(GL_ZERO,GL_SRC_ALPHA);//take the product at every pixelDrawQuad(0,1,1,0);//LL,UL,UR,LR ->draws vertical passglDisable(GL_TEXTURE_1D);//turn off 1D texturingglColorMask(GL_TRUE,GL_TRUE,GL_TRUE,GL_TRUE);//enable full color writingglBlendFunc(GL_DEST_ALPHA,GL_ONE);//draw normal textured basis function -same quad as above,//except use 2D texture coordinatesFigure 4shows the steps in drawing the support of the basis function.This allows reconstruction using any basis functions that are strictly positive and form a partition of unity.That is,at any point on the plane the sum of the values of all of the basis functions that it overlaps equals one.6X=Figure4:Two1D basis functions and the tensor product(bi-linear) Figure5a shows an image from one graphics pipe reconstructed with a constant basis function.Figure5b shows the image from one graphics pipe reconstructed with a piecewise linear basis function.Figure5c shows the image from one graphics pipe reconstructed with a bi-cubic basis function.The main bottleneck with these methods is the limited amount of texture memory,and the bandwidth between the host and the accelerator.The images associated with each node are distributed by interleaving the pipes along the one-dimensional index of a hilbert curve through the nodes.This distribution is statically created when the program starts and no attempt is made at load balancing.Each pipe simply renders the basis functions for each node that it contains,reads back the results and creates theﬁnal image.The compositing step currently used is very crude-pipe0composites the images of all the other pipes.For high frame rates,this becomes a limiting factor in the current system and is addressed in the future work section.2.2.1Tiled Hardware ReconstructionLoading full texture maps for each basis function being drawn is more work than is neces-sary.If every basis function on the plane was not clipped by the plane,the number of texels that required for a frame would be times the number of texels on the plane. Where is1for constant basis functions,3for piecewise linear,4for bilinear and16for quad cubic.Loading whole textures effectively touches the number of texels on the plane times the number of nodes on the plane.Intuitively,this can be thought of as projecting the triangulation of the plane onto the plane.Our solution to this problem is to tile the plane.Instead of representing it as one texture,the plane is uniformly tiled into several smaller textures.Figure6shows a nontiled plane while Figure7shows an example of tiling the plane.Only the tiles for a given node that contain the basis function for the corresponding node need to be loaded into texture memory.When implementing the tiled architecture we found that OpenGL would crash and/or7(a)(b)(c)Figure 5:Images of a different basis functions:constant (a),piecewise linear (b),and bi-cubic (c)UV PlaneFigure 6:Triangulation on plane.8UV PlaneFigure7:Tiling the plane.behave very erratically when the number of texture objects bound was greater then4096. The solution was to manage the”texture cache”ourselves.With small tile sizes(less then 64x64)it is impossible to have enough texture objects to fully utilize the amount of texture memory in the system(64MB).A workaround could utilize tiles represented as sub regions inside a larger texture,but would complicate the implementation.Currently there are the number of texture objects toﬁll two times texture memory,for a given tile size,but never more then4096.If the tile sizes are less then64x64not all the texture memory on the system is utilized.This technique is more effective at utilizing OpenGL’s texture caching scheme and lessens the amount of time spent rebinding texture-IDs to tiles.Each tile in each basis function has a simple structure which contains a”valid”word, that represents the last frame that the tile was used in and an index into the list of texture-IDs or-1if the tile isn’t currently bound to a texture.There is also a word associated with each texture-ID,which references the tile to which it is currently pointing.Thus,when a texture-ID is reused,the tile which is using it can be updated to reﬂect the fact that it is not currently bound to a texture.The pseudo code for the system is as follows:ComputeBasisFunctions();//detail belowQueryTextureResidence();DrawBoundTilesThatAreResident();DrawBoundTilesThatAreNotResident();DrawTilesSwapOld();DrawTheRestOfTheTiles();ComputeBasisFunctions creates the texture coordinates for all of the basis functions,it also computes the tile indices that overlap the bounding box of the basis functions.QueryTextureResidence just executes an OpenGL command that given a list of texture-IDs tells which ones are in texture memory.Strictly speaking this should not be necessary if the number of texture-IDs generated can allﬁt in texture memory.DrawBoundTilesThatAreResident loops through all of the tiles that are in texture mem-ory and need to be drawn this frame.9The pseudo code is as follows:Foreach(basis function)Foreach(tile that needs to be drawn)If(tile is bound and in texture memory)SetStencilFuncForTile();BindTextureForTile();DrawTile();DrawBoundTilesThatAreNotResident is almost the same as the above,but is only ap-plied to tiles that are bound and not in texture memory.SetStencilFuncForTile sets the stencil function to check based on the tiles posistion on the UV plane-the bits are set in the stencil planes at the begining of every frame,this is just changing the test..The stencil bits are used for clipping of the basis functions to the tiles.After these tiles have been drawn it is necessary to begin reassigning texture-IDs to the tiles since the only remaining tiles that need to be drawn have no associated texture object. Two lists are built,one for texture-IDs that were for tiles that were not drawn this frame (found by just checking the valid word for the associated tile)and one for all of the rest of the IDs.Theﬁrst list(IDs for tiles that aren’t being used this frame)is then exhausted followed by the second list.If there still are more tiles to draw,texure-IDs can be processed in order.When dealing with tiles there are4counters that are computed and optionally written to aﬁle for every frame:1.Total number of tiles needed for this frame2.Number of tiles that were in texture memory3.Number of tiles that are in this frame but were not in the last frame4.Number of tiles that were in last frame but not in this frameThis information will be presented in the results section and can be used in the future when trying to ascertain how much CPU time can be spent decompressing Lumigraphs. 2.2.2Software ArchitectureFor each graphics pipe there are two threads,one that just processes input from the GUI and one that just renders using OpenGL.The only GUI thread that is active for display purposes is the one that is running on the pipe the user is logged into while the others can be used for debugging.The pseudo code is as follows:While(1){Bool camChanged=GetEvent();//blocks on events...If(camChanged){CamLock.lock();10NextCam=dispCam;++bufferedEvents;if(bufferedEvents==1)//currently rendering-make sure goes againRenderWake.up();CamLock.unlock();}}The rendering threads are a little more complex:Camera MyCam;//local copy of camera for this thread...While(1){If(id==0){//pipe0is different then the others...RenderWake.down();CamLock.lock();MyCam=WorkCam=NextCam;BufferedEvents=0;//next event will up RenderWakeCamLock.unlock();For(int I=1;I<NumPipes;I++)SlaveWake.up();//wake up all of the slaveselse{//it’s a slaveSlaveWake.down();MyCam=WorkCam;//synchronized cameras...}}DrawBasisFunctions();//draw all of your basis functionsReadFrame();//get the data backComposite();//somehow composite images between pipesIf(id==0){For(int I=1;I<NumPipes;I++){MainWake.down();//wait for other threads to finishDrawFrame();//displays,swaps buffer...else{//wake up main display thread...MainWake.up();}}The rendering thread for the pipe the user is on is essentially a master process over the other pipes’rendering threads.The composite function simply gathers the results. DrawFrame currently composites the other threads buffers into the back buffer.The bufferedEvents variable allows the rendering threads to not lag behind the display threads.When you stop moving the mouse,things will stop-the camera used for rendering is always the latest.If the frame rates are extremely low,this could be objectionable.A ﬁxed sized window of cameras could be used,and just cycled through,where the display thread updates different slots depending on the timings of events.This would cause aﬁxed delay,but possibly smoother camera motion.Structuring the code this way made the implementation cleaner-both threads share the same”drawable”(in X11parlance)and do not have to worry about mutual exclusion because they are using unique Display pointers.The GUI thread creates the window,the OpenGL thread creates the context,using a unique Display pointer but the same drawable.If the compositing is totally done in software,the hardware could be utilized to start drawing the next frame immediately.This would make the most sense if the Display thread was separate from the rendering pipes11512x5122cpu8cpu32cpuRay Tracing 1.835 6.99327.924Shear/Warp 3.26812.87345.01032cpusRay Tracing 3.56915.43912.3032tiles1pipe23.7852pipe68.8914pipe115.305Number of Tiles1tile4tiles1pipe4.95448.8668pipe118.55229.408Table3:Average FPS for the1GB Lumigraph dataset.selections3.Theﬁrst frame of the sequence always took the longest to render since there was no data from previous frames already in texture memory.This frame always took the longest out of all of the frames due to the intial texture load,generally about1second for either data set.In Tables2and3we present the average frames per second(FPS)over the last653 frames of this sequence for the256MB and the1GB dataset.This is just the time spent rendering,it does not include any of the compositing.Slowing down when moving to smaller tile sizes was expected.At some point the inefﬁciency of transferring very small textures and the overhead from rendering the basis functions multiple times,most likely the former,will start to cause the performance to degrade.The information gathered about the memory access patterns are shown in Tables4and 5.The numbers reported are in MegaTexels(th texture elements).The four data series are the number used per frame,the number that were already bound when the frame started, the number that were new(i.e.,had not been rendered in the previous frame)and the number that were old(i.e.,were rendered in the previous frame but not in the current frame).These numbers are all averages over the sequence.The number of pipes only changed the average number of bound textures.2tilesavg used10.6081avg bound9.4957avg new0.3959avg out0.3899Number of Tiles1tile4tilesavg used42.4324 5.743888.867913.5166avg new 1.58350.62541.89740.9783tion that support compressed textures(like the S3TC technique-sort of based on color cell compression[1])should be leveraged to explore much more aggressive compression ratios.Also a more general tiling,not strictly based on”whole”planes,where locality with respect to and planes was considered would be beneﬁcial since the tiles are correlated in more then the plane clearly.The current compositing solution is very crude and limits the multipipe performance. By slightly modifying the architecture we can have the compositing done fully in software. The rendering threads would hand over there images to a compositing queue and then sy-chronize at a barrier.When the compositing threadsﬁnish the”driving”window(which is on pipe0,but not used to render any basis functions)would just display theﬁnal image. Pipe0would have two separate contexts,one for displaying images that only does a single glDrawPixels and a glxSwapBuffers.This would cause a very slight load imbalance be-tween the pipes.As long as the compositing threads were faster then the fastest drawing threads there would effectively be no performance penalty for compositing.Addressing the load imbalance in the current system should be examined.Perhaps by using”work stealing”where there is a pool of nodes that aren’t in any of the pipes texture memories.These would be parceled out via a work queue.Batching texture tiles together into metatiles could help with the performance impact incurred when using smaller tile sizes.The tiles to be uploaded would be batched and sent in a single glTexSubImage command.The software version currently also runs on parallel Intel machines running Windows NT.We would like to investigate using the SIMD Floating Point instructions and potentially the streaming memory extension as well that exist in the Pentium III to accelerate the shear/warp and incremental rendering algorithms.Moving the”hardware”based code even for multiple pipes should be possible as well with the current support for multi-monitor conﬁgurations.6AcknowledgmentsThanks to Chris Johnson for providing the open collaborative research environment that allowed this work to happen.Thanks to Yarden Livnat and the SCI group for allowing the ﬁrst author to spend his weekends there working on the big machine.Discusions of this work with Harry Shum and Michael Cohen of Microsoft Research and Steven Gortler of Harvard University were helpful.Thanks to Steven Parker for having such a great infras-tructure for plugging in the software side.References[1]Graham Cambell,Tom A.DeFanti,Jeff Frederiksen,Stephen A.Joyce,Lawrence A.Leske,John A.Lindberg,and Daniel J.Sandin.Two bit/pixel full color encoding.In15David C.Evans and Russell J.Athay,editors,Computer Graphics(SIGGRAPH’86 Proceedings),volume20,pages215–223,August1986.[2]Steven Gortler,Radek Grzeszczuk,Richard Szeliski,and Michael Cohen.The lumi-graph.In Holly Rushmeier,editor,Computer Graphics(SIGGRAPH’96Proceedings), pages43–55,August1996.[3]Philippe Lacroute and Marc Levoy.Fast volume rendering using shear-warp factoriza-tion of the viewing transformation.In Andrew Glassner,editor,Computer Graphics (SIGGRAPH’94Proceedings),pages451–458,1994.[4]Marc Levoy and Pat Hanrahan.Lightﬁeld rendering.In Holly Rushmeier,editor,Computer Graphics(SIGGRAPH’96Proceedings),pages31–42,August1996. [5]Leonard McMillan and Gary Bishop.Plenoptic modeling:An image-based renderingsystem.In Robert Cook,editor,Computer Graphics(SIGGRAPH’95Proceedings), pages39–46,August1995.[6]Steven Parker,William Martin,Peter-Pike Sloan,Peter Shirley,Brian Smits,andCharles Hansen.Interactive ray tracing.In Symposium on Interactive3D Graphics, April1999.[7]Steven Parker,Michael Parker,Yarden Livnat,Peter-Pike Sloan,Charles Hansen,andPeter Shirley.Interactive Ray Tracing for V olume Visualization.IEEE Transactions on Visualization and Computer Graphics,To Appear1999.[8]Peter-Pike Sloan,Steven Gortler,and Michael Cohen.Time critical lumigraph render-ing.In Michael Cohen and David Zeltzer,editors,1997Symposium on Interactive3D Graphics,pages17–24,1997.16。

Parallel Linear General Relativity and CMB

In the past varying degrees of approximation have been made in order to carry out the evolution (for example Peebles & Yu 1970, Bond & Efstathiou 1987, Holtzman 1989, Sugiyama and Gouda 1992, among others). The code we discuss here has a highly accurate treatment of both the physics and the numerical integration; we believe it is the most accurate to date. The tradeo for this accuracy is increased computational cost, making the use of supercomputers necessary.
Parallel Linear General Relativity and CMB Anisotropies
A Technical Paper to appear in Supercomputing '95 in HTML format
Paul W. Bode
bode@
Edmund Bertschinger
The equations are most easily solved in k-space. In Fourier space, all the k modes in the linearized Einstein, Boltzmann, and uid equations evolve independently. In addition to the Fourier transform, there is also an angular expansion of the phase space distributions in terms of Legendre polynomials; this turns the Boltzmann equations into moment hierarchies determined at each time step. At a given time it is also necessary to integrate over the 3momentum, q, of the massive neutrinos. We carry out a full integration down to the nal time without use of any free-streaming approximation. The time integration, ending at the present, is carried out using the standard RungeKutta integrator DVERK, obtained from netlib@.

Implementing the Viola-Jones Face Detection Algorithm

Implementing the Viola-Jones Face Detection Algorithm
Ole Helvig Jensen
Kongens Lyngby 2008 IMM-M.Sc.-2008-93
Implementing the Viola-Jones Face Detection Algorithm
Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673 reception@imm.dtu.dk www.imm.dtu.dk
18-09-2008
3
Implementing the Viola-Jones Face Detection Algorithm
Resumé
Fremkomsten af digitalkameraer med høj opløsning til optagelse af både billeder og video har betydeligt ændret udviklingen af kommunikation og underholdning inde for de seneste par år. Samtidig har Moores lov stillet en enorm computerkraft til rådighed, der for kun 20 år siden var forbeholdt højt profileret forskningsinstitutioner og efterretningstjenester. Disse to tendenser har henholdsvis efterspurgt og tilladt udviklingen af billedbehandlingsalgoritmer med hidtidig uset kompleksitet. Disse algoritmer har efterfølgende muliggjort ny behandling af eksisterende billedmateriale. Parallelt med denne udvikling kræver beskyttelsen imod angreb fra modstandere af moderniteten en stadig større overvågning af det offentlige rum. Som en konsekvens af denne beklagelige omstændighed installeres flere overvågningskameraer i lufthavne, på togstationer og sågar også på åbne gader i større byer. Uanset om formålet er underholdning eller gravalvorlig overvågning, så kræves de samme løsninger til opgaver såsom detektion og genkendelse af ansigter. Pga. de varierede forhold som disse løsninger skal arbejde i, er der et stort behov for robuste algoritmer. I 2004 publicerede Paul Viola og Michael J. Jones artiklen ”Robust Real-Time Face Detection” i tidsskriftet International Journal of Computer Vision. Algoritmen, der præsenteres i denne artikel har været så succesfuld, at den i dag kan betragtes som værende de facto standard når ansigter skal findes i billeder. Algoritmens succes skyldes hovedsageligt dens forholdsvise simple udformning, dens hurtige afvikling og dens bemærkelsesværdige resultater. Denne rapport dokumenter alle relevante aspekter vedrørende implementeringen af Viola-Jones’ ansigtsdetektionsalgoritme. Det er meningen at algoritmen skal kunne behandle ethvert tænkeligt billede, indeholdende ansigter, og som resultat lave en liste med registrerede ansigter.

家蚕生物反应器的研究进展及发展前景(1)

家蚕生物反应器的研究进展及发展前景(1)摘要：目前,家蚕生物反应器的研究和开发主要是以BmNPV为载体,在家蚕体液中表达多种有用蛋白,其表达量比其它生化微生物高出许多倍;但是利用转基因家蚕生物反应器表达外源蛋白比家蚕BmNPV表达系统有着更大的优越性。

家蚕生物反应器研究和开发已近20年历史，表达了数百种外源基因，由于表达量不高及产物分离纯化难度和成本问题，至今未能进入产业化；家蚕转基因生物反应器有过比较好的尝试，改进转基因技术提高外源基因的整合率是今后主攻方向。

本文综述了家蚕BmNPV表达系统的研究现状及转基因家蚕生物反应器的研究进展及发展前景。

关键词：家蚕生物反应器BmNPV表达系统转基因家蚕发展前景ThereearchprogreofilkwormbioreactoranddevelopmentpropectAbtract：Currently,theilkwormbioreactorofreearchanddevelopmentimainly Keyword:Bomby某moribioreactor,BmNPVe某preionytem,Trangenic ilkworm,Developmentpropect前言：养蚕业起源于我国，是我国的传统产业，在长达5000多年的生产实践中，为我国的经济发展和中外文化交流作出了巨大贡献，在国民经济中占有重要地位。

家蚕(Bomby某mori）属于鳞翅目蚕蛾科，为开放式血管系统，纤薄而强韧的表皮层包围着一个充满血淋巴及各种器官的空间。

几千年来，人们利用家蚕能吐丝结茧这一生物机能，大量生产生丝。

家蚕丝因具有柔软舒适、透气保温、吸湿散湿性能好、珍珠般光泽、染色性强等优良理化特性，被誉为“纤维皇后”，其织出的华丽丝绸深受人们喜爱。

并且随着科学技术的发展，很多新的技术和试验方法在家蚕新用途和基础研究中得到应用和推广。

家蚕除用作生产蚕茧以抽取蚕丝这一传统用途外，作为生物反应器而生产高价值物质等新用途也不断被开发研究出来。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Polynomial systems occur in a wide variety of application areas in science and engineering, see the case studies in [34, Chapter 9]. Homotopy continuation methods to solve polynomial systems (recently surveyed in [27] and [34]) are well suited for parallel implementation, as described in [1], [6, 7], [18], [31], and [32].
Parallel Implementation of the Polyhedral Homotopy Method∗
Jan Verschelde† Yan Zhuang‡
Abstract
Homotopy methods to solve polynomial systems are well suited for parallel computing because the solution paths deﬁned by the homotopy can be tracked independently. For sparse polynomial systems, polyhedral methods give efﬁcient homotopy algorithms. The polyhedral homotopy methods run in three stages: (1) compute the mixed volume; (2) solve a random coefﬁcient start system; (3) track solution paths to solve the target system. This paper is about how to parallelize the second stage in PHCpack. We use a static workload distribution algorithm and achieve a good speedup on the cyclic n-roots benchmark systems. Dynamic workload balancing leads to reduced wall times on large polynomial systems which arise in mechanism design. 2000 Mathematics Subject Classiﬁcation. 65H10. Secondary 14Q99, 68W30. Primary
Algorithm 1.1 Polyhedral homotopies to solve a generic system. Input: ω , G(x) = 0. Output: G−1 (0). for all C ∈ ω do create a polyhedral homotopy G(y, s) = 0; solve the start system G|C (y, 0) = 0; track y(s): G(y(s), s) = 0, for s from 0 to 1; end for. mixed-cell conﬁguration and generic system all solutions to G(x) = 0 enumerate all C with inner normals v apply coordinate transformation using v G|C (y, 0) = 0 has Vol(C ) solutions track as many solution paths as Vol(C )
† Department ∗ Date:
Almost all polynomial systems occurring in applications are sparse, i.e.: only relatively few monomials appear with nonzero coefﬁcients. For sparse systems, homotopies based on the degrees typically track many paths diverging to inﬁnity. For example, in one benchmark system (from mechanism design [36] [39]), a degree-based homotopy used in [38] by the POLSYS GLP extension [37] of HOMPACK [44] [45] [46] leads to 9,216 paths, whereas polyhedral homotopies track the optimal number of 1,024 paths. The 1,024 for this system is the mixed volume (which we will deﬁne precisely in the next section). We view the mixed volume as the number of isolated solutions of a system with randomly chosen complex coefﬁcients. To solve a polynomial system using polyhedral homotopies, we distinguish three stages. First we compute the mixed volume, ignoring the particular coefﬁcients of the polynomial system. In the second stage, we apply polyhedral homotopies to solve a system with the same sparse structure as the given system, but with random coefﬁcients, using the results of the ﬁrst stage. The third and ﬁnal stage applies a plain linear homotopy to solve the given system using coefﬁcient-parameter polynomial continuation [30] (related to the cheater’s homotopy [29]). In this paper, we primarily focus on the second stage, and consider a mixed-cell conﬁguration as given. These mixedcell conﬁgurations may be computed using PHCpack [40], or by MixedVol [14], or PHoM [17]. A parallel implementation of PHoM is described in [16]. While the third stage involves as many paths as the second stage, the computational cost can increase signiﬁcantly because the polynomial system at the end of the homotopy is no longer generic. Conditions on genericity are given in [33]. This paper continues the development of parallel implementations of homotopy algorithms in PHCpack [40] – started in [43] with Pieri homotopies, followed by parallel decomposition methods in [24, 25]. Our ﬁrst parallel implementation of polyhedral homotopies uses a static workload distribution. This static distribution (as we experienced in [43]) is favorable when all solution paths require the same computational cost, as could be expected from polyhedral homotopies solving generic polynomial systems. However, dynamic load balancing signiﬁcantly improves the wall time for large mechanisms design problems.
Key words and phrases. Continuation methods, load balancing, parallel computation, path following, polynomiபைடு நூலகம்l systems, polyhedral homotopies.
1 Introduction
26 May 2006. of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, 851 South Morgan (M/C 249), Chicago, IL 60607-7045, USA. Email: jan@ or jan.verschelde@. URL: /˜jan. This material is based upon work supported by the National Science Foundation under Grant No. 0134611 and Grant No. 0410036. ‡ Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, 851 South Morgan (M/C 249), Chicago, IL 60607-7045, USA. Email: yzhuan1@. URL: /˜yzhuan1.