A Cache-Based Data Intensive Distributed Computing Architecture For ”grid” Applications
FPGA外文文献

High Level Programming for Real TimeFPGA Based Image ProcessingD Crookes, K Benkrid, A Bouridane, K Alotaibi and A BenkridSchool of Computer Science, The Queen‟s University of Belfast, Belfast BT7 1NN, UK ABSTRACTReconfigurable hardware in the form of Field Programmable Gate Arrays (FPGAs) has been proposed as a way of obtaining high performance for computationally intensive DSP applications such us Image Processing (IP), even under real time requirements. The inherent reprogrammability of FPGAs gives them some of the flexibility of software while keeping the performance advantages of an application specific solution.However, a major disadvantage of FPGAs is their low level programming model. To bridge the gap between these two levels, we present a high level software environment for FPGA-based image processing, which aims to hide hardware details as much as possible from the user. Our approach is to provide a very high level Image Processing Coprocessor (IPC) with a core instruction set based on the operations of Image Algebra. The environment includes a generator which generates optimised architectures for specific user-defined operations.1. INTRODUCTIONImage Processing application developers require high performance systems for computationally intensive Image Processing (IP) applications, often under real time requirements. In addition, developing an IP application tends to be experimental and interactive. This means the developer must be able to modify, tune or replace algorithms rapidly and conveniently.Because of the local nature of many low level IP operations (e.g. neighbourhood operations), one way of obtaining high performance in image processing has been to use parallel computing [1]. However, multiprocessor IP systems have generally speaking not yet fulfilled their promise. This is partly a matter of cost, lack of stability and software support for parallel machines; it is also a matter of communications overheads particularly if sequences of images are being captured and distributed across the processors in real time.A second way of obtaining high performance in IP applications is to use Digital Signal Processing (DSP) processors [2,3]. DSP processors provide a performance improve-ment over standard microprocessors while still maintaining a high level programming model. However, because of the software based control, DSP processors have still difficulty in coping with real time video processing.At the opposite end of the spectrum lie the dedicated hardware solutions. Application Specific Integrated Circuits (ASICs) offer a fully customised solution to a particular algorithm [4]. However, this solution suffers from a lack of flexibility, plus the high manufacturing cost and the relatively lengthy development cycle.Reconfigurable hardware solutions in the form of FPGAs [5] offer high performance, with the ability to be electrically reprogrammed dynamically to perform other algorithms. Though the first FPGAs were only capable of modest integration levels and were thus usedmainly for glue logic and system control, the latest devices [6] have crossed the Million gate barrier hence making it possible to implement an entire System On a Chip. Moreover, the introduction of the latest IC fabrication techniques has increased the maximum speed at which FPGAs can run. Design‟s performance exceeding 150MHz are no longer outside the realm of possibilities in the new FPGA parts, hence allowing FPGAs to address high bandwidth applications such as video processing.A range of commercial FPGA based custom computing systems includes: the Splash-2 system [7]; the G-800 system [8] and VCC‟s HOTWorks HOTI & HOTII development [9]. Though this solution seems to enjoy the advantages of both the dedicated solution and the software based one, many people are still reluctant to move toward this new technology because of the low level programming model offered by FPGAs. Although behavioural synthesis tools have made enormous progress [10, 11], structural design techniques (including careful floorplanning) often still result in circuits that are substantially smaller and faster than those developed using only behavioural synthesis tools [12].In order to bridge the gap between these two levels, this paper presents a high level software environment for an FPGA-based Image Processing machine, which aims to hide the hardware details from the user. The environment generates optimised architectures for specific user-defined operations, in the form of a low level netlist. Our system uses Prolog as the basic notation for describing and composing the basic building blocks. Our current implementation of the IPC is based on the Xilinx 4000 FPGA series [13].The paper first outlines the programming environment at the user level (the programming model). This includes facilities for defining low level Image Processing algorithms based on the operators of Image Algebra [14], without any reference to hardware details. Next, the design of the basic building blocks necessary for implementing the IPC instruction set are presented. Then, we describe the runtime execution environment.2. THE USER’S PROGRAMMING MODELAt its most basic level, the programming model for our image processing machine is a host processor (typically a PC programmed in C++) and an FPGA-based Image Processing Coprocessor (IPC) which carries out complete image operations (such as convolution, erosion etc.) as a single coprocessor instruction. The instruction set of the IPC provides a core of instructions based on the operators of Image Algebra. The instruction set is also extensible in the sense that new compound instructions can be defined by the user, in terms of the primitive operations in the core instruction set. (Adding a new primitive instruction is a task for an architecture designer).The coprocessor core instruction setMany IP neighbourhood operations can be described by a template (a static window with user defined weights) and one of a set of Image Algebra operators. Indeed, simple neighbourhood operations can be split in two stages:∙ A …local‟ operato r applied between an image pixel and the corresponding window coefficient.∙ A …global‟ operator applied to the set of intermediate results of the local operation, to reduce this set to a single result pixel.The set of local operators contains …Add‟ (…+‟) and …multiplication‟ (…*‟), whereas the global operator contains …Accumulation‟ (…∑‟), …Maximum‟ (…Max‟) and …Minimum‟ (…Min‟). With these local and global operators, the following neighbourhood operations can be built:For instance, a simple Laplace operation would be performed by doing convolution (i.e. Local Operation = …∑‟ and Global operation= …*‟) with the following template:The programmer interface to this instruction set is via a C++ class. First, the programmer creates the required instruction object (and its FPGA configuration), and subsequently applies it to an actual image. Creating an instruction object is generally in two phases: firstly build an object describing the operation, and then generate the configuration, in a file. For neighbourhood operations, these are carried out by two C++ object constructors:image_operator (template & operator details)image_instruction (operator object, filename)For instructions with a single template operator, these can be conveniently combined in a single constructor:Neighbourhood_instruction (template, operators, filename)The details required when building a new image operator object include:∙The dimension of the image (e.g. 256 ⨯ 256)∙The pixels size (e.g. 16 bits).∙The size of the window (e.g. 3⨯3).∙The weights of the neighbourhood window.∙The target position within the window, for aligning it with the image pixels (e.g. 1,1).∙The …local‟ and …global‟ operations.Later, to apply an instruction to an actual image, the apply method of the instruction object is used:Result = instruction_object.apply (input image)This will reconfigure the FPGA (if necessary), download the input pixel data and store the result pixels in the RAM of the IPC as they are generated.The following example shows how a programmer would create and perform a 3 by 3 Laplace operation. The image is 256 by 256; the pixel size is 16 bits.2.1 Extending the Model for Compound OperationsIn practical image processing applications, many algorithms comprise more than a single operation. Such compound operations can be broken into a number of primitive core instructions.Instruction Pipelining: A number of basic image operations can be put together in series. A typical example of two neighbourhood operations in series is the …Open‟ operation. To do an …Open‟ operation, an …Erode‟ neighbourhood operation is first performed, and the resulting image is fed into a …Dilate‟ neighbourhood operation as shown in Figure 1.Figure 1 ‘Open’ complex operationThis operation is described as follows in our high level environment:Task parallel: A number of basic image operations can be put together in parallel.For example, the Sobel edge detection algorithm can be performed (approximately) by adding the absolute results of two separate convolutions. Assuming that the FPGA has enough computing resources available, the best solution is to implement the operations in parallel using separate regions of the FPGA chip.Figure 2 Sobel complex operationThe following is an example of the code, based on our high level instruction set, to define and use a Sobel edge detection instruction. The user defines two neighbourhood operators(horizontal and vertical Sobel), and builds the image instruction by summing the absolute results from the two neighbourhood operations.The generation phase will automatically insert the appropriate delays to synchronise the two parallel operations.3. ARCHITECTURES FROM OPERATIONSWhen a new Image_instruction object(e.g. Neighbourhood_instruction) is created (by new), the corresponding FPGA configuration will be generated dynamically. In this section, we will present the structure of the FPGA configurations necessary to implement the high level instruction set for the neighbourhood operations described above. As a key example, the structure of a general 2-D convolver will be presented. Other neighbourhood operations are essentially variations of this, with different local and global operators sub-blocks.A general 2D convolverAs mentioned earlier, any neighbourhood image operation involves passing a 2-D window over an image, and carrying out a calculation at each window position.To allow each pixel to be supplied only once to the FPGA, internal line delays are required. These synchronise the supply of input values to the processing elements, ensuringthat all the pixel values involved in a particular neighbourhood operation are processed at the same instant[15, 16]. Assuming a vertical scan of the image, Figure 3 shows the architecture of a generic 2-D convolver with a P by Q template. Each Processing Element (PE) performs the necessary Multiply/Accumulate operation.Figure 3 Architecture of a generic 2-D, P by Q convolution operation Architecture of a Processing ElementBefore deriving the architecture of a Processing Element, we first have to decide which type of arithmetic to be used- either bit parallel or bit serial processing.While parallel designs process all data bits simultaneously, bit serial ones process input data one bit at a time. The required hardware for a parallel implementation is typically …n‟ times the equivalent serial implementation (for an n-bit word). On the other hand, the bit serial approach requires …n… clock cycles to process an n-bit word while the equivalent parallel one needs only one clock cycle. However, bit serial architectures operates at a higher clock frequency due to their smaller combinatorial delays. Also, the resulting layout in a serial implementation is more regular than a parallel one, because of the reduced number of interconnections needed between PEs (i.e. less routing stress). This regularity feature means that FPGA architectures generated from a high level specification can have more predictable layout and performance. Moreover, a serial architecture is not tied to a particular processing word length. It is relatively straightforward to move from one word length to another withvery little extra hardware (if any). For these reasons, we decided to implement the IPC hardware architectures using serial arithmetic.Note, secondly, that the need to pipeline the bit serial Maximum and Minimum operations common in Image Algebra suggests we should process data Most Significant Bit first (MSBF). Following on from this choice, because of problems in doing addition MSBF in 2‟s complement, there are certain advantages in using an alternative number representation to 2‟s complement. For the p urposes of the work described in this paper, we have chosen to use a redundant number representation in the form of a radix-2 Signed Digit Number system (SDNR) [17]. Because of the inherent carry-free property of SDNR add/subtract operations, the corresponding architectures can be clocked at high speed. There are of course several alternative representations which could have been chosen, each with their own advantages. However, the work presented in this paper is based on the following design choices:∙Bit serial arithmetic∙Most Significant Bit First processing∙Radix-2 Signed Digit Number Representation (SDNR) rather than 2‟s complement.Because image data may have to be occasionally processed on the host processor, the basic storage format for image data i s still, however, 2‟s complement. Therefore, processing elements first convert their incoming image data to SDNR. This also reduces the chip area required for the line buffers (in which data is held in 2‟s complement). A final unit to convert a SDNR resu lt into 2‟s complement will be needed before any results can be returned to the host system. With these considerations, a more detailed design of a general Processing Element (in terms of a local and a global operation) is given in Figure 4.Figure 4 Architecture of a standard Processing ElementDesign of the Basic Building BlocksIn what follows, we will present the physical implementation of the five basic building blocks stated in section 2 (the adder, multiplier, accumulator and maximum/ minimum units). These basic components were carefully designed in order to fit together with as little wastage as possible.The ‘multiplier’ unitThe multiplier unit used is based on a hybrid serial-parallel multiplier outlined in [18]. It multiplies a serial SDNR input with a two‟s complement parallel coefficient B=b N b N-1…b1 as shown in Figure 5. The multiplier has a modular, scaleable design, and comprises four distinct basic building components [19]: Type A, Type B, Type C and Type D. An N bit coefficient multiplier is constructed by:Type A → Type B→ (N-3)*TypeC → Type DThe coefficient word length may be varied by varying the number of type C units. On the Xilinx 4000 FPGA, Type A, B and C units occupy one CLB, and a Type D unit occupies 2 CLBs. Thus an N bit coefficient multiplier is 1 CLB wide and N+1 CLBs high. The online delay of the multiplier is 3.In+In-Figure 5 Design of an N bit hybrid serial-parallel multiplierThe ‘accumulation’ g lobal operation unitThe accumulation unit is the global operation used in the case of a convolution. It adds two SDNR operands serially and outputs the result in SDNR format as shown in Figure 6. The accumulation unit is based on a serial online adder presented in [20]. It occupies 3 CLBs laid out vertically in order to fit with the multiplier unit in a convolver design.Figure 6Block diagram and floorplan of an accumulation unitThe ‘Addition’ local operation unitThis unit is used in additive/maximum and additive/minimum operations. It takes a single SDNR input value and adds it to the corresponding window template coefficient. The coefficient is stored in 2‟s complement format into a RAM addressed by a counter whose period is the pixel word length. To keep the design compact, we have implemented the counter using Linear Feedback Shift Registers (LFSRs). The coefficient bits are preloaded into the appropriate RAM cells according to the counter output sequence. The input SDNR operand is added to the coefficient in bit serial MSBF.+-+-Figure 7. Block diagram and floorplan of an …Addition‟ local operation unitOut-Out+The adder unit occupies 3 CLBs. The whole addition unit occupies 9 CLBs laid out in a 3x3 array. The online delay of this unit is 3 clock cycles.The Maximum/Minimum unitThe Maximum unit selects the maximum of two SDNR inputs presented to its input serially, most significant bit first. Figure 10 shows the transition diagram of the finite state machine performing the maximum …O‟ of two SDNRs …X‟ and ‟Y‟. The physical impl ementation of this machine occupies an area of 13 CLBs laid out in 3 CLBs wide by 5 high. Note that this will allow this unit to fit the addition local operation in an Additive/Maximumneighbourhood operation. The online delay of this unit is 3, compatible with the online delay of the accumulation global operation.*(O=X)*(O=Y)X +X --+Figure 8. State diagram and floorplan of a Maximum unitThe minimum of two SDNRs can be determined in a similar manner knowing that Min(X,Y)=- Max(-X,-Y).5. THE COMPLETE ENVIRONMENTThe complete system is given in Figure 11. For internal working purposes, we have developed our own intermediate high level hardware description notation called HIDE4k [21]. This is Prolog-based [22], and enables highly scaleable and parameterised component descriptions to be written.In the front end, the user programs in a high level software environment (typically C++) or can interact with a Dialog-based graphical interface, specifying the IP operation to be carried out on the FPGA in terms of Local and Global operators, window template coefficients etc. The user can also specify:The desired operating speed of the circuit.∙The input pixel bit-length.∙Whether he or she wants to use our floorplanner to place the circuit or leave this task to the FPGA vendor‟s Placement and Routing tools.The system provides the option of two output circuit description formats: EDIF netlist (the normal), and VHDL at RTL level.Behind the scenes, when the user gives all the parameters needed for the specific IP operation, the intermediate HIDE code is generated. Depending on the choice of the output netlist format, the HIDE code will go through either the EDIF generator tool to generate an EDIF netlist, or the VHDL generator tool to generate a VHDL netlist. In the latter case, the resulting VHDL netlist needs to be synthesised into an EDIF netlist by a VHDL synthesiser tool. Finally, the resulting EDIF netlist will go through the FPGA vendor‟s specific tools to generate the configuration bitstream file. The whole process is invisible to the user, thus making the FPGA completely hidden from the user‟s point of view. Note that the resulting configuration is stored in a library, so it will not be regenerated if exactly the same operation happens to be defined again.Complete and efficient configurations have been produced from our high level instruction set for all the Image Algebra operations and for a variety of complex operations including…Sobel‟, …Open‟ and …Close‟. They have been successfully simulat ed using the Xilinx Foundation Project Manager CAD tools.Figure 10 presents the resulting layout for a Sobel edge detection operation on XC4036EX-2 for 256x256 input image of 8-bits pixels. An EDIF configuration file, with all the placement information, has been generated automatically by our tools from the high level description in 2.1. Note that the generator optimises the design, and uses just a single shared line buffer area for the two (task parallel) neighbourhood operations. The resulting EDIF file is fed to Xilinx PAR tools to generate the FPGA configuration bitstream. The circuit occupies 475 CLBs. Timing simulation shows that the circuit can run at a speed of 75MHz which leads to a theoretical frame rate of 143 frames per second.Figure 10 Physical configuration of Sobel operation on XC4036EX-2 Figure 11 presents the resulting layout for an 'Open' operation on XC4036EX-2 for 256x256 input image of 8-bits pixels. As previously, EDIF configuration file with all the placement information has been generated, automatically by our tools from the correspondinghigh level description presented in section 2.1. The resulting EDIF file is then fed to Xilinx PAR tools to generate the FPGA configuration bitstream. The circuit occupies 962 CLBs. Timing simulation shows that the circuit can run at a speed of 75MHz which leads to a theoretical frame rate of 133 frames per second.Figure 11 Physical configuration of Open operation on XC4036EX-26. CONCLUSIONSIn this paper, we have presented the design of an FPGA-based Image Processing Coprocessor (IPC) along with its high level programming environment. The coprocessor instruction set is based on a core level containing the operations of Image Algebra. Architectures for user-defined compound operations can be added to the system. Possibly the most significant aspect of this work is that it opens the way to image processing application developers toexploit the high performance capability of a direct hardware solution, while programming in an application-oriented model. Figures presented for actual architectures show that real time video processing rates can be achieved when staring from a high level design.The work presented in this paper is based specifically on Radix-2 SDNR, bit serial MSBF processing. In other situations, alternative number representations may be more appropriate. Sets of alternative representations are being added to the environment, including a full bit parallel implementation of the IPC [23]. This will give the user a choice when trying to satisfy competing constraints.Although our basic approach is not tied to a particular FPGA, we have implemented our system on XC4000 FPGA series. However, the special facilities provided by the new Xilinx VIRTEX family (e.g. large on-chip synchronous memory, built in Delay Locked Loops etc.) make it a very suitable target architecture for this type of application. Upgrading our system to operate on this new series of FPGA chips is underway.REFERENCES[1] Webber, H C (ed.), …Image processing and transputers‟, IOS Press, 1992.[2] Rajan, K, Sangunni, K S and Ramakrishna, J, …Dual-DSP systems for signal and image-processing‟, Microprocessing & Microsystems, Vol 17, No 9, pp 556-560, 1993.[3] Akiyama, T, Aono, H, Aoki, K, et al,…MPEG2 video codec using Image compressionDSP‟, IEEE Transactions on Consumer Electronics, Vol 40, No 3, pp 466-472, 1994. [4] L.A. Christopher, W.T. Mayweather and S.S. Perlman, …VLSI median filter for impulsenoi se elimination in composite or component TV signals‟, IEEE Transactions on Consumer Electronics, Vol 34, no. 1, pp. 263-267, 1988.[5] J. Rose and A. Sangiovanni-Vincentelli, …Architecture of Field Programmable GateArrays‟, Proceedings of the IEEE Volume 81, No7, pp 1013-1029, 1993.[6] /products/virtex/ss_vir.htm[6] Arnold, J M, Buell, D A and Davis, E G, …Splash-2‟, Proceedings of the 4th AnnualACM Symposium on Parallel Algorithms and Architectures, ACM Press, pp 316-324, June 1992.[7] Gigaops Ltd., The G-800 System, 2374 Eunice St. Berkeley, CA 94708.[8] Chan, S C, Ngai, H O and Ho, K L, …A programmable image processing system usingFPGAs‟, International Journal of Electronics, Vol 75, No 4, pp 725-730, 1993.[9] /[10] /news/pubs/snug/snug99_papers/Jaffer_Final.pdf[11] FPL99.[12] Hutchings.[13] Xilinx 4000.[14] Ritter G X, Wilson J N and Davidson J L, …Image Algebra: an overview‟, ComputerVision, Graphics and Image Processing, No 49, pp 297-331, 1990.[15] Shoup, R G, …Parameterised Convolution Filtering in an FPGA‟, More FPGAs, WMoore and W Luk (editors), Abington, EE&CS Books, pp 274, 1994.[16] Kamp, W, Kunemund, H, Soldner and Hofer, H, …Programmable 2D linear filter forvideo applications‟, IEEE Journal of Solid State Circuits, pp 735-740, 1990.[17] Avizienis A, …Signed Digit Number Representation for Fast Parallel Arithmetic”, IRETransactions on Electronic Computer, Vol. 10, pp 389-400, 1961.[18] Moran, J, Rios, I and Mene ses, J, …Signed Digit Arithmetic on FPGAs‟, More FPGAs, WMoore and W Luk (editors), Abington, EE&CS Books, pp 250, 1994.[19] Donachy, P, …Design and implementation of a high level image processing machineusing reconfigurable hardware‟, PhD Thesis, Depar tment of Computer Science, The Queen‟s University of Belfast, 1996.[20] Duprat, J, Herreros, Y and Muller, J, …Some results about on-line computation offunction‟, 9th Symposium on Computer Arithmetic, Santa Monica, September 1989. [21]D Crookes, K Alota ibi, A Bouridane, P Donachy and A Benkrid, 1998, …An Environmentfor Generating FPGA Architectures for Image Algebra-based Algorithms‟, ICIP98, Vol.3, pp. 990-994.[22]Clocksin W F and Melish C S, 1994, …Programming in Prolog‟, Springer-Verlag.。
利用Oracle Cohereence来处理大规模的保险业规则管理

An Oracle White PaperJuly 2011Oracle Insurance Policy Administration for Life and Annuity: Leveraging Oracle Coherence for Distributed Executionof High Volume TransactionsOracle Insurance Policy Administration for Life and Annuity: Leveraging Oracle Coherence for Distributed Execution of High Volume TransactionsDisclaimerThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.Introduction (2)Oracle Coherence (2)Clustered Caching (3)Oracle Insurance Policy Administration for Life and Annuity (5)Rule-based Configuration (5)OIPA Cycle Processing (6)Challenges of Cycle Processing (7)Powering Cycle with Oracle Coherence (7)Distributed Computing with the Coherence Processing Pattern (8)Scalability and High Availability in the Processing Pattern (12)OIPA Cycle Execution using the Coherence Processing Pattern (13)Extending the Processing Pattern to Support Cycle (14)Using the Processing Pattern to Submit Tasks (14)Executing a Cycle Level using ResumableTask (15)Benchmark Results for Cycle Processing (17)Conclusion (18)About Oracle Insurance (18)IntroductionThis technical white paper discusses how Oracle Insurance Policy Administration for Life and Annuity (OIPA) leverages the powerful capabilities of Oracle Coherence in-memory data grid middleware to provide predictable scalability, performance for high-volume batch cycle processing, and seamless fail-over support. It includes a technical discussion of how the Coherence Processing Pattern provides the grid computing infrastructure used by the Oracle Insurance Policy Administration cycle subsystem. The first two sections of this paper provide an overview of Oracle Coherence and the Oracle Insurance Policy Administration products, respectively. The remaining sections of the paper describe how OIPA implements the Oracle Incubator Processing Pattern to enable distributed batch processing of work across a grid of interconnected Oracle Coherence powered compute nodes.Oracle CoherenceOracle Coherence is an in-memory data grid solution that enables organizations to predictably scale mission-critical applications by providing fast access to frequently used data. Data grid software is middleware that reliably manages data objects in memory across multiple servers. By automatically and dynamically partitioning data, Oracle Coherence ensures continuous data availability and transactional integrity even in the event of a server failure. It provides organizations with a robust scaled-out data abstraction layer that brokers the supply and demand of data between applications and data sources. Oracle Coherence enables the following benefits within today’s enterprise applications:•Performance – Oracle Coherence solves latency problems and drives dramatic increases in performance by moving data closer to applications for efficient access. In-memory performance alleviates bottlenecks and reduces data contention, improving application responsiveness.•Reliability – Oracle Coherence is built on a fault-tolerant mesh that provides data reliability and accuracy. Organizations can meet data availability demands in mission-critical environments with Oracle Coherence support for data tolerance and continuous operation. The reliability of the data grid minimizes the need for applications to compensate for server and network failures, streamlining the development and deployment process.•Scalability – Oracle Coherence enables applications to scale linearly and dynamically for predictable cost and improved resource utilization. For many applications, it offers a straightforward approach to increasing the effective capacity of shared data sources. Oracle Coherence handles continually growing application loads without risking data loss or interruption of service.Figure 1. Oracle Coherence OverviewThe information stored within the nodes of a data grid is evenly distributed among the cluster members. When applications access the data grid, Oracle Coherence ensures that the data is never more than one network hop away and thus allows for near in-memory latency to all data stored within the grid. This approach enables very large performance improvements in data access. Furthermore, since the data is distributed among all nodes of the data grid and each node does not have to hold all cached data, a data grid can store very large amounts of data (in the order of magnitude of hundreds of gigabytes) without a degradation in performance. In fact, the addition of nodes to a data grid leads to a linear increase in data capacity available to the clients of the grid without any degradation in performance. This latter property is what is known as a linear scalability. Finally, a data grid has the ability to store backup copies of its data objects in separate nodes and thus also provides for high availability.Clustered CachingOne of the main capabilities of Oracle Coherence, and one that is used heavily by the features discussed within the rest of this document, is clustered caching. Clustered caching refers to the ability to maintain data in the application tier in such a way that the application can fulfill some portion of its data access requirements from the cache. This mitigates the application’s load on the database without violating the application’s requirements for data correctness if that data is being changed. Oracle Coherence provides two main types of clustered caching that are used by the Oracle Insurance Policy Administration solution: Replicated Caching and Partitioned Caching.Replicated CachingThe best-known form of coherent clustered caching is the fully replicated cache. Replication is the ability to achieve guaranteed coherency of data across multiple nodes of a cluster by maintaining copies of the data within each node and synchronizing changes. If each server maintains a local copy of cached data, then the application logic running on each server can access local data without the need to communicate with any other servers. As a result, data access has no measurable latency. There are a few limitations to replicated caching. First, a change to the cache implies the need to communicate to the entire cluster. Such communication—often accomplished by use of group network protocols—cannot by its nature scale linearly. Second, the cache is severely limited in its in-memory size, because each cluster node is maintaining the entire cache within its process space.Partitioned CachingTo solve the limitations of a replicated cache model without sacrificing either the high-availability (HA) benefits of redundancy or the coherency guarantees provided by the clustered cache, Oracle invented the concept of a partitioned cache. With the data partitioned, the cache capacity grows linearly with the size of the cluster, as does the processing capacity available for managing the cache. Further, in a shared-nothing architecture, each piece of data in the cache has exactly one owner within the cluster who is responsible for managing that data. All network communication can be point-to-point in a partitioned model, allowing the cache throughput to scale linearly on a switched network. In order to recover from a possible node failure, a partitioned cache needs to store, at a minimum, one redundant copy of the cache data on a different physical node. The partitioned caching capability of Oracle Coherence provides the foundation for distributed processing in OIPA.Oracle Insurance Policy Administration for Life and AnnuityOracle Insurance Policy Administration is a highly-flexible, rules-based policy administration solution that provides full record keeping and support for all policy lifecycle transactions (e.g. policy issue, billing, collections, policy processing, and claims). With OIPA, insurers rapidly adapt to changing business needs and regulatory requirements while supporting straight-through processing throughout the policy lifecycle.OIPA is used by leading insurers to accelerate product development and speed time to market for differentiated life, health, and annuity products. The system enables insurers to provide real-time policy servicing of customers and sales channels throughout the policy lifecycle for increased retention and loyalty. It also helps insurers reduce risk and support compliance while better managing the business to optimize performance through the use of a single system for life and annuities.Insurers require the ability to quickly bring to market innovative products that stand out from the competition, capture more market share, and ultimately maximize profitability. OIPA is an industry leading, fully configurable system allowing insurance companies to outpace competitors by getting to market faster. In addition, it enables insurers to drive transactions using business rules. It also provides real-time access to policy data, enhancing self-service capabilities for customers and distribution channels.Rule-based ConfigurationFigure 2. OIPA enables insurers to accelerate the development and launch of life, health, and annuity products and update existing products with new features or riders through flexible, rules-based configuration.The rules-based architecture of OIPA enables rapid new product development, freeing insurers from the limitations of legacy policy administration systems. The rules-driven capabilities of OIPA are unmatched in the industry; almost all changes to the system—including products, fields, screens, languages, and currencies—can be made without ever touching the core code or recompiling the application. The system does the heavy lifting through pre-configured templates, rules reusability and product cloning, and a visual Rules Palette with easy to use, drag-and-drop functionality.One of the most important elements configured by business analysts within an OIPA system is the concept of a transaction. Transactions define how the system updates, modifies, or deletes data in OIPA. As an example, a premium transaction takes a payment submitted by an insured and applies that money to a policy. A premium transaction performs a number of actions including validating inputs, removing value from suspense, applying fees and charges, issuing commission, applying the money to one or more funds on the policy, and adjusting the cash value of the policy. OIPA allows insurers to configure the sequence of steps, business rules, and calculations that are performed, as well as what data is updated as a result of the transaction processing.An instance of a transaction is known as an activity. The execution of an activity involves the execution of all processing associated with its underlying transaction within a single atomic unit of work. The execution of an activity also involves the auditing of all of the changes so that they can easily be reversed or undone.In most cases, activities that are entered into the system are scheduled for execution at a later time. In a typical OIPA system, Customer Service Representatives (CSRs) may create hundreds or even thousands of pending activities during the day. These pending activities are subsequently executed during a scheduled batch process.OIPA Cycle ProcessingCycle is the component of an OIPA system that is responsible for the scheduled execution of pending activities. The activities executed by cycle are classified into different levels, where each Cycle Level determines the type of work item to be processed and the sequence of execution:•Pre-Company – process all pending company level activities that are configured to be executed before any policy activities are processed.•Pre-Plan – process all pending plan level activities that are configured to be executed before any policy activities are processed.•Pre-Client – process all pending client level activities that are configured to be executed before any policy activities are processed.•Policy – process all pending policy activities in the insurance database•Post-Client – process all pending client level activities that are configured to be executed after any policy activities are processed.•Post-Plan – process all pending plan level activities that are configured to be executed after any policy activities are processed.•Post-Company – process all company level activities that are configured to be executed after any policy activities are processed.The association of levels with activities allows business analysts to define the sequence within which policy level activities execute. An example is the importing of the latest fund unit values into the system so that policy level activities that affect cash value can be executed properly. Fund unit values wouldhave to be imported before policy level cycle can execute, because many policy level activities require the latest fund unit values in order to process successfully. Other examples are the execution of calculations for downstream reporting or exporting of data into a data warehouse. This should happen after policies are processed, as policy level processing results in the modification of a significant amount of operational data, which should be current before it is imported into a data warehouse.Challenges of Cycle ProcessingThe requirements of the OIPA Cycle system lead to inherent time and volume challenges as described in this section. The next section will describe how Oracle Coherence’s capabilities enable Cycle to meet these challenges.Time ConstraintsOIPA Cycle processing places a large demand on database and system resources and can slow down the online OIPA application. Cycle is typically scheduled to execute off business hours in order to minimize the impact to the OIPA application users. The increasing globalization of insurance companies expands the OIPA application availability and therefore shortens the time intervals that Cycle has to complete processing.High Volume ProcessingCycle activities must be able to meet varying processing volume demands with specific time constraints. There are peak periods when the volume of pending activities can be significantly greater than normal. Example peak periods include end-of-month, end-of-quarter, and end-of-year processing when many, if not all, active policies in the system have pending activities to process. During these peak periods when the number of pending activities possibly number hundreds of thousands or even millions, Cycle must still be able to meet time constraints.Powering Cycle with Oracle CoherenceA consideration of the challenges presented by Cycle requirements makes it clear that the use of a distributed computing infrastructure that allows for the parallel execution of process intensive tasks across a number of processing nodes is needed. Furthermore, the infrastructure would also need to adhere to the following additional requirements:•Fault Tolerance – The pending tasks from a failed Cycle processing node should fail over to another node and be scheduled for processing.•Scalability – The sizing requirements for the grid are different based on the customer. While some customers may be relatively small in terms of processing requirements, other customers need the ability to execute many millions of pending insurance activities in parallel.•Simplicity – The need to achieve distributed computing capability without the system being overly cumbersome and difficult to understand and setup.•Extensible – The ability to execute custom tasks•Proven – The underlying architecture must be proven at many existing production installations. After careful consideration of these requirements, the OIPA team concluded that the best approach in addressing Cycle’s demanding needs would be through the use of Oracle Coherence.Figure 3. The use of Oracle Coherence allows OIPA to achieve the challenging performance and scalability needs driven by insurers' demand for high-volume, nightly batch processing.Oracle Coherence provides a repository of projects, known as the Coherence Incubator, that represent best practices for common solutions. The Processing Pattern is a Coherence Incubator project that provides a generalized approach for the submission and asynchronous processing of work within a Coherence cluster. This makes the Processing Pattern a good fit as a foundation for a distributed computing platform. This fit, combined with OIPA’s existing and successful use of Oracle Coherence as a distributed cache, provided the basis for choosing the Processing Pattern as the foundation for the implementation for OIPA’s distributed computing needs.Distributed Computing with the Coherence Processing PatternThe Processing Pattern leverages Oracle Coherence as the connection, communication, and management mechanism to enable distribution and execution of work across a network of computers, also known as a cluster. It allows clients to submit work to the cluster, ensures that the work gets processed, and reports the results of the execution back to the client. The Processing Pattern provides aframework that allows implementers to define their own types of work to be processed and customizes how work is dispatched and processed.The Processing Pattern provides a simple interface so that clients do not have to be concerned with how work is distributed and processed. Clients of the Processing Pattern see the cluster as a single process, and are not aware of the number and types of computers that are being used.Figure 4. The Processing Pattern distributes work using a submissions cache that is partitioned across the cluster. The Processing Pattern is built on top of the clustered caching capability of Oracle Coherence. When work is submitted using the Processing Pattern, it is stored in a partitioned cache where Oracle Coherence assigns the storage for the submission to one of the nodes connected to the cluster. When the submission is stored on the node, Oracle Coherence also makes a backup copy of the submission and stores the copy on a different node in the cluster to provide failover capability in the event that one of the nodes fails. When a node fails, the submissions in the backup storage are automatically distributed among the remaining operational nodes in the cluster. This automatic failover feature of Oracle Coherence is used by the Processing Pattern to support re-dispatching of queued tasks in the event of node failure.Distributing Tasks in the Processing PatternThere are different methods of distributing work using the Processing Pattern. The method used by OIPA Cycle includes two phases: dispatching and processing. The sequence of steps for this two phase submission process is as follows:Figure 5. The Processing Pattern distributes work in two phases: dispatching and processing.•When a task is submitted to the Processing Pattern, it is bundled with a submission and stored in the submission partitioned cache. When the submission arrives at a node, an event is fired notifying registered listeners that a new submission has arrived. The Processing Pattern has a listener (the DispatchController) that is registered to receive new submission events.•The DispatchController dispatches the new submission through a set of registered Dispatchers. The registered dispatchers form a chain of dispatchers, each one capable of performing some processing on part of the incoming submission.•The DefaultTaskDispatcher is the most important type of Dispatcher used by OIPA Cycle. It consults one or more TaskDispatchPolicy instances to determine which TaskProcessors are available to process the new submission. A TaskProcessor maps to a node in the cluster that is capable of processing the Submission. One type of TaskDispatchPolicy is the attribute-matching policy, which maps a submission to a TaskProcessor based on attributes on the submission. Another type is the round robin policy, which guarantees uniform dispatching of submissions across all nodes in the cluster. OIPA Cycle uses both of these policies to route submissions to the appropriate TaskProcessors and perform round robin load balancing of work.•Once the DefaultTaskDispatcher has selected a TaskProcessor, the DefaultTaskDispatcher assigns the submission to the TaskProcessor’s work queue using a Coherence partitioned cache.•The new submission entry is received by a TaskProcessor using another registered listener on the TaskProcessor partitioned cache. The TaskProcessor can be a different node than the node thatdispatched the submission. The TaskProcessor accepts the submission and schedules it for processing by unbundling the task associated with the submission and placing the task in a thread pool.•At some point, the task is picked up by a thread from the thread pool and processed. The code that is executed is application specific, so the task can do anything that the application requires. As an example, OIPA Cycle has a type of task called the CycleTask, which is responsible for processing all pending activities on a single unit of work such as a policy.•Once the task is completed, the TaskProcessor stores the result of the task in a SubmissionResult partitioned cache. The result of processing gets transmitted to the originating client through additional event listeners.Handling Long Running Tasks with Resumable TasksA Resumable Task is a special type of task provided by the Processing Pattern that provides additional features of reporting progress, yielding, resuming, and returning a result. These capabilities make Resumable Tasks well-suited for executing long running tasks such as asynchronously executing a Web service or governing a process that takes a long time to complete like an OIPA Cycle Level. That task can then go through a sleep/wake cycle checking on the status of processing until it is complete.When a Resumable Task is executed, it can store the progress made by saving information using a checkpoint. Thus, if the Task needs to be restarted, it can start from the latest checkpoint rather than starting from the beginning. For clients that want to be informed of the progress of a Task, the progress is reported back by listening for events on that task. It can also return a special object called a Yield that indicates the task is not yet finished and needs to be re-scheduled for execution. The Yield object is used to store the intermediate state of the task, as well as specify a delay before running the task again.Figure 6. The above diagram shows the states of a Resumable Task.Scalability and High Availability in the Processing PatternThe Processing Pattern relies on the scalability and high availability capabilities provided by Oracle Coherence to scale and provide resiliency of failure to the tasks managed by the Processing Pattern. If a node crashes during execution of a task, that task is automatically restarted on another node in the cluster. Similarly, capacity can be added to the cluster without the need for stopping or reconfiguring existing nodes. As soon as a node joins the cluster, the Processing Pattern takes advantage of the additional capacity.OIPA Cycle Execution using the Coherence Processing PatternFigure 7. OIPA Cycle executes catch processing using Oracle Coherence and the Processing PatternOIPA Cycle is built on top of the Oracle Coherence Processing Pattern to enable the distributed and parallel processing of activities during the batch processing of pending activities in the OIPA system. OIPA Cycle implements the Processing Pattern framework and provides custom tasks that execute the Cycle process. There are two components to Cycle, the Cycle Client and the Cycle Agent. The Cycle Client is a simple console application that requests a Cycle Level be completed by submitting a task to the grid for processing. The Cycle Agent is a processing node that is connected to a cluster of Cycle Agents and provides the distributed computing capability for OIPA and Cycle.Cycle acts as a workload manager; the actual work of processing activities is done by the OIPA run-time. Each Cycle Agent is bundled with the same OIPA libraries that the OIPA Web Application uses and simply delegates to these libraries to process the pending activities in the OIPA system. In this sense, Cycle provides the plumbing that connects the distributed computing capability of the Coherence Processing Pattern with the activity processing capability of OIPA.Cycle uses its own queue that exists in the underlying OIPA database in order to direct the execution of a Cycle Level. The Cycle table in the database holds a record for every task that is to be processed during the execution of a single Cycle Level. For example, when running the Policy Cycle Level, each record in the table contains an identifier for a policy that has pending activities that need to be processed. Cycle maintains its own queue instead of using the Processing Pattern’s internal queue for a two reasons. First, the number of tasks that need to be executed may number in the millions; submitting millions of tasks at one time would overwhelm the grid. Second, Cycle needs to audit each task individually to support downstream reporting on the nightly batch process. Keeping the queue in the database allows Cycle to distribute manageable chunks of work to the grid for processing and persist the results of each task.Extending the Processing Pattern to Support CycleThe Processing Pattern is a purpose-built platform that enables grid computing. It leverages Oracle Coherence as the clustering technology and builds on top of it the dispatching, assignment, scheduling, execution, and management of tasks in the cluster. One of the benefits of using the Processing Pattern is the simplicity of the framework that you need to extend in order to leverage the grid computing infrastructure that it provides. The following diagram illustrates two custom task implementations:Figure 8. Cycle classes that extend the Processing Pattern are illustrated above.The only requirement to provide custom task processing for Cycle is the implementation of the ResumableTask interface. The ResumableTask interface has a single run method that passes in a TaskEnvironment and returns an Object. The TaskEnvironment object provides access to intermediate state and gives the custom task the ability to checkpoint information and report progress. The return parameter gives the ability for the custom task to return a Yield object, which moves the task into a sleep state and schedule it for future execution.The CycleProcessTask class is the Cycle task that governs the execution of an entire Cycle Level, such as Policy Level Cycle or Pre-Company Level Cycle. The CycleProcessTask class is discussed below; it is responsible for loading the Cycle queue in the database and replenishing the grid with work until the Cycle Level is complete.The CycleTask class is responsible for processing all of the pending activities on a unit of work, (for example, a single policy). When a CycleTask object completes processing, it updates a record in the Cycle table with the results.Using the Processing Pattern to Submit TasksThe Processing Pattern provides a simple service facade called the ProcessingSession that is used to submit tasks to the grid for processing. The ProcessingSession supports different ways of waiting for the results of a task to complete, including blocking for the task to be finished (synchronous) or receiving event notifications (asynchronous).。
网格计算理论及其应用

网格计算理论及其应用胡科电子科技大学应用数学学院,四川成都(610054)摘要:本文从理论角度,阐述网格概念、网格的标准化趋势、OGSA的体系结构、网格计算及其应用,并介绍了网格在我国的主要应用项目。
关键词:网格;网格标准;网格计算1. 概述网格(Grid)在欧美出现于20世纪90年代,是新一代高性能计算环境和信息服务基础设施,采用开放标准,能够实现动态跨地域的资源共享和协同工作。
网格作为解决分布式复杂异构问题的新一代技术,其核心是实现大规模的地理上广泛分布的高性能计算资源、海量数据和信息资源、数据获取和分析处理系统、应用系统、服务与决策支持系统,以及组织、人员等各种资源的共享与聚合。
网格被誉为继传统Internet、Web之后的“第三次信息技术浪潮”,成为互联网发展的第三大里程碑。
这次技术革新的本质是WWW(World Wide Web,万维网)升级到GGG(Great Global Grid,全球网格)。
如果说传统Internet实现了计算机硬件的连通,Web实现了网页的连通,网格则是试图实现互联网上所有资源的全面连通。
网格在科学研究、商业应用等领域有着广阔的发展前景。
2. 网格的概念2.1 狭义的“网格观”美国Argonne国家实验室的资深科学家、Globus项目的领导人、堪称“网格之父”的Ian Foster曾在1998年出版的《网格:21世纪信息技术基础设施的蓝图》一书中这样描述网格:“网格是构筑在互联网上的一组新兴技术,它将高速互联网、高性能计算机、大型数据库、传感器、远程设备等融为一体,为科技人员和普通老百姓提供更多的资源、功能和交互性。
互联网主要为人们提供电子邮件、网页浏览等通信功能,而网格功能则更多更强,让人们透明地使用计算、存储等其他资源。
”。
2000年,Ian Foster在《网格的剖析》这篇论文中把网格进一步描述为“在动态变化的多个虚拟机构间共享资源和协同解决问题。
”。
2002年7月,Ian Foster在《什么是网格?判断是否网格的三个标准》一文中,限定网格必须同时满足三个条件:(1) 在非集中控制的环境中协同使用资源;(2) 使用标准的、开放的和通用的协议和接口;(3) 提供非平凡的服务。
Declaration

5.6 Particular Implementation Challenges and the Backend Package . . . 39
5.6.1 Dynamic Invocation . . . . . . . . . . . . . . . . . . . . . . 41
v
5.6.2 Deserialization . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6.3 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6.4 Policy Facade . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.4 Last Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliography
65
vi
Chapter 1 Introduction
This dissertation documents the creation of a test harness for OGSI services. The grid is currently an active area of both corporate and academic research. Many of the larger computing corporations are investing heavily in grid technology in the hope that it will usher in a new age of utility computing, in which processing power is made available from processing providers in an analagous way to the manner in which electricity is provided today (this is one of the reasons it has been called grid computing).
Google开源激光SLAM算法论文原文

As-built floor plans are useful for a variety of applications. Manual surveys to collect this data for building management tasks typically combine computed-aided design (CAD) with laser tape measures. These methods are slow and, by employing human preconceptions of buildings as collections of straight lines, do not always accurately describe the true nature of the space. Using SLAM, it is possible to swiftly and accurately survey buildings of sizes and complexities that would take orders of magnitude longer to survey manually.
1All authors are at Google.
loop closure detection. Some methods focus on improving on the computational cost by matching on extracted features from the laser scans [4]. Other approaches for loop closure detection include histogram-based matching [6], feature detection in scan data,and using machine learning [7].
FortiSwitch数据中心交换机数据表说明书

FortiSwitch Data Center switches deliver a Secure,Simple, Scalable Ethernet solution with outstandingthroughput, resiliency and scalability. Virtualizationand cloud computing have created dense high-bandwidthEthernet networking requirements. FortiSwitch DataCenter switches meet these challenges by providing ahigh performance 10 GE, 40 GE or 100 GE capableswitching platform, with a low Total Cost of Ownership.Ideal for Top of Rack server or firewall aggregationapplications, as well as SD-Branch network core deployments, these switches are purpose-built to meet the needs of today’s bandwidth intensive environments.FortiSwitch™ Data Center SeriesStandalone ModeThe FortiSwitch has a native GUI and CLI interface. All configuration and switch administration can be accomplished through either of theseinterfaces. Available ReSTful API’s offer additional configuration and management options.FortiLink ModeFortiLink is an innovative proprietary management protocol that allows our FortiGate Security Appliance to seamlessly manage any FortiSwitch. FortiLink enables the FortiSwitch to become a logical extension of the FortiGate integrating it directly into the Fortinet Security Fabric. This management option reduces complexity and decreases management cost as network security and access layer functions are enabled and managed through a single console.3FortiSwitch 1024D — frontFortiSwitch 1048D — frontFortiSwitch 1048D — backFortiSwitch 3032D — frontFortiSwitch 3032D — backFortiSwitch 1048E — frontFortiSwitch 1048E — backFortiSwitch 1024D — backFortiSwitch 3032E — frontFortiSwitch 3032E — backLAG support for FortiLink Connection YesActive-Active Split LAG from FortiGate to FortiSwitches for Advanced Redundancy YesFORTISWITCH 1024D FORTISWITCH 1048D FORTISWITCH 1048E FORTISWITCH 3032D FORTISWITCH 3032E Layer 2Jumbo Frames Yes Yes Yes Yes YesAuto-negotiation for port speed and duplex Yes Yes Yes Yes YesIEEE 802.1D MAC Bridging/STP Yes Yes Yes Yes YesIEEE 802.1w Rapid Spanning Tree Protocol (RSTP)Yes Yes Yes Yes YesIEEE 802.1s Multiple Spanning Tree Protocol (MSTP)Yes Yes Yes Yes YesSTP Root Guard Yes Yes Yes Yes YesEdge Port / Port Fast Yes Yes Yes Yes YesIEEE 802.1Q VLAN Tagging Yes Yes Yes Yes Yes* Fortinet Warranty Policy: /doc/legal/EULA.pdfFortiSwitch 1024DFortiSwitch 1048DFortiSwitch 1048E7FortiSwitch 3032D* Fortinet Warranty Policy: /doc/legal/EULA.pdfFortiSwitch 3032EGLOBAL HEADQUARTERS Fortinet Inc.899 KIFER ROAD Sunnyvale, CA 94086United StatesTel: +/salesEMEA SALES OFFICE 905 rue Albert Einstein 06560 Valbonne FranceTel: +33.4.8987.0500APAC SALES OFFICE 8 Temasek Boulevard#12-01 Suntec Tower Three Singapore 038988Tel: +65.6395.2788LATIN AMERICA SALES OFFICE Sawgrass Lakes Center13450 W. Sunrise Blvd., Suite 430 Sunrise, FL 33323United StatesTel: +1.954.368.9990Copyright© 2019 Fortinet, Inc. All rights reserved. Fortinet®, FortiGate®, FortiCare® and FortiGuard®, and certain other marks are registered trademarks of Fortinet, Inc., in the U.S. and other jurisdictions, and other Fortinet names herein may also be registered and/or common law trademarks of Fortinet. All other product or company names may be trademarks of their respective owners. Performance and other metrics contained herein were attained in internal lab tests under ideal conditions, and actual performance and other results may vary. Network variables, different network environments and other conditions may affect performance results. Nothing herein represents any binding commitment by Fortinet, and Fortinet disclaims all warranties, whether express or implied, except to the extent Fortinet enters a binding written contract, signed by Fortinet’s General Counsel, with a purchaser that expressly warrants that the identified product will perform according to certain expressly-identified performance metrics and, in such event, only the specific performance metrics expressly identified in such binding written contract shall be binding on Fortinet. For absolute clarity, any such warranty will be limited to performance in the same ideal conditions as in Fortinet’s internal lab tests. In no event does Fortinet make any commitment related to future deliverables, features or development, and circumstances may change such that any forward-looking statements herein are not accurate. Fortinet disclaims in full any covenants, representations, and guarantees pursuant hereto, whether express or implied. Fortinet reserves the right to change, modify, transfer, or otherwise revise this publication without notice, and the most current version of the publication shall be applicable.FST -PROD-DS-SW4 FS-DC-DAT-R18-201903FortiSwitch ™ Data Center SeriesORDER INFORMATIONFS-SW-LIC-3000SW License for FS-3000 Series Switches to activate Advanced Features.* When managing a FortiSwitch with a FortiGate via FortiGate Cloud, no additional license is necessary.For details of Transceiver modules, see the Fortinet Transceivers datasheet.。
基于gpu的分组密码并行破解方法分析
三、GPU有着良好的可编程性,支持如C语言等多种高级语言编程,对大规 模的数据流并行处理方面具有明显的优势[13]。NVIDIA在2007年推出了CUDA 并行计算技术,将GPU通用计算(GPGPU)从图形硬件流水线和高级绘制语言中解
蔓E戈学甄士学位论文
放出来,没有图形学编程经验的开发人员也可以在单任务多数据模式下完成高性 能计算,从而使GPGPU开发的门槛大大降低[143【15]。
significant improvement on dealing with the data—intensive issues In this article。we mainly diSCUSS the GPU architecture*ith a
detailed analysis of the programming model of CUDA and its implementation mechanisms for paraIlel computing Then we propose a new general data parallel framework based on the existing brute force attack algorithms of block cipher,which i s off]oading the part of huge calculations for cracking from CPU to GPU,in order to improve the computing throughput.
ParallelComputingToolbox:并行计算工具箱
Parallel computing with MATLAB. You can use Parallel Computing Toolbox to run applications on a multicore desktop with local workers available in the toolbox, take advantage of GPUs, and scale up to a cluster (with MATLAB Distributed Computing Server).Programming Parallel ApplicationsParallel Computing Toolbox provides several high-level programming constructs that let you convert your applications to take advantage of computers equipped with multicore processors and GPUs. Constructs such as parallel for-loops(parfor)and special array types for distributed processing and for GPU computing simplify parallel code development by abstracting away the complexity of managing computations and data between your MATLAB session and the computing resource you are using.You can run the same application on a variety of computing resources without reprogramming it. The parallel constructs function in the same way, regardless of the resource on which your application runs—a multicore desktop (using the toolbox) or on a larger resource such as a computer cluster (using toolbox with MATLAB Distributed Computing Server).Using Built-In Parallel Algorithms in Other MathWorks ProductsKey functions in several MathWorks products have built-in parallel algorithms. In the presence of Parallel Computing Toolbox, these functions can distribute computations across available parallel computing resources, allowing you to speed up not just your MATLAB and Simulink based analysis or simulation tasks but also code generation for large Simulink models. You do not have to write any parallel code to take advantage of thesefunctions.Using built-in parallel algorithms in MathWorks products. Built-in parallel algorithms can speed up MATLAB and Simulink computations as well as code generation from Simulink models.Speeding Up Task-Parallel ApplicationsYou can speed up some applications by organizing them into independent tasks(units of work) and executing multiple tasks concurrently. This class of task-parallel applications includes simulations for design optimization, BER testing, Monte Carlo simulations, and repetitive analysis on a large number of data files.The toolbox offers parfor, a parallel for-loop construct that can automatically distribute independent tasks to multiple MATLAB workers(MATLAB computational engines running independently of your desktop MATLAB session). This construct automatically detects the presence of workers and reverts to serial behavior if none are present. You can also set up task execution using other methods, such as manipulating task objects in thetoolbox.Using parallel for-loops for a task-parallel application. You can use parallel for-loops in MATLAB scripts and functions and execute them both interactively and offline.Speeding Up MATLAB Computations with GPUsParallel Computing Toolbox provides GPUArray, a special array type with several associated functions that lets you perform computations on CUDA-enabled NVIDIA GPUs directly from MATLAB. Functions include fft, element-wise operations, and several linear algebra operations such as lu and mldivide, also known as the backslash operator (\). The toolbox also provides a mechanism that lets you use your existing CUDA-based GPU kernels directly from MATLAB.Learn more about GPU computing with MATLAB.GPU computing with MATLAB. Using GPUArrays and GPU-enabled MATLAB functions help speed up MATLAB operations without low-level CUDA programming.Scaling Up to Clusters, Grids, and Clouds Using MATLAB Distributed Computing ServerParallel Computing Toolbox provides the ability to run MATLAB workers locally on your multicore desktop to execute your parallel applications allowing you to fully use the computational power of your desktop. Using the toolbox in conjunction with MATLAB Distributed Computing Server, you can run your applications on large scale computing resources such as computer clusters or grid and cloud computing resourcesRunning a gene regulation model on a cluster using MATLAB Distributed Computing Server. The server enables applications developed using Parallel Computing Toolbox to harness computer clusters for large problems.Listening to the World’s Oceans: Searching for Marine Mammals by Detecting andClassifying Terabytes of Bioacoustic Data in Clouds of Noise51:32This session describes how Cornell University Bioacoustics Research Program datascientists use MATLAB®to develop high-performance computing software to processand analyze terabytes of acoustic data.Implementing Data-Parallel Applications using the Toolbox and MATLAB Distributed Computing ServerDistributed arrays in Parallel Computing Toolbox are special arrays that hold several times the amount of data that your desktop computer’s memory (RAM) can hold. Distributed arrays apportion the data across several MATLAB worker processes running on a computer cluster (using MATLAB Distributed Computing Server). As a result, with distributed arrays you can overcome the memory limits of your desktop computer and solve problems that require manipulating very large matrices.With over 150 functions available for working with distributed arrays, you can interact with these arrays as you would with MATLAB arrays and manipulate data available remotely on workers without low-level MPI programming. Available functions include linear algebra routines based on ScaLAPACK, such as mldivide, also known as the backslash operator (\),lu and chol, and functions for moving distributed data to and fromMAT-files.For fine-grained control over your parallelization scheme, the toolbox provides the single program multiple data (spmd)construct and several message-passing routines based on an MPI standard library (MPICH2). The spmd construct lets you designate sections of your code to run concurrently across workers participating in a parallel computation. During program execution,spmd automatically transfers data and code used within its body to the workers and, once the execution is complete, brings results back to the MATLAB client session. Message-passing functions for send, receive, broadcast, barrier, and probe operations are available.Programming with distributed arrays. Distributed arrays and parallel algorithms let you create data-parallel MATLAB programs with minimal changes to your code and without programming in MPI.Product Details, Examples, and System Requirements/products/parallel-computingTrial Software/trialrequestSales/contactsalesTechnical Support/support Running Parallel Applications Interactively and as Batch JobsYou can execute parallel applications interactively and in batch using Parallel Computing Toolbox. Using the parpool command, you can connect your MATLAB session to a pool of MATLAB workers that can run either locally on your desktop (using the toolbox) or on a computer cluster (using MATLAB Distributed Computing Server ) to setup a dedicated interactive parallel execution environment. You can execute parallel applications from the MATLAB prompt on these workers and retrieve results immediately as computations finish, just as you would in any MATLAB session.Running applications interactively is suitable when execution time is relatively short. When your applications need to run for a long time, you can use the toolbox to set them up to run as batch jobs. This enables you to free your MATLAB session for other activities while you execute large MATLAB and Simulink applications.While your application executes in batch, you can shut down your MATLAB session and retrieve results later. The toolbox provides several mechanisms to manage offline execution of parallel programs, such as the batch function and job and task objects. Both the batch function and the job and task objects can be used tooffload the execution of serial MATLAB and Simulink applications from a desktop MATLAB session.Running parallel applications interactively and as batch jobs. You can run applications on your workstation using twelve workers available with the toolbox, or on a computer cluster using more workers available with MATLAB Distributed Computing Server.ResourcesOnline User Community /matlabcentral Training Services /training Third-Party Products and Services /connections Worldwide Contacts /contact。
电子信息工程专业专业英语三千字翻译
Unit3 computer architecture and microprocessors3--1 Computer Architecture1) Computer architecture , in computer science , is a general term referring to the structure of all or part of computer system . The term also covers the design of system software , such as the operating system (the program that controls the computer) , as well as referring to the combination of hardware and basic software that links the machines on a computer network . Computer architecture refers to an entire structure and to the details needed to make it functional . Thus , computer architecture covers computer systems , microprocessors , circuits , and system programs . Typically the term does not refer to application programs , such as spreadsheets or word processing , which are required to perform a task but not to make the system run .2)1.Design Elements3) In designing a computer system , architects consider five major elements that make up the system's hardware : the arithmetic /logic unit , control unit , memory , input , and output . The arithmetic /logic unit performs arithmetic and compares numerical values . The control unit directs the operation of the computer by taking the user instructions andtransforming them into electrical signals that the computer 's circuitry can understand . The combination of the arithmetic /logic unit and the control unit is called the central processing unit (CPU) . The memory stores instructions and data . The input input and output sections allow the computer to receive and send data , respectively .4) Different hardware architectures are required because of the specialized needs of systems and users . One user may need a system to display graphics extremely fast , while another system may have to be optimized for searching a database or conserving battery power in a laptop computer .5) In addition to the hardware design , the architects must consider what software programs will operate the system . Software , such as programming languages and operating systems , makes the details of the hardware architecture invisible to the user . For example , computers that use the C programming language or a UNIT operating system may appear the same from the user's viewpoint , although they use different hardware architectures .6)2.Processing Architecture7) When a computer carries out an instruction , it proceeds through five steps . First ,the control unit retrieves theinstruction from memory , for example , an instruction to add two numbers . Second , the control unit decodes the instruction into electronic signals that control the computer . Third , the control unit fetches the data (the two numbers) . Fourth , the arithmetic/logic unit performs the specific operation ( the addition of the two numbers ) . Fifth , the control unit saves the result ( the sum of the two numbers ) .8) Early computers used only simple instructions because the cost of electronic capable of carrying out complex instructions was high . As this cost decreased in the 1960s , more complicated instructions became possible . Complex instructions ( single instructions that specify multiple operations ) can save time because they make it unnecessary for the computer to retrieve additional instructions . For example , if seven operations are combined in one instruction , then six of the steps that fetch instructions are eliminated and the computer spends less time processing that operation . Computers that combine several instructions into a single operation are called complex instruction set computers ( CISC ) .9) However , most programs do not often use complexinstructions , but consist mostly of simple instructions . When these simple instructions are run on CISC architectures , they slow down processing because each instruction--whether simple or complex --takes longer to decode in a CISC design . An alternative strategy is to return to designs that use only simple , single--operation instruction sets and make the most frequently used operations faster in order to increase overall performance . Computers that follow this design are called reduced instruction set computers ( RISC ) .10) RISC designs are especially fast at the numerical computations required in science , graphics , and engineering applications . CISC designs are commonly used for non-numerical computations because they provide special instruction sets for handling character data , such as text in a word processing program . Specialized CISC architectures , called digital signal processors , exist to accelerate processing of digitized audio and video signals .11)3.Open and Closed Architectures12) The CPU of a computer is connected to memory and to the outside world by means of either an open or a closed architecture . An open architecture can be expanded after the system has been built , usually by adding extra circuitry ,such as a new microprocessor computer chip connected to the main system . The specifications of the circuitry are made public , allowing other companies to manufacture these expansion products .13) Closed architectures are usually employed in specialized computers that will not require expansion , for example , computers that control microwave ovens . Some computer manufacturers have used closed architectures so that their customers can purchase expansion circuitry only from them . This allows the manufacture to charge more and reduces the options for the consumer .14)work Architecture15) Computers communicate with other computers via networks . The simplest network is a direct connection between two computers . However , computers can also be connected over large networks , allowing users to exchange data , communicate via electronic mail , and share resources such as printers .16) Computers can be connected in several ways . In a ring configuration , data are transmitted along the ring and each computer in the ring examines this data to determine if it is the intended recipient . If the data are not intended fora particular computer , the computer passes the data to the next computer in the ring . This process is repeated until the data arrive at their intended destination . A ring network allows multiple messages to be carried simultaneously , but since each message is checked by each computer , data transmission is slowed .17) In a bus configuration , computers are connected througha single set of wires , called a bus . One computer sends data to another by broadcasting the address of the receive and the data over the bus . All the computers in the network look at the address simultaneously , and the intended recipient accepts the data . A bus network , unlike a ring network , allows data to be sent directly from one computer to another . However , only one computer at a time can transmit data . The others must wait to send their messages .18) In a star configuration , computers are linked to a central computer called a hub . A computer sends the address of the receiver and the data to the hub , which then links the sending and receiving computers directly . A star network allows multiple messages to be sent simultaneously , but it is more costly because it uses an additional computer , the hub , to direct the data .19)5.Recent Advances20) One problem in computer architecture is caused by the difference between the speed of the CPU and the speed at which memory supplies instructions and data . Modern CPUs can process instructions in 3 nanoseconds ( 3 billionths of a second ) . A typical memory access , however , takes 100 nanoseconds and each instruction may require multiple accesses . To compensate for this disparity , new computer chips have been designed that contain small memories , called caches , located near the CPU . Because of their proximity to the CPU and their small size , caches can supply instructions and data faster than normal memory . Cache memory stores the most frequently used instructions and data and can greatly increase efficiency .21) Although a large cache memory can hold more data , it also becomes slower .To compensate , computer architects employ designs with multiple caches . The design places the smallest and fastest cache nearest the CPU and locates a second large and slower cache farther away . This arrangement allows the CPU to operate on the most frequently accessed instructions and data at top speed and to slow down only slightly when accessing the secondary cache . Using separatecaches for instructions and data also allows the CPU to retrieve an instruction and data simultaneously .22) Anther strategy to increase speed and efficiency is the use of multiple arithmetic/logic units for simultaneous operations , called super scalar execution . In this design , instructions are acquired in groups . The control unit examines each group to see if it contains instructions that can be performed together . Some designs execute as many as six operations simultaneously . It is rare , however , to have this many instructions run together ,so on average the CPU does not achieve a six-fold increase in performance . 23) Multiple computers are sometimes combined into single systems called parallel processors . When a machine has more than one thousand arithmetic/logic units , it is said to be massively parallel . Such machines are used primarily for numerically intensive scientific and engineering computation .Parallel machines containing as many as sixteen thousand computers have been constructed .3-3 VLIW Microprocessors24) When Transmeta Corp. revealed its new Crusoe of processors last month ,experts weren’t surprised to learn that the chips are based on Very Long Instruction Word(VLIW)technology .VLIW has become the prevailing philosophy of microprocessor design , eclipsing older approaches such as RISC and complex instruction set computing(CISC) .25) All microprocessor designs seek better performance within the limitations of their contemporary technology. In the 70s of 20th century ,for example ,memory was measured in kilobytes and very expensive .CISC was the dominant approach because it conserved memory .26) In the CISC architecture . there can be hundreds of program instructions-simple commands that tell the system to add numbers, store values and display results. If all instructions were the same length , the simple ones would waste memory .Simple instructions require as little as 8 bits of storage space , while the most complex consume 120 bits .27) Variable-length instructions are more difficult for a chip to process, though, and the longer CISC instructions are especially complex. nonetheless ,to maintain software compatibility, modern chips such as Intel’s Pentium III and Advanced Micro Devices Inc.’s Athlon must still work with all troublesome CISC instructions that were designed in the 1980s , even though their original advantage-memory conservation-isn’t as important .28) In the 1980s , RAM chips got bigger and bigger in capacity while their prices dropped . The emphasis in CPU design shifted to relatively simple , fixed-length instructions , always 32 bits long .Although this wastes some memory by making programs bigger ,the instructions are easier and faster to execute .29) The simplicity of RISC also makes it easier to design superscalar processors-chips that can execute more than one instruction at a time .This is called instruction-level parallelism. Almost all modern RISC and CISC processors are superscalar. But achieving this capability introduced significant new levels of design complexity.30) VLIW chips can cost less , burn less power and achieve significantly higher performance than comparable RISC and CISC chips But there are always trade-offs. One is code expansion-programs grow larger , requiring more memory. Far more important , though , is that compilers must get smarter .A poor VLIW complier will have a much greater negative impact on performance than would a poor RISC or CISC compiler .31) VLIW isn’t a magic bullet , but it’s the new wave in microprocessor design .Within a few years , it’s certain that at least some of your software will be running on VLIW chips .单元3 电脑体系和微处理器3-1 电脑体系电脑体系,在电脑科技中,是一个依据整个或部分电脑结构的通用术语,这个术语也包含软件系统的设计,例如这个操作系统(控制电脑的程序),以及依据这个在电脑网络中连接主机的硬件和基本的软件的结合体。
SYNOPSIS 数字实现库文档说明书
DATASHEETOverview Accurate library characterization is the foundation of successful digital implementation. Synthesis, place-and-route, verification and signoff tools rely on precise model libraries to accurately represent the timing, noise and power performance of digital and memory designs. Cell library characterization complexity has dramatically increased as libraries migrate to more advanced process nodes. Low-power design further complicates the characterization process by introducing complex cells such as multi-bit flip-flops, level shifters and retention logic, which must be accurately characterized to ensure effective digital implementation across multiple power domains. In addition, process variability, aging, reliability and electro-migration on these nodes requires fast and accurate characterization to model and validate the effects. This increased requirements to generate, model and validate data is also responsible for an increased demand on compute for characterization.Introduction The PrimeLib solution includes a comprehensive array of library characterization and QA capabilities that are tuned to produce PrimeTime ® signoff quality libraries with maximum throughput on available compute resources. PrimeLib’s innovative technologies utilize embedded gold reference SPICE engines to provide a characterization speed up of advanced Liberty™ models used by PrimeTime static timing analysis (STA) to accurately account for effects seen in ultra-low voltage FinFET processes that impact timing. This includes PrimeTime parametric on-chip variation (POCV), advanced waveform propagation (AWP) and electromigration (EM) analysis. PrimeLib is cloud-ready, and with its optimized scaling technology delivers an accelerated throughput on cloud or an on-premise cluster.Figure 1: Platform-level integration of PrimeLib with HSPICE and PrimeTime ensures signoff-quality librariesAccurate andcomprehensivelibrary characterizationfor successful digitalimplementationPrimeLib: Unified Library Characterization and ValidationKey Features and Benefits• SmartScaling based multi-PVT characterization to instantly generate libraries and reduce significantly the overall characterization required for multiple PVT corners• Single captive license bundles everything required for cell library characterization, QA and simulator• Simple multi-core licensing enables easy adaptation to constantly changing characterization workload requirements• Embedded gold reference SPICE engines for best accuracy and Integrated signoff library validation tuned to produce PrimeTime sign-off quality libraries• Innovative technologies provide high characterization throughput• ML-based high-sigma char w/ HSPICE® AVA• Faster LVF runtime using new ML models and key technologies• Cell reliability characterization to capture impacts of device model degradation over time (aging) and electro-migration (EM)• ML-based augmented sensitivity database to enable faster time-to-market for an updated PDK• Comprehensive QA features for library validation and SPICE correlation• Unified GUI for library database management system, job processing, and monitoring, compare and validate libraries—one GUI to visualize it all• Library characterization environment encryption support enables IP providers to deliver re-characterization kitsFigure 2: PrimeLib input and outputsPrimeLib Statistical CharacterizationPrimeLib provides comprehensive solution for fast & accurate process variation characterization and generation of PrimeTime compliant Liberty variation format (LVF), advanced on-chip variation (AOCV), parametric on-chip variation (POCV) models. PrimeLib offers range of solutions to reduce the overall time for LVF library characterization. Flexible characterization flows are supported to produce accurate libraries. Traditional sensitivity-based approach (SBA) generates accurate LVF data for regular voltage corners where delay/slew/constraint follow a Gaussian distribution. However, at ultra-low voltages, relationship between parameter perturbation and results become non-linear. Variation responses at ultra-low voltage corners display skewed behavior. Machine learning-based algorithms enable accurate modeling of non-Gaussian distribution at ultra-low voltage corners. For golden accuracy reference, PrimeLib provides a Monte Carlo capability which uses the build-in MC feature of the simulator.Figure 3: PrimeLib Machine Learning and Monte Carlo responseKey Features• Dynamic selection of algorithms (ML/SBA) based on variation trend. The effect of process variation becomes smaller at higher voltages and faster corners. Compute intensive methods (ML) are not necessary at higher voltages. This technology automatically determines whether sensitivity-based approach (SBA) method can be used without compromising accuracy. • Advanced ML algorithms to produce accurate LVF models for near threshold or sub-Vt corners• Physical implementation of a single large transistor can be extracted as multiple transistors. This increase in transistor count slows down the computation of the cell sensitivity to process variation. Advanced algorithms are supported to improve LVF characterization turn-around time for cells having fingered devices.• Pre-analysis flow to filter-out insignificant statistical parameters based on the transistor-level IRV analysis results• Enhanced algorithm for robust and accurate arc binning• Delay measurement-based approach instead of bisection approach to speedup LVF constraints characterization runtimesFigure 4: Characterization methods using Monte Carlo and Sensitivity Based AnalysisHigh Sigma CharacterizationHigh sigma characterization can be used to ensure robust std cells components at lower process nodes and for automotive applications. PrimeLib tool supports fast and accurate High Sigma Monte Carlo simulation in the 3-5.75 sigma range with HSPICE AVA. Simulators use advanced techniques in ML to reduce the number of simulations by several orders of magnitude compared to traditional MC for high sigma analysis.SmartScaling Based Multi-PVT CharacterizationSmartScaling for multi PVT characterization reduces the overall requirement to characterize full libraries across different PVTsand significantly improves the overaall turnaround time. SmartScaling solution produces instant zero-cost intermediate libraries using SmartScaling database at selected corners based on anchor PVTs. Multi-Dimensional scaling (across voltage, process& temperature) feature uses the SmartScaling engine to generate accurate signoff quality libraries with timing/CCST/CCSN/power/LVF data.PrimeTime Signoff Quality LibrariesAdvanced process node standard cell libraries require accurate timing and noise models to ensure confident static timing analysis signoff—especially for mobile IC and IoT applications operating at ultra-low voltages. To meet the accuracy needs for advanced node characterization, PrimeLib model generation has been tightly calibrated with PrimeTime and HSPICE® models to provide the best correlation and accuracy results.Integrated Signoff Library ValidationSuccessful IC design requires high-quality libraries. PrimeLib provides a comprehensive set of capabilities for quality assurance to verify the consistency, accuracy and completeness of the libraries. These capabilities include consistency checks across views within a library with easy to visualize HTML reports and heat maps, GUI-based library-to-library comparison capabilities, as well as advanced SPICE based correlation capabilities for timing, noise, constraints and power.Simple Multi-Core LicensingPrimeLib’s unique licensing approach easily adjusts to varying workload profiles thereby eliminating the burden on characterization teams to predict future workload requirements and having to operate within the constraints associated with traditionally cumbersome licensing methods. Dedicated SPICE availability for characterization teams is another added benefit.High Characterization Throughput: Optimized for Cloud or ClusterPrimeLib provides high throughput on a wide range of computing environments with its many performance-focused features. This includes netlist optimization, automatic function recognition with vector generation, vector optimization and efficient utilization of compute resources on cloud or in cluster. Library characterization is a disk-intensive and highly distributed process. PrimeLib is optimized for NFS traffic and disk usage and scales linearly to provide fastest and efficient throughput based on available resources. PrimeLib’s enhanced license checkout mechanism reduces the overhead on license servers that can be caused byhighly distributed processes.Figure 5: Parallel CharacterizationCell Reliability CharacterizationAdvances in process technology are increasing the impact of Electro-Migration (EM) on the performance and reliability of designs. Similarly, the stress and degradation in the performance of transistors with continued usage over a period of time is another growing reliability concern.Hence, cell reliability characterization is an important capability that PrimeLib has to offer given the pressing need of long running applications such as the automotive industry.EM characterization in PrimeLib is supported for avg, rms and peak current types. Aging characterization is supported for MOSRA, TMI and OMI aging models for both BTI and HCI effects. Both these characterization flows are based on the basic characterization flow making them easy to setup and use.Library Characterization Environment Encryption SupportIP providers have to deliver re-characterization kits to their customers, without opening up their characterization methodology IP. This is where the library characterization environment encryption support of PrimeLib is useful.Simulator SupportPrimeLib offers support for our existing FineSim and HSPICE simulators as well as for the next generation PrimeSimsimulator products.The embedded as well as the standalone simulator invocations are captive in nature—so you don’t have to worry about checking out any additional simulator license keys. The PrimeLib-Core license tokens are all you need to invoke any of these simulators.©2021 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks isavailable at /copyright.html . All other names mentioned herein are trademarks or registered trademarks of their respective owners.。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
A CACHE-BASED DATA INTENSIVE DISTRIBUTED COMPUTING ARCHITECTURE FOR “GRID” APPLICATIONS
Brian Tierney, William Johnston, Jason LeeLawrence Berkeley National Laboratory, Berkeley, CA 94720
AbstractModern scientific computing involves organizing, moving, visualizing, andanalyzing massive amounts of data from around the world, as well asemploying large-scale computation. The distributed systems that solve large-scale problems will always involve aggregating and scheduling manyresources. Data must be located and staged, cache and network capacity mustbe available at the same time as computing capacity, etc. Every aspect of sucha system is dynamic: locating and scheduling resources, adapting runningapplication systems to availability and congestion in the middleware andinfrastructure, responding to human interaction, etc. The technologies, themiddleware services, and the architectures that are used to build useful high-speed, wide area distributed systems, constitute the field of data intensivecomputing, and are sometimes referred to as the “Data Grid”. This paperexplores the use of a network data cache in a Data Grid environment.
1.INTRODUCTIONHigh-speed data streams resulting from the operation of on-line instruments and imaging systems are astaple of modern scientific, health care, and intelligence environments. The advent of high-speednetworks is providing the potential for new approaches to the collection, organization, storage, analysis,visualization, and distribution of the large-data-objects that result from such data streams. The resultwill be to make both the data and its analysis much more readily available.
For example, high energy physics experiments generate high rates and massive volumes of datathat must be processed and archived in real time. This data must also be accessible to large scientificcollaborations — typically hundreds of investigators at dozens of institutions around the world. In this paper we will describe how “Computational Grid” environments can be used to help withthese types of applications, and give a specific example of a high energy physics applications in thisenvironment. We describe how a high-speed application-level network data cache is a particularlyimportant component in a data intensive grid architecture, and describe our implementation of such acache.
2.DATA INTENSIVE GRIDSThe integration of the various technological approaches being used to address the problem of integrateduse of dispersed resources is frequently called a “grid,” or a computational grid — a name arising byanalogy with the grid that supplies ubiquitous access to electric power. See, e.g., [9]. Basic grid servicesare those that locate, allocate, coordinate, utilize, and provide for human interaction with the variousresources that actually perform useful functions.
Grids are built from collections of primarily independent services. The essential aspect of gridservices is that they are uniformly available throughout the distributed environment of the grid.Services may be grouped into integrated sets of services, sometimes called “middleware.” Current gridtools include Globus [8], Legion [15], SRB [2], and workbench systems like Habanero [10] andWebFlow [1]. Recently the term “Data Grid” has come into use to describe middleware and services fordata intensive Grid applications [3], and several data grid research projects have be started [5][[17]. From the application’s point of view, the Grid is a collection of middleware services that provideapplications with a uniform view of distributed resource components and the mechanisms forassembling them into systems. From the middleware systems points of view, the Grid is a standardizedset of basic services providing scheduling, resource discovery, global data directories, security,communication services, etc. However, from the Grid implementor’s point of view, these services resultfrom and must interact with a heterogeneous set of capabilities, and frequently involve “drilling” downthrough the various layers of the computing and communications infrastructure.
2.1Architecture for Data Intensive EnvironmentsOur model is to use a high-speed data storage cache as a common element for all of the sources andsinks of data involved in high-performance data systems. We use the term “cache” to mean storage thatis faster than typical local disk, and temporary in nature. This cache-based approach provides standardinterfaces to a large, application-oriented, distributed, on-line, transient storage system. In a wide-areaGrid environment, the caches must be specifically designed to achieve maximum throughput over high-speed networks.