Interprocessor Communication with Limited Memory
ITERATIVELY WEIGHTED MMSE APPROACH TO DISTRIBUTED SUM-UTILITY MAXIMIZATION FOR INTERFERING CHANNEL

1
Consider the MIMO interfering broadcast channel whereby multiple base stations in a cellular network simultaneously transmit signals to a group of users in their own cells while causing interference to the users in other cells. The basic problem is to design linear beamformers that can maximize the system throughput. In this paper we propose a linear transceiver design algorithm for weighted sum-rate maximization that is based on iterative minimization of weighted mean squared error (MSE). The proposed algorithm only needs local channel knowledge and converges to a stationary point of the weighted sum-rate maximization problem. Furthermore, we extend the algorithm to a general class of utility functions and establish its convergence. The resulting algorithm can be implemented in a distributed asynchronous manner. The effectiveness of the proposed algorithm is validated by numerical experiments. Index Terms— MIMO Interfering Broadcast Channel, Power Allocation, Beamforming, Coordinate Descent Algorithm 1. INTRODUCTION Consider a MIMO Interfering Broadcast Channel (IBC) in which a number of transmitters, each equipped with multiple antennas, wish to simultaneously send independent data streams to their intended receivers. As a generic model for multi-user downlink communication, MIMO-IBC can be used in the study of many practical systems such as Digital Subscriber Lines (DSL), Cognitive Radio systems, ad-hoc wireless networks, wireless cellular communication, to name just a few. Unfortunately, despite the importance and years of intensive research, the search for optimal transmit/receive strategies that can maximize the weighted sum-rate of all users in a MIMO-IBC remains rather elusive. This lack of understanding of the capacity region has motivated a pragmatic approach whereby we simply treat interference as noise and maximize the weighted sum-rate by searching within the class of linear transmit/receive strategies. Weighted sum-rate maximization for an Interference Channel (IFC), which is a special case of IBC, has been
NEC UNIVERGE SV9500通信服务器说明书

Achieve the Smart EnterpriseThe Smart Enterprise innovates by leveraging the best and most current information technologies,tools, and products. With NEC’s UNIVERGE SV9500 Communications Server, smart enterprisesare empowered by technologies which optimize business practices, drive workforce engagement,and create a competitive advantage.®Power for Large BusinessesThe UNIVERGE SV9500 is a powerful communications solution that is designed toprovide competitive businesses with the high-e ciency, easy-to-deploy technology that they require. Reliable, scalable, adaptable, andeasy-to-manage, the SV9500 is built on cutting-edge technology that supports Voice, Uni ed Communications (UC)and Collaboration, Uni ed Messaging, and Mobility out-of-the-box, all the while remaining easy to manage.This robust, feature-rich system is ideal for geographically distributed businesses and enterprises.It is designed to help solve today’s communications challenges and o ers easy integration with NEC’s unique vertical solutions.®Communication ServerMG (SIP)UG50UNIVERGE SV9500® UNIVERGE SV9500,®UNIVERGE SV9500®IP PhoneSoft Phone DECT PhoneUC DesktopSmart DeviceThe UNIVERGE SV9500 o ers:Premier IP uni ed communicationsVoice/UC/UM delivered as an integrated solutionSimpli ed user licensingComprehensive contact center suiteBroad range of mobility applications and devicesVertical market-speci c solutionsWide-range of end-pointsSingle point con guration and managementMulti-Line SIP client, multi-carrier supportVirtualization supportDelivery on NEC’s green initiativesThailand Uni ed CommunicationsVendor of The Year 2015UNIVERGE ®SV9500 – Empowering the Smart WorkforceInnovation that Fits your IT ArchitectureMaintain IT more e cientlyThe user-friendly management interface streamlines system adminis-tration, giving your IT department one personalized portal to administer the entire communications system – Voice, Uni edCommunications, and Voicemail – all from one central location. The SV9300 meets all the needs of today’s IT manager for operational e ciency, security and IT governance.No one wants a communications system that’s di cult to use and even harder to maintain and protect. That’s why NEC’s SV9500 is one of the easiest to con gure Uni ed-Communications-capable systems on the market. The SV9500 easily integrates with existing IT technology as a fully interoperable digital or IP system.Working seamlessly in data centers and cloud environments, SV9500 aligns with IT strategies to virtualize communication and collaboration services - whether deployed in a data center, spread across an organiza-tion’s di erent sites or hosted in the cloud.Data Center readyVirtualize your environmentThe SV9500 gives you the option of a fully virtualized communica-tions solution. By doing so you can deploy applications faster,increase performance and availability, and automate operations — resulting in IT that’s easier to implement and less costly to own and maintain.Make collaborating easier with Uni ed CommunicationsNEC’s SV9500 UC suite of applications gives you the communication tools you need to streamline communications and information delivery. With this powerful, manageable solution, your information is centralized and messages uni ed,so your employees can e ciently manage day-to-day business and communications easily.Users are able to dictate and manage how, when, and where he/she wants to bereached via the desktop and mobile clients. And with the help and inclusion of single number reach, an integrated softphone, call forwarding,and voice/video conferencing and collabo-ration you can ensure that your customers are able to reach whomever they need to, when they need to. SV9500 UC provides you with the option of using the desktop client as a standalone application or integrated with your Microsoft® O ce Outlook® client.Your employees retain ownership of their communications. They set their schedule, and their phone rings accordingly.They launch a meeting or customer service session, and then manage it directly fromtheir desktop.Making Calling ExcitingFreedom of choice and personalization ensurea smart work environmentCall from your desk phoneFor those interested in keeping handsetsstationary: NEC’s innovative desktop endpoint design is intended to deliver maximumdeployment exibility, while a wide range of choices allow for multiple combinations that t any and all business niches or personaliza-tion requirements.UNIVERGE Desktop Telephones make o ce life better®Enabling communication and access to informationin real timeSP350 SoftPhone embeds voice communication into establishedbusiness processes to bring employees the instant communication and information they require. This versatile communications tool o ers an extensive array of high-quality video, audio, voice and text features.The SP350 SoftPhone is a multimedia IP phone installed on a personal computer or laptop. It delivers high-quality voice communication using a USB-connected headset/handset. Employees can use itas a primary desktop telephone, as a supplemental desktop telephone or as a remote/telecommuting device.UC for Enterprise Attendant &NEC’s UT880 takes it to the next levelBusinesses need a cost-e ective attendant console that makes their workers more e cient while improving their customer service. NEC’s UCE Attendant was designed speci cally to optimize business performance and boost abusiness’s standard of service.Optimal call management through a customizable, intuitive user interface Presence-enabled directorythat seamlessly integrates with corporate directory data Screen-pops provide valuable customer information even before a call is answeredSkills-based directory search to quickly nd the person most suitable to assist the callerA cost-e ective way to increase attendant productivity Intuitive on screen call control with exible routingSeamless integration of presence-enabled directory with click to call, e-mail, SMS and IMOptional threat recording, 911 alerts, on-call schedules, message taking and procedure managementIntegrates with popular contact and CRM applicationsas well as Microsoft Outlook®®Wide range of choices – choose from IP or digital, 2-line keys to32+ or DESI-less, grayscale, color or touch-screen display, custom keypads, plus moreCustomizable function keys – can be adapted to the exact individual requirements of your business User-friendly interface – little or no sta training requiredBluetooth connection adapter – enables users to receive and place calls through either theirsmart device or desktop telephoneA full seven-inch color display with four- nger multi-touch capabilitiesUNIVERGE Multi-Line client that emulates any NEC telephoneOpen interface for application development Supports SV9500 platform voice functionality and hands-free speakerphone Integrated Bluetooth capabilityBuilt-in camera for video conferencing Multiple login support USB portAdvanced FeaturesUNIVERGE Softphone : SP350®The SV9500 meets allyour communications needsBusiness boosting applications – Extend your communicationDECT Phone DT400 / DT800 : Digital Phone & IP PhoneNEC Corporation (Thailand) Ltd. (Head O ce)3 Rajnakarn Building, 22nd . and 29th .South Sathorn Road, Yannawa, Sathorn, Bangkok 10120https://Email:*************.th。
ACM的论文写作格式标准

ACM Word Template for SIG Site1st Author1st author's affiliation1st line of address2nd line of address Telephone number, incl. country code 1st author's E-mail address2nd Author2nd author's affiliation1st line of address2nd line of addressTelephone number, incl. country code2nd E-mail3rd Author3rd author's affiliation1st line of address2nd line of addressTelephone number, incl. country code3rd E-mailABSTRACTA s network speed continues to grow, new challenges of network processing is emerging. In this paper we first studied the progress of network processing from a hardware perspective and showed that I/O and memory systems become the main bottlenecks of performance promotion. Basing on the analysis, we get the conclusion that conventional solutions for reducing I/O and memory accessing latencies are insufficient for addressing the problems.Motivated by the studies, we proposed an improved DCA combined with INIC solution which has creations in optimized architectures, innovative I/O data transferring schemes and improved cache policies. Experimental results show that our solution reduces 52.3% and 14.3% cycles on average for receiving and transmitting respectively. Also I/O and memory traffics are significantly decreased. Moreover, an investigation to the behaviors of I/O and cache systems for network processing is performed. And some conclusions about the DCA method are also presented.KeywordsKeywords are your own designated keywords.1.INTRODUCTIONRecently, many researchers found that I/O system becomes the bottleneck of network performance promotion in modern computer systems [1][2][3]. Aim to support computing intensive applications, conventional I/O system has obvious disadvantages for fast network processing in which bulk data transfer is performed. The lack of locality support and high latency are the two main problems for conventional I/O system, which have been wildly discussed before [2][4].To overcome the limitations, an effective solution called Direct Cache Access (DCA) is suggested by INTEL [1]. It delivers network packages from Network Interface Card (NIC) into cache instead of memory, to reduce the data accessing latency. Although the solution is promising, it is proved that DCA is insufficient to reduce the accessing latency and memory traffic due to many limitations [3][5]. Another effective solution to solve the problem is Integrated Network Interface Card (INIC), which is used in many academic and industrial processor designs [6][7]. INIC is introduced to reduce the heavy burden for I/O registers access in Network Drivers and interruption handling. But recent report [8] shows that the benefit of INIC is insignificant for the state of the art 10GbE network system.In this paper, we focus on the high efficient I/O system design for network processing in general-purpose-processor (GPP). Basing on the analysis of existing methods, we proposed an improved DCA combined with INIC solution to reduce the I/O related data transfer latency.The key contributions of this paper are as follows:▪Review the network processing progress from a hardware perspective and point out that I/O and related last level memory systems have became the obstacle for performance promotion.▪Propose an improved DCA combined with INIC solution for I/O subsystem design to address the inefficient problem of a conventional I/O system.▪Give a framework of the improved I/O system architecture and evaluate the proposed solution with micro-benchmarks.▪Investigate I/O and Cache behaviors in the network processing progress basing on the proposed I/O system.The paper is organized as follows. In Section 2, we present the background and motivation. In Section 3, we describe the improved DCA combined INIC solution and give a framework of the proposed I/O system implementation. In Section 4, firstly we give the experiment environment and methods, and then analyze the experiment results. In Section 5, we show some related works. Finally, in Section 6, we carefully discuss our solutions with many existing technologies, and then draw some conclusions.2.Background and MotivationIn this section, firstly we revise the progress of network processing and the main network performance improvement bottlenecks nowadays. Then from the perspective of computer architecture, a deep analysis of network system is given. Also the motivation of this paper is presented.2.1Network processing reviewFigure 1 illustrates the progress of network processing. Packages from physical line are sampled by Network Interface Card (NIC). NIC performs the address filtering and stream control operations, then send the frames to the socket buffer and notifies OS to invoke network stack processing by interruptions. When OS receives the interruptions, the network stack accesses the data in socket buffer and calculates the checksum. Protocol specific operations are performed layer by layer in stack processing. Finally, data is transferred from socket buffer to the user buffer depended on applications. Commonly this operation is done by memcpy, a system function in OS.Figure 1. Network Processing FlowThe time cost of network processing can be mainly broke down into following parts: Interruption handling, NIC driver, stack processing, kernel routine, data copy, checksum calculation and other overheads. The first 4 parts are considered as packet cost, which means the cost scales with the number of network packets. The rests are considered as bit cost (also called data touch cost), which means the cost is in proportion to the total I/O data size. The proportion of the costs highly depends on the hardware platform and the nature of applications. There are many measurements and analyses about network processing costs [9][10]. Generally, the kernel routine cost ranges from 10% - 30% of the total cycles; the driver and interruption handling costs range from 15% - 35%; the stack processing cost ranges from 7% - 15%; and data touch cost takes up 20% - 35%. With the development of high speed network (e.g. 10/40 Gbps Ethernet), an increasing tendency for kernel routines, driver and interruption handling costs is observed [3].2.2 MotivationTo reveal the relationship among each parts of network processing, we investigate the corresponding hardware operations. From the perspective of computerhardware architecture, network system performance is determined by three domains: CPU speed, Memory speed and I/O speed. Figure 2 depicts the relationship.Figure 2. Network xxxxObviously, the network subsystem can achieve its maximal performance only when the three domains above are in balance. It means that the throughput or bandwidth ofeach hardware domain should be equal with others. Actually this is hard for hardware designers, because the characteristics and physical implementation technologies are different for CPU, Memory and I/O system (chipsets) fabrication. The speed gap between memory and CPU – a.k.a “the memory wall” – has been paid special attention for more than ten years, but still it is not well addressed. Also the disparity between the data throughput in I/O system and the computing capacity provided by CPU has been reported in recent years [1][2].Meanwhile, it is obvious that the major time costs of network processing mentioned above are associated with I/O and Memory speeds, e.g. driver processing, interruption handling, and memory copy costs. The most important nature of network processing is the “producer -consumer locality” between every two consecutive steps of the processing flow. That means the data produced in one hardware unit will be immediately accessed by another unit, e.g. the data in memory which transported from NIC will be accessed by CPU soon. However for conventional I/O and memory systems, the data transfer latency is high and the locality is not exploited.Basing on the analysis discussed above, we get the observation that the I/O and Memory systems are the limitations for network processing. Conventional DCA or INIC cannot successfully address this problem, because it is in-efficient in either I/O transfer latency or I/O data locality utilization (discussed in section 5). To diminish these limitations, we present a combined DCA with INIC solution. The solution not only takes the advantages of both method but also makes many improvements in memory system polices and software strategies.3. Design MethodologiesIn this section, we describe the proposed DCA combined with INIC solution and give a framework of the implementation. Firstly, we present the improved DCA technology and discuss the key points of incorporating it into I/O and Memory systems design. Then, the important software data structures and the details of DCA scheme are given. Finally, we introduce the system interconnection architecture and the integration of NIC.3.1 Improved DCAIn the purpose of reducing data transfer latency and memory traffic in system, we present an improved Direct Cache Access solution. Different with conventional DCA scheme, our solution carefully consider the following points. The first one is cache coherence. Conventionally, data sent from device by DMA is stored in memory only. And for the same address, a different copy of data is stored in cache which usually needs additional coherent unit to perform snoop operation [11]; but when DCA is used, I/O data and CPU data are both stored in cache with one copy for one memory address, shown in figure 4. So our solution modifies the cache policy, which eliminated the snoopingoperations. Coherent operation can be performed by software when needed. This will reduce much memory traffic for the systems with coherence hardware [12].I/O write *(addr) = bCPU write *(addr) = aCacheCPU write *(addr) = a I/O write with DCA*(addr) = bCache(a) cache coherance withconventional I/O(b) cache coherance withDCA I/OFigure 3. xxxxThe second one is cache pollution. DCA is a mixed blessing to CPU: On one side, it accelerates the data transfer; on the other side, it harms the locality of other programs executed in CPU and causes cache pollution. Cache pollution is highly depended on the I/O data size, which is always quite large. E.g. one Ethernet package contains a maximal 1492 bytes normal payload and a maximal 65536 bytes large payload for Large Segment Offload (LSO). That means for a common network buffer (usually 50 ~ 400 packages size), a maximal size range from 400KB to 16MB data is sent to cache. Such big size of data will cause cache performance drop dramatically. In this paper, we carefully investigate the relationship between the size of I/O data sent by DCA and the size of cache system. To achieve the best cache performance, a scheme of DCA is also suggested in section 4. Scheduling of the data sent with DCA is an effective way to improve performance, but it is beyond the scope of this paper.The third one is DCA policy. DCA policy refers the determination of when and which part of the data is transferred with DCA. Obviously, the scheme is application specific and varies with different user targets. In this paper, we make a specific memory address space in system to receive the data transferred with DCA. The addresses of the data should be remapped to that area by user or compilers.3.2 DCA Scheme and detailsTo accelerate network processing, many important software structures used in NIC driver and the stack are coupled with DCA. NIC Descriptors and the associated data buffers are paid special attention in our solution. The former is the data transfer interface between DMA and CPU, and the later contains the packages. For farther research, each package stored in buffer is divided into the header and the payload. Normally the headers are accessed by protocols frequently, but the payload is accessed only once or twice (usually performed as memcpy) in modern network stack and OS. The details of the related software data structures and the network processing progress can be found in previous works [13].The progress of transfer one package from NIC to the stack with the proposed solution is illustrated in Table 1. All the accessing latency parameters in Table 1 are based on a state of the art multi-core processor system [3]. One thing should be noticed is that the cache accessing latency from I/O is nearly the same with that from CPU. But the memory accessing latency from I/O is about 2/3 of that from CPU due to the complex hardware hierarchy above the main memory.Table 1. Table captions should be placed above the tabletransfer.We can see that DCA with INIC solution saves above 95% CPU cycles in theoretical and avoid all the traffic to memory controller. In this paper, we transfer the NIC Descriptors and the data buffers including the headers and payload with DCA to achieve the best performance. But when cache size is small, only transfer the Descriptors and the headers with DCA is an alternative solution.DCA performance is highly depended on system cache policy. Obviously for cache system, write-back with write-allocate policy can help DCA achieves better performance than write-through with write non-allocate policy. Basing on the analysis in section 3.1, we do not use the snooping cache technology to maintain the coherence with memory. Cache coherence for other non-DCA I/O data transfer is guaranteed by software.3.3 On-chip network and integrated NICFootnotes should be Times New Roman 9-point, and justified to the full width of the column.Use the “ACM Reference format” for references – that is, a numbered list at the end of the article, ordered alphabetically and formatted accordingly. See examples of some typical reference types, in the new “ACM Reference format”, at the end of this document. Within this template, use the style named referencesfor the text. Acceptable abbreviations, for journal names, can be found here: /reference/abbreviations/. Word may try to automatically ‘underline’ hotlinks in your references, the correct style is NO underlining.The references are also in 9 pt., but that section (see Section 7) is ragged right. References should be published materials accessible to the public. Internal technical reports may be cited only if they are easily accessible (i.e. you can give the address to obtain thereport within your citation) and may be obtained by any reader. Proprietary information may not be cited. Private communications should be acknowledged, not referenced (e.g., “[Robertson, personal communication]”).3.4Page Numbering, Headers and FootersDo not include headers, footers or page numbers in your submission. These will be added when the publications are assembled.4.FIGURES/CAPTIONSPlace Tables/Figures/Images in text as close to the reference as possible (see Figure 1). It may extend across both columns to a maximum width of 17.78 cm (7”).Captions should be Times New Roman 9-point bold. They should be numbered (e.g., “Table 1” or “Figure 2”), please note that the word for Table and Figure are spelled out. Figure’s captions should be centered beneath the image or picture, and Table captions should be centered above the table body.5.SECTIONSThe heading of a section should be in Times New Roman 12-point bold in all-capitals flush left with an additional 6-points of white space above the section head. Sections and subsequent sub- sections should be numbered and flush left. For a section head and a subsection head together (such as Section 3 and subsection 3.1), use no additional space above the subsection head.5.1SubsectionsThe heading of subsections should be in Times New Roman 12-point bold with only the initial letters capitalized. (Note: For subsections and subsubsections, a word like the or a is not capitalized unless it is the first word of the header.)5.1.1SubsubsectionsThe heading for subsubsections should be in Times New Roman 11-point italic with initial letters capitalized and 6-points of white space above the subsubsection head.5.1.1.1SubsubsectionsThe heading for subsubsections should be in Times New Roman 11-point italic with initial letters capitalized.5.1.1.2SubsubsectionsThe heading for subsubsections should be in Times New Roman 11-point italic with initial letters capitalized.6.ACKNOWLEDGMENTSOur thanks to ACM SIGCHI for allowing us to modify templates they had developed. 7.REFERENCES[1]R. Huggahalli, R. Iyer, S. Tetrick, "Direct Cache Access forHigh Bandwidth Network I/O", ISCA, 2005.[2] D. Tang, Y. Bao, W. Hu et al., "DMA Cache: Using On-chipStorage to Architecturally Separate I/O Data from CPU Data for Improving I/O Performance", HPCA, 2010.[3]Guangdeng Liao, Xia Zhu, Laxmi Bhuyan, “A New ServerI/O Architecture for High Speed Networks,” HPCA, 2011. [4] E. A. Le´on, K. B. Ferreira, and A. B. Maccabe. Reducingthe Impact of the MemoryWall for I/O Using Cache Injection, In 15th IEEE Symposium on High-PerformanceInterconnects (HOTI’07), Aug, 2007.[5] A.Kumar, R.Huggahalli, S.Makineni, “Characterization ofDirect Cache Access on Multi-core Systems and 10GbE”,HPCA, 2009.[6]Sun Niagara 2,/processors/niagara/index.jsp[7]PowerPC[8]Guangdeng Liao, L.Bhuyan, “Performance Measurement ofan Integrated NIC Architecture with 10GbE”, 17th IEEESymposium on High Performance Interconnects, 2009. [9] A.Foong et al., “TCP Performance Re-visited,” IEEE Int’lSymp on Performance Analysis of Software and Systems,Mar 2003[10]D.Clark, V.Jacobson, J.Romkey, and H.Saalwen. “AnAnalysis of TCP processing overhead”. IEEECommunications,June 1989.[11]J.Doweck, “Inside Intel Core microarchitecture and smartmemory access”, Intel White Paper, 2006[12]Amit Kumar, Ram Huggahalli., Impact of Cache CoherenceProtocols on the Processing of Network Traffic[13]Wenji Wu, Matt Crawford, “Potential performancebottleneck in Linux TCP”, International Journalof Communication Systems, Vol. 20, Issue 11, pages 1263–1283, November 2007.[14]Weiwu Hu, Jian Wang, Xiang Gao, et al, “Godson-3: ascalable multicore RISC processor with x86 emulation,”IEEE Micro, 2009. 29(2): pp. 17-29.[15]Cadence Incisive Xtreme Series./products/sd/ xtreme_series.[16]Synopsys GMAC IP./dw/dwtb.php?a=ethernet_mac [17]ler, P.M.Watts, A.W.Moore, "Motivating FutureInterconnects: A Differential Measurement Analysis of PCILatency", ANCS, 2009.[18]Nathan L.Binkert, Ali G.Saidi, Steven K.Reinhardt.Integrated Network Interfaces for High-Bandwidth TCP/IP.Figure 1. Insert caption to place caption below figure.Proceedings of the 12th international conferenceon Architectural support for programming languages and operating systems (ASPLOS). 2006[19]G.Liao, L.Bhuyan, "Performance Measurement of anIntegrated NIC Architecture with 10GbE", HotI, 2009. [20]Intel Server Network I/O Acceleration./technology/comms/perfnet/downlo ad/ServerNetworkIOAccel.pdfColumns on Last Page Should Be Made As Close AsPossible to Equal Length。
Communicated by (Name of Editor)

Parallel Processing Letters,f c World Scientific Publishing CompanyAN EFFICIENT IMPLEMENTATION OFTHE BSP PROGRAMMING LIBRARY FOR VIAYANGSUK KEE and SOONHOI HA∗School of Electrical Engineering and Computer Science,Seoul National UniversitySeoul,151-742,KoreaReceived(received date)Revised(revised date)Communicated by(Name of Editor)ABSTRACTVirtual Interface Architecture(VIA)is a light-weight protocol for protected user-level zero-copy communication.In spite of the promised high performance of VIA,previousMPI implementations for GigaNet’s cLAN revealed low communication performance.Two main sources of such low performance are the discrepancy in the communicationmodel between MPI and VIA and the multi-threading overhead.In this paper,wepropose a new implementation of the Bulk Synchronous Parallel(BSP)programminglibrary for VIA called xBSP to overcome such problems.To the best of our knowledge,xBSP is thefirst implementation of the BSP library for VIA.xBSP demonstrates thatthe selection of a proper library is important to exploit the features of light-weightprotocols.Intensive use of Remote Direct Memory Access(RDMA)operations leads tohigh performance close to the native VIA performance with respect to round trip delayand bandwidth.Considering the effects of multi-threading,memory registration,andcompletion policy on performance,we could obtain an efficient BSP implementation forcLAN,which was confirmed by experimental results.Keywords:Bulk Synchronous Parallel,Virtual Interface Architecture,parallel program-ming library,light-weight protocol,cluster1.IntroductionEven though the peak bandwidth of networks has increased rapidly over the years,the latency experienced by applications using these networks has decreased only modestly.The main reason of this disappointing performance is the high software overhead[1,2,3],which mainly results from context switch and data copy between the user and the kernel spaces.To overcome these problems,many light-weight protocols have been proposed to move the protocol stacks from the kernel to the user space[4,5,6,7,8,9,10].One of these protocols is Virtual Interface Architecture(VIA)[6]which was jointly proposed by Intel,Compaq,and Microsoft.The VIA specifications describe a net-work architecture for protected user-level zero-copy communication.For applica-∗Correspondence Address:School of Electrical Engineering and Computer Science,Seoul National University,Shinlim-Dong,Kwanak-Gu,Seoul,151-742,Korea.Tel.82-2-880-7292.Fax.82-2-879-1532.Email:{enigma,sha}@iris.snu.ac.kr.12Parallel Processing Letterstion developers,VIA provides an interface called the Virtual Interface Provider Layer(VIPL).Even though the VIPL can be directly used to develop applications,it is de-sirable to build various popular programming libraries such as PVM[11],MPI[12], and BSPlib[13]for portability of the programs.Two previous works,for example, are the MPI implementations for cLAN by MPI Software Technology(MPI/Pro)[14] and by Rice University[15].Parallel programming library based on other commu-nication protocols can be found in[16,17,18].The authors of[14]described many implementation issues such as threading,long message,asynchronous incoming mes-sage,etc.In particular,they paid attention to the pre-posting constraint of VIA in implementing asynchronous operations of MPI.The zero-copy strategy of VIA enforces that the receiver is ready before the sender initiates its operation,which defines the pre-posting constraint.The results of these studies,however,are some-what disappointing.Even though the half round trip time(RTT)of cLAN using VIPL is8.21µs in our system,that of MPI/Pro is delayed more thanfive times. Furthermore,MPI/Pro achieved only81.7percent of the peak bandwidth of VIPL. This means that the MPI library could not be efficiently integrated with VIA.There are two main causes for such low performance.The primary one is the dis-crepancy in the communication model between MPI and VIA.VIA does not assume any intermediate buffers due to the zero-copy policy,while various asynchronous op-erations of MPI require receiving queues.Therefore,the authors suggested the use of”unexpected queues”on the receiver side to handle asynchronous incoming mes-sages.Then,the implementation experiences more than one copying overhead on the receiver side and requiresflow control for the queue.Moreover,they did not use the Remote Direct Memory Access(RDMA)operation for small messages,because only large messages can amortize the overhead of exchanging the address of RDMA buffers.The second cause is the overhead due to multi-threading.Although dele-gating the message handling task to a separate thread from the computation thread seems a good way of structural implementation,it suffers significant overhead due to thread switching.The overhead due to multi-threading in our system is over ten micro-seconds:this is indeed comparable to the round trip delay in the application level.This means that the multi-threading overhead negates the gain obtained by reducing the latency in the hardware level.These two problems motivate us to implement another VIA-based parallel li-brary.In this paper,we implement the BSPlib standard of the Bulk Synchronous Parallel(BSP)programming library.The BSP model[19]wasfirst proposed as a computing model to bridge the gap between software and hardware for parallel processing.Afterwards,it became a viable programming model with BSPlib.The performance of the BSPlib library was shown to be better than MPICH with re-spect to throughput and predictability[20],which means that BSPlib is not only theoretically but also practically useful.Moreover,the study on BSP clusters[21] has demonstrated that the BSPlib library can be accelerated by rewriting the Fast Ethernet device driver to be optimized for the BSPlib operations.One of the mainAn Efficient Implementation of the BSP Programming Library for VIA3 lessons of the study was that optimization with global knowledge about the trans-port layer and the parallel library promises higher performance.This perspective is also applicable to implementing parallel libraries using light-weight protocols. Indeed,BSPlib has a strong operational resemblance with VIA in memory registra-tion,message passing communication,and direct remote memory access.Our new implementation of BSPlib for cLAN is called express BSP(xBSP).To the best of our knowledge,xBSP is thefirst implementation of BSPlib for VIA.xBSP demonstrates that selecting a proper library is important in exploiting the features of light-weight protocols.Furthermore,we achieved performance close to the native VIPL by significant efforts to reduce the overheads due to multi-threading,memory registration,andflow-control.xBSP also supports reliable communication by using the reliable delivery mode of VIA.In the following two sections,we address key features of VIA required to imple-ment the BSPlib library and discuss how well the library is matched with VIA.After that,we present experimental implementation alternatives to achieve the full per-formance of VIA.In sections4and5,several benchmarks demonstrate the efficiency of xBSP,and we conclude our discussion in section6.2.VIA FeaturesIn this section,we discuss VIA features that should be carefully considered for efficient implementation of BSPlib.They concern memory registration,communi-cation mode,and descriptor processing.2.1.Memory RegistrationTable1.Costs of memory registration and copying(µs)message length(byte)11K2K4K8K16Kregistration333445copying122101835Communication buffers of the user space should be registered in order to elimi-nate data copying between the user space and the kernel space and to provide mem-ory protection.The memory registration cost,however,is not negligible.For ex-ample,the Windows NT system experienced over15µs latency for messages smaller than16Kbytes[15],while the overhead in our Linux system ranged from3to5µs as shown in table1.Considering communication delay and copying overhead,it is important to reduce the registration overhead,especially for small messages.munication ModeAfter communication buffers are registered,processes can transfer data between the registered buffers.VIA supports two communication modes.One is the tradi-tional message passing mode in which both the sender and the receiver participate in communication,satisfying the pre-posting constraint.The other is the one-sided4Parallel Processing LettersFig.1.Procedure of RDMA write operationcommunication called RDMA,which is an extension of the local DMA operations that allow a user process to transparently access buffers of a user process on another machine connected to the VIA network.The procedure of the RDMA write operation is illustrated in Fig. 1.First, both processes register their buffers to their VIA device drivers,and process B informs process A of the address of its buffer by explicit message passing to avoid the protection violation.After that,process A initiates its operation by posting descriptors and the device driver moves data from the user buffer to the network through DMA.When packets arrive at the target machine,the device driver of the target machine moves data in the reverse way of the sender.This RDMA operation has several advantages.First,the RDMA operation can avoid the descriptor processing overhead in the target process since it does not require any descriptor in the target process except when the initiator uses the immediate datafield of descriptor.Second,since only the VI-NIC of the target machine is involved in communication,the target process can continue without interruption.Finally,the initiator does not have to worry aboutflow control for the resources of the target machine.Therefore,we prefer the RDMA mode to the message passing mode.2.3.Descriptor Processing ModeWhen there are multiple VI-connections to a process,mechanisms like select() in the socket interface are needed.We can implement such mechanisms using the Completion Queue.Notifications of descriptor completion from multiple Receive Queues are directed to a single Completion Queue.The Completion Queue can be managed by a dedicated communication thread or the user thread itself.When a thread is dedicated to managing the Completion Queue,it prevents the interruption of user threads in a clustered SMP environment.An Efficient Implementation of the BSP Programming Library for VIA5However,this introduces extra latency of thread switching.On the other hand,the user thread directly receives messages at the expense of CPU time to avoid this multi-threading overhead.Since we aim at low latency communication,the user thread itself takes a role in managing the Completion Queue.3.BSPlib ImplementationBased on the previous discussion,we explain in this section how well the BSPlib library is matched with VIA and how the library is realized.3.1.BSP-RegistrationIn a BSP program,a user can access data in a remote memory after one registers a memory block by bsp push reg(void*ident,int nbytes).The registrations within a superstep take effect after the subsequent barrier synchronization identified by bsp sync().In the Oxford implementation[13],each node keeps track of the sequence of reg-istrations and maintains a mapping table between the unique block number and the associated local address:it does not require any explicit message exchange.When a process initiates a one-sided operation with this block number,the target process translates the number into its local address for the block.The main objective of this mechanism is to reduce unnecessary network traffic in the registration step. This low-cost dynamic registration is beneficial to implementing user-level libraries and applications with recursion.Since the registration typically appears at the beginning of a program and rarely afterwards,it may be preferable to speed up ordinary communication operations at the expense of the registration.As discussed in section2.2,the initiator of RDMA operations should know the address of the remote buffer.In xBSP,each node registers its local buffer to the VI-NIC in the bsp push reg()and exchanges the address in the barrier synchronization step.At the end of the synchronization, each node builds a mapping table between the local address and the corresponding remote addresses.Since each node knows the actual address of the global memory block,it can transfer data to the remote buffer directly using the RDMA operation, unlike the Oxford implementation.3.2.One-Sided OperationA process can initiate a one-sided operation on the registered memory block. For example,bsp hpput(int pid,void*src,void*dst,int offset,int nbytes)writes nbytes data in the src buffer to the dst+offset address in the pid node;the written data is valid in the next superstep.The bsp hpput()function is exactly matched to the RDMA write operation.As the initiator has the address information of the dst buffer after the registration step,it can transfer data to the dst buffer directly. The target process does not have to considerflow-control,descriptor posting,nor incoming message handling.Furthermore,it is free from multi-threading overhead.6Parallel Processing LettersConsequently,the bsp hpput()function is able to pull delay and bandwidth perfor-mance close to those of VIPL.One problem related to the RDMA write operation is how the target process knows the arrival of a message.There are two possible solutions to this problem. One is to enforce the use of a descriptor notifying the end of a message(EOM). An RDMA write operation consumes a descriptor in the Receive Queue only when there is an immediate data in the source descriptor.Hence,we can use this feature to mark the end of an RDMA message.When a message consists of n packets,the sender transfers n-1packets andfinishes the nth packet transfer with the EOM tag while the receiver checks whether a descriptor is consumed and the returned value is EOM.This approach requires one descriptor per message,while the traditional message passing requires n descriptors.The other approach is to send an additional control message to mark the end of a message.Even though this approach has more overhead than thefirst,in the case of BSPlib,this approach is preferable.As the transferred messages in a superstep are available in the next superstep,there is no need to handle incoming messages immediately.Since cLAN supports reliable in-order delivery,the arrival of a packet means the successful arrival of the preceding packets.Therefore,a series of EOM control messages in a superstep can be replaced by the last EOM control message and the EOM message can be piggybacked with the barrier synchronization packet.After all,in the place of EOM control messages, barrier synchronization can be used implicitly to mark the end of transfers.3.3.Other IssuesThe accumulated start-up costs of communication are significant if small mes-sages are outstanding to the network.This problem has already been discussed in other studies[20,22]and can be overcome with the combining scheme.xBSP also combines small messages into a temporary buffer since the copying overhead of small messages is smaller than the memory registration cost of VIA.This com-bining method contributes to increasing the communication bandwidth,sacrificing little round trip time.Table2.Total exchange time with eight nodes for cLAN(µs)message length(byte)latin square naive ordering factor8K15722358 1.516K27194871 1.832K493010308 2.164K934021535 2.3 Besides,reordering messages is helpful to avoid serialization of message deliv-ery[22],and we use a latin square indexing order to schedule the destination of messages.A latin square is a p x p square in which all rows and columns are per-mutations of integers1to p.In comparison,naive ordering distributes messages by thefixed index order as implied in the code,like for(j=0;j<p;j++).As presented in table2,the reordering affects the performance for large messages;theAn Efficient Implementation of the BSP Programming Library for VIA7speed-up factor increases with the message size.This result indicates that poor destination scheduling can decrease performance of total exchange significantly. 4.Micro-benchmark ExperimentsIn this section,we demonstrate that BSPlib could be efficiently implemented on VIA through the experimental results with two micro-benchmarks:half round trip time and bandwidth.Our Linux cluster consists of eight nodes connected by an8-port cLAN switch.Each node has dual Pentium III550MHz processors with 256-Mbyte SDRAM and runs Redhat Linux6.2SMP version.4.1.Preliminary ExperimentsWe tested a few implementation alternatives to achieve the full performance of VIA and observed the effects of completion policies and threading on the round trip delay.Fig.2.Effects of threading and completion policyBy polling,each process repeatedly checks whether the transaction is completed while by blocking it waits for the completion of the transaction.Meanwhile,in the threaded version,a communication thread is dedicated to receiving incoming messages while a user thread continues its computation.Fig.2shows that the single threaded version using polling achieves significant reduction of delay.However,it is wasteful to dedicate all of the CPU resources to polling,especially in the case of long message transfers.A tradeoffcan be made by mixing both schemes:xBSP polls for a certain number of iterations anticipating the completion of short message transfers and is blocked eventually.Based on these experiments,we chose the single threaded version using the mixed policy.8Parallel Processing Letters4.2.Half Round Trip Time and BandwidthWith micro-benchmarks,we measured the half RTT and the bandwidth.To measure the half RTT,two processes send equal amounts of data back and forth repeatedly.We vary the message size from4bytes to64Kbytes and take the average value over1000execution results.Also,the bandwidth is computed after measuring the latency to transfer1-Mbyte data varying the message size.The baseline is the performance of xBSP using the traditional message passing with a single thread.We change the communication mode of VIA from the message passing to RDMA and compare xBSP to VIPL and MPI/Pro.These benchmarks use the following configurations:•VIPL-MP:VIPL using message passing(polling)•VIPL-RDMA:VIPL using RDMA(polling)•xBSP-MP:xBSP using message passing(mixed)•xBSP-RDMA:xBSP using RDMA(mixed)•MPI/Pro:MPI of MPI Software TechnologyFig.3.Half round trip time(with combining overhead)Fig.3and Fig.4show the experimental results of the round trip delay and bandwidth for various configurations.For comparison,the results of MPI/Pro[14] are also presented.The VIPL versions reveal minimum application level latency since they do not include any supplementary jobs for communication like registra-tion and use the polling mechanism for the completion policy.Comparing the two VIPL versions,we can estimate the overhead due to the pre-posting constraint which includes descriptor posting andflow control.Even thoughAn Efficient Implementation of the BSP Programming Library for VIA9Fig.4.Bandwidth(without combining advantage)the performance gap is not significant,the RDMA version consistently outperforms the MP version and the experiments with xBSP also show similar results.According to Fig.3,xBSP-RDMA is two times slower than VIPL-RDMA with 4-byte packets.The extra latency of xBSP-RDMA mainly results from the copy-ing overhead of the message combining and the blocking overhead of the mixed completion policy.In contrast,MPI/Pro is8.8times slower than VIPL-RDMA:in average,xBSP shows at least twice lower latency than MPI/Pro in the case of small messages.In terms of the peak bandwidth,xBSP-RDMA achieves about94%of the VIPL bandwidth while MPI/Pro achieved only82%.Consequently,these results demonstrate that xBSP exploits VIA features more effectively than MPI/Pro.5.Benchmark ExperimentsEven though micro-benchmarks can be used for measuring the basic link proper-ties,high performance of micro-benchmarks does not ensure the same performance benefit in real applications.To rigorously evaluate the performance,we measure the BSP cost parameters and then the execution times of several real applications.5.1.BSP Cost ModelThe BSP model simplifies a parallel machine by three components,a set of pro-cessors,an interconnection network,and a barrier synchronizer,which are parame-terized as{p,g,l}.Parameter p represents the number of processors in the cluster, parameter g,the gap between continuous message sending operations,and parame-ter l,the barrier synchronization latency.A BSP program consists of a sequence of supersteps separated by barrier synchronizations.In every superstep,each process10Parallel Processing Lettersperforms local computation or exchanges messages which are available in the next superstep.Hence,the execution time for superstep i is modeled by w i+gh i+l, where w i is the longest duration of local computation in the ith superstep and h i is the largest amount of packets exchanged by a process during this superstep.Table3.BSP cost parameters,s(Mflop/s)=121xBSP-RDMA xBSP-MP BSPlib-UDP/IPP L(µs)g(µs/word)L(µs)g(µs/word)L(µs)g(µs/word) min max shift total min max shift total min max shift total 217230.0770.10319520.0860.1101363200.400.42 437500.0770.08642710.0920.1022714410.370.40 673890.0790.083801120.1090.1154066870.460.49 81091230.0790.0841081450.1050.1104337640.480.53 In Table3,the cost parameters of xBSP and the Oxford BSPlib implementa-tion using UDP/IP for Fast Ethernet are compared.These parameters serve as a measure of the entire system under some non-trivial workload.The s parameter represents the instruction execution rate of each processor taken from the average execution time of matrix multiplication and dot products.The minimum L value is taken as the average latency of a long sequence of bsp sync(),while the maximum value is taken as the average latency of a long sequence of the pair of bsp hpput()and bsp sync()with one word message.The g parameter is a measure of the global net-work bandwidth,not the point-to-point bandwidth:a smaller g value means higher global bandwidth.With the shift communication pattern,each process sends data to its neighbor,and with total exchange it broadcasts.xBSP-RDMA experiences much lower synchronization latency and higher band-width(short time interval)than the others.xBSP-RDMA achieves a constant global bandwidth of about381Mbps and xBSP-MP achieves about291Mbps while the BSPlib-UDP/IP’s performance decreases with the number of over four nodes:xBSP shows good scalability characteristics,and the RDMA operations are well matched with the BSPlib interfaces.5.2.ApplicationsIn this section,we compare the BSPlib libraries with the following two applica-tions.•ES:application to solve a grid problem with a300x300matrix[23]•LU:application to solve a linear equation using LU decomposition[24]The execution time of the grid solver is presented in Fig.5.The values above bars represent the ratio of the sum of communication and synchronization times compared with xBSP-RDMA.In the grid solver program,each process exchanges data with its neighbors:the communication pattern is similar to the shift communi-cation.Since ES spends most of its time(about5.9sec)in computation in the case of two nodes,the performance gap between xBSP-RDMA and xBSP-MP is not so great.In contrast,since the packet size transferred in a superstep is2400bytes,theAn Efficient Implementation of the BSP Programming Library for VIA11Fig.5.Execution time of ES with a300by300matrixlatency reduction of xBSP-RDMA over BSPlib/UDP is about6.2.Fig.5coincides with the expected result where the sum of the global communication time and the synchronization time is reduced to about18%.As the number of nodes increases, the portion of computation decreases and the communication and synchronization costs become significant.xBSP-RDMA always outperforms both xBSP-MP and BSPlib/UDP.Fig.6.LU decomposition on two by two nodesFig.6shows the execution time of LU decomposition measured on four pro-cessors varying the size of the input matrix.In the LU decomposition program, broadcast operations with small h-relations and barrier synchronizations are re-12Parallel Processing Letterspeated.Therefore,it provides a good measure of the communication latency of a system.xBSP-RDMA shows about1.4time better communication performance than xBSP-MP,which is expected by the cost model.6.ConclusionsIn this paper,we presented an efficient implementation of BSPlib for VIA called xBSP.xBSP demonstrates that BSPlib is more appropriate than MPI to exploit the features of VIA.Furthermore,we achieved similar application performance to the native performance from VIPL by reducing the overheads associated with multi-threading,memory registration,andflow-control.Even though we paid attention only to implementing BSPlib,there are many possibilities to improving performance by relaxing the BSPlib semantics.In particu-lar,we should reduce barrier synchronization costs by adopting such mechanisms as relaxed barrier synchronization[25]and zero-cost synchronization[26].Currently,we are building a programming environment based on xBSP-RDMA for heterogeneous cluster systems adopting a dynamic load balancing scheme. AcknowledgementsThis work was supported by National Research Laboratory program(number M1-0104-00-0015).The RIACT at Seoul National University provides research fa-cilities for this study.References[1]R.Caceres,P.B.Danzig,S.Jamin,and D.J.Mitzel,Characteristics of Wide-AreaTCP/IP Conversations,ACM SIGCOMM Computer Communication Review21(4) (1991),101–112.[2]J.Kay and J.Pasquale,The Importance of Non-Data Touching Processing Overheadsin TCP/IP,ACM SIGCOMM Computer Communication Review23(4)(1993),259–268.[3]J.Kay and J.Pasquale,Profiling and Reducing Processing Overheads in TCP/IP,IEEE/ACM work.4(6)(1996),817–828.[4]R.A.F Bhoedjang,T.Rubl,and H.E.Bal,User-Level Network Interface Protocols,IEEE Computer31(11)(1998),53–60.[5]G.Chiola and G.Ciaccio,GAMMA:A Low Cost Network of Workstations Based onActive Message,In Proc.PDP’97,1997.[6] D.Dunning,G.Regnier,G.McAlpine,D.Cameron,B.Shubert,F.Berry,A.MarieMerritt,E.Gronke,and C.Dodd,The Virtual Interface Architecture,IEEE Micro 18(2)(1998),66–76.[7]S.Pakin,V.Karamcheti,and A.A.Chien,Fast Messages:Efficient,Portable Commu-nication for Workstation Clusters and MPPs,IEEE Concurrency5(2)(1997),60–72.[8]L.Prylli and B.Tourancheau,BIP:A New Protocol Designed for High PerformanceNetworking on Myrinet,PC-NOW’98,Vol.1388of Lect.Notes in Comp.Science,April, 1998,472–485.[9]T.von Eicken,A.Basu,V.Buch,and W.Vogels,U-Net:A User-Level Network Interfacefor Parallel and Distributed Computing,Operating Systems Review29(5)(1995),40–An Efficient Implementation of the BSP Programming Library for VIA1353.[10]T.von Eicken,D.E.Culler,S.C.Goldstein,and K.E.Schauser,Active Messages:A Mechanism for Integrated Communications and Computation,Proc.19th Symp.Computer Architecture,May1992,256–266.[11]V.S.Sunderam,PVM:A Framwork for Paralel Distributed Computing,Concurrency:Practice and Experience2(4)(1990),315–339.[12]Message Passing Interface Forum,MPI:A Message Passing Interface Standard,Tch.Report Version1.1,Univ.of Tennessee,Knoxville,Tenn.,1995.[13]J.M.D.Hill,B.McColl,D.C.Stefanescu,M.W.Goudreau,ng,S.B.Rao,T.Suel,T.Tsantilas,and R.Bisseling,BSPlib:The BSP programming Library,Parallel Computing24(14)(1998),1947–1980.[14]R.Dimitrov and A.Skjellum,An Efficient MPI Implementation for Virtual Inter-face(VI)Architecture-Enabled Cluster Computing,MPI Software Technology,Inc. [15] E.Speight,H.Abdel-Shafi,and J.K.Bennett,Realizing the Performance Potential ofthe Virtual Interface Architecture,ICS’99,June1999,184–192.[16]uria and A.Chien,MPI-FM:High Performance MPI on Workstation Clusters,Journal of Parallel and Distributed Computing40(1)(1997),4–18.[17]L.Prylli,B.Tourancheau,and R.Westrelin,The Design for a High Performance MPIImplementation on the Myrinet Network,EuroPVM/MPI’99,Vol.1697of Lect.Notes in Comp.Science,September,1999,223–230.[18]J.Worringen and T.Bemmerl,MPICH for SCI-connected Clusters,SCI Europe’99,September,1999,3–11.[19]Leslie G.Valiant,A Bridging Model for Parallel Computation,Comm.ACM33(8)(1990),103–111.[20]S.R.Donaldson,J.M.D.Hill,and D.B.Skillicorn,Predictable Communication onUnpredictable Networks:Implementing BSP over TCP/IP,Concurrency:Practice and Experience11(11)(1999),687–700.[21]S.R.Donaldson,J.M.D.Hill,and D.B.Skillicorn,BSP Clusters:High Performance,Reliable,and Very Low Cost,Parallel Computing26(2-3)(2000),199–242.[22]J.M.D.Hill and D.B.Skillicorn,Lessons Learned from Implementing BSP,Journalof Future Generation Computer Systems13(4-5)(1998),327–335.[23] D.E.Culler and J.Pal Singh,Parallel Computer Architecture,Morgan KaufmannPublishers,Inc.,1999,92–116.[24]R.Bisseling,BSPEDUpack,/implmnts/oxtool.htm.[25]J.S.Kim,S.Ha,and C.S.Jhon,Relaxed Barrier Synchronization for the BSP Model ofComputation on Message-Passing Architectures,Information Processing Letters66(5) (1998),247–253.[26]O.Bonorden,B.Juurlink,I.von Otte,and I.Rieping,The Paderborn UniversityBSP(PUB)Library-Design,Implementation and Performance,IPPS/SPDP’99,April 1999,99–104.。
AppH

Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization: Scaling Up Performance of Scientific Applications on Shared-Memory Multiprocessors Performance Measurement of Parallel Processors with Scientific Applications Implementing Cache Coherence The Custom Cluster Approach: Blue Gene/L Concluding Remarks
H.2
Interprocessor Communication: The Critical Performance Issue
I
H-3
cessor node. By using a custom node design, Blue Gene achieves a significant reduction in the cost, physical size, and power consumption of a node. Blue Gene/L, a 64K-node version, is the world’s fastest computer in 2006, as measured by the linear algebra benchmark, Linpack.
H-2 H-3 H-6 H-12 H-21 H-33 H-34 H-41 H-44
大学_生活_The Impact of the Cellphone on Interpersonal Communication_559991_27816202

大学
559991_27816202
题目要求:
The Impact of the Cellphone on Interpersonal Communication
近年来,手机的流行催生了一大批“低头族”,无论是在拥挤的车厢还是在与朋友聚会的时间,甚至在马路上,在阶梯上都会有人低头看手机,对此现象,你的看法是什么呢?
To get the bottom of the reason for this tendency,we can't ignore the network's contribution to this phenomenon for the reason that cellphone is smart and convenient when it connects.What's more,a multitude of people rely heavily on the network and losing st but not least,cellphone enables individuals to communicate with friends who are a thousand miles away when they use some applications,such as Wechat,QQ,Facebook and so on.
英特尔 Stratix 10 设备 L-Tile 和 H-Tile 传输器更新说明书

Revision 1.0.0ADV Issue Date: 08/30/2019 CUSTOMER ADVISORYADV1913Description:Intel® Network & Custom Logic Group (formerly Intel Programmable Solutions Group, Altera) is notifying customers of an important update to the Intel Stratix® 10 devices L-Tile and H-Tile transceivers.It was recently determined that the Intel Quartus® Prime Settings File (QSF) assignment to preserve performance of unused transceiver channels is found not working as intended in versions of Intel Quartus Prime Software prior to 18.1.1.Customers implementing the QSF assignment to preserve performance of unused simplex transmit, simplex receive or duplex channels that will be used in the future need to migrate to Intel Quartus Prime Software version 18.1.1 or later.Note: See Intel Stratix 10 L- and H- Tile Transceiver PHY User Guide for details on preserving performance of unused transceiver channels QSF:https:///content/www/us/en/programmable/documentation/wry1479165 198810.htmlRecommended Actions:Table 1CustomerDesign StatusRecommended ActionsDesigns not in production For unused simplex transmit or duplex channels driven by ATX PLL or fPLL, or unused simplex receive channels, upgrade to the Intel Quartus Prime Software version 18.1.1 or later and apply the preserve unused channel performance QSF assignment.For unused simplex transmit or duplex channels driven by CMU PLL, upgrade to the Intel Quartus Prime Software versions 19.2 or later and apply the preserve unused channel performance QSF assignment.Designs in production If the unused channels intended to run at data rates greater than 12.8 Gbps, or the design is in Intel Quartus Prime Software version 18.0.1 and has been in production for more than 2 years, contact Intel for support.Implementation of the QSF assignment causes power consumption increase per unused channel. The power increase is about 160mW per unused channel for L-Tile transceiver and about 180mW per unused channel for H-Tile transceiver.•Use Intel Quartus Prime Software Power Analyzer to estimate the power consumption on the transceiver power supplies due to the implementationof the QSF assignment.•If your FPGA design is partially complete, you can import the Early Power Estimator (EPE) file (<revision name>_early_pwr.csv) generated by the IntelQuartus Prime software into the EPE spreadsheet.For questions or support, please contact your local Field Applications Engineer (FAE) or submit a service request at the My Intel support page.Products Affected:•Intel Stratix 10 GX and SX L-Tile devices•Intel Stratix 10 GX and SX H-Tile devices•Intel Stratix 10 MX Devices•Intel Stratix 10 TX 2800 and TX 2500 devicesThe list of affected part numbers (OPNs) can be downloaded in Excel form:https:///content/dam/www/programmable/us/en/pdfs/literature/pcn/adv 1913-opn-list.xlsxReason for Change:In the Intel Quartus Prime Software versions prior to 18.1.1, the QSF assignment is found not working as intended and the performance of the unused transceiver channels,when activated in the future, will degrade even with the QSF assignment implemented in the customer’s design.Change ImplementationTable 2Milestone AvailabilityIntel Quartus Prime software version 18.1.1 NowIntel Quartus Prime software version 19.2 NowContactFor more information, please contact your local Field Applications Engineer (FAE) or submit a Service Request at the My Intel support page.Customer Notifications SubscriptionCustomers that subscribe to Intel PSG’s customer notification mailing list can receive the Customer Advisory automatically via email.If you would like to receive customer notifications by email, please subscribe to our customer notification mailing list at:https:///content/www/us/en/programmable/my-intel/mal-emailsub/technical-updates.htmlRevision HistoryDate Rev Description08/30/2019 1.0.0 Initial Release©2019 Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, Max, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other marks and brands may be claimed as the property of others. Intel reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.。
最新SCI期刊目录(6606种SCI期刊,期刊名

刊名简称刊名全称REV MED VIROL REVIEWS IN MEDICAL VIROLOGYAIDS AIDSJ VIROL JOURNAL OF VIROLOGYANTIVIR THER ANTIVIRAL THERAPYRETROVIROLOGY RetrovirologyADV VIRUS RES ADVANCES IN VIRUS RESEARCHVIROLOGY VIROLOGYANTIVIR RES ANTIVIRAL RESEARCHJ VIRAL HEPATITIS JOURNAL OF VIRAL HEPATITISJ CLIN VIROL JOURNAL OF CLINICAL VIROLOGYJ GEN VIROL JOURNAL OF GENERAL VIROLOGYCURR HIV RES CURRENT HIV RESEARCHINT J MED MICROBIOL INTERNATIONAL JOURNAL OF MEDICAL MICROBIOLOGY MICROBES INFECT MICROBES AND INFECTIONJ MED VIROL JOURNAL OF MEDICAL VIROLOGYVIRUS RES VIRUS RESEARCHAIDS RES HUM RETROV AIDS RESEARCH AND HUMAN RETROVIRUSESJ VIROL METHODS JOURNAL OF VIROLOGICAL METHODSJ NEUROVIROL JOURNAL OF NEUROVIROLOGYVIRAL IMMUNOL VIRAL IMMUNOLOGYARCH VIROL ARCHIVES OF VIROLOGYVIROL J Virology JournalINTERVIROLOGY INTERVIROLOGYVIRUS GENES VIRUS GENESACTA VIROL ACTA VIROLOGICAFUTURE VIROL Future VirologyS AFR J HIV MED SOUTHERN AFRICAN JOURNAL OF HIV MEDICINEAM J PATHOL AMERICAN JOURNAL OF PATHOLOGYANNU REV PATHOL-MECH Annual Review of Pathology-Mechanisms of DiseaseJ PATHOL JOURNAL OF PATHOLOGYBRAIN PATHOL BRAIN PATHOLOGYJ NEUROPATH EXP NEUR JOURNAL OF NEUROPATHOLOGY AND EXPERIMENTAL NEUROLOGY LAB INVEST LABORATORY INVESTIGATIONSPRINGER SEMIN IMMUN SPRINGER SEMINARS IN IMMUNOPATHOLOGYMODERN PATHOL MODERN PATHOLOGYAM J SURG PATHOL AMERICAN JOURNAL OF SURGICAL PATHOLOGYACTA NEUROPATHOL ACTA NEUROPATHOLOGICAHISTOPATHOLOGY HISTOPATHOLOGYINT J IMMUNOPATH PH INTERNATIONAL JOURNAL OF IMMUNOPATHOLOGY AND PHARMACOLO CELL ONCOL CELLULAR ONCOLOGYJ MOL DIAGN JOURNAL OF MOLECULAR DIAGNOSTICSNEUROPATH APPL NEURO NEUROPATHOLOGY AND APPLIED NEUROBIOLOGYHUM PATHOL HUMAN PATHOLOGYADV ANAT PATHOL ADVANCES IN ANATOMIC PATHOLOGYEXPERT REV MOL DIAGN EXPERT REVIEW OF MOLECULAR DIAGNOSTICSSEMIN IMMUNOPATHOL Seminars in ImmunopathologyAM J CLIN PATHOL AMERICAN JOURNAL OF CLINICAL PATHOLOGYALZ DIS ASSOC DIS ALZHEIMER DISEASE & ASSOCIATED DISORDERSINT J EXP PATHOL INTERNATIONAL JOURNAL OF EXPERIMENTAL PATHOLOGYJ CLIN PATHOL JOURNAL OF CLINICAL PATHOLOGYCYTOM PART B-CLIN CY CYTOMETRY PART B-CLINICAL CYTOMETRYTISSUE ANTIGENS TISSUE ANTIGENSDIS MARKERS DISEASE MARKERSHISTOL HISTOPATHOL HISTOLOGY AND HISTOPATHOLOGYVIRCHOWS ARCH VIRCHOWS ARCHIV-AN INTERNATIONAL JOURNAL OF PATHOLOGY ARCH PATHOL LAB MED ARCHIVES OF PATHOLOGY & LABORATORY MEDICINE PATHOLOGY PATHOLOGYINT J GYNECOL PATHOL INTERNATIONAL JOURNAL OF GYNECOLOGICAL PATHOLOGY CARDIOVASC PATHOL CARDIOVASCULAR PATHOLOGYSEMIN DIAGN PATHOL SEMINARS IN DIAGNOSTIC PATHOLOGYTOXICOL PATHOL TOXICOLOGIC PATHOLOGYDIAGN MOL PATHOL DIAGNOSTIC MOLECULAR PATHOLOGYEXP MOL PATHOL EXPERIMENTAL AND MOLECULAR PATHOLOGYENDOCR PATHOL ENDOCRINE PATHOLOGYJ ORAL PATHOL MED JOURNAL OF ORAL PATHOLOGY & MEDICINE PATHOBIOLOGY PATHOBIOLOGYAPMIS APMISAPPL IMMUNOHISTO M M APPLIED IMMUNOHISTOCHEMISTRY & MOLECULAR MORPHOLOGY J CUTAN PATHOL JOURNAL OF CUTANEOUS PATHOLOGYNEUROPATHOLOGY NEUROPATHOLOGYVET PATHOL VETERINARY PATHOLOGYJ COMP PATHOL JOURNAL OF COMPARATIVE PATHOLOGYPATHOL INT PATHOLOGY INTERNATIONALPATHOL ONCOL RES PATHOLOGY & ONCOLOGY RESEARCHCYTOPATHOLOGY CYTOPATHOLOGYPEDIATR DEVEL PATHOL PEDIATRIC AND DEVELOPMENTAL PATHOLOGYMED MOL MORPHOL Medical Molecular MorphologyEXP TOXICOL PATHOL EXPERIMENTAL AND TOXICOLOGIC PATHOLOGYFOLIA NEUROPATHOL FOLIA NEUROPATHOLOGICALEPROSY REV LEPROSY REVIEWINT J SURG PATHOL INTERNATIONAL JOURNAL OF SURGICAL PATHOLOGYPATHOL RES PRACT PATHOLOGY RESEARCH AND PRACTICECLIN NEUROPATHOL CLINICAL NEUROPATHOLOGYDIAGN CYTOPATHOL DIAGNOSTIC CYTOPATHOLOGYPATHOL BIOL PATHOLOGIE BIOLOGIEULTRASTRUCT PATHOL ULTRASTRUCTURAL PATHOLOGYACTA CYTOL ACTA CYTOLOGICAAM J FOREN MED PATH AMERICAN JOURNAL OF FORENSIC MEDICINE AND PATHOLOGY SCI JUSTICE SCIENCE & JUSTICEPATHOLOGE PATHOLOGEBRAIN TUMOR PATHOL Brain Tumor PathologyANN PATHOL ANNALES DE PATHOLOGIEFETAL PEDIATR PATHOL Fetal and Pediatric PathologyMOL MED REP Molecular Medicine ReportsPROG CRYST GROWTH CH PROGRESS IN CRYSTAL GROWTH AND CHARACTERIZATION OF MATE POLYM TEST POLYMER TESTINGEXP MECH EXPERIMENTAL MECHANICSNDT&E INT NDT & E INTERNATIONALMATER CHARACT MATERIALS CHARACTERIZATIONMECH ADV MATER STRUC MECHANICS OF ADVANCED MATERIALS AND STRUCTURESJ NONDESTRUCT EVAL JOURNAL OF NONDESTRUCTIVE EVALUATIONSTRAIN STRAINNANOSC MICROSC THERM Nanoscale and Microscale Thermophysical Engineering POWDER DIFFR POWDER DIFFRACTIONMECH TIME-DEPEND MAT MECHANICS OF TIME-DEPENDENT MATERIALSPROG STRUCT ENG MAT Progress in Structural Engineering and MaterialsJ STRAIN ANAL ENG JOURNAL OF STRAIN ANALYSIS FOR ENGINEERING DESIGN PART PART SYST CHAR PARTICLE & PARTICLE SYSTEMS CHARACTERIZATIONRES NONDESTRUCT EVAL RESEARCH IN NONDESTRUCTIVE EVALUATIONENG FAIL ANAL ENGINEERING FAILURE ANALYSISJ SANDW STRUCT MATER JOURNAL OF SANDWICH STRUCTURES & MATERIALSINSIGHT INSIGHTARCH MECH ARCHIVES OF MECHANICSPOLYM POLYM COMPOS POLYMERS & POLYMER COMPOSITESMATER EVAL MATERIALS EVALUATIONCOMPUT CONCRETE Computers and ConcreteEXP TECHNIQUES EXPERIMENTAL TECHNIQUESJ TEST EVAL JOURNAL OF TESTING AND EVALUATIONNONDESTRUCT TEST EVA Nondestructive Testing and EvaluationRUSS J NONDESTRUCT+RUSSIAN JOURNAL OF NONDESTRUCTIVE TESTING MATERIALPRUFUNG MATERIALPRUFUNGMATER PERFORMANCE MATERIALS PERFORMANCEDYES PIGMENTS DYES AND PIGMENTSCELLULOSE CELLULOSETEXT RES J TEXTILE RESEARCH JOURNALCOLOR TECHNOL COLORATION TECHNOLOGYFIBER POLYM FIBERS AND POLYMERSJ VINYL ADDIT TECHN JOURNAL OF VINYL & ADDITIVE TECHNOLOGYWOOD FIBER SCI WOOD AND FIBER SCIENCEJ AM LEATHER CHEM AS JOURNAL OF THE AMERICAN LEATHER CHEMISTS ASSOCIATION INT J CLOTH SCI TECH International Journal of Clothing Science and Technolog FIBRES TEXT EAST EUR FIBRES & TEXTILES IN EASTERN EUROPEJ TEXT I JOURNAL OF THE TEXTILE INSTITUTEAATCC REV AATCC REVIEWJ SOC LEATH TECH CH JOURNAL OF THE SOCIETY OF LEATHER TECHNOLOGISTS AND CHE FIBRE CHEM+FIBRE CHEMISTRYSEN-I GAKKAISHI SEN-I GAKKAISHITEKSTIL TEKSTILCOMPOS SCI TECHNOL COMPOSITES SCIENCE AND TECHNOLOGYCOMPOS PART A-APPL S COMPOSITES PART A-APPLIED SCIENCE AND MANUFACTURINGCOMPOS PART B-ENG COMPOSITES PART B-ENGINEERINGCOMPOS STRUCT COMPOSITE STRUCTURESPOLYM COMPOSITE POLYMER COMPOSITESCEMENT CONCRETE COMP CEMENT & CONCRETE COMPOSITESJ COMPOS MATER JOURNAL OF COMPOSITE MATERIALSMECH ADV MATER STRUC MECHANICS OF ADVANCED MATERIALS AND STRUCTURES COMPOS INTERFACE COMPOSITE INTERFACESJ COMPOS CONSTR JOURNAL OF COMPOSITES FOR CONSTRUCTIONAPPL COMPOS MATER APPLIED COMPOSITE MATERIALSJ THERMOPLAST COMPOS JOURNAL OF THERMOPLASTIC COMPOSITE MATERIALSJ REINF PLAST COMP JOURNAL OF REINFORCED PLASTICS AND COMPOSITESJ SANDW STRUCT MATER JOURNAL OF SANDWICH STRUCTURES & MATERIALSSTEEL COMPOS STRUCT STEEL AND COMPOSITE STRUCTURESPLAST RUBBER COMPOS PLASTICS RUBBER AND COMPOSITESPOLYM POLYM COMPOS POLYMERS & POLYMER COMPOSITESMECH COMPOS MATER MECHANICS OF COMPOSITE MATERIALSADV COMPOS MATER ADVANCED COMPOSITE MATERIALSADV COMPOS LETT ADVANCED COMPOSITES LETTERSSCI ENG COMPOS MATER SCIENCE AND ENGINEERING OF COMPOSITE MATERIALSJ AM CERAM SOC JOURNAL OF THE AMERICAN CERAMIC SOCIETYJ EUR CERAM SOC JOURNAL OF THE EUROPEAN CERAMIC SOCIETYINT J APPL CERAM TEC International Journal of Applied Ceramic TechnologyJ NON-CRYST SOLIDS JOURNAL OF NON-CRYSTALLINE SOLIDSCERAM INT CERAMICS INTERNATIONALJ SOL-GEL SCI TECHN JOURNAL OF SOL-GEL SCIENCE AND TECHNOLOGYJ CERAM SOC JPN JOURNAL OF THE CERAMIC SOCIETY OF JAPANJ ELECTROCERAM JOURNAL OF ELECTROCERAMICSADV APPL CERAM Advances in Applied CeramicsCERAM-SILIKATY CERAMICS-SILIKATYBOL SOC ESP CERAM V BOLETIN DE LA SOCIEDAD ESPANOLA DE CERAMICA Y VIDRIO PHYS CHEM GLASSES-B Physics and Chemistry of Glasses-European Journal of Gl GLASS TECHNOL-PART A GLASS TECHNOLOGYJ INORG MATER JOURNAL OF INORGANIC MATERIALSGLASS PHYS CHEM+GLASS PHYSICS AND CHEMISTRYSCI SINTER SCIENCE OF SINTERINGJ CERAM PROCESS RES JOURNAL OF CERAMIC PROCESSING RESEARCHAM CERAM SOC BULL AMERICAN CERAMIC SOCIETY BULLETINPOWDER METALL MET C+POWDER METALLURGY AND METAL CERAMICSGLASS CERAM+GLASS AND CERAMICSIND CERAM INDUSTRIAL CERAMICSMATER WORLD MATERIALS WORLDCFI-CERAM FORUM INT CFI-CERAMIC FORUM INTERNATIONALREFRACT IND CERAM+REFRACTORIES AND INDUSTRIAL CERAMICSJ ELECTROCHEM SOC JOURNAL OF THE ELECTROCHEMICAL SOCIETYTHIN SOLID FILMS THIN SOLID FILMSCHEM VAPOR DEPOS CHEMICAL VAPOR DEPOSITIONSURF COAT TECH SURFACE & COATINGS TECHNOLOGYPROG ORG COAT PROGRESS IN ORGANIC COATINGSAPPL SURF SCI APPLIED SURFACE SCIENCEJ VAC SCI TECHNOL A JOURNAL OF VACUUM SCIENCE & TECHNOLOGY A-VACUUM SURFACE J THERM SPRAY TECHN JOURNAL OF THERMAL SPRAY TECHNOLOGYJ COAT TECHNOL RES Journal of Coatings Technology and ResearchPIGM RESIN TECHNOL Pigment & Resin TechnologyNEW DIAM FRONT C TEC NEW DIAMOND AND FRONTIER CARBON TECHNOLOGYSURF ENG SURFACE ENGINEERINGCORROS REV CORROSION REVIEWSJ PLAST FILM SHEET JOURNAL OF PLASTIC FILM & SHEETINGT I MET FINISH TRANSACTIONS OF THE INSTITUTE OF METAL FINISHINGJCT COATINGSTECH JCT COATINGSTECHBIOMATERIALS BIOMATERIALSEUR CELLS MATER EUROPEAN CELLS & MATERIALSACTA BIOMATER Acta BiomaterialiaMACROMOL BIOSCI MACROMOLECULAR BIOSCIENCEDENT MATER DENTAL MATERIALSJ BIOMED MATER RES A JOURNAL OF BIOMEDICAL MATERIALS RESEARCH PART A COLLOID SURFACE B COLLOIDS AND SURFACES B-BIOINTERFACESJ BIOMED MATER RES B JOURNAL OF BIOMEDICAL MATERIALS RESEARCH PART B-APPLIED J BIOMAT SCI-POLYM E JOURNAL OF BIOMATERIALS SCIENCE-POLYMER EDITIONJ BIOACT COMPAT POL JOURNAL OF BIOACTIVE AND COMPATIBLE POLYMERSJ MATER SCI-MATER M JOURNAL OF MATERIALS SCIENCE-MATERIALS IN MEDICINEJ BIOMATER APPL JOURNAL OF BIOMATERIALS APPLICATIONSBIOMED MATER Biomedical MaterialsARTIF CELL BLOOD SUB ARTIFICIAL CELLS BLOOD SUBSTITUTES AND IMMOBILIZATION B DENT MATER J DENTAL MATERIALS JOURNALBIO-MED MATER ENG BIO-MEDICAL MATERIALS AND ENGINEERINGCELL POLYM CELLULAR POLYMERSJ BIOBASED MATER BIO Journal of Biobased Materials and BioenergyJ MECH BEHAV BIOMED Journal of the Mechanical Behavior of Biomedical Materi CELLULOSE CELLULOSEHOLZFORSCHUNG HOLZFORSCHUNGWOOD SCI TECHNOL WOOD SCIENCE AND TECHNOLOGYJ WOOD CHEM TECHNOL JOURNAL OF WOOD CHEMISTRY AND TECHNOLOGYJ PULP PAP SCI JOURNAL OF PULP AND PAPER SCIENCEJ WOOD SCI JOURNAL OF WOOD SCIENCEHOLZ ROH WERKST HOLZ ALS ROH-UND WERKSTOFFNORD PULP PAP RES J NORDIC PULP & PAPER RESEARCH JOURNALTAPPI J TAPPI JOURNALWOOD FIBER SCI WOOD AND FIBER SCIENCEFOREST PROD J FOREST PRODUCTS JOURNALAPPITA J APPITA JOURNALPULP PAP-CANADA PULP & PAPER-CANADAMOKUZAI GAKKAISHI MOKUZAI GAKKAISHIWOOD RES-SLOVAKIA WOOD RESEARCHCELL CHEM TECHNOL CELLULOSE CHEMISTRY AND TECHNOLOGYPAP PUU-PAP TIM PAPERI JA PUU-PAPER AND TIMBERWOCHENBL PAPIERFABR WOCHENBLATT FUR PAPIERFABRIKATIONNAT MATER NATURE MATERIALSNAT NANOTECHNOL Nature NanotechnologyPROG MATER SCI PROGRESS IN MATERIALS SCIENCEMAT SCI ENG R MATERIALS SCIENCE & ENGINEERING R-REPORTSMATER TODAY Materials TodayNANO LETT NANO LETTERSADV MATER ADVANCED MATERIALSANNU REV MATER RES ANNUAL REVIEW OF MATERIALS RESEARCHADV FUNCT MATER ADVANCED FUNCTIONAL MATERIALSSMALL SMALLACS NANO ACS NanoMRS BULL MRS BULLETINCHEM MATER CHEMISTRY OF MATERIALSCRIT REV SOLID STATE CRITICAL REVIEWS IN SOLID STATE AND MATERIALS SCIENCES SOFT MATTER Soft MatterJ MATER CHEM JOURNAL OF MATERIALS CHEMISTRYCRYST GROWTH DES CRYSTAL GROWTH & DESIGNCARBON CARBONINT J PLASTICITY INTERNATIONAL JOURNAL OF PLASTICITYINT MATER REV INTERNATIONAL MATERIALS REVIEWSACTA MATER ACTA MATERIALIAORG ELECTRON ORGANIC ELECTRONICSJ MECH PHYS SOLIDS JOURNAL OF THE MECHANICS AND PHYSICS OF SOLIDSJ PHYS CHEM C Journal of Physical Chemistry CNANOTECHNOLOGY NANOTECHNOLOGYGOLD BULL GOLD BULLETINPLASMONICS PlasmonicsMICROPOR MESOPOR MAT MICROPOROUS AND MESOPOROUS MATERIALSSCRIPTA MATER SCRIPTA MATERIALIACURR OPIN SOLID ST M CURRENT OPINION IN SOLID STATE & MATERIALS SCIENCE CURR NANOSCI Current NanoscienceSOL ENERG MAT SOL C SOLAR ENERGY MATERIALS AND SOLAR CELLSMICROSC MICROANAL MICROSCOPY AND MICROANALYSISJ NANOPART RES JOURNAL OF NANOPARTICLE RESEARCHMECH MATER MECHANICS OF MATERIALSJ MICROMECH MICROENG JOURNAL OF MICROMECHANICS AND MICROENGINEERINGPHYS STATUS SOLIDI-R Physica Status Solidi-Rapid Research LettersEUR PHYS J E EUROPEAN PHYSICAL JOURNAL EINTERMETALLICS INTERMETALLICSIEEE T NANOTECHNOL IEEE TRANSACTIONS ON NANOTECHNOLOGYELECTROCHEM SOLID ST ELECTROCHEMICAL AND SOLID STATE LETTERSJ NANOSCI NANOTECHNO JOURNAL OF NANOSCIENCE AND NANOTECHNOLOGYCORROS SCI CORROSION SCIENCEJ MATER RES JOURNAL OF MATERIALS RESEARCHPHOTONIC NANOSTRUCT Photonics and Nanostructures-Fundamentals and ApplicatiNANOSCALE RES LETT Nanoscale Research LettersDIAM RELAT MATER DIAMOND AND RELATED MATERIALSAPPL PHYS A-MATER APPLIED PHYSICS A-MATERIALS SCIENCE & PROCESSINGSOFT MATER SOFT MATERIALSSYNTHETIC MET SYNTHETIC METALSMATER CHEM PHYS MATERIALS CHEMISTRY AND PHYSICSTHIN SOLID FILMS THIN SOLID FILMSOPT MATER OPTICAL MATERIALSSEMICOND SCI TECH SEMICONDUCTOR SCIENCE AND TECHNOLOGYMACROMOL MATER ENG MACROMOLECULAR MATERIALS AND ENGINEERINGSMART MATER STRUCT SMART MATERIALS & STRUCTURESMAT SCI ENG A-STRUCT MATERIALS SCIENCE AND ENGINEERING A-STRUCTURAL MATERIAL MATER LETT MATERIALS LETTERSMATER RES BULL MATERIALS RESEARCH BULLETINMAT SCI ENG C-BIO S MATERIALS SCIENCE & ENGINEERING C-BIOMIMETIC AND SUPRAM INT J DAMAGE MECH INTERNATIONAL JOURNAL OF DAMAGE MECHANICSJ NUCL MATER JOURNAL OF NUCLEAR MATERIALSADV ENG MATER ADVANCED ENGINEERING MATERIALSCMC-COMPUT MATER CON CMC-Computers Materials & ContinuaPHYS CHEM MINER PHYSICS AND CHEMISTRY OF MINERALSMAT SCI ENG B-SOLID MATERIALS SCIENCE AND ENGINEERING B-SOLID STATE MATERIA PHILOS MAG PHILOSOPHICAL MAGAZINEJ ALLOY COMPD JOURNAL OF ALLOYS AND COMPOUNDSJ MAGN MAGN MATER JOURNAL OF MAGNETISM AND MAGNETIC MATERIALSJ INTEL MAT SYST STR JOURNAL OF INTELLIGENT MATERIAL SYSTEMS AND STRUCTURES J NON-CRYST SOLIDS JOURNAL OF NON-CRYSTALLINE SOLIDSJ ELECTRON MATER JOURNAL OF ELECTRONIC MATERIALSWEAR WEARINT J ADHES ADHES INTERNATIONAL JOURNAL OF ADHESION AND ADHESIVES METALL MATER TRANS A METALLURGICAL AND MATERIALS TRANSACTIONS A-PHYSICAL MET MODEL SIMUL MATER SC MODELLING AND SIMULATION IN MATERIALS SCIENCE AND ENGIN CURR APPL PHYS CURRENT APPLIED PHYSICSCOMP MATER SCI COMPUTATIONAL MATERIALS SCIENCECEMENT CONCRETE RES CEMENT AND CONCRETE RESEARCHIEEE T ADV PACKAGING IEEE TRANSACTIONS ON ADVANCED PACKAGINGINT J FATIGUE INTERNATIONAL JOURNAL OF FATIGUESCI TECHNOL ADV MAT SCIENCE AND TECHNOLOGY OF ADVANCED MATERIALSPHYS STATUS SOLIDI A PHYSICA STATUS SOLIDI A-APPLICATIONS AND MATERIALS SCIE MET MATER-INT METALS AND MATERIALS INTERNATIONALEXP MECH EXPERIMENTAL MECHANICSGRANUL MATTER GRANULAR MATTERJOM-US JOM-JOURNAL OF THE MINERALS METALS & MATERIALS SOCIETY J MICRO-NANOLITH MEM Journal of Micro-Nanolithography MEMS and MOEMSJ COMPUT THEOR NANOS Journal of Computational and Theoretical Nanoscience SCI TECHNOL WELD JOI SCIENCE AND TECHNOLOGY OF WELDING AND JOININGJ MATER SCI JOURNAL OF MATERIALS SCIENCEJ ENG MATER-T ASME JOURNAL OF ENGINEERING MATERIALS AND TECHNOLOGY-TRANSACINT J REFRACT MET H INTERNATIONAL JOURNAL OF REFRACTORY METALS & HARD MATER MATER DESIGN MATERIALS & DESIGNREV ADV MATER SCI REVIEWS ON ADVANCED MATERIALS SCIENCEJ MATER SCI-MATER EL JOURNAL OF MATERIALS SCIENCE-MATERIALS IN ELECTRONICS J EXP NANOSCI Journal of Experimental NanoscienceINT J NANOTECHNOL International Journal of NanotechnologyMAT SCI SEMICON PROC MATERIALS SCIENCE IN SEMICONDUCTOR PROCESSINGVACUUM VACUUMMICROSYST TECHNOL MICROSYSTEM TECHNOLOGIESJ ADHESION JOURNAL OF ADHESIONJ POROUS MAT JOURNAL OF POROUS MATERIALSRAPID PROTOTYPING J RAPID PROTOTYPING JOURNALMATER TRANS MATERIALS TRANSACTIONSIEEE T COMPON PACK T IEEE TRANSACTIONS ON COMPONENTS AND PACKAGING TECHNOLOG MECH ADV MATER STRUC MECHANICS OF ADVANCED MATERIALS AND STRUCTURESINT J NUMER ANAL MET INTERNATIONAL JOURNAL FOR NUMERICAL AND ANALYTICAL METH J ELASTICITY JOURNAL OF ELASTICITYJ ADHES SCI TECHNOL JOURNAL OF ADHESION SCIENCE AND TECHNOLOGYSOLDER SURF MT TECH SOLDERING & SURFACE MOUNT TECHNOLOGYJ MATER PROCESS TECH JOURNAL OF MATERIALS PROCESSING TECHNOLOGYJ OPTOELECTRON ADV M JOURNAL OF OPTOELECTRONICS AND ADVANCED MATERIALS FATIGUE FRACT ENG M FATIGUE & FRACTURE OF ENGINEERING MATERIALS & STRUCTURE J NEW MAT ELECTR SYS JOURNAL OF NEW MATERIALS FOR ELECTROCHEMICAL SYSTEMS METALL MATER TRANS B METALLURGICAL AND MATERIALS TRANSACTIONS B-PROCESS META FIRE MATER FIRE AND MATERIALSCONSTR BUILD MATER CONSTRUCTION AND BUILDING MATERIALSMATER SCI TECH-LOND MATERIALS SCIENCE AND TECHNOLOGYJ SOC INF DISPLAY Journal of the Society for Information DisplayMICRO NANO LETT Micro & Nano LettersACI STRUCT J ACI STRUCTURAL JOURNALCORROSION CORROSIONJ COMPUT-AIDED MATER JOURNAL OF COMPUTER-AIDED MATERIALS DESIGNJ NANOMATER Journal of NanomaterialsJ ENERG MATER Journal of Energetic MaterialsACI MATER J ACI MATERIALS JOURNALB MATER SCI BULLETIN OF MATERIALS SCIENCEMATER MANUF PROCESS MATERIALS AND MANUFACTURING PROCESSESJ FIRE SCI JOURNAL OF FIRE SCIENCESJ ELASTOM PLAST JOURNAL OF ELASTOMERS AND PLASTICSMATER STRUCT MATERIALS AND STRUCTURESMATER CORROS MATERIALS AND CORROSION-WERKSTOFFE UND KORROSIONFIRE SAFETY J FIRE SAFETY JOURNALJ MATER SCI TECHNOL JOURNAL OF MATERIALS SCIENCE & TECHNOLOGYMACH SCI TECHNOL MACHINING SCIENCE AND TECHNOLOGYMATH MECH SOLIDS MATHEMATICS AND MECHANICS OF SOLIDSFULLER NANOTUB CAR N FULLERENES NANOTUBES AND CARBON NANOSTRUCTURESMATER PLAST MATERIALE PLASTICEMICROELECTRON INT MICROELECTRONICS INTERNATIONALMATER CONSTRUCC MATERIALES DE CONSTRUCCIONNEW DIAM FRONT C TEC NEW DIAMOND AND FRONTIER CARBON TECHNOLOGYSENSOR MATER SENSORS AND MATERIALSMATER RES INNOV MATERIALS RESEARCH INNOVATIONSADV CEM RES ADVANCES IN CEMENT RESEARCHCOMBUST EXPLO SHOCK+COMBUSTION EXPLOSION AND SHOCK WAVESATOMIZATION SPRAY ATOMIZATION AND SPRAYSFERROELECTRICS FERROELECTRICSJ MATER CIVIL ENG JOURNAL OF MATERIALS IN CIVIL ENGINEERINGCORROS ENG SCI TECHN CORROSION ENGINEERING SCIENCE AND TECHNOLOGYJ PHASE EQUILIB DIFF JOURNAL OF PHASE EQUILIBRIA AND DIFFUSIONACTA MECH SOLIDA SIN ACTA MECHANICA SOLIDA SINICAINORG MATER+INORGANIC MATERIALSP I MECH ENG L-J MAT PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS J UNIV SCI TECHNOL B JOURNAL OF UNIVERSITY OF SCIENCE AND TECHNOLOGY BEIJING CIRCUIT WORLD CIRCUIT WORLDJ MATER ENG PERFORM JOURNAL OF MATERIALS ENGINEERING AND PERFORMANCESCI CHINA SER E SCIENCE IN CHINA SERIES E-TECHNOLOGICAL SCIENCES MATER SCI-POLAND MATERIALS SCIENCE-POLANDANN CHIM-SCI MAT ANNALES DE CHIMIE-SCIENCE DES MATERIAUXJ WUHAN UNIV TECHNOL JOURNAL OF WUHAN UNIVERSITY OF TECHNOLOGY-MATERIALS SCI MAG CONCRETE RES MAGAZINE OF CONCRETE RESEARCHHIGH TEMP MATER P-US HIGH TEMPERATURE MATERIAL PROCESSESMATERIALWISS WERKST MATERIALWISSENSCHAFT UND WERKSTOFFTECHNIKRARE METALS RARE METALSJ FIRE PROT ENG Journal of Fire Protection EngineeringMATER HIGH TEMP MATERIALS AT HIGH TEMPERATURESMICRO MICROJSME INT J A-SOLID M JSME INTERNATIONAL JOURNAL SERIES A-SOLID MECHANICS AND MATER TECHNOL MATERIALS TECHNOLOGYHIGH TEMP MAT PR-ISR HIGH TEMPERATURE MATERIALS AND PROCESSESINDIAN J ENG MATER S INDIAN JOURNAL OF ENGINEERING AND MATERIALS SCIENCES RARE METAL MAT ENG RARE METAL MATERIALS AND ENGINEERINGSAMPE J SAMPE JOURNALOPTOELECTRON ADV MAT Optoelectronics and Advanced Materials-Rapid Communicat FIRE TECHNOL FIRE TECHNOLOGYINT J MATER PROD TEC INTERNATIONAL JOURNAL OF MATERIALS & PRODUCT TECHNOLOGY FIBRE CHEM+FIBRE CHEMISTRYADV MATER PROCESS ADVANCED MATERIALS & PROCESSESLASER ENG LASERS IN ENGINEERINGMATER SCI+MATERIALS SCIENCEROAD MATER PAVEMENT Road Materials and Pavement DesignJ ADV MATER-COVINA JOURNAL OF ADVANCED MATERIALSMATER WORLD MATERIALS WORLDINFORM MIDEM INFORMACIJE MIDEM-JOURNAL OF MICROELECTRONICS ELECTRONI ZKG INT ZKG INTERNATIONALPLAST ENG PLASTICS ENGINEERINGMETALLOFIZ NOV TEKH+METALLOFIZIKA I NOVEISHIE TEKHNOLOGIIPARTICUOLOGY ParticuologyIEEE T MED IMAGING IEEE TRANSACTIONS ON MEDICAL IMAGINGREMOTE SENS ENVIRON REMOTE SENSING OF ENVIRONMENTISPRS J PHOTOGRAMM ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING PHOTOGRAMM ENG REM S PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSINGINT J REMOTE SENS INTERNATIONAL JOURNAL OF REMOTE SENSING PHOTOGRAMM REC PHOTOGRAMMETRIC RECORDJ VISUAL-JAPAN JOURNAL OF VISUALIZATIONJ IMAGING SCI TECHN JOURNAL OF IMAGING SCIENCE AND TECHNOLOGYJ ELECTRON IMAGING JOURNAL OF ELECTRONIC IMAGINGIMAGING SCI J IMAGING SCIENCE JOURNALLANCET INFECT DIS LANCET INFECTIOUS DISEASESCLIN INFECT DIS CLINICAL INFECTIOUS DISEASESEMERG INFECT DIS EMERGING INFECTIOUS DISEASESJ INFECT DIS JOURNAL OF INFECTIOUS DISEASESAIDS AIDSCURR OPIN INFECT DIS CURRENT OPINION IN INFECTIOUS DISEASESANTIVIR THER ANTIVIRAL THERAPYJAIDS-J ACQ IMM DEF JAIDS-JOURNAL OF ACQUIRED IMMUNE DEFICIENCY SYNDROMES J ANTIMICROB CHEMOTH JOURNAL OF ANTIMICROBIAL CHEMOTHERAPYINFECT IMMUN INFECTION AND IMMUNITYAIDS REV AIDS REVIEWSCLIN MICROBIOL INFEC CLINICAL MICROBIOLOGY AND INFECTIONJ VIRAL HEPATITIS JOURNAL OF VIRAL HEPATITISPEDIATR INFECT DIS J PEDIATRIC INFECTIOUS DISEASE JOURNALHIV MED HIV MEDICINECURR HIV RES CURRENT HIV RESEARCHSEX TRANSM INFECT SEXUALLY TRANSMITTED INFECTIONSSEX TRANSM DIS SEXUALLY TRANSMITTED DISEASESINFECT CONT HOSP EP INFECTION CONTROL AND HOSPITAL EPIDEMIOLOGYJ INFECTION JOURNAL OF INFECTIONJ HOSP INFECT JOURNAL OF HOSPITAL INFECTIONINFECT GENET EVOL INFECTION GENETICS AND EVOLUTIONINT J ANTIMICROB AG INTERNATIONAL JOURNAL OF ANTIMICROBIAL AGENTSEUR J CLIN MICROBIOL EUROPEAN JOURNAL OF CLINICAL MICROBIOLOGY & INFECTIOUS DIAGN MICR INFEC DIS DIAGNOSTIC MICROBIOLOGY AND INFECTIOUS DISEASEAM J INFECT CONTROL AMERICAN JOURNAL OF INFECTION CONTROLTRANSPL INFECT DIS Transplant Infectious DiseaseAIDS PATIENT CARE ST AIDS PATIENT CARE AND STDSINT J TUBERC LUNG D INTERNATIONAL JOURNAL OF TUBERCULOSIS AND LUNG DISEASE AIDS RES HUM RETROV AIDS RESEARCH AND HUMAN RETROVIRUSESINT J INFECT DIS INTERNATIONAL JOURNAL OF INFECTIOUS DISEASESBMC INFECT DIS BMC INFECTIOUS DISEASESVECTOR-BORNE ZOONOT VECTOR-BORNE AND ZOONOTIC DISEASESCLIN VACCINE IMMUNOL Clinical and Vaccine ImmunologyFEMS IMMUNOL MED MIC FEMS IMMUNOLOGY AND MEDICAL MICROBIOLOGYEPIDEMIOL INFECT EPIDEMIOLOGY AND INFECTIONINFECTION INFECTIONINFECT DIS CLIN N AM INFECTIOUS DISEASE CLINICS OF NORTH AMERICAINT J HYG ENVIR HEAL INTERNATIONAL JOURNAL OF HYGIENE AND ENVIRONMENTAL HEAL MICROB DRUG RESIST MICROBIAL DRUG RESISTANCE-MECHANISMS EPIDEMIOLOGY AND D HIV CLIN TRIALS HIV CLINICAL TRIALSSCAND J INFECT DIS SCANDINAVIAN JOURNAL OF INFECTIOUS DISEASES ZOONOSES PUBLIC HLTH Zoonoses and Public HealthENFERM INFEC MICR CL ENFERMEDADES INFECCIOSAS Y MICROBIOLOGIA CLINICAINT J STD AIDS INTERNATIONAL JOURNAL OF STD & AIDSLEPROSY REV LEPROSY REVIEWJPN J INFECT DIS JAPANESE JOURNAL OF INFECTIOUS DISEASESS AFR J HIV MED SOUTHERN AFRICAN JOURNAL OF HIV MEDICINEMED MALADIES INFECT MEDECINE ET MALADIES INFECTIEUSESINFECT MED INFECTIONS IN MEDICINETRANSBOUND EMERG DIS Transboundary and Emerging DiseasesREV GEOPHYS REVIEWS OF GEOPHYSICSEARTH PLANET SC LETT EARTH AND PLANETARY SCIENCE LETTERSGEOCHIM COSMOCHIM AC GEOCHIMICA ET COSMOCHIMICA ACTAJ PETROL JOURNAL OF PETROLOGYCONTRIB MINERAL PETR CONTRIBUTIONS TO MINERALOGY AND PETROLOGYCHEM GEOL CHEMICAL GEOLOGYLITHOS LITHOSTECTONICS TECTONICSREV MINERAL GEOCHEM REVIEWS IN MINERALOGY & GEOCHEMISTRYGEOCHEM GEOPHY GEOSY GEOCHEMISTRY GEOPHYSICS GEOSYSTEMSIEEE T GEOSCI REMOTE IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING METEORIT PLANET SCI METEORITICS & PLANETARY SCIENCEELEMENTS ElementsPHYS EARTH PLANET IN PHYSICS OF THE EARTH AND PLANETARY INTERIORSORG GEOCHEM ORGANIC GEOCHEMISTRYGEOPHYS J INT GEOPHYSICAL JOURNAL INTERNATIONALAM MINERAL AMERICAN MINERALOGISTSEISMOL RES LETT SEISMOLOGICAL RESEARCH LETTERSGEOCHEM T GEOCHEMICAL TRANSACTIONSB SEISMOL SOC AM BULLETIN OF THE SEISMOLOGICAL SOCIETY OF AMERICAQUAT GEOCHRONOL Quaternary GeochronologyAPPL GEOCHEM APPLIED GEOCHEMISTRYSURV GEOPHYS SURVEYS IN GEOPHYSICSTECTONOPHYSICS TECTONOPHYSICSMINER DEPOSITA MINERALIUM DEPOSITAAQUAT GEOCHEM AQUATIC GEOCHEMISTRYJ ATMOS SOL-TERR PHY JOURNAL OF ATMOSPHERIC AND SOLAR-TERRESTRIAL PHYSICSJ GEODYN JOURNAL OF GEODYNAMICSJ GEODESY JOURNAL OF GEODESYRADIOCARBON RADIOCARBONIEEE GEOSCI REMOTE S IEEE Geoscience and Remote Sensing LettersECON GEOL ECONOMIC GEOLOGYDYNAM ATMOS OCEANS DYNAMICS OF ATMOSPHERES AND OCEANSCHEM ERDE-GEOCHEM CHEMIE DER ERDE-GEOCHEMISTRYSPACE WEATHER SPACE WEATHER-THE INTERNATIONAL JOURNAL OF RESEARCH AND GEOFLUIDS GEOFLUIDSGEOPHYSICS GEOPHYSICSNONLINEAR PROC GEOPH NONLINEAR PROCESSES IN GEOPHYSICSMINER PETROL MINERALOGY AND PETROLOGYGEOPHYS ASTRO FLUID GEOPHYSICAL AND ASTROPHYSICAL FLUID DYNAMICSRADIO SCI RADIO SCIENCEPURE APPL GEOPHYS PURE AND APPLIED GEOPHYSICSNEAR SURF GEOPHYS Near Surface GeophysicsGEOPHYS PROSPECT GEOPHYSICAL PROSPECTINGJ SEISMOL JOURNAL OF SEISMOLOGYGEOCHEM J GEOCHEMICAL JOURNALJ GEOCHEM EXPLOR JOURNAL OF GEOCHEMICAL EXPLORATIONMAR GEOPHYS RES MARINE GEOPHYSICAL RESEARCHESJ GEOPHYS ENG Journal of Geophysics and EngineeringSTUD GEOPHYS GEOD STUDIA GEOPHYSICA ET GEODAETICAJ ENVIRON ENG GEOPH JOURNAL OF ENVIRONMENTAL AND ENGINEERING GEOPHYSICS CHINESE J GEOPHYS-CH CHINESE JOURNAL OF GEOPHYSICS-CHINESE EDITION GEOTECTONICS+GEOTECTONICSASTRON GEOPHYS ASTRONOMY & GEOPHYSICSANN GEOPHYS-ITALY ANNALS OF GEOPHYSICSLITHOL MINER RESOUR+LITHOLOGY AND MINERAL RESOURCESGEOCHEM INT+GEOCHEMISTRY INTERNATIONALIZV-PHYS SOLID EART+IZVESTIYA-PHYSICS OF THE SOLID EARTHACTA GEOPHYS Acta GeophysicaNUOVO CIMENTO C NUOVO CIMENTO DELLA SOCIETA ITALIANA DI FISICA C-GEOPHY PETROPHYSICS PetrophysicsGEOCHEM-EXPLOR ENV A GEOCHEMISTRY-EXPLORATION ENVIRONMENT ANALYSISJ SEISM EXPLOR JOURNAL OF SEISMIC EXPLORATIONJ EARTHQ TSUNAMI Journal of Earthquake and TsunamiANNU REV EARTH PL SC ANNUAL REVIEW OF EARTH AND PLANETARY SCIENCESEARTH-SCI REV EARTH-SCIENCE REVIEWSGLOBAL BIOGEOCHEM CY GLOBAL BIOGEOCHEMICAL CYCLESQUATERNARY SCI REV QUATERNARY SCIENCE REVIEWSGEOBIOLOGY GEOBIOLOGYPALEOCEANOGRAPHY PALEOCEANOGRAPHYPRECAMBRIAN RES PRECAMBRIAN RESEARCHGEOL SOC AM BULL GEOLOGICAL SOCIETY OF AMERICA BULLETINJ GEOPHYS RES JOURNAL OF GEOPHYSICAL RESEARCHASTROBIOLOGY ASTROBIOLOGYBIOGEOSCIENCES BiogeosciencesGEOPHYS RES LETT GEOPHYSICAL RESEARCH LETTERSBIOGEOCHEMISTRY BIOGEOCHEMISTRY。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Interprocessor Communication withLimited MemoryAli Pinar,Member,IEEE Computer Society,and Bruce Hendrickson,Member,IEEE Computer Society Abstract—Many parallel applications require periodic redistribution of workloads and associated data.In a distributed memorycomputer,this redistribution can be difficult if limited memory is available for receiving messages.We propose a model for optimizing the exchange of messages under such circumstances which we call the minimum phase remapping problem.We first show that the problem is NP-Complete,and then analyze several methodologies for addressing it.First,we show how the problem can be phrased as an instance of multicommodity flow.Next,we study a continuous approximation to the problem.We show that this continuousapproximation has a solution which requires at most two more phases than the optimal discrete solution,but the question of how to consistently obtain a good discrete solution from the continuous problem remains open.We also devise a simple and practicalapproximation algorithm for the problem with a bound of1.5times the optimal number of phases.We also present an empirical study of variations of our algorithms which indicate that our approaches are quite practical.Index Terms—Interprocessor communication,dynamic load balancing,data migration,scheduling,NP-completeness,approximation algorithms.æ1I NTRODUCTIONI N many parallel computations,the workload needs to be periodically redistributed among the processors.When computational work varies over time,the tasks must be reassigned to keep the workload balanced.On a distributed memory computer,this generally requires that data structures associated with the computations be transferred between processors.Many examples of this phenomena occur in scientific computing,including adaptive mesh refinement,particle simulations with short or long-range forces,state-dependent physics models,and multiphysics or multiphase simulations.For many such simulations,the limiting resource is not computation but memory.Impor-tant examples include differential equation solvers using adaptive meshes,and large-scale particle simulations. Scientists commonly choose to use the minimum number of processors upon which a particular simulation can fit,or they choose a problem size which fills the memory of a particular number of processors.However,the dynamic memory requirements of such applications make such targeting difficult.A number of algorithms and software tools have been developed to repartition the work among processors(see, for example,[1],[2]and references therein).However,the mechanics of actually moving large amounts of data has received much less attention.When the processors have sufficient memory,the simplest way to transmit the data is quite effective.Each processor can execute the following steps:1.Allocate space for my incoming data.2.Post asynchronous receives for my incoming data.3.Barrier.4.Send all my outgoing data.5.Free up space consumed by my outgoing data.6.Wait for all my incoming data to arrive.The barrier in Step3ensures that no messages arrive until the processor is ready to receive them,so no buffering is needed.Unfortunately,this protocol can fail when memory is limited.It requires a processor to have sufficient memory to hold both the outgoing and the incoming data simulta-neously,since incoming messages can arrive before space for outgoing data is freed.An alternative way to view this issue is that,for a period of time,the data being transferred consumes space on both the sending and receiving processors.A protocol that alleviates this problem is desirable for three reasons.First,since many scientific calculations are memory limited,reserving space for this communication operation limits the size of the calculations that can be performed.Second,the amount of memory required by this protocol is unpredictable because the data remapping requirements depend upon the computation. Thus,one is forced to set aside a conservative amount of space,which is likely to be wasteful.Third,a general purpose tool for dynamic load balancing should be robust in the presence of limited memory.On the hopefully infrequent occasions when memory limitations prohibit direct transfers,a good tool should apply an alternative strategy instead of exiting.The construction of such a tool (Zoltan[3])inspired our interest in this problem.The desire to impact Zoltan has a number of implications for this research.We are interested in algorithms with attractive practical performance,not just asymptotic bounds on. A.Pinar is with the Computational Research Division of LawrenceBerkeley National Laboratory,One Cyclotron Road,MS50F,Berkeley,CA94720.E-mail:apinar@.. B.Hendrickson is with the Discrete Algorithms and Math Department,Sandia National Laboratories,Albuquerque,NM87185-1111.E-mail:bah@.Manuscript received12May2003;accepted29Oct.2003.For information on obtaining reprints of this article,please send e-mail to:tpds@,and reference IEEECS Log Number TPDS-0074-0503.1045-9219/04/$20.00ß2004IEEE Published by the IEEE Computer Societyworst-case behavior.We also need algorithms that are straightforward and efficient to implement.To address the deficiencies of the standard protocol,we propose a simple modification.Instead of sending all of the data at once,we will send it in phases.After each phase, processors can free up the memory of the data they have sent.That memory is now available for the next commu-nication phase.Each phase adds an overhead due to the latency costs,thus it is important to limit the total number of phases,which is an approximation to minimize total remapping time.The results in Section7will confirm that the overall communication time is strongly effected by the number of phases.More formally,consider a set of P processors.The amount of data that needs to be communicated between processors is a transfer request.We will assume that the request is feasible,that is,the end result of satisfying the transfer request does not violate any processor’s memory constraints.We will let T ij denote the total volume of data to be transferred from processor i to processor j.We now wish to perform the requested transfer in a sequence of phases.Let t l ij denote the volume of data transfer from processor i to processor j in phase l,and let A l i be the memory available to processor i at the beginning of phase l. We will also use R l i and S l i to denote the total volume of data received and sent by processor i in phase l(i.e.,R l i¼P Pj¼1t ljiand S l i¼P Pj¼1t lij).At each step,the finite-memoryconstraint requires that R l i A l i for i¼1;2;...;P.The available memory after each phase can be computed asA lþ1i ¼A l iþS l iÀR l i.Our objective is to find a schedule oftransfers that obeys the memory constraint and satisfies the transfer request in a minimal number of phases.We will call this the minimum phase remapping problem.In Section2,we show that the problem of determining whether a given transfer can be completed in a specified number of phases is NP-Complete.The remainder of this paper focuses on formulations and approximation algo-rithms that could be used in practice.In Section3,we present a reduction of our problem to multicommodity flow.We present a continuous relaxation of the problem in Section4 and a practical approximation algorithm in Section 5. Although the emphasis of this paper is a theoretical analysis, we discuss some practical issues associated with the applica-tion of our techniques in Section6and results of empirical studies in Section7.Earlier versions of some portions of this paper have appeared in conference proceedings[4],[5].Despite its practical importance,the problem of efficient data transfers has not been well-studied.Some standard collective communication operations can be implemented in ways that limit memory usage,but the general problem we consider seems to be new.Cypher and Konstantinidou designed memory efficient message passing protocols[6]. However,their work addressed exchange of tokens as opposed to variable sized messages,and they did not explicitly consider the effect of finite memory in the processors.Their work conceptually divides a process into communication and application munication processes receive unit-size messages and copy them to application processes.It is assumed that application processes have enough memory,and the goal is to limit the memory requirement of the communication processes. Very recently,Hall et al.studied the problem for large storage systems[7].They assume the total time for subdividing a message and sending subdivided messages is about the same as sending the entire message.So,they study the problem on unit-size messages.This assumption might hold for huge volumes of data to be transferred,as in the case of reorganizing a database,however,it is far from being true for dynamic load balancing applications due to very high message setup times.Their model also restricts each processor to send and receive only one message in a phase.2C OMPLEXITYIn this section,we show that determining whether a given transfer can be completed in a specified number of phases is NP-Complete.Our proof uses a reduction from the Hamilto-nian Circuit problem,which is known to be NP-Complete[8]. Recall that a Hamiltonian Circuit is a cycle in a graph that visits each vertex once.Given an instance of the Hamiltonian Circuit problem,the basic idea of our reduction is to construct an instance of the data transfer problem in which there is but a single unit of usable memory.This unit is a token that is passed between processors,and possession of the token allows a processor to receive data in the next phase.A solution to the data remapping problem occurs if and only if the token can be passed in a cycle among all the processors,which implies the existence of a Hamiltonian Circuit.To see how this can be done,consider the Hamiltonian Circuit problem posed in the left portion of Fig.1.From this instance,we construct the data remapping problem in the right portion of the figure.The data remapping problem contains the original graph as its core (represented in the figure with dark lines)after replacing vertices with processors and replacing edges with unit-volume data transfers.It also contains a chain of processors to the left.The bottom processor in this chain has free memory that will percolate upwards with each phase,finally allowing all the data transfers to be completed.Given a graph G¼ðV;EÞ,we construct a data remapping problem as follows:.For each vertex v i in V there is a processor p i.We refer to these processors as core processors.Eachedge of E is a unit-volume transfer.Fig.1.Construction for NP-Completeness proof..Add a chain of j V j processors c1;...;c j V j along with transfer requests from c iþ1to c i for i¼1;2;...j V jÀ1,each with volume j E jÀj V j..Add a transfer request from each core processor p i to the top of the chain c j V j.This transfer has volumeequal to one less than the in-degree of v i in G..Add a dummy processor d and a unit-weight transfer connecting d to an arbitrary processor pÃinthe core..Give j E jÀj V j units of free memory to processor c1 and1unit of free memory to pÃ.No other processorhas free memory.Consider what happens as the data remapping occurs.In the first phase,c2will send its data to c1,moving the free memory one step up in the chain.After j V jÀ1phases,this free memory will have arrived at c j V j,the top of the chain.Meanwhile,the token,which started at pÃ,will have meandered about,enabling some data to be transferred. In phase j V j,processor c j V j has enough memory to receive all it needs from the core processors.During this phase,the token can take one more step.The messages sent to c j V j free up memory in the core processors.Specifically,at the end of phase j V j,each core processor p i has indegreeðp iÞÀ1units of free memory(one processor has an additional unit due to the token).In phase j V jþ1,core processor p i can now receive all it needs,minus1.The transfers to p i can be completed in this phase if and only if one of the data transfers to p i has previously been handled by the token. But,the only way for the token to visit all the core processors in j V j phases is to complete a Hamiltonian Circuit of the core graph.Note that the token must end up where it started,at processor pÃ,to enable the transfer from d to occur at phase j V jþ1.This argument leads to the following result.Theorem 1.Determining whether an instance of the data remapping problem can complete in a specified number of phases is NP-Complete.Proof.Given an instance of the Hamiltonian Circuit problem G¼ðV;EÞ,construct a data remapping pro-blem as described above.As sketched above,the data remapping problem finishes in j V jþ1phases if the core graph has a Hamiltonian Circuit.If the core graph does not have a Hamiltonian Circuit,then one of its processors will not have been visited by the token by the end of phase j V j.That unvisited processor,p i,still needs to receive indegreeðv iÞdata,but has only indegree ðv iÞÀ1units of available memory,so the data transfers cannot complete in j V jþ1phases.The construction of the data remapping problem is polynomial,so we can conclude that the data remapping problem is NP-Hard.A given solution can be verified in polynomial time,thusthe problem is in NP.t u3M ULTICOMMODITY F LOW F ORMULATIONIn this section,we present a multicommodity flow(MCF) formulation to determine whether a given transfer can complete in a specified number of phases[9].Once we can solve the decision problem,the number of phases in an optimal solution can be determined using parametric search.This formulation enables use of MCF technology to solve the minimum phase data remapping problem optimally.This might be helpful for three reasons.First, some MCF problems can be solved relatively quickly, despite their intractability in the general case.Second,the continuous version of the MCF problem can be solved in polynomial time and the solution can be used as a heuristic for the integer problem.Finally,MCF solvers will find an optimal solution if runtime is not an issue.In our MCF formulation,each processor corresponds to a commodity.Let P be the number of processors and L be the number of phases.We must decide if a remapping can complete in L phases.As depicted in Fig.2,our MCF graph contains a sequence of components,one for each phase. Each component allows for the communication that occurs in the corresponding phase.The MCF graph G¼ðV;EÞhas2P L vertices.Each processor is represented by2L vertices:two processors(one sender and one receiver)at each phase.We will use r l i and s l i to denote receiver and sender,respectively,for processor i in phase l.A sender vertex of the first phase is the source of a commodity with volume equal to the total volume of the data originally stored by this processor.A receiver vertex in the last phase is a destination for a set of commodities that corresponds to data to be stored by this processor after remapping is complete.In the MCF graph,there is an edge from r l i to s l i for l¼1;...;L and i¼1;...;P.The capacity of an edge is equal to the total memory on the respective processor. There are also edges from each sender vertex s l i to all other receiver vertices r l j in the same phase to enable data exchange between any pair of processors in a phase.These edges have infinite capacities.With this construction,all processors first receive the data in a phase and then send their messages.This corresponds to first allocating space for the data to be received and then sending the outgoing data.The edges from receivers to senders within a phase guarantee that there is sufficient space to allocate memory for the incoming data before releasing the space for the data being shipped out,so that the memory constraints are guaranteed to be satisfied.Finally,there is an edge(with infinite capacity)from each sender s l i to the receiver in the next phase r l i forFig.2.MCF graph for five processors and two phases.l¼1;...;LÀ1.The flow on these edges corresponds to data that is already in the memory of a processor at the beginning of a phase.The graph for P¼5and L¼2is depicted in Fig.2.Theorem2.There exists a solution to the remapping problem if and only if there exists a solution to the MCF formulation. Proof.We can replace a data transfer from processor i to processor j in phase l with flow on edgeðs l i;r l jÞof equal volume.As argued above,memory constraints on the processors are satisfied if and only if the capacity constraints on the edges are satisfied in G.Thus,the feasibility of one solution implies the feasibility of the other.t u In this formulation,the number of commodities is equal to the number of processors,and the graph has2P L vertices and P2L edges.The number of vertices and edges can be reduced for a more efficient formulation.First,we can replace the crossbar between senders and receivers in a phase l with a vertex v l and edges from all senders of phase l to v l and edges from v l to all receivers of phase l.Second,we can merge the senders of phase l with receivers of phase lþ1.The graph after these reductions is depicted in Fig.3.This improved formulation has P LþLþP vertices andð3Lþ1ÞP edges. 4C ONTINUOUS R ELAXATIONAlthough the multicommodity flow formulation from Sec-tion3provides a methodology for solving instances of the minimum phase remapping problem,runtime can still be exponential in the problem size.In this section,we describe an efficient solution for an approximation to the remapping problem.In the approximation,integral constraints on the volume of data transfers are relaxed to allow continuous values.Naturally,the volume of transfer between two processors in a phase must be an integer.But,integer solutions near the continuous ones can be used as heuristics. Note that the unit of data transfer is only a byte,whereas the volume of data being transferred is often on the order of megabytes.So,conversion from a continuous solution to aninteger solution will often be a small perturbation,and so heuristics based upon this idea may be generally effective. However,bad cases for this heuristic exist,as discussed at the end of this section.As defined in the introduction,T ij denotes the total volume of data to be communicated from processor i to processor j,and t l ij denotes the volume of data transferred from processor i to processor j in phase l.The memory available to processor i at the beginning of phase l is denoted by A l i.We also use R i and S i to denote the total volume of data received and sent by processor i during remapping.Let T be the total volume of data to be transfered,and M be the total volume of available memory in the system,then L¼d TMe is a lower bound on the number of phases.We will divide each message into L equal pieces,i.e.,t0ij¼t1ij¼...¼t LÀ1ij¼T ij and send a piece at each phase.If the memory constraints are satisfied,then the data transfers will complete in precisely L phases.However,there is no guarantee that memory constraints will not be violated.As a solution to this problem,we will use preprocessing and postprocessing phases to ensure feasibility of the phases in between.Lemma1.If the following conditions are satisfied,the continuousversion of the remapping problem can be completed in L¼d TMe phases.1.S i¼R i for all processors.2.A0i!R iL.Proof.At each phase,processor i will receive R iLunits of data.By the second condition,each processor has sufficient memory for the first phase.By the first condition,each processor ships out S i¼R i units of data at each phase,which frees up sufficient memory for the next phase.t u Lemma 2.A solution for a continuous version of the data remapping problem for transfer request R can be performed via the following three steps:1.One preprocessing phase.2.A new transfer request R0,where S i¼R i andA0i!R iL.3.One postprocessing phase.Proof.In the preprocessing phase we will reorganize the data to satisfy conditions1and2from Lemma1,and define a new mapping of the data.After the new mapping is complete,a single postprocessing phase will be sufficient to get all of the data to the correct processor.In the preprocessing step,all processors i with R i<S i will transfer some of their outgoing data to processors j in which R j>S j,so that in subsequent phases R i¼S i.Note that,if the transfer request is feasible,then R jÀS j A0j.Thus,this rearrangement can be completed in a single phase.Next,as second part of the preprocessing step,all processors i with A i<R iLwill transfer some of their outgoing data to processors j with A j>R jL.To avoid disturbing the first property,the sending processors will also pass equal amounts of receiving assignment.OnceFig.3.MCF graph after reduction.again,this step can be completed in one phase,since by construction,the receiving processors have sufficient space.Notice that the actual data being transferred is irrelevant—we are just trying to balance the numbers—so a send and receive operation can cancel each other.This enables merging of the two steps above into one phase.After the new transfer request R0is realized,we must correct for the transfer of receiving assignments.This correction is the purpose of the postprocessing phase.Under the transfer of receiving assignments,each processor is either a sender or a receiver of such assignments.So,during postprocessing,each processor will either receive or send data,but not both.Since the initial remapping is feasible,each processor has enough memory for the data to be received,thus the postproces-sing can be completed in one phase.t u The complexity of constructing the solution for the preprocessing phase is linear in the number of processors. To see this,divide the processors into two lists:those with R i<S i and those with R j>S j.Now,step through the lists together,transferring sending responsibility from a proces-sor in the i list to one in the j list.Each transfer balances R i and S i for a processor in one of the lists.The same can be applied to balance initial available memories.Notice that the preprocessing step uniquely describes the postproces-sing phase,and remapping for R0is straightforward. Theorem3.Given a transfer request R,the continuous versionof the data remapping problem can be completed in d TM eþ2phases.Proof.By Lemma2,R can be completed by pre and postprocessing steps,along with a transfer request R0 satisfying conditions of Lemma1.Notice that the total volume of data to be transferred,T0in R0is no greater than T in R,and the total available memory in the system does not change:M¼M0.Hence,by Lemma1, R0can be completed in d T0e d T e phases.Together with one preprocessing and one postprocessing phases, remapping can be completed in d TMeþ2phases.t u It is worth noting that a good solution for this continuous approximation may not yield a good solution for the true discrete problem.For instance,consider the example depicted in Fig.4.This example consists of two groups of processors,with no communication between the groups,and there is only one unit of available memory.Available memory must be possessed by each component in turn,and this requires temporarily moving some data from one component to the other to transfer the free memory,as will be discussed in more detail in the next section.In the preprocessing step described in the proof of Lemma2,this available memory will be divided into two groups of processors,but thefractional transfers that follow give no insight into thecorrect way to orchestrate the data transfers for thisinstance.Specifically,in the continuous solution,all processors are identical,so no information is gleaned aboutthe necessity of working on components in turn.5E FFICIENT A PPROXIMATION A LGORITHMSIn this section,we describe the basics of a family of efficient algorithms that provide solutions,in which the number of phases is at most1.5times that of an optimal solution.The algorithm is motivated by some simple observations.First, the maximum amount of data that can be transferred in a phase is equal to the total amount of free memory in the parallel machine.Let M be the total available memory in the parallel machine,and let T be the total volume of data to be moved.Note that M does not change between phases. Lemma3.The minimum number of phases in a solution is d TMe.This bound can only be achieved only if availablememory is used to receive messages at each phase.Thus,free memory is wasted if it resides on a processor that hasno data to receive.Our algorithm works by redistributingfree memory to processors that can use it.Equivalently, data is parked on a processor with free memory,which it cannot use,to free up memory on processors that can use it. We will park only data that must be transferred eventually.5.1ParkingParking aims to utilize memory that would otherwise bewasted.Consider a processor that has received all its dataand still has available memory.This memory cannot be utilized in subsequent phases,which decreases the total memory usable for communication,thus potentially in-creasing the number of phases.Instead,another processor can temporarily move some of its data to this processor to free up space for messages.An example is illustrated in Fig.5.In this simple example,the top two processors want to exchange100units of data,but each has only one unit of available memory.A simplistic approach will require 100phases.However,the third processor has100units of free memory.By parking data on this third processor(i.e., transferring free memory to another processor),the number of phases can be reduced to three.More formally,if a processor has k units of data left toreceive and m units of free memory,then it has parking spaceof maxð0;mÀkÞunits.A processor has data to park if theincoming data exceeds available memory,and the quantityFig.4.Catastrophic instance for continuousrelaxation.Fig.5.Example of the utility of parking.of this parkable data is maxð0;kÀmÞunits.The parkable data consists of data that eventually must be sent to another processor.Note that,if the transfer request is feasible,then a processor must send out maxð0;kÀmÞunits.Any processor that has parking space can store parkable data from another processor,thereby maximizing the amount of usable free memory.This parked data merely takes an extra step on the way to its final destination.Exploiting this observation will allow us to construct an approximation algorithm.In our algorithm,we merely store data in a parking space,and then forward it to its correct destination when the destination processor has available memory.Note that it is inconsequential which processor owns the parked data. In other words,parking spaces are indistinguishable.What potentially effects performance is which processors shunt their data to a parking space.Lemma4.It is sufficient to park data at most once to obtain an optimal solution.Proof.Assume there is a solution that parks some data D twice.Let p1and p2be the first and second processors on which D is parked.After data is moved from p1to p2,if no other processor uses available memory at p1,then there was never a need to move data to p2.If another processor p i,parks data on p1,then we can rearrange the data movement with D staying in p1,and p i parking on p2,due to indistinguishability of parking spaces.t u It is worth noting that parking is not just a heuristic,but a requirement in some cases.Consider the example in Fig.5, modified so that there is no available memory in the top two processors.In this case,the transfer request is still feasible,but realizing the remapping requires parking. 5.2An Approximation AlgorithmIn this section,we describe an algorithm that obtains a solution with at most1.5times the optimal number of phases.The algorithm is quite generic and allows for a number of possible enhancements.Algorithm1:.A processor receives as much data as it can in each phase(i.e.,if a processor has available memory at theend of a phase,then this processor does not have anymore data to receive)..If the transfer request cannot be completed in the next phase,then park as much data as possible(i.e.,park the minimum of the total parkable data and thetotal parking space).Note that many details about the algorithm are un-specified:If I have more incoming data than free memory, which messages should I receive in the current phase?If several processors want to park data,but limited parking spaces are available,which should succeed?We will show below that,with any answers to these questions,the resulting algorithm generates a solution with no more than 1.5times the optimal number of phases.Intelligent answers to these questions could be used to devise algorithms with better practical(or perhaps theoretical)performance. Lemma5.The total volume of data transferred by Algorithm1is at most d3T2e.Proof.Let T p be the volume of data transferred through parking,and let T d be the data transferred directly.Datais transferred either directly or through parking,thusT¼T pþT d.It is enough to park data once due to Lemma4,thus parked data is moved twice,and the total volume of data moved is2T pþT d¼TþT p.Recall that,by definition of parkable data,data is parked only if it will help receiving more data in the next phase.Thus,each parked unit of data enables at least one direct transfer,the algorithm guarantees that T p T d.Thus,at most half of T can be transferred through parking,i.e.,T p T2,and the totalvolume of data moved is TþT p TþT2¼3T2.t u Theorem4.Algorithm1constructs a solution with at mostd3T2Meþ1phases.Proof.The algorithm makes use of all M units of available memory until the amount of parkable data is less than the amount of parking space.It then completes in at most two additional phases,one in which some data is parked, and a final phase in which each processor has enough memory to receive all its messages.By Lemma5,we know that the total volume of data transferred in thealgorithm is at most d3T2e.With M units of transfer in allbut the last two phases,the process can be completed inat most d3T2Meþ2phases.We will now decrease the bound to d3T2Meþ1.Let l be the number of phases for the algorithm to complete the data remapping process.The total volume of data transferred isðlÀ2ÞMþx,where x is the volume of data transferred in the last two phases:1<x2M.From Lemma5,we know thatðlÀ2ÞMþx3T:But,simple algebra reveals thatlÀ1ðlÀ2ÞMþxM:Combining these inequalitieslÀ13T2M;and the result follows.t u Combined with Lemma3,Theorem4shows that Algorithm1is a3=2approximation algorithm for the minimum phase remapping problem.Without a better lower bound,this value of3=2is tight as illustrated by the example in Fig.6.This example consists of an odd number of processors P. All but one of them are organized in pairs which exchange a single unit of data.Only the unpaired processor has a singleFig.6.Example to show the tightness of the1.5bound.。