Supporting Software Distributed Shared Memory with an Optimizing Compiler

合集下载

state variables 英文 定义

state variables 英文 定义

In the realm of computer science and programming, state variables serve as fundamental building blocks for modeling systems and processes that evolve over time. They embody the essence of dynamic behavior in software applications, enabling developers to capture and manipulate various aspects of an object or system's condition at any given moment. This essay delves into the concept of state variables from multiple perspectives, providing a detailed definition, discussing their roles and significance, examining their implementation across various programming paradigms, exploring their impact on program design, and addressing the challenges they introduce.**Definition of State Variables**At its core, a state variable is a named data item within a program or computational system that maintains a value that may change over the course of program execution. It represents a specific aspect of the system's state, which is the overall configuration or condition that determines its behavior and response to external stimuli. The following key characteristics define state variables:1. **Persistence:** State variables retain their values throughout the lifetime of an object or a program's execution, unless explicitly modified. These variables hold onto information that persists beyond a single function call or statement execution.2. **Mutability:** State variables are inherently mutable, meaning their values can be altered by program instructions. This property allows programs to model evolving conditions or track changes in a system over time.3. **Contextual Dependency:** The value of a state variable is dependent on the context in which it is accessed, typically determined by the object or scope to which it belongs. This context sensitivity ensures encapsulation and prevents unintended interference with other parts of the program.4. **Time-variant Nature:** State variables reflect the temporal dynamics of a system, capturing how its properties or attributes change in response to internal operations or external inputs. They allow programs to model systemswith non-static behaviors and enable the simulation of real-world scenarios with varying conditions.**Roles and Significance of State Variables**State variables play several critical roles in software development, contributing to the expressiveness, versatility, and realism of programs:1. **Modeling Dynamic Systems:** State variables are instrumental in simulating real-world systems with changing states, such as financial transactions, game characters, network connections, or user interfaces. By representing the relevant attributes of these systems as state variables, programmers can accurately model complex behaviors and interactions over time.2. **Enabling Data Persistence:** In many applications, maintaining user preferences, application settings, or transaction histories is crucial. State variables facilitate this persistence by storing and updating relevant data as the program runs, ensuring that users' interactions and system events leave a lasting impact.3. **Supporting Object-Oriented Programming:** In object-oriented languages, state variables (often referred to as instance variables) form an integral part of an object's encapsulated data. They provide the internal representation of an object's characteristics, allowing objects to maintain their unique identity and behavior while interacting with other objects or the environment.4. **Facilitating Concurrency and Parallelism:** State variables underpin the synchronization and coordination mechanisms in concurrent and parallel systems. They help manage shared resources, enforce mutual exclusion, and ensure data consistency among concurrently executing threads or processes.**Implementation Across Programming Paradigms**State variables find expression in various programming paradigms, each with its own idiomatic approach to managing and manipulating them:1. **Object-Oriented Programming (OOP):** In OOP languages like Java, C++, or Python, state variables are typically declared as instance variables withina class. They are accessed through methods (getters and setters), ensuring encapsulation and promoting a clear separation of concerns between an object's internal state and its external interface.2. **Functional Programming (FP):** Although FP emphasizes immutability and statelessness, state management is still necessary in practical applications. FP languages like Haskell, Scala, or Clojure often employ monads (e.g., State monad) or algebraic effects to model stateful computations in a pure, referentially transparent manner. These constructs encapsulate state changes within higher-order functions, preserving the purity of the underlying functional model.3. **Imperative Programming:** In imperative languages like C or JavaScript, state variables are directly manipulated through assignment statements. Control structures (e.g., loops and conditionals) often rely on modifying state variables to drive program flow and decision-making.4. **Reactive Programming:** Reactive frameworks like React or Vue.js utilize state variables (e.g., component state) to manage UI updates in response to user interactions or data changes. These frameworks provide mechanisms (e.g., setState() in React) to handle state transitions and trigger efficient UI re-rendering.**Impact on Program Design**The use of state variables significantly influences program design, both positively and negatively:1. **Modularity and Encapsulation:** Well-designed state variables promote modularity by encapsulating relevant information within components, objects, or modules. This encapsulation enhances code organization, simplifies maintenance, and facilitates reuse.2. **Complexity Management:** While state variables enable rich behavioral modeling, excessive or poorly managed state can lead to complexity spirals. Convoluted state dependencies, hidden side effects, and inconsistent state updates can make programs difficult to understand, test, and debug.3. **Testing and Debugging:** State variables introduce a temporal dimension to program behavior, necessitating thorough testing across different states and input scenarios. Techniques like unit testing, property-based testing, and state-machine testing help validate state-related logic. Debugging tools often provide features to inspect and modify state variables at runtime, aiding in diagnosing issues.4. **Concurrency and Scalability:** Properly managing shared state is crucial for concurrent and distributed systems. Techniques like lock-based synchronization, atomic operations, or software transactional memory help ensure data consistency and prevent race conditions. Alternatively, architectures like event-driven or actor-based systems minimize shared state and promote message-passing for improved scalability.**Challenges and Considerations**Despite their utility, state variables pose several challenges that programmers must address:1. **State Explosion:** As programs grow in size and complexity, the number of possible state combinations can increase exponentially, leading to a phenomenon known as state explosion. Techniques like state-space reduction, model checking, or static analysis can help manage this complexity.2. **Temporal Coupling:** State variables can introduce temporal coupling, where the correct behavior of a piece of code depends on the order or timing of state changes elsewhere in the program. Minimizing temporal coupling through decoupled designs, immutable data structures, or functional reactive programming can improve code maintainability and resilience.3. **Caching and Performance Optimization:** Managing state efficiently is crucial for performance-critical applications. Techniques like memoization, lazy evaluation, or cache invalidation strategies can optimize state access and updates without compromising correctness.4. **Debugging and Reproducibility:** Stateful programs can be challenging to debug due to their non-deterministic nature. Logging, deterministic replaysystems, or snapshot-based debugging techniques can help reproduce and diagnose issues related to state management.In conclusion, state variables are an indispensable concept in software engineering, enabling programmers to model dynamic systems, maintain data persistence, and implement complex behaviors. Their proper utilization and management are vital for creating robust, scalable, and maintainable software systems. While they introduce challenges such as state explosion, temporal coupling, and debugging complexities, a deep understanding of state variables and their implications on program design can help developers harness their power effectively, ultimately driving innovation and progress in the field of computer science.。

UniversityofWisconsin-Madison(

UniversityofWisconsin-Madison(

University of Wisconsin-Madison(UMW)周玉龙1101213442 计算机应用UMW简介美国威斯康辛大学坐落于美国密歇根湖西岸的威斯康辛州首府麦迪逊市,有着风景如画的校园,成立于1848年, 是一所有着超过150年历史的悠久大学。

威斯康辛大学是全美最顶尖的三所公立大学之一,是全美最顶尖的十所研究型大学之一。

在美国,它经常被视为公立的常青藤。

与加利福尼亚大学、德克萨斯大学等美国著名公立大学一样,威斯康辛大学是一个由多所州立大学构成的大学系统,也即“威斯康辛大学系统”(University of Wisconsin System)。

在本科教育方面,它列于伯克利加州大学和密歇根大学之后,排在公立大学的第三位。

除此之外,它还在本科教育质量列于美国大学的第八位。

按美国全国研究会的研究结果,威斯康辛大学有70个科目排在全美前十名。

在上海交通大学的排行中,它名列世界大学的第16名。

威斯康辛大学是美国大学联合会的60个成员之一。

特色专业介绍威斯康辛大学麦迪逊分校设有100多个本科专业,一半以上可以授予硕士、博士学位,其中新闻学、生物化学、植物学、化学工程、化学、土木工程、计算机科学、地球科学、英语、地理学、物理学、经济学、德语、历史学、语言学、数学、工商管理(MBA)、微生物学、分子生物学、机械工程、哲学、西班牙语、心理学、政治学、统计学、社会学、动物学等诸多学科具有相当雄厚的科研和教学实力,大部分在美国大学相应领域排名中居于前10名。

学术特色就学术方面的荣耀而言,威斯康辛大学麦迪逊校区的教职员和校友至今共获颁十七座诺贝尔奖和二十四座普立兹奖;有五十三位教职员是国家科学研究院的成员、有十七位是国家工程研究院的成员、有五位是隶属于国家教育研究院,另外还有九位教职员赢得了国家科学奖章、六位是国家级研究员(Searle Scholars)、还有四位获颁麦克阿瑟研究员基金。

威斯康辛大学麦迪逊校区虽然是以农业及生命科学为特色,但是令人注目,同时也是吸引许多传播科系学子前来留学的最大诱因,则是当前任教于该校新闻及传播研究所、在传播学界有「近代美国传播大师」之称的杰克·麦克劳(Jack McLauld)。

为当地慈善机构捐赠物品 英语作文

为当地慈善机构捐赠物品 英语作文

为当地慈善机构捐赠物品英语作文Donating Items to Local Charitable OrganizationsCharitable organizations play a vital role in supporting individuals and communities in need. These organizations rely on the generosity of donors to provide essential services and resources to those who may be struggling with poverty, homelessness, or other challenges. As members of our local community, we all have the opportunity to make a positive impact by donating items to these organizations.One of the primary benefits of donating to local charitable organizations is the direct impact it can have on the lives of those in need. When we donate items such as clothing, household goods, or non-perishable food, we are directly contributing to the well-being of our neighbors. These donations can provide warmth, comfort, and nourishment to those who may not have access to these basic necessities.Moreover, donating to local charitable organizations can be a highly efficient way to support our community. These organizations often have well-established distribution networks and partnerships with other local organizations, ensuring that the donations reach theindividuals and families who need them most. By donating through these channels, we can be confident that our contributions are making a meaningful difference.One of the most common ways to donate to local charitable organizations is by contributing gently used clothing and household items. Many organizations, such as Goodwill, Salvation Army, and local shelters, accept donations of clothing, furniture, toys, and other household goods. These items can then be distributed to individuals and families in need, or sold in thrift stores to generate funds for the organization's programs and services.In addition to clothing and household items, local charitable organizations often accept donations of non-perishable food items. Food banks and pantries play a crucial role in addressing food insecurity within our communities. By donating canned goods, dry foods, and other non-perishable items, we can help ensure that families and individuals have access to the nourishment they need.Another way to support local charitable organizations is by donating personal care items and hygiene products. These items, such as toothpaste, soap, shampoo, and feminine products, are often in high demand but can be difficult for individuals and families to afford. By donating these essential items, we can help alleviate the burden on those who may be struggling to meet their basic needs.When it comes to donating to local charitable organizations, it's important to consider the specific needs of the organizations and the individuals they serve. Many organizations maintain wish lists or have specific guidelines for the types of items they accept. It's crucial to research the organizations in our local area and to align our donations with their current needs.In addition to physical donations, monetary contributions can also be a valuable way to support local charitable organizations. Many organizations rely on financial donations to fund their programs, pay staff, and cover operational expenses. By making a monetary donation, we can help ensure that these organizations have the resources they need to continue their important work.In conclusion, donating items to local charitable organizations is a powerful way to make a positive impact on our community. Whether it's contributing clothing, household goods, non-perishable food, or personal care items, our donations can provide essential support to those in need. By aligning our donations with the specific needs of local organizations, we can ensure that our contributions are making a meaningful difference in the lives of our neighbors. By embracing the spirit of generosity and community, we can all play a role in creating a more compassionate and equitable society.。

全球人民共享未来作文英语

全球人民共享未来作文英语

In an era where globalization has woven the world into a tightly interconnected web, the concept of sharing a common future for all humanity has become more relevant than ever. The idea of a shared future transcends geographical boundaries, cultural differences, and political ideologies, emphasizing the need for collective responsibility and mutual cooperation.The Importance of a Shared Future1. Economic Interdependence: The global economy is a prime example of how nations are interconnected. The rise of multinational corporations and international trade has made it clear that the prosperity of one nation can significantly impact others. A shared future in economic terms means working towards policies that promote fair trade, reduce poverty, and ensure that the benefits of economic growth are distributed equitably.2. Environmental Sustainability: Climate change is a global challenge that requires a global response. The shared future in this context involves committing to sustainable practices, reducing carbon emissions, and investing in renewable energy sources. It is about ensuring that the planet remains habitable for future generations, regardless of their nationality.3. Cultural Exchange: The exchange of cultural practices, ideas, and values enriches societies and fosters understanding among different peoples. A shared future in cultural terms means embracing diversity and promoting dialogue that respects and learns from different traditions and perspectives.4. Technological Advancement: Technology has the power to transform lives and societies. A shared future in this regard is about ensuring that technological advancements are accessible to all, reducing the digital divide and using technology as a tool for education, healthcare, and social development.5. Peace and Security: The pursuit of a peaceful and secure world is fundamental to a shared future. This involves addressing the root causes of conflicts, promoting diplomacy over violence, and ensuring that international laws and norms are respected.Challenges to a Shared Future1. Inequality: Economic, social, and political inequalities pose a significant challenge to the idea of a shared future. These disparities can lead to social unrest and hinder cooperation among nations.2. Nationalism and Protectionism: The rise of nationalistic sentiments and protectionist policies can create barriers to international cooperation and hinder efforts towards a shared future.3. Lack of Access to Education and Healthcare: In many parts of the world, access to basic services like education and healthcare is limited, which can perpetuate cycles of poverty and hinder social mobility.4. Environmental Degradation: The overexploitation of natural resources and disregard for environmental conservation threaten the sustainability of our planet, posing a significant challenge to a shared future.The Role of Individuals and Governments1. Individual Responsibility: Each person has a role to play in shaping a shared future. This can be through making conscious choices about consumption, supporting social causes, and advocating for policies that promote a fair and sustainable world.2. Governmental Initiatives: Governments must take the lead in formulating and implementing policies that address global challenges. This includes investing in education, healthcare, and infrastructure, and working with international partners to tackle issues like climate change and poverty.3. International Cooperation: International organizations play a crucial role in facilitating dialogue and cooperation among nations. They can help to coordinate efforts and provide a platform for nations to work together towards common goals.In conclusion, the concept of a shared future for all humanity is not merely an idealistic vision but a practical necessity in our interconnected world. It requires a commitment to collaboration, understanding, and the recognition that the wellbeing of one is intrinsically linked to the wellbeing of all. By working together, we can overcome the challenges that face us and build a future that is sustainable, equitable, and prosperous for all.。

小学上册A卷英语第二单元测验卷

小学上册A卷英语第二单元测验卷

小学上册英语第二单元测验卷英语试题一、综合题(本题有100小题,每小题1分,共100分.每小题不选、错误,均不给分)1.I think friendship is one of the greatest gifts. Friends support each other through thick and thin. I’m grateful for my friend __________, who always knows how to cheer me up.2.My dog loves to fetch the ______ (球).3. A ______ is a large area of elevated land with a flat top.4.I enjoy _______ (运动) with my friends.5.The ______ is always smiling.6.The chemical formula for calcium hydroxide is ______.7.What is the name of the famous landmark in Sydney?A. Opera HouseB. Harbour BridgeC. Bondi BeachD. UluruA8.The chemical formula for sodium acetate is _______.9.The chemical symbol for argon is _______.10.I love to listen to ______ (音乐) while I study.11.My mom bought me a new ________ (滑梯) for the backyard. I can slide down________ (很快).12.Acids taste ______.13.What is the main ingredient in a salad?A. MeatB. VegetablesC. GrainsD. FruitB14.Which instrument has keys?A. GuitarB. DrumsC. PianoD. FluteC15. A ______ (温暖的气候) benefits many flowers.16.Which shape has three sides?A. SquareB. RectangleC. TriangleD. CircleC17.What animal is known as "man's best friend"?A. CatB. BirdC. DogD. FishC18.The process of breaking down food involves __________.19.I have _____ (two) pets.20.In , America declared its _______ from Britain.21.What is the main ingredient in bread?A. SugarB. FlourC. YeastD. WaterB22.What is the name of the famous structure in Egypt that was built as a tomb?A. Great WallB. ColosseumC. PyramidsD. ParthenonC23.The ______ (自然) has many wonders.24.The _______ of sound can be perceived in different ways by different people.25.The ______ shows the relationship between animals and plants.26.My family has a ______ pet. (我的家里有一只______宠物。

The Future of Work Gig Economy and Remote Work

The Future of Work Gig Economy and Remote Work

The Future of Work Gig Economy andRemote WorkThe future of work is rapidly evolving with the rise of the gig economy and remote work. This trend has been accelerated by the global pandemic, which forced many companies to quickly adapt to remote work arrangements. As we look ahead,it's clear that the traditional 9-to-5 office setup is no longer the only viable option for many workers. Instead, the gig economy and remote work are shaping the way we approach employment and professional opportunities. This shift presentsboth exciting possibilities and significant challenges for workers, companies, and society as a whole. From the perspective of workers, the gig economy and remote work offer unprecedented flexibility and autonomy. Freelancers and independent contractors have the freedom to choose their own projects, set their own schedules, and work from anywhere with an internet connection. This level of control overone's work life can lead to greater job satisfaction and work-life balance. Additionally, remote work eliminates the daily commute, reducing stress andallowing workers to reclaim valuable time that would have otherwise been spent in traffic or public transportation. This newfound flexibility has the potential to reshape the traditional notion of work and provide individuals with theopportunity to craft a career that aligns with their personal values andpriorities. However, the gig economy and remote work also present challenges for workers, particularly in terms of job security and benefits. Unlike traditionalfull-time employment, gig workers often lack access to employer-sponsored healthcare, retirement plans, and other essential benefits. Furthermore, the fluctuating nature of gig work means that income can be unpredictable, making financial planning and stability more challenging. Remote work can also blur the boundaries between professional and personal life, leading to potential burnoutand isolation if not managed effectively. As the workforce becomes increasingly decentralized, it's crucial to address these issues and ensure that all workers have access to the support and resources they need to thrive in this new landscape. From the perspective of companies, the gig economy and remote work present opportunities to access a broader talent pool and reduce overhead costs. By hiringfreelancers and remote workers, companies can tap into a global network of diverse skills and expertise without the constraints of geographical location. This not only fosters innovation and creativity but also allows businesses to scale more efficiently. Additionally, remote work arrangements can lead to increased productivity and employee retention, as workers appreciate the flexibility and freedom to tailor their work environment to their individual needs. Embracing the gig economy and remote work can position companies to stay competitive in arapidly changing business environment. However, companies also face challenges in managing and supporting a distributed workforce. Communication and collaboration can become more complex in a remote setting, requiring intentional efforts to maintain a strong company culture and cohesive team dynamics. Additionally, ensuring data security and compliance with labor laws across different regions can be a daunting task for companies operating in the gig economy. As the nature of work continues to evolve, businesses must invest in the infrastructure and technology necessary to effectively support and empower remote workers while upholding the values and integrity of the organization. Societally, the gig economy and remote work have the potential to reshape the dynamics of urbanization and economic opportunity. Remote work allows individuals to reside in locations outside of major metropolitan areas, easing the strain on infrastructure and potentially mitigating issues related to urban overcrowding. This could lead to more balanced regional development and reduced pressure on housing markets inlarge cities. Additionally, the gig economy creates opportunities for individuals who may have faced barriers to traditional employment, such as stay-at-home parents, individuals with disabilities, or those living in underserved communities. However, it's crucial to address the potential downsides of remote work, such as exacerbating inequalities in access to technology and furthering the divide between those who can work remotely and those who cannot. In conclusion, thefuture of work in the gig economy and remote work opens up a world ofpossibilities for workers, companies, and society at large. However, it alsobrings forth a unique set of challenges that must be carefully navigated. As we embrace this new era, it's essential to prioritize the well-being and inclusivity of all members of the workforce, while also harnessing the potential forinnovation and progress. By approaching this shift with empathy and foresight, we can create a future of work that is truly fulfilling, sustainable, and equitable.。

软件专业词汇

软件专业词汇

新词积累Swimming lane design: 数据流交互图JIRA:是Atlassian公司出品的项目与事务跟踪工具,被广泛应用于缺陷跟踪、客户服务、需求收集、流程审批、任务跟踪、项目跟踪和敏捷管理等工作领域。

JIRA中配置灵活、功能全面、部署简单、扩展丰富.Heartbeat:是Linux-HA 工程的一个组成部分,它实现了一个高可用集群系统。

心跳服务和集群通信是高可用集群的两个关键组件,在Heartbeat 项目里,由heartbeat 模块实现了这两个功能。

下面描述了heartbeat 模块的可靠消息通信机制,并对其实现原理做了一些介绍。

A(Active-matrix)主动矩阵(Adapter cards)适配卡(Advanced application)高级应用(Analytical graph)分析图表(Analyze)分析(Animations)动画(Application software)应用软件(Arithmetic operations)算术运算(Audio-output device)音频输出设备(Access time)存取时间(A ccess)存取(A ccuracy)准确性(A d network cookies)广告网络信息记录软件(A dministrator)管理员(Add-ons)插件(Address)地址(Agents)代理(Analog signals)模拟信号(Applets)程序(Asynchronous communications por t)异步通信端口(Attachment)附件AGP(accelerated graphics port)加速图形接口ALU (arithmetic-logic unit)算术逻辑单元AAT(Average Access Time) 平均存取时间ACL(Access Control Lists)访问控制表ACK(acknowledgement character)确认字符ACPI (Advanced Configuration and Power Interface)高级配置和电源接口ADC(Analog to Digital Converter) 模数转换器ADSL(Asymmetric Digital Subscriber Line)非对称用户数字线路ADT(Abstract Data Type) 抽象数据类型AGP(Accelerated Graphics Port)图形加速端口AI(Artif icial Intelligence) 人工智能AIFF(Audio Image File Format) 声音图像文件格式ALU(Arithmetic Logical Unit) 算术逻辑单元AM(Amplitude Modulation) 调幅ANN(Artificial Neural Network) 人工神经网络ANSI(American National Standard Institute)美国国家标准协会API(Application Programming Interface)应用程序设计接口APPN(Advanced Peer-to-Peer Network )高级对等网络ARP(Address Resolution Protocol) 地址分辨/ 转换协议ARPG(Action Role Playing Game) 动作角色扮演游戏ASCII (American Standard Code for Information Interchange)美国信息交换标准代码ASP(Active Server Page) 活动服务器网页ASP(Application Service Provider) 应用服务提供商AST(Average Seek Time) 平均访问时间ATM(asynchronous transfer mode)异步传输模式ATR (Automatic Target Recognition) 自动目标识别AVI (Audio Video Interleaved)声音视频接口Algorithm 算法B(Bar code)条形码(Bar code reader)条形码读卡器(Basic application)基础程序Beta testing Beta测试是一种验收测试。

M20网络背景路由器说明书

M20网络背景路由器说明书

D ATA S HE E TM 20 I n t e r n e t B a c k b o n e R o u t e rThe M20 router’s compact design offers tremendous performance and portdensity. The M20 router has a rich feature set that includes numerous advantages.sRoute lookup rates in excess of 40 Mpps for wire-rate forwarding performancesAggregate throughputcapacity exceeding 20 Gbps sPerformance-based packet filtering, rate limiting, and sampling with the Internet Processor II™ ASIC sRedundant System and Switch Board andredundant Routing Engine sMarket-leading port density and flexibility sProduction-proven routing software with Internet-scale implementations of BGP4, IS-IS, OSPF , MPLS traffic engineering, class of service, and multicasting applicationsThe M20™ Internet backbone router is a high-performance routing platform that is built for a variety of Internet applications, including high-speed access, public and private peering,hosting sites, and backbone core networks.The M20 router leverages proven M-series ASIC technology to deliver wire-rateperformance and rich packet processing,such as filtering, sampling, and rate limiting.It runs the same JUNOS™ Internet software and shares the same interfaces that are supported by the M40™ Internet backbone router, providing a seamless upgrade path that protects your investment. Moreover, its compact design (14 in / 35.56 cm high)delivers market-leading performance and port density, while consuming minimal rack space.The M20 router offers wire-rate performance,advanced features,internal redundancy,and scaleability in a space-efficient package.A d v a n t a g e sFeatur esBenefitsIt [JUNOS software]dramatically increases our confidence that we will have access to technology to keep scaling along with what the demands on the network are.We can keep running.—Michael O’Dell,Chief Scientist,UUNETTechnologies, Inc.“”A r c h i t e c t u r eThe two key components of the M20 architecture are the Packet Forwarding Engine (PFE) and the Routing Engine,which are connected via a 100-Mbps link. Control traffic passing through the 100-Mbps link is prioritized and rate limited to help protect against denial-of-service attacks.sThe PFE is responsible for packet forwarding performance.It consists of the Flexible PIC Concentrators (FPCs),physical interface cards (PICs), System and Switch Board (SSB), and state-of-the-art ASICs.sThe Routing Engine maintains the routing tables andcontrols the routing protocols. It consists of an Intel-based PCI platform running JUNOS software.The architecture ensures industry-leading service delivery by cleanly separating the forwarding performance from the routing performance. This separation ensures that stressexperienced by one component does not adversely affect the performance of the other since there is no overlap of required resources.Leading-edge ASICsThe feature-rich M20 ASICs deliver a comprehensive hardware-based system for packet processing, including route lookups, filtering, sampling, rate limiting, loadbalancing, buffer management, switching, encapsulation,and de-encapsulation functions. To ensure a non-blocking forwarding path, all channels between the ASICs are oversized, dedicated paths.Internet Processor and Internet Processor II ASICsThe Internet Processor™ ASIC, which was originally deployed with M20 routers, supports an aggregated lookup rate of over 40 Mpps.An enhanced version, the Internet Processor II ASIC,supports the same 40 Mpps lookup rate. With over one million gates, this ASIC delivers predictable, high-speed forwarding performance with service flexibility, including filtering and sampling. The Internet Processor II ASIC is the largest, fastest, and most advanced ASIC ever implemented on a router platform and deployed in the Internet.Distributed Buffer Manager ASICsThe Distributed Buffer Manager ASICs allocate incoming data packets throughout shared memory on the FPCs. This single-stage buffering improves performance by requiring only one write to and one read from shared memory. There are no extraneous steps of copying packets from input buffers to output buffers. The shared memory is completelynonblocking, which in turn, prevents head-of-line blocking.I/O Manager ASICsEach FPC is equipped with an I/O Manager ASIC that supports wire-rate packet parsing, packet prioritizing, and queuing.Each I/O Manager ASIC divides the packets, stores them in shared memory (managed by the Distributed Buffer Manager ASICs), and re-assembles the packets for transmission.Media-specific ASICsThe media-specific ASICs perform physical layer functions,such as framing. Each PIC is equipped with an ASIC or FPGA that performs control functions tailored to the PIC’s media type.Packet Forwarding EngineThe PFE provides Layer 2 and Layer 3 packet switching, route lookups, and packet forwarding. The Internet Processor II ASIC forwards an aggregate of up to 40 Mpps for all packet sizes. The aggregate throughput is 20.6 Gbps half-duplex.The PFE supports the same ASIC-based features supported by all other M-series routers. For example, class-of-service features include rate limiting, classification, priority queuing,Random Early Detection and Weighted Round Robin to increase bandwidth efficiency. Filtering and sampling areLogical View of M20 ArchitecturePacket Forwarding Enginealso available for restricting access, increasing security, and analyzing network traffic.Finally, the PFE delivers maximum stability duringexceptional conditions, while also providing a significantly lower part count. This stability reduces power consumption and increases mean time between failure.Flexible PIC ConcentratorsThe FPCs house PICs and connect them to the rest of the PFE. There is a dedicated, full-duplex, 3.2-Gbps channel between each FPC and the core of the PFE.You can insert up to four FPCs in an M20 chassis. Each FPC slot supports one FPC or one OC-48c/STM-16 PIC. Each FPC supports up to four of the other PICs in any combination,providing unparalleled interface density and configuration flexibility.Each FPC contains shared memory for storing data packets received; the Distributed Buffer Manager ASICs on the SSB manage this memory. In addition, the FPC houses the I/O Manager ASIC, which performs a variety of queue management and class-of-service functions.Physical Interface CardsPICs provide a complete range of fiber optic and electrical transmission interfaces to the network. The M20 router offers flexibility and conserves rack space by supporting a wide variety of PICs and port densities. All PICs occupy one of four PIC spaces per FPC except for the OC-48c/STM-16 PIC, which occupies an entire FPC slot.An additional Tunnel Services PIC enables the M20 router to function as the ingress or egress point of an IP-IP unicasttunnel, a Cisco generic routing encapsulation (GRE) tunnel, or a Protocol Independent Multicast - Sparse Mode (PIM-SM) tunnel.For a list of available PICs, see the M-series Internet Backbone Routers Physical Interface Cards datasheet.System and Switch BoardThe SSB performs route lookup, filtering, and sampling, as well as provides switching to the destination FPC. Hosting both the Internet Processor II ASIC and the Distributed Buffer Manager ASICs, the SSB makes forwarding decisions,distributes data cells throughout memory , processes exception and control packets, monitors system components, and controls FPC resets. You can have one or two SSBs, ensuring automatic failover to a redundant SSB in case of failure.Routing EngineThe Routing Engine maintains the routing tables and controls the routing protocols, as well as the JUNOS software processes that control the router’s interfaces, the chassis components, system management, and user access to the router. These routing and software processes run on top of a kernel that interacts with the PFE.sThe Routing Engine processes all routing protocol updates from the network, so PFE performance is not affected.sThe Routing Engine implements each routing protocol with a complete set of Internet features and provides full flexibility for advertising, filtering, and modifying routes.Routing policies are set according to route parameters,such as prefixes, prefix lengths, and BGP attributes.You can install a redundant Routing Engine to ensuremaximum system availability and to minimize MTTR in case of failure.JUNOS Internet SoftwareJUNOS software is optimized to scale to large numbers of network interfaces and routes. The software consists of a series of system processes running in protected memory on top of an independent operating system. The modular design improves reliability by protecting against system-wide failure since the failure of one software process does not affect other processes.SuppliesBack ViewM20 Router Front and Back ViewsFront View14 inS p e c i f i c a t i o n sSpecification DescriptionCopyright © 2000, Juniper Networks, Inc. All rights reserved. Juniper Networks is a registered trademark of Juniper Networks, Inc. Internet Processor,Internet Processor II, JUNOS, M5, M10, M20, M40, and M160 are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks may be the property of their respective owners. All specifications are subject to change without notice.Printed in USA.O r d e r i n g I n f o r m a t i o nModel NumberDescriptionPart Number 100009-003 09/00w w w.j u n i p e r.n e tC O R P O R AT EH E A D Q U A R T E RSJuniper Networks, Inc.1194 North Mathilda Avenue Sunnyvale, CA 94089 USAPhone 408 745 2000 or 888 JUNIPER Fax 408 745 2100Juniper Networks, Inc. has sales offices worldwide.For contact information, refer to /contactus.html .。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Supporting Software Distributed Shared Memorywith an Optimizing CompilerTatsushi Inagaki Junpei Niwa Takashi Matsumoto Kei Hiraki Department of Information Science,Faculty of Science,University of Tokyo 7-3-1Hongo,Bunkyo-ku,Tokyo113Japaninagaki,niwa,tm,hiraki@is.s.u-tokyo.ac.jpAbstractTo execute a shared memory program efficiently,we have to manage memory consistency with low overheads,and have to utilize communication bandwidth of the platform as much as possible.A software distributed shared mem-ory(DSM)can solve these problems via proper support by an optimizing compiler.The optimizing compiler can de-tect shared write operations,using interprocedural points-to analysis.It also coalesces shared write commitments onto contiguous regions,and removes redundant write com-mitments,using interprocedural redundancy elimination.A page-based target software DSM system can utilize commu-nication bandwidth,owing to coalescing optimization.We have implemented the above optimizing compiler and a run-time software DSM on AP1000+.We have obtained a high speed-up ratio with the SPLASH-2benchmark suite.The result shows that using an optimizing compiler to assist a software DSM is a promising approach to obtain a good performance.It also shows that the appropriate protocol selection at a write commitment is an effective optimization.1.IntroductionApplications using software distributed shared memory (DSM)can run without troubles of unnecessary memory copy and address translation which happen with the in-spector/executor mechanism[22].Most of existing software DSM systems are designed on the assumption of using se-quential compilers[23,20,19].An executable object made by a sequential compiler only issues a shared memory ac-cess as the ordinary memory access(load/store).To utilize bandwidth,a runtime system has to buffer the remote mem-ory access.There is another approach where a programmer can specify optimal granularity,protocol,and association Presently with Tokyo Research Laboratory,IBM Japan,Ltd.between synchronization and shared data[3,30].However, with this approach,existing shared memory applications re-quire rewriting.Our idea is that an optimizing compiler directly analy-ses shared memory source programs,and optimizes com-munication and consistency management for software DSM execution[28].Our target is a page-based software DSM, asymmetric distributed shared memory(ADSM)[26,25]. ADSM uses a virtual memory mechanism for shared read, and uses explicit user-level consistency management code sequences for shared write.This enables static optimization of shared write operations.Static optimizing information about them can reduce the overhead of the runtime system. Shasta[29]is another software DSM system assuming op-timizing compiler support.Since Shasta compiler analyzes objects generated by sequential compilers,it only performs limited local optimizations.Our compiler analyzes a source program directly.Therefore,it performs array data-flow analysis interprocedurally.Here we have to solve the following three problems in order to show that our approach is effective.First,the compiler must perform sufficient optimization in reasonable compilation time.We have applied interprocedural points-to analysis[14,31],and implemented interprocedural write set calculation,to detect and optimize shared write opera-tions.We have found out that the above powerful analysis is done in reasonable time.Second,the runtime system also must work efficiently.We had been using a history-based runtime system of lazy release consistency[28].But when the compiler can not optimize,the system introduces a large runtime overhead and causes the growth of synchronization costs.Therefore,we have implemented a new page-based runtime system with delayed invalidate release consistency (DIRC)model[12]to overcome these problems.We have made sure that the new system is more efficient than the history-based runtime system.Third,we have to provide an interface such that users can give information which the compiler can not extract statically.Memory access patterns of irregular applications depend on input parameters.It isdifficult for a compiler to optimize copy management pro-tocols statically.We have examined the effect of manual protocol selection on the bottleneck shared write operations of the program.We have evaluated the performances with the SPLASH-2benchmark suite[32].SPLASH-2is not only the most frequently used benchmark to evaluate shared memory sys-tems,but also a benchmark suite with in detailed algorith-mic information about each program.We have manually optimized shared write protocols using these descriptions. We do not consider SPLASH-2as“dusty deck”.Our target is to investigate what information from a user or a compiler is required for the efficient execution about shared memory programs on software DSM.Section2describes a process of compilation and opti-mization.Section3describes the implementation of the runtime software DSM.Section4describes performance evaluation with SPLASH-2.Section5describes related work about a combination of optimizing compiler and soft-ware DSM.Section6gives a summary.pilation ProcessFigure1describes the overall compilation process.The input is a shared memory program written in C extended with PARMACS[4].PARMACS provides the primitives for task creation,shared memory allocation,and synchro-nization(barrier,lock,and pause).The consistency of shared memory follows lazy release consistency(LRC) model[20].Our compiler inserts consistency management code sequences for software DSM into a given shared mem-ory program.The backend sequential compiler compiles the instrumented source program and links it with a runtime library.To inform the runtime system that a write happened onto a contiguous shared block,we use a pair found by the ini-tial address and the size of of the block.We call this pair a(shared)write commitment.Besides the start address and the size,a write commitment also requires the written con-tents of the block.Therefore,we place a write commitment after the corresponding shared write operations.The single write commitment can represent a lot of shared writes onto a large contiguous region.When there are succeeding write commitments with the same parameters,we can eliminate them but the last one.2.1.Shared Write DetectionThe goal of our optimizing compiler is to insert valid write commitments and to decrease the number of write commitments as much as possible.First we have to enu-merate all shared memory access in a given shared memory program.Since the input program is written in C,a shared address may be contained in a pointer variable and may be passed across procedure calls.We have applied interprocedural points-to analysis[14, 31]to shared write detection.Interprocedural points-to analysis calculates symbolic locations where variables may point to.Variables and heap locations are represented with a location set,a tuple of a symbolic base address,an off-set,and a stride.The compiler interprocedurally calculates points-to relations among location sets using a depth-first traversal of the call graph.We track the return values of shared memory allocation primitive(G MALLOC).We in-sert a write commitment after a write operation using shared address values.We adopted interprocedural points-to analysis because of the following merits:succeeding optimization passes can perform code mo-tion using pointer information,andprecise shared pointer information can decrease thecosts of the redundancy elimination pass.Points-to analysis represents all variables as memory lo-cations.This is a conservative assumption in C.When an input program contains unions or type-castings,they may generate false alias information,which takes many itera-tions to converge.We assume that an input program is type-safe about pointer values,that is pointer values are not conveyed through non-pointer locations.In points-to anal-ysis,we only record pointer assignments into pointer type locations.This assumption prevents generating false alias relations in a program with complex structures.2.2.Redundancy EliminationIn release consistency model,a shared write is not trans-mitted to other nodes until the node which had issued the shared write reaches a synchronization.Therefore,it is valid that we place a write commitment everywhere from the corresponding shared write to thefirst synchronization thereafter.We use thisflexibility to remove redundant write commitments.For example,let us look the following code sequence from LU:a[ii][jj]=((double)lrand48())/MAXRAND; if(i==j)a[ii][jj]*=10;Suppose that a[ii][jj]is shared.It is valid that we in-sert write commitments after both assignments.However, if we delay thefirst write commitment after the conditional, the write commitment within the conditional is redundant. When we denote a write commitment as WC,parallel shared memoryprogram (C + PARMACS)software DSMmacro Figure1.Overall compilation processa[ii][jj]=((double)lrand48())/MAXRAND; if(i==j)a[ii][jj]*=10;WC(&a[ii][jj],1);Note that this holds if the order between the assignment and the conditional is opposite.This optimization can be formalized as redundancy elimination[8,27]of write commitment.Here we repre-sent a statement in a procedure as i.We can consider that i is a node of a controlflow graph(CFG)of the procedure. For simplicity,wefix a write commitment with the same ad-dress and the same size.From the result of points-to anal-ysis,we obtain the following logical constants about each statement i:COMP i the statement i issues the shared writeTRANS i the statement i propagates informationabout the shared writeTRANS i is false when the statement i is a synchronization primitive or the statement i modifies the parameters of the write commitment.We can calculate the following logical dataflow variables from these constants:Availability In all paths which precede the statement i,the shared write is issuedAnticipatability In all paths which succeed the statement i,the shared write is issuedTo minimize the number of write commitments,we place write commitments only where,the shared write is available,the shared write is not available in one of the succeed-ing paths,andthe shared write is not anticipatable.We represent availability before and after execution of the statement i as A VIN i and A VOUT i.Similarly,we represent anticipatability as ANTIN i and ANTOUT i. INSERT i is a variable which means we actually place the write commitment after the statement i.Variables are cal-culated under dataflow equations in Figure2.Primitives pred i and succ i represent sets of statements preceding and succeeding the statement i.A VIN i∏p pred iA VOUT pA VOUT i COMP i TRANS i A VIN i ANTOUT i∏s succ iANTIN sANTIN i COMP i TRANS i ANTOUT i INSERT i A VOUT i∏s succ iA VOUT sANTOUT iFigure2.Dataflow equation to remove redun-dant write commitmentsTo compute interprocedurally,we reflect A VOUT at the exit of the callee procedure to the COMP at the call site of the caller procedure.When the availability of the callee can not be propagated to the caller,we insert write commit-ments at the exit of the callee.We call a procedure which is called recursively or called through function pointers,as open procedure[7].An open procedure does not inform availability to the call sites.Therefore,we can consider the call graph is acyclic.The compiler simply calculates inter-procedural availability with bottom-up traversal of the call graph.If we want more precise elimination,the compiler also can traverse the call graph in depth-first manner,whichis not implemented yet.2.3.Merging Multiple Write CommitmentsA write commitment can handle shared write operations onto a contiguous region.For example,let us look the fol-lowing code sequence in LU:for(i=0;i<n;i++)a[i]+=alpha*b[i];Suppose that a is a shared pointer.Instead of inserting a write commitment into the innermost loop,we can generate: for(i=0;i<n;i++)a[i]+=alpha*b[i];WC(a,n);This code generation has two merits.First,a consistency management overhead is reduced because the write com-mitment is hoisted out from the loop.Second,the runtime system can utilize the size information for message vector-ization.To combine multiple write commitments,it is convenient to represent a sequence of write commitments as(shared) write set.A write set W f s C is a tuple,such that f is a start address of a write commitment,s is a size,and C is a set of inequalities which generate write commitments.In-equalities C represent induction variables of enclosing loops around the write commitment.A dataflow variable takes a set of write sets.The logical operations in the above dataflow equations are considered as set operations.Just after points-to analysis,each write set includes only one write commitment,i.e.,s1C/0.We use interval analysis[9,5]to calculate dataflow equations. In interval analysis,CFG is represented hierarchically with interval(i.e.loop)structures.When a summary of interval is propagated outward,inequalities which represent induc-tion variables are added to C.We describe optimizing methods to combine multiple write commitments using write set.Coalescing This is applicable when write commitments onto contiguous locations are issued in a loop.Sup-pose a write set W f i s C i and the induction variable i has a increment value c.If f i c f i s, we can replace i with the initial value of i,multiply s by the number of iterations,and remove inequalities about i from C.For the above example,W&a i10i n W a n/0 Coalescing is applicable when the index variable is only continuous.For example,let us look the follow-ing code sequence in Radix:for(i=key_start;i<key_stop;i++){ this_key=key_from[i]&bb;this_key=this_key>>shiftnum;tmp=rank_ff_mynum[this_key];key_to[tmp]=key_from[i];rank_ff_mynum[this_key]++;}/*i*/Suppose key_to points to shared addresses.Vari-ables rank_ff_mynum[this_key]are incre-mented by one when key_to[tmp]is writ-ten.Therefore,we can coalesce write com-mitments using initial values andfinal values of rank_ff_mynum[this_key].Fusion We can also merge write commitments originating in different statements in the program.We represent this operation as a binary operator“”.For example, let us look the following code sequence in FFT:for(i=0;i<n1;i++){x[2*i]/=N;x[2*i+1]/=N;}Suppose x points to shared address,W&x2i1/0W&x2i11/0W W W&x2i2/0 Redundant index elimination When the start address of a write commitment is a constant,we can delegate the write commitment with the maximal size.If we can detect the maximum,this index variable is redundant.We can eliminate redundant indexes using Fourier-Motzkin elimination[11].Fourier-Motzkin elimina-tion is also applicable to nonlinear but monotonous ex-pressions.For example,in the following write set in FFT,W x22q N2q1q M we can eliminate q,using monotonicity of2q and QN Q,and obtainW W x4N2/0The names coalescing and fusion come from the similarity to loop transformations.When a dimension of inequalities in C is decreased,the dimension of generated loop of write commitments is decreased.When the summary of an interval is computed,we ap-ply coalescing and redundant index elimination to write sets.Fusion is applied to the computation of set union in dataflow equations.When a write set is propagated outwardfrom a loop without coalescing or index elimination,we add inequalities about loop indexes into C.This corresponds tofission(or distribution)in loop transformations.Fission does not reduce the number of issued write commitments but improves memory access locality.Along dataflow com-putation in interval analysis,the compiler repeatedly ap-plies Fourier-Motzkin elimination to the expressions in in-nermost loops.We use memorization[1]technique which stores and reuses the results computed before.3.Target Software DSMWe implemented a runtime library of ADSM on a Fu-jitsu AP1000+.The AP1000+has dedicated hardware which executes remote block transfer operation(put/get interface[18]).We assume that point-to-point message or-der is preserved.Formerly,we had been using a history-based runtime system of lazy release consistency[28].This implementa-tion stores write commitments as a write history.When a synchronization primitive is issued,the page contents are written back to the page-home.This corresponds to a software emulation of automatic update release consistency (AURC)[19].Diff based implementation compares whole page contents[20].History based implementation can avoid this when the compiler successfully eliminates and coa-lesces the write commitments.However,the following two problems exist:When the compiler can not optimize,history manage-ment introduces a large runtime overhead.We handle logical timestamps between each synchro-nization like LRC and AURC.Frequent synchroniza-tion causes long synchronization messages and the growth of synchronization costs.This time,we have implemented a new page-based runtime system.The basic design is similar to that of SoftFLASH[15]with delayed invalidate release consistency (DIRC)model[12].We use a write commitment for mes-sage vectorization.3.1.Basic DesignShared memory is managed by pages.Each page has a page-home node and the user can specify which it is.Each node manages the following bit tables with the size of the number of shared pages.Valid bit table indicates that the page contents are valid. Dirty bit table indicates that the node has written into the page with the current synchronization interval[20].Each node also manages the following bit table with the size of the number of nodes.Acknowledge table indicates that the node had written into the page of the corresponding page-home node. Synchronization tags of locks and pauses are handled by specified synchronization-home(i.e.,lock-home or pause-home)nodes.Each lock and pause has its own dirty bit table.We describe the behaviors of the runtime system for each primitive.When a write commitment is issued,the written memory contents are sent to the page-home node with a put oper-ation.The size parameter of the write commitment corre-sponds to the length of the block transfer.The page-home node is recorded in the acknowledge table.At an acquire operation,the node receives the dirty bit table from the lock-home processor.The obtained dirty bit table is applied to the valid bit table.The size of synchro-nization messages are limited by the dirty bit table size be-cause the time information is not utilized at synchroniza-tion.However,if a node acquires the same lock again,a page may be invalidated even when the page is not written between lock acquisitions.In a release operation,the node sends the nodes recorded in the acknowledge table and confirms that all sent mes-sages have arrived to the destinations.Then,the node sends the dirty bit table to the lock-home node.When a page fault occurs,the page contents are copied from the page-home by a get operation.At a barrier operation,the following steps are executed: 1.Each node confirms whether all the preceding page-home updates are completed.2.All nodes send their own dirt bit tables to the masternode.3.The master merges the sent dirty bit tables and broad-casts the merged one.4.All nodes invalidate their copies using the sent dirty bittable.5.Each node clears its dirty bit table and the dirty bittable of synchronization tags which it manages.Communications at page faults and write commitments are handled asynchronously.Acquire and release oper-ations are serialized by sending explicit messages to the synchronization-home nodes.Currently we use CellOS on AP1000+.CellOS does not provide a signal mechanism to users.Therefore,shared memory accesses are not han-dled by the virtual memory mechanism.But they are exe-cuted by code sequences which check valid bit tables.The optimizing compiler inserts this code sequence before each shared memory access.The compiler also inserts message polling[29].3.2.Protocol Selection at Write CommitmentThe above runtime system provides a write-invalidate protocol.We can simulate two other protocolsBy modifying behavior at a write commitment,we can select two other protocols[26,25]at each write commit-ment.Broadcast At a write commitment,the writing node sends written contents to all nodes.The node does not set the dirty bit table entry.Home Only The writer updates the page-home without making a copy.This is achieved by omitting the valid bit table checking of the corresponding shared write. The broadcast protocol can reduce the communication la-tency and alleviate false sharing.Broadcast is also use-ful to efficiently execute a program which is not properly labeled[16].At the release operation after broadcasting, the sender node must wait for acknowledgments from all nodes.Home only protocol can reduce page fault traffic at fetch-on-write.The contents of the page and the state of the valid bit table entry are temporarily inconsistent until the succeeding synchronization.When a home only write and ordinary page accesses occur in the same page,this may cause incorrect page contents.We introduce the home only acknowledge table which records the page-home node for home only write commitments.When a page fault occurs, the node checks this table and waits for an acknowledgment from the page-home node.To perform the protocol optimization,we have manually specified the type of write commitments in the bottleneck part of a generated source program.When we implement the home only protocol using a virtual memory mechanism, we have to explicitly check the valid bit table at conflicting writes to avoid frequent page faults.4.Performance Study with SPLASH-2We used three kernels(LU-Contig,Radix,FFT)and five applications(Barnes,Raytrace,Water-Nsq,Water-Sp, Ocean)from SPLASH-2.pilation TimeAt redundancy elimination,we calculated availability with bottom up traversal of the call graph,and calculated ancitipatability intraprocedurally.We show the compila-tion time of each program in Figure1.The compiler is run on Sun SPARCstation20(with50MHz SuperSPARC) +SunOS4.1.3.“Scalar dataflow”represents the time to detect induction variables.Without type-safe assumption, points-to analysis takes from1.4to4.2times longer time forTable2.Input problem size and sequential ex-ecution time(in seconds)program problem size sequentialLU-Contig10242doubles115.67Radix1M integer keys 4.32FFT64K complex doubles 2.10Barnes16K bodies54.68Raytrace balls4,1282pixels349.38Water-Nsq4096molecules800.08Water-Sp4096molecules88.37Ocean1302ocean7.09 programs with structures containing pointers(Barnes,Ray-trace,and Water-Sp)and for a program with pointer casting (Ocean).4.2.Runtime SystemWe show the problem size of each program and the se-quential execution time on one node.Each node of the AP1000+consists of50MHz SuperSPARC(20KB I-cache and16KB D-cache)and16MB memory.The nodes are linked by2D torus network whose bandwidth is25MB/s per link.The small problem size of Ocean is caused because of the limit of physical memory size.The page table checking is implemented by software.If we use a virtual memory mechanism,there is no checking overhead when the page is valid.Coalescing and redun-dancy elimination are also applicable to the software page table checking.We manually applied redundancy elimina-tion to checking codes using a similar interprocedural algo-rithm to that of write commitments.We selected4KB page size for kernels,and1KB for applications.We used gcc 2.7.2(the optimizing level is-O2)as the backend compiler.We modified the source codes of FFT and Raytrace.The transpose operation of the original FFT is written so that a receiver reads the parts of the array.But their page-home nodes are not receivers but senders.This causes a severe false sharing.We rewrote the procedure Transpose so that a sender writes to the page-home of receivers.In the original Raytrace,lock acquisition for ray ID is a bottle-neck for the execution.This ID is not used for any actual computation.We removed this lock operation.For each program,we specified a page-home and a synchronization-home according to optimization hints of SPLASH-2.We applied protocol optimization to Radix,FFT,Barnes,and Raytrace.In Figure3,we show effects of compiler optimization on32nodes execution.The left bar of each program is thepilation time of SPLASH-2(in seconds)RadixSpeedup1248163264FFTSpeedup1248163264128Figure4.Effects of protocol optimization to Radix and FFTspectively mean executions without the broadcast protocol, coalescing,and the home only protocol.Though write com-mitments in the innermost loop cause a large overhead,this part can be parallelized.Without the home only proto-col,the performance is saturated over16nodes because of heavy traffic.The broadcast protocol is effective also over16nodes.The rightfigure shows the speedup ratio of FFT.“Orig”means the execution of the original SPLASH-2code.In FFT,the code restructuring of Transpose and protocol selection raise the maximal speedup ratio from 1.49to18.1.In Figure5,we show speedup ratio of the programs with compiler optimization and protocol selection.Because of the low overheads of our runtime system and the uti-lization of the communication bandwidth,Raytrace,LU-Contig,Water-Ns,and Water-Sp show high speedup ratios and a good scalability.Both in Radix and FFT,an appropri-ate protocol selection is crucial for scalability.The perfor-mance of Barnes is saturated over32nodes.In Radix and Barnes,the principal overhead is synchronization because of the problem decomposition.Only Ocean slows down owing to the page fault handling which is an overhead of the runtime system.This is mainly because of the small size of the problem.As a whole,both compiler optimiza-tion and appropriate protocol specification are essential for scalability of the input problem.5.Related WorkThe computation power of recent machines enables the application of interprocedural analysis to practical prob-lems(e.g.interprocedural points-to analysis[14,31],in-terprocedural array dataflow analysis[17],and interproce-dural partial redundancy elimination[2]).So far,these ad-vanced analyses have not been used for explicit parallel shared memory programs.Existing research about cooperation between optimizing compilers and software DSM can be divided in three kinds. Thefirst is that a parallelizing compiler targets software DSM[21,13,24].For parallelizable programs,the com-piler can use precise communication information.Message vectorization is applicable to regular communication.The compiler can use code generation techniques for inspec-tor/executor mechanism.Software DSM does not require complex code generation for multi-level indirection.The runtime library has the benefit of message vectorization, synchronization messages,and support for sender initiated communication.However,this policy is only applicable to automatically parallelizable programs.The second is that a programmer declares shared data and association between data and synchronization[3,10,30, 6].The programmer can select appropriate protocols for each data.The runtime system can utilize application spe-cific information.Since this model hides a memory model from users,the system does not suffer from false sharing.Speedup124816Speedup1248163264128Figure5.Speedup ratio up to16nodes(left)and up to128nodes(right)However,the message packing/unpacking mechanism must be implemented effiers also have to adjust paral-lel programs to the provided programming model.The third is that a compiler directly analyzes a shared memory program.Our system and Shasta[29]are classi-fied in this kind.The Shasta compiler uses two optimizing techniques to reduce software overheads.One is a special flag value which indicates that the content is invalid.If the loaded value is not equal to theflag value,we know that the content is valid without using the page table checking.The other is batching to combine multiple checking for the same entry of the directory.These optimizations are intraproce-dural.Since they do not perform loop level optimization, their system requires both high network bandwidth and low latency.6.SummaryWe have shown that compiler support enables efficient software DSM which can utilize communication bandwidth as much as possible.We designed an interface between a shared memory program and a runtime library,and estab-lished a coalescing and redundancy elimination problem of write commitments.Our framework enabled applying inter-procedural optimizations to a shared memory program.We have described the interprocedural optimization scheme and an efficient implementation of the runtime system.We have shown that the appropriate write protocol selection is one important application specific information for the efficient software DSM.The redundancy elimination scheme in this paper de-creases the number of write commitment as much as pos-sible and makes the size of the write commitment as large as possible.Therefore,it issues write commitments as late as possible.This policy is suitable for the runtime system on AP1000+,since AP1000+has a fast communication net-work.However,this is not always optimal,especially on machines with slower communication facilities.Our future work is to reflect this tradeoff of the platform into dataflow equations.AcknowledgmentsWe would like to thank the referees for their valuable comments and aadvice.This work is partly supported by Advanced Information Technology Program(AITP)of Information-technology Promotion Agency(IPA)Japan. References[1]H.Abelson,G.J.Sussman,and J.Sussman.Structure andInterpretation of Computer Programs.The MIT Press,Cam-bridge,MA,1985.[2]G.Agrawal,J.Saltz,and R.Das.Interprocedural PartialRedundancy Elimination and its Application to Distributed。

相关文档
最新文档