1 Shared Memory vs. Message Passing the COMOPS Benchmark Experiment

合集下载

python多进sharedmemory用法

python多进sharedmemory用法（最新版）目录1.共享内存的概念与作用2.Python 多进程的共享内存实现方法3.使用 sharedmemory 的实例与注意事项正文一、共享内存的概念与作用共享内存是多进程或多线程之间共享数据空间的一种技术。

在多进程或多线程程序中，共享内存可以实现进程或线程之间的数据交换和同步，从而提高程序的运行效率。

Python 提供了 sharedmemory 模块来实现多进程之间的共享内存。

二、Python 多进程的共享内存实现方法在 Python 中，可以使用 multiprocessing 模块创建多进程，并使用 sharedmemory 模块实现进程间的共享内存。

以下是一个简单的示例：```pythonimport multiprocessing as mpimport sharedmemorydef func(shm):shm.write(b"Hello, shared memory!")if __name__ == "__main__":shm = sharedmemory.SharedMemory(name="my_shared_memory", size=1024)p = mp.Process(target=func, args=(shm,))p.start()p.join()print(shm.read(0, 1024))```在这个示例中，我们首先导入 multiprocessing 和 sharedmemory 模块。

然后，定义一个函数 func，该函数接收一个共享内存对象 shm 作为参数，并将字节串"Hello, shared memory!"写入共享内存。

在主进程中，我们创建一个共享内存对象 shm，并启动一个子进程 p，将 func 函数和共享内存对象 shm 作为参数传递给子进程。

操作系统进程通信

进程通信-----消息传递通信的实现方式
通信链路：
第一种方式（主要用于计算机网络中）：由发送进程在通信之前用显式的“建立连接”命令请求系统为之建立一条通信链路，在链路使用完后拆除链路。
第二种方式（主要用于单机系统中）：发送进程无须明确提出建立链路的请求，只须利用系统提供的发送命令（原语），系统会自动地为之建立一条链路。
邮箱特点：（1）每一个邮箱有一个唯一的标识符；（2）消息在邮箱中可以安全保存，只允许核准的用户随时
读取；（3）利用邮箱可以实现实时通信，又可以实现非实时通信。
进程通信-----信箱通信
信箱结构：
信箱定义为一种数据结构，在逻辑上可以分为：
• 1，信箱头，用以存放有关信箱的描述信息，如信箱标识符，信箱的拥有者，信箱口令，信箱的空格数等；
基于共享存储区的通信方式。在存储器中划出了一块共享存储区，各进程可通过对共享存储区中的数据的读和写来实现通信。适用于传输大量数据。
进程通信-----消息传递系统
消息传递机制：进程间的数据交换以消息为单位，程序员利用系统的通信原语（要
进行消息传递时执行send；当接收者要接收消息时执行receive）实现通信。这种通信方式属于高级通信。
b 接收区
原语描述
二、实例—消息缓冲队列通信机制
1、通信描述
PCB(B)
进程 A
进程 B
send (B,a)
mq mutex
sm
receive(b)
Emphead
a 发 sender: A 送 size:5 区 text:Hello
sender : A size : 5
i Text:Hello next: 0

Introduction to Advanced Computer Archtecture and Paralel Processing

Most computer scientists agree that there have been four distinct paradigms or eras of computing;
•
• •
1.Batch Era
It was the typical batch processing machine with punched card readers, tapes and disk drives, but no connection beyond the computer room. The IBM System/360 had an operating system, multiple programming languages, and 10 megabytes of disk storage. The System/360 filled a room with metal boxes and people to run them. Its transistor circuits were reasonably fast. This machine was large enough to support many programs in memory at the same time, even though the central processing unit had to switch from one program to another.
By 1965 the IBM System/360 mainframe dominated the corporate computer centers.
• • • •
The mainframes of the batch era were firmly established by the late 1960s when advances in semiconductor technology made the solid-state memory and integrated circuit feasible. These advances in hardware technology spawned the minicomputer era. They were small, fast, and inexpensive enough to be spread throughout the company at the divisional level. However, they were still too expensive and difficult.

重庆大学操作系统期末考试

重庆大学《操作系统》课程试卷A卷B卷2016—2017学年第一学期开课学院：计算机学院课程号：18012035考试日期：考试方式：开卷闭卷其他考试时间：120分钟题号一二三四五六七八九十总分得分Part I:True / False Questions （12 points ）1. ( ) The OS is a kind of application program, it manages all hardwareresources to work together.2. ( ) A relocation register is used to check for invalid memory addressesgenerated by a CPU.3. ( ) Monitors are a theoretical concept and are not practiced in modernprogramming languages.4. ( ) When a user-level thread is created, it cannot be scheduled directly bykernel because the kernel can’t realize it .5. ( ) Most SMP systems try to avoid migration of processes from one processorto another and attempt to keep a process running on the same processor. This is known as processor affinity.6. ( ) Record semaphore may cause the problem of busy waiting.7. ( ) A deadlocked state is an unsafe state, all unsafe states are deadlocks.8. ( ) In segmentation memory management, to access an operand needs accessmemory twice.9. ( ) The system thrashing occurs lots of page-faults. It can result in severeperformance problems.10. ( ) All files in a single-level directory must have unique names.11. ( ) When continuously reading data on the same cylinder and different disksurface, It is not necessary to move the heads.12. ( ) Users can use the computer hardware features without going through theoperating system.Part II: Single Choice (22 points)1、Which one of the following descriptions about command-interpreter (命令解释程序) is correct?（）A. the interface between the user and the OSB. allows users to directly enter commandsC. In the kernel or as a special programD. the program to interpret commands2、Which of the following does not correct for memory sharing and message passing? ( )A. Shared-memory is faster than message passing scheme because data sharing does not need to switch between kernel and user spaceB. Shared-memory scheme does not need kernel support. User can do it by themselves.C. message passing highly replies on the support of kernel.D. message passing scheme is easy to use for users since most of its function is provided by kernel.3、 In five states of a process, ( ) state can convert from the other three states.A. NEWB. RUNC. READYD. WAIT4、A thread is a basic unit of CPU utilization, It shares with other threads belonging to the same process the ( )A. code sectionB. program counterC. register setD. stack命题人：郭平,何静媛,石锐,石亮,但静培,张玉芳组题人：石亮审题人：石亮命题时间：20161226教务处制学院 _______专业班________年级__________学号_________姓名_________考试教室__________公平竞争、诚实守信、严肃考纪、拒绝作弊封线密5、Assume 3 processes want to enter critical section, S is the mutual exclusion semaphore, the minimum value and maximum value are ( )A. -3, 3B.-3, 0C.-3, 1D.-2, 16、In virtual memory, what kind of addresses is used by CPU ( ) .A. physical addressB. linear addressC. logical addressD. relative address7、In paging system, the size of a page is 1K bytes, ifa process has a page table as right, the logical addressof an instruction is 463, its corresponding physicaladdress is ( )A. 2660.B.7583C.7168D.45598、In order to better use memory space, which of the following methods can be used? ( )A. cachingB. swappingC. SPOOLingD. absolute loading9、The Belady’s anomal (异常现象) probably occurs in ( ) page-replacement algorithm.A .OPT B.FIFO C.LRU D.none10、which of the following CPU scheduling is non-preemptive? ( )A .FCFS B. SJF C. Priority D. round robin11、Magnetic tapes was used as an early secondary-storage medium. The file stored in it can be accessed in ( )A. direct access.B. sequential accessC. indexedD. none of the abovePart I and II Answer1 2 3 4 5 6 7 8 9 10 11 12IIIPart III : Fill in the blanks（10 points）1.For an operating systems can be designed in different structure, includingsimple structure, layered,______ and _______.2.____ is the important structure for a process. It includes much informationabout a specific process.3.There are three types of operations can be used for semaphore,including , _____ and ______.4.We can classify page-replacement algorithms into two broad categories:__________allows a process to select a replacement frame from the set of all frames, ___________ requires that each process select from only its own set of allocated frames.5.The time to move disk arm to desired cylinder is called _.6.controller can control the device to directly access the mainmemory.Part IV: Short Answer Questions (32 points)1.Please list all types of processor scheduling in a computer and explain the maintasks for each type.Page ID Frame ID0 41 62 73 92.Why are two modes (user and kernel) needed?3.Please list as many as possible deadlock recovery schemes (at least 2) andexplains their advantages and disadvantages.4.Please e xplain the difference between internal and external fragmentation.5.Please explain why we need to use TLB for memory accesses. What is theprinciple of TLB. 6.Please explain the role of file directory and the organization structures of filedirectory.7.Briefly describe the steps taken to read a block of data from the disk to thememory using DMA controlled I/O.8.Please explain what are cache and buffer. What are their difference?Part V: Integrated Exercises (24 points)1.The OS allocated 4 page frames to each active process. Initially, no page inthe main memory. If a process demand pages as follows:3,4,5,6,1,0,2,3,6,3,2,1Please use OPT, LRU, and CLOCK policies separately to replace the page in memory, and calculate the total page fault.2.Consider the following snapshot of a system with five processes (p1, (5)and four resources (r1, ... r4). There are no current outstanding queued unsatisfied requests.a)what is the content of the matrix Need?b)Is this system currently deadlocked, or will any process become deadlocked?Why or why not? If not, give an execution order.c)If a request from p3 arrives for (0, 1, 0, 0), can that request be safely grantedimmediately? And why?Processes Allocation Max AvailableR1 R2 R3 R4 R1 R2 R3 R4 R1 R2 R3 R4 P1 0 0 1 2 0 0 1 2 2 1 0 0 P2 2 0 0 0 2 7 5 0P3 0 0 3 4 6 6 5 6P4 2 3 5 4 4 3 5 6P5 0 3 3 2 0 6 5 23.Assuming there are 5000 cylinders (No.0-4999) in a disk. Read-write head is at cylinder No. 143 right now, and the previous position is No.125. The coming request queue is86, 1470, 913, 1774, 948, 1509, 1022, 1750, 130.Starting from the current head position, please list the access sequences for the following disk-scheduling algorithms?a)FCFS, SSTF and SCANb)Considering the state-of-the-art storage media, there are some storage mediahas no arm without seeking latency. In this case, which one of the above scheduler would be the best? Describe your reason。

并行计算体系结构

多计算机（多地址空间非共享存储器） NORMA:No-Remote Memory Access
8
最新的TOP500计算机
12:12
9
最新的TOP500计算机
12:12
10
来自Cray的美洲豹“Jaguar”，凭借1.75 PFlop/s(每秒1750万亿次)的计算能力傲视群雄。“Jaguar”采用了224162个处理器核心
12:12
2
结构模型
共享内存/对称多处理机系统(SMP)
PVP：并行向量机
单地址空间共享存ess) SMP：共享内存并行机（ Shared Memory Processors ）。多个处理器通过交叉开关（Crossbar）或总线与共享内存互连。
来自中国的曙光“星云”系统以1271万亿次/s的峰值速度名列第二
• 采用了自主设计的HPP体系结构、高效异构协同计算技术
• 处理器是32nm工艺的六核至强X5650，并且采用了Nvidia Tesla C2050 GPU做协处理的用户编程环境；
异构体系结构专用通用
TOP500中85%的系统采用了四核处理器，而有5%的系统已经使
12:12
6
Cluster：机群系统
Cluster(Now,Cow)：群集系统。将单个节点，用商业网络：Ethernet，Myrinet，Quadrics， Infiniband，Switch等连结起来形成群集系统。
• 每个节点都是一个完整的计算机（SMP或DSM），有自己磁盘和操作系统
系统在物理上分布、逻辑上共享。各结点有
自己独立的寻址空间。
• 单地址空间、分布共享
• NUMA（ Nonuniform Memory Access ）

QNX中的进程间通信

QNX中的进程间通信(IPC)在QNX Neutrino中消息传递(Message passing)是IPC的主要形式，其他形式也都是基于消息传递来实现的。

QNX中提供了如下一个形式的IPC：Serive: Implemented in：・Message-passing Kernel・Signals Kernel・POSIX message queues External process・Shared memory Process manager・Pipes External process・FIFOs External process一、Synchronous message passing[同步消息传递]如果一个线程执行了MsgSend()方法向另一个线程(可以属于不同进程)发送消息，它会就被阻塞，直到目标线程执行了MsgReceive()，并处理消息，然后执行了MsgReply()。

如果一个线程在其他线程执行MsgSend()之前执行了MsgReceive()，它会被阻塞直到另一个线程执行了MsgSend()。

消息传递是通过直接的内存copy来实现的。

需要巨大消息传递的时候建议通过Shared Message[共享内存]或其他方式来实现。

1、消息传递中的状态迁移客户程序的状态迁移・SEND blocked：调用MsgSend()后，服务程序没有调用MsgReceive()的状态。

・REPLY blocked：调用MsgSend()后，并且服务程序调用了MsgReceive()，但是没有调用MsgReply()/MsgError()的状态。

当服务程序已经调用了MsgReceive()方法是，客户程序一旦调用MsgSend()就直接迁移如REPLY blocked状态。

・READY：调用MsgSend()后，并且服务程序调用了MsgReceive()和MsgReply()的状态。

服务程序的状态迁移：・RECEIVE blocked；调用MsgReceive()后，客户程序没有调用MsgSend()时的状态。

C#.Net多进程同步通信共享内存内存映射文件MemoryMapped转

C#.Net多进程同步通信共享内存内存映射⽂件MemoryMapped转节点通信存在两种模型：共享内存（Shared memory）和消息传递（Messages passing）。

内存映射⽂件对于托管世界的开发⼈员来说似乎很陌⽣，但它确实已经是很远古的技术了，⽽且在操作系统中地位相当。

实际上，任何想要共享数据的通信模型都会在幕后使⽤它。

内存映射⽂件究竟是个什么？内存映射⽂件允许你保留⼀块地址空间，然后将该物理存储映射到这块内存空间中进⾏操作。

物理存储是⽂件管理，⽽内存映射⽂件是操作系统级内存管理。

优势：1.访问磁盘⽂件上的数据不需执⾏I/O操作和缓存操作(当访问⽂件数据时，作⽤尤其显著)；2.让运⾏在同⼀台机器上的多个进程共享数据(单机多进程间数据通信效率最⾼)；利⽤⽂件与内存空间之间的映射，应⽤程序（包括多个进程）可以通过直接在内存中进⾏读写来修改⽂件。

.NET Framework 4 ⽤托管代码按照本机Windows函数访问内存映射⽂件的⽅式来访问内存映射⽂件，。

有两种类型的内存映射⽂件：持久内存映射⽂件持久⽂件是与磁盘上的源⽂件关联的内存映射⽂件。

在最后⼀个进程使⽤完此⽂件后，数据将保存到磁盘上的源⽂件中。

这些内存映射⽂件适合⽤来处理⾮常⼤的源⽂件。

⾮持久内存映射⽂件⾮持久⽂件是未与磁盘上的源⽂件关联的内存映射⽂件。

当最后⼀个进程使⽤完此⽂件后，数据将丢失，并且垃圾回收功能将回收此⽂件。

这些⽂件适⽤于为进程间通信 (IPC) 创建共享内存。

1）在多个进程之间进⾏共享（进程可通过使⽤由创建同⼀内存映射⽂件的进程所指派的公⽤名来映射到此⽂件）。

2）若要使⽤⼀个内存映射⽂件，则必须创建该内存映射⽂件的完整视图或部分视图。

还可以创建内存映射⽂件的同⼀部分的多个视图，进⽽创建并发内存。

为了使两个视图能够并发，必须基于同⼀内存映射⽂件创建这两个视图。

3）如果⽂件⼤于应⽤程序⽤于内存映射的逻辑内存空间（在 32 位计算机上为2GB），则还需要使⽤多个视图。

操作系统课程英文词汇

操作系统课程英文词汇 ____郭冀生1. 操作系统 Operating System 42. 管道 Pipe2. 计算机 Computer 43. 临界区 Critical Section3. 内核映像 Core Image 44. 忙等候 Busy Waiting4. 超级用户 Super-user 45. 原子操作 Atomic Action5. 进度 Process 46. 同步 Synchronization6. 线程 Threads 47. 调动算法 Scheduling Algorithm7. 输入 /输出 I/O (Input/Output) 48. 剥夺调动 Preemptive Scheduling8. 多办理器操作系统Multiprocessor Operating 49. 非剥夺调动Nonpreemptive SchedulingSystems 50. 硬及时 Hard Real Time9. 个人计算机操作系统 Personal Computer 51. 软及时 Soft Real TimeOperating Systems 52. 调动体制 Scheduling Mechanism10. 及时操作系统 Real-Time Operating Systems 53. 调动策略 Scheduling Policy11. 办理机 Processor 54. 任务 Task12. 内存 Memory 55. 设施驱动程序Device Driver13. 进度间通讯 Interprocess Communication 56. 内存管理器Memory Manager14. 输入 /输出设施 I/O Devices 57. 指引程序 Bootstrap15. 总线 Buses 58. 时间片 Quantum16. 死锁 Deadlock 59. 进度切换 Process Switch17. 内存管理 Memory Management 60. 上下文切换Context Switch18. 输入 /输出 Input/Output 61. 重定位 Relocation19. 文件 Files 62. 位示图 Bitmaps20. 文件系统 File System 63. 链表 Linked List21. 文件扩展名 File Extension 64. 虚构储存器 Virtual Memory22. 次序存取 Sequential Access 65. 页 Page23. 随机存取文件 Random Access File 66. 页面 Page Frame24. 文件属性 Attribute 67. 页面 Page Frame25. 绝对路径 Absolute Path 68. 改正 Modify26. 相对路径 Relative Path 69. 接见 Reference27. 安全 Security 70. 联想储存器Associative Memory28. 系统调用 System Calls 71. 命中率 Hit Ration29. 操作系统构造 Operating System Structure 72. 信息传达 Message Passing30. 层次系统 Layered Systems 73. 目录 Directory31. 虚构机 Virtual Machines 74. 设施文件 Special File32. 客户 /服务器模式 Client/Server Mode 75. 块设施文件Block Special File33. 线程 Threads 76. 字符设施文件Character Special File34. 调动激活 Scheduler Activations 77. 字符设施 Character Device35. 信号量 Semaphores 78. 块设施 Block Device36. 二进制信号量 Binary Semaphore 79. 纠错码 Error-Correcting Code37. 互斥 Mutexes 80. 直接内存存取Direct Memory Access38. 互斥 Mutual Exclusion 81. 一致命名法Uniform Naming39. 优先级 Priority 82. 可剥夺资源Preemptable Resource40. 监控程序 Monitors 83. 不行剥夺资源Nonpreemptable Resource41. 管程 Monitor 84. 先来先服务First-Come First-Served85.最短寻道算法 Shortest Seek First86.电梯算法 Elevator Algorithm87.指引参数 Boot Parameter88.时钟滴答 Clock Tick89.内核调用 Kernel Call90.客户进度 Client Process91.服务器进度 Server Process92.条件变量 Condition Variable93.信箱 Mailbox94.应答 Acknowledgement95.饥饿 Starvation96.空指针 Null Pointer97.规范模式 Canonical Mode98.非规范模式 Uncanonical Mode99.代码页 Code Page100.虚构控制台 Virtual Console101.高速缓存 Cache102.基地点 Base103.界线 Limit104.互换 Swapping105.内存压缩 Memory Compaction 106.最正确适配 Best Fit107.最坏适配 Worst Fit108.虚地点 Virtual Address109.页表 Page Table110.缺页故障 Page Fault111.近来未使用 Not Recently Used 112.最久未使用 Least Recently Used 113.工作集 Working Set114.请调 Demand Paging115.预调 Prepaging116.接见的局部性 Locality of Reference 117.颠簸 Thrashing118.内零头 Internal Fragmentation 119.外零头 External Fragmentation 120.共享正文 Shared Text121.增量转储 Incremental Dump122.权限表 Capability List123.接见控制表 Access Control List。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

BibliographyBibliographyCONVEX Computer Corporation. (1994).Exemplar Architecture.2nd Edition, Doc. No.081-023430-001. Richardson, TX: CONVEX Press. Nov. 1994Digital Equipment Corp. (1995).AlphaServer 8000 Series Conﬁguring Guide.DEC WWW home page /info/alphaserver/alphsrv8400/8000.html#spec.Fenwick., D.M., Foley, D.J., Gist, W.B., VanDoren, S.R., & Wissell, D. (1995). The AlphaServer 8000 Series: High-End Server Platform Development.Digital Techni-cal Journal,V ol. 7 No. 1, 43-65.Gropp, W., Doss, N., & Skjellum, A. (1996).A High-Performance, Portable Implemen-tation of the MPI Message Passing Interface Standard(On-line Technical Report) Available at /mpi/mpicharticle/paper.html. Argonne, IL:Mathematics and Computer Science Division, Argonne National Laboratory. Reed, J. (1996).Personal Correspondence About SPP1600.Nov. 1996.Salo, E. (1996).Personal Correspondence About SGI MPI.Nov. 1996.Silicon Graphics Inc. (1996). Power Challenge Technical Report. SGI WWW home page /Products/hardware/servers, 1995.Conclusionssors may not be on the same agent (Figure 2). Therefore, the interaction between these two processors has to go through the crossbar switch.ConclusionsFrom the COMOPS benchmark results measured on three shared memory machines, the following conclusions can be made.1.The MPI implementation on the SGI Power Challenge is generallysuperior to the others, at least for COMOPS operations.2.In general, the communication performance for COMOPS operations isbetter in two customized versions of MPI, the Convex MPI and the SGMPI, than in their corresponding shared memory schemes.3.On the DEC Turbolaser, the communication performance in the sharedmemory scheme is slightly better than that in the MPI because of the MPIoverhead.It is clear that customizing the MPI implementation based on the speciﬁc hardware architecture is a good way to achieve high performance for message passing operations on a shared memory platform. Also, using direct memory copy, instead of going through an intermediate shared space, is critical to the improvement of the communication perfor-mance.Performance Data and Analysistions, the slightly higher cost on six processors is probably because two of the six proces-Figure 14. SPP1600 Broadcast TimeFigure 15. SPP1600 Reduction TimePerformance Data and AnalysisThe ping-pong performance on the Convex SPP1600 (Figure 13) is very similar to that on the SGI Power Challenge. From the phenomenon that the MPI takes nearly half time of what the shared memory scheme takes to perform the ping-pong operation, it is reasonable to anticipate that the MPI implementation on the SPP1600 may be also based on the direct memory copy, instead of going through an intermediate shared space (Gropp, et al,1996).The performance of shared memory broadcast and reduction on this SPP1600 (Figure 14 and 15) is similar to the other two machines. The queue length for reading the shared block and the cost from the critical section are the major effects in broadcast and reduction respectively.Since the details of broadcast and reduction implementation in the Convex version of MPI are unclear at this moment, it is anticipated that the MPI broadcast involves regular synchronizations, just like the situation on the DEC AlphaServer. As for reduction opera-Figure 13. SPP1600 ping-pong TimePerformance Data and Analysispong operation is accomplished by directly copying data from the space owned by the source processor to the destination processor, without going through an intermediate shared space (Gropp, Lusk, Doss & Skjellum, 1996). Therefore, the shared memory scheme, which uses an intermediate shared space as an interim, takes about twice as long as MPI does.The performance of shared memory broadcast and reduction on the SGI machine (Figure 11 and 12) is similar to what is observed on the DEC AlphaServer because of the identical architecture and the same version of shared memory code. The time for broad-cast grows up with more processors because of the increasing queue length for reading the shared space. For reduction, the cost from the critical section increases with more proces-sors involved. The MPI performance behaviors for broadcast and reduction on the SGI Power Challenge are interesting. In fact, the MPI performance illustrated in Figure 11 and 12 reﬂect the underlying implementation of the SGI MPI. The MPI operation for broad-cast is implemented as a fan-out tree on the top of the Bcopy() point-to-point mechanism (Salo, 1996). For reduction operations, it is in the reversed order as a fan-in tree. Both of them have some parallelism as each pair of processors can perform fan-in or fan-out inde-pendently. Since the algorithm of fan-in/fan-out tree requires a synchronization at each tree-fork/join stage, the cost of broadcast/reduction will grow up with more fork/join syn-chronizations as more processors participate into the operation. Therefore, the time for reduction on eight processors is nearly the same as that for six processors because they both involve the same number of join synchronization stages. The big growth in the time for broadcast on eight processors (Figure 11) in fact is caused by the synchronization at the completion of broadcast. With all the processors in the system being synchronized at certain point, the OS overhead can be signiﬁcant. On the other hand, there is no need for such a synchronization in reduction.Performance Data and AnalysisFigure 11. SGI Broadcast TimeFigure 12. SGI Reduction TimePerformance Data and AnalysisFigure 9. Turbolaser Broadcast TimeFigure 10. Turbolaser Reduction TimePerformance Data and AnalysisThe broadcast performance on the DEC AlphaServer (Figure 9) is easy to under-stand. The increase of the shared memory broadcast time with more processors is caused by the increasing queue length of the slave processors. In MPI, the synchronization cost causes the broadcast time to increase more signiﬁcantly with more processors. The same situation holds for reduction (Figure 10). However, because the shared memory reduction involves a critical section (as listed in the pseudocode), the reduction time increases more as more processors are waiting to enter the critical section.Similarly, the ping-pong operation has a ﬂat performance on the SGI Power Chal-lenge (Figure 8). The difference from the situation of the DEC AlphaServer is that the MPI ping-pong time does not grow up with more processors. It looks like the MPI pro-cesses are “light” on the SGI Power Challenge because the OS interruption does not steal the effective bandwidth even if all processors are in the run. The SGI implementation of MPI is based on the global memory copy function Bcopy() (Salo, 1996). Thus, the ping-Figure 8. SGI ping-pong TimePerformance Data and Analysisoperations on each platform is presented here. Figure 7 shows the ping-pong time on the DEC AlphaServer for a ﬁxed message size (800KB) with different number of processors involved. On this DEC machine, MPI is built on top of its shared memory communication protocol. Therefore, MPI performance is always slightly worse than shared memory because of the overhead involved in the MPI implementation. Also, MPI processes seem to be “heavy”. Although only two processors participate in the ping-pong operation, the time slightly grows up when the number of MPI processes increases. This is probably due to the interruption from the operating system and the other MPI processes, which are sup-posed to be idle. On the other hand, the time for the shared memory ping-pong operation remains constant, regardless of the number of processors in the run. This is because the cache coherence caused by invalidating the shared cache line on each processor is per-formed by broadcasting the message on the bus, instead of sending it to each processor separately.Figure 7. Turbolaser ping-pong TimePerformance Data and AnalysisMessage Size (Bytes)B a n d w i d t h (M B /s e c )Figure 5. Reduction BandwdithMessage Size (Bytes)T i m e (m i c r o s e c )Figure 6. Small Message ping-pong TimePerformance Data and Analysis10Shared Memory vs. Message PassingNow, based on an understanding of architectures and the underlying MPI imple-mentations, the qualitative performance analysis of ping-pong, broadcast, and reductionMessage Size (Bytes)0.010.020.030.040.050.0B a n d w i d t h (M B /s e c )Message Size (Bytes)B a n d w i d t h (M B /s e c )Figure 3. Ping-pong Rate Figure 4. Broadcast BandwdithPerformance Data and Analysisones and also better than its corresponding shared memory performance on all 3 commu-nication operations (ping-pong, broadcast, and reduction).More speciﬁcally, on the SGI Power Challenge, MPI is more than twice as fast as shared memory for the performance ping-pong. The broadcast performance on this SGI shared memory machine is about the same for MPI and shared memory.The DEC AlphaServer 8400/300 has comparable MPI and shared memory perfor-mance for the ping-pong operation. But for all the tested collective operations (broadcast and reduction), its shared memory bandwidth is considerably higher than the MPI band-width.On the Convex Exemplar SPP1600, the Convex-customized MPI performs twice as fast as its shared memory does for the ping-pong operation. For collective operations, the performance of MPI is just slightly better than that of shared memory method.Figure 6 demonstrates the ping-pong round trip transfer time for small message sizes (8 Bytes to 80 Bytes). This performance typically reﬂects the communication latency. It is clear that the shared memory method on the DEC AlphaServer8400 has the lowest ping-pong latency. In Figures 7 through 15, the performance behaviors for ping-pong, broadcast, and reduction are respectively shown on each platform for a ﬁxed mes-sage size (800KB) with different number of processors. It should be noted that the band-wdith calculation of ping-pong in COMOPS is what some people called “ ping-pong rate = message_size / round_trip_time”. So, it’s only half of the “one-way” ping-pong band-width as other benchmark reported.Shared Memory vs. Message Passing 9Performance Data and AnalysisAs shown in the pseudocode list, only one pair of processors participate in the operation of ping-pong, regardless of the total number of processors involved. The collec-tive communication operations involves all the processors in the run. The shared memory version accomplishes the same operations performed in the original MPI version of the COMOPS benchmark.Performance Data and AnalysisThe original MPI COMOPS benchmark set and the equivalent multi-thread shared memory version have been run on three platforms outlined in Table 1 (SGI, 1995, DEC, 1995 & Reed, 1996). On both SGI and Convex machines, vendor’s customized version of MPI are used in this experiment. On the DEC Alpha machine, a public-domain MPI implementation (MPICH) is used.TABLE 1.Three Tested Shared Memory System ConﬁgurationsThe collected performance data are illustrated in Figures 3 through 15. Figures 3 through 5 exhibit the cross-platform bandwidth comparison and the comparison between the shared memory communication protocol as well as the message passing communica-tion protocol. These performance data are all obtained using four processors with differ-ent message sizes. It is clear that the performance of the SGI MPI is superior to the other8Shared Memory vs. Message PassingThe Experimental MethodBroadcast:call timerdo ntimesif (my_thread .eq. 0) thenshared_temp=privated_send !! Thread 0 sends out messageendifbarrier !! synchronizationif (my_thread .ne. 0) then !! Other threads receives theprivate_recv=shared_tmp !! message simultaneouslyendifbarrier !! synchronizationenddocall timerReduction (global max):call timerdo ntimescritical sectionshared_tmp=max(shared_tmp, private_send)end critical sectionbarrier !! synchronizationif (my_thread .eq. 0) then !! Thread 0 collects the ﬁnalprivate_recv=shared_tmp !! resultendifenddocall timerThis experiment actually involves two versions of shared memory codes because of the different shared memory programming environments. The shared memory pro-gramming environment on both the DEC AlphaServer and SGI Power Challenge systems is compatible with PCF (Parallel Computing Forum) standard. Therefore, only one ver-sion of code is needed for these two machines. The Convex shared memory programming feature in Fortran is slightly different. In particular, in the operation of ping-pong, a lock-and-wait mechanism, instead of the general synchronization barrier, can be used for the synchronization between Processor 0 and Processor 1.Shared Memory vs. Message Passing 7The Experimental MethodThe COMOPS benchmark set is designed to measure the performance of inter-pro-cessor point-to-point and collective communication in MPI. It measures the communica-tion bandwidth and message transfer time for different message sizes. The set includes ping-pong, broadcast, reduction, and gather/scatter operations. The MPI performance measurement can be directly performed on the three platforms with the corresponding best available MPI implementation. Both SGI and HP-Convex have their own customized MPI implementations on their shared memory platforms. Although the current version of MPI implementation on our DEC AlphaServer 8400/300 Turbolaser is a public-domain MPICH version, according to the information from DEC, this MPICH implementation performs no worse than the DEC customized version MPI within one shared memory multi-processor box. The main effort of this experiment is to write a shared memory ver-sion of the COMOPS benchmark set. The shared memory version of these communica-tion operations is illustrated in the following pseudo-code.6Shared Memory vs. Message PassingThe Experimental MethodNUMA system such as Convex SPP, a processor always has some neighbors electrically closer than the others in the system. As illustrated in Figure 2, even though the memory access is still uniform within one hypernode of the SPP1600, each processor is electrically closer to the one shared with the same agent because it does not need to go through the crossbar switch for the inter-processor communication.In this experiment, none of the three shared memory machines has a physical implementation for CPU-private or thread-private memory. In a bus-connected multi-processor system, such as the SGI Power Challenge and the DEC AlphaServer 8400/300 (nickname Turbolaser), the memory system is purely homogeneous. Therefore, there is no physical distinction between a logically-private memory space and a logically-shared memory space. For the NUMA system SPP1600, although it is a DSM system, its CPU-private or thread-private memory is not physically implemented (HP-Convex, 1994). Instead, the operating system partitions hypernode-private memory (memory modules within one hypernode) used as CPU-private memory for each of the processors in the hypernode. The reason for this is that implementation of a physical CPU-private memory would not result in substantially lower CPU-to-memory latency, and the latency from a processor to hypernode-private memory would be increased (HP-Convex, 1994).The Experimental MethodThe direct objective of this experiment is to clarify the difference in the perfor-mance of inter-processor communication between the shared memory protocol and the message passing protocol on a shared memory platform. T o achieve this goal, the com-mon inter-processor communication operations speciﬁed in the LANL COMOPS bench-mark set are used to perform the comparison. The point-to-point communication operation actually used in this experiment is ping-pong. The tested collective operations include broadcast and reduction.Shared Memory vs. Message Passing 5ArchitecturesFigure 2. Convex Exemplar Hypernode StructureThe Exemplar SPP architecture is shown in Figure 2. The Convex machine we have access to (courtesy of Convex) is a one-hypernode 8-processor machine. The inter-hyper-node connection is irrelevant to this experiment and this report focuses on the intra-hyper-node structure only.jdfsjfdasThe memory access pattern and the physical distance between two processors are different in bus-connected and distributed shared memory systems. In a bus-connected shared memory structure, the memory access for each processor is uniform. But in a dis-tributed shared memory structure, the memory access is non-uniform. This structure is called a NUMA (Non-Uniform Memory Access) architecture. Also, the inter-processor communication in bus-connected shared memory systems is homogeneous and every pro-cessor is equi-distant to any other processor in the same system. On the other hand, in a4Shared Memory vs. Message PassingArchitecturesArchitecturesCurrently there are two types of shared memory connections for multi-processor systems. One is the bus-connected shared memory system as illustrated in Figure 1. The DEC AlphaServer 8400/300 and the SGI Power Challenge have this type of architecture. In this type of system every processor has equal access to the entire memory system through the same bus. Another type of shared memory multi-processor connection archi-tecture is the crossbar switch. This crossbar connection is a typical connection mecha-nism within one hypernode of many distributed shared memory (DSM) systems such as HP-Convex Exemplar and NEC SX-4.Figure 1. Bus-connected shared memory multiprocessorsShared Memory vs. Message Passing 3IntroductionIntroductionParallel computing on shared memory multi-processors has become an effective method to solve large scale scientiﬁc and engineering computational problems. Both MPI and shared memory are available for data communication between processors on shared memory platforms. Normally, performing inter-processor data communication by copy-ing data into and out of an intermediate shared buffer seems natural on a shared memory platform. However, some vendors have recently claimed that their customized MPI imple-mentations performed better than the corresponding shared memory protocol on their shared memory platforms even though the MPI protocol was originally designed for dis-tributed memory multi-processor systems. This situation makes it hard for users to choose the best tool for inter-processor communication on those shared memory platforms on which both MPI and shared memory protocols are available. In order to clarify this confu-sion, a comparison experiment was conducted to illustrate the communication perfor-mance for the COMOPS operations on major shared memory platforms. This report presents the experimental results and presents some qualitative analyses to interpret the results.This report has four sections. In the ﬁrst section, the architectures of three shared memory platforms are brieﬂy described. The implementation details of the experiment are described in the second section. The second section also discusses the shared memory simulation of those communication patterns deﬁned in the COMOPS benchmark set. The third section presents the data and analyses. It graphically exhibits the collected commu-nication performance data and qualitatively interprets the performance behavior based on an understanding of underlying architectures. In the ﬁnal section, some conclusions and recommendations are made regarding the interprocessor communication performance on the three shared memory platforms.2Shared Memory vs. Message PassingApril 15, 1997Shared Memory vs. MessagePassing: the COMOPSBenchmark ExperimentYong LuoCIC-19, Mail Stop B256Email: yongl@Los Alamos National LaboratoryAbstractThis paper presents the comparison of the COMOPS benchmark performance in MPI and shared memory on three different shared memory platforms: the DEC AlphaSer-ver 8400/300 (Turbolaser), the SGI Power Challenge, and the HP-Convex ExemplarSPP1600. The paper also qualitatively analyzes the obtained performance data based on an understanding of the corresponding architecture and the MPI implementations. Some conclusions are made for the inter-processor communication performance on these three shared memory platforms.1。