分布式并行算法

合集下载

并行与分布式计算基础知识

并行与分布式计算基础知识在计算机科学领域中，随着数据规模和计算需求的不断增长，越来越多的任务需要同时进行处理。

为了实现高效的计算和数据处理，计算机领域涌现出了并行计算和分布式计算两个重要概念。

并行计算是指将一个任务分解为多个子任务，并同时在多个处理单元上进行处理，以提高计算速度和效率。

这种处理方式通常应用于单个计算机上，通过利用机器的多个核心或线程来同时执行多个任务。

分布式计算则是将一个任务分发给多个计算机或服务器进行处理，每个计算机独立运行一部分任务，最终将结果汇总以获得最终结果。

这种处理方式通常应用于网络环境下，可以利用多台计算机的资源来处理大规模的数据和计算任务。

并行计算和分布式计算的基础知识包括以下几个方面：1. 并行计算模型并行计算的模型可以分为共享内存模型和消息传递模型两种。

共享内存模型是指多个处理单元共享同一块内存空间，并通过对共享内存的读写来进行通信和同步。

每个处理单元可以独立访问内存，并且可以通过修改共享数据来与其他处理单元进行交互。

消息传递模型是指处理单元之间通过发送和接收消息进行通信。

每个处理单元有自己的私有内存，需要通过消息传递来实现不同处理单元之间的数据共享和同步。

2. 并行算法设计在并行计算中，算法的设计至关重要。

好的并行算法可以充分发挥处理单元的计算能力，提高计算效率。

并行算法的设计考虑到任务的划分和通信。

任务的划分需要将一个大任务分解为多个更小的子任务，并合理分配给不同的处理单元。

通信部分则需要设计好处理单元之间的数据传输和同步机制。

3. 分布式计算系统分布式计算系统是一组相互连接的计算机组成的系统，用于处理大规模的数据和计算任务。

这些计算机可以分布在不同的地理位置，并通过网络连接进行通信和协作。

分布式计算系统通常包括任务调度器、数据分发和结果合并等组件。

任务调度器负责将任务划分为多个子任务，并将其分发给不同的计算机执行。

数据分发和结果合并负责将数据传输到计算机节点并从节点上收集处理结果。

并行计算与分布式计算区别与联系

并⾏计算与分布式计算区别与联系并⾏计算、分布式计算以及⽹格计算和云计算都是属于⾼性能计算（HPC）的范畴，主要⽬的在于对⼤数据的分析与处理，但它们却存在很多差异。

我们需要了解两者的原理、特点和运⽤的场合，对云计算的了解⼤有裨益。

之所以将两种计算技术放在⼀起，是因为这两种计算具有共同的特点，都是运⽤并⾏来获得更⾼性能计算，把⼤任务分为N个⼩任务。

但两者还是有区别的，关于两者的区别在后⾯有介绍。

并⾏计算1、并⾏计算概念并⾏计算（Parallel Computing）⼜称平⾏计算是指⼀种能够让多条指令同时进⾏的计算模式，可分为时间并⾏和空间并⾏。

时间并⾏即利⽤多条流⽔线同时作业，空间并⾏是指使⽤多个处理器执⾏并发计算，以降低解决复杂问题所需要的时间。

并⾏计算同时使⽤多种计算资源解决计算问题的过程。

为执⾏并⾏计算，计算资源应包括⼀台配有多处理机（并⾏处理）的计算机、⼀个与⽹络相连的计算机专有编号，或者两者结合使⽤。

并⾏计算主要⽬的在于两个⽅⾯： (1) 加速求解问题的速度。

(2) 提⾼求解问题的规模。

2、并⾏计算的原理并⾏计算能快速解决⼤型且复杂的计算问题。

此外还能利⽤⾮本地资源，节约成本 ― 使⽤多个“廉价”计算资源取代⼤型计算机，同时克服单个计算机上存在的存储器限制。

为提⾼计算效率，并⾏计算处理问题⼀般分为以下三步：（1）将⼯作分离成离散独⽴部分，有助于同时解决；（2）同时并及时地执⾏多个程序指令；（3）将处理完的结果返回主机经⼀定处理后显⽰输出。

从上图可以看出，串⾏计算必须按步骤⼀步⼀步计算才能出来最终结果。

⽽并⾏计算则要将问题分成N多个⼦任务，每个⼦任务并⾏执⾏计算。

⽽每个⼦任务是⾮独⽴的，每个⼦任务的计算结果决定最终的结果。

这个和分布式计算不同。

3、并⾏计算需满⾜的基本条件（1）并⾏计算机。

并⾏计算机⾄少包含两台或两台以上处理机，这此处理机通过互联⽹络相互连接，相互通信。

（2）应⽤问题必须具有并⾏度。

如何进行高效的并行计算和分布式计算

如何进行高效的并行计算和分布式计算高效的并行计算和分布式计算已经成为许多领域和行业中关键的技术之一。

在计算能力层面上进行并行计算和分布式计算可以显著提高计算效率，加快数据处理速度，同时也能够更好地利用硬件资源，提高系统的可靠性和稳定性。

本文将介绍高效的并行计算和分布式计算的概念、原理，以及实现这些计算方式的方法和技术。

一、并行计算并行计算是指利用两个或多个处理器或计算机同时执行多个计算任务的一种计算方式。

它的基本原理是将一个大的计算任务拆分成若干个小任务，然后由多个处理器或计算机并行地执行这些小任务，最后将结果合并得到最终的计算结果。

并行计算的优点主要包括：1.提高计算速度：并行计算可以同时进行多个计算任务，大大缩短计算时间。

2.提高系统的可靠性和稳定性：由于多个处理器或计算机可以相互协作，当一个处理器或计算机出现故障时，其他处理器或计算机可以继续工作，从而提高系统的可靠性和稳定性。

3.更好地利用硬件资源：并行计算可以充分利用多个处理器或计算机，提高硬件资源的利用率。

在实际应用中，实现并行计算可以采用以下几种方法和技术：1.多线程编程：多线程编程是实现并行计算的一种常用方式，通过在程序中创建多个线程，每个线程负责执行一个独立的计算任务，从而实现并行计算。

多线程编程可以使用线程库或者编程语言提供的线程相关API来实现。

2.并行算法：并行算法是一种通过将问题分解成多个子问题，然后将这些子问题分配给多个处理器或计算机并行地计算的算法。

常见的并行算法有并行排序、并行搜索等。

二、分布式计算分布式计算是指将一个大的计算任务分解为多个子任务，并将这些子任务分配给多个计算节点进行计算，最后将结果进行合并得到最终的计算结果。

分布式计算的优点包括：1.扩展性好：分布式计算可以通过增加计算节点来扩展计算能力，适应处理大规模计算任务的需求。

2.提高系统的可靠性和稳定性：分布式计算可以通过冗余计算节点和分布式存储等方式提高系统的可靠性和稳定性。

并行计算的三种形式

并行计算的三种形式
随着计算机技术的发展和进步，计算任务的复杂度和数据规模不
断地增加，单台计算机无法满足高性能计算的需求，因此人们开始研
究并行计算。

并行计算是指多个计算任务在同一时间内同时进行的计
算方式，可以大幅提高计算效率和速度。

大体上有以下三种形式：
1. 分布式计算
分布式计算是指将一台大型计算机的计算工作分配给多台计算机
进行处理，让每个节点计算一部分数据。

多台计算机之间通过网络进
行通信和协同工作，最终将各自计算的结果进行合并得到最终结果。

这种形式的并行计算主要应用于分布式系统、云计算和大数据处理等
计算密集型任务。

2. 多核并行计算
多核并行计算是指将一台计算机上的多个核心同时运行同一程序，每个核心按照一定的分配规则处理不同的数据，最终得到全部结果。

这种形式的并行计算主要应用于计算密集型任务，例如图像处理、模
拟和物理计算等。

3. GPU并行计算
GPU并行计算是指利用图形处理器（GPU）对计算任务进行并行处理，使用GPU加速器进行高性能计算。

GPU并行计算主要应用于动画渲染、计算流体动力学（CFD）、加密和解密等计算密集型任务。

总之，并行计算已经被广泛应用于各个领域和行业，它提高了计算效率、降低了计算成本，并加速了科学技术的进步。

未来，随着技术的不断发展，相信并行计算将在更多的领域发挥更大的作用。

为了实现更好的并行计算，需要对并行计算技术进行深入的研究和探索。

分布式计算与并行计算的应用

添加标题
特点：物联网具有全面感知、可靠传输和智能处理的特点，可以实现数据的实时采集、传输和处理，为分布式计算提供大量的数
据资源。
添加标题
与分布式计算的关系：分布式计算可以利用物联网的数据资源，实现大规模的数据处理和分析，提高计算效率和精度，进一步推动物联网的应用和
发展。
添加标题
大数据处理与分析
分布式计算在大数据处理与分析中发挥着重要作用，能够提高数据处理速度和效率。
分布式计算能够将大规模数据分散到多个节点进行处理，降低计算成本和提高可扩展性。
分布式计算能够支持多种数据处理和分析工具，如Hadoop、 Spark等，满足不同业务需求。
分布式计算在大数据处理与分析中具有广泛应用，如金融、医疗、电商等领域。
人工智能与机器学习的融合：分布式计算与并行计算将进一步与人工智能和机器学习技术融合，推动人工智能应用的普及和发展。
数据安全和隐私保护：随着分布式计算与并行计算的应用范围不断扩大，数据安全和隐私保护将成为未来发展的重要研究方向。
跨学科领域的合作：分布式计算与并行计算将与多个学科领域进行交叉融合，如生物学、物理学、金融学等，推动跨学科领域的研究和应用。
边缘计算：分布式计算与并行计算在边缘计算中的应用，提高数据处理效率和降低网络延迟。
人工智能与分布式计算的融合发展
人工智能技术将进一步与分布式计算结合，提高计算效率和数据处理能力。未来展望中，人工智能与分布式计算的融合将为各行业带来更多创新应用。研究方向包括如何优化分布式计算系统以适应人工智能算法的需求。融合发展的关键技术包括分布式机器学习、深度学习框架与分布式系统的集成等。
物理模拟：在材料科学、航空航天等领域，通过并行计算模拟物理实验，可以降低实验成本和风险。

分块求解三角形线性方程组的一种分布式并行算法

２０正０１
ቤተ መጻሕፍቲ ባይዱ
２月
装备指挥技术学院学报
ＪｕｎｌｆｔｅＡｃｄｍｙｏｕｐｎｍｍａｄ＆ＴｅｈｏｏｙｏｒａｈａｅｆＥｑｉｍｅｔＣｏｏｎｃｎｌｇ
Ｆｅｕｒｂｒａｙ
２ＯＯ１
第２卷第１１期
２。ｔ ≈ ２ｎ
，
ｂｕＩ。。ｅｌｐｐｎ。ｐｉｎｄｃｍｍｕｎｃｔ。ｓｕｅｅｅａｙｔｒｄｕｅｔｔａｓｖｒａｉｇｃｍｕｔｎｇａ。ｉａｉｎｉｓｄｐｒｆｒｂｌ。ｅｃｈｅ
Ｋｅｏｄｙｗｒｓ：ｔｉｎｇｕａｉｅｒｓｓｅｓ；ｄｓｒｂｕｅ — ｅｏｙｐｒｌｅｌｏｉｈｍ；ｄｔｏｐｅｓｎｒａｌｒｌｎａｙｔｍｉｔｉｔｄｍｍｒａａｌｌａｇｒｔａａｃｍｒｓｉｇｓｏａ；ｏｅｌｐｎｏｍｐｕｉｇａｏｍｕｃｔｏｔｒｇｅｖｒａｐｉｇｃｔｎｎｄｃｍｎｉａｉｎ；ｌａａａｃｏｄｂｌｎｅ
２ＡｉＦｏｃ．３Ｆｃｏｙ，Ｃｈｎ）．ｒｒｅＮｏ２ａｔｒｉａ
Ｔｅｈｏｏｙｅｉｇ１１１，Ｃｉａｃｎｌｇ，Ｂｉｎ０４６ｈｎ；ｊ
Ａｂｔａｔｓｒｃ：Ｕｎｒｄｉｔｉｕｔｄｍｅｏｒｕｌｉｏｐｕｔｒｄｅｓｒｂｅ — ｍｙｍｔｃｍｅｓ，ａｐａａｌｌａｇｏｉｈｏｏｖｉｇｔｉｎｌｒｒｌｅｌｒｔｍｆｒｓｌｎｒａｇｕａ

计算机的并行与分布式计算

计算机的并行与分布式计算计算机技术的快速发展促使了并行与分布式计算的兴起。

随着信息时代的到来，计算机的性能需求越来越大，传统的串行计算已无法满足实际应用需求。

并行与分布式计算技术的应用成为了解决大规模计算问题的有效手段。

本文将着重讨论计算机的并行与分布式计算的基本概念、发展历程以及应用前景。

一、并行计算的基本概念和技术并行计算是指通过同时执行多个任务或多个子任务的方式来提升计算机系统的整体计算能力。

相比传统的串行计算，它能够充分利用多个处理器或计算机节点的计算和存储资源，从而提高计算效率和速度。

并行计算可分为共享内存并行和分布式并行两种模式。

共享内存并行是通过多个处理器共享同一块物理内存来实现的，并通过锁机制来协调对共享资源的访问。

这种模式具有良好的可编程性和易用性，但在实际应用中往往面临着多线程同步和数据一致性等问题。

分布式并行则是将计算任务划分为若干个子任务，并分发到不同的计算节点上进行并行计算。

各计算节点之间通过网络进行通信，共享数据并协同完成计算任务。

分布式并行模式具有较好的可扩展性和容错性，但需要克服网络延迟和节点间通信带来的开销问题。

二、分布式计算的基本概念和技术分布式计算是指将一个较大的计算任务分解为多个子任务，并分发到不同的计算节点上进行协同计算和协同数据处理的计算模式。

在分布式计算中，各计算节点之间通过网络进行通信，共享数据和资源，并通过协同工作完成整个计算过程。

分布式计算技术的基础是计算机网络和通信技术的发展。

随着互联网的普及和计算能力的提升，分布式计算已经得到了广泛的应用，例如云计算和大数据处理等。

分布式计算具有高可靠性、高性能和强大的计算能力等优势，可以满足海量数据处理和复杂计算任务的需求。

三、并行与分布式计算的发展历程并行与分布式计算的发展历程可以追溯到上世纪60年代。

当时，计算机科学家开始尝试将计算任务分成多个子任务进行并行计算，从而提高计算速度和效率。

在此后的几十年中，随着硬件技术和软件技术的进步，人们对并行与分布式计算的研究逐渐深入，并提出了一系列的并行计算模型和分布式计算框架。

并行计算与分布式算法

并行计算与分布式算法并行计算和分布式算法是现代计算领域中重要的研究方向，它们在高性能计算、大规模数据处理和人工智能等领域具有广泛的应用。

本文将介绍并行计算和分布式算法的基本概念、原理和应用，并讨论它们对计算效率和性能的影响。

一、并行计算1.1 概念与背景并行计算是指同时使用多个计算资源（如处理器、内存等）来完成某个计算任务的技术。

它通过将任务分解成若干个子任务，并同时在多个计算资源上执行这些子任务，以提高计算效率和处理能力。

1.2 原理与模型并行计算的基本原理是任务分解和结果合并。

在任务分解阶段，将计算任务划分成多个独立的子任务，这些子任务可以并行地在不同的计算资源上执行。

在结果合并阶段，将各个子任务的计算结果进行合并，得到最终的计算结果。

并行计算有多种模型，如共享内存模型、分布式内存模型和混合模型等。

其中，共享内存模型使用多个处理器共享同一块内存空间，使得不同处理器之间可以直接访问和修改共享内存中的数据。

而分布式内存模型则通过网络连接多个计算节点，每个节点拥有独立的内存空间，通过消息传递进行通信和数据交换。

1.3 应用与挑战并行计算在科学计算、图像处理、仿真模拟等领域有广泛的应用。

它可以加速计算任务的执行，提高计算性能和数据处理能力。

然而，并行计算也面临着任务划分、数据同步和通信开销等挑战，需要合理设计和优化算法，以充分发挥并行计算的优势。

二、分布式算法2.1 概念与特点分布式算法是一种针对分布式计算环境设计的算法，它通过将计算任务分布到多个计算节点上，并通过消息传递进行协调和通信，以解决大规模数据处理和复杂计算问题。

分布式算法的特点包括并发性、容错性和可扩展性。

并发性指多个计算节点可以同时执行不同的任务；容错性指分布式系统可以在单个计算节点故障时继续正常运行；可扩展性指分布式系统可以适应规模的变化，添加或删除计算节点而不影响整体的性能和可靠性。

2.2 基本原理分布式算法的基本原理是分而治之和协同计算。

并行计算和分布式计算的优劣比较

并行计算和分布式计算的优劣比较集群技术在计算机领域中发挥着重要的作用，而其中的两种技术并行计算和分布式计算也都是非常重要的。

它们有着各自的优劣，本文将对两者进行比较分析。

一、并行计算和分布式计算的定义首先，我们需要明确并行计算和分布式计算的定义。

并行计算是一种利用多台计算机进行高速计算的方法，它可以将任务分解成多个子任务，由多台计算机同时进行计算，最终将计算结果合并起来。

而分布式计算则是将一个大问题分解成多个小问题，由多个计算机同时计算，其计算结果最终再次合并成整体的计算结果。

二、并行计算和分布式计算的优点并行计算的优点在于它的计算效率非常高，可以利用多台计算机同时进行计算，解决大型科学计算或数据处理问题的能力强。

而分布式计算也有着同样的优点，其相比于单机计算，可以实现更高的效率，同时还可以实现任务的负载均衡，避免单台计算机的瓶颈。

三、并行计算和分布式计算的缺点与优点相对应的，两者的缺点也并不少。

首先，对于并行计算而言，它需要使用特殊的硬件，而且硬件的成本比较高，这在一定程度上限制了其在实际应用中的使用。

其次，对于任务的分解和结果的合并，需要进行相应的编程，编程难度较大且需要具备专业的技能。

分布式计算的缺点主要在于通信成本高、数据同步、数据一致性等问题，这都对其性能产生了影响。

同时，分布式计算需要一个管理节点来管理整个集群，这也是需要考虑的问题。

四、并行计算和分布式计算的适用场景那么，对于并行计算和分布式计算，它们的适用场景是什么呢？对于并行计算而言，它适用于需要高速运算的任务，比如图像压缩、大规模矩阵计算、天气预报等。

而对于分布式计算而言，它适用于数据集比较大并且需要分布式存储的任务，比如海量数据的搜索、人工智能应用等。

五、结论综上所述，计算机集群技术在计算机行业中极为重要。

并行计算和分布式计算是其中两个非常重要的技术，两者各有优缺点。

在选择集群技术时，应该根据任务的特性、硬件条件、人力技术水平等方面的需求进行权衡，选择适合自己的集群技术。

并行与分布式计算基础知识入门

并行与分布式计算基础知识入门在当今的信息时代，计算机技术扮演着举足轻重的角色。

并行与分布式计算是其中两个重要的概念。

本文将介绍并行与分布式计算的基础知识，包括概念、应用领域和关键技术。

一、概念简介1. 并行计算并行计算是指多个计算任务同时进行，在同一时刻利用多个处理器或计算机的计算能力来解决大问题的计算过程。

与串行计算相比，它能够显著提高计算速度和效率。

2. 分布式计算分布式计算是指将一个计算任务分解为多个子任务，分配给多台计算机进行并行处理，各自计算结果再汇总得到最终的计算结果。

与单机计算相比，分布式计算能够提高计算能力和可靠性。

二、应用领域1. 科学计算并行与分布式计算在科学计算领域有着广泛的应用。

例如，在气象预报中，利用并行计算可以加快模拟和预测的速度，提高天气预报的准确性。

在生物信息学中，利用分布式计算可以加快基因测序和分析的速度，促进生物医学研究的进展。

2. 大数据处理随着互联网的迅猛发展，大数据成为了一种珍贵的资源。

并行与分布式计算在大数据处理中起到了重要的作用。

通过将数据分布到不同的计算节点上，并行计算可以高效地处理大规模数据集，提供实时的数据分析和挖掘结果。

3. 人工智能人工智能是当前热门的领域之一，而并行与分布式计算为人工智能的发展提供了强大的支持。

例如，在深度学习中，通过并行计算可以加快神经网络的训练速度，提高模型的准确性。

而分布式计算则可以处理大规模的训练数据和模型参数，促进模型的优化和部署。

三、关键技术1. 并行算法并行算法是实现并行计算的关键。

它将计算任务划分为多个子任务，并通过合理的任务调度和数据交换来实现计算的并行化。

常见的并行算法有并行排序、并行搜索和并行矩阵计算等。

2. 分布式系统分布式系统是实现分布式计算的基础。

它由多个计算节点组成，节点之间通过网络进行通信和数据传输。

分布式系统需要解决通信协议、数据一致性、容错和负载均衡等关键技术问题。

3. 并行编程模型并行编程模型是实现并行与分布式计算的抽象层次。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

LLB:A Fast and Effective Scheduling Algorithm for Distributed-Memory SystemsAndrei R˘a dulescu Arjan J.C.van Gemund Hai-Xiang Lin Faculty of Information Technology and SystemsDelft University of TechnologyP.O.Box5031,2600GA Delft,The NetherlandsAbstractThis paper presents a new algorithm called List-basedLoad Balancing(LLB)for compile-time task scheduling on distributed-memory machines.LLB is intended as acluster-mapping and task-ordering step in the multi-step class of scheduling algorithms.Unlike current multi-step approaches,LLB integrates cluster-mapping and task-ordering in a single step.The beneﬁts of this integration are twofold.First,it allows dynamic load balancing in time,because only the ready tasks are considered in the mappingprocess.Second,communication is also considered,as op-posed to algorithms like WCM and GLB.The algorithm hasa low time complexity of,where is the number of dependences,is the number of tasks and is the number of processors.Experimental results show that LLB outperforms known cluster-mapping algorithms of comparable complexity,improving the schedule lengths upto.Furthermore,compared with LCA,a much higher-complexity algorithm,LLB obtains comparable results for ﬁne-grain graphs and yields improvements up to for coarse-grain graphs.1IntroductionThe problem of efﬁciently scheduling programs is one of the most important and difﬁcult issues in a parallel process-ing environment.The goal of the scheduling problem is to minimize the parallel execution time of a program.Except for very restricted cases,the scheduling problem has been shown to be NP-complete[3].For shared-memory systems, it has been proven that even a low-cost scheduling heuristic is guaranteed to produce acceptable performance[4].How-ever,for distributed-memory architectures,there is no such guarantee and the task scheduling problem remains a chal-lenge,especially for algorithms where low cost is of key interest.The heuristic algorithms used for compile-time task scheduling on distributed-memory systems can be divided into(a)scheduling algorithms for an unbounded num-ber of processors(e.g.,DSC[12],EZ[9],STDS[2]and DFRN[6])and(b)scheduling algorithms for a bounded number of processors(e.g.,MCP[10],ETF[5]and CPFD[1]).The problem in the unbounded case is easier, because the constraint on the number of processors need not be considered.Therefore,scheduling in the unbounded case can be performed with good results at a lower cost com-pared to the bounded case.However,in practical situations, the necessary number of processors requested by an algo-rithm for the unbounded case is rarely available.Recently,a multi-step approach has been proposed, which allows task scheduling at the low complexity of the scheduling algorithms for the unbounded case[9,11].In theﬁrst step,called clustering,scheduling without duplica-tion is performed for an unbounded number of processors, tasks being grouped in clusters which are placed on virtual processors.In the second step,called cluster-mapping,the clusters are mapped on the available processors.Finally,in the third step,called task-ordering,the mapped tasks are ordered on processors.Although the multi-step approach has a very low complexity,previous results have shown that their schedule lengths can be up to twice the length of a list scheduling algorithm such as MCP[7].This paper presents a new algorithm,called List-based Load Balancing(LLB)for cluster-mapping and task-ordering.It signiﬁcantly improves the schedule lengths of the multi-step approach to a level almost comparable to the list scheduling algorithms,yet retaining the low complex-ity of the scheduling algorithms for an unbounded number of processors.The LLB algorithm combines the cluster-mapping and task-ordering steps into a single one,which allows a much more precise tuning of the cluster-mapping process.This paper is organized as follows:Next section de-scribes the scheduling problem and introduces some deﬁ-nitions used in the paper.Section3brieﬂy reviews existingcluster-mapping algorithms.Section4presents the LLB al-gorithm,while Section5shows its performance.Section6concludes the paper.2PreliminariesA parallel program can be modeled by a directed acyclic graph,where is a set of nodes and is a set of edges.A node in the DAG represents a task,con-taining instructions that execute sequentially without pre-emption.The computation cost of a task,,isthe cost of executing the task on any processor.The edges correspond to task dependencies(communication messages or precedence constraints).The communication cost of anedge,,is the cost to satisfy the depen-dence.The communication to computation ratio()of a parallel program is deﬁned as the ratio between its averagecommunication cost and its average computation cost.A task with no input edges is called an entry task,while a task with no output edges is called an exit task.A task is said to be ready if all its parents haveﬁnished their execu-tion.A ready task can start its execution only after all its dependencies have been satisﬁed.If two tasks are mapped to the same processor or cluster,the communication cost be-tween them is assumed to be zero.The number of clusters obtained in the clustering step is.As a distributed system,we assume a processor ho-mogeneous clique topology with no contention on commu-nication operations and non-preemptive tasks execution.Once scheduled,a task is associated with a processor ,a start time and aﬁnish time.If the task is not scheduled,these three values are not deﬁned.The objective of the scheduling problem is toﬁnd a scheduling of the tasks in on the target system such that the parallel completion time of the problem(schedule length)is minimized.The parallel completion time is de-ﬁned as.For performance comparison,the normalized schedulelength is used.In order to deﬁne the normalized scheduled length,ﬁrst the ideal completion time is deﬁned as the ratio between the sequential time(aggregate computation cost) and the number of processors:.The nor-malized schedule length is deﬁned as the ratio between the actual parallel time(schedule length)and the ideal comple-tion time:.It should be noted that the ideal time may not always be achieved,and that the optimal schedule length may be greater than the ideal time.3Related WorkIn this section,three existing cluster-mapping algorithms and their characteristics are described:List Cluster Assign-ment(LCA)[9],Wrap Cluster Merging(WCM)[11]and Guided Load Balancing(GLB)[7].3.1List Cluster Assignment(LCA)LCA is an incremental algorithm,that performs both cluster-mapping and task-ordering in a single step.At each step,LCA considers the unmapped task with the highest priority.The task,along with the cluster it belongs to,is mapped to the processor that yields the minimum increase of parallel completion time.The parallel completion time is calculated as if each unmapped cluster would be exe-cuted on a separate processor.The time complexity of LCA is.One can note,that if each task would have been mapped to a separate cluster,LCA degenerates to a traditional list scheduling algorithm.If the clustering step produces large clusters,the time spent to map the clusters to processors is decreased,since all tasks in a cluster are mapped in a single step.However,the complexity of LCA is still high,because at each cluster-mapping step,it is necessary to compute the total parallel completion time times.3.2Wrap Cluster Merging(WCM)In WCM,cluster-mapping is based only on the clus-ter workloads.First,each of the clusters with a workload higher than the average is mapped to a separate processor. Second,the remaining clusters are sorted in increasing or-der by workload.Assuming the remaining processors are renumbered as and the remaining clusters as,the processor for cluster is deﬁned as.The time com-plexity of WCM is.Assuming that in the previous clustering step the ef-fect of the largest communication delays are eliminated, the communication delays are not considered in the cluster-mapping step.The time complexity of WCM is very low. However,it can lead to load imbalances among processors at different stages during the execution.That is,the cluster-mapping is performed considering only the sum of the ex-ecution times of the tasks in clusters,irrespective of the tasks’starting times.As a result,processors may be tem-porarily overloaded throughout the time.3.3Guided Load Balancing(GLB)GLB improves on WCM by exploiting knowledge about the task start times computed in the clustering step.A clus-ter start time is deﬁned as the earliest start time of the tasks in the given cluster.The clusters are mapped in the order of their start times to the least loaded processor at that time. The time complexity of GLB is.The start time of theﬁrst task determines the cluster priority,thus providing information from the dependence analysis in the clustering step.Topologically ordering tasks yields a natural cluster ordering.Scheduling clusters in the order they become available for execution generally yields a better schedule compared to scheduling when only the workload in a cluster is considered,because the work in the clusters is spread over ing this approach,the concurrent tasks(without dependencies between them)are more likely to be placed on different processors and there-fore run in parallel.However,there are cases in which GLB still does not produce the right load balance.For example,if a larger cluster is mapped on a processor,it will signiﬁcantly in-crease the aggregate load of that processor and cause that processor not to be considered in the next mappings,even if it is already idle.In[8]a graphical illustration is given of how the above algorithms perform on an example task graph.4The LLB Algorithm4.1MotivationComparing the existing low-cost multi-step scheduling algorithms with higher-cost scheduling algorithms,it can be noticed that low-cost multi-step algorithms produce up to longer schedules compared to higher-cost schedul-ing algorithms like MCP[7].The existing low-cost cluster-mapping algorithms(e.g.,WCM,GLB)aim to balance the workload of the clusters on processors.Neither the com-munication costs,nor the order imposed to tasks by their dependencies are considered when the clusters are mapped. On the other hand,the higher cost cluster-mapping algo-rithms(e.g.LCA),despite their better performance,are less attractive in many practical cases,because of their high cost.LLB is a low-cost algorithm intended to improve load balance throughout the time by performing cluster-mapping and task-ordering in a single step.Integrating the two steps in a single one allows a better tracking of the running tasks throughout the time when a new mapping decision is made. This approach helps the scheduling process in two ways. First it allows a dynamic load balancing throughout the time,because only the ready tasks are considered in the mapping process.Second,communication costs are also considered when selecting tasks,as opposed to algorithms such as WCM and GLB.Similar to the other algorithms presented earlier,LLB maps all tasks from a cluster in a single step.However, instead of selecting the best processor for a given cluster, the algorithm selects the best cluster for a given processor. The selected processor is theﬁrst one becoming idle.This new approach is taken because it signiﬁcantly decreases the complexity of the algorithm.Instead of scanning all pro-cessors toﬁnd the best mapping for the cluster(), the selected processor is simply determined by identifying the minimal processor ready time().The tasks are scheduled one at a time,which implies task-ordering in the same step as cluster-mapping.If the selected task has not been mapped before,all other tasks in its cluster are mapped along with the selected task.Scheduling tasks one at a time allows better control of the scheduling throughout the time, comparable to list scheduling algorithms,therefore leading to better schedules.4.2The LLB AlgorithmBefore describing the LLB algorithm,we introduce few concepts on which the algorithm is based.In this paper there is a clear distinction between the concept of a mapped and a scheduled task.A task is mapped,if its destination processor has been assigned to it.A task is scheduled if it is mapped and its start time has been computed.A ready task is an unscheduled task that has all its direct predecessors scheduled.The urgency of a task is deﬁned as a static pri-ority,in the same way as in list scheduling algorithms.We use as the task urgency the sum of the computation and the inter-cluster communication on the longest path from the given task to any exit task.This urgency deﬁnition provides a good measure of the task importance,because the larger the urgency is,the more work is still to be completed until the programﬁnishes.The LLB algorithm maps all tasks in the same cluster on the same processor.When theﬁrst task in a cluster is scheduled on a processor,all the other tasks in the same cluster are mapped on the same processor.The reason is to keep the communication at the same low level as obtained in the clustering step.At each step one task is scheduled.First,the destina-tion processor is selected as the processor becoming idle the earliest.Each processor has a list of ready tasks al-ready mapped on it.The ready unmapped tasks are kept in a global list.Initially,the ready mapped task lists are empty and the ready unmapped task list contains the entry tasks. All ready task lists are sorted in descending order using the task urgencies.After selecting the destination processor a task to be scheduled on that processor is selected.The candidates are the most urgent ready task mapped on the selected processor and the most urgent ready unmapped task.Between these two tasks,the one starting the earliest is selected and sched-uled on the selected processor.If the two tasks start at the same time the mapped task is selected.If none of the two candidates exists,the most urgent task mapped on another processor is selected and scheduled on its own processor.Finally,the ready task lists are updated.Scheduling atask may lead to other tasks becoming ready.These readytasks can be mapped or unmapped.The mapped tasks are added to the ready tasks lists corresponding to their proces-sors.The unmapped tasks are added to the global readyunmapped task list.LLB()BEGINFor each task compute urgencies.WHILE NOT all tasks scheduled DOp the processor becoming idle the earliestt_m the most urgent ready task mapped on pt_u the most urgent ready unmapped taskSWITCHCASE t_m and t_u exist:IF ST(t_m,p)ST(t_u,p)THENtask t_mELSEtask t_uEND IFCASE only t_m exists:task t_mCASE only t_u exists:task t_uCASE neither t_m nor t_u exist:t the most urgent ready task.END SWITCHSchedule t on p.IF t has not been mapped THENMap all tasks in t’s cluster on pEND IFAdd the new ready tasks tothe ready task lists.END WHILEENDThe complexity of the LLB algorithm is as follows. Computing task priorities takes.At each taskscheduling step,processor selection takes time.Task selection requires at most two dequeue operations from the ready task lists,which takes time.Thecumulative complexity of computing the task start times throughout the execution of the while loop is,since all the edges of the task graph must be considered.Also,each new ready task in the task graph must be added to one of the ready task lists.Adding one task to a list takes time and,as there are tasks,time is required to maintain the ready task lists throughout the execution of the while loop.The resulting complexity ofthe LLB algorithm is therefore.In[8]an elaborate description of the algorithm is pre-sented including a sample execution trace and a comparison to related work.5Performance ResultsThe LLB algorithm is compared with the three algo-rithms described in Section3,LCA,WCM and GLB.The algorithms are compared within a full multi-step schedul-Figure1.Algorithms running timesing method to obtain a realistic test environment.The al-gorithm used for the clustering step is DSC[12],because of both its low cost and good performance.In case of us-ing WCM and GLB for cluster-mapping,we use RCP[11] for task-ordering.In our comparisons we also included the well-known list scheduling algorithm MCP[10]to obtain a reference to other type of scheduling algorithms.We consider task graphs representing various types of parallel algorithms.The selected problems are LU decom-position,Laplace equation solver,a stencil algorithm and fast Fourier transform(sample examples are shown in[8]). For each of these problems,we adjusted the problem size to obtain task graphs of about2000nodes.We varied the task graph granularities,by varying the communication to computation ratio().The values used for were 0.2,0.5,1.0,2.0and5.0.For each problem and each value,we generated5graphs with random execution times and communication delays(i.i.d.,uniform distribution with coefﬁcient of variation1)One of our objectives is to observe the trade-offs between the performance(i.e.,the schedule length),and the cost to obtain these results(i.e.,the running time)required to gen-erate the schedule.In Fig.1the average running time of the algorithms is shown.For the multi-step scheduling meth-ods,the total scheduling time is displayed(i.e.,the sum of clustering,cluster-mapping and task-ordering times).The running time of MCP grows linearly with the number of processors.For a small number of processors,the running time of MCP is comparable with the running times of the three low-cost multi-step scheduling methods(DSC-WCM, DSC-GLB and DSC-LLB).However,for a larger number of processors,the running time of MCP is much higher.All three low-cost multi-step scheduling methods(DSC-WCM, DSC-GLB and DSC-LLB)have comparable small running times,which do not vary signiﬁcantly with the number of processors.DSC-LCA is not displayed,because its run-ning times are much higher compared to the other running times,varying from82seconds for to13minutes for .C C R =5.0C C R =1.C C R =0.2(a)(b)(c)(d)Figure 2.Normalized scheduling lengths for:(a)LU,(b)Laplace,(c)stencil (d)FFTThe average normalized schedule lengths (deﬁned in Section 2)for the selected problems are shown in Figure 2forvalues ,and only (more measurements are presented in [8]).For each of the consideredval-ues a set ofvalues is presented.Note that the generally increases with as a result of the limited paral-lelism in the task graphs.Within the class of high-cost algorithms,the schedule lengths of DSC-LCA are still longer compared to MCP.In a multi-step approach the degree of freedom in mapping tasks to processors is decreased by clustering tasks together.Mapping a single task from a cluster to a processor forces all the other tasks within the same cluster to be mapped to the same processor.In this case,using a higher-cost algo-rithm (DSC-LCA vs.MCP)does not imply an increase in scheduling quality.While for small number of processors the performance of the DSC-WCM is comparable with the high-cost algo-rithms,for a large number of processors the performance drops.If the computation time dominates,the difference in the quality of the schedules is increasing in favor of high-cost algorithms,going up to a factor of in some cases (rowLU decomposition,,).In the case of ﬁne-grain task graphs,DSC-WCM obtains results compa-rable with high-cost algorithms.Minimizing the commu-nication delays in the clustering step is the key factor in obtaining these results.DSC-GLB obtains better schedules compared with DSC-WCM,since more information is used from the clus-tering step.However,the quality increase varies both withand with the type of problem.For high values of(coarse-grain task graphs),the improvements in schedule lengths over DSC-WCM arefor stencil problems andfor LU decomposition,Laplace equa-tion solver and FFT.For small values of,the improve-ments over DSC-WCM are higher (for Laplace equation solver and FFT,and for LU decomposi-tion and stencil problems).DSC-LLB outperforms both DSC-WCM and DSC-GLB,while maintaining the cost at a low level.It consistently outperforms DSC-WCM for all type ofproblems withfor ﬁne-grain graphs and for coarse-grain graphs (for FFT,for Laplace equation solver and stencil problems,andfor LU decomposition)Compared with DSC-GLB forﬁne-grain graphs, DSC-LLB generally performs comparable or better if there is still speedup to be obtained(small number of proces-sors).However,in some cases(Laplace equation solver), the performance is somewhat lower.For coarse-grain graphs,DSC-LLB consistently outperforms DSC-GLB with.Moreover,in many cases DSC-LLB even outperforms DSC-LCA,with up to(LU decomposi-tion,,).Compared to the more expensive MCP algorithm, DSC-LLB generally obtains longer schedules.As LLB is similar to list scheduling algorithms,its behavior is simi-lar to MCP.The increase in schedule length depends on the values and the type of problem,but does not exceed (stencil problem,,).6ConclusionIn this paper,a new algorithm,called List-based Load Balancing(LLB),is presented.LLB is intended as a cluster-mapping and task-ordering step in the multi-step class of scheduling algorithms.Unlike the current approaches,LLB integrates cluster-mapping and task-ordering in a single step,which improves the scheduling process in two ways. First it allows a dynamic load balancing in time,because only the ready tasks are considered in the mapping process. Second,the communication is also considered when select-ing tasks,as opposed to algorithms like WCM and GLB.The complexity of LLB does not exceed the complexity of a low-cost clustering algo-rithm,like DSC.Thus,the low complexity of the multi-step approach remains unaffected,despite our improvements in performance.Experimental results show that compared with known cluster-mapping algorithms of low-complexity,LLB algo-rithm improves the schedule lengths up -pared with LCA,a much higher-complexity algorithm,LLB obtains comparable results forﬁne-grain task graphs and even better results for coarse-grain task graphs,yielding improvements up to.In conclusion,LLB outper-forms other algorithms in the same low-cost class and even matches the better performing,higher-cost algorithms in the list scheduling class.AcknowledgementsThis research is part of the Automap project granted by the Netherlands Computer Science Foundation(SION) withﬁnancial support from the Netherlands Organization for Scientiﬁc Research(NWO)under grant number SION-2519/612-33-005.References[1]I.Ahmad and Y.-K.Kwok.A new approach toscheduling parallel programs using task duplication.In Proc.of the ICPP,pages47–51,Aug.1994. [2]S.Darbha and D.P.Agrawal.Scalable scheduling al-gorithm for distributed memory machines.In Proc.of the SPDP,Oct.1996.[3]M.R.Garey and putersand Intractability:A Guide to the Theory of NP-Completeness.W.H.Freeman and Co.,1979.[4]R.L.Graham.Bounds on multiprocessing tim-ing anomalies.SIAM J.on Applied Mathematics, 17(2):416–429,Mar.1969.[5]J.-J.Hwang,Y.-C.Chow,F.D.Anger,and C.-Y.Lee.Scheduling precedence graphs in systems with inter-processor communication times.SIAM J.on Comput-ing,18:244–257,Apr.1989.[6]G.-L.Park,B.Shirazi,and J.Marquis.DFRN:Anew approach for duplication based scheduling for distributed memory multiprocessor system.In Proc.of the IPPS,pages157–166,Apr.1997.[7]A.R˘a dulescu and A.J.C.van Gemund.GLB:A low-cost scheduling algorithm for distributed-memory ar-chitectures.In Proc.of the HiPC,Dec.1998.294–301.[8]A.R˘a dulescu,A.J.C.van Gemund,and H.-X.Lin.LLB:A fast and effective scheduling algorithm for distributed-memory systems.Technical Report1-68340-44(1998)11,Delft Univ.of Technology,Dec.1998.[9]V.Sarkar.Partitioning and Scheduling Parallel Pro-grams for Execution on Multiprocessors.PhD thesis, MIT,1989.[10]M.-Y.Wu and D.D.Gajski.Hypertool:A program-ming aid for message-passing systems.IEEE Trans.on Parallel and Distributed Systems,1(7):330–343, July1990.[11]T.Yang.Scheduling and Code Generation for ParallelArchitectures.PhD thesis,Dept.of CS,Rutgers Univ., May1993.[12]T.Yang and A.Gerasoulis.DSC:Scheduling paralleltasks on an unbounded number of processors.IEEE Trans.on Parallel and Distributed Systems,5(9):951–967,Dec.1994.。