Review of Parallel Computing Techniques for Computed Tomography Abstract Image Reconstructi
并行计算六十年

并行计算六十年杨学军【期刊名称】《计算机工程与科学》【年(卷),期】2012(34)8【摘要】Parallel computing is the main technology to implement high performance computing. This paper reviews the history of parallel computing over the past 60 years, and reaffirms the fact that the measurement equations for parallel scalability have played an important role in the development of parallel computing. Based on our analysis of challenges in exascale computing in the future, new scalability measurement model for parallel computing has been built, in which factors affecting performance of exascale computing are considered, including memory access, communication, reliability and power consumption. Through quantitative analysis, it has been found that there are some scalability "walls" during the development when parallel computing advances to higher performance. Finally, in consideration of the conditions of our country, the author proposes some suggestions for the development of high performance computing in our country.%并行计算是实现高性能计算的主要技术手段.本文回顾了并行计算技术六十多年来的发展历史,重温了并行可扩展性度量公式在并行计算发展进程中的重要地位.分析了并行计算向未来E级计算发展时面临的挑战,并建立了新的并行计算可扩展性度量模型,建模了访存、通信、可靠性、能耗等影响E级计算的因素.通过定量分析,发现和研究了并行计算向更高性能发展面临的可扩展性“墙”.最后,针对我国国情,提出了作者关于我国高性能计算未来发展的体会与思考.【总页数】10页(P1-10)【作者】杨学军【作者单位】国防科学技术大学计算机学院,湖南长沙410073【正文语种】中文【中图分类】TP338.6【相关文献】1.基于四种并行计算模式的自然对数底并行计算方法 [J], 刘荣;朱建伟;李富合;李磊2.MPI网络并行计算系统通信性能及并行计算性能的研究 [J], 孟杰;孙彤3.西安市供销合作事业六十年六十年艰苦创业服务三农六十年风雨历程谱写新篇 [J],4.并行计算,“时”半功倍——并行计算在大型汽车覆盖件冲压成形仿真分析中价值凸显 [J], 谢晖;王元;李锋5.微机网络并行计算环境Linux,并行计算平台MPI及其应用 [J], 吴锤红;吴锤结因版权原因,仅展示原文概要,查看原文内容请购买。
美国计算机科学家周以真:使计算思维成为常识

美国计算机科学家周以真:使计算思维成为常识计算机科学领域,周以真(Jeannette M. Wing)是以提倡计算思维而著名于世,她认为:计算思维是运用计算机科学的基础概念进行问题求解、系统设计以及人类行为理解等涵盖计算机科学之广度的一系列思维活动。
这位亚裔科学家1979年获得麻省理工学院(MIT)的电子工程和计算机科学学士和硕士学位,1983年获得MIT计算机科学的博士学位。
她的主要争辩领域是形式方法、可信计算、分布式系统、编程语言等。
1993年她与图灵奖得主芭芭拉·利斯科夫合作,提出了著名的Liskov代换原则,是面对对象基本原则之一。
2004至2007年她曾担当卡内基梅隆大学计算机学院院长;2013年1月加入微软争辩院,担当微软全球资深副总裁,负责微软争辩院全球各核心争辩机构及微软争辩院学术合作部的工作。
卡内基梅隆大学前校长Jared L Cohon评价说:“周以真是当今世界上最有创新精神、最具原创力的计算机科学家之一。
她在科研、教学和行政上均作出了重大贡献。
NSF不行能找到一个比周以真更好的人选。
”在卡内基梅隆大学,周以真的严峻也让她在学生中著名。
一个学生叫她“Dragon Lady”——这个不太友好的词本用于形容强大奇特、盛气凌人的东方女子,但周以真接受了这个称呼。
最让周以真感到满足的事情,莫过于看到学生在被简单概念折腾得死去活来之后,最终把握她所传授的学问。
在卡内基梅隆大学,“Dragon Lady”渐渐成为饱含学生敬意的昵称。
这位既醉心教研,又不懈学习其他学问的科学家在计算机科学之外的领域同样多才多艺。
她曾在中国研习舞剑,也学习武术,是唐手道黑带四段。
此外,周以真还有扎实的芭蕾舞功底,也跳过探戈、现代舞、爵士舞乃至踢踏舞。
她所把握的才艺如此之多,人们不禁要问她是怎么挤出时间学这么多东西的。
“日程表啊。
”她说。
周以真的青春活力数十年不减,究其缘由,她说那不过是因为“天性乐观,过着简洁的生活”。
随机森林算法改进综述

随机森林算法改进综述发布时间:2021-01-13T10:23:33.577Z 来源:《科学与技术》2020年第27期作者:张可昂[导读] 随机森林是当前一种常用的机器学习算法,张可昂云南财经大学国际工商学院云南昆明 650221摘要:随机森林是当前一种常用的机器学习算法,其是Bagging算法和决策树算法的一种结合。
本文就基于随机森林的相关性质及其原理,对它的改进发展过程给予了讨论。
1、引言当前,随机森林算法得到了快速的发展,并应用于各个领域。
随着研究环境等的变化,且基于随机森林良好的可改进性,学者们对随机森林的算法改进越来越多。
2、随机森林的原理随机森林是一种集成的学习模型,它通过对样本集进行随机取样,同时对于属性也随机选取,构建大量决策树,然后对每一棵决策树进行训练,在决策树中得到许多个结果,最后对所有的决策树的结果进行投票选择最终的结果。
3、随机森林算法改进随机森林的算法最早由Breiman[1]提出,其是由未经修剪树的集合,而这些树是通过随机特征选择并组成训练集而形成的,最终通过汇总投票进行预测。
随机森林的应用范围很广,其可以用来降水量预测[2]、气温预测[3]、价格预测[4]、故障诊断[5]等许多方面。
但是,根据研究对象、数据等不同,随机森林也有许多改进。
例如为了解决在高维数据中很大一部分特征往往不能说明对象的类别的问题,Ye et al.提出了一种分层随机森林来为具有高维数据的随机森林选择特征子空间[6]。
Wang为了解决对高位数据进行分类的问题,提出了一种基于子空间特征采样方法和特征值搜索的新随机森林方法,可以显著降低预测的误差[7]。
尤东方等在研究存在混杂因素时高维数据中随机森林时,实验得出基于广义线性模型残差的方法能有效校正混杂效应[8]。
并且许多学者为了处理不平衡数据问题,对随机森林算法进行了一系列的改进。
为了解决在特征维度高且不平衡的数据下,随机森林的分类效果会大打折扣的问题,王诚和高蕊结合权重排序和递归特征筛选的思想提出了一种改进的随机森林算法,其可以有效的对特征子集进行精简,减轻了冗余特征的影响[9]。
面向非结构网格应用并行程序的编程工具

253
化编程方式,屏蔽 JAUMIN 框架的编程接口,帮助用户在不学习编程框架的基础上快速开发基于编程框架的 并行应用程序。实际应用表明,该工具可以显著提升并行应用软件的研发效率,降低用户编写并行数值模拟 程序的难度。由于编程工具生成的代码规范统一,系统的维护效率也得以大幅度提高。 关键词:数值模拟;非结构网格;并行计算;JAUMIN 框架;代码自动生成 文献标志码:A 中图分类号:TP302
Abstract: This paper analyzes the challenge of developing parallel numerical simulation software based on the programming framework, which is used for high-performance numerical simulation. Then, on the basis of this analysis, this paper designs and develops a graphical programming tool for parallel numerical simulation based on the JAUMIN (J adaptive unstructured mesh application infrastructure) framework. The graphical programming tool uses a programming scheme based on the structured flow chart, which can hide the programming interface of JAUMIN, and can help users without the experience of programming framework to develop parallel application software based on JAUMIN quickly. Results from practice application show that this tool can significantly improve the development efficiency of parallel application software and reduce the difficulty of programming. Meanwhile the code generated by the programming tool is standard and normal, which makes the software maintenance efficiency be enhanced. Key words: numerical simulation; unstructured grids; parallel computing; JAUMIN; automatic code generation
空间结构多维地震响应改进组合准则

第 39 卷第 5 期2023 年10 月结构工程师Structural Engineers Vol. 39 , No. 5Oct. 2023空间结构多维地震响应改进组合准则曲扬1,*唐潮1陈刚1程建军1罗永峰2(1.中建八局第三建设有限公司,南京 210046; 2.同济大学土木工程学院,上海 200092)摘要现有以主轴模型为理论基础的多维地震响应组合准则由于忽略了多维相关性,通常低估空间结构实际的多维地震响应量。
针对此问题,首先指出了经典反应谱CQC准则和SRSS准则在多维地震情形下的不足,在此基础上,提出并定义多维分量相关系数,据此量化多维地震动分量之间的相关程度,采用并行计算技术,进行大量多维地震动分量的相关性分析,验证其合理性。
将多维分量相关系数引入SRSS准则,提出能够考虑多维地震响应相关性的改进组合准则。
将该准则运用于球面网壳和鞍型网壳算例的时程分析和推覆分析响应组合中,并与SRSS准则进行对比分析,计算结果表明:改进组合准则的计算精度较高,最大位移的计算误差控制在15%以内,能够合理预测空间结构多维位移响应的分布规律和变化趋势,显著改善了SRSS准则低估多维响应的缺点,同时保留了较高的分析效率,便于应用。
关键词空间结构,多维地震,组合准则,相关性分析Multicomponent Correlated Combination Rule for MultidimensionalSeismic Response of Spatial StructuresQU Yang1,*TANG Chao1CHEN Gang1CHENG Jianjun1LUO Yongfeng2(1.The Third Construction Co. Ltd. of China Construction Eighth Engineering Division, Nanjing 210046, China;2.College of Civil Engineering,Tongji University, Shanghai 200092, China)Abstract Based on the orthogonal effect, the existing combination rules of multidimensional seismic response usually underestimate the responses of spatial structures,since the coupling effect of earthquake multicomponent is ignored. To address this issue,the limitations of the CQC rule and SRSS rule from conventional response spectrum applied in multidimensional conditions were pointed out. Thereafter,a multicomponent correlation coefficient was proposed and defined to quantify the correlation of multi-dimensional seismic components. Through extensive correlation analysis of earthquake records using parallel computing technique, the coefficient was proved to be rational and serviceable. On this basis, the coefficient was introduced into the SRSS rule and then the Multicomponent Correlated Combination (MCC) rule was put forward for the multidimensional seismic response of spatial structures. To validate the computational superiority of the MCC rule over the SRSS rule,a spherical latticed shell and a saddle latticed shell were employed for seismic evaluation by means of the pushover method and response history analysis method. The results show that the MCC rule provides more accurate prediction of multi-dimensional response than that by the SRSS rule,especially in terms of response distribution and variation tendency. Calculation errors for maximum displacement are generally less than 15%, which overcomes the drawback of underestimation of the SRSS rule. The MCC rule maintains high computational efficiency, and its application is simple and convenient. Keywords spatial structure, multidimensional earthquake, combination rule, correlation analysis收稿日期:2022-08-07基金项目:国家自然科学基金资助项目(51378379)*联系作者:曲扬,男,工学博士,一级注册结构工程师,一级注册建造师,高级工程师,研究方向为空间结构地震反应分析方法和施工过程模拟技术。
Parallel-Computing-并行程序设计-同济大学张大强

▪ Minimizing the cost of communication
improved speedup - Moved students (“processors”) closer together
(or let them shout)
9
Tongji 217004301
14
Tongji 217004301
Course Parallel computer hardware them implementation: how parallel computers e 2: work
▪ Mechanisms used to implement
abstractions efficiently
- Some students (“processors”) ran out
work to do (went idle), while others were still working on their assigned task
▪ Improving the distribution of work
Everyone needs to understand
Tongji 217004301, F
Tunes
“I’d spent all winter break waiting to write some parallel code, and when I got back in front of a machine I was so jacked I ended up just spawning pthreads all “Long Time night long.” C oming” - Leela James, on the inspiration for “Long Time Coming ” (A Change is
高性能计算与并行算法设计在海洋模拟中的应用研究

高性能计算与并行算法设计在海洋模拟中的应用研究摘要:随着科技的不断进步,高性能计算( High-Performance Computing, HPC) 和并行算法设计在各个领域的应用越来越广泛。
本文将探讨在海洋模拟中,高性能计算和并行算法设计的应用研究。
首先,介绍了海洋模拟在科学研究和工程应用中的重要性。
接着,讨论了高性能计算和并行算法设计在海洋模拟中的优势和挑战。
最后,介绍了一些在海洋模拟中应用到的高性能计算和并行算法设计方法,并总结了未来的研究方向。
1. 引言海洋是地球上最广阔的环境之一,海洋模拟旨在模拟海洋的物理、化学和生物过程,以便更好地理解海洋系统的动态和机制。
海洋模拟在气候变化研究、海洋生态系统保护、海洋工程等方面发挥着重要作用。
但是,由于海洋模拟涉及海洋的广阔空间和复杂的物理过程,需要处理大量的数据和计算量,因此需要高性能计算和并行算法设计的支持。
2. 高性能计算在海洋模拟中的应用高性能计算的主要特点是通过使用强大而灵活的硬件和软件资源,以及并行计算和分布式计算技术来处理大规模和复杂的问题。
在海洋模拟中,高性能计算可以显著提高计算效率和模拟精度。
例如,在海洋环流模拟中,通过将复杂的物理模型分解成多个子问题并在多个处理器上并行计算,可以加快模拟速度并提高模拟精度。
此外,高性能计算还可以支持实时的海洋模拟,以便更好地预测海洋系统的变化。
3. 并行算法设计在海洋模拟中的应用并行算法设计是指将一个大规模的问题拆分成多个小规模的子问题,并通过多个处理器同时计算来提高计算效率。
在海洋模拟中,由于海洋系统复杂且分布广泛,需要处理巨大的数据集和复杂的计算任务。
因此,并行算法设计是实现高性能海洋模拟的关键。
例如,在海洋生态系统模拟中,可以将整个海洋区域划分成多个子网格,并在每个子网格上并行计算。
通过优化并行算法设计,可以大大提高计算效率和模拟的准确性。
4. 高性能计算与并行算法设计的挑战尽管高性能计算和并行算法设计在海洋模拟中具有许多优势,但也面临一些挑战。
用GPU实现基于LSM算法的美式期权定价

用GPU实现基于LSM算法的美式期权定价摘要:期权是金融领域中投资者用以进行套利和避险交易的一种衍生性金融工具。
相对于CPU,GPU有着更好的并行处理能力和带宽优势,将其用于期权定价计算将极大地提高运算性能。
本文以经典的美式期权定价模型的最小二乘蒙特卡洛方法为基础,提出了该算法基于GPU的一种实现。
该文对一维期权合约的定价在CPU和GPU 上进行了比较,来探索用GPU进行期权定价计算的优越性。
测试结果表明,在保证相应的系统稳定性的前提下,针对不同的模拟次数和时间步数,GPU平台在运算性能上明显优于CPU平台。
关键词:美式期权定价GPU CUDAAbstract:Recently,with the rapid development of computer techniques,the ability of dealing with parallel computing has got significantly improved.The parallel computing technique is widely used in the field of Financial,Oil,Hydromechanics,Image Processing and so on.The Least-Squares Monte Carlo method dealing with American Option pricing is born to be parallized and parallization will make it more efficient.The aim of this paper is to study the parallelization of the Least-Squares Monte Carlo method on GPU and explore its performances.The result tell us that the GPU platform is obviously better than CPU.Key words:American Option pricing GPU CUDA期权是在期货的基础上产生的一种衍生性金融工具,可以使其买方将风险锁定在一定的范围之内。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Review of Parallel Computing Techniques for Computed TomographyImage ReconstructionJun Ni1, 3, Xiang Li2, Tao He3, Ge Wang1,21Medical Imaging High Performance Computing Lab, Department of Radiology,2Department of Biomedical Engineering, 3Department of Computer ScienceThe University of Iowa, Iowa City, IA 52242{jun-ni, xiang-li, tao-he, ge-wang}AbstractAfter we briefly review representative analytic and iterative reconstruction algorithms for X-ray computed tomography (CT), we address the need for faster reconstruction by parallel computing techniques. For a decent, a cone-beam reconstruction usually takes hours on a regular PC, since most of algorithms take more than 60 iterations even longer. In order to speedup the performance, people introduce various acceleration methodologies including algorithm improvements, chip utilization, and parallel computing technique. This paper focuses on the speedup the computation using parallel computing. The first generation of parallel computing systems was based on a centralized parallel configuration. The second generation of such systems employed a cluster of general-purpose computers that are connected by a fast local area network (LAN). Hereby, we highlight distributed parallel computing techniques: from a locally distributed client-server topology to a peer-to-peer (P2P) enhanced network model. With the P2P technology, the client would be directly connected to all other computing peers seamlessly, forming a virtual parallel computer. There are multiple Internet connections between the client and other computing peers. This way, a single failure of node wouldn’t cause the entire failure of computation. Finally, we state that by integrating the large-scale geographically distributed systems such as Grid computing technology the future of the CT reconstruction will be highly parallel, efficient, scalable over the Internet, so will be other biomedical imaging tasks.1. IntroductionX-ray computed tomography (CT) is one of the most important non-invasive medical imaging techniques [1]. X-ray CT reconstructs a cross-sectional image by computing the X-ray absorption coefficient distribution of an object from projection data, which records the relative number of photons passing through the object. Hence, X-ray CT is regarded as transmission CT. Another imaging modality is emission CT such aspositron emission tomography (PET) [2-4] and single photon emission computed tomography (SPECT) [5-7] where the distribution of injected radioactive chemicals is estimated. No matter it is emission or transmission CT, the principles for image reconstruction remains the same.The development of X-ray CT technology is closely related to the evolution of the detector design and the scanning mode. The first generation of X-ray CT scanners used the parallel-beam geometry. (Fig. 1) The next generation systems were in the fan-beam geometry, which may be further divided into sub-categories [8, 9]. For 3D image reconstruction, an image volume was traditionally reconstructed by stacking 2D cross-sectional images. This method resulted in poor resolution in the axial direction. The modern scanning mode is to let the gantry rotate continuously while a patient table is simultaneously translated [10-12]. From the patient’s point of view, the X-ray source moves along a spiral or helical locus. The spiral scan enables continuous data acquisition and improves the image quality significantly. Almost all the modern CT devices allow spiral scanning.(a) (b)Fig. 1. Parallel-beam (a) and fan-beam (b) scan geometries.The X-ray photons emitted from the radiation source naturally form a cone traveling away from the source focal spot. A collimator is used in the parallel-beam and fan-beam scanners to restrict the X-ray beam to one single line or a set of lines on a plane, respectively. The first multi-slice spiral CT (MSCT) came into the market in 1998 employing a four-row detector. MSCT was a breakthrough in the CT technology that made possible the sub-minute whole body CT scan a routine clinical exam [13]. To reduce the scan time and improve the X-ray energy efficiency even further a larger area detector is desirable. In MSCT, the cone angle is small, usually of several degrees. Algorithms for fan-beam reconstruction can still be adapted for MSCT. When a cone angle is as large as tens of degrees extended by an area detector, a new challenge must be met in the algorithm development. Cone-beam CT reconstruction has been an active research area for the past decade. Many algorithms have been proposed [14-19]. They can be generally grouped into either analytic or iterative algorithms. For a full review of these algorithms, please see [20]. Recently, Katsevich developed an efficient exact cone-beamreconstruction algorithm [21, 22]. However the truly cone-beam medical X-ray CT scanner has not been popular in the market yet.1.1. Analytic reconstructionThe filtered back-projection (FBP) method is predominantly used with most commercial X-ray CT or PET CT scanners. With FBP the projection data are first filtered then the filtered data is linearly smeared back along ray paths to form image pixels. For example, in the parallel-beam geometry the relationship between the projection data and the object can be described as: [23]∫=),(),(),(t l dl y x f t P θθ , (1.1)where ),(t P θ is the projection data measured at a projection angle θ, and t the detector position in the beam.Using the δ-function to define the line integral, we have∫∫+∞∞−+∞∞−−+=dxdy t y x y x f t P )sin cos (),(),(θθδθ. (1.2)The CT image reconstruction problem is to compute given ),(y x f ),(t P θ. The filtered backprojection (FBP) method can be formulated as:∫=πθθ20),(),(d t D y x f p , θθsin cos y x t +=, (1.3)⎟⎠⎞⎜⎝⎛=∫+∞∞−dr e r t P t D rt j p πθθ221||*),(),( (1.4){}{||),(1r t P F F θ−=},(1.5) where and denote the forward and inverse Fourier transforms, respectively. Equation (1.4) is the filtering step in the form of convolution, which can be carried out in the frequency domain using equation (1.5). The FBP formula for fan-beam CT can be similarly obtained.{}F {}1−FIn the cone-beam geometry, a 2D area detector is used. In this case, there are two types of analytic reconstruction methods: exact and approximate algorithms. The exactalgorithm is able to reconstruct the original object using cone-beam data accurately given sufficiently fine detector resolution and large number of projections. The approximate algorithm often has simpler formulation and faster speed. Not all scanning trajectories permit exact reconstruction. As stated by the Smith-Tuy completeness condition: “If on every plane that intersects the object there lies a vertex (the source point), then one has complete information about the object ” [24]. The completeness condition can be heuristically interpreted as that a cone-beam trajectory must fill out data in 3D Radon space completely. In other words, on any plane that intersects the object being scanned there exists at least one radiation source. For example, a spiral scan curve is complete (even data are longitudinally truncated) for exact cone-beam reconstruction but a circular source trajectory is not.Among many approximate algorithms the Feldkamp-type algorithms are the most popular. The original Feldkamp formula extends the circular fan-beam reconstruction to the circular cone-beam geometry by compensating for the cone angle effect appropriately. Suppose a planar detector is used. The detector position is determined by the Cartesian coordinate. The Feldkamp cone-beam reconstruction formula can be written as:[20]),(v u ∫′′=πβββ202),,(),,(),,(d y x L v u D z y x f cone , (1.12)where the interpolated detector position is determined byβββββsin cos sin cos ),,(y x d x y d y x u so so ++−=′, (1.13) βββsin cos ),,,(y x d d z z y x v so so ++=′,(1.14) andso so d y x d y x L βββsin cos ),,(++=. (1.15) The filtering step is expressed as:⎟⎠⎞⎜⎝⎛⎟⎟⎠⎞⎜⎜⎝⎛++=∫∞+∞−⋅dr e r v u d d v u P v u D u r j soso cone cone πββ221222||*),,(),,(, (1.16) where ),,(v u P cone β are cone-beam projections.1.2. Iterative reconstructionThe iterative reconstruction (IR) methods include statistical reconstruction (SR) algorithms and algebraic reconstruction techniques (ART), but they all compute the final image iteratively through the same top-level loop (Fig. 2). There are many IR algorithms available. Representative algorithms are the maximum likelihood (ML) expectation maximization (EM) formula, [25-27] the simultaneous algebraic reconstruction technique (SART), [28-30] and the Convex algorithm [31, 32]. With ML-EM the image is obtained iteratively as an optimal estimate that maximizes the likelihood of the detection of the actual measured photons based on a statistical model of the imaging system. The EM method can also be deterministically interpreted as the process of minimizing the I-divergence between the estimated and measured projection data in the nonnegative space [33]. The SART algorithm iteratively minimizes the mean square error between the estimated and measured projections in the real space. The Convex algorithm is a statistical reconstruction algorithm for transmission CT, which also aims at maximizing the Poisson likelihood. The IR algorithms are superior to the analytic methods in terms of image quality (contrast and resolution) with noisy and/or incomplete projections.Fig. 2. Iterative reconstruction process.2. Why a single PC insufficientThe combination of the cone-beam geometry and the spiral scan made possible the data acquisition for 3D reconstruction ever faster and easier. However, along with the ease of obtaining large 3D projection data sets is the intense computation needed to reconstruct the volume. A micro-CT scanner with a 4000 by 2000 detector cells will need, depending on resolution settings, more than three hours to reconstruct a decent image volume using the FBP method. The time for whole body CT reconstruction on a medical scanner may take long time too.The situation is even more challenging when the IR technique is used because themajor disadvantage of the IR is its high demand on computation. Besides updating an intermediate image, a single iteration of the IR needs one forward-projection and one backprojection whereas the FBP method only needs one backprojection. The forward-projection step is of the same computational complexity as the backprojection step. As an example, the EM method often requires at least 30 iterations. Therefore, it takes approximately 60 times longer time for a comparable quality reconstruction than the FBP method.Image volumeProjection data Projection data(a) (b) (c)Fig. 3. Memory allocations needed from a scan (a) for FBP (b) and IR (c).Another problem with IR is that it is associated large memory need. For example, to reconstruct a volume of 5123 voxels it may use data from 384 projection views with 256by 512 detectors. The memory buffers for one volume and the projection data will be512MB and 192MB, respectively (Assuming that each data point is stored as 4-byte floating type). Without other intermediate parameters caching we need another two memory allocations of the size of image volume for error update of every voxel and the associated weighting factors, respectively. (Fig. 3) In total at least 1,728 MB (512x3 + 192) or about 1.7GB free memory are needed for IR. It has already exceeded the capacityof most current 32-bit PCs. As a result, even if the IR is highly desirable for 3D image reconstruction, it is impossible to implement such an IR algorithm on a single PC.3. How to accelerate?To speed up the reconstruction, generally three approaches are viable. (Fig. 4) These approaches are not exclusive to each other and actually can be combined. The first approach is to improve the algorithm itself both for a faster convergence rate and a more efficient backprojection implementation. In the FBP case, approximate algorithms are often preferred over exact algorithms for sake of efficiency. In the IR case, the ordered subset (OS) scheme is the most popular trick, which is able to accelerate the IR by anorder of magnitude. [34-36]Katsevich 2002 [21] Fast exact FBP Wang. 1993 [16] Feldkamp-type. Various scan loci Hudson&Larkin 1994 [34] Ordered SubsetsImproved AlgorithmsParallel computing Mainframe parallel computers Guerrini 1989 [42] Vector computer Chen 1990 [46] Hypercube McCarty 1991 [44] Mesh-parallel Atkins 1991 [45] Transputers Hardware Acceleration Proprietary chips on scanners Commodity graphics cards Cabral 1994 [38] Feldkamp-type Shattuck 2002 [48] Fast Internet-connected cluster for EM 3D PETLi 2004 [56] OS-SART on cluster for X-ray CTDistributed computing Li&Ni, 2004 [57, 58] P2P-enhanced network Fig. 4. Development of acceleration methods for CT image reconstruction.The second approach uses the graphics processing unit (GPU) to conduct the expensive processing steps such as forward-projection and backprojection. The modern GPU is able to handle 32-bit floating-point data and may have as many as four channels for color and eight channels for texture rendering that work in parallel [37]. It incorporates common arithmetic operations on the chip. Unlike those specially designed chips that are installed on most commercial CT scanners, the GPU also features a fully programmable interface, and some of them enable high-level programming languages such as C . The commodity texture mapping hardware was reported to support Feldkamp, SART, EM and OS-EM algorithms [38-40].The third approach is for parallelization of the computation. The most time-consuming part in the reconstruction process is backprojection in FBP, and bothforward-projection and backprojection in IR. Various parallelization schemes have been proposed to distribute the workload of backprojection and forward-projection among parallel computing units. They can be categorized into two types according to the hardware used.The first type of parallel computing systems utilizes a centralized parallel system such as a VLSI architecture computer, a vector computer, a large-scale parallel computer, and a shared memory multi-processor computer that is based on the same instruction multiple data (SIMD) structure [41-47]. Early efforts on parallel CT image reconstruction were often based on these mainframe computers. They were expensive, less scaleable and inefficient in terms of hardware usage. The architecture of these parallel machines was usually fixed upon manufacture. Users have little control over the design. This results in incompatible parallel implementations unique to the specific parallel computer. Algorithms would have to be re-developed should the hardware change.Fig. 5. Data partition scheme in the projection data domain. Bluearrow lines and straps illustrate the portion of the projection dataand the corresponding voxels being updated on individual workernodes, respectively.The second type of parallel computing systems employs general-purpose computers that are connected by a fast local area network (LAN). They are built based on the multiple instructions multiple data (MIMD) architecture. As the computing technology advances dramatically most recent parallel implementations become this type [48-52]. The PC cluster comprises several workstation nodes. One of them is the master node and the rest are worker nodes. The master node is usually a control and coordination unit. It distributes data to worker nodes and receives results from them for integration. (Fig. 5) PC clusters have been built on either WinNT or Unix/Linux platforms and could be a hybrid of SIMD computers and MIMD systems. In fact, dual processor nodes are not uncommon for many clusters. Some even installed four processors on each node [50][52].These systems normally call a set of cross-platform message passing libraries as the communication interface among individual nodes. Earlier implementations used the parallel virtual machine (PVM) protocol [53]. Latest ones often take advantages of the message passing interface (MPI) protocol [54]. The multi-threading and the message passing can be combined to fully exploit the computation power of the multi-processor cluster system. In [52], the inter-node communication is coordinated by MPI, while the inner-node data partition is controlled by the OpenMP multi-threading protocol [55]. (Fig.6) Most of these studied have targeted EM and OS-EM algorithms for PET and SPECT. In [56] parallel OS-SART was implemented on a Linux cluster for X-ray CT.Worker nodesFig. 6. Hybrid architecture of SIMD and MIMD. Each node has more than oneprocessor. They achieve multithreading via the OpenMP interface. The inter-nodeinterface is MPI. Note that there can be MPI links among worker nodes. Here onlylinks between master and worker nodes are shown.The parallel performance using a small LINUX cluster is illustrated in Fig. 7 and Fig. 8. The maximum speedup is achieved when number of processors is about 9.Fig. 7. Total iteration time and total computation time vs.number of computational node (servers).Fig. 8. Performance in terms of speedup vs. the number ofcomputing nodes used.The data utilized for the case study is 3-D image of Shepp-Logan phantom is shown in Fig. 9 (a) and (b).(a)(b)Fig. 9. Reconstructed image results: (a) selected sections of reconstructed images; (b) Thereconstructed profile image of 3D Shepp-Logan phantom (The blue blur represents the reconstructedimage cube).4. Distributed parallel reconstruction with InternetDecentralized parallel computing has many desirable features. Among these areflexibility, reliability, and cost-effectiveness. Recent studies favor drastically distributed parallel reconstruction with the modern Internet technologies. For example, in [48], [50], and [52] a java-applet enabled web-interface has been built to submit projection data and start the reconstruction. The remote cluster reconstructs the image and sends it back for analysis. Upon receiving the data the master node may either dispatch them directly to worker nodes or maintain a job queue waiting for regular checks from worker nodes. From the point of view of the distributed network model these tasks were based on the client-server (C/S) topology. The trend is that the client side becomes thinner and simpler leaving only the data submission and job request functionalities. The server completes heavy-duty tasks such as searching a database, calculation, information integration, and image reconstruction in the CT case. If a PC cluster is used instead of a mainframe computer, the master node of the cluster is also the server and is connected with the clients. Other worker nodes only connect to the master node via a high-speed local area network (LAN) (Fig. 9 [57]).(a)(b)(c) Fig. 10. Centralization of C/S model. (a) Clients are connected to a mainframe supercomputer; (b) Clients are connected to a remote gateway first through Internet; (c) Clients are connected to a remote cluster server.5. Peer-to-peer Distributed NetworkThe peer-to-peer (P2P) technology is able to further improve the robustness and performance of a parallel reconstruction system by facilitating distributed computation in a fundamentally different fashion. The P2P model is a fully decentralized and distributed topology. With P2P the client would be directly connected with all other computing peers seamlessly, forming a virtual computer [57, 58]. There are multiple Internet connectionsbetween the client peer and other computing peers (Fig. 8). Because there is no longer there such a thing as the server, one node’s failure will not cause the entire system fail. Each peer is the server and the client at the same time. P2P is not a new concept. The idea of a drastically distributed and decentralized application was tried with USENET as early as in 1980’s. It becomes popular recently, however, largely because of its application in some popular software products such as the Napster (software for MP3 music file sharing) and Gnutella (software for file sharing). [59] As far as Napster is concerned, some may argue that it is actually not a P2P software program in the strict sense because there is still a central server that maintains a database of all peers. Users first install a Napster peer program and then share music files by logging into the server. The server will search the database and match peers who show the file share interests. Without this server, peers would have no idea where other peers are. Therefore, here comes the question what is a P2P network. Generally speaking, in a P2P network a peer should be able to: 1) Find other peers and form or join a small network of those peers who have common interests; 2) Broadcast the resource it owns and discover new resources from other peers. 3) Control its own resource with privilege and share the resource with other peers in collaboration. 4) Perform all the above tasks regardless the operating system and network topology. Of course, there are issues of security, policy, and copyright that are inherent to the P2P network model. Any P2P system that means to be a real application must address these issues properly.(a)(b)Fig. 11. Client/Server (a) and peer-to-peer (b) decentralized topology models.The nice thing comes with P2P in medical imaging is its great flexibility, accountable robustness and scalability. Imagine that a patient at a clinic just receives a CT scan. The doctor wants to see the reconstructed images for diagnosis. The clinic is connected to the hospital’s imaging center via a broadband LAN. The center utilizes a high performance cluster for fast reconstruction. Now suppose the master node has crashed. What will happen next is that the whole cluster becomes unavailable to the outside world even if the rest computing nodes are still functioning. In fact we haveexperienced similar situations when the remote cluster server for parallel computing failed, causing the entire research disrupted. Recently we have built a P2P-enhanced network for iterative image reconstruction. The performance result (shown in Fig. 11) is comparable to a cluster of a similar setup, [57, 58]. It indicates that in a clinic, one can integrate PCs in each room to virtually construction a parallel system for high performance medical image reconstruction. Therefore in the future, P2P technology is very promising, since the personal desktop machines become cheaper and cheaper.Speedup vs. Iteration No.12.00000 10.00000 speedup 8.00000 6.00000 4.00000 2.00000 0.00000 1 2 3 4 5 6 7 8 9 10 no of computational nodes iter=2 iter=3 iter=4 iter=5 iter=6 iter=7 iter=8 iter=9 iter=10Fig. 12. Performance in terms of speedup vs. the number of computing nodes and iterations used, in a P2P distributed computing environment.6. Future WorkDecentralized parallel computing has many merits. Among these are the reliability, low-cost, and economical efficiency. From super-computers to PC clusters; from a local cluster to a remote cluster server, over the decades the computer and the Internet technology have evolved into this trend. However, the computation itself is still centralized regardless of parallelism. Now, the peer-to-peer technology further improves the system reliability and robustness. The concept applies to all the representative iterative and non-iterative algorithms, including EM-type, ART-type, Feldkamp-type and Katsevich-type algorithms for cone-beam image reconstruction, as long as the forward/backward projection is involved. Also, parallel reconstruction techniques are not restricted to transmission and emission CT. Many medical imaging and analysis processes can be accelerated with parallel computing techniques, including optical tomography, image visualization and so on. Recently, the Grid becomes an emerging infrastructure that will change the Internet and the way people think of science and computing. [60-62] In general, the Grid technology views the computing resources, storage, CPU cycle, memory and so forth as a network similar to a electrical network. Any resource, once are grid-enabled, will beready for “plug in-and-use”. Thus, many hospitals may be connected via the Grid and forming a virtual organization (VO). In such a VO, one medical image reconstruction site’s need for parallel computing will be shared across every facility in the Grid. To take advantage of such a highly distributed computing model, however, two things are necessary. One is the high-speed connection such as the TerraByte network. The other is a dedicated reconstruction algorithm that must be sophisticatedly tamed for this diversified distributed parallelism.Reference[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Seeram E. Computed tomography, physical principles – clinical applications, and quality control. 2nd edition, WB Saunders Co. 2000; 1-8. Burnham CA, Brownell GL. A multi-crystal positron camera. IEEE Transactions on Nuclear Science 1972; 201-205, Chesler DA. Three-dimensional activity distribution from multiple positron scintigraphs. Journal of Nuclear Medicine, 1971; 12: 347-8. Brownell GL. A history of positron imaging. /~glb/alb.html. Kuhl DE, Edwards RQ. Image separation radioisotope scanning. Radiol. 1963; 80: 653-662. Labbe J. SPECT/CT Emerges from the shadow of PET/CT. Biophotonics International. 2003; 50-57. Scarfone C. Single photon emission computed tomography (SPECT). /bae/courses/bae590f/1995/scarfone/index.html. Seeram E. Computed tomography, physical principles – clinical applications, and quality control. 2nd edition, WB Saunders Co. 2000; 23. Goldman LW. Principles of CT and the evolution of CT technology. RSNA Categorical course in diagnostic radiology physics: CT and US cross-sectional imaging. 2000; 33-52. Kalender WA, Seissler W, Vock P. Single-breath-hold spiral volumetric CT by continuous patient translation and scanner rotation. Radiology. 1989; 173: 14. Crawford CR, King KF. Computed tomography scanning with simultaneous patient translation. Med. Phys. 1990; 17: 967-982. Wang G, Lin TH, Cheng PC, Shinozaki DM, Kim HG. Scanning cone-beam reconstruction algorithms for x-ray microtomography. Proceedings of SPIE. 1992; 1556, 99-112. Taguchi K, Aradate H. Algorithm for image reconstruction in multi-slice helical CT. Med. Phys. 1998; 25: 550-561. Feldkamp L, Davis L, and Kress J. Practical cone-beam algorithm. Journal of the Optical Society of America. 1984; 612-619. Kudo H, Saito T. Helical-scan computed tomography using cone-beam projections. IEEE Medical Imaging. 1991; 1958-1962. Wang G, Lin TH, Cheng PC, Shinozaki TM. A general cone-beam reconstruction algorithm. IEEE Trans. Med. Imaging. 1993; 12: 486-496.[17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38]Schaller S, Flohr T, Steffen P. A new approximate algorithm for image reconstruction in cone-beam spiral CT at small cone-angles. IEEE Medical Imaging. 1996; 1703-1709. Wang G., Crawford CR, Kalender WA. Multirow detector and cone-beam spiral/helical CT. IEEE Trans. Med. Imaging. 2000; 19: 817-821. Turebell H, Danilsson PE. Helical cone beam tomography. Int. J. Imaging System and Technology. 2000; 11: 91-100. Turbell H. Cone-beam reconstruction using filtered backprojection. Ph.D. thesis, Linköpings Universitet, Sweden 2001. Katsevich A. Theoretically exact filtered backprojection-type inversion algorithm for spiral CT. SIAM J. Appl. Math. 2002; 62: 2012-2026. Katsevich A. An improved exact filtered backprojection algorithm for spiral computed tomography. Advances in Applied Mathematics. 2004; 32 (4): 681–697. Kak AC, Slaney M. Principles of computerized tomographic imaging. SIAM 2001; 50. Tuy HK. An inversion formula for cone beam reconstruction. SIAM J. of Applied Mathematics. 1983; 43: 546-551. Rockmore AJ, Macovski A. A maximum likelihood approach to image reconstruction. IEEE Trans. Nucl. Sci. 1976; NS-23: 1428-1432. Shepp LA, Valdi Y. Maximum likelihood reconstruction for emission tomography. IEEE, Trans. Med. Imag. 1982; MI-1: 113-122. Lange K, Carson R. EM reconstruction algorithms for emission and transmission tomography. J. Comput. Assist. Tomog. April 1984; 8(2): 302-316. Andersen AH. Algebraic reconstruction in CT from limited views. IEEE Trans. Med. Imag. 1989; 8: 50-55. Andersen AH, Kak AC. Simultaneous algebraic reconstruction technique (SART): A superior implementation of the ART algorithm. Ultrasonic Imaging. 1984; 6: 81-94. Jiang M, Wang G. Convergence of the simultaneous algebraic reconstruction technique (SART). IEEE Trans. Image Processing. 2003; 12: 957-961. Lange K. Convergence of EM image reconstruction algorithms with Gibbs smoothing. IEEE Trans. Med. Imaging 1990; 9 (4): 439-446. Lange K, Fessler JA. Globally convergent algorithms for maximum a posteriori transmission tomography. IEEE Trans. Image Processing. 1995; 4 (10): 1430-1438. Snyder DL, Schulz TJ, O'Sullivan JA. Deblurring subject to nonnegativity constraints. IEEE Trans. Signal Processing. 1992; 40: 1143-1150. Hudson HM, Larkin RS. Accelerated image reconstruction using ordered subsets of projection data. IEEE Trans. Med. Imag. 1994; 13: 601-609. Kamphuis C, Beekman FJ. Accelerated iterative transmission CT reconstruction using an ordered subsets convex algorithm. IEEE Trans. Med. Imaging. December 1998; 17 (6). Erdogan H, Fessler JA. Ordered subsets algorithms for transmission tomography. Phys. Med. Biol. 1999; 44(11). Xu F. Tomographic Reconstruction using graphics hardware. Nov. 2003. Cabral B, Cam N, Foran J. Accelerated volume rendering and tomographic reconstruction using。