一种快速聚类算法的GPU加速

合集下载

什么是GPU加速如何利用它提升电脑形性能

什么是GPU加速如何利用它提升电脑形性能什么是GPU加速如何利用它提升电脑形性能近年来，随着科技的不断发展，计算机技术的进步极大地推动着各行业的发展。

而在电脑性能提升的过程中，GPU（Graphics Processing Unit，图形处理器）的加速技术发挥了重要的作用。

本文将介绍什么是GPU加速，以及如何利用它来提升电脑的性能。

一、什么是GPU加速GPU加速是指利用GPU进行计算任务处理的技术。

传统计算机的中央处理器（CPU）负责处理大部分的计算任务，包括图形渲染、音视频编解码等。

然而，由于CPU的设计初衷并非用于图形计算，因此在处理大规模图形计算任务时效率相对较低。

与CPU不同，GPU被设计出来专门用于处理图像和图形计算任务。

它拥有大量的处理核心，能够并行处理大规模的图形计算任务，因此在图形渲染、视频编辑、3D建模等方面表现出色。

基于这一特点，研究人员开始将GPU引入普通计算任务中，以加快计算过程。

二、GPU加速的原理GPU加速的原理主要是通过将计算任务分解成多个小任务，然后由GPU中的处理核心并行执行这些小任务。

相比于单一的CPU核心串行执行任务，GPU的并行处理能力使得计算速度大幅提升。

在利用GPU进行加速时，首先需要将计算任务中的部分步骤转化为可以并行处理的形式。

然后，通过使用并行计算框架如CUDA （Compute Unified Device Architecture）、OpenCL（Open Computing Language）等，将这些任务分发给GPU的处理核心进行运算。

最后，将GPU计算的结果送回CPU进行进一步的处理和显示。

三、如何利用GPU加速提升电脑性能1. 查看系统配置：首先需要确定电脑是否配置有独立的GPU，以及该GPU的型号和性能。

在Windows系统中，可以通过按下Win + R键打开运行窗口，输入“dxdiag”来查看系统配置。

在Mac系统中，可以通过点击左上角的苹果标志，选择“关于本机”来查看系统信息。

基于GPU加速的视觉SLAM算法研究

基于GPU加速的视觉SLAM算法研究随着人工智能技术的不断发展，机器视觉的应用越来越广泛。

在机器视觉中，SLAM技术是一个非常重要的研究方向。

SLAM 技术可以实现对未知环境的建图和自主导航，是机器人和自动驾驶等领域关键技术之一。

然而，SLAM技术的计算量非常大，实时性较差，这也就限制了SLAM技术在实际应用中的广泛应用。

为了解决这一问题，近年来研究者们开始将GPU加速技术应用于视觉SLAM算法中，将CPU和GPU的计算能力相结合，实现了视觉SLAM算法的实时运行和加速。

本文将介绍基于GPU加速的视觉SLAM算法研究的最新进展和应用。

一、GPU在SLAM中的应用随着硬件技术的不断进步，GPU的计算能力越来越强大，现在的GPU已经可以实现比CPU更加快速的并行计算。

而在SLAM 算法中，大量的计算任务可以分为许多小的计算单元，在GPU并行计算下可以更加高效地完成相关任务。

因此，GPU加速技术被应用于SLAM算法的研究中，可以高效地实现对大量数据的处理和追踪。

二、GPU加速的实时、快速SLAM算法目前，基于GPU加速的实时、快速视觉SLAM算法已经成为研究的热点之一。

这些算法可以处理多种不同类型的视觉数据，如RGB-D、单目或多目摄像头等，并利用GPU并行计算快速生成3D地图和位姿估计。

以下是目前研究较为成熟的一些GPU加速SLAM算法：1.ORB-SLAM2：ORB-SLAM2是一种很流行的基于单目相机的视觉SLAM算法。

相比于其前身ORB-SLAM，ORB-SLAM2具有更高的运行速度和更好的性能表现。

通过对于ORB特征点的跟踪和匹配，并结合GPU并行计算的技术，ORB-SLAM2可以生成高精度的3D地图和相机路径等信息。

2.MVSLAM：MVSLAM是一种基于多目相机的视觉SLAM算法，可以快速、准确地建立3D场景模型。

MVSLAM主要依靠GPU并行计算中的子图算法来实现SLAM算法的加速，使其具有更快的执行速度和更好的计算性能。

vtk gpu加速原理

vtk gpu加速原理vtk（Visualization Toolkit）是一个开源的跨平台计算机图形库，用于3D可视化和图形处理。

它提供了一系列功能强大的工具和算法，用于创建、操作和可视化大规模的数据集。

为了加快计算速度，vtk库可以利用GPU（Graphics Processing Unit，图形处理单元）进行加速。

GPU是一种专门设计用于处理图形和图像的硬件设备，它具有高度并行的架构和强大的计算能力。

与传统的中央处理器（CPU）相比，GPU在处理大规模数据时能够提供更高的计算速度。

vtk库通过利用GPU的并行计算能力来加速数据集的处理和可视化过程。

vtk库中的GPU加速原理主要包括以下几个方面：1. GPU并行计算：GPU具有大量的计算核心，能够同时执行多个计算任务。

vtk库通过将计算任务划分为多个子任务，并将这些子任务分配给不同的GPU核心并行执行，从而加快计算速度。

2. GPU内存管理：GPU具有独立的显存，可以存储大规模的数据集。

vtk库通过将需要处理的数据集从主内存复制到GPU显存中，在GPU上进行计算和操作，然后再将结果复制回主内存，从而减少数据传输的时间，提高计算效率。

3. GPU硬件加速：GPU具有专门的硬件加速功能，如纹理映射、光栅化等，可以加速图形渲染和可视化操作。

vtk库通过利用GPU 的硬件加速功能，可以实现更高质量的图形渲染和更流畅的交互操作。

4. GPU优化算法：vtk库针对GPU硬件特点进行了优化算法的设计，以充分利用GPU的计算能力和存储容量。

例如，vtk库使用了基于GPU的并行排序算法，可以在较短的时间内对大规模数据进行排序操作。

5. GPU与CPU协同工作：vtk库中的一些计算任务无法完全由GPU独立完成，需要与CPU进行协同工作。

vtk库通过合理地利用GPU和CPU的计算资源，将计算任务分配到最适合的处理器上，以实现最佳的加速效果。

总结起来，vtk库利用GPU的并行计算能力、内存管理、硬件加速和优化算法等特点，实现了对大规模数据集的快速处理和可视化。

gpu 加速原理

gpu 加速原理
GPU加速的原理是利用图形处理单元（GPU）的并行计算能力，对于需要大量计算的任务进行加速。

传统上，中央处理单元（CPU）负责处理计算机系统的各种运算任务。

然而，CPU在处理并行计算时效率较低，因为它的
核心数量较少。

由于图形处理对于大规模并行计算的需求较高，GPU被设计用来处理图形渲染任务，并且具有大量的核心。

GPU的并行计算能力使其在其他数据密集型应用程序中的计
算加速方面也十分突出。

当需要进行大量并行计算的任务时，可以利用GPU来代替CPU执行这些任务。

例如，科学计算、
机器学习、数据挖掘等都可以借助GPU进行高速计算。

GPU加速的原理是通过将计算任务划分为多个小任务，然后
并行地分配给GPU的各个核心进行计算。

由于GPU具有大量的核心，每个核心可以同时处理一部分任务，因此可以极大地加快计算速度。

此外，GPU还采用了专门的内存结构，如图形内存（VRAM），这种内存结构可以更快速地访问和处理图像和
图形数据。

对于一些需要频繁读写图像和图形数据的任务来说，GPU的内存结构可以提供更高效的数据传输速度，进一步加
速计算。

总的来说，GPU加速利用其多核心和专用内存结构的特点，
在处理大规模并行计算任务时可以提供更高的计算效率。

这种
加速方式已经广泛应用于各种领域，为大型计算任务带来了巨大的性能提升。

在MATLAB中使用GPU加速计算的技巧

在MATLAB中使用GPU加速计算的技巧随着计算机性能的提升和科学研究的不断深入，传统的中央处理器（CPU）已经无法满足高性能计算的需求。

为了提升计算速度和效率，图形处理器（GPU）开始被广泛应用于科学计算领域。

而MATLAB作为一款功能强大的数值计算软件，也提供了使用GPU加速计算的功能。

本文将介绍一些在MATLAB中使用GPU加速计算的技巧，帮助读者更好地利用GPU进行高效的计算。

一、了解GPU计算的基本原理GPU是一种并行处理器，相对于CPU而言，其拥有更多的核心和更高的内存带宽。

这使得GPU在一些并行计算任务上具有明显的优势。

在使用GPU加速计算之前，我们需要先了解GPU计算的基本原理。

GPU计算的核心思想是利用大量的线程并行地执行指令。

每个线程可以执行相同的指令，但是操作的数据可以是不同的。

GPU中的核心被分为多个线程块，每个线程块中又包含多个线程。

通过合理地划分线程块和线程，可以实现对复杂计算任务的并行处理。

二、选择适合GPU加速的任务在使用GPU加速计算之前，我们需要明确哪些任务适合使用GPU进行加速。

一般来说，涉及大规模数组计算和迭代运算的任务更适合使用GPU加速。

例如，对大规模矩阵的乘法运算，可以将其拆分为多个小矩阵的乘法运算，然后由不同的线程并行地执行。

这样可以显著提升计算速度。

而对于简单的数值运算和逻辑判断，由于其计算量较小，使用GPU加速效果并不明显。

三、使用GPU数组进行计算MATLAB提供了一种特殊的数组类型，称为GPU数组。

与传统数组不同，GPU数组存储在GPU的内存中，可以直接在GPU上进行计算，避免了数据在CPU和GPU之间的频繁传输。

我们可以使用`gpuArray`函数将一个普通数组转化为GPU数组，然后在GPU上进行计算。

例如，假设我们有两个数据向量`A`和`B`，我们可以使用以下代码将其转化为GPU数组并进行相加操作：```matlabA = rand(1e6, 1);B = rand(1e6, 1);A_gpu = gpuArray(A);B_gpu = gpuArray(B);C_gpu = A_gpu + B_gpu;```在这个例子中，`A`和`B`是普通的数组，通过`gpuArray`函数转化为GPU数组。

C++中的高性能计算和GPU加速应用

C++中的高性能计算和GPU加速应用C++是一种强大的编程语言，被广泛用于开发高性能计算和GPU加速应用。

在C++中，开发者可以通过使用各种优化技术和库来实现高性能的计算任务，同时充分利用显卡的计算能力，实现更快的运算速度和更高的并行性。

在高性能计算中，C++的性能优势主要体现在以下几个方面：首先，C++具有较高的执行效率。

C++是一种静态类型语言，它在编译时进行类型检查，并生成高度优化的机器码。

这使得C++程序经常能够比其他语言更快地执行任务，特别是在需要处理大量数据和进行复杂计算的情况下。

其次，C++具有丰富的优化工具和库。

在C++中，开发者可以使用一系列的优化技术和工具来提高程序的性能。

例如，编译器优化可以对程序进行自动的性能优化，从而减少运行时间和内存使用。

此外，C++还有许多专门用于高性能计算的库，如OpenMP和MPI，它们可以帮助开发者利用多核处理器和分布式系统的计算能力，实现更高效的并行计算。

另外，在GPU加速应用方面，C++也有许多优势。

GPU是一种强大的并行计算设备，通过将计算任务分配到成百上千个处理单元上，可以实现极高的计算性能。

C++提供了一些库和工具，如CUDA和OpenCL，可以将C++代码转换为针对GPU的代码，并利用GPU的高并行性来加速计算任务。

使用这些库，开发者可以方便地编写高效的GPU加速应用，从而提升计算性能。

除了这些优势外，C++还具有灵活性和可移植性。

C++语言可以在多种平台上运行，包括Windows、Linux和MacOS等。

开发者可以方便地使用C++编写高性能计算和GPU加速应用，并在不同平台上进行部署。

这使得C++成为一个广泛应用于科学计算、金融分析、图形渲染等领域的编程语言。

总之，C++是一种非常适合开发高性能计算和GPU加速应用的语言。

通过使用优化技术和库，开发者可以充分发挥C++的性能优势，并利用显卡的计算能力，实现更快的运算速度和更高的并行性。

基于神经网络的聚类算法研究

基于神经网络的聚类算法研究近年来，随着人工智能技术的不断发展，基于神经网络的聚类算法也越来越受到研究者的关注。

此类算法能够根据数据的特征，将数据划分成不同的簇，从而方便后续的数据分析。

本文将探讨基于神经网络的聚类算法的研究现状、应用前景以及存在的问题。

一、研究现状随着数据量的不断增加，传统的聚类算法（例如k-means）已经不能满足现代数据的需求。

因此，基于神经网络的聚类算法应运而生。

这类算法结合了神经网络的非线性映射能力和聚类算法的分类能力，不仅能够处理大规模和高维的数据，还具有异构聚类的能力。

目前，基于神经网络的聚类算法主要可以分为两类：有监督学习和无监督学习。

有监督学习的算法需要先对数据标注，然后通过神经网络进行分类，这类算法的优点在于能够得到更准确的聚类结果。

无监督学习的算法则不需要数据标注，通常采用自组织映射网络（SOM）或高斯混合模型（GMM）进行计算，这类算法的优点在于不需要额外的标注信息。

二、应用前景基于神经网络的聚类算法在很多领域都有着广泛的应用前景。

其中，最为常见的应用领域就是图像分割和模式识别。

在图像分割领域，这类算法可以将一张图像分成若干个部分，每个部分代表一种物体或者纹理。

在模式识别领域，这类算法可以帮助我们检测文本和语言中的规律模式，从而方便我们进行分类和标注。

另外，基于神经网络的聚类算法还可以应用于网络安全领域。

例如，我们可以将用户的网络行为数据进行聚类，从而发现异常的网络行为，提供更加有效的安全防护。

三、存在的问题尽管基于神经网络的聚类算法具有许多优点，但也存在着一些问题和挑战。

首先，这类算法需要大量的计算资源才能进行有效的计算。

其次，由于神经网络模型的复杂性，这类算法可能存在过拟合的问题。

此外，由于神经网络的黑箱结构，这类算法可能难以解释计算的结果。

针对上述问题，目前研究者正在尝试寻找有效的解决方案。

例如，一些研究者提出了基于GPU加速的算法，可以显著减少计算时间。

gpu加速

GPU加速什么是GPU加速GPU加速是一种利用图形处理器（Graphic Processing Unit, GPU）来加速计算任务的方法。

传统的计算机处理器（中央处理器，CPU）主要用于处理通用计算任务，而GPU原本用于处理图形和图像相关的任务。

然而，由于GPU拥有并行计算能力高、大规模矩阵计算速度快、能够同时处理大量数据等优势，使得其逐渐成为加速通用计算任务的重要工具。

GPU加速的优势并行计算能力高GPU可以同时处理数百甚至数千个任务，而传统的CPU只能一次处理一个任务。

在并行计算任务中，GPU的计算能力远高于CPU，能够显著缩短计算时间。

大规模矩阵计算速度快在科学计算和数据处理领域，大规模矩阵计算是非常常见的任务。

GPU通过并行计算能力和专门优化的矩阵计算功能，能够以极快的速度执行这些任务。

处理大量数据在大数据时代，处理海量数据是一项重要任务。

GPU的大规模并行处理能力能够快速处理大量数据，提高数据处理效率。

GPU加速应用领域科学计算和数值模拟在科学研究领域，许多计算问题需要进行大规模的数值模拟和计算。

GPU加速能够显著提高科学计算的速度和效率，使得更加复杂和精确的模拟成为可能。

深度学习和人工智能深度学习和人工智能的训练过程需要进行大量的矩阵运算，而GPU的并行计算能力使得深度学习模型的训练速度大大提高。

许多著名的深度学习框架如TensorFlow和PyTorch都提供了GPU加速的支持。

数字图像处理和图形渲染GPU最初用于图形渲染和图像处理，如3D游戏和动画制作等。

利用GPU的高并行计算能力，可以快速处理和渲染复杂的图像和图形。

如何使用GPU加速使用GPU加速库许多编程语言和框架提供了GPU加速的库和工具，使得用户可以方便地在代码中使用GPU进行计算。

例如，CUDA是由NVIDIA开发的一种用于并行计算的GPU加速计算平台和API。

在使用CUDA加速的代码中，开发者可以通过编写GPU核函数来利用GPU的并行计算能力。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Chapter43Fast Clustering of Radar Reﬂectivity Dataon GPUsWei Zhou,Hong An,Hongping Yang,Gu Liu,Mu Xuand Xiaoqiang LiAbstract In short-term weather analysis,we use clustering algorithm as a fun-damental tool to analyze and display the radar reﬂectivity data.Different from ordinary parallel k-means clustering algorithms using compute uniﬁed device architecture,in our clustering of radar reﬂectivity data,we face the dataset of large scale and the high dimension of texture feature vector we used as clustering space. Therefore,the memory access latency becomes a new bottleneck in our application of short-term weather analysis which requests real time.We propose a novel parallel k-means method on graphics processing units which utilizes on-chip registers and shared memory to cut the dependency of off-chip global memory. The experimental results show that our optimization approach can achieve409 performance improvement compared to the serial code.It sharply reduces the algorithm’s running time and makes it satisfy the request of real time in appli-cations of short-term weather analysis.Keywords Clustering algorithmÁReal timeÁShort-term weather forecastÁGPUÁCUDAW.Zhou(&)ÁH.AnÁG.LiuÁM.XuÁX.LiDepartment of Computer Science and Technology,University of Science and Technology of China,230027Hefei,Chinae-mail:greatzv@H.YangChina Meteorological Administration,Meteorological Observation Centre,100081Beijing,ChinaW.ZhouHefei New Star Research Institute of Applied Technology,230031Hefei,China487 J.J.Park et al.(eds.),Proceedings of the International Conference on Human-centric Computing2011and Embedded and Multimedia Computing2011,Lecture Notes in Electrical Engineering102,DOI:10.1007/978-94-007-2105-0_43,ÓSpringer Science+Business Media B.V.2011488W.Zhou et al.43.1IntroductionThere are many existing research efforts conducted in the radar data analysis for numerical weather prediction[1].Some of them take the clustering of radar reﬂectivity data as a fundamental tool to depict the storm structure.The large volume and high updating frequency(*every5min)of the radar reﬂectivity data poses a strict requirement of real timing on the data processing.Lakshmanan[2]conducted a comparing of several segmentation algorithms used to analyze weather images,and also proposed a nested segmentation method using k-means clustering[3].However,the execution time required by k-means algorithm is signiﬁcantly long and scales up with the size and dimensionality of the dataset.Such feature hinders its application into the short-term weather analysis,which emphasizes the real-timing a lot.On the other hand,as a specialized single-chip massively parallel architecture, graphics processing units(GPUs)are becoming increasingly attractive for general purpose parallel computation beyond their traditional uses in graphics rendering. For the parallel codes that required costly computational clusters to achieve rea-sonable speedup,they could be ported to GPUs to gain the equivalent perfor-mance.The NIVDIA compute uniﬁed device architecture(CUDA)makes the programming easier.GPU also brings convenience and lower cost.It has shown throughput and performance per dollar that is higher than traditional CPU by orders of magnitude. Consequently,parallel clustering algorithms for the traditional computational clusters could be ported to desktop machines or laptops now.Thus,GPU are providing weather researchers great convenience to conduct fast weather analysis and forecast.The intrinsic feature of clustering algorithm makes it suitable to be imple-mented as multithreaded SIMD program in CUDA.Many researches aimed at porting the clustering algorithm to CUDA-enabled GPU[4–7];however,most of them only port part of clustering works to the GPU,not the whole process of clustering algorithm including k centroids re-calculation.Therefore,the rich memory hierarchy of GPU has not been fully utilized yet.Meanwhile,the bot-tleneck in our real time system actually derives from the long memory access latency.In this paper,we propose a novel method using on-chip registers and shared memory to reduce the global memory latency,and in turn fully utilize the feature of many-core architecture and the rich memory hierarchy to optimize the whole process of the clustering of radar reﬂectivity data.The result shows that our approach reduces the algorithm’s execution time within4s with1million points input,and thus enable its satisfaction of the real time requirement of short-term weather analysis.Moreover,our approach could be applied in the cases of mul-timedia computing in which clustering is adopted in the process of images.43Fast Clustering of Radar Reﬂectivity Data on GPUs489 The paper is organized as follows.Section43.2presents the related works of clustering algorithm using CUDA.Section43.3presents the clustering of radar reﬂectivity data based on CPU.Section43.4gives the details of our fast clustering of radar reﬂectivity data.It shows the strategy of parallelizing and the complexity of our algorithm.The experiments of our approach are reported in Sect.43.5. Finally,conclusions are drawn in Sect.43.6.43.2Related WorkAs the clustering algorithm is widely used in many areas,many researches have been made to port the clustering algorithm onto GPU[4–7].Farivar et al.[4]parallelized the k-means algorithm and achieved a satisfying speed up.But the algorithm only dealt with1D dataset and only parallelized the ﬁrst phase of algorithm.Chen et al.[7]propose a mean shift clustering algorithm and accelerate the k means clustering using CUDA.But after each of iteration,they use a single thread to solve the means updating and threads synchronization which reduce the performance.The GPU’s rich memory hierarchy was not well utilized in this step.Walters et al.[6]used clustering in liver segmentation.They ignored the threads synchronization between block and made the difference ratio in control. Achieve great speed up at the cost of sacriﬁcing the precision.Most of the work only parallelized the partial steps of clustering[4,5,7]. The rich memory hierarchy on GPU was seldom fully utilized to combine with the algorithm improvement.In order to meet the request of real time,our fast clus-tering of radar reﬂectivity data adopts a new method and achieves higher speed up.43.3Clustering of Radar Reﬂectivity DataClustering algorithm is widely used in automated weather analysis area. Lakshmanan[3]proposed a method to segment radar reﬂectivity data using tex-ture.In the implementation of this method,k-means clustering algorithm was adopted to cluster the radar reﬂectivity data.The k-means algorithm is one of the most popular clustering methods.In our implementation,a set of n data points R={r1,r2,…,r n}in a d-dimensional space is given,where each data point represents1km*1km and the d-dimensional space is composed by the texture feature vectors extracted from radar composition reﬂectivity.The task of k-means is to partition R into k clusters S={S1,S2,…, S k}such that each cluster has maximal similarity as deﬁned by a cost function related to weather analysis[8].In our clustering of radar data,we preprocess the input data by labeling the areas requiring computation.Then,we adopt k-means algorithm iteratively to partition the dataset into k clusters.It ﬁrst divides up the measurement space into k equal intervals and each data point was initially assigned to the interval in which its reﬂectivity data value lies,and so the initial k centroids are calculated.Then,the algorithm iterates as follows:(1)Compute the cost function between each pair of data point and the centroid,and then assign each data point to the centroid with the minimum cost function.The cost function incorporates two measures,the euclidean distance and contiguity measure [8];(2)Recalculate new centroids by taking the means of all the data points in each cluster.The iteration terminates when the changes in the centroids are less than some threshold or the number of iterations reaches a predeﬁned threshold.The whole process is shown in Algorithm 1.We point out that there’s an interesting distribution of radar reﬂectivity.The points with the same value are always in the same area.We only need to compute the area we care about.The areas where reﬂectivity data value below a threshold can be ignored in our meteorology application.So we preprocess the input data and label the areas needing computation before performing k-means.The areas needing computation are only the rectangle areas in Fig 43.1b.the feature of the reﬂectivity distribution sharply reduces the computational requirement.Our clustering of radar reﬂectivity data includes the following steps1.Preprocess the input data.2.Partition the input data using k-means with the costfunction.Fig.43.1a The distribution of radar reﬂectivity data.b The value of the area outside the rectangle is bellow 5dbz,thus,we can ignore it in our weather analysis,and sharply reduce the computation.c The result after clustering490W.Zhou et al.43Fast Clustering of Radar Reﬂectivity Data on GPUs491 Algorithm1Clustering of radar data based on CPU43.4Design of Fast Clustering of Radar Reﬂectivity Dataon GPUsThe computational complexity of algorithm1is O[n?m(nk?n?k)].Lines 1–2preprocess the data once and get complexity O(n);Lines3–28contain a series of2-phase steps.Lines3–12compute the cost function and assign each point to the cluster whose centroid has minimum cost function values with it.This phase has a complexity of O(nk);Lines13–23recalculate the centroids and has a complexity O[(n?k)d],where d is dimension of the data point.492W.Zhou et al.Many works have been done to parallelize theﬁrst phase:data points assign-ment.But after parallelizing theﬁrst phase,the problem we faced is that the second phase recalculating the centroids becomes the new bottleneck for our short-term weather forecast application which has strict real time requirements.To see this,observe that with p cores we can accelerate theﬁrst phase to O(nk/p),so the ratio between the two is O[nk/p(n?k)]&O(k/p)(for n)k)when d=1.That means when k is a few orders of magnitude larger than the amount of cores,the algorithm is bound by theﬁrst phase and not by the second.For example,in the Farivar’s[4]implementation,there were about k=4,000clusters and32 streaming multiprocessors and the performance is bound by theﬁrst phase only. But in our clustering of radar reﬂectivity data,k is3–16[1],and there are16or30 multiprocessors(p=16or30).So the second phase becomes the new bottleneck for us.We show it with the experimental results in Sect.43.5.part A.The high-dimensional of dataset causes larger memory access and makes things worse. Therefore,we adopt a new method including parallelizing the second phase on GPUs utilizing the shared memory and registers to reduce the memory access latency.The problems in theﬁrst phase are the large scale dataset and the high dimension of the data point which causes long memory latency.The on-chip register resource must be utilized skillfully to decrease the reading latency.The strategy is simply as follow:(1)keep the multiprocessors busy;(2)keep register usage high and reduce the usage of local memory;besides,coalesced access to global memory also decreases the reading latency.We discuss speciﬁc design decisions to accelerate k-means for the CUDA architecture in the following two subsections.Section43.4.1introduces parallel algorithm for assignment of data points.Section43.4.2illustrates our novel par-allel algorithm for k centroids recalculation.43.4.1Data Points AssignmentThe CPU-based algorithm of assignment of data points is shown in algorithm1 lines3–12.There are two strategies to parallel the algorithm.Theﬁrst is the centroid-oriented,in which the cost function value from each centroid to all data points are calculated and then,each point get its own centroid with minimum cost function value.Another is the data points-oriented.It dispatching one data point to one thread and then each thread calculates the cost function from one data point to all the centroids,and maintains the minimum cost function value and the corre-sponding centroid.The former strategy has disadvantage that each point which is stored in off-chip global memory has to be read several times and causes long latency.Another disadvantage in our clustering of radar data is that our k is small (k=3–16),resulting in making the number of threads too small for GPU sche-duler to hide the latency well in this strategy.Therefore,the latter strategy is adopted in our application.The parallel algorithm of data point assignment is43Fast Clustering of Radar Reﬂectivity Data on GPUs493 shown in Algorithm2.Lines1–2show the design of block size and grid;line3–5 calculate the position of the corresponding data points for each thread in global memory;line6loads the data point into the register;line7–13compute the cost function and maintain the minimum one.Algorithm2Data points assignment based on GPUAlgorithm2only has one loop instead of two loops in Algorithm1.The loop for n data point has been dispatched to n threads.If the number of processing elements were equal to the number of data points,this pass could beﬁnish in one step.However,the number of processing elements is limited in our GPU and with p cores we can accelerate theﬁrst phase to O(nk/p).The key point of achieving high efﬁciency is reducing the global memory access latency utilizing the on-chip register and hiding the latency with GPU scheduler.We accomplish as follows.To fully utilize the on-chip register,ﬁrstly,we load the data points into the on-chip registers and ensure that reading data points from global memory happens only once when calculating the cost function in the thread.Reading from the register is much faster than reading from global memory which largely reduces the latency.Secondly,we adjust the block size and optimize the kernel program to fully use the register and reduce the usage of local memory which stored in global memory.Because of the total number of registers in stream multiprocessor is limited,the kernel program have to be adjusted properly and the block size have to be adjusted to utilize the SM’s limited registers resources.Our experiments in Sect.43.5show that a block size of128results better performance than block size of32,64and256.Besides,coalesced access to the global memory also decreases494W.Zhou et al. the reading latency.In our design of the thread block and grid,the access of global memory can be coalesced well.Hiding the latency is mainly done by GPU scheduler automatically.The mul-tiprocessor SIMT unit creates,manages,schedules,and executes threads in groups of32parallel threads called warps.So the block size should be a multiple of32. The number of blocks in our application is large enough to be scheduled. 43.4.2K Centroids RecalculationIn order to achieve the new centroids,the points have to be added in variable centros_temp in Algorithm1,line16.The second phase of k centroids recalcu-lation has a relatively low computational complexity of O(nd)and is difﬁcult to be fully parallelized due to the write conﬂict.Though it has a low computational complexity,it is a time consuming pro-cessing in our clustering of radar reﬂectivity data due to the long memory access latency.In order to compute the new centroids on GPU,we have to accumulate all data points in k variables in centros_tmp array.It should be done in atomic operations.Thus,the process was turned into a serial process in fact.What makes things worse,because of the variables of centros_tmp should be shared by all threads in all the blocks,they have to be stored in global memory suffering long global memory access latency.We give a new method to solve the problem of the serial accumulation process and the long latency of global memory.Our method includes two steps as follows. Figure43.2shows the two steps.Firstly,we use‘‘divide and conquer’’strategy to turn the serial process into a parallel one.We divide the dataset and accumulate the partial sum of each sub dataset in different multiprocessor simultaneously.Each part of sum would be dispatched to one block.The algorithm is shown in algorithm3.In line1,we use shared memory instead of global memory to store the variables of centroid_temp because of the shared memory can be shared in one block.This reduces the latency caused by atomic operations on global memory.With p cores we can get the complexity of this step to O(n/p)when d=1.Algorithm3Data points assignment based on GPUSecondly,we accumulate all the partial sums using parallel reduction algorithm which has a complexity of O(log n).We make the partial sums be accessed from global memory only once and accessed coalesced(algorithm4,lines4).The496W.Zhou et al. computation process is accomplished in shared memory(algorithm4,lines6–13). The algorithm is shown in algorithm4.The variable of count can also be calcu-lated in this way.After that,we get the centroids by dividing the total sum variables by count variables.Thus,we can accelerate the whole process of the second phase to O(n/p?k log n).However the number of process elements is limited.In fact,because the shared memory is used in the two step of centroids calculation,the latency has been sharply reduced.Meanwhile,the serial process of accumulation is parallel-ized to be done in several multiprocessors and the accumulation of partial sums is calculated in parallel reduction.Therefore,the whole process of centroids was largely parallelized.The performance is shown in our experiments in Sect.43.5.Algorithm4Parallel reduction on GPU43.5ExperimentsWe have implemented our fast clustering of radar data using CUDA version2.3. Our experiments were conducted on a PC with an NVIDIA Geforce GTX275and an Intel(R)Core(TM)Q8200CPU.GTX275has30multiprocessors,and per-forms at1.40GHz.The memory of GPU is1GB with the peak bandwidth of 127GB/s.The CPU has four cores running at2.33GHz.The main memory is 4GB with the peak bandwidth of5.6GB/s.Our data set consist of a d-dimen-sional texture feature vector extract from radar reﬂectivity data.There are several million data points to be clustered,and the number of clusters k is smaller than32 according to our application demands.43.5.1Time Consuming AnalysisThe radar reﬂectivity data generated by multiple radars at different times in one day was used as our input data:CREF_20100717_010,CREF_20100717_020,and CREF_20100717_030.We show the time consuming proportion of the second phase in the algorithm based on CPU in Table 43.1.We show the proportion of the second phase after only parallelizing the ﬁrst phase in Table 43.2.It shows that in the CPU based algorithm,the ﬁrst phase is the bottleneck to accelerate.But after parallelizing the ﬁrst phase only,the second phase of k centroids recalculation becomes the new bottleneck in our real time system which takes more than 57%of the total con-suming time.And the proportion doesn’t change with the scale of input data.43.5.2Speed up of Fast Clustering of Radar Reﬂectivity DataFigure 43.3present the speedups gained by using our method on GPU for the second phase.It has a 2X speed improvement over the serial one.The data sets with 1,2,3million points were created.Figure 43.4shows the speed up of our fast clustering method for the whole process.We experienced a 36*40X speed improvement over a baseline application running on host machine.The speed up almost doesn’t change with the input data scale.Table 43.1The proportion in CPU based algorithm Input radar data The ﬁrst phase (s)The secondphase (s)Proportion of the second phase (%)CREF_20100717_010152.203 3.201 2.06CREF_20100717_020152.882 3.093 1.98CREF_20100717_030153.2933.1622.02Table 43.2The proportion after parallelizing the ﬁrst phase only Input radar data The ﬁrst phase (s)The secondphase (s)Proportion of the second phase CREF_20100717_0102.2043.20159.2%CREF_20100717_020 2.187 3.09358.5%CREF_20100717_0302.3773.16257.1%24123T i m e (s )cpubased gpubasedFig.43.3The speed up of the second phase using our method43Fast Clustering of Radar Reﬂectivity Data on GPUs 49743.6Conclusion and Future WorkIn this paper,we proposed the fast clustering of radar reﬂectivity data algorithm.It accelerates two phase of k-means clustering algorithm.The ﬁrst phase mainly utilizes the on-chip register and adjusts the execution conﬁguration to reduce the global memory latency and hide the latency with scheduler.The second phase adopts a novel algorithm.It ﬁrstly accumulates the partial sums of centroids in shared memory of different blocks in parallel.And then,it uses parallel reduction to get the total sum of partial sums and get the centroids eventually.In this way,our clustering algorithm show over a 409performance improvement.It meets the request of real time in application of short-time weather analysis and forecast.Acknowledgment This work is supported ﬁnancially by the National Basic Research Program of China under contract 2011CB302501,the National Natural Science Foundation of China grants 60633040and 60970023,the National Hi-tech Research and Development Program of China under contracts 2009AA01Z106,the National Science &Technology Major Projects 2009ZX01036-001-002and 2011ZX01028-001-002-3.References1.Yang HP,Zhang J,Langston C (2009)Synchronization of radar observations with multi-scale storm tracking.J Adv atmos sci 26kshmanan V,Rabin R,DeBrunner V (2003)Multiscale storm identiﬁcation and forecast.J Atmos Res 67–68:367–380kshmanan V,Rabin R,DeBrunner V (2001)Segmenting radar reﬂectivity data using texture.30th international conference on radar meteorology,Zurich,Switzerland4.Farivar R,Rebolledo D,Chan E,Campbell R (2008)A parallel implementation of K-means clustering on GPUs.International conference on parallel and distributed processing techniques and applications (PDPTA 2008),pp.340–3455.Miao Q,Fu ZL,Zhao XH (2009)A new approach for color character extraction based on parallel clustering.International conference on computational intelligence and software engineering,IEEE Computer Society16111621263136410501001502002503003504004505001million 2million 3million s p e e d u pT i m e (s )cpu basedour GPU method speedupFig.43.4The speed up of our parallel clustering method of whole process498W.Zhou et al.43Fast Clustering of Radar Reﬂectivity Data on GPUs499 6.Walters JP,Balu V,Kompalli S(2009)Evaluating the use of GPUs in liver imagesegmentation and HMMER database searches.IEEE International Parallel And Distributed Processing Symposium,IEEE Computer Society7.Chen J.,Wu XJ,Cai R(2010)Parallel processing for accelerated mean shift algorithm withGPU.J Comput-Aided Des Comput Gr38.Wang GL(2007)The Development on a multiscale identifying algorithm for heavy rainfalland methods of extracting evolvement information.Ph.D Thesis,Nanjing University of Information Science and Technology。