RK-Means Clustering K-Means with Reliability

合集下载

请简述k-means算法的流程

请简述k-means算法的流程K均值聚类算法（k-means clustering algorithm）是数据挖掘中常用的一种聚类算法之一。

它是一种无监督学习算法，能够将样本数据分成K个不同的簇。

本文将简述K均值聚类算法的流程，包括初始中心点的选择、簇分配和中心点更新等步骤，具体分为以下几个部分进行描述。

一、初始中心点的选择K均值聚类算法的第一步是选择初始中心点。

中心点的选择对聚类结果有一定的影响，因此选择合适的初始中心点十分重要。

最常用的方法是随机选择K个样本作为初始中心点，也可以通过其他方法选择。

二、簇分配初始中心点确定后，下一步是将每个样本分配给最近的中心点所属的簇。

计算样本到每个中心点的距离，然后将样本分配给离它最近的中心点所属的簇。

三、中心点更新所有样本都被分配到了簇后，接下来的步骤是更新每个簇的中心点。

将属于同一簇的所有样本的坐标取平均值，得到该簇的新的中心点。

这个新的中心点将被用于下一次迭代的簇分配。

簇分配和中心点更新这两个步骤会不断重复，直到收敛。

四、收敛条件K均值聚类算法的收敛条件通常是中心点不再发生明显变动，即所有的样本分配到的簇不再发生变化，或者中心点的移动距离小于一个给定的阈值。

五、算法复杂度分析K均值聚类算法的时间复杂度主要取决于簇分配和中心点更新这两个步骤的计算量。

在每次簇分配中，对于每个样本需要计算与K个中心点的距离，因此时间复杂度为O(N*K*d)，其中N为样本数目，K为簇的数目，d为样本的维度。

在每次中心点更新中，需要对每个簇中的样本进行平均计算，因此时间复杂度为O(N*d)。

总的时间复杂度为O(T*N*K*d)，其中T为迭代次数。

当样本数目较大时，计算量会显著增加。

六、优化方法K均值聚类算法还有一些优化方法，可以提高算法的运行效率和准确性。

其中包括：修改初始中心点的选择方法，使用k-d 树等数据结构来加速簇分配过程，引入加权距离等。

总结而言，K均值聚类算法的流程包括初始中心点的选择、簇分配和中心点更新等步骤。

kmeans色彩聚类算法

kmeans色彩聚类算法
K均值（K-means）色彩聚类算法是一种常见的无监督学习算法，用于将图像中的像素分组成具有相似颜色的集群。

该算法基于最小
化集群内部方差的原则，通过迭代寻找最优的集群中心来实现聚类。

首先，算法随机初始化K个集群中心（K为预先设定的参数），然后将每个像素分配到最接近的集群中心。

接下来，更新集群中心
为集群内所有像素的平均值，然后重新分配像素直到达到收敛条件。

最终，得到K个集群，每个集群代表一种颜色，图像中的像素根据
它们与集群中心的距离被归类到不同的集群中。

K均值色彩聚类算法的优点是简单且易于实现，对于大型数据
集也具有较高的效率。

然而，该算法也存在一些缺点，例如对初始
集群中心的选择敏感，可能收敛于局部最优解，对噪声和异常值敏
感等。

在实际应用中，K均值色彩聚类算法常被用于图像压缩、图像
分割以及图像检索等领域。

同时，为了提高算法的鲁棒性和效果，
通常会结合其他技术和方法，如颜色直方图、特征提取等。

此外，
还有一些改进的K均值算法，如加权K均值、谱聚类等，用于解决
K均值算法的局限性。

总之，K均值色彩聚类算法是一种常用的图像处理算法，通过对图像像素进行聚类，实现了图像的颜色分组和压缩，具有广泛的应用前景和研究价值。

K-Means解析

Clustering中文翻译作“聚类”，简单地说就是把相似的东西分到一组，同Classification(分类)不同，对于一个classifier，通常需要你告诉它“这个东西被分为某某类”这样一些例子，理想情况下，一个classifier会从它得到的训练集中进行“学习”，从而具备对未知数据进行分类的能力，这种提供训练数据的过程通常叫做supervised learning(监督学习)，而在聚类的时候，我们并不关心某一类是什么，我们需要实现的目标只是把相似的东西聚到一起，因此，一个聚类算法通常只需要知道如何计算相似度就可以开始工作了，因此clustering通常并不需要使用训练数据进行学习，这在Machine Learning中被称作unsupervised learning(无监督学习)。

举一个简单的例子：现在有一群小学生，你要把他们分成几组，让组内的成员之间尽量相似一些，而组之间则差别大一些。

最后分出怎样的结果，就取决于你对于“相似”的定义了，比如，你决定男生和男生是相似的，女生和女生也是相似的，而男生和女生之间则差别很大”，这样，你实际上是用一个可能取两个值“男”和“女”的离散变量来代表了原来的一个小学生，我们通常把这样的变量叫做“特征”。

实际上，在这种情况下，所有的小学生都被映射到了两个点的其中一个上，已经很自然地形成了两个组，不需要专门再做聚类了。

另一种可能是使用“身高”这个特征。

我在读小学候，每周五在操场开会训话的时候会按照大家住的地方的地域和距离远近来列队，这样结束之后就可以结队回家了。

除了让事物映射到一个单独的特征之外，一种常见的做法是同时提取N种特征，将它们放在一起组成一个N维向量，从而得到一个从原始数据集合到N维向量空间的映射——你总是需要显式地或者隐式地完成这样一个过程，因为许多机器学习的算法都需要工作在一个向量空间中。

那么让我们再回到clustering的问题上，暂且抛开原始数据是什么形式，假设我们已经将其映射到了一个欧几里德空间上，为了方便展示，就使用二维空间吧，如下图所示：从数据点的大致形状可以看出它们大致聚为三个cluster，其中两个紧凑一些，剩下那个松散一些。

r语言的kmeans方法

r语言的kmeans方法R语言中的k均值聚类方法（k-means clustering）是一种常用的无监督学习方法，用于将数据集划分为K个不相交的类别。

本文将详细介绍R语言中的k均值聚类算法的原理、使用方法以及相关注意事项。

原理：k均值聚类算法的目标是将数据集划分为K个簇，使得同一簇内的样本点之间的距离尽可能小，而不同簇之间的距离尽可能大。

算法的基本思想是：首先随机选择K个初始质心（簇的中心点），然后将每个样本点分配到与其最近的质心所在的簇中。

接下来，计算每个簇的新质心，再次将每个样本点重新分配到新的质心所在的簇中。

不断重复这个过程，直到质心不再发生变化或达到最大迭代次数。

最终，得到的簇就是我们需要的聚类结果。

实现：在R语言中，我们可以使用kmeans(函数来实现k均值聚类。

该函数的基本用法如下：kmeans(x, centers, iter.max = 10, nstart = 1)-x：要进行聚类的数据集，可以是矩阵、数据框或向量。

- centers：指定聚类的个数K，即要划分为K个簇。

- iter.max：迭代的最大次数，默认为10。

- nstart：进行多次聚类的次数，默认为1，选取最优结果。

聚类结果：聚类的结果包含以下内容：- cluster：每个样本所属的簇的编号。

- centers：最终每个簇的质心坐标。

- tot.withinss：簇内平方和，即同一簇内各个样本点到质心的距离总和。

示例：为了更好地理解k均值聚类的使用方法，我们将通过一个具体的示例来进行演示：```R#生成示例数据set.seed(123)x <- rbind(matrix(rnorm(100, mean = 0), ncol = 2),matrix(rnorm(100, mean = 3), ncol = 2))#执行k均值聚类kmeans_res <- kmeans(x, centers = 2)#打印聚类结果print(kmeans_res)```上述代码中，我们首先生成了一个包含两个簇的示例数据集x（每个簇100个样本点），然后使用kmeans(函数进行聚类，指定了聚类的个数为2、最后，通过print(函数来打印聚类的结果。

kmeans的聚类依据

kmeans的聚类依据摘要：一、kmeans 聚类的基本原理二、kmeans 聚类的应用场景三、kmeans 聚类的优缺点正文：一、kmeans 聚类的基本原理kmeans（K-means）聚类是一种基于距离的聚类方法，其主要思想是将数据集中的点分为K 个簇（cluster），使得每个簇的内部点之间的距离尽可能小，而不同簇之间的点之间的距离尽可能大。

kmeans 聚类的具体步骤如下：1.随机选择K 个数据点作为初始聚类中心。

2.将数据集中的每个点分配到距离它最近的聚类中心所在的簇。

3.根据上一步的结果，更新每个簇的聚类中心。

新的聚类中心是其所在簇内所有点的均值。

4.重复步骤2 和3，直到聚类中心的变化小于某个阈值或达到迭代次数限制。

二、kmeans 聚类的应用场景kmeans 聚类算法广泛应用于各种数据挖掘和分析任务中，例如：1.光伏和风力发电的典型场景分析。

通过kmeans 聚类可以找出光伏和风力发电的典型场景，从而优化能源配置和提高发电效率。

2.客户细分。

在市场营销中，通过分析客户的消费行为、偏好等数据，可以使用kmeans 聚类方法对客户进行细分，以便精准定位目标客户并制定有效的营销策略。

3.文本聚类。

在自然语言处理领域，kmeans 聚类可以用于对文本进行分类，例如新闻分类、情感分析等。

三、kmeans 聚类的优缺点kmeans 聚类算法具有以下优点：1.算法简单易懂，实现起来较为容易。

2.可以处理大规模数据集。

3.对于一些特定的数据分布，kmeans 聚类可以得到较好的结果。

然而，kmeans 聚类也存在一些缺点：1.算法的收敛性无法保证。

虽然迭代过程中聚类中心会不断更新，但无法确保最终结果是最优的。

2.对于某些数据分布，例如圆形分布，kmeans 聚类可能无法得到合理的结果。

3.需要事先指定聚类数量K，这在实际应用中可能较为困难。

k-means参数

k-means参数详解K-Means 是一种常见的聚类算法，用于将数据集划分成K 个不同的组（簇），其中每个数据点属于与其最近的簇的成员。

K-Means 算法的参数包括聚类数K，初始化方法，迭代次数等。

以下是一些常见的K-Means 参数及其详细解释：1. 聚类数K (n_clusters)：-说明：K-Means 算法需要预先指定聚类的数量K，即希望将数据分成的簇的个数。

-选择方法：通常通过领域知识、实际问题需求或通过尝试不同的K 值并使用评估指标（如轮廓系数）来确定。

2. 初始化方法(init)：-说明：K-Means 需要初始的聚类中心点，初始化方法决定了这些初始中心点的放置方式。

-选择方法：常见的初始化方法包括"k-means++"（默认值，智能地选择初始中心点以加速收敛）和"random"（从数据中随机选择初始中心点）。

3. 最大迭代次数(max_iter)：-说明：K-Means 算法是通过迭代优化来更新聚类中心的。

max_iter 参数定义了算法运行的最大迭代次数。

-调整方法：如果算法没有收敛，你可以尝试增加最大迭代次数。

4. 收敛阈值(tol)：-说明：当两次迭代之间的聚类中心的变化小于阈值tol 时，算法被认为已经收敛。

-调整方法：如果算法在较少的迭代后就收敛，可以适度增加tol 以提高效率。

5. 随机种子(random_state)：-说明：用于初始化算法的伪随机数生成器的种子。

指定相同的种子将使得多次运行具有相同的结果。

-调整方法：在调试和复现实验时，可以使用相同的随机种子。

这些参数通常是实现K-Means 算法时需要关注的主要参数。

在实际应用中，还可以根据数据的特性和问题的需求来选择合适的参数值。

通常，通过尝试不同的参数组合并使用评估指标（如轮廓系数）来评估聚类结果的质量。

k均值分类

k均值分类（原创实用版）目录1.K 均值分类简介2.K 均值分类原理3.K 均值分类步骤4.K 均值分类应用实例5.K 均值分类优缺点正文1.K 均值分类简介K 均值分类（K-means Clustering）是一种基于划分的聚类方法，它是通过将数据集划分为 K 个不同的簇（cluster），使得每个数据点与其所属簇的中心点（均值）距离最小。

这种方法被广泛应用于数据挖掘、模式识别以及图像处理等领域。

2.K 均值分类原理K 均值分类的原理可以概括为以下两个步骤：（1）初始化：首先选择 K 个数据点作为初始簇的中心点。

这些中心点可以是随机选择的，也可以是通过一定方法进行选择的。

（2）迭代：计算每个数据点与各个簇中心点的距离，将数据点划分到距离最近的簇。

然后重新计算每个簇的中心点，并重复上述过程，直至中心点不再发生变化。

3.K 均值分类步骤K 均值分类的具体步骤如下：（1）确定 K 值：根据数据特点和需求确定聚类的数量 K。

（2）初始化中心点：随机选择 K 个数据点作为初始簇的中心点，也可以通过一定方法进行选择。

（3）计算距离：计算每个数据点与各个簇中心点的距离。

（4）划分簇：将数据点划分到距离最近的簇。

（5）更新中心点：重新计算每个簇的中心点。

（6）迭代：重复步骤（3）至（5），直至中心点不再发生变化。

4.K 均值分类应用实例K 均值分类在许多领域都有广泛应用，例如在图像处理中，可以将图像划分为不同的区域，以便进行后续的处理；在数据挖掘中，可以将客户划分为不同的群体，以便进行精准营销。

5.K 均值分类优缺点优点：（1）简单、易于理解；（2）可以应用于大规模数据集；（3）不需要先验知识。

kmeans 文本聚类原理

kmeans 文本聚类原理
K均值（K-means）是一种常用的文本聚类算法，它的原理是基
于样本之间的相似度来将它们分成不同的簇。

在文本聚类中，K均
值算法首先需要将文本表示为特征向量，常用的方法包括词袋模型、TF-IDF权重等。

然后，算法随机初始化K个簇中心，接着将每个样
本分配到最近的簇中心，然后更新每个簇的中心为该簇所有样本的
平均值。

重复这个过程直到簇中心不再发生变化或者达到预定的迭
代次数。

K均值算法的核心思想是最小化簇内样本的方差，最大化簇间
样本的方差，从而实现簇内的相似度高、簇间的相似度低。

这样做
的目的是将相似的文本聚集到一起形成一个簇，并且使得不同簇之
间的文本尽可能地不相似。

需要注意的是，K均值算法对初始簇中心的选择比较敏感，可
能会收敛到局部最优解。

因此，通常会多次运行算法并选择最优的
聚类结果。

此外，K均值算法还需要事先确定簇的个数K，这通常需
要领域知识或者通过一些启发式方法来确定最佳的K值。

总的来说，K均值算法通过不断迭代更新簇中心来实现文本聚
类，其原理简单直观，易于实现。

然而，对初始簇中心的选择和簇个数的确定需要一定的经验和技巧。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

96IEICE TRANS.INF.&SYST.,VOL.E91–D,NO.1JANUARY2008PAPERRK-Means Clustering:K-Means with ReliabilityChunsheng HUA†∗,Qian CHEN†,Nonmembers,Haiyuan WU†a),and Toshikazu W ADA†,MembersSUMMARY This paper presents an RK-means clustering algorithmwhich is developed for reliable data grouping by introducing a new reli-ability evaluation to the K-means clustering algorithm.The conventionalK-means clustering algorithm has two shortfalls:1)the clustering resultwill become unreliable if the assumed number of the clusters is incorrect;2)during the update of a cluster center,all the data points belong to thatcluster are used equally without considering how distant they are to thecluster center.In this paper,we introduce a new reliability evaluation toK-means clustering algorithm by considering the triangular relationshipamong each data point and its two nearest cluster centers.We applied theproposed algorithm to track objects in video sequence and conﬁrmed itseﬀectiveness and advantages.key words:robust clustering,reliability evaluation,K-means clustering,data classiﬁcation1.IntroductionClustering algorithms can partition a data set into c groupsto reveal its nature structure.K-means(KM)[25]is a“Hard”algorithm that un-equivocally assign each vector in the data set into one ofc subsets,where c is a natural number.Given a data setX={x1,x2,...,x n},the K-means algorithm computes c pro-totypes w=(w c)which minimize the average distance be-tween each vector and its closest prototype:E(w)=ni=1(x i−w s i(w))2,(1)where s i(w)denotes the subscript of the closest prototype to vector x i.The prototypes can be computed iteratively with the following equation until w reaches aﬁxed point.w k =1N ki:k=s i(w)x i,(2)K-means clustering has been widely applied to the image segmentation[22]–[24].Recently,it has also been used for object tracking[17],[18],[20].Fuzzy and possibilistic clustering generate a member-ship matrix U,where its elements u ik indicates that the mem-bership of x k in cluster i.The Fuzzy C-Means algorithm Manuscript received May2,2006.Manuscript revised December5,2006.†The authors are with the Faculty of System Engineering, Wakayama University,Wakayama-shi,640–8510Japan.∗Presently,with the Institute of Scientiﬁc and Industrial Re-search,Osaka University.a)E-mail:wuhy@sys.wakayama-u.ac.jpDOI:10.1093/ietisy/e91–d.1.96(FCM)[1],[2],[16]is the most popular fuzzy clustering al-gorithm.It assumes that the number of clusters c,is known as a priori,and minimizesE f cm=ci=1nk=1u mikx k−w i 2,(3)subject to the n probabilistic constrains,ci=1u ik=1;k=1,...,n,(4)here,m>1is the fuzziﬁer.FCM provides the matrices U and W(W=(w1,w2,...,w c))that indicate the membership of each vector in each cluster and the center of each cluster respectively.The conditions for local extreme for Eqs.(3) and(4)are,u ik=⎛⎜⎜⎜⎜⎜⎝cj=1m−1d2ikd2jk⎞⎟⎟⎟⎟⎟⎠−1,(5) andw i=nk=1u mikx knk=1u mik.(6)In many applications of K-means or FCM clustering, the data set contains noisy ing the above cluster-ing methods,every vector in the data set is assigned to one or more clusters,even if it is a noisy one.These clustering methods are not able to distinguish the noisy vectors from the rest vectors in the data set,and those noisy vectors will attract the cluster centers towards them.Therefore,these clustering methods are sensitive to noise.A good K-means (or C-means)clustering method should be robust so that it can determine good clusters for noisy data sets.Several ro-bust clustering methods have been proposed[3]–[15],and a comprehensive review can be found in[3].The Noise Cluster Approach[4]introduces an addi-tional cluster called noise cluster to collect the outliers(an outlier is a distant vector from all the clusters).The distance between any vectors and the noise cluster is assumed to be the same.This distance can be considered as a threshold.If the distance between a vector and all the clusters is longer than this threshold,it will be attracted to the noise cluster thus classiﬁed as an outlier.When all the c clusters have about the same sizes and the size is known,the noise clusterCopyright c 2008The Institute of Electronics,Information and Communication EngineersHUA et al.:RK-MEANS CLUSTERING:K-MEANS WITH RELIABILITY97approach is very eﬀective.When the clusters have diﬀerent sizes or the size is unknown,this approach will not perform well because it only has one threshold and there is not a method toﬁnd a good threshold value for a given data set.The Possibilistic C-Means Algorithm(PCM)[5],[6] computes a matrix of possibilistic membership T.The el-ement t i j of T indicates the possibilistic membership of vec-tor x j belongs to i th cluster.Although PCM seems better than FCM and hard K-means clustering approach,it often ﬁnds identical clusters.Jolion[13]proposed an approach called the General-ized Minimum V olume Ellipsoid Method(GMVE).In the GMVE approach,oneﬁnds a minimum volume ellipsoid that covers(at least)h vectors of the data set X.After that, itﬁnds the best cluster by reducing the value of h gradu-ally.This approach is computationally expensive and needs to specify several threshold parameters.Moreover,when the cluster shapes to be detected cannot be described by ellip-soids,this approach would not work.Chintalapudi[10]proposed an approach called Credi-bility Fuzzy C-Means method(CFCM).It assigns a cred-ibility value for each vector x k according the ratio of the distance between it and its closest cluster to the distance between the farthest vector and its closest cluster.If the farthest outlier is much farther than the rest outliers,this approach will assign high credibility values to most of the outliers.In this case,CFCM would not work well.2.RK-Means Clustering2.1ReliabilityBoth the K-means and fuzzy C-means clustering(FCM)al-gorithm(including its extensions)assume that the number (c)of the clusters of a data set is known.In the real world, a data set may contain many noisy vectors and the number of clusters of it is often unknown.The noisy vectors and the data vectors that do not belong to any of the assumed clusters are often very distant from any of the c prototypes (thus called outliers),so it would not be meaningful to as-sign them a cluster number for K-means algorithm or a high membership value to any of the c clusters for FCM.We attempt to decrease the outlier sensitivity in K-means clustering by introducing a new variable reliability to distinguish an outlier from a non-outlier.Since outliers are distant from any of the c prototypes,the distance between a data vector and its closest prototype should be taken into ac-count in order to tell whether the vector is an outlier or not. Existing methods such as noise cluster approach[4]or cred-ibility fuzzy C-means algorithm use that distance to tell an outlier from a non-outlier.However,the distance between a data vector and its nearest cluster center can not be a mea-sure of outlier by itself.To tell if a data vector is an outlier or not,one should consider both the distance between a data vector and its nearest cluster center and the structure of the cluster centers,that is the distance between cluster centers.As shown in Fig.1,where w1and w2are twoclusterFig.1Introduce the concept of reliability. centers,and x1and x2are two data vectors.In theﬁgure, if we only look at the distance from a vector(x1or x2) to its closest cluster center(w1),we can only say that x2 is closer to w1than x1.However,no conclusions can be drawn from the value d11and d21as which of the vectors is more like an outlier.By considering the shape of the trian-gle x1w1w2and x2w1w2,that is the relation among three distances(d11,d12,d w12),and the one among(d21,d22,d w12),we can say that x1should be considered as an outlier while x2should not be.In this paper we denote the data set as X,its n vectors as{x k}n k=1and the cluster centers w i,i=1,...,c.We deﬁne reliability for a data vector x k asR k=w f(xk)−w s(x k)d f k+d sk,(7) whered k f= x k−w f(x k) ,d ks= x k−w s(x k) ,(8)f(x k)and s(x k)are the subscript of the closest and the sec-ondly closest cluster centers to vector x k:f(x k)=argmini=1,...,c( x k−w i ),s(x k)=argmini=1,...,c,i f( x k−w i ).(9)The farther x k is from its nearest cluster center,the lower is its reliability.The value of this reliability is not determined by the distance between x k and its nearest clus-ter center itself.Instead,it is determined by using the dis-tance between the two nearest cluster centers to measure how distant that x k from the two nearest cluster centers.This makes the reliability evaluation becoming more reasonable than only use the distance from a data vector to its nearest cluster center.2.2Degree of ReversionFor each data vector,fuzzy C-means algorithm computes c memberships that indicate the degrees of the vector belong to the c clusters.This is computationally expensive,espe-cially for real-time video image processing.In this research,98IEICE TRANS.INF.&SYST.,VOL.E91–D,NO.1JANUARY2008we use a simple method to evaluate the degree of a vectorbelongs to its closest cluster to achieve realistic clusteringthan“hard”techniques while keeping the computation costto be low.The degreeμk of a vector x k belongs to its closestcluster is computed from the distance from it to its closest(d k f)and the secondly closest(d ks)cluster centers:μk=d ksd k f+d ks,(10)Since R k–the reliability of x k–indicates how reliable that x kcan be classiﬁed,the possibility that x k belongs to its closestcluster can be computed as the product of R k andμk.t k=R k∗μk.(11)2.3RK-Means Clustering AlgorithmRK-means clustering algorithm partitions X by minimizingthe following objective function:J rkm(w)=nk=1t k x k−w f(x k) 2,(12)The cluster centers w can be obtained by solving the equa-tion∂J rkm(w)∂w=0(13)The existence of the solution to Eq.(13)can be proved eas-ily if the Euclidean distance is assumed.To solve this equa-tion,weﬁrst compute an approximate w with the following equation:w j=nk=1δj(x k)t k x knk=1δj(x k)t k(x k).(14)whereδj(x k)=1if j=f(x k)0otherwise(15)Then w can be obtained by applying Newton’s algo-rithm using the result of Eq.(14)as the initial values.The RK-means clustering algorithm is summarized as follows:1)Initializationi)given the number of clusters c andii)given an initial value to each cluster center w i,i= 1,...,c.The initialization can be done by applying K-means or fuzzy c-means approaches,or performed manually.2)Iterationwhile w i,i=1,...,c do not reachﬁxed points,Doi)calculate f(x k)and s(x k)for each x k.ii)update w i,i=1,...,c by solving Eq.(13).3.Object Tracking Using RK-Means AlgorithmThe RK-means can be used for realizing robust object track-ing in video sequences.(see Fig.2).Because the most im-portant thing while object tracking is to classify the un-known pixels into target or background clusters,object tracking can be considered as a binary classiﬁcation.There-fore,theﬁrst and second closest clusters in RK-means clus-tering algorithm will be replaced by the target and back-ground clusters.Since it is more important to tell if the unknown pixel belongs to the target cluster or not,theﬁrst closest cluster will be considered as the target cluster,and naturally second closest cluster is the background cluster.Each pixel within the search area will be classiﬁed into target and background clusters with the RK-means cluster-ing algorithm.Noise pixels that neither belong to target nor background clusters will be given low reliability and ig-nored.Pixels belong to the target group is farther divided into N target clusters,each of them describes the target pix-els having similar color.The pixels belonging to the back-ground clusters is also divided into m background clusters.We use a5D uniform feature space to describe im-age features.In the5D feature space,both the color and the position of a pixel is described by a vector f=[c p]T uniformly.Here c=[Y U V]T describes the color and p=[x y]T describes the position on image plane.In the5D feature space,we describe the target centers asf T(i)=[c T(i)p T(i)]T,(i=1∼N),N is the number of target clusters.Each target cluster de-scribes the target pixels having similar color.The back-ground pixels are represented asf B(j)=[c B(j)p B(j)]T,(j=1∼m),m is the number of the background clusters.An unknown pixel is described by f u=[c u p u]T.Thus we can get the subscript of the closest target and background cluster centers to f u as follows.s T(f u)=argmini=1∼N{ f T(i)−f u },s B(f u)=argminj=1∼m{ f B(j)−f u }.(16)Fig.2Explanation for target clustering with multiple colors.HUA et al.:RK-MEANS CLUSTERING:K-MEANS WITH RELIABILITY99(a)Inputimage.(b)K-means.(c)RK-Means:The red area has low reliability.Fig.3A comparative result of object tracking with the K-Means andRK-Means algorithms.Since f u is requested to be classiﬁed into target or back-ground clusters,it can be considered that only two candidateclusters(target and background clusters)exist while objecttracking.Therefore,obviously one cluster will be theﬁrstnearest cluster,and another will be the second nearest one.Because it is more important to check if f u is a target pixelor not,s T is considered as theﬁrst nearest cluster and s Bas the second nearest cluster.By applying s T and s B toEqs.(9),(8),the RK-means algorithm can be used to removethe noise data(as shown in Fig.3)and update the target cen-ters f T(i),i=1,...,N.Because the background clusters aredeﬁned and selected from the ellipse contour and updatingbackground clusters with Eq.(14)will move them from theellipse contour,we update the background clusters(also theellipse contour)by the method mentioned in[17].4.Experiment and Discussion4.1Evaluating the Eﬃciency of RKMIn the following parts of this paper,we abbreviate the Hard(or conventional)K-means clustering as CKM,Fuzzy C-means as FKM and the reliability K-means as RKM.Toevaluate the proposed RKM algorithm,we compare it withthe CKM and FKM clustering under diﬀerent conditions,inthe following illustrations the large“•”represents the targetcenter.In Figs.4,5,the initial value of K is two(onecluster(a)CKM.(b)FKM.(c)RKM.Fig.4Clustering with separatedclusters.(a)CKM.(b)FKM.(c)RKM.Fig.5Clustering with completely overlappedclusters.(a)CKM.(b)FKM.(c)RKM.Fig.6Clustering with missing cluster problem.is indicated by red,another by green),the position of ini-tial points is the same and all the algorithms run at the sameiterations.In Fig.4,since the distribution of two clusters is wellseparated,all methods give the similar good result.In Fig.5,the distributions of two clusters merge as onecluster.In such case,it is reasonable to consider them as onecluster.The CKM and FKM algorithm still brutally divide itinto two separated clusters.With Eq.(7),the proposed RKMcan gradually move the initial centers together and give theresult as shown in(c).Although the result of our RKMalgorithm works better than the CKM and FKM methods,the whole clustering procedure becomes unreliable.Thatis because the cluster centers tend to move together,thusEqs.(11),(7)will produce low reliability for each data item.In our future work,we consider the function of merging isnecessary for our RKM algorithm.Figure6shows a case where four clusters exist but theinitial value of K is three(hereafter we call the phenomenonlike this as the missing cluster problem).Here,the yellow“ ”denotes the initial position of cluster centers which ismanually deﬁned,and the red“•”means theﬁnal clusteringresult.The CKM only puts one cluster center correctly andmakes the completely wrong results of the other two clustercenters.The FKM is heavily aﬀected by the missing cluster,because it brutally allocates a membership to each input dataitem.With Eqs.(7),(11),our RKM correctly classiﬁes theinitial three centers by giving an extremely low reliability tothe missing cluster.In Fig.7,we show the comparative experimental result100IEICE TRANS.INF.&SYST.,VOL.E91–D,NO.1JANUARY2008Fig.7Experiment when the given number of K is larger than the real number of clusters.Cluster1:red points;Cluster2:green points;Clus-ter3:blue points.All three clustering algorithms give the same results to Cluster1and3.They give diﬀerent conclusion on Cluster2.Table1When the given number of K is larger than the real number of clusters,the clustering result of CKM,FKM and RKM for cluster2.The real center of cluster2is located at(230,80).Point1Point2Initial Given Position(285,140)(245,60)Result of CKM(240,75)(218,86)Result of FKM(238,75)(220,86)Result of RKM(235,84)(229,81)of the CKM,FKM and RKM in the case that the given initial number of K(K=4)is larger than that of the real input data (3clusters).In this image,the pink hollow“ ”is the initial given point and the solid“ ”of sky blue color,the red hol-low“◦”and the black“•”denote theﬁnal result of CKM, FKM and RKM clustering algorithms,respectively.In this experiment,three of the initial points are correctly located within the corresponding clusters,but one additional initial point is assigned to the right-top of cluster2.In such case, because both the CKM and FKM will allocate one data to one or several clusters without considering if the clustering process for such data is reliable or not,either of them will assign some data of cluster2to that additional initial point. Thus,their clustering result is heavily aﬀected.As for the RKM algorithm,it becomes insensitive to the initial given points by evaluating the reliability of clustering data to the additional initial point.Andﬁnally,the additional cluster center is removed towards the real center of cluster2.Al-though the result of RKM when clustering data to cluster2 is not the perfect result,compared with CKM and FKM,the RKM gives the best result.Through Table1,we can see that there is just a little dif-ference between the result of CKM and that of FKM.Both of them are heavily aﬀected by the additional given initial point(285,140)which is located out of cluster2.Although CKM and FKM try to merge the two given centers into one cluster,the performance of them is not satisfying.Contrast to CKM and FKM,although the proposed RKM doesnotFig.8Convergence of the CKM and RKM algorithms. really convert the two initial points into one cluster,it seems to merge them most successfully.4.2Convergence of the RKMOne important characteristic of the RKM clustering algo-rithm is whether it converges or not.As for this question,the objective function of RKM is a good index to check whether it converges or not.If the value of the objective function re-mains unchanged(or the changes are so small that such they can be ignored)after aﬁnite iteration number,that denotes that the RKM algorithm converges.Otherwise,the RKM diverges.To perform this experiment of convergence,we apply the RKM to the IRIS dataset to check its convergence.The IRIS dataset has150data points.It is divided into three groups and two of them are overlapping.Each group con-tains50data points.Each point has four attributes.In Fig.8,because the objective functions of the CKM and RKM are diﬀerent from each other,we can not say that the RKM converges better than the CKM,but we could say that the RKM algorithm could converge as well as the CKM algorithm.Since the RKM adds the reliability estimation to the clustering process,as the speed of convergence,it can converge faster than the CKM algorithm.4.3Object Tracking Experiment4.3.1Manual InitializationIt is necessary to assign the number of clusters when using the K-means algorithm to classify a data set.In our tracking algorithm,in theﬁrst frame,we manually select N points on the object to be tracked and use them as the N initial target cluster centers.We let the initial ellipse(search area)be a circle.The center of circle is put at the centroid of the N initial target cluster centers.We manually select one point out of the object and let the circle cross it.In the following frames,the ellipse center is updated according to the result of target detection.Here,we use m representative background samples se-lected from the ellipse contour.Theoretically,it is best to use all pixels on the boundary of the search area(ellipse contour)as representative background samples.However,HUA et al.:RK-MEANS CLUSTERING:K-MEANS WITH RELIABILITY101Frame010Frame145Frame285The missing target center and similar color of background make CKMfailed.In Frame305(a)CKM-based tracking.(b)RKM-based tracking.Fig.9Comparative experiment with CKM and RKM under complex background.since there are many pixels on the ellipse contour(severalhundreds in common cases),the number of the clusters willbecome a huge one.This dramatically reduces the process-ing speed of pixel classiﬁcation during tracking.In orderto realize real-time processing speed,we let m be a smallnumber.Through extensive experiment of tracking objectsof wide classes,we found that m=9is a good choice forfast tracking while keeping the stable tracking performance.Of the9points,8points are resolved by the8-equal divisionof the ellipse contour,and the9th one is the cross point be-tween the ellipse contour and the line connecting the pixelto be classiﬁed and the center of the ellipse center.The target in Fig.9is a hand.During tracking,the handfreely changes between the palm and the back.Although thecolors of palm and back of a hand look similar,they are notexactly the same colors.At the beginning of this compara-tive experiment,the hand is showing the palm,so the initialtarget colors are those of the palm.Such object is diﬃcult102IEICE TRANS.INF.&SYST.,VOL.E91–D,NO.1JANUARY2008 Frame003Frame165Frame75Frame036Frame250Frame238Frame053Frame280Frame370In Frame053,the color was suddenly changed by theﬂash ofcamera.Frame083Frame358Frame495In Frame358,the target was partially occluded by anotherpedestrian.Frame110Frame361Frame673In Frame673,the target was blurred because of its high-speed movement.Fig.10Experiment result with the RKM tracking algorithm.The left column shows a case with thenon-rigid object.The middle column shows a result with scaling and occlusion.The right one showsthe experiment with illumination variance.for tracking because:1)many background areas have quitesimilar color to the target;2)the arbitrary deformation ofnon-rigid target shows diﬀerent shape in the image.To eval-uate the performance of the CKM and RKM fairly,the twotracking algorithm are tested with the same sequence,giventhe same initial points,and both CKM and RKM are run atHUA et al.:RK-MEANS CLUSTERING:K-MEANS WITH RELIABILITY103the same iterations.As for the CKM-based tracking algorithm[17]in the left column,the tracking failed since frame285.In frame 285,the hand turned to show the back and the interfused background contained quite similar color to the initial target color(the color of the palm selected from theﬁrst frame). As mentioned in Fig.6(a),the CKM-based tracking al-gorithm greatly suﬀers from the missing cluster problem. Since the background interfusion and missing cluster hap-pened simultaneously,the CKM-based tracking algorithm mistakenly took the interfused background area that con-tains similar color to the target as the target area.So the search area was wrongly updated,which caused the CKM-based tracking algorithm to fail in frame305.As for the RKM-based tracking algorithm in the right column,as men-tioned in Fig.6(c),the RKM-based tracking algorithm is insensitive to the missing cluster problem by assigning the missing cluster with a low reliability.Therefore,in frame 285,the RKM-based tracking algorithm could correctly de-tect the target area and update the search area,which ensures the robust object tracking.We also applied the proposed RKM-based tracking al-gorithm to many other sequences.Figure10shows some ex-periment results,which indicates our tracking algorithm can deal with the deformation of non-rigid object,color variance caused by the illumination changes and occlusion.More-over,this tracking algorithm can also deal with the rotation and size change of the target.All the experiments were performed with a desktop PC with3.06GHz Intel XEON CPU,and the image size was640×480pixels.When the target size varied from 140×140∼200×200pixels,the processing speed of our algorithm was about9∼15ms/frame.5.ConclusionIn this paper,we presented a reliability-based K-means clus-tering algorithm which is insensitive to the initial value of K.By applying the proposed algorithm to object track-ing,we realized a robust tracking algorithm.Through the comparative experiments with the CKM-based tracking al-gorithm[17],we conﬁrmed the proposed RKM-based track-ing algorithm can work more robustly when the background contains similar colors to the target.Besides object tracking,we also conﬁrmed the pro-posed RKM clustering algorithm can be applied to image segmentation.AcknowledgementThis research is partially supported by the Ministry of Ed-ucation,Culture,Sports,Science and Technology,Grant-in-Aid for Scientiﬁc Research(A)(2),16200014and (C)18500131and(C)19500150.References[1]J.C.Dunn,“A fuzzy relative of the ISODATA process and its use indetecting compact well-separated clusters,”Journal of Cybernetics, vol.3,pp.32–57,1973.[2]J.C.Bezdek,Pattern Recognition with Fuzzy Objective Function Al-gorithm,Plenum Press,New York,1981.[3]R.N.Dave and R.Krishnapuram,“Robust clustering methods:Auniﬁed review,”IEEE Trans.Fuzzy Syst.,vol.5,no.2,pp.270–293, May1997.[4]R.N.Dave,“Characterization and detection of noise,”PatternRecognit.Lett.,vol.12,no.11,pp.657–664,1991.[5]R.Krishnapuram and J.M.Keller,“A possibilistic approach to clus-tering,”IEEE Trans.Fuzzy Syst.,vol.1,no.2,pp.98–110,1993. [6]R.Krishnapuram and J.M.Keller,“The possibilistic c-means algo-rithm:Insights and recommendations,”IEEE Trans.Fuzzy Syst., vol.4,no.3,pp.98–110,1996.[7] A.Schneider,“Weighted possibilistic c-means clustering algo-rithms,”Ninth IEEE International Conference on Fuzzy Systems, vol.1,pp.176–180,May2000.[8]N.R.Pal,K.Pal,J.M.Keller,and J.C.Bezdek,“A new hybrid c-means clustering model,”IEEE International Conference on Fuzzy Systems,vol.1,pp.179–184,July2004.[9]N.R.Pal,K.Pal,J.M.Keller,and J.C.Bezdek,“A possibilistic fuzzyc-means clustering algorithm,”IEEE Trans.Fuzzy Syst.,vol.13, no.4,pp.517–530,Aug.2005.[10]K.K.Chintalapudi and M.Kam,“A noise-resistant fuzzy C meansalgorithm for clustering,”1998IEEE International Conference on Fuzzy Systems,vol.2,pp.1458–1463,May1998.[11]J.-L.Chen and J.-H.Wang,“A new robust clustering algorithm-density-weighted fuzzy c-means,”IEEE International Conference on Systems,Man,and Cybernetics,vol.3,pp.90–94,1999.[12]J.M.Leski,“Generalized weighted conditional fuzzy clustering,”IEEE Trans.Fuzzy Syst.,vol.11,no.6,pp.709–715,2003.[13]J.M.Jolion,P.Meer,and S.Bataouche,“Robust clustering with ap-plications in computer vision,”IEEE Trans.Pattern Anal.Mach.In-tell.,vol.13,no.8,pp.791–802,Aug.1991.[14]M.A.Egan,“Locating clusters in noisy data:A genetic fuzzy c-means clustering algorithm,”1998Conference of the North Ameri-can Fuzzy Information Processing Society,pp.178–182,Aug.1998.[15]J.-S.Zhan and Y.-W.Leung,“Robust clustering by pruning outliers,”IEEE Trans.Syst.Man Cybern.B,Cybern,vol.33,no.6,pp.983–998,2003.[16]J.J.DeGruijter and A.B.McBratney,“A modiﬁed fuzzy K-means forpredictive classiﬁcation,”in Classiﬁcation and Related Methods of Data Analysis,ed.H.H.Bock,pp.97–104,Elsevier Science,Ams-terdam,1988.[17] C.Hua,H.Wu,T.Wada,and Q.Chen,“K-means tracking with vari-able ellipse model,”IPSJ Tans.CVIM,vol.46,no.Sig15(CVIM12), pp.59–68,2005.[18]T.Wada,T.Hamatsuka,and T.Kato,“K-means tracking:A robusttarget tracking against background involution,”MIRU2004,vol.2, pp.7–12,2004.[19]H.T.Nguyen and A.Semeulders,“Tracking aspects of the fore-ground against the background,”ECCV,vol.2,pp.446–456,2004.[20] B.Heisele,U.Kreßel,and W.Ritter,“Tracking non-rigid movingobjects based on color clusterﬂow,”CVPR,pp.253–2571997. [21]J.Hartigan and M.Wong,“Algorithm AS136:A k-means clusteringalgorithm,”Applied Statistics,vol.28,pp.100–108,1979.[22] F.Gibou and R.Fedkiw,“A fast hybrid k-means level set algorithmfor segmentation,”4th Annual Hawaii International Conference on Statistics and Mathematics,pp.281–291,2005.[23] C.Baba,et al.,“Multiresolution adaptive K-means algorithm forsegmentation of brain MRI,”ICSC1995,pp.347–354,1995. [24]P.K.Singh,“Unsupervised segmentation of medical images usingDCT coeﬃcients,”VIP2003,pp.75–81,2003.[25]J.B.MacQueen,“Some methods for classiﬁcation and analysisof multivariate observations,”Proc.5-th Berkeley Symposium on Mathematical Statistics and Probability,vol.1,pp.281–297,Berke-ley,University of California Press,1967.。