一种改进的K-means聚类算法与孤立点检测研究

龙源期刊网 https://www.360docs.net/doc/5c14853939.html,

作者：尹敏杰,东春昭

来源：《电脑知识与技术》2010年第21期

摘要:传统的K-means算法对于孤立点数据是非常敏感的,少量的该类数据就能对聚类结果产生很大影响。该文提出了一种改进的K-means算法来消弱这种敏感性。算法基于孤立点检测LOF算法中计算K距离的思想,将大于K距离的数据点作为伪聚类中心参与聚类划分,通过对聚类结果的评价来判断该数据点是否为孤立点。若为孤立点则去掉该点,进而来提高聚类质量。

关键词:K-means;K距离;孤立点;伪聚类中心

中图分类号:TP311文献标识码:A文章编号:1009-3044(2010)21-6085-02

A Modified K-means Clustering Algorithm and Research on Outlier Detection

YING Min-jie, DONG Chun-zhao

(Southwest Jiaotong University, Chengdu 610031, China)

Abstract: lassical K-means algorithm is very sensitive to outlier data, small amounts of such data can have a great impact on the clustering results. In this paper, a modified K-means algorithm is put forward to weaken this sensitivity. This algorithm bases on the idea of LOF outlier detection algorithm, which regards the data that are greater than K-distance as a pseudo-center. Through the evaluation of clustering results to determine whether the data is an outlier data point. If so, the outlier data point is removed in order to improve the quality of clustering.

Key words: k-means; k-distance; outlier data point; pseudo-center

聚类是把一组个体按照相似性归成若干类别,使得属于同一类别个体之间的距离尽可能小,而不同类别个体间的距离尽可能的大。聚类作为数据挖掘中的一种重要技术,在模式识别、数

据分析以及市场研究等很多领域都发挥着重要作用。目前主要的聚类算法[1-2]有基于划分方法的K-means算法和K-中心算法,基于密度的DBSCAN和OPTICS方法,基于网格的CLIQUE和STING方法等。本文重点研究了K-means算法,并针对该算法的孤立点敏感性提出了一种改进算法。改进后的K-means算法能很好的削弱孤立点的影响,大大提高了聚类质量。

1 K-means算法研究

1.1 K-means算法[1,6]