子空间聚类Sparse Subspace Clustering SSC

基于多级结构的深度子空间聚类方法

基于多级结构的深度子空间聚类方法作者：***来源：《现代信息科技》2022年第06期摘要：提出了一种新的深度子空间聚类方法，使用了卷积自编码器将输入图像转换为位于线性子空间上的表示。

通过结合自编码器提取的低阶和高阶信息来促进特征学习过程，在编码器的不同层级生成多组自我表示和信息表示。

将得到的多级信息融合得到统一的系数矩阵并用于后续的聚类。

通过多组实验验证了上述创新的有效性，在三个经典数据集：Coil20，ORL 和Extended Yale B上，聚类精度分别达到95.38%、87.25%以及97.58%。

相较于其他主流方法，能有效提高聚类准确性，并具有较强的鲁棒性。

关键词：子空间聚类;多级结构;自编码器中图分类号：TP181 文献标识码：A文章编号：2096-4706（2022）06-0100-04Deep Subspace Clustering Method Based on the Multi-level StructureYU Wanrong（School of Artificial Intelligence and Computer Science， Jiangnan University， Wuxi 214122， China）Abstract： A new deep subspace clustering method that uses a convolutional autoencoder to transform an input image into a representation that lies on a linear subspace is proposed. The feature learning process is facilitated by combining low-order and high-order information extracted by the autoencoders， and multiple sets of self-representations and information representations are generated at different levels of the encoder. The obtained multi-level information is fused to obtain a unified coefficient matrix and use it for subsequent clustering. The effectiveness of the above innovations is verified through multiple experiments on three classic datasets， including Coil20， ORL and Extended Yale B. And the clustering accuracies reach 95.38%， 87.25% and 97.58% respectively.Compared with other mainstream methods， this method can effectively improve the clustering accuracy and it has strong robustness.Keywords： subspace clustering; multi-level structure; autoencoder0 引言高維数据处理已成为机器学习和模式识别领域具有代表性的任务之一。

Sparse Subspace Clustering

(RANSAC) [11], ﬁt a subspace of dimension d to randomly chosen subsets of d points until the number of inliers is large enough. The inliers are then removed, and the process is repeated to ﬁnd a second subspace, and so on. RANSAC can deal with noise and outliers, and does need to know the number of subspaces. However, the dimensions of the subspaces must be known and equal, and the number of trials needed to ﬁnd d points in the same subspace grows exponentially with the number and dimension of the subspaces. Factorization-based methods [6, 12, 16] ﬁnd an initial segmentation by thresholding the entries of a similarity matrix built from the factorization of the matrix of data points. Such methods are provably correct when the subspaces are independent, but fail when this assumption is violated. Also, these methods are sensitive to noise. Spectralclustering methods [30, 10, 28] deal with these issues by using local information around each point to build a similarity between pairs of points. The segmentation of the data is then obtained by applying spectral clustering to this similarity matrix. These methods have difﬁculties dealing with points near the intersection of two subspaces, because the neighborhood of a point can contain points from different subspaces. This issue can be resolved by looking at multiway similarities that capture the curvature of a collection of points within an afﬁne subspace [5]. However, the complexity of building a multi-way similarity grows exponentially with the number of subspaces and their dimensions. Algebraic methods, such as Generalized Principal Component Analysis (GPCA) [25, 18], ﬁt the data with a polynomial whose gradient at a point gives a vector normal to the subspace containing that point. Subspace clustering is then equivalent to ﬁtting and differentiating polynomials. GPCA can deal with subspaces of different dimensions, and does not impose any restriction on the relative orientation of the subspaces. However, GPCA is sensitive to noise and outliers, and its complexity increases exponentially with the number of subspaces and their dimensions. Informationtheoretic approaches, such as Agglomerative Lossy Compression (ALC) [17], model each subspace with a degenerate Gaussian, and look for the segmentation of the data that minimizes the coding length needed to ﬁt these points with a mixture of Gaussians. As this minimization problem 1

空间序列低秩稀疏子空间聚类算法

空间序列低秩稀疏子空间聚类算法作者：由从哲舒振球范洪辉来源：《江苏理工学院学报》2020年第04期摘要：研究序列数据的子空间聚类问题，具体来说，给定从一组序列子空间中提取的数据，任务是将这些数据划分为不同的不相交组。

基于表示的子空间聚类算法，如SSC和LRR 算法，很好地解决了高维数据的聚类问题，但是，这类算法是针对一般数据集进行开发的，并没有考虑序列数据的特性，即相邻帧序列的样本具有一定的相似性。

针对这一问题，提出了一种新的低秩稀疏空间子空间聚类方法（Low Rank and Sparse Spatial Subspace Clustering for Sequential Data，LRS3C）。

该算法寻找序列数据矩阵的稀疏低秩表示，并根据序列数据的特性，在目标函数中引入一个惩罚项来加强近邻数据样本的相似性。

提出的LRS3C算法充分利用空间序列数据的时空信息，提高了聚类的准确率。

在人工数据集、视频序列数据集和人脸图像数据集上的实验表明：提出的方法LRS3C与传统子空间聚类算法相比具有较好的性能。

关键词：低秩表示;稀疏表示;子空间聚类;序列数据中图分类号：TP391.4 文献标识码：A 文献标识码：2095-7394（2020）04-0078-08序列数据特别是视频数据往往具有高维属性，利用传统聚类算法进行分析处理时，往往会遇到“维数灾难”的问题，于是研究人员提出了一系列基于表示的子空间聚类算法，如稀疏表示子空间聚类算法（SSC）和低秩表示算法（LRR），较好地解决了高维数据聚类的问题，从而得到了广泛的关注，并在众多领域得到成功的应用。

但是，这类算法是针对一般数据集设计开发的，在许多实际场景中，数据通常具有顺序或有序的属性，例如视频、动画或其他类型的时间序列数据。

然而，传统的方法假设数据点独立于多个子空间，而忽略了时间序列数据中的连续关系。

如何充分利用空间序列数据这一特性提高聚类性能，是计算机视觉领域中一个重要但又具有挑战性的问题。

稀疏子空间聚类综述_王卫卫

j
W1j , · · · ,
j
WN j }
Trace Lasso (Trace least absolute shrinkage and selection operator)
以 σ 为参数的高斯核函数, kσ (x) = exp(−x2 /2σ 2 )
8期
王卫卫等: 稀疏子空间聚类综述
1375
首先, 在第 1 节中分析稀疏子空间聚类的基本原理; 其次, 在第 2 节中详细介绍稀疏子空间聚类的发展现状; 最后, 在第 3 节剖析存在的问题并展望值得进一步研究的方向; 第 4 节总结了全文.
Recommended by Associate Editor FENG Ju-Fu 1. 西安电子科技大学数学与统计学院西安 710126 1. School of Mathematics and Statistics, Xidian University, Xi an 710126
个低维子空间的并, 从而产生了子空间分割问题[2] . 如图 1[3] 所示, 给定的三维数据分别来自一个平面和两条直线, 即数据本质上分别是二维和一维的, 在其所属的低维子空间 (平面或直线) 中, 能够更好地体现出数据本身所具有的性质, 对数据聚类、数据分析、数据挖掘以及模式识别等有重要的意义. 子空间分割的目的是将来自不同子空间的高维数据分割到本质上所属的低维子空间. 子空间分割也称为子空间聚类, 是高维数据聚类的一种新方法, 在机器学习[4] 、计算机视觉[5] 、图像处理[6−7] 和系统辨识[8] 等领域有广泛的应用. 定义 1 (子空间聚类 (Subspace clustering, SC)). 给定一组数据 X = [x 1 , x 2 , · · · , x N ] ∈ RD×N , 设这组数据属于 k (k 已知或未知) 个线性子空间 {Si }k i=1 的并, 子空间聚类是指将这组数据分割为不同的类, 在理想情况下, 每一类对应一个子空间.

信号与数据处理中的低秩模型——理论、算法与应用

min rank( A), s.t.
A
( D) ( A)
2 F
,
(2)
以处理测量数据有噪声的情况。如果考虑数据有强噪声时如何恢复低秩结构的问题，看似这个问题可以用传统的 PCA 解决，但实际上传统 PCA 只在噪声是高斯噪声时可以准确恢复潜在的低秩结构。对于非高斯噪声，如果噪声很强，即使是极少数的噪声，也会使传统的主元分析失败。由于主元分析在应用上的极端重要性，大量学者付出了很多努力在提高主元分析的鲁棒性上，提出了许多号称“鲁棒”的主元分析方法，但是没有一个方法被理论上严格证明是能够在一定条件下一定能够精确恢复出低秩结构的。 2009 年， Chandrasekaran 等人[CSPW2009]和 Wright 等人[WGRM2009]同时提出了鲁棒主元分析（Robust PCA, RPCA）。他们考虑的是数据中有稀疏大噪声时如何恢复数据的低秩结构：
b) 多子空间模型
RPCA 只能从数据中提取一个子空间，它对数据在此子空间中的精细结构无法刻画。精细结构的最简单情形是多子空间模型，即数据分布在若干子空间附近，我们需要找到这些子空间。这个问题马毅等人称为 Generalized PCA (GPCA)问题[VMS2015]，之前已有很多算法，如代数法、RANSAC 等，但都没有理论保障。稀疏表示的出现为这个问题提供了新的思路。E. Elhamifar 和 R. Vidal 2009 年利用样本间相互表达，在表达系数矩阵稀疏的目标下提出了 Sparse Subspace Clustering (SSC)模型 [EV2009]（(6)中 rank( Z ) 换成 Z
* 本文得到国家自然科学基金(61272341, 61231002)资助。

CLUSTERING DISJOINT SUBSPACES VIA SPARSE REPRESENTATION

D Let {Si }n i=1 be an arrangement of n linear subspaces of R n of dimensions {di }i=1 . We will distinguish between the following two types of arrangements.
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
1926
ICASSP 2010
Y = Y 1 , . . . , Y n Γ, where Γ ∈ RN ×N is an unknown permutation matrix. We assume that we do not know a priori the bases for each one of the subspaces nor do we know which data points belong to which subspace. The subspace clustering problem refers to the problem of ﬁnding the number of subspaces, their dimensions, a basis for each subspace, and the segmentation of the data from Y . The sparse subspace clustering (SSC) algorithm (see [7]) addresses the subspace clustering problem using techniques from sparse representation theory. This algorithm is based on the observation that each data point y ∈ Si can always be written as a linear combination of all the other data points in {Si }n i=1 . However, generically, the sparsest representation is obtained when the point y is written as a linear combination of points in its own subspace. In this case, the number of nonzero coefﬁcients corresponds to the dimension of the subspace. It is shown in [7] that when the subspaces are independent and low-dimensional, i.e., di D, this sparse representation can be obtained by using 1 minimization. The segmentation of the data is found by applying spectral clustering to a similarity graph formed using the sparse coefﬁcients. More speciﬁcally, the SSC algorithm proceeds as follows. Algorithm 1 : Sparse Subspace Clustering (SSC) n Input: A set of points {y i }N i=1 lying in n subspaces {Si }i=1 . 1: For every data point y i , solve the following optimization problem: min ci

基于分式函数约束的稀疏子空间聚类方法

摘要：针对现有稀疏子空间聚类算法获取的系数矩阵不能准确反应高维空间中数据分布的稀疏性的不足，提出一种分式函数约束的稀疏子空间聚类模型，并利用交替方向迭代方法给出该模型的解。在无噪声情形下，证明了该方法获取的系数矩阵具有块对角结构，这为其准确获取数据结构提供了理论保证；在含噪声情形下，对异常点噪声同样采用分式函数约束作为正则项，提高了模型的鲁棒性。在人工数据集、Extended Yale B 库和 Hopkins155 数据集上的实验结果表明，基于分式函数约束的稀疏子空间聚类方法不仅提高了聚类结果的准确率，而且对异常点噪声具有更好的鲁棒性。关键词：分式函数；稀疏表示；块对角结构；子空间聚类；谱聚类文献标志码：A 中图分类号：TP391 doi：10.3778/j.issn.1002-8331.1909-0147
Abstract：This paper proposes a novel sparse subspace clustering model which is based on the constraints of fractional function in order to overcome the shortcoming of sparse subspace clustering algorithm that the coefficient matrix obtained by this algorithm cannot reflect the sparsity of data distribution in high-dimensional space accurately and solves this model by applying the alternating direction iteration method. It is proved that the coefficient matrix obtained by this method has block diagonal structure without any noise, which provides a theoretical guarantee to acquire its data structure accurately. Under the condition of noise, the fractional function constraint is also used as the regular term for outlier noise to improve the robustness of the model. Experimental results on artificial data sets, Extended Yale B database and Hopkins155 data set show that the sparse subspace clustering method based on fractional function constraint not only improves the accuracy of clustering results and also improves the robustness to outlier noise. Key words：fractional function; sparse representation; block diagonal structure; subspace clustering; spectral clustering

一种改进的稀疏子空间聚类算法

聚类方法，若数据分布在一些低维的线性或仿射子空间的联合中，子空间聚类比其它聚类方法能得到更好的
聚类效果。
子空间聚类目的在于把高维数据划分到其潜在的子空间并应用到尽可能多的领域中，主要指的是获得子空间的个数，维数，每个子空间的基以及数据的分割。现有的子空间聚类算法分为四个主要类型：迭代方
第２７卷第８月３期２。４年
青岛大学学ＶＥＲＳ报（自然科学ｔ版）ＪＯＵＲＮＡＬＯＦＱＩＮＧＤＡＯＵＮＩＩＴＹ，ＮａｕｒａｌＳｃｉｅｎｃｅＥｄｉｔｉｏｎ）
Ｖ０Ｉ．２７Ｎｏ．３
通讯作者：赵志刚，男．博士，教授，研究生导师，主要研究方向：机器学习等。Ｅｍａｉｌ：ｚｈａｏｌｈｘ＠２６３．ｎｅｔ
第３期
欧阳佩佩，等：一种改进的稀疏子空间聚类算法
４５
设有Ｎ个Ｄ维数据｛｝，处于Ｒ空间的ｎ个线性子空间｛Ｓ：中，子空间的维数分别为｛ｄ｝，定
Ａｕｇ．２０ｌ４
文章编号：ｌ００６一ｉ０３７（２０１４）０３ —００４４—０５
ｄｏｉ：ｌ０．３９６９／／ｊ．ｉｓｓｎ．１００６一］０３７．２０１４．０８．１０

数据的多流形和子空间的聚类模型研究

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0 -0.5 0 0.5 1 1 -1
(a)
(b)
(c)
(d)
图1 3. 请解决决以下三个实实际应用中中的子空间聚聚类问题，数据见附件件三 (a)受实际际条件的制约约，在工业业测量中往往需要非接接触测量的的方式，视觉觉重建是是一类重要要的非接触测测量方法。特征提取是视觉重建建的一个关关键环节，如如图
(a) )
(b)
图2 （c）3c.m mat 中的数数据为两个人人在不同光光照下的人脸脸图像共 2 20 幅（X 变量变的每每一列为拉拉成向量的一一幅人脸图图像），请将将这 20 幅图图像分成两两类。 4. 请作答答如下两个实实际应用中中的多流形聚聚类问题图 3(a)分别别显示了圆圆台的点云，请将点按按照其所在的面分开(即即圆台按照照圆台的的顶、底、侧面分成三三类)。是外部边缘轮轮廓的图像，请将轮廓廓线中不同同的直线和圆圆弧图 3(b)是机器工件外分类类，类数自定。
assoc ( A, A)
qij q ( xi x j )
wij
i pij
Knn( x )
M l l 1
n
四、问题分析
通过对问题的初步分析可知，本题是要求我们用几何结构方法对数据进行分析处理，而我们知道高维空间的数据往往能够在其低维子空间中进行表示，这样的低维表示对于数据的处理是极有帮助的。而经典的子空间聚类方法恰巧能够准确的在低维空间中表示数据，实现子空间聚类问题的方法有很多，包括代数方法、迭代方法、统计学方法、基于谱聚类的方法。各种方法的理论基础不同，在求解过程上也有很大差异。本文主要采取近几年较为流行的基于谱聚类的多种聚类方法并综合运用得到理想的分类结果。问题 1: 要求我们对附件一中的数据分成 2 类，由于数据采样于两个独立的子空间，子空间聚类问题相对容易，尝试了 K 均值聚类，SC，SSC 等多种方法进行数据分类，运行结果发现这些方法是合理有效的。问题 2: 对四个低维空间中子空间聚类问题和多流形聚类问题，由于数据结构性质的变化，简单经典的 K 均值聚类及 SC(谱聚类)方法就无法使用，此时针对问题建立了 SCC、SMCE 与 SMMC 模型，得到理想的分类结果。问题 3: 分析三个实际应用中的子空间聚类问题，(a)中为确定十字的中心位置可以考虑将十字中的点分成横竖两类，这就与问题 2 中(a)类似。(b)考虑到在文献[5]给出基于 ADMM 的 SCC 模型是一种重要运动的分割方法，所以可以将

子空间聚类概述

子空间聚类概述
子空间聚类是一种在高维数据中发现隐含的低维子空间结构的聚类方法。

与传统的聚类算法不同，子空间聚类考虑到了数据在不同的属性子空间中可能具有不同的聚类结构。

它将数据投影到不同的子空间中进行聚类分析，以发现数据在各个子空间中的聚类特征。

子空间聚类算法通常具有以下步骤：
1. 子空间选择：选择要进行聚类的属性子空间。

可以通过特征选择、主成分分析等方法来选择合适的子空间。

2. 子空间投影：将数据投影到选择的子空间中，得到在每个子空间中的投影结果。

3. 聚类分析：在每个子空间中使用传统的聚类算法（如
k-means、DBSCAN等）进行聚类分析，得到每个子空间中的聚类结果。

4. 融合聚类结果：将各个子空间中的聚类结果进行融合，得到最终的聚类结果。

子空间聚类的优势在于可以处理高维数据中存在的低维子空间结构，能够更好地挖掘数据的潜在模式和关联信息。

它适用于许多领域，如图像处理、文本挖掘、生物信息学等。

然而，子空间聚类也面临着一些挑战，如选择合适的子空间、处理噪音和异常值等问题，需要根据具体应用场景进行算法选择和参数调优。