Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolutionA case study

合集下载

Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions

Xiaojin Zhu ZHUXJ@ Zoubin Ghahramani ZOUBIN@ John Lafferty LAFFERTY@ School of Computer Science,Carnegie Mellon University,Pittsburgh PA15213,USAGatsby Computational Neuroscience Unit,University College London,London WC1N3AR,UKAbstractAn approach to semi-supervised learning is pro-posed that is based on a Gaussian randomﬁeldbeled and unlabeled data are rep-resented as vertices in a weighted graph,withedge weights encoding the similarity between in-stances.The learning problem is then formulatedin terms of a Gaussian randomﬁeld on this graph,where the mean of theﬁeld is characterized interms of harmonic functions,and is efﬁcientlyobtained using matrix methods or belief propa-gation.The resulting learning algorithms haveintimate connections with random walks,elec-tric networks,and spectral graph theory.We dis-cuss methods to incorporate class priors and thepredictions of classiﬁers obtained by supervisedlearning.We also propose a method of parameterlearning by entropy minimization,and show thealgorithm’s ability to perform feature selection.Promising experimental results are presented forsynthetic data,digit classiﬁcation,and text clas-siﬁcation tasks.1.IntroductionIn many traditional approaches to machine learning,a tar-get function is estimated using labeled data,which can be thought of as examples given by a“teacher”to a“student.”Labeled examples are often,however,very time consum-ing and expensive to obtain,as they require the efforts of human annotators,who must often be quite skilled.For in-stance,obtaining a single labeled example for protein shape classiﬁcation,which is one of the grand challenges of bio-logical and computational science,requires months of ex-pensive analysis by expert crystallographers.The problem of effectively combining unlabeled data with labeled data is therefore of central importance in machine learning.The semi-supervised learning problem has attracted an in-creasing amount of interest recently,and several novel ap-proaches have been proposed;we refer to(Seeger,2001) for an overview.Among these methods is a promising fam-ily of techniques that exploit the“manifold structure”of the data;such methods are generally based upon an assumption that similar unlabeled examples should be given the same classiﬁcation.In this paper we introduce a new approach to semi-supervised learning that is based on a randomﬁeld model deﬁned on a weighted graph over the unlabeled and labeled data,where the weights are given in terms of a sim-ilarity function between instances.Unlike other recent work based on energy minimization and randomﬁelds in machine learning(Blum&Chawla, 2001)and image processing(Boykov et al.,2001),we adopt Gaussianﬁelds over a continuous state space rather than randomﬁelds over the discrete label set.This“re-laxation”to a continuous rather than discrete sample space results in many attractive properties.In particular,the most probable conﬁguration of theﬁeld is unique,is character-ized in terms of harmonic functions,and has a closed form solution that can be computed using matrix methods or loopy belief propagation(Weiss et al.,2001).In contrast, for multi-label discrete randomﬁelds,computing the low-est energy conﬁguration is typically NP-hard,and approxi-mation algorithms or other heuristics must be used(Boykov et al.,2001).The resulting classiﬁcation algorithms for Gaussianﬁelds can be viewed as a form of nearest neigh-bor approach,where the nearest labeled examples are com-puted in terms of a random walk on the graph.The learning methods introduced here have intimate connections with random walks,electric networks,and spectral graph the-ory,in particular heat kernels and normalized cuts.In our basic approach the solution is solely based on the structure of the data manifold,which is derived from data features.In practice,however,this derived manifold struc-ture may be insufﬁcient for accurate classiﬁcation.WeProceedings of the Twentieth International Conference on Machine Learning(ICML-2003),Washington DC,2003.Figure1.The randomﬁelds used in this work are constructed on labeled and unlabeled examples.We form a graph with weighted edges between instances(in this case scanned digits),with labeled data items appearing as special“boundary”points,and unlabeled points as“interior”points.We consider Gaussian randomﬁelds on this graph.show how the extra evidence of class priors can help classi-ﬁcation in Section4.Alternatively,we may combine exter-nal classiﬁers using vertex weights or“assignment costs,”as described in Section5.Encouraging experimental re-sults for synthetic data,digit classiﬁcation,and text clas-siﬁcation tasks are presented in Section7.One difﬁculty with the randomﬁeld approach is that the right choice of graph is often not entirely clear,and it may be desirable to learn it from data.In Section6we propose a method for learning these weights by entropy minimization,and show the algorithm’s ability to perform feature selection to better characterize the data manifold.2.Basic FrameworkWe suppose there are labeled points, and unlabeled points;typically. Let be the total number of data points.To be-gin,we assume the labels are binary:.Consider a connected graph with nodes correspond-ing to the data points,with nodes corre-sponding to the labeled points with labels,and nodes corresponding to the unla-beled points.Our task is to assign labels to nodes.We assume an symmetric weight matrix on the edges of the graph is given.For example,when,the weight matrix can be(2)To assign a probability distribution on functions,we form the Gaussianﬁeldfor(3) which is consistent with our prior notion of smoothness of with respect to the graph.Expressed slightly differently, ,where.Because of the maximum principle of harmonic functions(Doyle&Snell,1984),is unique and is either a constant or it satisﬁesfor.To compute the harmonic solution explicitly in terms of matrix operations,we split the weight matrix(and sim-ilarly)into4blocks after the th row and column:(4) Letting where denotes the values on the un-labeled data points,the harmonic solution subject to is given by(5)Figure2.Demonstration of harmonic energy minimization on twosynthetic rge symbols indicate labeled data,otherpoints are unlabeled.In this paper we focus on the above harmonic function as abasis for semi-supervised classiﬁcation.However,we em-phasize that the Gaussian randomﬁeld model from which this function is derived provides the learning frameworkwith a consistent probabilistic semantics.In the following,we refer to the procedure described aboveas harmonic energy minimization,to underscore the har-monic property(3)as well as the objective function being minimized.Figure2demonstrates the use of harmonic en-ergy minimization on two synthetic datasets.The leftﬁgure shows that the data has three bands,with,, and;the rightﬁgure shows two spirals,with,,and.Here we see harmonic energy minimization clearly follows the structure of data, while obviously methods such as kNN would fail to do so.3.Interpretation and ConnectionsAs outlined brieﬂy in this section,the basic framework pre-sented in the previous section can be viewed in several fun-damentally different ways,and these different viewpoints provide a rich and complementary set of techniques for rea-soning about this approach to the semi-supervised learning problem.3.1.Random Walks and Electric NetworksImagine a particle walking along the graph.Starting from an unlabeled node,it moves to a node with proba-bility after one step.The walk continues until the par-ticle hits a labeled node.Then is the probability that the particle,starting from node,hits a labeled node with label1.Here the labeled data is viewed as an“absorbing boundary”for the random walk.This view of the harmonic solution indicates that it is closely related to the random walk approach of Szummer and Jaakkola(2001),however there are two major differ-ences.First,weﬁx the value of on the labeled points, and second,our solution is an equilibrium state,expressed in terms of a hitting time,while in(Szummer&Jaakkola,2001)the walk crucially depends on the time parameter. We will return to this point when discussing heat kernels. An electrical network interpretation is given in(Doyle& Snell,1984).Imagine the edges of to be resistors with conductance.We connect nodes labeled to a positive voltage source,and points labeled to ground.Thenis the voltage in the resulting electric network on each of the unlabeled nodes.Furthermore minimizes the energy dissipation of the electric network for the given.The harmonic property here follows from Kirchoff’s and Ohm’s laws,and the maximum principle then shows that this is precisely the same solution obtained in(5).3.2.Graph KernelsThe solution can be viewed from the viewpoint of spec-tral graph theory.The heat kernel with time parameter on the graph is deﬁned as.Here is the solution to the heat equation on the graph with initial conditions being a point source at at time.Kondor and Lafferty(2002)propose this as an appropriate kernel for machine learning with categorical data.When used in a kernel method such as a support vector machine,the kernel classiﬁer can be viewed as a solution to the heat equation with initial heat sourceson the labeled data.The time parameter must,however, be chosen using an auxiliary technique,for example cross-validation.Our algorithm uses a different approach which is indepen-dent of,the diffusion time.Let be the lower right submatrix of.Since,it is the Laplacian restricted to the unlabeled nodes in.Consider the heat kernel on this submatrix:.Then describes heat diffusion on the unlabeled subgraph with Dirichlet boundary conditions on the labeled nodes.The Green’s function is the inverse operator of the restricted Laplacian,,which can be expressed in terms of the integral over time of the heat kernel:(6) The harmonic solution(5)can then be written asor(7)Expression(7)shows that this approach can be viewed as a kernel classiﬁer with the kernel and a speciﬁc form of kernel machine.(See also(Chung&Yau,2000),where a normalized Laplacian is used instead of the combinatorial Laplacian.)From(6)we also see that the spectrum of is ,where is the spectrum of.This indicates a connection to the work of Chapelle et al.(2002),who ma-nipulate the eigenvalues of the Laplacian to create variouskernels.A related approach is given by Belkin and Niyogi (2002),who propose to regularize functions on by select-ing the top normalized eigenvectors of corresponding to the smallest eigenvalues,thus obtaining the bestﬁt toin the least squares sense.We remark that ourﬁts the labeled data exactly,while the order approximation may not.3.3.Spectral Clustering and Graph MincutsThe normalized cut approach of Shi and Malik(2000)has as its objective function the minimization of the Raleigh quotient(8)subject to the constraint.The solution is the second smallest eigenvector of the generalized eigenvalue problem .Yu and Shi(2001)add a grouping bias to the normalized cut to specify which points should be in the same group.Since labeled data can be encoded into such pairwise grouping constraints,this technique can be applied to semi-supervised learning as well.In general, when is close to block diagonal,it can be shown that data points are tightly clustered in the eigenspace spanned by theﬁrst few eigenvectors of(Ng et al.,2001a;Meila &Shi,2001),leading to various spectral clustering algo-rithms.Perhaps the most interesting and substantial connection to the methods we propose here is the graph mincut approach proposed by Blum and Chawla(2001).The starting point for this work is also a weighted graph,but the semi-supervised learning problem is cast as one ofﬁnding a minimum-cut,where negative labeled data is connected (with large weight)to a special source node,and positive labeled data is connected to a special sink node.A mini-mum-cut,which is not necessarily unique,minimizes the objective function,and label0other-wise.We call this rule the harmonic threshold(abbreviated “thresh”below).In terms of the random walk interpreta-tion,ifmakes sense.If there is reason to doubt this assumption,it would be reasonable to attach dongles to labeled nodes as well,and to move the labels to these new nodes.6.Learning the Weight MatrixPreviously we assumed that the weight matrix is given andﬁxed.In this section,we investigate learning weight functions of the form given by equation(1).We will learn the’s from both labeled and unlabeled data;this will be shown to be useful as a feature selection mechanism which better aligns the graph structure with the data.The usual parameter learning criterion is to maximize the likelihood of labeled data.However,the likelihood crite-rion is not appropriate in this case because the values for labeled data areﬁxed during training,and moreover likeli-hood doesn’t make sense for the unlabeled data because we do not have a generative model.We propose instead to use average label entropy as a heuristic criterion for parameter learning.The average label entropy of theﬁeld is deﬁned as(13) using the fact that.Both and are sub-matrices of.In the above derivation we use as label probabilities di-rectly;that is,class.If we incorpo-rate class prior information,or combine harmonic energy minimization with other classiﬁers,it makes sense to min-imize entropy on the combined probabilities.For instance, if we incorporate a class prior using CMN,the probability is given bylabeled set size a c c u r a c yFigure 3.Harmonic energy minimization on digits “1”vs.“2”(left)and on all 10digits (middle)and combining voted-perceptron with harmonic energy minimization on odd vs.even digits (right)Figure 4.Harmonic energy minimization on PC vs.MAC (left),baseball vs.hockey (middle),and MS-Windows vs.MAC (right)10trials.In each trial we randomly sample labeled data from the entire dataset,and use the rest of the images as unlabeled data.If any class is absent from the sampled la-beled set,we redo the sampling.For methods that incorpo-rate class priors ,we estimate from the labeled set with Laplace (“add one”)smoothing.We consider the binary problem of classifying digits “1”vs.“2,”with 1100images in each class.We report aver-age accuracy of the following methods on unlabeled data:thresh,CMN,1NN,and a radial basis function classiﬁer (RBF)which classiﬁes to class 1iff .RBF and 1NN are used simply as baselines.The results are shown in Figure 3.Clearly thresh performs poorly,because the values of are generally close to 1,so the major-ity of examples are classiﬁed as digit “1”.This shows the inadequacy of the weight function (1)based on pixel-wise Euclidean distance.However the relative rankings ofare useful,and when coupled with class prior information signiﬁcantly improved accuracy is obtained.The greatest improvement is achieved by the simple method CMN.We could also have adjusted the decision threshold on thresh’s solution ,so that the class proportion ﬁts the prior .This method is inferior to CMN due to the error in estimating ,and it is not shown in the plot.These same observations are also true for the experiments we performed on several other binary digit classiﬁcation problems.We also consider the 10-way problem of classifying digits “0”through ’9’.We report the results on a dataset with in-tentionally unbalanced class sizes,with 455,213,129,100,754,970,275,585,166,353examples per class,respec-tively (noting that the results on a balanced dataset are sim-ilar).We report the average accuracy of thresh,CMN,RBF,and 1NN.These methods can handle multi-way classiﬁca-tion directly,or with slight modiﬁcation in a one-against-all fashion.As the results in Figure 3show,CMN again im-proves performance by incorporating class priors.Next we report the results of document categorization ex-periments using the 20newsgroups dataset.We pick three binary problems:PC (number of documents:982)vs.MAC (961),MS-Windows (958)vs.MAC,and base-ball (994)vs.hockey (999).Each document is minimally processed into a “tf.idf”vector,without applying header re-moval,frequency cutoff,stemming,or a stopword list.Two documents are connected by an edge if is among ’s 10nearest neighbors or if is among ’s 10nearest neigh-bors,as measured by cosine similarity.We use the follow-ing weight function on the edges:(16)We use one-nearest neighbor and the voted perceptron al-gorithm (Freund &Schapire,1999)(10epochs with a lin-ear kernel)as baselines–our results with support vector ma-chines are comparable.The results are shown in Figure 4.As before,each point is the average of10random tri-als.For this data,harmonic energy minimization performsmuch better than the baselines.The improvement from the class prior,however,is less signiﬁcant.An explanation for why this approach to semi-supervised learning is so effec-tive on the newsgroups data may lie in the common use of quotations within a topic thread:document quotes partof document,quotes part of,and so on.Thus, although documents far apart in the thread may be quite different,they are linked by edges in the graphical repre-sentation of the data,and these links are exploited by the learning algorithm.7.1.Incorporating External ClassiﬁersWe use the voted-perceptron as our external classiﬁer.For each random trial,we train a voted-perceptron on the la-beled set,and apply it to the unlabeled set.We then use the 0/1hard labels for dongle values,and perform harmonic energy minimization with(10).We use.We evaluate on the artiﬁcial but difﬁcult binary problem of classifying odd digits vs.even digits;that is,we group “1,3,5,7,9”and“2,4,6,8,0”into two classes.There are400 images per digit.We use second order polynomial kernel in the voted-perceptron,and train for10epochs.Figure3 shows the results.The accuracy of the voted-perceptron on unlabeled data,averaged over trials,is marked VP in the plot.Independently,we run thresh and CMN.Next we combine thresh with the voted-perceptron,and the result is marked thresh+VP.Finally,we perform class mass nor-malization on the combined result and get CMN+VP.The combination results in higher accuracy than either method alone,suggesting there is complementary information used by each.7.2.Learning the Weight MatrixTo demonstrate the effects of estimating,results on a toy dataset are shown in Figure5.The upper grid is slightly tighter than the lower grid,and they are connected by a few data points.There are two labeled examples,marked with large symbols.We learn the optimal length scales for this dataset by minimizing entropy on unlabeled data.To simplify the problem,weﬁrst tie the length scales in the two dimensions,so there is only a single parameter to learn.As noted earlier,without smoothing,the entropy approaches the minimum at0as.Under such con-ditions,the results of harmonic energy minimization are usually undesirable,and for this dataset the tighter grid “invades”the sparser one as shown in Figure5(a).With smoothing,the“nuisance minimum”at0gradually disap-pears as the smoothing factor grows,as shown in FigureFigure5.The effect of parameter on harmonic energy mini-mization.(a)If unsmoothed,as,and the algorithm performs poorly.(b)Result at optimal,smoothed with(c)Smoothing helps to remove the entropy minimum. 5(c).When we set,the minimum entropy is0.898 bits at.Harmonic energy minimization under this length scale is shown in Figure5(b),which is able to dis-tinguish the structure of the two grids.If we allow a separate for each dimension,parameter learning is more dramatic.With the same smoothing of ,keeps growing towards inﬁnity(we usefor computation)while stabilizes at0.65, and we reach a minimum entropy of0.619bits.In this case is legitimate;it means that the learning al-gorithm has identiﬁed the-direction as irrelevant,based on both the labeled and unlabeled data.Harmonic energy minimization under these parameters gives the same clas-siﬁcation as shown in Figure5(b).Next we learn’s for all256dimensions on the“1”vs.“2”digits dataset.For this problem we minimize the entropy with CMN probabilities(15).We randomly pick a split of 92labeled and2108unlabeled examples,and start with all dimensions sharing the same as in previous ex-periments.Then we compute the derivatives of for each dimension separately,and perform gradient descent to min-imize the entropy.The result is shown in Table1.As entropy decreases,the accuracy of CMN and thresh both increase.The learned’s shown in the rightmost plot of Figure6range from181(black)to465(white).A small (black)indicates that the weight is more sensitive to varia-tions in that dimension,while the opposite is true for large (white).We can discern the shapes of a black“1”and a white“2”in thisﬁgure;that is,the learned parametersCMNstart97.250.73%0.654298.020.39%Table1.Entropy of CMN and accuracies before and after learning ’s on the“1”vs.“2”dataset.Figure6.Learned’s for“1”vs.“2”dataset.From left to right: average“1”,average“2”,initial’s,learned’s.exaggerate variations within class“1”while suppressing variations within class“2”.We have observed that with the default parameters,class“1”has much less variation than class“2”;thus,the learned parameters are,in effect, compensating for the relative tightness of the two classes in feature space.8.ConclusionWe have introduced an approach to semi-supervised learn-ing based on a Gaussian randomﬁeld model deﬁned with respect to a weighted graph representing labeled and unla-beled data.Promising experimental results have been pre-sented for text and digit classiﬁcation,demonstrating that the framework has the potential to effectively exploit the structure of unlabeled data to improve classiﬁcation accu-racy.The underlying randomﬁeld gives a coherent proba-bilistic semantics to our approach,but this paper has con-centrated on the use of only the mean of theﬁeld,which is characterized in terms of harmonic functions and spectral graph theory.The fully probabilistic framework is closely related to Gaussian process classiﬁcation,and this connec-tion suggests principled ways of incorporating class priors and learning hyperparameters;in particular,it is natural to apply evidence maximization or the generalization er-ror bounds that have been studied for Gaussian processes (Seeger,2002).Our work in this direction will be reported in a future publication.ReferencesBelkin,M.,&Niyogi,P.(2002).Using manifold structure for partially labelled classiﬁcation.Advances in Neural Information Processing Systems,15.Blum,A.,&Chawla,S.(2001).Learning from labeled and unlabeled data using graph mincuts.Proc.18th Interna-tional Conf.on Machine Learning.Boykov,Y.,Veksler,O.,&Zabih,R.(2001).Fast approx-imate energy minimization via graph cuts.IEEE Trans. on Pattern Analysis and Machine Intelligence,23. Chapelle,O.,Weston,J.,&Sch¨o lkopf,B.(2002).Cluster kernels for semi-supervised learning.Advances in Neu-ral Information Processing Systems,15.Chung,F.,&Yau,S.(2000).Discrete Green’s functions. Journal of Combinatorial Theory(A)(pp.191–214). Doyle,P.,&Snell,J.(1984).Random walks and electric networks.Mathematical Assoc.of America. Freund,Y.,&Schapire,R.E.(1999).Large margin classi-ﬁcation using the perceptron algorithm.Machine Learn-ing,37(3),277–296.Hull,J.J.(1994).A database for handwritten text recog-nition research.IEEE Transactions on Pattern Analysis and Machine Intelligence,16.Kondor,R.I.,&Lafferty,J.(2002).Diffusion kernels on graphs and other discrete input spaces.Proc.19th Inter-national Conf.on Machine Learning.Le Cun,Y.,Boser, B.,Denker,J.S.,Henderson, D., Howard,R.E.,Howard,W.,&Jackel,L.D.(1990). Handwritten digit recognition with a back-propagation network.Advances in Neural Information Processing Systems,2.Meila,M.,&Shi,J.(2001).A random walks view of spec-tral segmentation.AISTATS.Ng,A.,Jordan,M.,&Weiss,Y.(2001a).On spectral clus-tering:Analysis and an algorithm.Advances in Neural Information Processing Systems,14.Ng,A.Y.,Zheng,A.X.,&Jordan,M.I.(2001b).Link analysis,eigenvectors and stability.International Joint Conference on Artiﬁcial Intelligence(IJCAI). Seeger,M.(2001).Learning with labeled and unlabeled data(Technical Report).University of Edinburgh. Seeger,M.(2002).PAC-Bayesian generalization error bounds for Gaussian process classiﬁcation.Journal of Machine Learning Research,3,233–269.Shi,J.,&Malik,J.(2000).Normalized cuts and image segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence,22,888–905.Szummer,M.,&Jaakkola,T.(2001).Partially labeled clas-siﬁcation with Markov random walks.Advances in Neu-ral Information Processing Systems,14.Weiss,Y.,,&Freeman,W.T.(2001).Correctness of belief propagation in Gaussian graphical models of arbitrary topology.Neural Computation,13,2173–2200.Yu,S.X.,&Shi,J.(2001).Grouping with bias.Advances in Neural Information Processing Systems,14.。

深度学习中的半监督学习方法

深度学习中的半监督学习方法在深度学习领域，半监督学习（Semi-Supervised Learning）是一种处理具有标记和未标记样本的学习方法。

相比于完全监督学习，半监督学习利用未标记样本的信息能够提供更多的数据，从而改善模型的性能。

在本文中，我们将深入探讨深度学习中的半监督学习方法，包括其优势、主要技术以及应用领域。

半监督学习背景传统的监督学习方法通常需要大量标记样本来训练模型，但在许多实际应用中，标记样本往往难以获取或者标记成本过高。

与此同时，未标记样本相对容易获取，但其无法直接用于模型的训练。

半监督学习的目标就是充分利用未标记样本的信息，提高模型的性能。

半监督学习方法可以看作是无监督学习和监督学习的结合，通过利用无标记样本进行模型训练，同时使用有标记样本进行模型优化。

半监督学习方法1. 自训练（Self-training）自训练是最基本的半监督学习方法之一。

该方法通过将有标记样本的预测结果作为伪标签，然后使用伪标签和未标记样本一起训练模型。

自训练方法通常采用迭代的方式，每轮迭代后，使用更新的模型对未标记样本进行预测并生成新的伪标签。

2. 半监督生成模型（Semi-supervised Generative Models）半监督生成模型利用生成模型来学习数据的分布，并且通过生成模型与有标记样本的条件概率进行建模。

典型的半监督生成模型包括生成对抗网络（GAN）、变分自编码器（Variational Autoencoder）等。

通过生成模型，半监督生成模型可以生成未标记样本，从而扩大样本空间，提高模型的性能。

3. 半监督降噪（Semi-Supervised Denoising）半监督降噪方法通过在训练过程中引入噪声，利用噪声和未标记样本之间的关系来改进模型。

该方法的核心思想是将未标记样本与具有噪声的样本进行混合，并在训练过程中对模型进行约束，以提高模型的泛化能力。

半监督学习的优势半监督学习方法相比于完全监督学习方法具有以下几个优势：1. 数据利用率高：通过利用未标记样本，半监督学习能够充分利用数据资源，提高模型的性能。

半监督学习中的半监督降维与半监督聚类的关系分析(六)

半监督学习（Semi-Supervised Learning）是指在一部分有标签数据和大量无标签数据的情况下进行学习的方法。

在现实生活中，很多机器学习任务往往无法获得足够的标签数据，因此半监督学习成为了一种重要的学习范式。

在半监督学习中，降维和聚类是两个重要的任务，在本文中我将讨论半监督降维与半监督聚类的关系。

降维（Dimensionality Reduction）是指将高维数据映射到低维空间的过程。

在监督学习中，常见的降维方法有主成分分析（PCA）和线性判别分析（LDA）等。

这些方法在有标签数据的情况下能够有效地降低数据的维度，提取出最重要的特征。

然而，在半监督学习中，我们往往只有一小部分数据是有标签的，因此传统的监督降维方法无法直接应用。

在这种情况下，半监督降维方法就显得至关重要了。

半监督降维方法主要有两种：一种是基于图的方法，另一种是基于生成模型的方法。

基于图的方法将数据看作是图的节点，节点之间的相似性作为边的权重，然后通过图的特征进行降维。

典型的方法有拉普拉斯特征映射（LE）和局部线性嵌入（LLE）等。

这些方法在处理半监督降维问题时能够充分利用无标签数据的信息，从而获得更好的降维效果。

而基于生成模型的方法则是通过对数据的分布进行建模，然后利用模型进行降维。

这类方法中，最著名的就是自编码器（Autoencoder）了。

自编码器通过学习数据的特征表示，然后再将其映射到低维空间中。

这类方法在处理半监督学习问题时同样表现出了很好的效果。

与降维相似，聚类（Clustering）也是无监督学习的一种重要方法。

聚类是指将数据划分为若干个不相交的簇的过程。

在传统的无监督学习中，聚类方法如K均值（K-means）和层次聚类（Hierarchical Clustering）等被广泛应用。

然而，在半监督学习中，我们往往需要利用有标签数据的信息来指导聚类过程，因此半监督聚类方法就显得尤为重要。

半监督聚类方法可以分为基于图的方法和基于生成模型的方法两种。

半监督学习中的半监督聚类算法详解

半监督学习（Semi-Supervised Learning）是指在训练过程中同时利用有标签和无标签的数据进行学习。

相比于监督学习和无监督学习，半监督学习更贴近实际场景，因为在实际数据中，通常有很多无标签的数据，而标记数据的获取往往十分耗时耗力。

半监督学习可以利用未标记数据进行模型训练，从而提高模型的性能和泛化能力。

在半监督学习中，半监督聚类算法是一个重要的研究方向，它旨在利用有标签的数据和无标签的数据进行聚类，以获得更好的聚类结果。

本文将对半监督聚类算法进行详细的介绍和解析。

半监督聚类算法的核心思想是利用有标签的数据指导无标签数据的聚类过程。

一般来说，半监督聚类算法可以分为基于约束的方法和基于图的方法两类。

基于约束的方法是通过给定的一些约束条件来引导聚类过程，例如必连约束（必须属于同一类的样本必须被分到同一簇中）和禁连约束（不属于同一类的样本不能被分到同一簇中）。

基于图的方法则是通过构建样本之间的图结构来进行聚类，例如基于图的半监督学习算法中常用的谱聚类算法。

在基于图的方法中，谱聚类算法是一种常用的半监督聚类算法。

谱聚类算法首先将样本之间的相似度表示为一个相似度矩阵，然后通过对相似度矩阵进行特征分解，得到样本的特征向量，再利用特征向量进行聚类。

在半监督学习中，谱聚类算法可以通过引入有标签数据的信息来指导聚类过程，从而提高聚类的准确性。

例如，可以通过构建一个带权图，其中节点代表样本，边的权重代表样本之间的相似度，有标签的样本可以通过设置固定的标签权重来指导聚类，从而使得相似的有标签样本更有可能被分到同一簇中。

除了谱聚类算法，基于图的半监督学习还有许多其他算法，例如标签传播算法（Label Propagation）、半监督支持向量机（Semi-Supervised SupportVector Machine）等。

这些算法都是通过在样本之间构建图结构，利用图的拓扑结构和样本的相似度信息来进行半监督学习。

半监督学习简介(Ⅱ)

半监督学习简介半监督学习（Semi-Supervised Learning）是机器学习领域的一个重要分支，它试图利用标记和未标记数据来进行模型训练和预测。

与监督学习和无监督学习相比，半监督学习在现实问题中具有更广泛的应用场景。

在本篇文章中，我们将从半监督学习的基本原理、常见方法和实际应用等方面进行介绍。

1. 基本原理在监督学习中，我们通常会有一些带有标签的数据用于模型训练和测试。

而在无监督学习中，则是利用未标记的数据进行模型训练和预测。

而半监督学习则是结合了这两种情形，利用少量的带有标签的数据和大量的未标记数据进行模型训练。

其基本原理是利用未标记数据的分布信息来提高模型的泛化能力，从而提高模型的预测性能。

2. 常见方法在实际应用中，有一些常见的半监督学习方法被广泛使用。

其中，最具代表性的方法之一是基于图的半监督学习方法。

该方法利用数据之间的相似性构建图结构，将带有标签的数据和未标记的数据连接起来，并通过图模型来学习数据的分布信息。

另外，还有基于生成对抗网络（GAN）的半监督学习方法，利用生成模型和判别模型之间的博弈来提高模型的泛化能力。

此外，还有一些基于核方法、半监督聚类和半监督降维等方法，这些方法在不同的应用场景中具有一定的效果。

3. 实际应用半监督学习在实际应用中有着广泛的应用场景。

在计算机视觉领域，半监督学习可以应用于图像分类、目标检测和图像分割等任务。

在自然语言处理领域，半监督学习可以应用于文本分类、情感分析和机器翻译等任务。

在推荐系统领域，半监督学习可以应用于个性化推荐和信息过滤等任务。

此外，在生物信息学、金融风控和工业制造等领域，半监督学习也有着重要的应用价值。

总结半监督学习作为机器学习领域的一个重要分支，其基本原理、常见方法和实际应用具有重要的意义。

在未来的发展中，随着数据规模的不断增大和标记成本的不断上升，半监督学习将会更加重要。

因此，我们有必要深入研究半监督学习的理论和方法，以应对现实世界中的复杂问题。

数据挖掘与机器学习(一)

数据挖掘与机器学习（一）Part I 数据挖掘与机器学习一、数据挖掘、机器学习、深度学习的区别1、数据挖掘数据挖掘也就是data mining，是一个很宽泛的概念，也是一个新兴学科，旨在如何从海量数据中挖掘出有用的信息来。

数据挖掘这个工作BI（商业智能）可以做，统计分析可以做，大数据技术可以做，市场运营也可以做，或者用excel分析数据，发现了一些有用的信息，然后这些信息可以指导你的business，这也属于数据挖掘。

目前最常见的方式是结合机器学习的算法模型来实现数据挖掘。

2、机器学习machine learning，是计算机科学和统计学的交叉学科，基本目标是学习一个x->y的函数（映射），来做分类、聚类或者回归的工作。

之所以经常和数据挖掘合在一起讲是因为现在好多数据挖掘的工作是通过机器学习提供的算法工具实现的，例如广告的ctr预估，PB级别的点击日志在通过典型的机器学习流程可以得到一个预估模型，从而提高互联网广告的点击率和回报率；个性化推荐，还是通过机器学习的一些算法分析平台上的各种购买，浏览和收藏日志，得到一个推荐模型，来预测你喜欢的商品。

3、深度学习deep learning，机器学习里面现在比较火的一个topic，本身是神经网络算法的衍生，在图像，语音等富媒体的分类和识别上取得了非常好的效果，所以各大研究机构和公司都投入了大量的人力做相关的研究和开发。

总结：数据挖掘是个很宽泛的概念，数据挖掘常用方法大多来自于机器学习这门学科，深度总结学习也是来源于机器学习的算法模型，本质上是原来的神经网络。

二、数据挖掘体系数据挖掘：统计学、数据库系统、数据仓库、信息检索、机器学习、应用、模式识别、可视化、算法、高性能计算（分布式、GPU计算）三、数据挖掘的流程目前，越来越多的人认为数据挖掘应该属于一种知识发现过程（KDD：Knowledge Discovery in Database）。

KDD过程迭代序列：1、数据清理=》消除噪声和删除不一致数据2、数据集成=》多种数据源可以组合在一起3、数据选择=》从数据库中提取与分析任务相关数据4、数据变换=》通过汇总或聚集操作，把数据变换和统一成适合挖掘的形式5、数据挖掘=》使用一定的模型算法提取数据模式6、模式评估=》根据某种兴趣度度量，识别代表知识的真正有趣的模式7、知识表示=》使用可视化和知识表示技术，向用户提供挖掘的知识总结数据挖掘的定义：从大量数据中挖掘有趣模式和知识的过程。

机器学习技术中的半监督聚类方法

机器学习技术中的半监督聚类方法半监督聚类是机器学习领域中一种重要的技术，它结合了监督学习和无监督学习的方法。

通过利用少量标记数据和大量无标记数据，半监督聚类可以提供更准确和可靠的聚类结果。

半监督聚类方法旨在解决无标记数据量大、有标记数据量少的问题。

在传统的无监督聚类方法中，只利用无标记数据进行聚类，无法充分利用已有的有标记数据的信息。

而在监督学习中，虽然可以利用有标记数据进行分类或回归任务，但由于标记数据量的限制，很难满足大规模数据的需要。

半监督聚类方法的核心思想是将无标记数据和少量有标记数据的信息结合起来，通过半监督学习的方式进行聚类。

其中最经典的方法之一是S3C（Semi-Supervised Spectral Clustering）算法，它将无标记数据和有标记数据进行低维表示，并通过优化一个目标函数来实现聚类。

S3C算法在处理大规模数据集时具有较高的效率和可扩展性。

另一个常用的半监督聚类方法是Co-training算法，它通过同时训练两个相互独立的分类器来实现聚类。

其中一个分类器使用有标记数据进行训练，另一个分类器使用无标记数据进行训练。

通过交替迭代训练分类器，并利用它们在未标记数据上的一致性进行更新，Co-training算法能够充分利用有标记数据和无标记数据的信息，提高聚类的准确性。

除了以上两种方法，还有许多其他的半监督聚类方法，如基于图的半监督聚类算法、基于聚类原型的半监督聚类算法等。

这些方法根据不同的数据特点和问题需求，采用不同的策略进行模型设计和优化。

在选择合适的半监督聚类方法时，需要综合考虑数据规模、数据特征、标记数据的可用性等因素。

半监督聚类方法在许多领域都有广泛的应用。

例如，在社交网络分析中，可以利用半监督聚类方法对用户进行聚类，发现潜在的社交群体或兴趣群体。

在图像分割中，可以利用半监督聚类方法对图像进行分割，获取更准确的边界和目标提取结果。

在推荐系统中，可以利用半监督聚类方法对用户和物品进行聚类，实现个性化推荐和精准广告投放。

ssnmf 稀疏半非负矩阵

ssnmf 稀疏半非负矩阵SSNMF（Sparse Semi-Negative Matrix Factorization）是一种用于高维数据分析的矩阵分解方法。

它可以有效地提取出数据中的稀疏和半非负特征，广泛应用于图像处理、文本挖掘、社交网络分析等领域。

SSNMF通过将原始数据矩阵分解为两个低秩矩阵的乘积，来实现特征提取和数据降维的目的。

其中，一个矩阵包含了数据中的稀疏特征，另一个矩阵包含了数据中的半非负特征。

在SSNMF中，稀疏特征矩阵表示了原始数据中的稀疏性信息。

稀疏性是指数据中只有少数几个元素具有较大的值，而大部分元素都接近于零。

通过提取出稀疏特征，我们可以发现数据中的重要模式和结构。

与稀疏特征相对应的是半非负特征矩阵。

半非负特征是指数据中的元素都是非负的，但不一定是稀疏的。

通过提取出半非负特征，我们可以捕捉到数据中的整体趋势和分布情况。

SSNMF的关键在于如何选择合适的约束条件和优化算法来进行矩阵分解。

一种常用的约束条件是稀疏性约束，通过最小化稀疏特征矩阵中的非零元素个数，来实现稀疏特征的提取。

另一种常用的约束条件是非负性约束，通过要求矩阵中的元素都为非负数，来实现半非负特征的提取。

SSNMF的优化算法通常采用迭代的方式进行求解。

常用的方法包括交替最小二乘法（Alternating Least Squares, ALS）、梯度下降法（Gradient Descent, GD）等。

这些算法通过不断更新稀疏特征和半非负特征矩阵，使其逐渐逼近原始数据矩阵，从而得到更好的分解结果。

SSNMF的应用非常广泛。

在图像处理领域，可以利用SSNMF提取图像的纹理特征、边缘特征等，从而实现图像分类、图像检索等任务。

在文本挖掘领域，可以利用SSNMF提取文档的主题特征、关键词特征等，从而实现文本聚类、文本分类等任务。

在社交网络分析领域，可以利用SSNMF提取用户的兴趣特征、社区特征等，从而实现用户推荐、社区发现等任务。

基于图正则化的半监督非负矩阵分解

基于图正则化的半监督非负矩阵分解杜世强;石玉清;王维兰;马明【期刊名称】《计算机工程与应用》【年(卷),期】2012(48)36【摘要】提出了一种基于图正则化的半监督非负矩阵分解算法(GSNMF),克服了非负矩阵分解(NMF)、约束非负矩阵分解(CNMF)和图正则化非负矩阵分解(GNMF)方法忽略样本数据的局部几何结构或标签信息不足的缺陷,且NMF、CNMF和GNMF均为GSNMF的特例.也从理论上证明了GSNMF算法的收敛性.该算法对样本数据进行低维非负分解时,在图框架下既保持数据的几何结构,又利用已知样本的标签信息,在进行半监督学习时,同类样本能更好地聚集而类间距离尽可能大.在人脸数据库ORL、FERET和手写体数据库USPS上的仿真结果表明,相对于NMF及其一些改进算法,GSNMF均具有更高的聚类精度.%This paper presents a novel algorithm called Graph regularized-based Semi-supervised NMF(GSNMF). It overcomes the shortcomings which ignore the geometric structure and the label information of the data for Non-negative Matrix Factorization(NMF), Constrained NMF(CNMF) and Graphed regularized NMF(GNMF). Moreover, those algorithms are special case of GSNMF. The convergence proof of this algorithm is provided. GSNMF preserves the intrinsic geometry of data and uses the label information as semi-supervised learning. It makes nearby samples with the same class-label more compact, and nearby classes separated. Compared with NMF, LNMF, PNMF, GNMF and CNMF, experiment results on ORL face database, FERETface database and USPS handwrite database have shown that the proposed method achieves better clustering results.【总页数】7页(P194-200)【作者】杜世强;石玉清;王维兰;马明【作者单位】西北民族大学数学与计算机科学学院,兰州730030;西北民族大学电气工程学院,兰州730030;西北民族大学数学与计算机科学学院,兰州730030;西北民族大学数学与计算机科学学院,兰州730030【正文语种】中文【中图分类】TP391.4【相关文献】1.基于图正则化非负矩阵分解的二分网络社区发现算法 [J],2.基于图正则化和稀疏约束的增量型非负矩阵分解 [J], 孙静;蔡希彪;姜小燕;孙福明3.基于图正则化和稀疏约束的半监督非负矩阵分解 [J], 姜小燕;孙福明;李豪杰4.基于L2稀疏约束和图正则化的非负矩阵分解算法 [J], 王美能5.基于L21范式的多图正则化非负矩阵分解方法 [J], 周长宇;姚明海;李劲松因版权原因，仅展示原文概要，查看原文内容请购买。

Graph Regularized Nonnegative Matrix

Ç
1 INTRODUCTION
HE
techniques for matrix factorization have become popular in recent years for data representation. In many problems in information retrieval, computer vision, and pattern recognition, the input data matrix is of very high dimension. This makes learning from example infeasible [15]. One then hopes to find two or more lower dimensional matrices whose product provides a good approximation to the original one. The canonical matrix factorization techniques include LU decomposition, QR decomposition, vector quantization, and Singular Value Decomposition (SVD). SVD is one of the most frequently used matrix factorization techniques. A singular value decomposition of an M Â N matrix X has the following form: X ¼ UÆVT ; where U is an M Â M orthogonal matrix, V is an N Â N orthogonal matrix, and Æ is an M Â N diagonal matrix with Æij ¼ 0 if i 6¼ j and Æii ! 0. The quantities Æii are called the singular values of X, and the columns of U and V are called

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution:A case studyRenaud Gaujoux a ,Cathal Seoighe b ,⇑a Computational Biology Group,Institute of Infectious Disease and Molecular Medicine,University of Cape Town,South Africa bSchool of Mathematics,Statistics and Applied Mathematics,National University of Ireland Galway,Irelanda r t i c l e i n f o Article history:Available online 10September 2011Keywords:NMFMicroarrayGene expression DeconvolutionSample heterogeneitya b s t r a c tHeterogeneity in sample composition is an inherent issue in many gene expression studies and,in many cases,should be taken into account in the downstream analysis to enable correct interpretation of the underlying biological processes.Typical examples are infectious diseases or immunology-related studies using blood samples,where,for example,the proportions of lymphocyte sub-populations are expected to vary between cases and controls.Nonnegative Matrix Factorization (NMF)is an unsupervised learning technique that has been applied successfully in several ﬁelds,notably in bioinformatics where its ability to extract meaningful informa-tion from high-dimensional data such as gene expression microarrays has been demonstrated.Very recently,it has been applied to biomarker discovery and gene expression deconvolution in heterogeneous tissue samples.Being essentially unsupervised,standard NMF methods are not guaranteed to ﬁnd components corre-sponding to the cell types of interest in the sample,which may jeopardize the correct estimation of cell proportions.We have investigated the use of prior knowledge,in the form of a set of marker genes,to improve gene expression deconvolution with NMF algorithms.We found that this improves the consis-tency with which both cell type proportions and cell type gene expression signatures are estimated.The proposed method was tested on a microarray dataset consisting of pure cell types mixed in known pro-portions.Pearson correlation coefﬁcients between true and estimated cell type proportions improved substantially (typically from about 0.5to approximately 0.8)with the semi-supervised (marker-guided)versions of commonly used NMF algorithms.Furthermore known marker genes associated with each cell type were assigned to the correct cell type more frequently for the guided versions.We conclude that the use of marker genes improves the accuracy of gene expression deconvolution using NMF and suggest modiﬁcations to how the marker gene information is used that may lead to further improvements.Ó2011Elsevier B.V.All rights reserved.1.IntroductionA typical objective in gene expression studies using microarrays or deep sequencing is the identiﬁcation of genes that are differen-tially expressed between groups of samples,such as case vs.control or normal vs.tumor tissue.Although sample heterogeneity is widely acknowledged as potentially a substantial confounding fac-tor in such analysis (Cleator et al.,2006;Whitney et al.,2003),it is often discarded,due to the unavailability of data on the composi-tion of the boratory techniques such as laser capture mi-cro-dissection,ﬂuorescence-activated cell sorting or ﬂowcytometry exist to separate or quantify constituents from each sample.However,these require extra resources that may not be available in all situations,such as sample quantity,time,technology or funds,beside the fact that manipulating the samples may alter the original gene expression proﬁles.The capacity to accurately deconvolve gene expression data computationally is,therefore,an attractive alternative.Starting with Venet et al.(2001),many authors pro-posed different methods and provided insights on how to estimate the cell type/tissue speciﬁc signatures or their relative proportions (Zhao and Simon,2010).Some methods perform partial gene expression deconvolution in the sense that they require that either the signatures (Lu et al.,2003;Wang et al.,2006;Abbas et al.,2009;Clarke et al.,2010),or –estimates of –the mixture proportions (Lähdesmäki et al.,2005;Erkkiläet al.,2010;Shen-Orr et al.,2010)are available.Others perform complete deconvolution where both the cell/tissue signatures and the proportions are estimated directly from the global gene expression data of the heterogeneous samples (Roy et al.,2006;Repsilber et al.,2010).1567-1348/$-see front matter Ó2011Elsevier B.V.All rights reserved.doi:10.1016/j.meegid.2011.08.014⇑Corresponding author.Tel.:+35391492343.E-mail addresses:renaud@cbio.uct.ac.za (R.Gaujoux),cathal.seoighe@nuigalway.ie (C.Seoighe).URL:http://web.cbio.uct.ac.za/~renaud (R.Gaujoux).Complete gene expression deconvolution may provide valuable information about the underlying biological processes of interest, particularly in the context of infectious diseases,immunology-related or cancer studies,where whole blood,PBMC or tissue sam-ples are used to compare different phenotypic groups of patients. An accurate estimate of the proportions of constituent cell types in a sample can provide insights into the inﬂammatory stage of each sample and possibly uncover group-speciﬁc patterns,or en-able gene expression estimates to be corrected for stromal contam-ination,a common issue for tumor samples.At the cell-type level, the estimated signatures can be used to extract gene modules, which may reveal cell-type speciﬁc gene interactions or pathway activations.Performing complete deconvolution separately on each phenotypic group and looking at the differences in these gene modules between groups may then identify genes or pathways that play a key role in the response to the disease.For this type of analysis to give meaningful results,it is therefore important that the estimated signatures are consistent with the relevant consti-tuting cell-types.In this paper,we explore the potential beneﬁts of a simple ap-proach that incorporates prior knowledge from marker genes into general algorithms that perform Nonnegative Matrix Factorization (NMF)(Paatero and Tapper,1994;Lee and Seung,1999)for the complete gene expression deconvolution problem.NMF is an unsu-pervised technique that has been successfully applied to a broad range ofﬁelds,including bioinformatics,and has proved to be capable of extracting meaningful components from composite data (Devarajan,2008;Brunet et al.,2004;Pehkonen et al.,2005; Hutchins et al.,2008).Because the NMF theoretical framework is naturally suitable to modeling the problem of complete gene expression deconvolution,many deconvolution methods have been developed within this framework and these often share com-mon algorithmic properties(Venet et al.,2001;Lähdesmäki et al., 2005;Repsilber et al.,2010).One of the drawbacks of standard NMF methods is that the estimation process is completely unsu-pervised,which does not guarantee that the extracted components are related to the problem of interest.In particular,in the case of gene expression deconvolution,one expects that the cell/tissue sig-natures exhibit block-like expression patterns,at least for genes that are known to be characteristic of the different cell types. Our proposition is to enforce the recovery of such signature pat-terns within the estimation process,instead of looking for them a posteriori,with the risk of obtaining biologically inconsistent com-ponents.This paper explores the beneﬁt of guiding the estimation in such a way and shows how,on a real dataset,the proposed ap-proach is able to dramatically improve the capacity of standard NMF algorithms to both recover meaningful cell/tissue signatures and accurately estimate their relative mixture proportions.2.Material and methods2.1.DataWe evaluated the performance of several NMF methods on the microarray dataset GSE11058accessible at NCBI GEO database (Barrett et al.,2010).It contains data from a controlled mixture experiment performed by Abbas et al.(2009)to develop their partial deconvolution method.The dataset comprises the gene expression proﬁles from four pure cell lines of immune origin(Jurkat,IM-9,Raji, THP-1)as well as four different mixtures for which the relative pro-portions of each cell type are known.Mixtures of cells were per-formed in triplicate making up a total of24arrays(triplicates of each pure cell type and mixture).Because both the pure gene expres-sion proﬁles and the mixture proportions are available,these data provide a ground truth reference against which the proportions and cell-type expression signatures obtained from complete decon-volution can be assessed.For the latter,the mean expression across pure cell type samples was used as the reference.We used the normalized gene expression data stored as Series Matrixﬁles available from GEO.This data is normalized by global scaling with Microarray Suite version5.0(MAS5.0)using Affyme-trix default analysis settings,with the trimmed mean target inten-sity of each array arbitrarily set to500(e.g.see description page for sample GSM279589).The complete dataset(54675probesets)was further processed to produce a curated and a full dataset that were subsequently used in the analysis.The curated dataset is limited to the359probesets that compose theﬁnal basis matrix used in Abbas et al.(2009)to deconvolve white blood cell samples from Systemic Lupus Erythematosus patients,and consists of a set of marker probesets for common immune cell subsets in different states(e.g.resting or activated).The purpose of this dataset is to assess the performance of the deconvolution methods in a setting that is a priori favorable,due the high discriminative power of the probesets(see Supplementary Fig.1).Moreover,it provides insight about how deconvolution works when only considering a limited number of genes.The full dataset is composed of the40791 probesets that could be mapped to an Entrez Gene identiﬁer, using the annotation package hgu133plus2.db from bioconductor (Gentleman,2004),ﬁltering out any built-in Affymetrix control probesets,whose probe IDs start with the preﬁx AFFX-.2.2.Methods2.2.1.NMF algorithmsWe considered seven NMF algorithms,among which three are guided algorithms that incorporate prior knowledge from marker probesets within theﬁtting process,using the method described in Section3.A brief description of each method as well as some de-tails about their implementation follows.Canonical method names are underlined to distinguish them from labeling names.The method deconf was proposed by Repsilber et al.(2010)spe-ciﬁcally for performing gene expression deconvolution.It applies an alternating least-square schema to minimize the euclidean dis-tance between the target matrix and the NMF estimate.After each least-squareﬁt,both non-negativity and scaling or sum-to-one constraints are enforced onto the basis and the mixture coefﬁcient matrices.Algorithm lee,which was proposed by Lee and Seung(1999)ini-tially for image recognition,inspired several other NMF algorithms. In this work,we considered the version that minimizes the euclid-ean distance via iterative multiplicative updates,which are derived from a gradient descent approach.In its original deﬁnition,the algorithm only ensures that the non-negativity constraints are sat-isﬁed at each iteration.We enforced the sum-to-one constraint on theﬁnalﬁt only,by scaling the columns of the mixture coefﬁcients.Algorithm brunet was developed by Brunet et al.(2004)to per-form class discovery in cancer studies,and is an enhancement of Lee’s algorithm for minimizing the Kullback-Leibler divergence (Lee and Seung,2001).One of its particular features is the intro-duction of a stopping criterion based on the stationarity of the clus-tering consensus matrix,which makes sense in the context of class discovery,as it indicates that the the model achieved a stationary point for the clusters.However,this criterion is too lax in the case of deconvolution,as it might stop the algorithm too early and pre-vent further improvements of the estimation accuracy.For this reason,we instead used a stopping criterion based on the stationa-rity of the objective function,which indicates that the approxima-tion of the target matrix would not improve signiﬁcantly with further iterations.As per method lee,scaling of theﬁnal mixture coefﬁcient matrix ensures the result satisﬁes the sum-to-one con-straint on the proportions.914R.Gaujoux,C.Seoighe/Infection,Genetics and Evolution12(2012)913–921The method nsNMF was designed(Pascual-Montano et al., 2006)for performing bi-clustering of microarray data,and intro-duces a constant smoothing matrix into the model in order to obtain sparser results.For this method too,we used the stopping criterion based on the stationarity of the objective function instead of the clustering consensus matrix.The mixture proportions are obtained as the product of the smoothing matrix by theﬁnal mix-ture coefﬁcient matrix,with aﬁnal column scaling to satisfy the sum-to-one constraint.Methods G-lee,G-brunet and G-nsNMF are modiﬁcations we made from the methods lee,brunet and nsNMF,respectively,that take into account prior knowledge of markers for each cell type using the strategy described in Section3.In the remainder of the paper,we refer to these last three methods as the guided methods, and sometimes substitute the preﬁx‘‘G’’for a numerical sufﬁx that speciﬁes the number of markers used in theﬁtting process(e.g. lee-5refers to the method G-lee that uses5markers per cell type). Different NMF models were estimated for each of these methods using an increasing number of markers,precisely{1–10,15,20, 25}and{5,10,20,30,40,50,70,90,120,200}for the curated and the full datasets,respectively.An important point to bear in mind is that theﬁrst four methods actually do require and use the same prior knowledge as the guided methods,that is a set of marker genes for each cell-type.Indeed such methods return estimated basis components in an unpredictable order and unlabeled.Marker genes are used to map each basis com-ponent to one of the real cell type signatures(see Section2.2.3). Hence,strictly speaking,all methods are supervised,some at aﬁnal mapping stage,others during the estimation step.2.2.2.Marker selectionWe selected a set of marker probesets for each cell type based on the differences in gene expression observed between the pure samples.For each probeset,we computed a standard t-test statistic between the samples from the cell type in which the gene was most highly expressed and the second most highly expressing cell type(Abbas et al.,2009),and deﬁned as markers the probesets with a p-value less than0.05and a log2fold change greater than 1.5.The p-values were computed on the log2transformed data, using a two-sided t-test with equal variance.Thus,markers were selected for which the expression level in the most highly express-ing cell type was signiﬁcantly greater than the expression level in the next most highly expression cell type.Table1shows the total number of markers per cell type obtained for each dataset.In the case of the full dataset,only the top300markers of each cell type were used in the subsequent analysis.This is to limit the number of false positives(cf.the q-values in Table1),in addition to the fact that,in practice,it is unrealistic to require a very large number of markers for each cell type of interest.These markers were used by the guided methods to enforce the expected expression block pattern on the estimated signatures (cf.Section3),and by the non-guided methods to a posteriori map the estimated signatures to the real cell types(cf.Section 2.2.3).2.2.3.Cell type mappingAs already stated,all methods require and use a set of markers at some stage of the deconvolution process.The guided methods G-brunet,G-lee and G-nsNMF use the markers to enforce cell type speciﬁc expression patterns on each basis component.This means that each component is de facto associated with a given cell type and noﬁnal mapping stage is necessary.For the non-guided methods deconf,brunet,lee and nsNMF however,the order of the components is not known a priori and these need to be mapped heuristically and a posteriori to one of the cell types.Hence the mapping process is critical in this case as it provides all their meaningfulness to the results:trying to estimate proportions is meaningless if these cannot be reliably associated with the correct cell types.Repsilber et al.(2010)applied a majority count decision rule to assign the estimated components from two cell types.We ex-tended the principle to make it work robustly for any number of cell types.The mapping strategy consists in iteratively associating each component with the cell type with the maximum percentage of consistent markers.More precisely,weﬁrst build a predicted map that assigns each marker to the estimated component that ex-presses it the most and compute the contingency table of this map with the theoretical map built from the marker list.Each entry in the contingency table is the number of markers that are consistent between a given component and a given cell type.The columns of the contingency table are scaled to sum to one in order to obtain the percentages of markers from each cell type that are consistent with each component.The component and the cell type that achieve the maximum percentage of consistent markers are mapped together and removed from the contingency table.The mapping is repeated until all components have been assigned to a cell type.The components obtained from the non-guided meth-ods were assigned using this strategy with the complete set of markers,as this is expected to give more robust mappings.2.2.4.Implementation detailsAll computations were done within R(R Development Core Team,2011),using the package NMF(Gaujoux and Seoighe, 2010),which provides a general framework for running,develop-ing and testing NMF algorithms.We used the built-in optimized version of the methods brunet,lee and nsNMF,only changing the stopping criterion as described in Section2.2.1.A maximum of 2000iterations was allowed.The guided methods G-brunet,G-lee and G-nsNMF were implemented upon their respective non-guided versions,by enforcing inclusion of the marker patterns on the basis matrix after each iteration.The method deconf was implemented within the same framework by wrapping the function provided the R package deconf available in Supplementary data of Repsilber et al.(2010).All methods need to be initialized with a starting point,i.e.an initial NMF model.This is randomly chosen by drawing the entries of the basis and mixture coefﬁcient matrices from a uniform distri-bution U½0;maxðXÞ ,where X is the global gene expression matrix. The mixture coefﬁcients are then scaled to satisfy the sum-to-one constraint.Given that none of the methods have established global convergence properties,all NMF estimates were obtained as theﬁt that achieved the least residual error from200runs,each one using a different random initialization.To avoid biasing the comparisons by the choice of different starting points,and out of concern for reproducible research(Hothorn and Leisch,2011),weﬁxed the ran-dom seed to a common value before each set of runs (seed=123456).This allows the package NMF to guarantee that each set of runs uses a common sequence of random initializations, generated by independent random streams(L’Ecuyer et al.,2002; L’Ecuyer and Leydold,2005).See Appendix A for detailed informa-tion on the R installation used to generate the results.Table1Total number of markers per cell type for each dataset.For the full dataset,the q-values estimate for each cell type the proportion of false positives expected in the top300markers.Datasets Cell typesJurkat IM-9Raji THP-1Full7335624371294q-Value(300)0.0280.0580.0620.004Curated21172026R.Gaujoux,C.Seoighe/Infection,Genetics and Evolution12(2012)913–9219153.TheoryAlthough the relationship between the expression levels of pure and mixed samples is known not to be strictly linear,previous work on gene expression deconvolution showed that the linearity assumption is reasonnable(Shen-Orr et al.,2010).Hence,the com-plete gene expression deconvolution problem is commonly formu-lated as an extended linear model.In this paper,we use the following Nonnegative Matrix Factorization(NMF)theoretical framework.Given a nonnegative nÂp matrix X,Nonnegative Matrix Factor-ization aims atﬁnding an approximationX%WH;ð1Þwhere W,H are nÂr and rÂp non-negative matrices,respectively, and the factorization rank r is often such that r(min(n,p).In essence,Eq.(1)simply states that each column of X(i.e.the observed features of each sample)is approximated by a non-negative linear combination of the columns of W(i.e.the basis components),where the coefﬁcients are given by the correspond-ing column of H(i.e.the mixture coefﬁcients).If one imposes more-over that the columns of H sum to one,an NMF model such as(1) may be directly interpreted in terms of gene expression deconvo-lution:the matrix X represents the global gene expression matrix from heterogeneous samples(e.g.blood or PBMCs),the columns of the matrix W correspond to speciﬁc gene expression signatures of the cell types(e.g.T-cells,Monocytes),and each column of the matrix H provides the proportions of each cell type in the corre-sponding sample.Classical NMF algorithms use iterative optimization methods to minimize an objective function that measures the distance be-tween the target global gene expression matrix and its NMF mon objective functions are based on the Frobenius norm or the Kullback-Leibler divergence(Lee and Seung,2001; Cichocki et al.,2008).Variations on the Eq.(1)or the optimization problem exist in order to take into account some a priori knowl-edge about the data or the solution(Hoyer,2004;Pascual-Montano et al.,2006).One example of such variations is the sum-to-one con-straint we imposed on the mixture coefﬁcient matrix H in order to represent relative proportions,instead of absolute counts.Our proposition is to impose another set of constraints on the signature matrix W,with the objective of estimating more stable and meaningful cell type signatures.This should,in turn,improve the estimation of the mixture proportions.In order to achieve this we use a set of marker genes,each one of which is known to be–almost–exclusively expressed by just one of the cell types.We therefore want to constrain the rows of the signature matrix W that correspond to each marker gene so that all entries are zero ex-cept one.In this way we associate a priori each gene expression signature(i.e.each column of W)to a given cell type.Many NMF algorithms such as lee,brunet,and nsNMF,imple-ment gradient-descent methods using iterative multiplicative up-dates.These deﬁne the next iterate value of each factor(W and H)as its element-wise product by another matrix,chosen to ensure –at least–that the objective function is non-increasing(Berry et al.,2007).Therefore,theoretically,for this kind of algorithm, enforcing block patterns on the initial signatures guarantees their persistence alonog all iterations.In practice however,due to adjustments commonly made to avoid numerical difﬁculties,one may require to enforce the block patterns after each iteration. Hence at initialization and after each iteration of the chosen NMF algorithm,each cell type signature has the values corresponding to markers of other cell types set to zero.The values for its own markers are left free to be updated by the algorithm’s own iterative schema.On the other hand,the method deconf is based on an alternating least-squares strategy.Rigourously constraining block expression patterns for this kind of algorithm,requires more sophisticated ap-proaches such as projected-gradient methods(Lin,2007).In fact, the method deconf already imposes the non-negativity and sum-to-one constraints using a heuristic,whose implications on the objective value are not clear;all the more so if marker constraints are added.Therefore,for the purpose of this paper,we incorpo-rated such constraints only into the multiplicative NMF algorithms considered,viz.lee,brunet and nsNMF.However,the performances achieved by deconf on the full dataset suggests that adapting this algorithm to make use of markers could potentially be fruitful.4.Results and discussionAll analyses were performed on both the curated and the full datasets.Since we are interested in complete deconvolution,the pure samples were excluded from the expression matrix,which mimics the realistic situation in which expression data for the pure samples are not available.Fig.1shows the mean absolute differ-ences(mAD)between the true and estimated proportions achieved by each method for a varying number of markers,on both datasets. The values achieved by the guided methods are plotted with bul-lets,those achieved by the non-guided methods with single solid diamonds at abscissa0.Note that this abscissa choice is for plotting purposes only,and does not reﬂect the actual usage of markers by these methods(cf.Section2.2.1),since all of these methods do make use of markers to identify components with cell types.The beneﬁt of using markers to guide theﬁtting process is clear on both datasets.Indeed,the guided methods achieve signiﬁcantly lower mAD values than their respective non-guided versions,and this for any number markers,meaning that enforcing marker patterns on the basis components improves the accuracy of the mixing pro-portions estimates.The improvement in accuracy is particularly striking in the case of the curated dataset,where using a single marker per cell type already improves dramatically the estimation of the mixture proportions,specially for the methods brunet and lee.The best accuracy was achieved by the method brunet-7 (mAD=0.05).In Figs.2and3(a–b)we highlight the differences be-tween the estimates obtained from the methods brunet,brunet-7 and deconf on the curated dataset.For completeness,we show in Supplementary Figs.2–4the plots obtained for each method and each number of markers.These are animated plots which highlight the effect of increasing the number of enforced markers.Fig.2(a and c)shows scatter plots of the estimated versus the true mixture proportions for brunet and brunet-7,respectively. On each plot,the colors distinguish between the four cell types, the global pearson correlation coefﬁcient r is indicated at the bot-tom right,and the values in parenthesis within the legend indicate the pearson correlation coefﬁcients computed for each cell type separately.The proportion estimates from the guided method bru-net-7are much more accurate(mAD=0.05)and highly correlated with the true proportions(r=0.91),than the estimates obtained by its non-guided version brunet(mAD=0.208and r=0.52).The heatmaps in Fig.2(b and d)supports the same conclusions. These show the expression of all the marker probesets across the estimated signatures.The assigned cell type is indicated at the bot-tom of each signature.To emphasize the differences between cell types,the expression levels were scaled in each row separately into relative percentages of expression.The color palette ranges from light yellow to dark red for,respectively,0%and100%of expres-sion.The markers are ordered by increasing p-value within their respective reference cell type.The colored annotation columns on the left hand side of the heatmap indicate which estimated cell type expresses each marker the most highly.Markers are colored according to their respective true cell type,using the same color916R.Gaujoux,C.Seoighe/Infection,Genetics and Evolution12(2012)913–921code as the scatter plots.Hence,a method correctly recovers the markers of a given cell type if the associated annotation column consists of a single monochromatic block.As a result of the strat-egy used to guide the algorithm,the enforced markers of each cell type show a100%-value in their corresponding components,and0 elsewhere.Fig.2(d)shows that using markers successfully resolved the inconsistent signatures obtained for IM-9and THP-1by the stan-dard method brunet(Fig.2(b)).In fact,independently of the num-ber of markers being enforced,all guided methods estimated signatures which exhibit the expected block pattern(Supplemen-tary Figs.2–4).The recovered signatures are not only more deﬁned than the one estimated by the standard methods,but they are also biologically more meaningful as most of the markers are highly ex-pressed by the correct cell type.Despite the fact that a single mar-ker per cell-type was sufﬁcient to guide the algorithm towards relevant cell-type signatures,more markers were needed to grad-ually improve the accuracy;up to a certain point after which the proportion estimates seem to become biased,although being more precise.For example,the proportion of IM-9seems to be systemat-ically under-estimated by brunet-25,resulting in a compensatory over-estimation of the other proportions,that however appears evenly divided amongst the different cell-types(Supplementary Fig.2).Fig.3(b)indicates that a similar inconsistency issue affects the signature estimated by the method deconf(mAD=0.216).In this case,the estimated signature assigned to THP-1 expresses at a high level most of the markers for Jurkat as well; somehow even more clearly than the markers for THP-1itself. We noticed that swapping the signatures assigned to these two cell-types improved the accuracy(mAD=0.172),which suggests that they were mis-assigned(Supplementary Fig.5).However,this does not change the signatures themselves,which remain incon-sistent with the real underlying cell types,limiting the method’s ability to properly estimate the mixture proportions(Fig.3(a)).As far as the method nsNMF is concerned,the mAD plots in Fig.1 show that,when non guided,it achieved consistently lower mAD values than brunet and lee,but beneﬁted relatively moderately from the usage of the markers compared to the two latter methods.Prob-ably due to its extra sparsity constraint however,its guided version estimates some proportions very accurately(IM-9and Raji),but others with a systematic bias(Supplementary Fig.4).On the other hand,the method lee seems to be somehow more sensitive to the variation in the number of markers,compared to brunet and nsNMF. This could be explained by fundamental differences in the objective functions these methods optimize.Indeed,brunet and nsNMF minimize the Kullback-Leibler divergence,which is based on log-differences,whereas lee minimizes the euclidean distance,which is based on square-differences,making it more sensitive to deviations,in particular to those that arise from the enforcement of the marker expression patterns.Fig.1shows that the method deconf achieved a remarkable accuracy when applied to the full dataset(mAD=0.079),and is only outperformed by the guided methods G-lee and G-brunet when using an appropriate number of markers.Fig.3(c)indicates that the global pearson correlation is high(r=0.82)and that all cell types except THP-1were recovered with a correlation greater than 0.9.Along the same lines,the heatmap in Fig.3(d)reveals that the estimated signature for THP-1is particularly inconsistent with its associated markers,all of them being mostly expressed by the component assigned to Jurkat.Because the other three cell types are relatively well recovered,the estimation of the mixture propor-tions would not be completely hampered,specially with the pres-ence of the sum-to-one constraint.On the other hand,Fig.4d shows that,when guided by30markers per cell type,the method lee recovers cell type signatures that are extremely consistent with the real cell types,while estimating the mixture proportions with a similar accuracy(mAD=0.075).In particular,enforcing these markers improves the correlation of the estimated proportions of THP-1from0.56to0.79(cf.Fig.4(a and c)).The plots obtained for each method and each number of markers are all shown in Sup-plementary Figs.6–8.On a more general level,these results raise the main potential caveat of using the non-guided methods.Their performance heav-ily relies on their ability to recover meaningful cell signatures without being supervised.Given the noise inherent in gene expres-sion data and other possible confounding factors,this is not guar-anteed to succeed,especially when estimating more than two cell types.Moreover,the interpretation of their results also depends on the mapping heuristic that assigns the estimated components to a real cell type.As an example,Table2shows the percentages of con-sistent markers obtained by the method deconf on the full dataset. All the markers were used from each cell type.The columns corre-spond to the real cell types,the rows to the estimated signatures,R.Gaujoux,C.Seoighe/Infection,Genetics and Evolution12(2012)913–921917。