Abstract Missing Value Estimation for DNA Microarray Expression Data Least Squares Imputati

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Missing Value Estimation for DNA Microarray
Expression Data:Least Squares Imputation
Hyunsoo Kim,Gene H.Golub,and Haesun Park
Department of Computer Science and Engineering,
University of Minnesota,
200Union Street S.E.,4-192EE/CS Building,
Minneapolis,MN55455,U.S.A.
and
Computer Science Department,
Stanford University,
Gates Building2B#280,Stanford CA94305-9025,U.S.A.
February2,2004
Abstract
Motivation:Gene expression microarray data sets often contain missing expression values.Robust missing value estimation methods are needed since many algorithms for gene expression analysis require a complete matrix of gene array values.In this paper,imputation methods based on the least squares and cluster structure are proposed to estimate missing values in the gene expression data, which exploits local and cluster structures in the data.
Methods:The proposed least squares based method(LSimpute)represents a target gene that has missing values as a linear combination of similar genes.The similar genes are chosen by k-nearest neighbors or the concept of coherent genes that have large absolute values of Pearson correlation coefﬁcients.In addition,several cluster based imputation methods are proposed including a method based on dimension reduction(DRimpute)which takes advantage of cluster structure in the reduced dimensional space to acquire a more accurate cluster structure.
Results:LSimpute outperforms the Bayesian principal component analysis(BPCA)based method as well as the conventional weighted-nearest neighbor method(KNNimpute)for all four different data sets tested
Availability:The LSimpute is available from the authors upon request.
Corresponding author:Haesun Park,E-mail:hpark@,Phone:612-625-0041,Fax:612-625-0572.This material is based upon work supported by the National Science Foundation Grants CCR-0204109and ACI-0305543.Any opinions,ﬁndings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation(NSF).
Introduction
Gene expression data sets often contain missing values for various reasons.For example,the background and the signal may have similar intensities,the surface of the chip may not be planar, there may be dust on the slides,the probe may not be properlyﬁxed on the chip or washed prop-erly,the hybridization step may not work properly.There are several approaches for estimating the missing values.Recently,for missing value estimation,the singular value decomposition based method(SVDimpute)and weighted-nearest neighbors imputation(KNNimpute)have been intro-duced(Troyanskaya et al.,2001).It has been shown that KNNimpute performs better on non-time series data or noisy time series data,while SVDimpute works well on time series data with low noise levels.Overall,the weighted-nearest neighbors based imputation provides a more robust method for missing value estimation than the SVD based method.
Throughout the paper,we will use to denote a gene expression data matrix with genes and experiments,and assume.In the matrix,a row represents expressions of the th gene in experiments:
..
.
A missing value in the th location of the th gene is denoted as,i.e..For sim-plicity of algorithm description,all missing value estimation algorithms mentioned in this paper are describedﬁrst assuming there is a missing value in theﬁrst position of theﬁrst gene,i.e.
then general algorithms for our proposed missing value estimation methods for DNA microarray expression data are introduced.
The KNNimpute method(Troyanskaya et al.,2001)ﬁnds()other genes with expressions most similar to that of and with the values in theirﬁrst positions not missing.The missing value of is estimated by the weighted average of values in theﬁrst positions of the closest genes.For the weighted average,the contribution of each gene is weighted by the similarity of its expression to that of.In the SVDimpute method(Troyanskaya et al.,2001),the SVD of the matrix,which is obtained after all missing values of the are substituted by zero or row averages,is computed. Then,using the most signiﬁcant eigengenes of,where the speciﬁc value of is either predeter-mined or determined based on data sets,a missing value in theﬁrst array of is estimated by regressing this gene against the most signiﬁcant ing the coefﬁcients of the regres-sion,the missing value is estimated as a linear combination of the values in theﬁrst position of eigengenes.When determining these regression coefﬁcients,theﬁrst missing value of andﬁrst values of the eigengenes are not used.The above procedure is repeated until the total change of the matrix becomes insigniﬁcant.The computational complexity of SVDimpute is,where is the number of iterations performed before the threshold value is reached SVDimpute is useful for time series data with low noise level.However,the SVD based imputation has several weaknesses since the solution relies on all genes and experiments in the data set and does not consider local structure nor cluster structure.For non-time series data,a clear expression pattern may not exist.For noisy data,expression patterns of the genes with smaller expression values may not be represented
well by the dominant eigengenes.Recently,Bayesian principal component analysis(BPCA),which simultaneously estimates a probabilistic model and latent variables within the framework of Bayesian inference,has been successfully applied to missing value estimation problems(Oba et al.,2002;Oba et al.,2003).Aﬁxed rank approximation algorithm(FRAA)(Friedland et al.,2003)using the singu-lar value decomposition has been proposed.However,FRAA could not outperform KNNimpute even though it is more accurate than replacing missing values with0’s or with row means.
In this paper,we introduce least squares based imputation methods and methods that exploit cluster structures and the local relationships among genes in the gene expression data for estimating missing values.We investigate all proposed imputation methods on the four different data sets and compare them with KNNimpute and an estimation method based on BPCA.
Materials and Methods
Least Squares Based Imputation Algorithms
In this section,several least squares based imputation methods are proposed,where a target gene that has missing values is represented as a linear combination of similar genes.A target experiment that has a missing value can be represented as a linear combination of the other experiments as well. One of two approaches is used to estimate missing values depending on the relative sizes between the number of selected similar genes and the available experiments.Rather than using all available genes in the data,since only the genes with high similarity with the target gene are used in the proposed methods,our methods are described as local least squares algorithms.The similarity measures based on the norm as well as the Pearson correlation coefﬁcient are investigated.
Local Least Squares Imputation(LLSimpute)
In our Local Least Squares imputation(LLSimputation)method,to recover a missing value in the ﬁrst location of,,theﬁrst nearest neighbor gene vectors for,
are found,each of which does not contain any missing value in itsﬁrst location as well as the rest of the vector.The value for varies for each speciﬁc case of the missing value estimation since it depends on the missing values in the rest of the gene vectors.Then,based on these neighboring gene vectors,the matrix and the two vectors and are formed. The rows of the matrix consist of the nearest neighbor genes,,with theirﬁrst values deleted,the elements of the vector consists of theﬁrst values of the vectors, and the elements of the vector are the elements of the gene vector whose missingﬁrst item is deleted.After the matrix,and the vectors and are formed,one of the following two least squares problems are solved depending on the relative sizes between and.
When,we can solve a least squares problem:
(1) Then,the missing value is estimated as a linear combination ofﬁrst values of genes
In this case,as approaches,the effect of the genes less similar to the target gene become involved in the solution and the performance may become worse.A more detailed discussion is presented in the Results and Discussions section.
When,the least squares problem to solve is
(2) Then,the missing value is estimated as a linear combination of values of experiments except theﬁrst experiment in the target,i.e.
In this case,using more data points,i.e.genes,is generally helpful for better estimation of the missing value.To incorporate weights of similarity for the-nearest neighbors into Eqn.(2),a weighted least squares problem
(3) can be solved where is a diagonal matrix of weights and the th diagonal element,(), represents the similarity of the th gene to.
In the actual DNA microarray data,missing values occur in various locations.For estimating each element in the th position of the th gene,we need to build the matrix and the vectors and and solve the corresponding least squares problem according to its conditions When building a matrix for estimating each missing value,all of the previously estimated entries are used as well in order to estimate the current missing value.
Constrained Least Squares Imputation(CLSimpute)
In KNNimputed,ﬁrst,the nearest neighbors to the gene with a missing value are found as described earlier.Then a missing value is computed using the weighted average where the weights are based on the similarity values between the neighbors and,assigning higher weights to the genes closer to.Therefore,the choice of weights play a critical role.
Our constrained least squares imputation(CLSimpute)algorithm is formulated to obtain the op-timal weights of-nearest neighbors by least squares methods where the values of the weights are computed as a part of the solution.As in the LLSimpute method,the problem is divided into two cases depending on the relative sizes between the number of neighboring genes and the number of ing the matrix and the vectors and deﬁned as in the LLSimpute method, when,the following constrained least squares problem is solved
for
Then,the missing value is calculated by
where each component of vector is the optimized weight of the corresponding gene by the least squares approach.When as in the LLSimpute,the following problem is solved Then,the missing value is calculated by
The constrained least squares problem was solved by quadratic programming,which uses an active set method(Gill et al.,1981).
As in the LLSimpute,for each missing element in the DNA microarray expression data,a matrix and vectors and are formed and the corresponding constrained least squares problem needs to be solved.
Least Squares Imputation with the Pearson Correlation Coefﬁcient(LSPimpute) In this subsection,another method based on the least squares imputation is introduced that takes advantage of the similar genes based on the Pearson correlation coefﬁcient(Pearson,1894).When there is a missing value in theﬁrst location of,the Pearson correlation coefﬁcient between two vectors=and=is deﬁned as
(4)
where is the average of values in and is the standard deviation of these values.The compo-nents of that contain missing values are not considered in computing the coefﬁcients.We used the absolute values of the Pearson correlation coefﬁcients since the highly correlated but opposite signed components of the genes,i.e,are also helpful in estimating missing values.
In the least squares imputation with the Pearson correlation coefﬁcient,missing values in the target genes are estimated by local least squares where highly correlated genes in the microarray data are selected based on the Pearson correlation coefﬁcients.First,all Pearson correlation coefﬁcients between and the other genes are computed.After sorting the absolute values of the correlation coefﬁcients,the genes whose absolute values of correlation coefﬁcients with are greater than a threshold value are selected.Let denote the number of genes satisfying the criterion.For example,is the number of genes for which the absolute values of correlation coefﬁcients are greater than0.95.Let deﬁne the number of genes that do not have any missing values.If there are sufﬁcient numbers of very highly correlated genes(e.g.),it is reasonable to estimate the missing values by these genes.Therefore,the two cases are divided according to as well as for building the corresponding least squares problem.
When or is greater or equal to a threshold number of genes,i.e.(
),we used the following least squares problem:
(5)
where the rows of the matrix are the genes highly correlated to the target gene,which
do not contain a missing value.After building of theﬁrst column vector,the missing
value can be obtained by.The missing value is estimated by a linear combination of the ﬁrst values of the coherent genes.The number of threshold number of genes is determined based
on our numerical experiments.
When and,we used the following least squares problem:
(6)
where that does not contain a missing value and of theﬁrst column vector.The number of similar genes(),i.e.the number of rows of the matrix,is determined by numerical experiments.We describe the selection of the threshold values(,)in the Results and Discussion section.The missing value can be calculated by
This approach estimates a missing value by a linear combination of values of experiments excluding theﬁrst experiment in the target.The coefﬁcients of the linear combination are determined by the above least squares problem using the genes that exhibit similar behavior to with respect to the experimental conditions.For solving a least squares problem,the truncated SVD can be used to achieve noise reduction in estimating missing values.LSPimpute/SVD is a variation of LSPimpute using these noise reduction schemes,where a speciﬁc reduced rank has to be determined.More discussion on this follows in the Results and Discussion section.
For estimating each missing element of the th position of the th gene in the data matrix ,we need to build a matrix after computing gene correlation coefﬁcients between and other genes,and solve the corresponding least squares problem according to its conditions.
Cluster Based Imputation Algorithms
In KNNimpute,the estimated value by weighted average can be perturbed by genes in other clusters when the number of nearest neighbors,,is larger than the number of genes in the cluster that contains a target gene with a missing value.To avoid this problem,we developed imputation algorithms that estimate missing values by weighted average based on cluster structure as well as similarity.For simplicity of notation,we will assume that the matrix is partitioned into two submatrices as
where has at least one missing value in each row and has no missing value. CLUSTERimpute
When the gene expression data has a cluster structure,CLUSTERimpute missing value estimation can be used.In CLUSTERimpute,the following steps are taken:
1.Cluster the matrtix using-means(Lloyd,1982),EM or any other clustering algorithm.
2.For all genes in,,that have missing values and ignoring the compo-
nents with missing values,
(a)Determine the cluster where belongs
(b)Select genes among most similar to
(c)Estimate the missing values by the weighted average of the chosen genes.Each selected
gene is weighted by their similarities to.For genes from,apply additional weights.
To measure gene similarity,we used the Euclidean distance since it was found that the log trans-formation seems to sufﬁciently reduce the effect of outliers in gene expression data(Troyanskaya et al.,2001).For our experiments,we tested both the-means and EM algorithms to obtain cluster structure in the original input space.When we use the EM algorithm,a matrix whose all missing values are substituted by row averages is used as an input matrix in order not to lose rows that contain missing values in the clustering step.
DRimpute
It is well-known that conventional clustering algorithms such as expectation maximization(EM) and-means are often trapped in a local minimum when handling high dimensional data.It is possible to avoid this problem by applying clustering algorithms in a reduced dimension space using the cluster structure preserving dimension reduction methods(Ding et al.,2002).DRimpute takes advantage of this approach to estimate missing values for DNA microarrays.When the number of clusters for the gene expression data is,the missing value estimation with dimension reduction, DRimpute,works as follows.
1.Reduce the data dimension of by applying PCA or any other methods.
2.Apply a clustering algorithm such as-means or EM to obtain clusters in the reduced dimen-
sional e the cluster membership to construct cluster centroids in the original space.If the positions of centroids in the original space are converged,go to Step4.
pute the dimension reducing transformation using the cluster structure preserving dimen-
sion reduction methods such as Orthogonal Centroid(Park et al.,2003)or LDA/GSVD(How-land et al.,2003).Transform the data to the reduced dimensional space.Go to step2.
4.For all genes in,,that have missing values.
(a)Determine the cluster where belongs.
(b)Select the genes most similar to among.
(c)Estimate the missing values by a weighted average of chosen genes.Each selected gene
is weighted by their similarities to.For genes from,apply additional weights.
The cluster membership obtained by the clustering algorithm in the reduced dimensional space was used in order to construct cluster centroids in the original space.In step2,the cluster centroids
in the original space have to be constructed by the cluster membership in the reduced dimensional space.The centroid vector for the th cluster in the original space can be calculated by
(7) where is the number of genes in the th cluster,is the set of indexes of the genes that belongs to the th cluster in the reduced dimensional space,and is an expression vector for the th gene. The is the probability that the point belongs to cluster,where is the th point in the reduced space.If we use the-means algorithm,the probability is always1.For our experiments, we used both-means and EM algorithms to obtain the cluster structure in the reduced space and the Orthogonal Centroid dimension reduction method for DRimpute.
SVDimpute Algorithms
In the SVDimpute method(Troyanskaya et al.,2001),all missing values are substituted by row averages or zeros and then the SVD of the complete matrix is computed in order to obtain the eigen-genes.There are alternative methods.After removing all rows that contain a missing value,we can apply the SVD to a complete submatrix with no missing data(Alter et al.,2003).We also tried a hybrid method between SVDimpute and KNNimpute.After all missing values are initially estimated by KNNimpute,eigengenes can be obtained by the SVD.We refer to these methods as SVDim-pute/RowAverage,SVDimpute/Remove,and SVDimpute/kNN,respectively.
Results and Discussion
Four microarray data sets obtained from the Stanford Microarray Database(SMD)have been used in our experiments.Theﬁrst data set was obtained from-factor block release that was studied for identiﬁcation of cell-cycle regulated genes in yeast Saccharomyces cerevisiae(Spellman et al., 1998).We build a complete data matrix of4304genes and18experiments(SP.ALPHA)that does not have any missing value to assess missing value estimation methods.The second data set came from an elutriation data set(Spellman et al.,1998).We build a complete matrix of4304genes and14 experiments(SP.ELU).The4304genes originally had no missing values in the-factor block release set and the elutriation data set.The third data set was from784cell cycle regulated genes,which were classiﬁed by Spellman et al.(Spellman et al.,1998)intoﬁve classes,for the same14experiments as the second data set.After removing all gene rows that have missing values,we built the third data set of474genes and14experiments(SP.CYCLE).We also built the fourth data set that has a large number of experiments.This data set was from a study of response to environmental changes in yeast(Gasch et al.,2001).It contains6361genes and156experiments that have time-series of speciﬁc treatments.We built a complete matrix of2641genes and44experiments after removing experimental columns that have more than8%missing values and then selecting gene rows that do not have any missing value(GA.ENV).
The SP.ALPHA and SP.ELU are the same data sets that were used in the study of BPCA(Oba et al.,2003).The SP.CYCLE data set was designed to test how much an imputing method can take advantage of strongly correlated genes in estimating missing values.The GA.ENV data set was prepared to test how efﬁciently an imputation method can handle a large number of experiments.The
normalized root mean squared(RMS)difference between the imputed matrix and the original matrix was used to assess the accuracy of the methods.
Given an-genes-experiments original expression data matrix from the SMD,we pre-pared the initial full matrix where genes have no missing values().If there is no missing value in the original matrix,then the initial matrix is set to.Given an -genes-experiments initial expression data matrix,5%percent of the data elements of are randomly selected and regarded as missing values.To measure the performance of an algorithm, we computed the normalized root mean squared difference between the imputed matrix and the initial matrix.
For LSPimpute,we set.If then is set to a lower number between and.When is too small,it may not have sufﬁcient information about the relationships between experiments.When is too large,genes that have small correlation with the target gene that has a missing value inﬂuence the result.Through several numerical experiments,we chose the value of.The optimal value(and)may vary with the corresponding data set but the values we chose were good for the four different data sets.In LSPimpute/SVD,the reduced dimension in the truncated SVD is set to be.
CLUSTERimpute and DRimpute take advantage of the clusters for missing value estimation. KNNimpute also takes advantage of the local cluster structure byﬁnding nearest neighbors and their weighted average based on the similarities.The effect of neighboring data points further away from the target gene and with memberships in the clusters other than where the target gene belongs is weakened by applying lower weights.Therefore,the choice of weights is important for KNNimpute. We have observed that the normalized RMS error without weight average is much larger than that of KNNimpute in our experiments.While KNNimpute is relatively insensitive to the exact value of within the range of10-20neighbors(Figure1),CLUSTERimpute also has a minimum normalized RMS error in a similar range.
In Figure1,we compared normalized RMS errors of the missing value estimation methods dis-cussed in this paper.For readability of theﬁgures,we only show SVDimpute/kNN as a representa-tive of SVD based imputation methods,since there was no signiﬁcant performance difference among methods in the same category:SVDimpute/RowAverage and SVDimpute/kNN showed slightly better results than SVDimpute/Remove.The missing value estimation based on Bayesian principal compo-nent analysis(BPCA)showed good performance on the SP.ELU data set(See Figure2).However, LLSimpute outperformed BPCA as well as KNNimpute for all four data sets when a large number of genes was involved.In Figure3,overall,LLSimpute shows better performance as the number of genes increases for estimating missing values on the SP.CYCLE data set.The normalized RMS error values of LSPimpute and BPCA were0.594and0.733respectively.For the constrained least squares imputation(CLSimpute),we observed that its performance is better than LLSimpute when .However,it could not give us signiﬁcantly better results since the least squares problem Eqn.(1)highly depends on the number of very similar genes and the size of the over-determined sys-tem is usually much smaller than that of Eqn.(2).Even though LSPimpute/SVD was designed to deal with noisy data sets by using the truncated SVD,we could not obtain any better result than LSPim-pute for all four data sets.In Figure4,the normalized RMS error values of LSPimpute and BPCA on the GA.ENV data set were0.534and0.603,respectively.It shows that LLSimpute works well even if the number of experiments is large.LSPimpute also signiﬁcantly outperformed LLSimpute
on the GA.ENV data set even when LLSimpute took into account more than400similar genes.This is because LSPimpute takes advantage of coherent genes while LLSimpute uses Euclidean distance based on k-nearest neighbor genes.
We introduced several missing value estimation methods and their variations.We have success-fully developed a least squares based method using the concept of coherent genes for the missing value estimation of DNA microarray expression data.Once the coherent genes are identiﬁed,missing values can be estimated by representing the target gene with missing values as a linear combination of the similar genes or the target experiment that has missing values as a linear combination of related experiments.Our results show that the most successful general missing value estimation method is based on representing a target experiment that has a missing value as a linear combination of the other experiments.
Acknowledgements
The authors would like to thank the University of Minnesota Supercomputing Institute(MSI)for providing the computing facilities.We also thank Dr.Shigeyuki Oba for providing data sets and helpful discussions.
REFERENCES
Alter,O.,Brown,P.O.and Botstein,D.(2003)Generalized singular value decomposition for com-parative analysis of genome-scale expression datasets of two different organisms.Proc.Natl A,100(6),3351–3356.
Ding,C.,He,X.,Zha,H.and Simon.,H.D.(2002)Adaptive dimension reduction for clustering high dimensional data.In Proc.of the2nd IEEE Int’l Conf.Data Mining Maebashi,Japan. Friedland,S.,Niknejad,A.and Chihara,L.(2003).A simultaneous reconstruction of missing data in DNA microarrays.Institute for Mathematics and its Applications Preprint Series,No.1948. Gasch,A.P.,Huang,M.,Metzner,S.,Botstein,D.,Elledge,S.J.and Brown,P.O.(2001)Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p.Mol.Biol.Cell,12(10),2987–3003.
Gill,P.E.,Murray,W.and Wright,M.H.(1981)Practical Optimization.Academic Press,London, UK.
Howland,P.,Jeon,M.and Park,H.(2003)Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition.SIAM J.Matrix Anal.Appl., 25(1),165–179.
Lloyd,S.P.(1982)Least squares quantization in PCM.IEEE Transactions on Information Theory, 2,129–137.
Oba,S.,Sato,M.,Takemasa,I.,Monden,M.,Matsubara,K.and Ishii,S.(2002)Missing value estimation using mixture of PCAs.In Proceedings of International Conference on Artiﬁcial Neural Networks(ICANN2002)pp.492–497Springer,LNC2415.
Oba,S.,Sato,M.,Takemasa,I.,Monden,M.,Matsubara,K.and Ishii,S.(2003)A Bayesian missing value estimation method for gene expression proﬁle data.Bioinformatics,19(16),2088–2096.
Park,H.,Jeon,M.and Rosen,J.B.(2003)Lower dimensional representation of text data based on centroids and least squares.BIT Numerical Mathematics,42(2),1–22.
Pearson,K.(1894)Contributions to the mathematical theory of evolution.Philosophical Transactions of the Royal Society of London,185,71–110.
Spellman,P.T.,Sherlock,G.,Zhang,M.Q.,Iyer,V.R.,Anders,K.,Eisen,M.B.,Brown,P.O., Botstein,D.and Futcher,B.(1998)Comprehensive identiﬁcation of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization.Mol.Biol.Cell,9,3273–3297.
Troyanskaya,O.,Cantor,M.,Sherlock,G.,Brown,P.,Hastie,T.,Tibshirani,R.,Botstein,D.and Altman,R.B.(2001)Missing value estimation methods for DNA microarray.Bioinformatics, 17(6),520–525.。