基于核函数的学习算法 ppt课件

合集下载

机器学习中的核函数

Kernel Functions for Machine Learning Applications机器学习中的核函数1.核函数概述In recent years, Kernel methods have received major attention, particularly due to the increased popularity of the Support Vector Machines. Kernel functions can be used in many applications as they provide a simple bridge from linearity to non-linearity for algorithms which can be expressed in terms of dot products. In this article, we will list a few kernel functions and some of their properties.Many of these functions have been incorporated in , a extension framework for the popular Framework which also includes many other statistics and machine learning tools.2.机器学习中的核函数Kernel Methods（核函数方法）Kernel methods are a class of algorithms for pattern analysis or recognition, whose best known element is the support vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (such as clusters, rankings, principal components, correlations, classifications) in general types of data (such as sequences, text documents, sets of points, vectors, images, graphs, etc) (Wikipedia, 2010a).The main characteristic of Kernel Methods, however, is their distinct approach to this problem. Kernel methods map the data into higher dimensional spaces in the hope that in this higher-dimensional space the data could become more easily separated or better structured. There are also no constraints on the form of this mapping, which could even lead to infinite-dimensional spaces. This mapping function, however, hardly needs to be computed because of a tool called the kernel trick.The Kernel Trick（核函数构造）The kernel trick is a mathematical tool which can be applied to any algorithm which solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced by a kernel function. When properly applied, those candidate linear algorithms are transformed into a non-linear algorithms (sometimes with little effort or reformulation). Those non-linear algorithms are equivalent to their linear originals operating in the range space of a feature space φ. However, because kernels are used, the φ function does not need to be ever explicitly computed. This is highly desirable, as we noted previously, because this higher-dimensional feature space could even be infinite-dimensional and thus infeasible to compute. There are also no constraints on the nature of the input vectors. Dot products could be defined between any kind of structure, such as trees or strings.Kernel Properties（核函数特性）Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix. Kernels which are said to satisfy the Mercer's theorem are positivesemi-definite, meaning their kernel matrices have no non-negative Eigen values. The use of a positive definite kernel insures that the optimization problem will be convex and solution will be unique.However, many kernel functions which aren’t strictly positive definite also have been shown to perform very well in practice. An example is the Sigmoid kernel, which, despite its wide use, it is not positive semi-definite for certain values of its parameters. Boughorbel (2005) also experimentally demonstrated that Kernels which are only conditionally positive definite can possibly outperform most classical kernels in some applications.Kernels also can be classified as anisotropic stationary, isotropic stationary, compactly supported, locally stationary, nonstationary or separable nonstationary. Moreover, kernels can also be labeled scale-invariant or scale-dependant, which is an interesting property as scale-invariant kernels drive the training process invariant to a scaling of the data.Choosing the Right Kernel（怎样选择正确的核函数）Choosing the most appropriate kernel highly depends on the problem at hand - and fine tuning its parameters can easily become a tedious and cumbersome task. Automatic kernel selection is possible and is discussed in the works by Tom Howley and Michael Madden.The choice of a Kernel depends on the problem at hand because it depends on what we are trying to model. Apolynomial kernel, for example, allows us to model feature conjunctions up to the order of the polynomial. Radial basis functions allows to pick out circles (or hyperspheres) - in constrast with the Linear kernel, which allows only to pick out lines (or hyperplanes).The motivation behind the choice of a particular kernel can be very intuitive and straightforward depending on what kind of information we are expecting to extract about the data. Please see the final notes on this topic from Introduction to Information Retrieval, by Manning, Raghavan and Schütze for a better explanation on the subject.Kernel Functions（常见的核函数）Below is a list of some kernel functions available from the existing literature. As was the case with previous articles, every LaTeX notation for the formulas below are readily available from their alternate text html tag. I can not guarantee all of them are perfectly correct, thus use them at your own risk. Most of them have links to articles where they have been originally used or proposed.1. Linear KernelThe Linear kernel is the simplest kernel function. It is given by the inner product <x,y> plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e. KPCA with linear kernel is the same as standard PCA.2. Polynomial KernelThe Polynomial kernel is a non-stationary kernel. Polynomial kernels are well suited for problems where all the training data is normalized.Adjustable parameters are the slope alpha, the constant term c and the polynomial degree d.3. Gaussian KernelThe Gaussian kernel is an example of radial basis function kernel.The adjustable parameter sigma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand. If overestimated, the exponential will behave almost linearly and the higher-dimensional projection will start to lose its non-linear power. In the other hand, if underestimated, the function will lack regularization and the decision boundary will be highly sensitive to noise in training data.4. Exponential KernelThe exponential kernel is closely related to the Gaussian kernel, with only the square of the norm left out. It is also a radial basis function kernel.5. Laplacian KernelThe Laplace Kernel is completely equivalent to the exponential kernel, except for being less sensitive for changes in the sigma parameter. Being equivalent, it is also a radial basis function kernel.It is important to note that the observations made about the sigma parameter for the Gaussian kernel also apply to the Exponential and Laplacian kernels.6. ANOVA KernelThe ANOVA kernel is also a radial basis function kernel, just as the Gaussian and Laplacian kernels. It is said toperform well in multidimensional regression problems (Hofmann, 2008).7. Hyperbolic Tangent (Sigmoid) KernelThe Hyperbolic Tangent Kernel is also known as the Sigmoid Kernel and as the Multilayer Perceptron (MLP) kernel. The Sigmoid Kernel comes from the Neural Networks field, where the bipolar sigmoid function is often used as anactivation function for artificial neurons.It is interesting to note that a SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. This kernel was quite popular for support vector machines due to its origin from neural network theory. Also, despite being only conditionally positive definite, it has been found to perform well in practice.There are two adjustable parameters in the sigmoid kernel, the slope alpha and the intercept constant c. A common value for alpha is 1/N, where N is the data dimension. A more detailed study on sigmoid kernels can be found in theworks by Hsuan-Tien and Chih-Jen.8. Rational Quadratic KernelThe Rational Quadratic kernel is less computationally intensive than the Gaussian kernel andcan be used as an alternative when using the Gaussian becomes too expensive.9. Multiquadric KernelThe Multiquadric kernel can be used in the same situations as the Rational Quadratic kernel. As is the case with the Sigmoid kernel, it is also an example of an non-positive definite kernel.10. Inverse Multiquadric KernelThe Inverse Multi Quadric kernel. As with the Gaussian kernel, it results in a kernel matrix with full rank (Micchelli, 1986) and thus forms a infinite dimension feature space.11. Circular KernelThe circular kernel is used in geostatic applications. It is an example of an isotropic stationary kernel and is positive definite in .12. Spherical KernelThe spherical kernel is similar to the circular kernel, but is positive definite in R3.13. Wave KernelThe Wave kernel is also symmetric positive semi-definite (Huang, 2008).14. Power KernelThe Power kernel is also known as the (unrectified) triangular kernel. It is an example of scale-invariant kernel (Sahbi and Fleuret, 2004) and is also only conditionally positive definite.15. Log KernelThe Log kernel seems to be particularly interesting for images, but is only conditionally positive definite.16. Spline KernelThe Spline kernel is given as a piece-wise cubic polynomial, as derived in the works by Gunn (1998).17. B-Spline (Radial Basis Function) KernelThe B-Spline kernel is defined on the interval [−1, 1]. It is given by the recursive formula:In the work by Bart Hamers it is given by:Alternatively, Bn can be computed using the explicit expression (Fomel, 2000):Where x+ is defined as the truncated power function:18. Bessel KernelThe Bessel kernel is well known in the theory of function spaces of fractional smoothness. It is given by:where J is the Bessel function of first kind. However, in the Kernlab for R documentation, the Bessel kernel is said to be:19. Cauchy KernelThe Cauchy kernel comes from the Cauchy distribution (Basak, 2008). It is a long-tailed kernel and can be used to give long-range influence and sensitivity over the high dimension space.20. Chi-Square KernelThe Chi-Square kernel comes from the Chi-Square distribution.21. Histogram Intersection KernelThe Histogram Intersection Kernel is also known as the Min Kernel and has been proven useful in image classification.22. Generalized Histogram IntersectionThe Generalized Histogram Intersection kernel is built based on the Histogram Intersection Kernel for image classification but applies in a much larger variety of contexts (Boughorbel, 2005). It is given by:23. Generalized T-Student KernelThe Generalized T-Student Kernel has been proven to be a Mercel Kernel, thus having a positive semi-definite Kernel matrix (Boughorbel, 2004). It is given by:24. Bayesian KernelThe Bayesian kernel could be given as:However, it really depends on the problem being modeled. For more information, please see the work by Alashwal, Deris and Othman, in which they used a SVM with Bayesian kernels in the prediction of protein-protein interactions.25. Wavelet KernelThe Wavelet kernel (Zhang et al, 2004) comes from Wavelet theory and is given as:Where a and c are the wavelet dilation and translation coefficients, respectively (the form presented above is a simplification, please see the original paper for details). A translation-invariant version of this kernel can be given as:Where in both h(x) denotes a mother wavelet function. In the paper by Li Zhang, Weida Zhou, and Licheng Jiao, the authors suggests a possible h(x) as:Which they also prove as an admissible kernel function.See also（推荐阅读）Kernel Support Vector Machines (kSVMs)Principal Component Analysis (PCA)3.参考文献On-Line Prediction Wiki Contributors. "Kernel Methods." On-Line Prediction Wiki. /?n=Main.KernelMethods (accessed March 3, 2010). Genton, Marc G. "Classes of Kernels for Machine Learning: A Statistics Perspective." Journal of Machine Learning Research 2 (2001) 299-312.Hofmann, T., B. Schölkopf, and A. J. Smola. "Kernel methods in machine learning." Ann. Statist. Volume 36, Number 3 (2008), 1171-1220.Gunn, S. R. (1998, May). "Support vector machines for classification and regression." Technical report, Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science.Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. "Kernlab – an R package for kernel Learning." (2004).Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. "Kernlab – an S4 package for kernel methods in R." J. Statistical Software, 11, 9 (2004).Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. "R: Kernel Functions." Documentation for package 'kernlab' version 0.9-5. /Rdoc/library/kernlab/html/dots.html (accessed March 3, 2010). Howley, T. and Madden, M.G. "The genetic kernel support vector machine: Description and evaluation". Artificial Intelligence Review. Volume 24, Number 3 (2005), 379-395.Shawkat Ali and Kate A. Smith. "Kernel Width Selection for SVM Classification: A Meta-Learning Approach." International Journal of Data Warehousing & Mining, 1(4), 78-97, October-December 2005.Hsuan-Tien Lin and Chih-Jen Lin. "A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods." Technical report, Department of Computer Science, National Taiwan University, 2003.Boughorbel, S., Jean-Philippe Tarel, and Nozha Boujemaa. "Project-Imedia: Object Recognition." INRIA - INRIA Activity Reports - RalyX. http://ralyx.inria.fr/2004/Raweb/imedia/uid84.html (accessed March 3, 2010).Huang, Lingkang. "Variable Selection in Multi-class Support Vector Machine and Applications in Genomic Data Analysis." PhD Thesis, 2008.Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. "Nonlinear SVMs." The Stanford NLP (Natural Language Processing) Group. /IR-book/html/htmledition/nonlinear-svms-1.html(accessed March 3, 2010).Fomel, Sergey. "Inverse B-spline interpolation." Stanford Exploration Project, 2000./public/docs/sep105/sergey2/paper_html/node5.html (accessed March 3, 2010).Basak, Jayanta. "A least square kernel machine with box constraints." International Conference on Pattern Recognition 2008 1 (2008): 1-4.Alashwal, H., Safaai Deris, and Razib M. Othman. "A Bayesian Kernel for the Prediction of Protein - Protein Interactions." International Journal of Computational Intelligence 5, no. 2 (2009): 119-124.Hichem Sahbi and François Fleuret. “Kernel methods and scale invariance using the triangular kernel”. INRIA Research Report, N-5143, March 2004.Sabri Boughorbel, Jean-Philippe Tarel, and Nozha Boujemaa. “Generalized histogram intersection kernel for image recognition”. Proceedings of the 2005 Conference on Image Processing, volume 3, pages 161-164, 2005.Micchelli, Charles. Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constructive Approximation 2, no. 1 (1986): 11-22.Wikipedia contributors, "Kernel methods," Wikipedia, The Free Encyclopedia, /w/index.php?title=Kernel_methods&oldid=340911970 (ac cessed March 3, 2010).Wikipedia contributors, "Kernel trick," Wikipedia, The Free Encyclopedia, /w/index.php?title=Kernel_trick&oldid=269422477 (access ed March 3, 2010).Weisstein, Eric W. "Positive Semidefinite Matrix." From MathWorld--A Wolfram Web Resource./PositiveSemidefiniteMatrix.htmlHamers B. "Kernel Models for Large Scale Applications'', Ph.D. , Katholieke Universiteit Leuven, Belgium, 2004.Li Zhang, Weida Zhou, Licheng Jiao. Wavelet Support Vector Machine. IEEE Transactions on System, Man, and Cybernetics, Part B, 2004, 34(1): 34-39.。

大数据十大经典算法SVM-讲解PPT

大数据十大经典算法svm-讲解
contents
目录
• 引言 • SVM基本原理 • SVM模型构建与优化 • SVM在大数据处理中的应用 • SVM算法实现与编程实践 • SVM算法性能评估与改进 • 总结与展望
01 引言
算法概述
SVM（Support Vector Machine，支持向量机）是一种监督学习模型，用于数据分类和回归分析。
性能评估方法
01
准确率评估
通过计算模型在测试集上的准确率来评估SVM算法的性能，准确率越
高，说明模型分类效果越好。
02
混淆矩阵评估
通过构建混淆矩阵，可以计算出精确率、召回率、F1值等指标，更全面
地评估SVM算法的性能。
03
ROC曲线和AUC值评估
通过绘制ROC曲线并计算AUC值，可以评估SVM算法在不同阈值下的
核函数是SVM的重要组成部分，可将数据映射到更高维的空间，使得原本线性不可分的数据变得线性可分。常见的核函数有线性核、多项式核、高斯核等。
SVM的性能受参数影响较大，如惩罚因子C、核函数参数等。通过交叉验证、网格搜索等方法可实现SVM参数的自动调优，提高模型性能。
SVM在文本分类、图像识别、生物信息学等领域有广泛应用。通过具体案例，可深入了解 SVM的实际应用效果。
SVM算法实现步骤
模型选择
选择合适的SVM模型，如CSVM、ν-SVM或One-class SVM等。
模型训练
使用准备好的数据集对SVM模型进行训练，得到支持向量和决策边界。
数据准备
准备用于训练的数据集，包括特征提取和标签分配。
参数设置
设置SVM模型的参数，如惩罚系数C、核函数类型及其参数等。

核函数

核函数(2012-05-28 15:04:07)标签：杂谈核函数理论不是源于支持向量机的.它只是在线性不可分数据条件下实现支持向量方法的一种手段.这在数学中是个古老的命题. Mercer定理可以追溯到1909年，再生核希尔伯特空间(ReproducingKernel Hilbert Space, RKHS)研究是在20世纪40年代开始的。

早在1964年Aizermann等在势函数方法的研究中就将该技术引入到机器学习领域，但是直到1992年Vapnik等利用该技术成功地将线性SVMs推广到非线性SVMs时其潜力才得以充分挖掘。

核函数方法是通过一个特征映射可以将输入空间(低维的)中的线性不可分数据映射成高维特征空间中(再生核Hilbert空间)中的线性可分数据.这样就可以在特征空间使用SVM方法了.因为使用svm方法得到的学习机器只涉及特征空间中的内积，而内积又可以通过某个核函数(所谓Mercer核)来表示，因此我们可以利用核函数来表示最终的学习机器.这就是所谓的核方法。

核函数本质上是对应于高维空间中的内积的，从而与生成高维空间的特征映射一一对应。

核方法正是借用这一对应关系隐性的使用了非线性特征映射(当然也可以是线性的)。

这一方法即使得我们能够利用高维空间让数据变得易于处理，不可分的变成可分的,同时又回避了高维空间带来的维数灾难不用显式表达特征映射.设x,z∈X,X属于R（n）空间,非线性函数Φ实现输入间X到特征空间F的映射,其中F属于R（m）,n<<m。

根据核函数技术有：K(x,z) =<Φ(x),Φ(z) > (1)其中：<, >为内积,K(x,z)为核函数。

从式(1)可以看出，核函数将m维高维空间的内积运算转化为n维低维输入空间的核函数计算，从而巧妙地解决了在高维特征空间中计算的“维数灾难”等问题，从而为在高维特征空间解决复杂的分类或回归问题奠定了理论基础。

基于核及其优化的流形学习算法

PCA KPCA KOPCA(my method)
0.22
PCA KPCA KOPCA(my method)
0.21
error rate
0.36 0.34 0.32 0.3
0.2
0.19
0.18
0.28 0.26 0.1
0.2
0.3 trainset size
0.4
0.5
0.17 0.1
0.2
0.3 trainset size
Dataset methods PCA KPCA KOPCA Label rate 0.005 0.01 0.05 0.1 0.2841 0.2727 0.2522 0.2472 0.2427 0.2336 0.04 0.0533 0.044 0.0405 0.0351 0.0343 0.4247 0.2603 0.2589 0.2536 0.2470 0.2445 0.4712 0.3942 0.3487 0.3443 0.3353 0.3284 0.2174 0.2609 0.2697 0.2432 0.2334 0.2051 wine Iris glass sonar soybean
利用核函数k代替特征空间中的内积，就对应于将数据通过一个映射，映射到某个高维的特征空间中，高维特征空间是由核函数定义的，选定了一个核函数，也就对应地定义了一个高维特征空间。特征空间中所有的内积运算都是通过原空间中的核函数来隐含实现。我们可以利用此思想，在特征空间中实现一般的线性算法，同时也就实现了相对于原空间来说是非线性的算法。这将会大大地提高学习算法的效率，改进现有算法，提高各类模式识别任务的识别率。目前常用的满足mercer条件的核函数：
线性分类器，只能对线性可分的样本做处理，如果提供的样本线性不可分，那么用线性分类器无法将样本点分开，于是，便可以引入核函数。那么什么是核函数呢？

核函数

SVM 小结理论基础：机器学习有三类基本的问题，即模式识别、函数逼近和概率密度估计．SVM 有着严格的理论基础，建立了一套较好的有限训练样本下机器学习的理论框架和通用方法。

他与机器学习是密切相关的，很多理论甚至解决了机器学习领域的其他的问题，所以学习SVM 和机器学习是相辅相成的，两者可以互相促进，有助于机器学习理论本质的理解。

VC 维理论：对一个指示函数集，如果存在h 个样本能够被函数集中的函数按所有可能的2h 种形式分开，则称函数集能够把h 个样本打散；函数集的VC 维就是它能打散的最大样本数目。

VC 维反映了函数集的学习能力，VC 维越太则学习机器越复杂(容量越太)。

期望风险：其公式为[](,,(,))(,)y R f c y f y dP y χχχχ⨯=⎰，其中(,,(,))c y f y χχ为损失函数，(,)P y χ为概率分布，期望风险的大小可以直观的理解为，当我们用()f χ进行预测时，“平均”的损失程度，或“平均”犯错误的程度。

经验风险最小化（ERM 准则）归纳原则：但是，只有样本却无法计算期望风险，因此，传统的学习方法用样本定义经验风险[]emp R f 作为对期望风险的估计，并设计学习算法使之最小化。

即所谓的经验风险最小化（ERM 准则）归纳原则。

经验风险是用损失函数来计算的。

对于模式识别问题的损失函数来说，经验风险就是训练样本错误率；对于函数逼近问题的损失函数来说，就是平方训练误差；而对于概率密度估计问题的损失函数来说，ERM 准则就等价于最大似然法。

但是，经验风险最小不一定意味着期望风险最小。

其实，只有样本数目趋近于无穷大时，经验风险才有可能趋近于期望风险。

但是很多问题中样本数目离无穷大很远，那么在有限样本下ERM 准则就不一定能使真实风险较小。

ERM 准则不成功的一个例子就是神经网络和决策树的过学习问题（某些情况下，训练误差过小反而导致推广能力下降，或者说是训练误差过小导致了预测错误率的增加，即真实风险的增加）。

基于核函数的学习算法

基于核函数的学习算法基于核函数的学习算法是一种机器学习算法，用于解决非线性分类和回归问题。

在传统的机器学习算法中，我们通常假设样本数据是线性可分或线性可回归的，但是在现实世界中，许多问题是非线性的。

为了解决这些非线性问题，我们可以使用核函数来将原始数据映射到高维特征空间中，然后在该特征空间中进行线性分类或回归。

核函数是一个用于计算两个向量之间相似度的函数。

它可以通过计算两个向量在特征空间中的内积来度量它们的相似程度。

常用的核函数包括线性核函数、多项式核函数、高斯核函数等。

支持向量机是一种非常有力的分类算法。

它利用核技巧将输入数据映射到高维特征空间中，然后在该特征空间中找到一个最优分割超平面，使得样本点离超平面的距离最大化。

通过最大化间隔，支持向量机能够更好地处理非线性分类问题，并具有较好的泛化性能。

支持向量机的核函数可以将样本数据映射到高维特征空间中，以便在非线性问题上进行线性分类。

常用的核函数包括线性核函数、多项式核函数和高斯核函数等。

线性核函数可以实现与传统线性分类算法相同的效果。

多项式核函数可以将数据映射到多项式特征空间中，通过多项式特征的组合实现非线性分类。

高斯核函数可以将数据映射到无穷维的特征空间中，通过高斯核函数的相似度计算实现非线性分类。

核岭回归是一种非线性回归算法。

类似于支持向量机，核岭回归也利用核函数将输入数据映射到高维特征空间中，然后在该特征空间中进行线性回归。

通过最小二乘法求解岭回归问题，核岭回归能够更好地处理非线性回归问题。

1.能够处理非线性问题：核函数能够将数据映射到高维特征空间中，从而实现对非线性问题的线性分类或回归。

2.较好的泛化性能：支持向量机等基于核函数的学习算法通过最大化间隔来进行分类，可以有较好的泛化性能，减少模型的过拟合风险。

3.算法简洁高效：基于核函数的学习算法通常具有简单的模型结构和高效的求解方法，能够处理大规模数据集。

4.不依赖数据分布：基于核函数的学习算法不依赖于数据的分布情况，适用于各种类型的数据。

机器学习核函数基本概念

(1.26)
这里 vt (vt1 , vt 2 , , vtl ) 是矩阵 G 的第 t 个特征向量，它对应的特征值是 t 。因为 G 是半
T
正定的，所以所有特征值均为非负数。于是由（1.26）推知
K ( xi , x j ) vti t vtj t vti vtj
t 1 t 1
n
Gij K ( xi , x j ), i, j 1,2, , l
(1.21)
则称 G (Gij ) 是 ( x, z ) 关于 X 的 Gram 矩阵。我们首先要研究的问题是：当 Gram 矩阵 G 满足什么条件时，函数 K (,) 是一个核函数。定义 1.2 （矩阵算子）定义在 R 上的矩阵算子 G ：对 u (u1 , u 2 , , ul ) R ， Gu 的分量由下式确定
这样的有序单项式 [ x ] j1 [ x ] j2 [ x ] jd 的个数为 n ，即多项式空间 H 的维数 n H n 。如果在 H 中进行内积运算 C d ( x ) C d ( z ) ，当 n 和 d 都不太小时，多项式空间 H 的维数 n H n
d d d
会相当大。如当 n 200 ， d 5 时，维数可达到上亿维。显然，在多项式空间 H 中直接进行内积运算将会引起“维数灾难”问题，那么，如何处理这个问题呢？我们先来考查 n d 2 的情况，计算多项式空间 H 中两个向量的内积
K ( xi , x j ) t vti vtj ,
t 1
l
i, j 1, 2, , m
(1.25)
证明：由于 G 是对称的，所以存在着正交矩阵 V (v1 , v2 , , vl ) 和对角矩阵

核函数方法_下_

前沿技术论坛核函数方法(下)[收稿日期]2001212219[作者简介]罗公亮(1941-),男,贵州贵阳人,教授级高级工程师,博士,主要从事智能控制与工业自动化的研究及开发工作。

[中图分类号]TP183[文献标识码]A[文章编号]100027059(2002)0420001203(冶金自动化研究设计院,北京100071)罗公亮(Automation Research and Desi g n Instit ute of Metallur g ical Indust r y ,Bei j in g 100071,China )K ernel 2based methods(B)L UO G on g 2lian g2主分量分析法2.1经典的主分量分析法[5]主分量分析(PCA )是一种经典的统计方法,它对多元统计观测数据的协方差结构进行分析,以期求出能简约地表达这些数据依赖关系的主分量。

具体地说,通过线性变换将原始n 维观测矢量化为个数相同的一组新特征,即每一个新特征都是原始特征的线性组合,如果这些新特征互不相关,其中少数m 个(m νn )包含了原始数据主要信息的最重要的特征就是主分量。

因此,主分量分析是一种特征抽取的方法,也可以认为是一种数据压缩(降维)的方法。

设以m 个正交矢量{u i ∈R n ;i =1,2,…,m }为列矢构成矩阵U ∈Rn ×m(m <n ):U =[u 1,u 2,…,u m ](43)U 是一组正交基,即满足u i u j =δi j (i ,j =1,2,…,m ),其所张的子空间记为Ω≡s p an (U ),则P =UU T构成一个正交投影算子。

x 在子空间Ω上的投影为x ^=Px =UU T x(44)线性变换U T将n 维观测矢量x 变换成新的m 维特征矢量:y =U T x(45)其中y =[y 1,y 2,…,y m ]T。

将(45)式代入(44)式得:x ^=U y (46)从(46)式可见,x ^是用y 来恢复(重构)原始特征得到的结果。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

在样本数目有限时是不合理的,因此，需要同时最小化经验风险和置信范围。统计学习理论提出了一种新的策略,即把函数集构造为一个函数子集序列,使各个子集按照VC维的大小排列;在每个子集中寻找最小经验风险,在子集间折衷考虑经验风险和置信范围,取得实际风险的最小。这种思想称作结构风险最小化准则(Structural Risk Minimization Principle)。
理论基础监督学习:SVM、KFD 无监督学习：KPCA 模型选择
理论基础
机器学习 VC维结构风险最小化原则
SLT(Statistical Learning Theory)
上世纪90年代中才成熟的统计学习理论，是在基于经验风险的有关研究基础上发展起来的，专门针对小样本的统计理论。
统计学习理论为研究有限样本情况下的模式识别、函数拟合和概率密度估计等三种类型的机器学习问题提供了理论框架，同时也为模式识别发展了一种新的分类方法——支持向量机。
机器学习
机器学习是现代智能技术中重要的一个方面，研究从观测样本出发去分析对象，去预测未来。
机器学习的基本模型：
输出y与x之间存在一种固定的、但形式未知的联合概率分布函数 F(y,x)。
VC维
Vanik和Chervonenkis(1968)提出了VC维的概念。 VC维：对于一个指示函数（即只有0和1两种取值的函
数）集，如果存在h个样本能够被函数集里的函数按照所有可能的2h种形式分开，则称函数集能够把h个样本打散，函数集的VC维就是能够打散的最大样本数目。 VC维是描述函数集或学习机器的复杂性或者说是学习能力的一个重要指标,在此概念基础上发展出了一系列关于统计学习的一致性、收敛速度、泛化性能等的重要结论。
核函数
在处理线性分类问题时，数据以点积的形式( xi ·xj ) 出现。而在处理非线性分类问题时，需要采用非线性映射把输入空间映射到高维特征空间，记为：当在特征空间H 中构造最优超平面时，训练算法仅使用空间中的点积，即
存在一种核函数K,使得:
核函数将m维高维空间的内积运算转化为n维低维输入空间的核函数计算，从而巧妙地解决了在高维特征空间中计算的“维数灾难”等问题。
该线性分类函数的VC维即为3
一般而言,VC维越大, 学习能力就越强,但学习机器也越复杂。
目前还没有通用的关于计算任意函数集的VC 维的理论,只有对一些特殊函数集的VC维可以准确知道。
结构风险最小化准则
Vapnik和Chervonenkis(1974)提出了SRM。传统机器学习方法中普遍采用的经验风险最小化原则
典型的例子就是SVM（可支持向量机）、KFD （基于核的Fisher判别分析）。
SVM（Support vector machines)
SVM是基于SLT的一种机器学习方法。简单的说，就是将数据单元表示在多维空间中，然后对这个空间做划分的算法。
SVM是建立在统计学习理论的VC维理论和结构风险最小原理基础上的，根据有限的样本信息在模型的复杂性之间寻求最佳折衷，以期获得最好的推广（泛化）能力。
在一组函数{f(x,w)}中求一个最优函数f(x,w0)，使预测的期望风险R(w)最小化。
R(L(y, {f(x,w)})为损失函数，由于对y进行预测而造成的损失；w为函数的广义参数，故{f(x,w)}可表示任何函数集；F(x,y) 为联合分布函数。
学习机中有函数集{f(x,w)}，可估计输入与输出之间依赖关系，其中w为广义参数。
风险最小化－机器学习问题表示
已知变量y与输入x之间存在一定的未知依赖关系，即联合概率分布F(x,y) 机器学习就是根据独立同分布的n个观测样本： (x1, y1), (x2, y2), ···, (xn, yn)
支持向量机方法建立在统计学习理论基础之上，专门针对小样本情况下的机器学习问题。对于分类问题，支持向量机方法根据区域中的样本计算该区域的分类曲面，由该曲面决定该区域中的样本类别。
已知样本x 为m 维向量, 在某个区域内存在n个样本:
(x1,y1)，(x2,y2)，…，(xn,yn)
其中，xi 是训练元组，xi∈Rm，yi是类标号， yi∈{1,-1}。
若存在超平面( hyperplane):
ω·x + b = 0
(1)
其中·表示向量的点积，如图1 所示，超平面能将这n 个
样本分为两类,那么存在最优超平面不仅能将两类样本准
确分开，而且能使两类样本到超平面的距离最大。式(1)
中的ω和b 乘以系数后仍能满足方程，进行归一化处理之后，对于所有样本xi ，式| ω·xi + b| 的最小值为1 , 则样本与此最优超平面的最小距离为|ω·xi + b |/‖ω‖= 1/‖ω‖,那么最优超平面应满足条件:
核方法分为核函数设计和算法设计两个部分,具体情况如图1 所示。核方法的实施步骤,具体描述为: ①收集和整理样本, 并进行标准化; ②选择或构造核函数; ③ 用核函数将样本变换成为核矩阵; ④在特征空间对核矩阵实施各种线性算法;⑤ 得到输入空间中的非线性模型。
核函数
主要的核函数有三类：多项式核函数
Kernel-Based Learning Algorithms
引言
近几年，出现了一些基于核函数的机器学习方法，例如：SVM（可支持向量机）、KFD （基于核的Fisher判别分析）、KPCA（核主成分分析）等。这些方法在分类问题、回归问题以及无监督学习上都具有现实意义。这些核函数方法已经成功应用到模式识别的各个领域，比如目标识别、文本分类、时间序列预测等等
径向基函数
S形函数
有监督学习 (supervised learning)
监督学习，就是人们常说的分类，通过已有的训练样本（即已知数据以及其对应的输出）去训练得到一个最优模型（这个模型属于某个函数的集合，再利用这个模型将所有的输入映射为相应的输出，对输出进行简单的判断从而实现分类的目的，也就具有了对未知数据进行分类的能力。