Bayes optimal hyperplanes ! maximal margin hyperplanes

合集下载

matlab贝叶斯算法

matlab贝叶斯算法一、引言随着科技的发展，人工智能、数据挖掘等领域的研究日益深入，贝叶斯算法作为一种基于概率推理的方法，在这些领域中得到了广泛的应用。

MATLAB 作为一款强大的数学软件，为贝叶斯算法的实现和应用提供了便利。

本文将介绍贝叶斯算法的原理，以及如何在MATLAB中实现和应用贝叶斯算法。

二、贝叶斯算法的原理1.贝叶斯定理贝叶斯定理是贝叶斯算法的基础，它描述了在已知某条件概率的情况下，求解相关联的逆条件概率。

贝叶斯定理的数学表达式为：P(A|B) = P(B|A) * P(A) / P(B)2.概率论基础贝叶斯算法涉及到的概率论基础包括概率分布、条件概率、独立性等概念。

在实际问题中，我们需要根据已知条件来计算概率分布，从而得出相关联的概率值。

三、MATLAB实现贝叶斯算法的方法1.贝叶斯网络贝叶斯网络是一种基于贝叶斯定理的图形化表示方法，它可以帮助我们构建复杂的问题模型。

在MATLAB中，可以使用Bayes Net Toolbox工具包来创建和计算贝叶斯网络。

2.极大似然估计极大似然估计是一种求解概率模型参数的方法。

在贝叶斯算法中，我们可以通过极大似然估计来优化模型参数，从而提高预测准确性。

在MATLAB中，可以使用统计工具箱中的极大似然估计函数进行计算。

3.朴素贝叶斯分类器朴素贝叶斯分类器是一种基于贝叶斯定理的分类方法，它要求特征之间相互独立。

在MATLAB中，可以使用朴素贝叶斯分类器进行文本分类、故障诊断等任务。

四、实例分析1.故障诊断应用贝叶斯算法在故障诊断领域具有广泛的应用。

通过建立故障诊断模型，可以对设备的故障进行预测和诊断。

例如，在MATLAB中，可以使用朴素贝叶斯分类器对轴承故障数据进行分类。

2.文本分类应用贝叶斯算法在文本分类领域也具有较高的准确率。

通过构建贝叶斯网络模型，可以对文本进行自动分类。

例如，在MATLAB中，可以使用朴素贝叶斯分类器对新闻分类数据进行分类。

贝叶斯正则化算法

贝叶斯正则化算法贝叶斯正则化算法是一种基于贝叶斯概率框架的机器学习算法，它是建立在贝叶斯概率模型的基础上的一种统计学习方法。

它将传统的机器学习方法（如线性回归和支持向量机）与贝叶斯理论相结合，将贝叶斯概率模型用于机器学习，从而提高机器学习的准确性和效率。

本文将回顾贝叶斯正则化算法的基本原理和优点，以及它如何用于机器学习。

一、基本原理贝叶斯正则化算法是一种基于贝叶斯概率模型的机器学习算法。

贝叶斯概率模型假设数据生成过程可以用概率分布来描述，并通过贝叶斯公式来推断数据的潜在模式。

在贝叶斯正则化算法中，模型参数的估计值是通过最大后验概率（MAP）确定的，即目标函数是参数的函数的最大后验概率。

贝叶斯正则化算法的核心思想是，未知参数的估计值应该是参数的概率分布的最大值。

在贝叶斯正则化中，参数的概率分布是一个拉普拉斯先验分布，它是一个较简单的分布，可以用来描述参数的未知性，从而降低机器学习模型的过拟合。

二、优点贝叶斯正则化算法具有许多优点，其中最重要的优点是它可以显著改善机器学习模型的准确性和效率。

此外，贝叶斯正则化算法还可以增加模型的稳定性和可解释性。

首先，贝叶斯正则化算法可以显著提高机器学习模型的准确性。

贝叶斯正则化算法将传统的机器学习方法（如线性回归和支持向量机）与贝叶斯理论相结合，可以更好地拟合数据，从而提高机器学习模型的准确性。

此外，贝叶斯正则化算法还可以提高机器学习模型的效率。

它通过拉普拉斯先验分布将参数的不确定性考虑在内，从而降低对数据量的要求，从而提高机器学习模型的效率。

另外，贝叶斯正则化算法还可以提高模型的稳定性。

传统的机器学习模型往往会受到较大的噪声影响，而贝叶斯正则化算法可以有效减少噪声对模型的影响，从而提高模型的稳定性。

最后，贝叶斯正则化算法还可以增强模型的可解释性。

贝叶斯正则化算法可以将模型参数的不确定性表达出来，从而使模型更容易解释。

三、应用贝叶斯正则化算法可以用于多种机器学习应用，如线性回归、支持向量机和神经网络等。

bayesianoptimization的使用代码

bayesianoptimization的使用代码
BayesianOptimization是一种优化算法，可以在较少的迭代次数下找到函数的全局最优解。

它通过使用高斯过程模型来建立函数的后验分布，并根据后验分布来选择下一个采样点。

以下是使用Python 实现Bayesian Optimization的示例代码：
```python
import numpy as np
from bayes_opt import BayesianOptimization
def f(x, y):
return -(x**2 + y**2)
pbounds = {'x': (-2, 2), 'y': (-2, 2)}
optimizer = BayesianOptimization(
f=f,
pbounds=pbounds,
random_state=1,
)
optimizer.maximize(
init_points=2,
n_iter=10,
)
print(optimizer.max)
```
在上面的代码中，`f`是需要优化的函数，`pbounds`是函数的参数范围。

`BayesianOptimization`类接受这些参数，并创建一个优化器对象。

`maximize`方法在给定的初始点（`init_points`）和迭代次数（`n_iter`）内寻找最佳参数值，并返回最优化的结果。

以上是Bayesian Optimization的简单示例代码，读者可以根据自己的需求进行修改和扩展。

Maximum-margin matrix factorization

Maximum-Margin Matrix FactorizationNathan Srebro Dept.of Computer Science University of Toronto Toronto,ON,CANADA nati@ Jason D.M.Rennie Tommi S.Jaakkola Computer Science and Artiﬁcial Intelligence Lab Massachusetts Institute of TechnologyCambridge,MA,USAjrennie,tommi@ AbstractWe present a novel approach to collaborative prediction,using low-norminstead of low-rank factorizations.The approach is inspired by,and hasstrong connections to,large-margin linear discrimination.We show howto learn low-norm factorizations by solving a semi-deﬁnite program,anddiscuss generalization error bounds for them.1IntroductionFitting a target matrix Y with a low-rank matrix X by minimizing the sum-squared error is a common approach to modeling tabulated data,and can be done explicitly in terms of the singular value decomposition of Y.It is often desirable,though,to minimize a different loss function:loss corresponding to a speciﬁc probabilistic model(where X are the mean parameters,as in pLSA[1],or the natural parameters[2]);or loss functions such as hinge loss appropriate for binary or discrete ordinal data.Loss functions other than squared-error yield non-convex optimization problems with multiple local minima.Even with a squared-error loss,when only some of the entries in Y are observed,as is the case for collaborative ﬁltering,local minima arise and SVD techniques are no longer applicable[3].Low-rank approximations constrain the dimensionality of the factorization X=UV . Other constraints,such as sparsity and non-negativity[4],have also been suggested for better capturing the structure in Y,and also lead to non-convex optimization problems.In this paper we suggest regularizing the factorization by constraining the norm of U and V—constraints that arise naturally when matrix factorizations are viewed as feature learn-ing for large-margin linear prediction(Section2).Unlike low-rank factorizations,such constraints lead to convex optimization problems that can be formulated as semi-deﬁnite programs(Section4).Throughout the paper,we focus on using low-norm factorizations for“collaborative prediction”:predicting unobserved entries of a target matrix Y,based on a subset S of observed entries Y S.In Section5,we present generalization error bounds for collaborative prediction using low-norm factorizations.2Matrix Factorization as Feature LearningUsing a low-rank model for collaborative prediction[5,6,3]is straightforward:A low-rank matrix X is sought that minimizes a loss versus the observed entries Y S.Unobservedentries in Y are predicted according to X.Matrices of rank at most k are those that can be factored into X=UV ,U∈R n×k,V∈R m×k,and so seeking a low-rank matrix is equivalent to seeking a low-dimensional factorization.If one of the matrices,say U,isﬁxed,and only the other matrix V needs to be learned,then ﬁtting each column of the target matrix Y is a separate linear prediction problem.Each row of U functions as a“feature vector”,and each column of V is a linear predictor,predicting the entries in the corresponding column of Y based on the“features”in U.In collaborative prediction,both U and V are unknown and need to be estimated.This can be thought of as learning feature vectors(rows in U)for each of the rows of Y,enabling good linear prediction across all of the prediction problems(columns of Y)concurrently, each with a different linear predictor(columns of V ).The features are learned without any external information or constraints which is impossible for a single prediction task(we would use the labels as features).The underlying assumption that enables us to do this in a collaborativeﬁltering situation is that the prediction tasks(columns of Y)are related,in that the same features can be used for all of them,though possibly in different ways.Low-rank collaborative prediction corresponds to regularizing by limiting the dimensional-ity of the feature space—each column is a linear prediction problem in a low-dimensional space.Instead,we suggest allowing an unbounded dimensionality for the feature space,and regularizing by requiring a low-norm factorization,while predicting with large-margin. Consider adding to the loss a penalty term which is the sum of squares of entries in U andV,i.e. U 2Fro + V 2Fro(Frodenotes the Frobenius norm).Each“conditional”problem(ﬁtting U given V and vice versa)again decomposes into a collection of standard,this time regularized,linear prediction problems.With an appropriate loss function,or constraints on the observed entries,these correspond to large-margin linear discrimination problems. For example,if we learn a binary observation matrix by minimizing a hinge loss plus such a regularization term,each conditional problem decomposes into a collection of SVMs.3Maximum-Margin Matrix FactorizationsMatrices with a factorization X=UV ,where U and V have low Frobenius norm(recall that the dimensionality of U and V is no longer bounded!),can be characterized in several equivalent ways,and are known as low trace norm matrices:Deﬁnition1.The trace norm1 XΣis the sum of the singular values of X.Lemma1. XΣ=min X=UV UFroVFro=min X=UV 12( U 2Fro+ V 2Fro)The characterization in terms of the singular value decomposition allows us to characterize low trace norm matrices as the convex hull of bounded-norm rank-one matrices:Lemma2.{X| XΣ≤B}=convuv |u∈R n,v∈R m,|u|2=|v|2=BIn particular,the trace norm is a convex function,and the set of bounded trace norm ma-trices is a convex set.For convex loss functions,seeking a bounded trace norm matrix minimizing the loss versus some target matrix is a convex optimization problem.This contrasts sharply with minimizing loss over low-rank matrices—a non-convex prob-lem.Although the sum-squared error versus a fully observed target matrix can be min-imized efﬁciently using the SVD(despite the optimization problem being non-convex!), minimizing other loss functions,or even minimizing a squared loss versus a partially ob-served matrix,is a difﬁcult optimization problem with multiple local minima[3].1Also known as the nuclear norm and the Ky-Fan n-norm.In fact,the trace norm has been suggested as a convex surrogate to the rank for various rank-minimization problems [7].Here,we justify the trace norm directly,both as a natural extension of large-margin methods and by providing generalization error bounds.To simplify presentation,we focus on binary labels,Y ∈{±1}n ×m .We consider hard-margin matrix factorization ,where we seek a minimum trace norm matrix X that matches the observed labels with a margin of one:Y ia X ia ≥1for all ia ∈S .We also consider soft-margin learning,where we minimize a trade-off between the trace norm of X and its hinge-loss relative to Y S :minimize X Σ+c ia ∈Smax(0,1−Y ia X ia ).(1)As in maximum-margin linear discrimination,there is an inverse dependence between the norm and the margin.Fixing the margin and minimizing the trace norm is equivalent to ﬁxing the trace norm and maximizing the margin.As in large-margin discrimination with certain inﬁnite dimensional (e.g.radial)kernels,the data is always separable with sufﬁciently high trace norm (a trace norm of n |S |is sufﬁcient to attain a margin of one).The max-norm variant Instead of constraining the norms of rows in U and V on aver-age,we can constrain all rows of U and V to have small L 2norm,replacing the trace norm with X max =min X =UV (max i |U i |)(max a |V a |)where U i ,V a are rows of U,V .Low-max-norm discrimination has a clean geometric interpretation.First,note that predicting the target matrix with the signs of a rank-k matrix corresponds to mapping the “items”(columns)to points in R k ,and the “users”(rows)to homogeneous hyperplanes,such that each user’s hyperplane separates his positive items from his negative items.Hard-margin low-max-norm prediction corresponds to mapping the users and items to points and hy-perplanes in a high-dimensional unit sphere such that each user’s hyperplane separates his positive and negative items with a large-margin (the margin being the inverse of the max-norm).4Learning Maximum-Margin Matrix FactorizationsIn this section we investigate the optimization problem of learning a MMMF,i.e.a low norm factorization UV ,given a binary target matrix.Bounding the trace norm of UV by 12( U 2Fro + V 2Fro ),we can characterize the trace norm in terms of the trace of a positive semi-deﬁnite matrix:Lemma 3([7,Lemma 1]).For any X ∈R n ×m and t ∈R : X Σ≤t iff there existsA ∈R n ×n andB ∈R m ×m such that 2 A X X B0and tr A +tr B ≤2t .Proof.Note that for any matrix W , W Fro =tr W W .If A X X B 0,we can write it as a product [U V ][U V ].We have X =UV and 12( U 2Fro + V 2Fro )=12(tr A +tr B )≤t ,establishing X Σ≤t .Conversely,if X Σ≤t we can write it as X =UV with tr UU +tr V V ≤2t and consider the p.s.d.matrix UU XX V V .Lemma 3can be used in order to formulate minimizing the trace norm as a semi-deﬁnite optimization problem (SDP).Soft-margin matrix factorization (1),can be written as:min 12(tr A +tr B )+c ia ∈Sξia s.t. A X X B 0,y ia X ia ≥1−ξia ξia ≥0∀ia ∈S (2)2A 0denotes A is positive semi-deﬁniteAssociating a dual variable Q ia with each constraint on X ia,the dual of(2)is[8,Section 5.4.2]:maxia∈S Q ia s.t.I(−Q⊗Y)(−Q⊗Y) I0,0≤Q ia≤c(3)where Q⊗Y denotes the sparse matrix(Q⊗Y)ia=Q ia Y ia for ia∈S and zeros elsewhere.The problem is strictly feasible,and there is no duality gap.The p.s.d.constraint in the dual(3)is equivalent to bounding the spectral norm of Q⊗Y,and the dual can also be written as an optimization problem subject to a bound on the spectral norm,i.e.a bound on the singular values of Q⊗Y:maxia∈S Q ia s.t.Q⊗Y2≤10≤Q ia≤c∀ia∈S(4)In typical collaborative prediction problems,we observe only a small fraction of the entries in a large target matrix.Such a situation translates to a sparse dual semi-deﬁnite program, with the number of variables equal to the number of observed rge-scale SDP solvers can take advantage of such sparsity.The prediction matrix X∗minimizing(1)is part of the primal optimal solution of(2),and can be extracted from it directly.Nevertheless,it is interesting to study how the optimal prediction matrix X∗can be directly recovered from a dual optimal solution Q∗alone. Although unnecessary when relying on interior point methods used by most SDP solvers(as these return a primal/dual optimal pair),this can enable us to use specialized optimization methods,taking advantage of the simple structure of the dual.Recovering X∗from Q∗As for linear programming,recovering a primal optimal solu-tion directly from a dual optimal solution is not always possible for SDPs.However,at least for the hard-margin problem(no slack)this is possible,and we describe below how an optimal prediction matrix X∗can be recovered from a dual optimal solution Q∗by calculating a singular value decomposition and solving linear equations.Given a dual optimal Q∗,consider its singular value decomposition Q∗⊗Y=UΛV . Recall that all singular values of Q∗⊗Y are bounded by one,and consider only the columns ˜U∈R n×p of U and˜V∈R m×p of V with singular value one.It is possible to show[8,Section5.4.3],using complimentary slackness,that for some matrix R∈R p×p,X∗=˜URR ˜V is an optimal solution to the maximum margin matrix factorization problem(1).Furthermore,p(p+1)2is bounded above by the number of non-zero Q∗ia.When Q∗ia>0,and assuming hard-margin constraints,i.e.no box constraints in the dual,complimentary slackness dictates that X∗ia=˜U i RR ˜V a=Y ia,providing us with a linear equation onthe p(p+1)2entries in the symmetric RR .For hard-margin matrix factorization,we cantherefore recover the entries of RR by solving a system of linear equations,with a number of variables bounded by the number of observed entries.Recovering speciﬁc entries The approach described above requires solving a large sys-tem of linear equations(with as many variables as observations).Furthermore,especially when the observations are very sparse(only a small fraction of the entries in the target matrix are observed),the dual solution is much more compact then the prediction matrix: the dual involves a single number for each observed entry.It might be desirable to avoidstoring the prediction matrix X∗explicitly,and calculate a desired entry X∗i0a0,or at leastits sign,directly from the dual optimal solution Q∗.Consider adding the constraint X i0a0>0to the primal SDP(2).If there exists an optimalsolution X∗to the original SDP with X∗i0a0>0,then this is also an optimal solution tothe modiﬁed SDP,with the same objective value.Otherwise,the optimal solution of the modiﬁed SDP is not optimal for the original SDP,and the optimal value of the modiﬁed SDP is higher(worse)than the optimal value of the original SDP.Introducing the constraint X i0a0>0to the primal SDP(2)corresponds to introducing anew variable Q i0a0to the dual SDP(3),appearing in Q⊗Y(with Y ia0=1)but not in theobjective.In this modiﬁed dual,the optimal solution Q∗of the original dual would alwaysbe feasible.But,if X∗i0a0<0in all primal optimal solutions,then the modiﬁed primalSDP has a higher value,and so does the dual,and Q∗is no longer optimal for the new dual. By checking the optimality of Q∗for the modiﬁed dual,e.g.by attempting to re-optimizeit,we can recover the sign of X∗i0a0 .We can repeat this test once with Y i0a0=1and once with Y ia0=−1,correspondingto X i0a0<0.If Y ia0X∗ia0<0(in all optimal solutions),then the dual solution can beimproved by introducing Q i0a0with a sign of Y ia0.Predictions for new users So far,we assumed that learning is done on the known entries in all rows.It is commonly desirable to predict entries in a new partially observed row of Y(a new user),not included in the original training set.This essentially requires solving a“conditional”problem,where V is already known,and a new row of U is learned(the predictor for the new user)based on a new partially observed row of ing maximum-margin matrix factorization,this is a standard SVM problem.Max-norm MMMF as a SDP The max-norm variant can also be written as a SDP,with the primal and dual taking the forms:min t+cia∈S ξia s.t.A XX BA ii,B aa≤t∀i,ay ia X ia≥1−ξiaξia≥0∀ia∈S(5)maxia∈S Q ia s.t.Γ(−Q⊗Y)(−Q⊗Y) ∆Γ,∆are diagonaltrΓ+tr∆=10≤Q ia≤c∀ia∈S(6)5Generalization Error Bounds for Low Norm Matrix Factorizations Similarly to standard feature-based prediction approaches,collaborative prediction meth-ods can also be analyzed in terms of their generalization ability:How conﬁdently can we predict entries of Y based on our error on the observed entries Y S?We present here gen-eralization error bounds that holds for any target matrix Y,and for a random subset of observations S,and bound the average error across all entries in terms of the observed margin error3.The central assumption,paralleling the i.i.d.source assumption for standard feature-based prediction,is that the observed subset S is picked uniformly at random. Theorem4.For all target matrices Y∈{±1}n×m and sample sizes|S|>n log n,and for a uniformly selected sample S of|S|entries in Y,with probability at least1−δover 3The bounds presented here are special cases of bounds for general loss functions that we present and prove elsewhere[8,Section6.2].To prove the bounds we bound the Rademacher complexity of bounded trace norm and bounded max-norm matrices(i.e.balls w.r.t.these norms).The unit trace norm ball is the convex hull of outer products of unit norm vectors.It is therefore enough to bound the Rademacher complexity of such outer products,which boils down to analyzing the spectral norm of random matrices.As a consequence of Grothendiek’s inequality,the unit max-norm ball is within a factor of two of the convex hull of outer products of sign vectors.The Rademacher complexity of such outer products can be bounded by considering their cardinality.the sample selection,the following holds for all matrices X ∈R n ×m and all γ>0:1nm |{ia |X ia Y ia ≤0}|<1|S ||{ia ∈S |X ia Y ia ≤γ}|+K X Σγ√nm4√ln m (n +m )ln n |S |+ ln(1+|log X Σ/γ|)|S |+ ln(4/δ)2|S |(7)and1nm |{ia |X ia Y ia ≤0}|<1|S ||{ia ∈S |X ia Y ia ≤γ}|+12 X max γ n +m |S |+ ln(1+|log X Σ/γ|)|S |+ ln(4/δ)2|S |(8)Where K is a universal constant that does not depend on Y ,n ,m ,γor any other quantity.To understand the scaling of these bounds,consider n ×m matrices X =UV where the norms of rows of U and V are bounded by r ,i.e.matrices with X max ≤r 2.The trace norm of such matrices is bounded by r 2/√nm ,and so the two bounds agree up to log-factors—the cost of allowing the norm to be low on-average but not uniformly.Recall that the conditional problem,where V is ﬁxed and only U is learned,is a collection of low-norm (large-margin)linear prediction problems.When the norms of rows in U and V are bounded by r ,a similar generalization error bound on the conditional problem would include the term r2γ n |S |,matching the bounds of Theorem 4up to log-factors—learningboth U and V does not introduce signiﬁcantly more error than learning just one of them.Also of interest is the comparison with bounds for low-rank matrices,for which X Σ≤√rank X X Fro .In particular,for n ×m rank-k X with entries bounded by B , X Σ≤√knmB ,and the second term in the right-hand side of (7)becomes:K B γ4√ln m k (n +m )ln n |S |(9)Although this is the best (up to log factors)that can be expected from scale-sensitive bounds 4,taking a combinatorial approach,the dependence on the magnitude of the entries in X (and the margin)can be avoided [9].6Implementation and ExperimentsRatings In many collaborative prediction tasks,the labels are not binary,but rather are discrete “ratings”in several ordered levels (e.g.one star through ﬁve stars).Separating R levels by thresholds −∞=θ0<θ1<···<θR =∞,and generalizing hard-margin constraints for binary labels,one can require θY ia +1≤X ia ≤θY ia +1−1.A soft-margin version of these constraints,with slack variables for the two constraints on each observed rating,corresponds to a generalization of the hinge loss which is a convex bound on the zero/one level-agreement error (ZOE)[10].To obtain a loss which is a convex bound on the mean-absolute-error (MAE—the difference,in levels,between the predicted level and the true level),we introduce R −1slack variables for each observed rating—one for each4For general loss functions,bounds as in Theorem 4depend only on the Lipschitz constant of the loss,and (9)is the best (up to log factors)that can be achieved without explicitly bounding the magnitude of the loss function.of the R−1constraints X ia≥θr for r<Y ia and X ia≤θr for r≥Y ia.Both of these soft-margin problems(“immediate-threshold”and“all-threshold”)can be formulated as SDPs similar to(2)-(3).Furthermore,it is straightforward to learn also the thresholds (they appear as variables in the primal,and correspond to constraints in the dual)—either a single set of thresholds for the entire matrix,or a separate threshold vector for each row of the matrix(each“user”).Doing the latter allows users to“use ratings differently”and alleviates the need to normalize the data.Experiments We conducted preliminary experiments on a subset of the100K MovieLens Dataset5,consisting of the100users and100movies with the most ratings.We used CSDP [11]to solve the resulting SDPs6.The ratings are on a discrete scale of one throughﬁve, and we experimented with both generalizations of the hinge loss above,allowing per-user thresholds.We compared against WLRA and K-Medians(described in[12])as“Baseline”learners.We randomly split the data into four sets.For each of the four possible test sets, we used the remaining sets to calculate a3-fold cross-validation(CV)error for each method (WLRA,K-medians,trace norm and max-norm MMMF with immediate-threshold and all-threshold hinge loss)using a range of parameters(rank for WLRA,number of centers for K-medians,slack cost for MMMF).For each of the four splits,we selected the two MMMF learners with lowest CV ZOE and MAE and the two Baseline learners with lowest CV ZOE and MAE,and measured their error on the held-out test data.Table1lists these CV and test errors,and the average test error across all four test sets.On average and on three of the four test sets,MMMF achieves lower MAE than the Baseline learners;on all four of the test sets,MMMF achieves lower ZOE than the Baseline learners.Test ZOE MAESet Method CV Test Method CV Test 1WLRA rank20.5470.575K-Medians K=20.6780.691 2WLRA rank20.5500.562K-Medians K=20.6860.681 3WLRA rank10.5620.543K-Medians K=20.7000.681 4WLRA rank20.5570.553K-Medians K=20.6850.696 Avg.0.5580.687 1max-norm C=0.00120.5430.562max-norm C=0.00120.6690.677 2trace norm C=0.240.5500.552max-norm C=0.00110.6750.683 3max-norm C=0.00120.5510.527max-norm C=0.00120.6680.646 4max-norm C=0.00120.5440.550max-norm C=0.00120.6670.686 Avg.0.5480.673 Table1:Baseline(top)and MMMF(bottom)methods and parameters that achieved the lowest cross validation error(on the training data)for each train/test split,and the error for this predictor on the test data.All listed MMMF learners use the“all-threshold”objective. 7DiscussionLearning maximum-margin matrix factorizations requires solving a sparse semi-deﬁnite program.We experimented with generic SDP solvers,and were able to learn with up to tens of thousands of labels.We propose that just as generic QP solvers do not perform well on SVM problems,special purpose techniques,taking advantage of the very simple structure of the dual(3),are necessary in order to solve large-scale MMMF problems. SDPs were recently suggested for a related,but different,problem:learning the features 5/Research/GroupLens/6Solving with immediate-threshold loss took about30minutes on a3.06GHz Intel Xeon. Solving with all-threshold loss took eight to nine hours.The MATLAB code is available at /˜nati/mmmf(or equivalently,kernel)that are best for a single prediction task[13].This task is hopeless if the features are completely unconstrained,as they are in our nckriet et al suggest constraining the allowed features,e.g.to a linear combination of a few“base fea-ture spaces”(or base kernels),which represent the external information necessary to solve a single prediction problem.It is possible to combine the two approaches,seeking con-strained features for multiple related prediction problems,as a way of combining external information(e.g.details of users and of items)and collaborative information.An alternate method for introducing external information into our formulation is by adding to U and/or V additionalﬁxed(non-learned)columns representing the external features. This method degenerates to standard SVM learning when Y is a vector rather than a matrix. An important limitation of the approach we have described,is that observed entries are assumed to be uniformly sampled.This is made explicit in the generalization error bounds. Such an assumption is typically unrealistic,as,e.g.,users tend to rate items they like.At an extreme,it is often desirable to make predictions based only on positive samples.Even in such situations,it is still possible to learn a low-norm factorization,by using appropriate loss functions,e.g.derived from probabilistic models incorporating the observation pro-cess.However,obtaining generalization error bounds in this case is much harder.Simply allowing an arbitrary sampling distribution and calculating the expected loss based on this distribution(which is not possible with the trace norm,but is possible with the max-norm [8])is not satisfying,as this would guarantee low error on items the user is likely to want anyway,but not on items we predict he would like.Acknowledgments We would like to thank Sam Roweis for pointing out[7]. References[1]T.Hofmann.Unsupervised learning by probabilistic latent semantic analysis.Machine Learn-ing Journal,42(1):177–196,2001.[2]M.Collins,S.Dasgupta,and R.Schapire.A generalization of principal component analysis tothe exponential family.In Advances in Neural Information Processing Systems14,2002. [3]Nathan Srebro and Tommi Jaakkola.Weighted low rank approximation.In20th InternationalConference on Machine Learning,2003.[4] D.D.Lee and H.S.Seung.Learning the parts of objects by non-negative matrix factorization.Nature,401:788–791,1999.[5]tent semantic models for collaborativeﬁltering.ACM Trans.Inf.Syst.,22(1):89–115,2004.[6]Benjamin Marlin.Modeling user rating proﬁles for collaborativeﬁltering.In Advances inNeural Information Processing Systems,volume16,2004.[7]Maryam Fazel,Haitham Hindi,and Stephen P.Boyd.A rank minimization heuristic with appli-cation to minimum order system approximation.In Proceedings American Control Conference, volume6,2001.[8]Nathan Srebro.Learning with Matrix Factorization.PhD thesis,Massachusetts Institute ofTechnology,2004.[9]N.Srebro,N.Alon,and T.Jaakkola.Generalization error bounds for collaborative predictionwith low-rank matrices.In Advances In Neural Information Processing Systems17,2005. [10]Amnon Shashua and Anat Levin.Ranking with large margin principle:Two approaches.InAdvances in Neural Information Proceedings Systems,volume15,2003.[11] B.Borchers.CSDP,a C library for semideﬁnite programming.Optimization Methods andSoftware,11(1):613–623,1999.[12] B.Marlin.Collaborativeﬁltering:A machine learning perspective.Master’s thesis,Universityof Toronto,2004.[13]nckriet,N.Cristianini,P.Bartlett,L.El Ghaoui,and M.Jordan.Learning the kernel matrixwith semideﬁnite programming.Journal of Machine Learning Research,5:27–72,2004.。

Support-Vector Networks

by adjusting the weights ai from the ith hidden unit to the output unit so as to minimize some error measure over the training data. As a result of Rosenblatt's approach, construction of decision rules was again associated with the construction of linear hyperplanes in some space. An algorithm that allows for all weights of the neural network to adapt in order locally to minimize the error on a set of vectors belonging to a pattern recognition problem was found in 1986 (Rumelhart, Hinton & Williams, 1986,1987; Parker, 1985; LeCun, 1985) when the back-propagation algorithm was discovered. The solution involves a slight modification of the mathematical model of neurons. Therefore, neural networks implement "piece-wise linear-type" decision functions. In this article we construct a new type of learning machine, the so-called support-vector network. The support-vector network implements the following idea: it maps the input vectors into some high dimensional feature space Z through some non-linear mapping chosen a priori. In this space a linear decision surface is constructed with special properties that ensure high generalization ability of the network.

Maximum Margin Classifier_1

« Display Equation with MathJaX Acrobat meets Embedding »支持向量机: Maximum Margin Classifierby pluskid, on 20100908, in Machine Learning 86 comments本文是“支持向量机系列”的第一篇，参见本系列的其他文章。

支持向量机即 Support Vector Machine，简称 SVM 。

我最开始听说这头机器的名号的时候，一种神秘感就油然而生，似乎把 Support 这么一个具体的动作和 Vector 这么一个抽象的概念拼到一起，然后再做成一个 Machine ，一听就很玄了！不过后来我才知道，原来 SVM 它并不是一头机器，而是一种算法，或者，确切地说，是一类算法，当然，这样抠字眼的话就没完没了了，比如，我说 SVM 实际上是一个分类器 (Classifier) ，但是其实也是有用 SVM 来做回归(Regression) 的。

所以，这种字眼就先不管了，还是从分类器说起吧。

SVM 一直被认为是效果最好的现成可用的分类算法之一（其实有很多人都相信，“之一”是可以去掉的）。

这里“现成可用”其实是很重要的，因为一直以来学术界和工业界甚至只是学术界里做理论的和做应用的之间，都有一种“鸿沟”，有些很 fancy 或者很复杂的算法，在抽象出来的模型里很完美，然而在实际问题上却显得很脆弱，效果很差甚至完全fail 。

而 SVM 则正好是一个特例——在两边都混得开。

好了，由于 SVM 的故事本身就很长，所以废话就先只说这么多了，直接入题吧。

当然，说是入贴，但是也不能一上来就是 SVM ，而是必须要从线性分类器开始讲。

这里我们考虑的是一个两类的分类问题，数据点用 $x$ 来表示，这是一个 $n$ 维向量，而类别用 $y$ 来表示，可以取 1 或者 1 ，分别代表两个不同的类（有些地方会选 0 和 1 ，当然其实分类问题选什么都无所谓，只要是两个不同的数字即可，不过这里选择 +1 和 1 是为了方便 SVM 的推导，后面就会明了了）。

凸优化理论课件

1–5
Solving convex optimization problems • no analytical solution • reliable and eﬃcient algorithms • computation time (roughly) proportional to max{n3, n2m, F }, where F is cost of evaluating fi’s and their ﬁrst and second derivatives • almost a technology
1–7
New applications since 1990
• linear matrix inequality techniques in control • circuit design via geometric programming • support vector machine learning via quadratic programming • semideﬁnite programming relaxations in combinatorial optimization • applications in structural optimization, statistics, signal processing, communications, image processing, quantum information theory, ﬁnance, . . .
Convex Optimization
Stephen Boyd (Stanford University)
Short Course, Harbin Institute of Technology July 13-18, 2012

Support-vector networks

1 2
Daytime phone: (973)360பைடு நூலகம்8670. E-mail: corinna@ Daytime phone: (732)345-3342. E-mail: vlad@
1
dot−product
perceptron output weights of the output unit, α 1, ... ,α 5 output from the 5 hidden units: z1 , ... , z 5 weights of the 5 hidden units
I (x) = sign
X
i
i zi (x)
!
(4)
by adjusting the weights i from the i-th hidden unit to the output unit so as to minimize some error measure over the training data. As a result of Rosenblatt's approach,
3
The optimal coe cient for was found in the sixties 2].
2
construction of decision rules was again associated with the construction of linear hyperplanes in some space. An algorithm that allows for all weights of the neural network to adapt in order locally to minimize the error on a set of vectors belonging to a pattern recognition problem was found in 1986 12, 13, 10, 8] when the back-propagation algorithm was discovered. The solution involves a slight modi cation of the mathematical model of neurons. Therefore, neural networks implement \piece-wise linear-type" decision functions. In this article we construct a new type of learning machines, the so-called supportvector network. The support-vector network implements the following idea: it maps the input vectors into some high dimensional feature space Z through some non-linear mapping chosen a priori. In this space a linear decision surface is constructed with special properties that ensure high generalization ability of the network.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

1 Introduction
Maximal margin classiﬁers are a core technology in modern machine learning. They have strong theoretical justiﬁcations and have shown empirical successes. We provide an alternative justiﬁcation for maximal margin hyperplane classiﬁers by relating them to Bayes optimal classiﬁers.1
Our main result in this paper is to show that, for linearly separable data, as tends to zero, a hyperplane is the Bayes
decision boundary. In other words, the es optimal classiﬁer typically has high variance. As a consequence, when doing Bayes optimal classiﬁcation in high-dimensional domains, one rarely uses anything but the simplest density estimator (e.g., the common use of the Naive Bayes classiﬁer in text classiﬁcation [11]).
Stanford University
Abstract
Maximal margin classiﬁers are a core technology in modern machine learning. They have strong theoretical justiﬁcations and have shown empirical successes. We provide an alternative justiﬁcation for maximal margin hyperplane classiﬁers by relating them to Bayes optimal classiﬁers that use Parzen windows estimations with Gaussian kernels. For any value of the smoothing parameter (the width of the Gaussian kernels), the Bayes optimal classiﬁer deﬁnes a density over the space of instances. We deﬁne the Bayes optimal hyperplane to be the hyperplane decision boundary that gives lowest probability of classiﬁcation error relative to this density. We show that, for linearly separable data, as we reduce the smoothing parameter to zero, a hyperplane is the Bayes optimal hyperplane if and only if it is the maximal margin hyperplane. We also analyze the behavior of the Bayes optimal hyperplane for non-linearly-separable data, showing that it has a very natural form. We explore the idea of using the hyperplane that is optimal relative to a density with some small non-zero kernel width, and present some promising preliminary results.
1Cristianini et al. [4] also provide links between Bayesian classiﬁers and large margin hyperplanes. Their analysis is based on viewing the resulting posterior distribution as a hyperplane in a Hilbert space, and is quite different from ours.
In this paper, we investigate one particular instantiation of this approach, and show that it is equivalent to choosing a maximal margin hyperplane. Consider the problem of classifying vectors in IRD into two classes C0 and C1. We estimate the class conditional densities using Parzen windows estimation with Gaussian kernels. For a given value , the density for each class Ci is deﬁned as a mixture of Gaussian kernels of width , centered on the data points in class Ci. Different values for correspond to different choices along the bias-variance spectrum: smaller values (sharper peaks for the kernels) correspond to higher variance but lower bias estimates of the density. For a ﬁnite number of training instances, the choice of is often crucial for the accuracy of the Bayes optimal classiﬁer. We can eliminate the bias induced by the smoothing effect of by making it arbitrarily close to zero. We prevent the variance of the classiﬁer from growing unboundedly by restricting our hypotheses to the very limited class of hyperplanes. Thus, we choose as our hypothesis the Bayes optimal hyperplane relative to the estimated density induced by the data and .
! Bayes Optimal Hyperplanes Maximal Margin Hyperplanes
Simon Tong simon.tong@ Computer Science Department
Stanford University
Daphne Koller koller@ Computer Science Department
Bayes optimal classiﬁers use density estimation to perform classiﬁcation by estimating the class priors and class conditional densities and then classifying a sample as belonging to the most likely class according to the estimated densities. The Bayes optimal classiﬁer is known to minimize the probability of misclassiﬁcation relative to the estimated density. Most density representations tend to have a large number of parameters to be estimated. Thus, the learned density is typically quite sensitive to the training data, as is the associated
We propose an alternative approach to dealing with the problem of variance in Bayes optimal classiﬁcation in a spirit similar to that mentioned in [5; 9]. Rather than simplifying the density, we restrict the nature of the decision boundary used by our classiﬁer. In other words, rather than using the classiﬁcation hypothesis induced by the Bayes optimal classiﬁer, we select a hypothesis from a restricted class; the hypothesis selected is the one that minimizes the probability of error relative to our learned density. We call this error the estimated Bayes error of the hypothesis. As we mentioned, the Bayes optimal classiﬁer minimizes this error among all possible hypothesis; we choose the hypothesis that minimizes it within the restricted class. For example, we can restrict to hypotheses deﬁned by hyperplane decision boundaries. We call the hyperplane that minimizes the estimated Bayes error with respect to a given density a Bayes optimal hyperplane.