模式识别中的支持向量机(SVM)教程

模式识别中的支持向量机(SVM)教程
模式识别中的支持向量机(SVM)教程

,,1–43()

c Kluwer Academic Publishers,Boston.Manufacture

d in Th

e Netherlands.

A Tutorial on Support Vector Machines for Pattern Recognition

CHRISTOPHER J.C.BURGES burges@https://www.360docs.net/doc/d315320456.html,

Bell Laboratories,Lucent Technologies

Abstract.The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization.We then describe linear Support Vector Machines(SVMs)for separable and non-separable data,working through a non-trivial example in detail.We describe a mechanical analogy,and discuss when SVM solutions are unique and when they are global.We describe how support vector training can be practically implemented,and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data.We show how Support Vector machines can have very large (even in?nite)VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels.While very high VC dimension would normally bode ill for generalization performance,and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs,there are several arguments which support the observed high accuracy of SVMs, which we review.Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems.There is new material,and I hope that the reader will?nd that even old material is cast in a fresh light.

Keywords:Support Vector Machines,Statistical Learning Theory,VC Dimension,Pattern Recognition Appeared in:Data Mining and Knowledge Discovery2,121-167,1998

1.Introduction

The purpose of this paper is to provide an introductory yet extensive tutorial on the basic ideas behind Support Vector Machines(SVMs).The books(Vapnik,1995;Vapnik,1998) contain excellent descriptions of SVMs,but they leave room for an account whose purpose from the start is to teach.Although the subject can be said to have started in the late seventies(Vapnik,1979),it is only now receiving increasing attention,and so the time appears suitable for an introductory review.The tutorial dwells entirely on the pattern recognition problem.Many of the ideas there carry directly over to the cases of regression estimation and linear operator inversion,but space constraints precluded the exploration of these topics here.

The tutorial contains some new material.All of the proofs are my own versions,where I have placed a strong emphasis on their being both clear and self-contained,to make the material as accessible as possible.This was done at the expense of some elegance and generality:however generality is usually easily added once the basic ideas are clear.The longer proofs are collected in the Appendix.

By way of motivation,and to alert the reader to some of the literature,we summarize some recent applications and extensions of support vector machines.For the pattern recog-nition case,SVMs have been used for isolated handwritten digit recognition(Cortes and Vapnik,1995;Sch¨o lkopf,Burges and Vapnik,1995;Sch¨o lkopf,Burges and Vapnik,1996; Burges and Sch¨o lkopf,1997),object recognition(Blanz et al.,1996),speaker identi?cation (Schmidt,1996),charmed quark detection1,face detection in images(Osuna,Freund and Girosi,1997a),and text categorization(Joachims,1997).For the regression estimation case,SVMs have been compared on benchmark time series prediction tests(M¨u ller et al., 1997;Mukherjee,Osuna and Girosi,1997),the Boston housing problem(Drucker et al.,

2

1997),and(on arti?cial data)on the PET operator inversion problem(Vapnik,Golowich

and Smola,1996).In most of these cases,SVM generalization performance(i.e.error rates

on test sets)either matches or is signi?cantly better than that of competing methods.The

use of SVMs for density estimation(Weston et al.,1997)and ANOVA decomposition(Stit-

son et al.,1997)has also been studied.Regarding extensions,the basic SVMs contain no

prior knowledge of the problem(for example,a large class of SVMs for the image recogni-

tion problem would give the same results if the pixels were?rst permuted randomly(with

each image su?ering the same permutation),an act of vandalism that would leave the best

performing neural networks severely handicapped)and much work has been done on in-

corporating prior knowledge into SVMs(Sch¨o lkopf,Burges and Vapnik,1996;Sch¨o lkopf et

al.,1998a;Burges,1998).Although SVMs have good generalization performance,they can

be abysmally slow in test phase,a problem addressed in(Burges,1996;Osuna and Girosi,

1998).Recent work has generalized the basic ideas(Smola,Sch¨o lkopf and M¨u ller,1998a;

Smola and Sch¨o lkopf,1998),shown connections to regularization theory(Smola,Sch¨o lkopf

and M¨u ller,1998b;Girosi,1998;Wahba,1998),and shown how SVM ideas can be incorpo-

rated in a wide range of other algorithms(Sch¨o lkopf,Smola and M¨u ller,1998b;Sch¨o lkopf

et al,1998c).The reader may also?nd the thesis of(Sch¨o lkopf,1997)helpful.

The problem which drove the initial development of SVMs occurs in several guises-the

bias variance tradeo?(Geman,Bienenstock and Doursat,1992),capacity control(Guyon

et al.,1992),over?tting(Montgomery and Peck,1992)-but the basic idea is the same.

Roughly speaking,for a given learning task,with a given?nite amount of training data,the

best generalization performance will be achieved if the right balance is struck between the

accuracy attained on that particular training set,and the“capacity”of the machine,that is,

the ability of the machine to learn any training set without error.A machine with too much

capacity is like a botanist with a photographic memory who,when presented with a new

tree,concludes that it is not a tree because it has a di?erent number of leaves from anything

she has seen before;a machine with too little capacity is like the botanist’s lazy brother,

who declares that if it’s green,it’s a tree.Neither can generalize well.The exploration and

formalization of these concepts has resulted in one of the shining peaks of the theory of

statistical learning(Vapnik,1979).

In the following,bold typeface will indicate vector or matrix quantities;normal typeface

will be used for vector and matrix components and for scalars.We will label components

of vectors and matrices with Greek indices,and label vectors and matrices themselves with

Roman indices.Familiarity with the use of Lagrange multipliers to solve problems with

equality or inequality constraints is assumed2.

2.A Bound on the Generalization Performance of a Pattern Recognition Learn-

ing Machine

There is a remarkable family of bounds governing the relation between the capacity of a

learning machine and its performance3.The theory grew out of considerations of under what

circumstances,and how quickly,the mean of some empirical quantity converges uniformly,

as the number of data points increases,to the true mean(that which would be calculated

from an in?nite amount of data)(Vapnik,1979).Let us start with one of these bounds. The notation here will largely follow that of(Vapnik,1995).Suppose we are given l

observations.Each observation consists of a pair:a vector x i∈R n,i=1,...,l and the associated“truth”y i,given to us by a trusted source.In the tree recognition problem,x i

might be a vector of pixel values(e.g.n=256for a16x16image),and y i would be1if the

image contains a tree,and-1otherwise(we use-1here rather than0to simplify subsequent

3 formulae).Now it is assumed that there exists some unknown probability distribution P(x,y)from which these data are drawn,i.e.,the data are assumed“iid”(independently drawn and identically distributed).(We will use P for cumulative probability distributions, and p for their densities).Note that this assumption is more general than associating a ?xed y with every x:it allows there to be a distribution of y for a given x.In that case, the trusted source would assign labels y i according to a?xed distribution,conditional on x i.However,after this Section,we will be assuming?xed y for given x.

Now suppose we have a machine whose task it is to learn the mapping x i→y i.The machine is actually de?ned by a set of possible mappings x→f(x,α),where the functions f(x,α)themselves are labeled by the adjustable parametersα.The machine is assumed to be deterministic:for a given input x,and choice ofα,it will always give the same output f(x,α).A particular choice ofαgenerates what we will call a“trained machine.”Thus, for example,a neural network with?xed architecture,withαcorresponding to the weights and biases,is a learning machine in this sense.

The expectation of the test error for a trained machine is therefore:

R(α)=

1

2

|y?f(x,α)|dP(x,y)(1)

Note that,when a density p(x,y)exists,dP(x,y)may be written p(x,y)d x dy.This is a nice way of writing the true mean error,but unless we have an estimate of what P(x,y)is, it is not very useful.

The quantity R(α)is called the expected risk,or just the risk.Here we will call it the actual risk,to emphasize that it is the quantity that we are ultimately interested in.The “empirical risk”R emp(α)is de?ned to be just the measured mean error rate on the training set(for a?xed,?nite number of observations)4:

R emp(α)=1

2l

l

i=1

|y i?f(x i,α)|.(2)

Note that no probability distribution appears here.R emp(α)is a?xed number for a particular choice ofαand for a particular training set{x i,y i}.

The quantity12|y i?f(x i,α)|is called the loss.For the case described here,it can only take the values0and1.Now choose someηsuch that0≤η≤1.Then for losses taking these values,with probability1?η,the following bound holds(Vapnik,1995):

R(α)≤R emp(α)+

h(log(2l/h)+1)?log(η/4)

l

(3)

where h is a non-negative integer called the Vapnik Chervonenkis(VC)dimension,and is a measure of the notion of capacity mentioned above.In the following we will call the right hand side of Eq.(3)the“risk bound.”We depart here from some previous nomenclature: the authors of(Guyon et al.,1992)call it the“guaranteed risk”,but this is something of a misnomer,since it is really a bound on a risk,not a risk,and it holds only with a certain probability,and so is not guaranteed.The second term on the right hand side is called the “VC con?dence.”

We note three key points about this bound.First,remarkably,it is independent of P(x,y). It assumes only that both the training data and the test data are drawn independently according to some P(x,y).Second,it is usually not possible to compute the left hand

4

side.Third,if we know h,we can easily compute the right hand side.Thus given several di?erent learning machines(recall that“learning machine”is just another name for a family of functions f(x,α)),and choosing a?xed,su?ciently smallη,by then taking that machine which minimizes the right hand side,we are choosing that machine which gives the lowest upper bound on the actual risk.This gives a principled method for choosing a learning machine for a given task,and is the essential idea of structural risk minimization(see Section2.6).Given a?xed family of learning machines to choose from,to the extent that the bound is tight for at least one of the machines,one will not be able to do better than this.To the extent that the bound is not tight for any,the hope is that the right hand side still gives useful information as to which learning machine minimizes the actual risk. The bound not being tight for the whole chosen family of learning machines gives critics a justi?able target at which to?re their complaints.At present,for this case,we must rely on experiment to be the judge.

2.1.The VC Dimension

The VC dimension is a property of a set of functions{f(α)}(again,we useαas a generic set of parameters:a choice ofαspeci?es a particular function),and can be de?ned for various classes of function f.Here we will only consider functions that correspond to the two-class pattern recognition case,so that f(x,α)∈{?1,1}?x,α.Now if a given set of l points can be labeled in all possible2l ways,and for each labeling,a member of the set{f(α)}can be found which correctly assigns those labels,we say that that set of points is shattered by that set of functions.The VC dimension for the set of functions{f(α)}is de?ned as the maximum number of training points that can be shattered by{f(α)}.Note that,if the VC dimension is h,then there exists at least one set of h points that can be shattered,but it in general it will not be true that every set of h points can be shattered.

2.2.Shattering Points with Oriented Hyperplanes in R n

Suppose that the space in which the data live is R2,and the set{f(α)}consists of oriented straight lines,so that for a given line,all points on one side are assigned the class1,and all points on the other side,the class?1.The orientation is shown in Figure1by an arrow, specifying on which side of the line points are to be assigned the label1.While it is possible to?nd three points that can be shattered by this set of functions,it is not possible to?nd four.Thus the VC dimension of the set of oriented lines in R2is three.

Let’s now consider hyperplanes in R n.The following theorem will prove useful(the proof is in the Appendix):

Theorem1Consider some set of m points in R n.Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes5if and only if the position vectors of the remaining points are linearly independent6.

Corollary:The VC dimension of the set of oriented hyperplanes in R n is n+1,since we can always choose n+1points,and then choose one of the points as origin,such that the position vectors of the remaining n points are linearly independent,but can never choose n+2such points(since no n+1vectors in R n can be linearly independent).

An alternative proof of the corollary can be found in(Anthony and Biggs,1995),and references therein.

5

Figure1.Three points in R2,shattered by oriented lines.

2.3.The VC Dimension and the Number of Parameters

The VC dimension thus gives concreteness to the notion of the capacity of a given set of functions.Intuitively,one might be led to expect that learning machines with many parameters would have high VC dimension,while learning machines with few parameters would have low VC dimension.There is a striking counterexample to this,due to E.Levin and J.S.Denker(Vapnik,1995):A learning machine with just one parameter,but with in?nite VC dimension(a family of classi?ers is said to have in?nite VC dimension if it can shatter l points,no matter how large l).De?ne the step functionθ(x),x∈R:{θ(x)= 1?x>0;θ(x)=?1?x≤0}.Consider the one-parameter family of functions,de?ned by f(x,α)≡θ(sin(αx)),x,α∈R.(4) You choose some number l,and present me with the task of?nding l points that can be shattered.I choose them to be:

x i=10?i,i=1,···,l.(5) You specify any labels you like:

y1,y2,···,y l,y i∈{?1,1}.(6) Then f(α)gives this labeling if I chooseαto be

α=π(1+

l

i=1

(1?y i)10i

2

).(7) Thus the VC dimension of this machine is in?nite.

Interestingly,even though we can shatter an arbitrarily large number of points,we can also?nd just four points that cannot be shattered.They simply have to be equally spaced, and assigned labels as shown in Figure2.This can be seen as follows:Write the phase at x1asφ1=2nπ+δ.Then the choice of label y1=1requires0<δ<π.The phase at x2, mod2π,is2δ;then y2=1?0<δ<π/2.Similarly,point x3forcesδ>π/3.Then at x4,π/3<δ<π/2implies that f(x4,α)=?1,contrary to the assigned label.These four points are the analogy,for the set of functions in Eq.(4),of the set of three points lying along a line,for oriented hyperplanes in R n.Neither set can be shattered by the chosen family of functions.

6

1234

x=0Figure 2.Four points that cannot be shattered by θ(sin(αx )),despite in?nite VC dimension.

2.4.Minimizing The Bound by Minimizing h

0.20.4

0.6

0.811.2

1.4

0.10.20.30.40.50.60.70.80.9

1

V C C o n f i d e n c e

h / l = VC Dimension / Sample Size Figure 3.VC con?dence is monotonic in h

Figure 3shows how the second term on the right hand side of Eq.(3)varies with h ,given a choice of 95%con?dence level (η=0.05)and assuming a training sample of size 10,000.The VC con?dence is a monotonic increasing function of h .This will be true for any value of l .

Thus,given some selection of learning machines whose empirical risk is zero,one wants to choose that learning machine whose associated set of functions has minimal VC dimension.This will lead to a better upper bound on the actual error.In general,for non zero empirical risk,one wants to choose that learning machine which minimizes the right hand side of Eq.

(3).

Note that in adopting this strategy,we are only using Eq.(3)as a guide.Eq.(3)gives (with some chosen probability)an upper bound on the actual risk.This does not prevent a particular machine with the same value for empirical risk,and whose function set has higher VC dimension,from having better performance.In fact an example of a system that gives good performance despite having in?nite VC dimension is given in the next Section.Note also that the graph shows that for h/l >0.37(and for η=0.05and l =10,000),the VC con?dence exceeds unity,and so for higher values the bound is guaranteed not tight.

2.5.Two Examples

Consider the k ’th nearest neighbour classi?er,with k =1.This set of functions has in?nite VC dimension and zero empirical risk,since any number of points,labeled arbitrarily,will be successfully learned by the algorithm (provided no two points of opposite class lie right on top of each other).Thus the bound provides no information.In fact,for any classi?er

7 with in?nite VC dimension,the bound is not even valid7.However,even though the bound is not valid,nearest neighbour classi?ers can still perform well.Thus this?rst example is a cautionary tale:in?nite“capacity”does not guarantee poor performance.

Let’s follow the time honoured tradition of understanding things by trying to break them, and see if we can come up with a classi?er for which the bound is supposed to hold,but which violates the bound.We want the left hand side of Eq.(3)to be as large as possible, and the right hand side to be as small as possible.So we want a family of classi?ers which gives the worst possible actual risk of0.5,zero empirical risk up to some number of training observations,and whose VC dimension is easy to compute and is less than l(so that the bound is non trivial).An example is the following,which I call the“notebook classi?er.”This classi?er consists of a notebook with enough room to write down the classes of m training observations,where m≤l.For all subsequent patterns,the classi?er simply says that all patterns have the same class.Suppose also that the data have as many positive (y=+1)as negative(y=?1)examples,and that the samples are chosen randomly.The notebook classi?er will have zero empirical risk for up to m observations;0.5training error for all subsequent observations;0.5actual error,and VC dimension h=m.Substituting these values in Eq.(3),the bound becomes:

m

4l

≤ln(2l/m)+1?(1/m)ln(η/4)(8) which is certainly met for allηif

f(z)= z

2

exp(z/4?1)≤1,z≡(m/l),0≤z≤1(9)

which is true,since f(z)is monotonic increasing,and f(z=1)=0.236.

2.6.Structural Risk Minimization

We can now summarize the principle of structural risk minimization(SRM)(Vapnik,1979). Note that the VC con?dence term in Eq.(3)depends on the chosen class of functions, whereas the empirical risk and actual risk depend on the one particular function chosen by the training procedure.We would like to?nd that subset of the chosen set of functions,such that the risk bound for that subset is minimized.Clearly we cannot arrange things so that the VC dimension h varies smoothly,since it is an integer.Instead,introduce a“structure”by dividing the entire class of functions into nested subsets(Figure4).For each subset, we must be able either to compute h,or to get a bound on h itself.SRM then consists of ?nding that subset of functions which minimizes the bound on the actual risk.This can be done by simply training a series of machines,one for each subset,where for a given subset the goal of training is simply to minimize the empirical risk.One then takes that trained machine in the series whose sum of empirical risk and VC con?dence is minimal.

We have now laid the groundwork necessary to begin our exploration of support vector machines.

3.Linear Support Vector Machines

3.1.The Separable Case

We will start with the simplest case:linear machines trained on separable data(as we shall see,the analysis for the general case-nonlinear machines trained on non-separable data-

8

h1 < h2 < h3 ...

Figure4.Nested subsets of functions,ordered by VC dimension.

results in a very similar quadratic programming problem).Again label the training data {x i,y i},i=1,···,l,y i∈{?1,1},x i∈R d.Suppose we have some hyperplane which separates the positive from the negative examples(a“separating hyperplane”).The points x which lie on the hyperplane satisfy w·x+b=0,where w is normal to the hyperplane, |b|/ w is the perpendicular distance from the hyperplane to the origin,and w is the Euclidean norm of w.Let d+(d?)be the shortest distance from the separating hyperplane to the closest positive(negative)example.De?ne the“margin”of a separating hyperplane to be d++d?.For the linearly separable case,the support vector algorithm simply looks for the separating hyperplane with largest margin.This can be formulated as follows:suppose that all the training data satisfy the following constraints:

x i·w+b≥+1for y i=+1(10) x i·w+b≤?1for y i=?1(11) These can be combined into one set of inequalities:

y i(x i·w+b)?1≥0?i(12) Now consider the points for which the equality in Eq.(10)holds(requiring that there exists such a point is equivalent to choosing a scale for w and b).These points lie on the hyperplane H1:x i·w+b=1with normal w and perpendicular distance from the origin|1?b|/ w .Similarly,the points for which the equality in Eq.(11)holds lie on the hyperplane H2:x i·w+b=?1,with normal again w,and perpendicular distance from the origin|?1?b|/ w .Hence d+=d?=1/ w and the margin is simply2/ w .Note that H1and H2are parallel(they have the same normal)and that no training points fall between them.Thus we can?nd the pair of hyperplanes which gives the maximum margin by minimizing w 2,subject to constraints(12).

Thus we expect the solution for a typical two dimensional case to have the form shown in Figure5.Those training points for which the equality in Eq.(12)holds(i.e.those which wind up lying on one of the hyperplanes H1,H2),and whose removal would change the solution found,are called support vectors;they are indicated in Figure5by the extra circles.

We will now switch to a Lagrangian formulation of the problem.There are two reasons for doing this.The?rst is that the constraints(12)will be replaced by constraints on the Lagrange multipliers themselves,which will be much easier to handle.The second is that in this reformulation of the problem,the training data will only appear(in the actual training and test algorithms)in the form of dot products between vectors.This is a crucial property which will allow us to generalize the procedure to the nonlinear case(Section4).

9

Figure5.Linear separating hyperplanes for the separable case.The support vectors are circled.

Thus,we introduce positive Lagrange multipliersαi,i=1,···,l,one for each of the inequality constraints(12).Recall that the rule is that for constraints of the form c i≥0, the constraint equations are multiplied by positive Lagrange multipliers and subtracted from the objective function,to form the Lagrangian.For equality constraints,the Lagrange multipliers are unconstrained.This gives Lagrangian:

L P≡1

2

w 2?

l

i=1

αi y i(x i·w+b)+

l

i=1

αi(13)

We must now minimize L P with respect to w,b,and simultaneously require that the derivatives of L P with respect to all theαi vanish,all subject to the constraintsαi≥0 (let’s call this particular set of constraints C1).Now this is a convex quadratic programming problem,since the objective function is itself convex,and those points which satisfy the constraints also form a convex set(any linear constraint de?nes a convex set,and a set of N simultaneous linear constraints de?nes the intersection of N convex sets,which is also a convex set).This means that we can equivalently solve the following“dual”problem: maximize L P,subject to the constraints that the gradient of L P with respect to w and b vanish,and subject also to the constraints that theαi≥0(let’s call that particular set of constraints C2).This particular dual formulation of the problem is called the Wolfe dual (Fletcher,1987).It has the property that the maximum of L P,subject to constraints C2, occurs at the same values of the w,b andα,as the minimum of L P,subject to constraints C18.

Requiring that the gradient of L P with respect to w and b vanish give the conditions: w=

i

αi y i x i(14)

i

αi y i=0.(15)

Since these are equality constraints in the dual formulation,we can substitute them into Eq.(13)to give

L D=

i αi?

1

2

i,j

αiαj y i y j x i·x j(16)

10

Note that we have now given the Lagrangian di?erent labels(P for primal,D for dual)to emphasize that the two formulations are di?erent:L P and L D arise from the same objective function but with di?erent constraints;and the solution is found by minimizing L P or by maximizing L D.Note also that if we formulate the problem with b=0,which amounts to requiring that all hyperplanes contain the origin,the constraint(15)does not appear.This is a mild restriction for high dimensional spaces,since it amounts to reducing the number of degrees of freedom by one.

Support vector training(for the separable,linear case)therefore amounts to maximizing L D with respect to theαi,subject to constraints(15)and positivity of theαi,with solution given by(14).Notice that there is a Lagrange multiplierαi for every training point.In the solution,those points for whichαi>0are called“support vectors”,and lie on one of the hyperplanes H1,H2.All other training points haveαi=0and lie either on H1or H2(such that the equality in Eq.(12)holds),or on that side of H1or H2such that the strict inequality in Eq.(12)holds.For these machines,the support vectors are the critical elements of the training set.They lie closest to the decision boundary;if all other training points were removed(or moved around,but so as not to cross H1or H2),and training was repeated,the same separating hyperplane would be found.

3.2.The Karush-Kuhn-Tucker Conditions

The Karush-Kuhn-Tucker(KKT)conditions play a central role in both the theory and practice of constrained optimization.For the primal problem above,the KKT conditions may be stated(Fletcher,1987):

??wνL P=wν?

i

αi y i x iν=0ν=1,···,d(17)?

?b

L P=?

i

αi y i=0(18) y i(x i·w+b)?1≥0i=1,···,l(19)

αi≥0?i(20)αi(y i(w·x i+b)?1)=0?i(21)

The KKT conditions are satis?ed at the solution of any constrained optimization problem (convex or not),with any kind of constraints,provided that the intersection of the set of feasible directions with the set of descent directions coincides with the intersection of the set of feasible directions for linearized constraints with the set of descent directions (see Fletcher,1987;McCormick,1983)).This rather technical regularity assumption holds for all support vector machines,since the constraints are always linear.Furthermore,the problem for SVMs is convex(a convex objective function,with constraints which give a convex feasible region),and for convex problems(if the regularity condition holds),the KKT conditions are necessary and su?cient for w,b,αto be a solution(Fletcher,1987). Thus solving the SVM problem is equivalent to?nding a solution to the KKT conditions. This fact results in several approaches to?nding the solution(for example,the primal-dual path following method mentioned in Section5).

As an immediate application,note that,while w is explicitly determined by the training procedure,the threshold b is not,although it is implicitly determined.However b is easily found by using the KKT“complementarity”condition,Eq.(21),by choosing any i for

11

which αi =0and computing b (note that it is numerically safer to take the mean value of b resulting from all such equations).

Notice that all we’ve done so far is to cast the problem into an optimization problem where the constraints are rather more manageable than those in Eqs.(10),(11).Finding the solution for real world problems will usually require numerical methods.We will have more to say on this later.However,let’s ?rst work out a rare case where the problem is nontrivial (the number of dimensions is arbitrary,and the solution certainly not obvious),but where the solution can be found analytically.

3.3.Optimal Hyperplanes:An Example

While the main aim of this Section is to explore a non-trivial pattern recognition problem where the support vector solution can be found analytically,the results derived here will also be useful in a later proof.For the problem considered,every training point will turn out to be a support vector,which is one reason we can ?nd the solution analytically.Consider n +1symmetrically placed points lying on a sphere S n ?1of radius R :more precisely,the points form the vertices of an n -dimensional symmetric simplex.It is conve-nient to embed the points in R n +1in such a way that they all lie in the hyperplane which passes through the origin and which is perpendicular to the (n +1)-vector (1,1,...,1)(in this formulation,the points lie on S n ?1,they span R n ,and are embedded in R n +1).Explicitly,recalling that vectors themselves are labeled by Roman indices and their coordinates by Greek,the coordinates are given by:

x iμ=?(1?δi,μ) R n (n +1)+δi,μ Rn n +1

(22)where the Kronecker delta,δi,μ,is de?ned by δi,μ=1if μ=i ,0otherwise.Thus,for example,the vectors for three equidistant points on the unit circle (see Figure 12)are:

x 1=( 23,?1√6,?1√6)x 2

=(?1√6, 23,?1√6)x 3=(?1√6,?1√6, 23)(23)One consequence of the symmetry is that the angle between any pair of vectors is the same (and is equal to arccos(?1/n )):

x i 2=R 2

(24)x i ·x j =?R 2/n

(25)

or,more succinctly,

x i ·x j R 2=δi,j ?(1?δi,j )1n .(26)Assigning a class label C ∈{+1,?1}arbitrarily to each point,we wish to ?nd that hyperplane which separates the two classes with widest margin.Thus we must maximize

12

L D in Eq.(16),subject toαi≥0and also subject to the equality constraint,Eq.(15). Our strategy is to simply solve the problem as though there were no inequality constraints. If the resulting solution does in fact satisfyαi≥0?i,then we will have found the general solution,since the actual maximum of L D will then lie in the feasible region,provided the equality constraint,Eq.(15),is also met.In order to impose the equality constraint we introduce an additional Lagrange multiplierλ.Thus we seek to maximize

L D≡n+1

i=1

αi?

1

2

n+1

i,j=1

αi H ijαj?λ

n+1

i=1

αi y i,(27)

where we have introduced the Hessian

H ij≡y i y j x i·x j.(28) Setting?L D

?αi

=0gives

(Hα)i+λy i=1?i(29)

Now H has a very simple structure:the o?-diagonal elements are?y i y j R2/n,and the diagonal elements are R2.The fact that all the o?-diagonal elements di?er only by factors of y i suggests looking for a solution which has the form:

αi=

1+y i

2

a+

1?y i

2

b(30)

where a and b are unknowns.Plugging this form in Eq.(29)gives:

n+1

n

a+b

2

?y i p

n

a+b

2

=

1?λy i

R2

(31)

where p is de?ned by

p≡n+1

i=1

y i.(32)

Thus

a+b=

2n

R2(n+1)

(33)

and substituting this into the equality constraint Eq.(15)to?nd a,b gives

a=

n

R2(n+1)

1?

p

n+1

,b=

n

R2(n+1)

1+

p

n+1

(34)

which gives for the solution

αi=

n

R2(n+1)

1?

y i p

n+1

(35)

Also,

(Hα)i=1?

y i p

n+1

.(36)

13

Hence

w 2=n +1

i,j =1

αi αj y i y j x i ·x j =αT H α=n +1 i =1αi 1?y i p n +1 =n +1 i =1

αi = n R 2 1? p n +1 2 (37)Note that this is one of those cases where the Lagrange multiplier λcan remain undeter-mined (although determining it is trivial).We have now solved the problem,since all the αi are clearly positive or zero (in fact the αi will only be zero if all training points have the same class).Note that w depends only on the number of positive (negative)polarity points,and not on how the class labels are assigned to the points in Eq.(22).This is clearly not true of w itself,which is given by

w =n R 2(n +1)n +1 i =1 y i ?p n +1

x i (38)

The margin,M =2/ w ,is thus given by

M =2R n (1?(p/(n +1))2).(39)Thus when the number of points n +1is even,the minimum margin occurs when p =0(equal numbers of positive and negative examples),in which case the margin is M min =2R/√n .If n +1is odd,the minimum margin occurs when p =±1,in which case M min =2R (n +1)/(n √n +2).In both cases,the maximum margin is given by M max =R (n +1)/n .Thus,for example,for the two dimensional simplex consisting of three points lying on S 1(and spanning R 2),and with labeling such that not all three points have the same polarity,the maximum and minimum margin are both 3R/2(see Figure (12)).

Note that the results of this Section amount to an alternative,constructive proof that the VC dimension of oriented separating hyperplanes in R n is at least n +1.

3.4.Test Phase

Once we have trained a Support Vector Machine,how can we use it?We simply determine on which side of the decision boundary (that hyperplane lying half way between H 1and H 2and parallel to them)a given test pattern x lies and assign the corresponding class label,i.e.we take the class of x to be sgn (w ·x +b ).

3.5.The Non-Separable Case

The above algorithm for separable data,when applied to non-separable data,will ?nd no feasible solution:this will be evidenced by the objective function (i.e.the dual Lagrangian)growing arbitrarily large.So how can we extend these ideas to handle non-separable data?We would like to relax the constraints (10)and (11),but only when necessary,that is,we would like to introduce a further cost (i.e.an increase in the primal objective function)for doing so.This can be done by introducing positive slack variables ξi ,i =1,···,l in the constraints (Cortes and Vapnik,1995),which then become:

14

x i ·w +b ≥+1?ξi

for y i =+1(40)x i ·w +b ≤?1+ξi

for y i =?1(41)

ξi ≥0?i.(42)

Thus,for an error to occur,the corresponding ξi must exceed unity,so i ξi is an upper

bound on the number of training errors.Hence a natural way to assign an extra cost for errors is to change the objective function to be minimized from w 2/2to w 2/2+C ( i ξi )k ,where C is a parameter to be chosen by the user,a larger C corresponding to assigning a higher penalty to errors.As it stands,this is a convex programming problem for any positive integer k ;for k =2and k =1it is also a quadratic programming problem,and the choice k =1has the further advantage that neither the ξi ,nor their Lagrange multipliers,appear in the Wolfe dual problem,which becomes:

Maximize:

L D ≡ i αi ?12 i,j

αi αj y i y j x i ·x j (43)subject to:

0≤αi ≤C,

(44)

i αi y i =0.

(45)The solution is again given by

w =N S

i =1αi y i x i .(46)

where N S is the number of support vectors.Thus the only di?erence from the optimal hyperplane case is that the αi now have an upper bound of C .The situation is summarized schematically in Figure 6.

We will need the Karush-Kuhn-Tucker conditions for the primal problem.The primal Lagrangian is

L P =12 w 2+C i ξi ? i αi {y i (x i ·w +b )?1+ξi }? i μi ξi (47)

where the μi are the Lagrange multipliers introduced to enforce positivity of the ξi .The KKT conditions for the primal problem are therefore (note i runs from 1to the number of training points,and νfrom 1to the dimension of the data)

?L P ?w ν=w ν? i αi y i x iν=0(48)?L P ?b =? i αi y i =0(49)

15

?L P

?ξi

=C?αi?μi=0(50) y i(x i·w+b)?1+ξi≥0(51)

ξi≥0(52)

αi≥0(53)

μi≥0(54)αi{y i(x i·w+b)?1+ξi}=0(55)

μiξi=0(56)

As before,we can use the KKT complementarity conditions,Eqs.(55)and(56),to determine the threshold b.Note that Eq.(50)combined with Eq.(56)shows thatξi=0if αi

points.)

Figure6.Linear separating hyperplanes for the non-separable case.

3.6.A Mechanical Analogy

Consider the case in which the data are in R2.Suppose that the i’th support vector exerts a force F i=αi y i?w on a sti?sheet lying along the decision surface(the“decision sheet”) (here?w denotes the unit vector in the direction w).Then the solution(46)satis?es the conditions of mechanical equilibrium:

Forces=

i αi y i?w=0(57)

Torques=

i

s i∧(αi y i?w)=?w∧w=0.(58)

(Here the s i are the support vectors,and∧denotes the vector product.)For data in R n, clearly the condition that the sum of forces vanish is still met.One can easily show that the torque also vanishes.9

This mechanical analogy depends only on the form of the solution(46),and therefore holds for both the separable and the non-separable cases.In fact this analogy holds in general

16

(i.e.,also for the nonlinear case described below).The analogy emphasizes the interesting point that the “most important”data points are the support vectors with highest values of α,since they exert the highest forces on the decision sheet.For the non-separable case,the upper bound αi ≤C corresponds to an upper bound on the force any given point is allowed to exert on the sheet.This analogy also provides a reason (as good as any other)to call these particular vectors “support vectors”10.

3.7.Examples by Pictures

Figure 7shows two examples of a two-class pattern recognition problem,one separable and one not.The two classes are denoted by circles and disks respectively.Support vectors are identi?ed with an extra circle.The error in the non-separable case is identi?ed with a cross.The reader is invited to use Lucent’s SVM Applet (Burges,Knirsch and Haratsch,1996)to experiment and create pictures like these (if possible,try using 16or 24bit color).

Figure 7.The linear case,separable (left)and not (right).The background colour shows the shape of the decision surface.

4.Nonlinear Support Vector Machines

How can the above methods be generalized to the case where the decision function 11is not a linear function of the data?(Boser,Guyon and Vapnik,1992),showed that a rather old trick (Aizerman,1964)can be used to accomplish this in an astonishingly straightforward way.First notice that the only way in which the data appears in the training problem,Eqs.

(43)-(45),is in the form of dot products,x i ·x j .Now suppose we ?rst mapped the data to some other (possibly in?nite dimensional)Euclidean space H ,using a mapping which we will call Φ:

Φ:R d →H .(59)Then of course the training algorithm would only depend on the data through dot products in H ,i.e.on functions of the form Φ(x i )·Φ(x j ).Now if there were a “kernel function”K such that K (x i ,x j )=Φ(x i )·Φ(x j ),we would only need to use K in the training algorithm,and would never need to explicitly even know what Φis.One example is

17

K(x i,x j)=e? x i?x j 2/2σ2.(60) In this particular example,H is in?nite dimensional,so it would not be very easy to work withΦexplicitly.However,if one replaces x i·x j by K(x i,x j)everywhere in the training

algorithm,the algorithm will happily produce a support vector machine which lives in an in?nite dimensional space,and furthermore do so in roughly the same amount of time it would take to train on the un-mapped data.All the considerations of the previous sections hold,since we are still doing a linear separation,but in a di?erent space.

But how can we use this machine?After all,we need w,and that will live in H also(see Eq.(46)).But in test phase an SVM is used by computing dot products of a given test point x with w,or more speci?cally by computing the sign of

f(x)=N S

i=1

αi y iΦ(s i)·Φ(x)+b=

N S

i=1

αi y i K(s i,x)+b(61)

where the s i are the support vectors.So again we can avoid computingΦ(x)explicitly and use the K(s i,x)=Φ(s i)·Φ(x)instead.

Let us call the space in which the data live,L.(Here and below we use L as a mnemonic for“low dimensional”,and H for“high dimensional”:it is usually the case that the range of Φis of much higher dimension than its domain).Note that,in addition to the fact that w lives in H,there will in general be no vector in L which maps,via the mapΦ,to w.If there were,f(x)in Eq.(61)could be computed in one step,avoiding the sum(and making the corresponding SVM N S times faster,where N S is the number of support vectors).Despite this,ideas along these lines can be used to signi?cantly speed up the test phase of SVMs (Burges,1996).Note also that it is easy to?nd kernels(for example,kernels which are functions of the dot products of the x i in L)such that the training algorithm and solution found are independent of the dimension of both L and H.

In the next Section we will discuss which functions K are allowable and which are not. Let us end this Section with a very simple example of an allowed kernel,for which we can construct the mappingΦ.

Suppose that your data are vectors in R2,and you choose K(x i,x j)=(x i·x j)2.Then it’s easy to?nd a space H,and mappingΦfrom R2to H,such that(x·y)2=Φ(x)·Φ(y): we choose H=R3and

Φ(x)=?

?

x21

2x1x2

x22

?

?(62)

(note that here the subscripts refer to vector components).For data in L de?ned on the square[?1,1]×[?1,1]∈R2(a typical situation,for grey level image data),the(entire) image ofΦis shown in Figure8.This Figure also illustrates how to think of this mapping: the image ofΦmay live in a space of very high dimension,but it is just a(possibly very contorted)surface whose intrinsic dimension12is just that of L.

Note that neither the mappingΦnor the space H are unique for a given kernel.We could equally well have chosen H to again be R3and

Φ(x)=

1

2

?

?

(x21?x22)

2x1x2

(x21+x22)

?

?(63)

180.20.40.60.81-1

-0.500.510

0.2

0.4

0.6

0.8

1

Figure 8.Image,in H ,of the square [?1,1]×[?1,1]∈R 2under the mapping Φ.

or H to be R 4and Φ(x )=????x 21x 1x 2x 1x 2x 22????.

(64)

The literature on SVMs usually refers to the space H as a Hilbert space,so let’s end this Section with a few notes on this point.You can think of a Hilbert space as a generalization of Euclidean space that behaves in a gentlemanly fashion.Speci?cally,it is any linear space,with an inner product de?ned,which is also complete with respect to the corresponding norm (that is,any Cauchy sequence of points converges to a point in the space).Some authors (e.g.(Kolmogorov,1970))also require that it be separable (that is,it must have a countable subset whose closure is the space itself),and some (e.g.Halmos,1967)don’t.It’s a generalization mainly because its inner product can be any inner product,not just the scalar (“dot”)product used here (and in Euclidean spaces in general).It’s interesting that the older mathematical literature (e.g.Kolmogorov,1970)also required that Hilbert spaces be in?nite dimensional,and that mathematicians are quite happy de?ning in?nite dimensional Euclidean spaces.Research on Hilbert spaces centers on operators in those spaces,since the basic properties have long since been worked out.Since some people understandably blanch at the mention of Hilbert spaces,I decided to use the term Euclidean throughout this tutorial.

4.1.Mercer’s Condition

For which kernels does there exist a pair {H ,Φ},with the properties described above,and for which does there not?The answer is given by Mercer’s condition (Vapnik,1995;Courant and Hilbert,1953):There exists a mapping Φand an expansion K (x ,y )= i

Φ(x )i Φ(y )i (65)

19

if and only if,for any g (x )such that

g (x )2d x is ?nite

(66)then

K (x ,y )g (x )g (y )d x d y ≥0.(67)Note that for speci?c cases,it may not be easy to check whether Mercer’s condition is satis?ed.Eq.(67)must hold for every g with ?nite L 2norm (i.e.which satis?es Eq.(66)).However,we can easily prove that the condition is satis?ed for positive integral powers of the dot product:K (x ,y )=(x ·y )p .We must show that

(

d

i =1x i y i )p g (x )g (y )d x d y ≥0.(68)

The typical term in the multinomial expansion of ( d i =1x i y i )p contributes a term of the

form p !r 1!r 2!···(p ?r 1?r 2···)!

x r 11x r 22···y r 11y r 22···g (x )g (y )d x d y (69)to the left hand side of Eq.(67),which factorizes:

=p !r 1!r 2!···(p ?r 1?r 2···)!( x r 11x r 22···g (x )d x )2≥0.(70)

One simple consequence is that any kernel which can be expressed as K (x ,y )= ∞p =0c p (x ·

y )p ,where the c p are positive real coe?cients and the series is uniformly convergent,satis?es Mercer’s condition,a fact also noted in (Smola,Sch¨o lkopf and M¨u ller,1998b).

Finally,what happens if one uses a kernel which does not satisfy Mercer’s condition?In general,there may exist data such that the Hessian is inde?nite,and for which the quadratic programming problem will have no solution (the dual objective function can become arbi-trarily large).However,even for kernels that do not satisfy Mercer’s condition,one might still ?nd that a given training set results in a positive semide?nite Hessian,in which case the training will converge perfectly well.In this case,however,the geometrical interpretation described above is lacking.

4.2.Some Notes on Φand H

Mercer’s condition tells us whether or not a prospective kernel is actually a dot product in some space,but it does not tell us how to construct Φor even what H is.However,as with the homogeneous (that is,homogeneous in the dot product in L )quadratic polynomial kernel discussed above,we can explicitly construct the mapping for some kernels.In Section

6.1we show how Eq.(62)can be extended to arbitrary homogeneous polynomial kernels,and that the corresponding space H is a Euclidean space of dimension d +p ?1p .Thus for example,for a degree p =4polynomial,and for data consisting of 16by 16images (d=256),dim(H )is 183,181,376.

Usually,mapping your data to a “feature space”with an enormous number of dimensions would bode ill for the generalization performance of the resulting machine.After all,the

20

set of all hyperplanes{w,b}are parameterized by dim(H)+1numbers.Most pattern

recognition systems with billions,or even an in?nite,number of parameters would not make it past the start gate.How come SVMs do so well?One might argue that,given the form

of solution,there are at most l+1adjustable parameters(where l is the number of training samples),but this seems to be begging the question13.It must be something to do with our

requirement of maximum margin hyperplanes that is saving the day.As we shall see below,

a strong case can be made for this claim.

Since the mapped surface is of intrinsic dimension dim(L),unless dim(L)=dim(H),it

is obvious that the mapping cannot be onto(surjective).It also need not be one to one (bijective):consider x1→?x1,x2→?x2in Eq.(62).The image ofΦneed not itself be a vector space:again,considering the above simple quadratic example,the vector?Φ(x)is

not in the image ofΦunless x=0.Further,a little playing with the inhomogeneous kernel K(x i,x j)=(x i·x j+1)2(71) will convince you that the correspondingΦcan map two vectors that are linearly dependent in L onto two vectors that are linearly independent in H.

So far we have considered cases whereΦis done implicitly.One can equally well turn

things around and start withΦ,and then construct the corresponding kernel.For example (Vapnik,1996),if L=R1,then a Fourier expansion in the data x,cut o?after N terms, has the form

f(x)=a0

2

+

N

r=1

(a1r cos(rx)+a2r sin(rx))(72)

and this can be viewed as a dot product between two vectors in R2N+1:a=(a0√

2,a11,...,a21,...),

and the mappedΦ(x)=(1√

2,cos(x),cos(2x),...,sin(x),sin(2x),...).Then the correspond-

ing(Dirichlet)kernel can be computed in closed form:

Φ(x i)·Φ(x j)=K(x i,x j)=sin((N+1/2)(x i?x j))

2sin((x i?x j)/2)

.(73)

This is easily seen as follows:lettingδ≡x i?x j,

Φ(x i)·Φ(x j)=1

2

+

N

r=1

cos(rx i)cos(rx j)+sin(rx i)sin(rx j)

=?1

2

+

N

r=0

cos(rδ)=?

1

2

+Re{

N

r=0

e(irδ)}

=?1

2

+Re{(1?e i(N+1)δ)/(1?e iδ)}

=(sin((N+1/2)δ))/2sin(δ/2).

Finally,it is clear that the above implicit mapping trick will work for any algorithm in which the data only appear as dot products(for example,the nearest neighbor algorithm). This fact has been used to derive a nonlinear version of principal component analysis by (Sch¨o lkopf,Smola and M¨u ller,1998b);it seems likely that this trick will continue to?nd uses elsewhere.

(完整word版)支持向量机(SVM)原理及应用概述分析

支持向量机(SVM )原理及应用 一、SVM 的产生与发展 自1995年Vapnik (瓦普尼克)在统计学习理论的基础上提出SVM 作为模式识别的新方法之后,SVM 一直倍受关注。同年,Vapnik 和Cortes 提出软间隔(soft margin)SVM ,通过引进松弛变量i ξ度量数据i x 的误分类(分类出现错误时i ξ大于0),同时在目标函数中增加一个分量用来惩罚非零松弛变量(即代价函数),SVM 的寻优过程即是大的分隔间距和小的误差补偿之间的平衡过程;1996年,Vapnik 等人又提出支持向量回归 (Support Vector Regression ,SVR)的方法用于解决拟合问题。SVR 同SVM 的出发点都是寻找最优超平面(注:一维空间为点;二维空间为线;三维空间为面;高维空间为超平面。),但SVR 的目的不是找到两种数据的分割平面,而是找到能准确预测数据分布的平面,两者最终都转换为最优化问题的求解;1998年,Weston 等人根据SVM 原理提出了用于解决多类分类的SVM 方法(Multi-Class Support Vector Machines ,Multi-SVM),通过将多类分类转化成二类分类,将SVM 应用于多分类问题的判断:此外,在SVM 算法的基本框架下,研究者针对不同的方面提出了很多相关的改进算法。例如,Suykens 提出的最小二乘支持向量机 (Least Square Support Vector Machine ,LS —SVM)算法,Joachims 等人提出的SVM-1ight ,张学工提出的中心支持向量机 (Central Support Vector Machine ,CSVM),Scholkoph 和Smola 基于二次规划提出的v-SVM 等。此后,台湾大学林智仁(Lin Chih-Jen)教授等对SVM 的典型应用进行总结,并设计开发出较为完善的SVM 工具包,也就是LIBSVM(A Library for Support Vector Machines)。LIBSVM 是一个通用的SVM 软件包,可以解决分类、回归以及分布估计等问题。 二、支持向量机原理 SVM 方法是20世纪90年代初Vapnik 等人根据统计学习理论提出的一种新的机器学习方法,它以结构风险最小化原则为理论基础,通过适当地选择函数子集及该子集中的判别函数,使学习机器的实际风险达到最小,保证了通过有限训练样本得到的小误差分类器,对独立测试集的测试误差仍然较小。 支持向量机的基本思想:首先,在线性可分情况下,在原空间寻找两类样本的最优分类超平面。在线性不可分的情况下,加入了松弛变量进行分析,通过使用非线性映射将低维输

svm使用详解

1.文件中数据格式 label index1:value1 index2:value2 ... Label在分类中表示类别标识,在预测中表示对应的目标值 Index表示特征的序号,一般从1开始,依次增大 Value表示每个特征的值 例如: 3 1:0.122000 2:0.792000 3 1:0.144000 2:0.750000 3 1:0.194000 2:0.658000 3 1:0.244000 2:0.540000 3 1:0.328000 2:0.404000 3 1:0.402000 2:0.356000 3 1:0.490000 2:0.384000 3 1:0.548000 2:0.436000 数据文件准备好后,可以用一个python程序检查格式是否正确,这个程序在下载的libsvm文件夹的子文件夹tools下,叫checkdata.py,用法:在windows命令行中先移动到checkdata.py所在文件夹下,输入:checkdata.py 你要检查的文件完整路径(包含文件名) 回车后会提示是否正确。

2.对数据进行归一化。 该过程要用到libsvm软件包中的svm-scale.exe Svm-scale用法: 用法:svmscale [-l lower] [-u upper] [-y y_lower y_upper] [-s save_filename] [-r restore_filename] filename (缺省值: lower = -1,upper = 1,没有对y进行缩放) 其中, -l:数据下限标记;lower:缩放后数据下限; -u:数据上限标记;upper:缩放后数据上限; -y:是否对目标值同时进行缩放;y_lower为下限值,y_upper 为上限值;(回归需要对目标进行缩放,因此该参数可以设定为–y -1 1 ) -s save_filename:表示将缩放的规则保存为文件save_filename; -r restore_filename:表示将缩放规则文件restore_filename载入后按此缩放; filename:待缩放的数据文件(要求满足前面所述的格式)。 数据集的缩放结果在此情况下通过DOS窗口输出,当然也可以通过DOS的文件重定向符号“>”将结果另存为指定的文件。该文件中的参数可用于最后面对目标值的反归一化。反归一化的公式为:

svm核函数matlab

clear all; clc; N=35; %样本个数 NN1=4; %预测样本数 %********************随机选择初始训练样本及确定预测样本******************************* x=[]; y=[]; index=randperm(N); %随机排序N个序列 index=sort(index); gama=23.411; %正则化参数 deita=0.0698; %核参数值 %thita=; %核参数值 %*********构造感知机核函数************************************* %for i=1:N % x1=x(:,index(i)); % for j=1:N % x2=x(:,index(j)); % K(i,j)=tanh(deita*(x1'*x2)+thita); % end %end %*********构造径向基核函数************************************** for i=1:N x1=x(:,index(i)); for j=1:N x2=x(:,index(j)); x12=x1-x2; K(i,j)=exp(-(x12'*x12)/2/(deita*deita)); End End %*********构造多项式核函数**************************************** %for i=1:N % x1=x(:,index(i)); % for j=1:N % x2=x(:,index(j)); % K(i,j)=(1+x1'*x2)^(deita); % end %end %*********构造核矩阵************************************ for i=1:N-NN1 for j=1:N-NN1 omeiga1(i,j)=K(i,j); end end

支持向量机(SVM)算法推导及其分类的算法实现

支持向量机算法推导及其分类的算法实现 摘要:本文从线性分类问题开始逐步的叙述支持向量机思想的形成,并提供相应的推导过程。简述核函数的概念,以及kernel在SVM算法中的核心地位。介绍松弛变量引入的SVM算法原因,提出软间隔线性分类法。概括SVM分别在一对一和一对多分类问题中应用。基于SVM在一对多问题中的不足,提出SVM 的改进版本DAG SVM。 Abstract:This article begins with a linear classification problem, Gradually discuss formation of SVM, and their derivation. Description the concept of kernel function, and the core position in SVM algorithm. Describes the reasons for the introduction of slack variables, and propose soft-margin linear classification. Summary the application of SVM in one-to-one and one-to-many linear classification. Based on SVM shortage in one-to-many problems, an improved version which called DAG SVM was put forward. 关键字:SVM、线性分类、核函数、松弛变量、DAG SVM 1. SVM的简介 支持向量机(Support Vector Machine)是Cortes和Vapnik于1995年首先提出的,它在解决小样本、非线性及高维模式识别中表现出许多特有的优势,并能够推广应用到函数拟合等其他机器学习问题中。支持向量机方法是建立在统计学习理论的VC 维理论和结构风险最小原理基础上的,根据有限的样本信息在模型的复杂性(即对特定训练样本的学习精度,Accuracy)和学习能力(即无错误地识别任意样本的能力)之间寻求最佳折衷,以期获得最好的推广能力。 对于SVM的基本特点,小样本,并不是样本的绝对数量少,而是与问题的复杂度比起来,SVM算法要求的样本数是相对比较少的。非线性,是指SVM擅长处理样本数据线性不可分的情况,主要通过松弛变量和核函数实现,是SVM 的精髓。高维模式识别是指样本维数很高,通过SVM建立的分类器却很简洁,只包含落在边界上的支持向量。

SVM方法步骤

SVM 方法步骤 彭海娟 2010-1-29 看了一些文档和程序,大体总结出SVM 的步骤,了解了计算过程,再看相关文档就比较容易懂了。 1. 准备工作 1) 确立分类器个数 一般都事先确定分类器的个数,当然,如有必要,可在训练过程中增加分类器的个数。分类器指的是将样本中分几个类型,比如我们从样本中需要识别出:车辆、行人、非车并非人,则分类器的个数是3。 分类器的个数用k 2) 图像库建立 SVM 方法需要建立一个比较大的样本集,也就是图像库,这个样本集不仅仅包括正样本,还需要有一定数量的负样本。通常样本越多越好,但不是绝对的。 设样本数为S 3) ROI 提取 对所有样本中的可能包含目标的区域(比如车辆区域)手动或自动提取出来,此时包括正样本中的目标区域,也包括负样本中类似车辆特征的区域或者说干扰区域。 4) ROI 预处理 包括背景去除,图像滤波,或者是边缘增强,二值化等预处理。预处理的方法视特征的选取而定。 5) 特征向量确定 描述一个目标,打算用什么特征,用几个特征,给出每个特征的标示方法以及总的特征数,也就是常说的特征向量的维数。 对于车辆识别,可用的特征如:车辆区域的灰度均值、灰度方差、对称性、信息熵、傅里叶描述子等等。 设特征向量的维数是L 。 6) 特征提取 确定采取的特征向量之后,对样本集中所有经过预处理之后的ROI 区域进行特征提取,也就是说计算每个ROI 区域的所有特征值,并将其保存。 7) 特征向量的归一化 常用的归一化方法是:先对相同的特征(每个特征向量分别归一化)进行排序,然后根据特征的最大值和最小值重新计算特征值。 8) 核的选定 SVM 的构造主要依赖于核函数的选择,由于不适当的核函数可能会导致很差的分类结果,并且目前尚没有有效的学习使用何种核函数比较好,只能通过实验结果确定采用哪种核函数比较好。训练的目标不同,核函数也会不同。 核函数其实就是采用什么样的模型描述样本中目标特征向量之间的关系。如常用的核函数:Gauss 函数 2 1),(21x x x p e x x k --= 对样本的训练就是计算p 矩阵,然后得出描述目标的模板和代表元。 2. 训练 训练就是根据选定的核函数对样本集的所有特征向量进行计算,构造一个使样本可分的

支持向量机(SVM)简明学习教程

支持向量机(SVM )简明学习教程 一、最优分类超平面 给定训练数据),(,),,(11l l y x y x ,其中n i R x ∈,}1,1{-∈i y 。 若1=i y ,称i x 为第一类的,I ∈i x ;若1-=i y ,称i x 为第二类的,II ∈i x 。 若存在向量?和常数b ,使得?????II ∈<-I ∈>-i i T i i T x if b x x if b x ,0,0?? (1),则该训练集可被超平面 0=-b x T ?分开。 (一)、平分最近点法 求两个凸包集中的最近点d c ,',做d c ,'的垂直平分面x ,即为所求。 02 )(2 22 2 =-- -?-=-d c x d c x d x c T ,则d c -=?,2 ) ()(d c d c b T +-= 。 求d c ,,?? ?? ?≥==≥==∑∑∑∑-=-===. 0,1, . 0,1,1 111 i y i y i i i y i y i i i i i i x d x c αα ααα α

所以2 1 1 2 ∑∑-==-= -i i y i i y i i x x d c αα,只需求出最小的T l ),,(1ααα =。 算法:1)求解. 0,1,1..2121min 1 1 2 12 11≥===-∑∑∑∑∑-===-==i y i y i l i i i i y i i y i i i i i i t s x y x x αααααα;2)求最优超平面0=-b x T ?。 (二)、最大间隔法 附加条件1=?,加上(1)式。记C x C i T x i >=I ∈??min )(1,C x C i T x i <=II ∈??max )(2。 使?????II ∈<-I ∈>-=-= i i T i i T x if b x x if b x t s C C ,0,0,1..2 ) ()()(max 21??????ρ (2) 可以说明在(2)下可以得到一个最优超平面,且该超平面是唯一的。 如何快速生成一个最优超平面??? 考虑等价问题:求权向量w 和b ,使?????II ∈-<-I ∈>-i i T i i T x if b x w x if b x w ,1,1,且?最小。 这种写法已经包含最大间隔。 事实上b C C C x if C b x w x if C b x w i i T i i T =+=??????II ∈=+-))()((21),(1),(121021????中心,而w w =?, 故w b C = ,w C C 1 2)()()(21=-=???ρ。 所以(2)式可以转化为求解: 1 )(..min ≥-b x w y t s w i T i (3) 总结,求最优超平面,只需求解: 1 )(..2 1)(min ≥-= Φb x w y t s w w w i T i T (QP1) 对(QP1)构造lagrange 函数: 令∑=---=l i i T i i b x w y w b w L 1 2]1)([21),,(αα,其中0),,(1≥=T l ααα 为lagrange 乘子。 下求L 的鞍点:

svm为什么需要核函数

svm为什么需要核函数 本来自己想写这个内容,但是看到了一篇网上的文章,觉得写得很好,这样我就不自己写了,直接转载人家的。我在两处加粗红了,我觉得这两处理解了,就理解了svm中kernel的作用。 1.原来在二维空间中一个线性不可分的问题,映射到四维空间后,变成了线性可分的!因此这也形成了我们最初想解决线性不可分问题的基本思路——向高维空间转化,使其变得线性可分。 2.转化最关键的部分就在于找到x到y的映射方法。遗憾的是,如何找到这个映射,没有系统性的方法(也就是说,纯靠猜和凑)。 3.我们其实只关心那个高维空间里内积的值,那个值算出来了,分类结果就算出来了。 4.核函数的基本作用就是接受两个低维空间里的向量,能够计算出经过某个变换后在高维空间里的向量内积值。 列一下常用核函数: 线性核函数: 多项式核函数: 高斯核函数: 核函数: 下面便是转载的部分: 转载地址:https://www.360docs.net/doc/d315320456.html,/zhenandaci/archive/2009/03/06/258288.html 生存?还是毁灭?——哈姆雷特 可分?还是不可分?——支持向量机 之前一直在讨论的线性分类器,器如其名(汗,这是什么说法啊),只能对线性可分的样本做处理。如果提供的样本线性不可分,结果很简单,线性分类器的求解程序会无限循环,永远也解不出来。这必然使得它的适用范围大大缩小,而它的很多优点我们实在不原意放弃,怎么办呢?是否有某种方法,让线性不可分的数据变得线性可分呢? 有!其思想说来也简单,来用一个二维平面中的分类问题作例子,你一看就会明白。事先声明,下面这个例子是网络早就有的,我一时找不到原作者的正确信息,在此借用,并加进了我自己的解说而已。 例子是下面这张图: 我们把横轴上端点a和b之间红色部分里的所有点定为正类,两边的黑色部分里的点定为负类。试问能找到一个线性函数把两类正确分开么?不能,因为二维空间里的线性函数就是指直线,显然找不到符合条件的直线。

支持向量机(SVM)的实现

模式识别课程大作业报告——支持向量机(SVM)的实现 : 学号: 专业: 任课教师: 研究生导师:

容摘要 支持向量机是一种十分经典的分类方法,它不仅是模式识别学科中的重要容,而且在图像处理领域中得到了广泛应用。现在,很多图像检索、图像分类算法的实现都以支持向量机为基础。本次大作业的容以开源计算机视觉库OpenCV 为基础,编程实现支持向量机分类器,并对标准数据集进行测试,分别计算出训练样本的识别率和测试样本的识别率。 本报告的组织结构主要分为3大部分。第一部分简述了支持向量机的原理;第二部分介绍了如何利用OpenCV来实现支持向量机分类器;第三部分给出在标准数据集上的测试结果。

一、支持向量机原理概述 在高维空间中的分类问题实际上是寻找一个超平面,将两类样本分开,这个超平面就叫做分类面。两类样本中离分类面最近的样本到分类面的距离称为分类间隔。最优超平面指的是分类间隔最大的超平面。支持向量机实质上提供了一种利用最优超平面进行分类的方法。由最优分类面可以确定两个与其平行的边界超平面。通过拉格朗日法求解最优分类面,最终可以得出结论:实际决定最优分类面位置的只是那些离分类面最近的样本。这些样本就被称为支持向量,它们可能只是训练样本中很少的一部分。支持向量如图1所示。 图1

图1中,H是最优分类面,H1和H2别是两个边界超平面。实心样本就是支持向量。由于最优超平面完全是由这些支持向量决定的,所以这种方法被称作支持向量机(SVM)。 以上是线性可分的情况,对于线性不可分问题,可以在错分样本上增加一个惩罚因子来干预最优分类面的确定。这样一来,最优分类面不仅由离分类面最近的样本决定,还要由错分的样本决定。这种情况下的支持向量就由两部分组成:一部分是边界支持向量;另一部分是错分支持向量。 对于非线性的分类问题,可以通过特征变换将非线性问题转化为新空间中的线性问题。但是这样做的代价是会造成样本维数增加,进而导致计算量急剧增加,这就是所谓的“维度灾难”。为了避免高维空间中的计算,可以引入核函数的概念。这样一来,无论变换后空间的维数有多高,这个新空间中的线性支持向量机求解都可以在原空间通过核函数来进行。常用的核函数有多项式核、高斯核(径向基核)、Sigmoid函数。 二、支持向量机的实现 OpenCV是开源计算机视觉库,它在图像处理领域得到了广泛应用。OpenCV中包含许多计算机视觉领域的经典算法,其中的机器学习代码部分就包含支持向量机的相关容。OpenCV中比较经典的机器学习示例是“手写字母分类”。OpenCV中给出了用支持向量机实现该示例的代码。本次大作业的任务是研究OpenCV中的支持向量机代码,然后将其改写为适用于所有数据库的通用程序,并用标准数据集对算法进行测试。本实验中使用的OpenCV版本是2.4.4,实验平台为Visual Studio 2010软件平台。 OpenCV读取的输入数据格式为“.data”文件。该文件记录了所有数据样

SVM通俗讲解

SVM(Support Vector Machine) 支持向量机相关理论介绍 基于数据的机器学习是现代智能技术中的重要方面,研究从观测数据(样本)出发寻找规律,利用这些规律对未来数据或无法观测的数据进行预测。迄今为止,关于机器学习还没有一种被共同接受的理论框架,关于其实现方法大致可以分为三种[3]: 第一种是经典的(参数)统计估计方法。包括模式识别、神经网络等在内,现有机器学习方法共同的重要理论基础之一是统计学。参数方法正是基于传统统计学的,在这种方法中,参数的相关形式是已知的,训练样本用来估计参数的值。这种方法有很大的局限性。 首先,它需要已知样本分布形式,这需要花费很大代价,还有,传统统计学研究的是样本数目趋于无穷大时的渐近理论,现有学习方法也多是基于此假设。但在实际问题中,样本数往往是有限的,因此一些理论上很优秀的学习方法实际中表现却可能不尽人意。 第二种方法是经验非线性方法,如人工神经网络(ANN)。这种方法利用已知样本建立非线性模型,克服了传统参数估计方法的困难。但是,这种方法缺乏一种统一的数学理论。与传统统计学相比,统计学习理论(Statistical Learning Theory或SLT)是一种专门研究小样本情况下机器学习规律的理论。该理论针对小样本统计问题建立了一套新的理论体系,在这种体系下的统计推理规则不仅考虑了对渐近性能的要求,而且追求在现有有限信息的条件下得到最优结果。V. Vapnik等人从六、七十年代开始致力于此方面研究,到九十年代中期,随着其理论的不断发展和成熟,也由于神经网络等学习方法在理论上缺乏实质性进展,统计学习理论开始受到越来越广泛的重视。 统计学习理论的一个核心概念就是VC维(VC Dimension)概念,它是描述函数集或学习机器的复杂性或者说是学习能力(Capacity of the machine)的一个重要指标,在此概念基础上发展出了一系列关于统计学习的一致性(Consistency)、收敛速度、推广性能(Generalization Performance)等的重要结论。 统计学习理论是建立在一套较坚实的理论基础之上的,为解决有限样本学习问题提供了一个统一的框架。它能将很多现有方法纳入其中,有望帮助解决许多原来难以解决的问题(比如神经网络结构选择问题、局部极小点问题等); 同时,这一理论基础上发展了一种新的通用学习方法──支持向量机(Support Vector Machine或SVM),已初步表现出很多优于已有方法的性能。一些学者认为,SLT和SVM正在成为继神经网络研究之后新的研究热点,并将推动机器学习理论和技术有重大的发展。 支持向量机方法是建立在统计学习理论的VC维理论和结构风险最小原理基础上的,根据有限的样本信息在模型的复杂性(即对特定训练样本的学习精度,Accuracy)和学习能力(即无错误地识别任意样本的能力)之间寻求最佳折衷,以期获得最好的推广能力(Generalizatin Ability)。支持向量机方法的几个主要优点

SVM支持向量机

SVM 支持向量机 目录 一、简介 (1) 二、线性分类器 (3) 三、分类间隔指标 (4) 四、线性分类器的求解 (8) 五、核函数 (9) 六、松弛变量 (11) 七、惩罚因子C (15) 八、SVM用于多类分类 (17) 九、SVM的计算复杂度 (19) 一、简介 支持向量机在解决小样本、非线性及高维模式识别中表现出许多特有的优势,并能够推广应用到函数拟合等其他机器学习问题中。 支持向量机方法是建立在统计学习理论的VC 维理论和结构风险最小原理基础 上的,根据有限的样本信息在模型的复杂性(即对特定训练样本的学习精度,Accuracy)和学习能力(即无错误地识别任意样本的能力)之间寻求最佳折衷,以期获得最好的推广能力(或称泛化能力)。 以下逐一分解并解释一下:统计机器学习之所以区别于传统机器学习的本质,就在于统计机器学习能够精确的给出学习效果,能够解答需要的样本数等等一系列问题。与统计机器学习的精密思维相比,传统的机器学习基本上属于摸着石头过河,用传统的机器学习方法构造分类系统是一种技巧,一个人做的结果可能很好,另一个人差不多的方法做出来却很差,缺乏指导和原则。 VC维是对函数类的一种度量,可以简单的理解为问题的复杂程度,VC维越高,一个问题就越复杂。SVM关注的是VC维,和样本的维数是无关(甚至样本可以是上万维的,这使得SVM很适合用于解决文本分类的问题,也因此引入了核函数)。 结构风险最小:机器学习本质上就是对问题真实模型的逼近(我们选择一个我们认为比较好的近似模型作为假设),而真实模型是未知的。假设与问题真实解之间的误差,叫做风险(更严格的说,误差的累积叫做风险)。我们选择了一个假设(即分类器)之后,我们可以用某些可以掌握的量来逼近误差,最直观的方法就是使用分类器在样本数据上的分类的结果与真实结果(样本是已标注过的数据,即准确的数据)之间的差值来表示。这个差值叫做经验风险Remp(w)。

机器学习SVM(支持向量机)实验报告

. . 实验报告 实验名称:机器学习:线性支持向量机算法实现 学员:张麻子学号: *********** 培养类型:硕士年级: 专业:所属学院:计算机学院 指导教员: ****** 职称:副教授 实验室:实验日期:

. . 一、实验目的和要求 实验目的:验证SVM(支持向量机)机器学习算法学习情况 要求:自主完成。 二、实验内容和原理 支持向量机(Support V ector Machine, SVM)的基本模型是在特征空间上找到最 佳的分离超平面使得训练集上正负样本间隔最大。SVM是用来解决二分类问题的有监督学习算法。通过引入了核方法之后SVM也可以用来解决非线性问题。 但本次实验只针对线性二分类问题。 SVM算法分割原则:最小间距最大化,即找距离分割超平面最近的有效点距离超平面距离和最大。 对于线性问题: w T x+b=0 假设存在超平面可最优分割样本集为两类,则样本集到超平面距离为: ρ = min{|w T x+b| ||w|| }= a ||w|| 需压求取: max a ||w|| s.t. y i(w T x+b)≥a 由于该问题为对偶问题,可变换为: min 1 2 ||w||2 s.t. y i(w T x+b)≥1 可用拉格朗日乘数法求解。 但由于本实验中的数据集不可以完美的分为两类,即存在躁点。可引入正则化参数C,用来调节模型的复杂度和训练误差。

. . min 1 2||w||2+C ∑εi s.t. y i (w T x +b)≥1?εi , εi >0 作出对应的拉格朗日乘式: 对应的KKT条件为: 故得出需求解的对偶问题: {min 1∑∑αi αj y i y j (x i T x j )?∑αi s.t. ∑αi y j = 0 , C≥αi ≥0, 本次实验使用python 编译器,编写程序,数据集共有270个案例,挑选其中70%作为训练数据,剩下30%作为测试数据。进行了两个实验,一个是取C值为1,直接进行SVM训练;另外一个是利用交叉验证方法,求取在前面情况下的最优C值。 三、实验器材 实验环境:windows7操作系统+python 编译器。

支持向量机SVM原理及应用概述

东北大学 研究生考试试卷 考试科目:信号处理的统计分析方法 课程编号:09601513 阅卷人: 刘晓志 考试日期:2012年11月07日 姓名:赵亚楠 学号:1001236 注意事项 1.考前研究生将上述项目填写清楚. 2.字迹要清楚,保持卷面清洁. 3.交卷时请将本试卷和题签一起上交. 4.课程考试后二周内授课教师完成评卷工作,公共课成绩单与试卷交研究生院培养办公室, 专业课成绩单与试卷交各学院,各学院把成绩单交研究生院培养办公室. 东北大学研究生院培养办公室

目录 一、SVM的产生与发展3 二、支持向量机相关理论4 (一)统计学习理论基础4 (二)SVM原理4 1.最优分类面和广义最优分类面5 2.SVM的非线性映射7 3.核函数8 三、支持向量机的应用研究现状9(一)人脸检测、验证和识别9(二)说话人/语音识别10 (三)文字/手写体识别10 (四)图像处理11 (五)其他应用研究11 四、结论和讨论12

一、SVM 的产生与发展 自1995年Vapnik 在统计学习理论的基础上提出SVM 作为模式识别的新方法之后,SVM 一直倍受关注。同年,Vapnik 和Cortes 提出软间隔(soft margin)SVM ,通过引进松弛变量i ξ度量数据i x 的误分类(分类出现错误时i ξ大于0),同时在目标函数中增加一个分量用来惩罚非零松弛变量(即代价函数),SVM 的寻优过程即是大的分隔间距和小的误差补偿之间的平衡过程;1996年,Vapnik 等人又提出支持向量回归 (Support Vector Regression ,SVR)的方法用于解决拟合问题。SVR 同SVM 的出发点都是寻找最优超平面,但SVR 的目的不是找到两种数据的分割平面,而是找到能准确预测数据分布的平面,两者最终都转换为最优化问题的求解;1998年,Weston 等人根据SVM 原理提出了用于解决多类分类的SVM 方法(Multi-Class Support VectorMachines ,Multi-SVM),通过将多类分类转化成二类分类,将SVM 应用于多分类问题的判断:此外,在SVM 算法的基本框架下,研究者针对不同的方面提出了很多相关的改进算法。例如,Suykens 提出的最小二乘支持向量机 (Least Square Support VectorMachine ,LS —SVM)算法,Joachims 等人提出的SVM-1ight ,张学工提出的中心支持向量机 (Central Support Vector Machine ,CSVM),Scholkoph 和Smola 基于二次规划提出的v-SVM 等。此后,台湾大学林智仁(Lin Chih-Jen)教授等对SVM 的典型应用进行总结,并设计开发出较为完善的SVM 工具包,也就是LIBSVM(A Library for Support Vector Machines)。上述改进模型中,v-SVM 是一种软间隔分类器模型,其原理是通过引进参数v ,来调整支持向量数占输入数据比例的下限,以及参数ρ来度量超平面偏差,代替通常依靠经验选取的软间隔分类惩罚参数,改善分类效果;LS-SVM 则是用等式约束代替传统SVM 中的不等式约束,将求解QP 问题变成解一组等式方程来提高算法效率;LIBSVM 是一个通用的SVM 软件包,可以解决分类、回归以及分布估计等问题,它提供常用的几种核函数可由用户选择,并且具有不平衡样本加权和多类分类等功能,此外,交叉验证(cross validation)方法也是LIBSVM 对核函数参数选取问题所做的一个突出贡献;SVM-1ight 的特点则是通过引进缩水(shrinking)逐步简化QP 问题,以及缓存(caching)技术降低迭代运算的计算代价来解决大规模样本条件下SVM 学习的复杂性问题。

SVM支持向量机题目

机器学习课程作业(1) 提交截止日期:2017年10月10日周二 1. 一个优化问题的原问题(Prime Problem )与对偶问题(Dual Problem )定义如下: 原问题 Minimize: ()f ω Subject to: ()0,1,2,...,i g i K ω≤= ()0,1,2,...,i h i M ω== 对偶问题 定义 ()()()()()()()11,,K M T T i i i i i i L f g h f g h ωαβωαωβωωαωβω===++=++∑∑ 对偶问题为: Maximize: ()(),inf ,,L ωθαβωαβ= Subject to: 0,1,2,...,i i K α≥= (a) 证明:如果*ω是原问题的解,*α,*β是对偶问题的解,则有:()()***,f ωθαβ≥ (b) 证明 (强对偶定理):如果()g A b ωω=+,()h C d ωω=+,且()f ω为凸函数,即对任意1ω和2ω,有()()()()()121211f f f λωλωλωλω+-≤+-, 则有:()()*** ,f ωθαβ= 2. 求下列原问题的对偶问题 (a) (1l and 2l -norm SVM Classification) : Minimize: 221211 12N N i i i i C C ωδδ==++∑∑ Subject to: 0,1,2,...,i i N δ≥= ()1T i i i y x b ω?δ??+≥-??

(b) (SVM regression): Minimize: ()()2221211 12N N i i i i i i C C ωδζδζ==++++∑∑ Subject to: (),1,2,...,T i i i x b y i N ω?εδ+-≤+= (),1,2,...,T i i i y x b i N ω?εζ--≤+= 0i δ≥, 0i ζ≥ (c) (Kernel Ridge Regression): Minimize: 221 12N i i C ωδ=+∑ Subject to: (),1,2,...,T i i i y x i N ω?δ-== (d) (Entropy Maximization Problem): Minimize: ()1log N i i i x x =∑ Subject to: T x b ω≤ 11N i i x ==∑ 3. 如图所示,平面上有N 个点12{,,...,}N x x x ,求一个半径最小的圆,使之能包含这些点。 图1. 平面上N 个点,求最小的圆包含这些点。 (a) 写出这个优化问题的数学表达式。 (b) 写出(a)的对偶问题。 (c) 编写程序求解这个问题(选做)

相关文档
最新文档