Incremental Support Vector Machine Learning A Local Approach

合集下载

强化学习算法中的反向动力学方法详解(Ⅰ)

强化学习（Reinforcement Learning, RL）是一种机器学习方法，它通过代理在与环境的互动中学习如何做出决策来最大化累积奖赏。

RL 中的核心问题是探索与利用的权衡，以及如何在不确定性环境下做出最优决策。

近年来，强化学习在许多领域取得了巨大的进展，并成为人工智能领域备受关注的研究方向之一。

在强化学习算法中，反向动力学方法是一种重要的学习策略。

与传统的基于值函数或策略函数的方法不同，反向动力学方法直接学习动作值函数或动作策略函数。

本文将详细介绍反向动力学方法在强化学习中的应用及其原理。

一、反向动力学方法的基本原理在强化学习中，代理与环境不断进行交互，代理根据环境的反馈调整自己的决策策略。

反向动力学方法的核心思想是从输出开始反向计算输入对应的价值函数或策略函数，以此来更新参数。

与直接从输入到输出的前向计算相比，反向动力学方法更适用于高维复杂的问题，并且能够在参数更新过程中更好地处理梯度消失和梯度爆炸等问题。

二、反向动力学方法在深度强化学习中的应用深度强化学习是指将深度学习技术应用于强化学习中，以解决高维、复杂环境下的决策问题。

在深度强化学习中，反向动力学方法被广泛应用于价值函数的估计和策略函数的优化。

通过神经网络逼近动作值函数或动作策略函数，可以有效地处理高维状态空间和动作空间，并且能够对复杂的非线性关系进行建模。

三、反向动力学方法的算法实现在实际应用中，反向动力学方法通常采用基于梯度的优化算法进行参数更新。

常用的算法包括随机梯度下降（Stochastic Gradient Descent, SGD）、Adam、RMSProp 等。

这些算法通过不断地迭代更新参数，使得神经网络逼近目标函数，并且能够处理高维、非凸的优化问题。

四、反向动力学方法的改进与应用近年来，学者们提出了许多改进的反向动力学方法，以应对深度强化学习中的挑战。

例如，基于自适应激励的增强学习方法（Intrinsically Motivated Reinforcement Learning, IMRL）可以有效地解决探索与利用的平衡问题；基于元学习的方法可以在少样本学习的场景下实现快速收敛。

Support vector machine reference manual

SV Machine Parameters ===================== 1. 2. 3. 4. 5. 0. Enter Load Save Save Show Exit parameters parameters parameters (pattern_test) parameters as... parameters
snsv
ascii2bin bin2ascii
The rest of this document will describe these programs. To nd out more about SVMs, see the bibliography. We will not describe how SVMs work here. The rst program we will describe is the paragen program, as it speci es all parameters needed for the SVM.
sv
- the main SVM program - program for generating parameter sets for the SVM - load a saved SVM and classify a new data set
paragen loadsv
rm sv
- special SVM program for image recognition, that implements virtual support vectors BS97]. - program to convert SN format to our format - program to convert our ASCII format to our binary format - program to convert our binary format to our ASCII format

Controlling the sensitivity of support vector machines

ber of real world problems such as handwritten charac-
ter and digit recognition Scholkopf, 1997; Cortes, 1995;
LeteCalu.,n1e9t9a7l].,a1n9d9s5p;eVaakperniikd,e1n9ti95c]a, tfiaocne
cients
i are found by
L = Xp
i=1
i
?
1 2
Xp
i;j=1
i
jyiyjK(xi; xj)
(1)
subject to constraints:
i0
Xp iyi = 0
(2)
i=1
Only those points which lie closest to the hyperplane
pdoeicnitssionxifumncatpiopningis
to targets formulated
yini
(i = terms
of these kernels:
f(x) = sign
Xp
!
iyiK(x; xi) + b
i=1
where b is the bias and the coe maximising the Lagrangian:
have In
thie>p0re(stehnecesuopfpnoortisve,ecttworos
). techniques
can
be
used
to allow for, and control, a trade o between training

Support Vector Machines and Kernel Methods

Slack variables
4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −3
−2
−1
0
1
2
3
If not linearly separable, add slack variable s ≥ 0 y (x · w + c) + s ≥ 1 Then
i si is total amount by which constraints are violated i si as small as possible
So try to make
Perceptron as convex program
The ﬁnal convex program for the perceptron is: min
i si subject to
(y i x i ) · w + y i c + s i ≥ 1 si ≥ 0 We will try to understand this program using convex duality
10 8
6
4
2
0
−2
−4
−6
−8
−10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Classiﬁcation problem
100
10
% Middle & Upper Class
. . .
95
8
6
90
4
85
2
80
0
75
−2
70
−4
−6
65
X

Building support vector machines with reduced classifier complexity

Journal of Machine Learning Research7(2006)1493–1515Submitted10/05;Revised3/06;Published7/06 Building Support Vector Machines withReduced Classiﬁer ComplexityS.Sathiya Keerthi SELVARAK@ Yahoo!Research3333Empire Avenue,Building4Burbank,CA91504,USAOlivier Chapelle CHAPELLE@TUEBINGEN.MPG.DE MPI for Biological Cybernetics72076T¨u bingen,GermanyDennis DeCoste DECOSTED@ Yahoo!Research3333Empire Avenue,Building4Burbank,CA91504,USAEditors:Kristin P.Bennett and Emilio Parrado-Hern´a ndezAbstractSupport vector machines(SVMs),though accurate,are not preferred in applications requiring great classiﬁcation speed,due to the number of support vectors being large.To overcome this problem we devise a primal method with the following properties:(1)it decouples the idea of basis functions from the concept of support vectors;(2)it greedilyﬁnds a set of kernel basis functions of a speciﬁed maximum size(d max)to approximate the SVM primal cost function well;(3)it is efﬁcient and roughly scales as O(nd2max)where n is the number of training examples;and,(4)the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.Keywords:SVMs,classiﬁcation,sparse design1.IntroductionSupport Vector Machines(SVMs)are modern learning systems that deliver state-of-the-art perfor-mance in real world pattern recognition and data mining applications such as text categorization, hand-written character recognition,image classiﬁcation and bioinformatics.Even though they yield very accurate solutions,they are not preferred in online applications where classiﬁcation has to be done in great speed.This is due to the fact that a large set of basis functions is usually needed to form the SVM classiﬁer,making it complex and expensive.In this paper we devise a method to overcome this problem.Our method incrementallyﬁnds basis functions to maximize accuracy.The process of adding new basis functions can be stopped when the classiﬁer has reached some limiting level of complexity.In many cases,our method efﬁciently forms classiﬁers which have an order of magnitude smaller number of basis functions compared to the full SVM,while achieving nearly the same level of accuracy.SVM solution and post-processing simpliﬁcation Given a training set{(x i,y i)}n i=1,y i∈{1,−1}, the Support Vector Machine(SVM)algorithm with an L2penalization of the training errors consistsK EERTHI,C HAPELLE AND D E C OSTE of solving the following primal problemmin λ2w 2+12n∑i=1max(0,1−y i w·φ(x i))2.(1)Computations involvingφare handled using the kernel function,k(x i,x j)=φ(x i)·φ(x j).For conve-nience the bias term has not been included,but the analysis presented in this paper can be extended in a straightforward way to include it.The quadratic penalization of the errors makes the primal objective function continuously differentiable.This is a great advantage and becomes necessary for developing a primal algorithm,as we will see below.The standard way to train an SVM is to introduce Lagrange multipliersαi and optimize them by solving a dual problem.The classiﬁer function for a new input x is then given by the sign of ∑iαi y i k(x,x i).Because there is aﬂat part in the loss function,the vectorαis usually sparse.The x i for whichαi=0are called support vectors(SVs).Let n SV denote the number of SVs for a given problem.A recent theoretical result by Steinwart(Steinwart,2004)shows that n SV grows as a linear function of n.Thus,for large problems,this number can be large and the training and testing complexities might become prohibitive since they are respectively,O(n n SV+n SV3)and O(n SV).Several methods have been proposed for reducing the number of support vectors.Burges and Sch¨o lkopf(1997)apply nonlinear optimization methods to seek sparse representations after building the SVM classiﬁer.Along similar lines,Sch¨o lkopf et al.(1999)use L1regularization onβto obtain sparse approximations.These methods are expensive since they involve the solution of hard non-convex optimization problems.They also become impractical for large problems.Downs et al. (2001)give an exact algorithm to prune the support vector set after the SVM classiﬁer is built. Thies and Weber(2004)give special ideas for the quadratic kernel.Since these methods operate as a post-processing step,an expensive standard SVM training is still required.Direct simpliﬁcation via basis functions and primal Instead ofﬁnding the SVM solution by maximizing the dual problem,one approach is to directly minimize the primal form after invoking the representer theorem to represent w asw=n∑i=1βiφ(x i).(2)If we allowβi=0for all i,substitute(2)in(1)and solve for theβi’s then(assuming uniqueness of solution)we will getβi=y iαi and thus we will precisely retrieve the SVM solution(Chapelle, 2005).But our aim is to obtain approximate solutions that have as few non-zeroβi’s as possible. For many classiﬁcation problems there exists a small subset of the basis functions1suited to the complexity of the problem being solved,irrespective of the training size growth,that will yield pretty much the same accuracy as the SVM classiﬁer.The evidence for this comes from the empir-ical performance of other sparse kernel classiﬁers:the Relevance Vector Machine(Tipping,2001), Informative Vector Machine(Lawrence et al.,2003)are probabilistic models in a Bayesian setting; and Kernel Matching Pursuit(Vincent and Bengio,2002)is a discriminative method that is mainly developed for the least squares loss function.These recent non-SVM works have laid the claim that they can match the accuracy of SVMs,while also bringing down considerably,the number of basis functions as well as the training cost.Work on simplifying SVM solution has not caught up well 1.Each k(x,x i)will be referred to as a basis function.B UILDING SVM S WITH R EDUCEDC OMPLEXITYwith those works in related kernelﬁelds.The method outlined in this paper makes a contribution to ﬁll this gap.We deliberately use the variable name,βi in(2)so as to interpret it as a basis weight as opposed to viewing it as y iαi whereαi is the Lagrange multiplier associated with the i-th primal slack con-straint.While the two are(usually)one and the same at exact optimality,they can be very different when we talk of sub-optimal primal solutions.There is a lot of freedom when we simply think of theβi’s as basis weights that yield a good suboptimal w for(1).First,we do not have to put any bounds on theβi.Second,we do not have to think of aβi corresponding to a particular location relative to the margin planes to have a certain value.Going even one more step further,we do not even have to restrict the basis functions to be a subset of the training set examples.Osuna and Girosi(1998)consider such an approach.They achieve sparsity by including the L1 regularizer,λ1 β 1in the primal objective.But they do not develop an algorithm(for solving the modiﬁed primal formulation and for choosing the rightλ1)that scales efﬁciently to large problems.Wu et al.(2005)write w asw=l∑i=1βiφ(˜x i)where l is a chosen small number and optimize the primal objective with theβi as well as the ˜x i as variables.But the optimization can become unwieldy if l is not small,especially since the optimization of the˜x i is a hard non-convex problem.In the RSVM algorithm(Lee and Mangasarian,2001;Lin and Lin,2003)a random subset of the training set is chosen to be the˜x i and then only theβi are optimized.2Because basis functions are chosen randomly,this method requires many more basis functions than needed in order to achieve a level of accuracy close to the full SVM solution;see Section3.A principled alternative to RSVM is to use a greedy approach for the selection of the subset of the training set for forming the representation.Such an approach has been popular in Gaussian processes(Smola and Bartlett,2001;Seeger et al.,2003;Keerthi and Chu,2006).Greedy meth-ods of basis selection also exist in the boosting literature(Friedman,2001;R¨a tsch,2001).These methods entail selection from a continuum of basis functions using either gradient descent or linear programming column generation.Bennett et al.(2002)and Bi et al.(2004)give modiﬁed ideas for kernel methods that employ a set of basis functionsﬁxed at the training points.Particularly relevant to the work in this paper are the kernel matching pursuit(KMP)algo-rithm of Vincent and Bengio(2002)and the growing support vector classiﬁer(GSVC)algorithm of Parrado-Hern´a ndez et al.(2003).KMP is an effective greedy discriminative approach that is mainly developed for least squares problems.GSVC is an efﬁcient method that is developed for SVMs and uses a heuristic criterion for greedy selection of basis functions.Our approach The main aim of this paper is to give an effective greedy method SVMs which uses a basis selection criterion that is directly related to the training cost function and is also very efﬁcient.The basic theme of the method is forward selection.It starts with an empty set of basis functions and greedily chooses new basis functions(from the training set)to improve the primal objective function.We develop efﬁcient schemes for both,the greedy selection of a new basis function,as well as the optimization of theβi for a given selection of basis functions.For choosing upto d max basis functions,the overall compuational cost of our method is O(nd2max).The different 2.For convenience,in the RSVM method,the SVM regularizer is replaced by a simple L2regularizer onβ.K EERTHI,C HAPELLE AND D E C OSTESpSVM-2SVMData Set TestErate#Basis TestErate n SVBanana10.87(1.74)17.3(7.3)10.54(0.68)221.7(66.98)Breast29.22(2.11)12.1(5.6)28.18(3.00)185.8(16.44)Diabetis23.47(1.36)13.8(5.6)23.73(1.24)426.3(26.91)Flare33.90(1.10)8.4(1.2)33.98(1.26)629.4(29.43)German24.90(1.50)14.0(7.3)24.47(1.97)630.4(22.48)Heart15.50(1.10) 4.3(2.6)15.80(2.20)166.6(8.75)Ringnorm 1.97(0.57)12.9(2.0) 1.68(0.24)334.9(108.54)Thyroid 5.47(0.78)10.6(2.3) 4.93(2.18)57.80(39.61)Titanic22.68(1.88) 3.3(0.9)22.35(0.67)150.0(0.0)Twonorm 2.96(0.82)8.7(3.7) 2.42(0.24)330.30(137.02)Waveform10.66(0.99)14.4(3.3)10.04(0.67)246.9(57.80)Table1:Comparison of SpSVM-2and SVM on benchmark data sets from(R¨a tsch).For TestErate, #Basis and n SV,the values are means over ten different training/test splits and the values in parantheses are the standard deviations.components of the method that we develop in this paper are not new in themselves and are inspired from the above mentioned papers.However,from a practical point of view,it is not obvious how to combine and tune them in order to get a very efﬁcient SVM training algorithm.That is what we achieved in this paper through numerous and careful experiments that validated the techniques employed.Table1gives a preview of the performance of our method(called SpSVM-2in the table)in comparison with SVM on several UCI data sets.As can be seen there,our method gives a competing generalization performance while reducing the number of basis functions very signiﬁcantly.(More speciﬁcs concerning Table1will be discussed in Section4.)The paper is organized as follows.We discuss the details of the efﬁcient optimization of the primal objective function in Section2.The key issue of selecting basis functions is taken up in Section3.Sections4-7discuss other important practical issues and give computational results that demonstrate the value of our method.Section8gives some concluding remarks.The appendix gives details of all the data sets used for the experiments in this paper.2.The Basic OptimizationLet J⊂{1,...,n}be a given index set of basis functions that form a subset of the training set.We consider the problem of minimizing the objective function in(1)over the set of vectors w of the form3w=∑βjφ(x j).(3)j∈J3.More generally,one can consider expansion on points which do not belong to the training set.B UILDING SVM S WITH R EDUCEDC OMPLEXITY2.1Newton OptimizationLet K i j =k (x i ,x j )=φ(x i )·φ(x j )denote the generic element of the n ×n kernel matrix K .The notation K IJ refers to the submatrix of K made of the rows indexed by I and the columns indexed by J .Also,for a n -dimensional vector p ,let p J denote the |J |dimensional vector containing {p j :j ∈J }.Let d =|J |.With w restricted to (3),the primal problem (1)becomes the d dimensional mini-mization problem of ﬁnding βJ that solvesmin βJf (βJ )=λ2β⊤J K JJ βJ +12n ∑i =1max (0,1−y i o i )2(4)where o i =K i ,J βJ .Except for the regularizer being more general,i.e.,β⊤J K JJ βJ (as opposed to thesimple regularizer, βJ 2),the problem in (4)is very much the same as in a linear SVM design.Thus,the Newton method and its modiﬁcation that are developed for linear SVMs (Mangasarian,2002;Keerthi and DeCoste,2005)can be used to solve (4)and obtain the solution βJ .Newton Method1.Choose a suitable starting vector,β0J .Set k =0.2.If βk J is the optimal solution of (4),stop.3.Let I ={i :1−y i o i ≥0}where o i =K i ,J βk J is the output of the i -th example.Obtain ¯βJ as the result of a Newton step or equivalently as the solution of the regularized least squares problem,min βJ λ2β⊤J K JJ βJ+12∑i ∈I (1−y i K i ,J βJ )2.(5)4.Take βk +1J to be the minimizer of f on L ,the line joining βk J and ¯βJ .Set k :=k +1and goback to step 2for another iteration.The solution of (5)is given by¯βJ =βk J −P −1g ,where P =λK JJ +K JI K ⊤JIand g =λK JJ βJ −K JI (y I −o I ).(6)P and g are also the (generalized)Hessian and gradient of the objective function (4).Because the loss function is piecewise quadratic,Newton method converges in a ﬁnite number of iterations.The number of iterations required to converge to the exact solution of (4)is usually very small (less than 5).Some Matlab code is available online at http://www.kyb.tuebingen.mpg.de/bs/people/chapelle/primal .2.2Updating the HessianAs already pointed out in Section 1,we will mainly need to solve (4)in an incremental mode:4with the solution βJ of (4)already available,solve (4)again,but with one more basis function added,i.e.,J incremented by one.Keerthi and DeCoste (2005)show that the Newton method is very efﬁcient4.In our method basis functions are added one at a time.K EERTHI,C HAPELLE AND D E C OSTEfor such seeding situations.Since the kernel matrix is dense,we maintain and update a Cholesky factorization of P,the Hessian deﬁned in(6).Even with Jﬁxed,during the course of solving(4) via the Newton method,P will undergo changes due to changes in I.Efﬁcient rank one schemes can be used to do the updating of the Cholesky factorization(Seeger,2004).The updatings of the factorization of P that need to be done because of changes in I are not going to be expensive because such changes mostly occur when J is small;when J is large,I usually undergoes very small changes since the set of training errors is rather well identiﬁed by that stage.Of course P and its factorization will also undergo changes(their dimensions increase by one)each time an element is added to J. This is a routine updating operation that is present in most forward selection methods.2.3Computational ComplexityIt is useful to ask:what is the complexity of the incremental computations needed to solve(4) when its solution is available for some J,at which point one more basis element is included in it and we want to re-solve(4)?In the best case,when the support vector set I does not change,the cost is mainly the following:computing the new row and column of K JJ(d+1kernel evaluations); computing the new row of K JI(n kernel computations);5computing the new elements of P(O(nd) cost);and the updating of the factorization of P(O(d2)cost).Thus the cost can be summarized as: (n+d+1)kernel evaluations and O(nd)cost.Even when I does change and so the cost is more, it is reasonable to take the above mentioned cost summary as a good estimate of the cost of the incremental work.Adding up these costs till d max basis functions are selected,we get a complexity of O(nd2max).Note that this is the basic cost given that we already know the sequence of d max basis functions that are to be used.Thus,O(nd2max)is also the complexity of the method in which basis functions are chosen randomly.In the next section we discuss the problem of selecting the basis functions systematically and efﬁciently.3.Selection of New Basis ElementSuppose we have solved(4)and obtained the minimizerβJ.Obviously,the minimum value of the objective function in(4)(call it f J)is greater than or equal to f⋆,the optimal value of(1).If the difference between them is large we would like to continue on and include another basis function. Take one j∈J.How do we judge its value of inclusion?The best scoring mechanism is the following one.3.1Basis Selection Method1Include j in J,optimize(4)fully using(βJ,βj),andﬁnd the improved value of the objective func-tion;call it˜f j.Choose the j that gives the least value of˜f j.We already analyzed in the earlier section that the cost of doing one basis element inclusion is O(nd).So,if we want to try all elements out-side J,the cost is O(n2d);the overall cost of such a method of selecting d max basis functions is O(n2d2max),which is much higher than the basic cost,O(nd2max)mentioned in the previous section. Instead,if we work only with a random subset of sizeκchosen from outside J,then the cost in one basis selection step comes down to O(κnd),and the overall cost is limited to O(κnd2max).Smola and Bartlett(2001)have successfully tried such random subset choices for Gaussian process regression, usingκ=59.However,note that,even with this scheme,the cost of new basis selection(O(κnd)) 5.In fact this is not n but the size of I.Since we do not know this size,we upper bound it by n.B UILDING SVM S WITH R EDUCEDC OMPLEXITYis still disproportionately higher(byκtimes)than the cost of actually including the newly selected basis function(O(nd)).Thus we would like to go for cheaper methods.3.2Basis Selection Method2This method computes a score for a new element j in O(n)time.The idea has a parallel in Vincent and Bengio’s work on Kernel Matching Pursuit(Vincent and Bengio,2002)for least squares loss functions.They have two methods called preﬁtting and backﬁtting;see equations(7),(3)and(6) of Vincent and Bengio(2002).6Their preﬁtting is parallel to Basis Selection Method1that we described earlier.The cheaper method that we suggest below is parallel to their backﬁtting idea. SupposeβJ is the solution of(4).Including a new element j and its corresponding variable,βj yields the problem of minimizingλ2(β⊤Jβj) K JJ K J jK jJ K j j βJβj+12n∑i=1max(0,1−y i(K iJβJ+K i jβj)2,(7)WeﬁxβJ and optimize(7)using only the new variableβj and see how much improvement in the objective function is possible in order to deﬁne the score for the new element j.This one dimensional function is piecewise quadratic and can be minimized exactly in O(n log n) time by a dichotomy search on the different breakpoints.But,a very precise calculation of the scoring function is usually unnecessary.So,for practical solution we can simply do a few Newton-Raphson-type iterations on the derivative of the function and get a near optimal solution in O(n) time.Note that we also need to compute the vector K J j,which requires d kernel evaluations.Though this cost is subsumed in O(n),it is a factor to remember if kernel evaluations are expensive.If all j∈J are tried,then the complexity of selecting a new basis function is O(n2),which is disproportionately large compared to the cost of including the chosen basis function,which is O(nd).Like in Basis Selection Method1,we can simply chooseκrandom basis functions to try. If d max is speciﬁed,one can chooseκ=O(d max)without increasing the overall complexity beyond O(nd2max).More complex schemes incorporating a kernel cache can also be tried.3.3Kernel CachingFor upto medium size problems,say n<15,000,it is a good idea to have cache for the entire kernel matrix.If additional memory space is available and,say a Gaussian kernel is employed,then the values of x i−x j 2can also be cached;this will help signiﬁcantly reduce the time associated with the tuning of hyperparameters.For larger problems,depending on memory space available,it is a good idea to cache as many as possible,full kernel rows corresponding to j that get tried,but do not get chosen for inclusion.It is possible that they get called in a later stage of the algorithm,at which time,this cache can be useful.It is also possible to think of variations of the method in which full kernel rows corresponding to a large set(as much that canﬁt into memory)of randomly chosen training basis is pre-computed and only these basis functions are considered for selection.3.4ShrinkingAs basis functions get added,the SVM solution w and the margin planes start stabilizing.If the number of support vectors form a small fraction of the training set,then,for a large fraction of 6.For least squares problems,Adler et al.(1996)had given the same ideas as Vincent and Bengio in earlier work.K EERTHI,C HAPELLE AND D E C OSTE(well-classiﬁed)training examples,we can easily conclude that they will probably never come into the active set I.Such training examples can be left out of the calculations without causing any undue harm.This idea of shrinking has been effectively used to speed-up SVM training(Joachims,1999; Platt,1998).3.5Experimental EvaluationWe now evaluate the performance of basis selection methods1and2(we will call them as SpSVM-1, SpSVM-2)on some sizable benchmark data sets.A full description of these data sets and the kernel functions used is given in the appendix.The value ofκ=59is used.To have a baseline,we also consider the method,Random in which the basis functions are chosen randomly.This is almost the same as the RSVM method(Lee and Mangasarian,2001;Lin and Lin,2003),the only difference being the regularizer(β⊤J K J,JβJ in(4)versus βJ 2in RSVM).For another baseline we consider the(more systematic)unsupervised learning method in which an incomplete Cholesky factorization with pivoting(Meijerink and van der V orst,1977;Bach and Jordan,2005)is used to choose basis functions.7For comparison we also include the GSVC method of Parrado-Hern´a ndez et al.(2003). This method,originally given for SVM hinge loss,uses the following heuristic criterion to select the next basis function j∗∈J:j∗=arg minj∈I,j∈J maxl∈J|K jl|(8)with the aim of encouraging new basis functions that are far from the basis functions that are already chosen;also,j is restricted only to the support vector indices(I in(5)).For a clean comparison with our methods,we implemented GSVC for SVMs using quadratic penalization,max(0,1−y i o i)2.We also tried another criterion,suggested to us by Alex Smola,that is more complex than(8):j∗=arg maxj∈I,j∈J(1−y j o j)2d2j(9)where d j is the distance(in feature space)of the j-th training point from the subspace spanned by the elements of J.This criterion is based on an upper bound on the improvement to the training cost function obtained by including the j-th basis function.It also makes sense intuitively as it selects basis functions that are both not well approximated by the others(large d j)and for which the error incurred is large.8Below,we will refer to this criterion as BH.It is worth noting that both(8)and (9)can be computed very efﬁciently.Figures1and2compare the six methods on six data sets.9Overall,SpSVM-1and SpSVM-2 give the best performance in terms of achieving good reduction of test error rate with respect to the number of basis functions.Although SpSVM-2slightly lags SpSVM-1in terms of performance in the early stages,it does equally well as more basis functions are added.Since SpSVM-2is signiﬁcantly less expensive,it is the best method to use.Since SpSVM-1is quite cheap in the early stages,it is also appropriate to think of a hybrid method in which SpSVM-1is used in the early stages and,when it becomes expensive,switch to SpSVM-2.The other methods sometimes do well,but,overall,they are inferior in comparison to SpSVM-1and SpSVM-2.Interestingly,on the IJCNN and Vehicle data7.We also tried the method of Bach and Jordan(2005)which uses the training labels,but we noticed little improvement.8.Note that when the set of basis functions is not restricted,the optimalβsatisﬁesλβi y i=max(0,1−y i o i).9.Mostﬁgures given in this paper appear in pairs of two plots.One plot gives test error rate as a function of the numberof basis functions,to see how effective the compression is.The other plot gives the test error rate as a function of CPU time,and is used to indicate the efﬁciency of the method.B UILDING SVM S WITH R EDUCEDC OMPLEXITYFigure1:Comparison of basis selection methods on Adult,IJCNN&Shuttle.On Shuttle some methods were terminated because of ill-conditioning in the matrix P in(6).K EERTHI,C HAPELLE AND D E C OSTEFigure2:Comparison of basis selection methods on M3V8,M3VOthers&Vehicle.sets,Cholesky,GSVC and BH are even inferior to Random.A possible explanation is as follows: these methods give preference to points that are furthest away in feature space from the points already selected.Thus,they are likely to select points which are outliers(far from the rest of the training points);but outliers are probably unsuitable points for expanding the decision function.As we mentioned in Section1,there also exist other greedy methods of kernel basis selection that are motivated by ideas from boosting.These methods are usually given in a setting different from that we consider:a set of(kernel)basis functions is given and a regularizer(such as β 1)is directly speciﬁed on the multiplier vectorβ.The method of Bennett et al.(2002)called MARK is given for least squares problems.It is close to the kernel matching pursuit method.We compare SpSVM-2with kernel matching pursuit and discuss MARK in Section5.The method of Bi et al. (2004)uses column generation ideas from linear and quadratic programming to select new basis functions and so it requires the solution of,both,the primal and dual problems.10Thus,the basis selection process is based on the sensitivity of the primal objective function to an incoming basis function.On the other hand,our SpSVM methods are based on computing an estimate of the de-crease in the primal objective function due to an incoming basis function;also,the dual solution is not needed.4.Hyperparameter TuningIn the actual design process,the values of hyperparameters need to be determined.This can be done using k-fold cross validation.Cross validation(CV)can also be used to choose d,the number of basis functions.Since the solution given by our method approaches the SVM solution as d becomes large,there is really no need to choose d at all.One can simply choose d to be as big a value as possible.But,to achieve good reduction in the classiﬁer complexity(as well as computing time) it is a good idea to track the validation performance as a function of d and stop when this function becomes nearlyﬂat.We proceed as follows.First an appropriate value for d max is chosen.For a given choice of hyperparameters,the basis selection method(say,SpSVM-2)is then applied on each training set formed from the k-fold partitions till d max basis functions are chosen.This gives an estimate of the k-fold CV error for each value of d from1to d max.We choose d to be the number of basis functions that gives the lowest k-fold CV error.This computation can be repeated for each set of hyperparameter values and the best choice can be decided.Recall that,at stage d,our basis selection methods choose the(d+1)-th basis function from a set ofκrandom basis functions.To avoid the effects of this randomness on hyperparameter tuning, it is better to make thisκ-set to be dependent only on d.Thus,at stage d,the basis selection methods will choose the same set ofκrandom basis functions for all hyperparameter values.We applied the above ideas on11benchmark data sets from(R¨a tsch)using SpSVM-2as the basis selection method.The Gaussian kernel,k(x i,x j)=1+exp(−γ x i−x j 2)was used.The hyperparameters,λandγwere tuned using3-fold cross validation.The values,2i,i=−7,···,7 were used for each of these parameters.Ten different train-test partitions were tried to get an idea of the variability in generalization performance.We usedκ=25and d max=25.(The Titanic data set has three input variables,which are all binary;hence we set d max=8for this data set.) Table1(already introduced in Section1)gives the results.For comparison we also give the results for the SVM(solution of(1));in the case of SVM,the number of support vectors(n SV)is the 10.The CPLEX LP/QP solver is used to obtain these solutions.。

An Introduction to Support Vector Machines

16
9/13/2013
Input space 802. Prepared by Martin LawFeature space CSE
Example Transformation

Define the kernel function K (x,y) as Consider the following transformation

9/13/2013
CSE 802. Prepared by Martin Law
7
The Optimization Problem the problem to its dual

This is a quadratic programming (QP) problem

Kernel methods, large margin classifiers, reproducing kernel Hilbert space, Gaussian process
CSE 802. Prepared by Martin Law 3
9/13/2013
Two Class Problem: Linear Separable Case

Global maximum of ai can always be found

w can be recovered by
CSE 802. Prepared by Martin Law 8
9/13/2013
Characteristics of the Solution

Many of the ai are zero
Class 2
Many decision boundaries can separate these two classes Which one should we choose?

VectorVMS用户指南（供应商）说明书

VectorVMS User Guide for SuppliersBelow is a step by step guide on how to use VectorVMS, Sevenstep’s portal, to support the MSP program at the Commonwealth. VectorVMS also has a very comprehensive help section and user guides within their platform, so please feel free to reference their materials in addition to this guide. To access their reference materials, click the drop-down next your name and select Help.If you have any questions or concerns, please reach out to us at . Logging In / Dashboard Overview1.Login here.2.Type in your username, password, and i4625 for the organization key and click the Login button.a.If this is the first time logging in, you will need to reset your password.The below screenshot is what your dashboard could look like upon logging in (may vary slightly based on your configurations). NOTE: You can change your dashboard view.•If you have any tasks to complete, such as interviews to accept or engagements to accept, you will see a number to the right of the task under My Tasks.•Under Current Activity, you can see active requisitions, active candidates, interviews accepted, and engagements. If you click the green + icon, it will expand to show you the actual requisitions, candidates, interviews and engagements.•The Alerts (or calendar) will show you any of your candidates start/end dates, as well as interviews.•The black toolbar has additional options to select, as well as dropdowns with further options. Creating Users1.Click Create from the black toolbar on your dashboard and select User.2.Fill in the fields (required field(s) will have a red circle next to them). Click Save.3.The supplier will then need to notify the new user of their username and password.How to Change Your Password1.Click My Account under your name dropdown on your dashboard.2.Click Change User Password.3.The below will pop up.4.Type in your current password, your new password, update your password hint Q&A, i f you’d likeand Click Save.Submitting Candidates1.There are two ways to view active requisitions from your dashboard (by clicking ActiveRequisitions under Current Activity or by clicking View > Contingent Requisitions from the blacktoolbar on your dashboard). It will bring you to the requisition summary page.2.Find the requisition you’d like to submit a candidate(s) to and either click the requisition name orthe clipboard with green arrow icon and select Submit Candidate.a.If you click the requisition name instead of selecting Submit Candidate from the actiondropdown – you will need to select Submit Candidate from the clipboard with green arrowicon once you are in the requisition.3.Fill in the candidate submission fields and click Next.plete the candidate information fields (anything with a red circle is a required field) and ClickNext.a.If there are skills built into the requisition, you will be able to score the candidate you aresubmitting based on the requirements in the requisition. Click Next.5.Review Compliance Requirements and Click Next.plete Candidate Employment Status Fields and ensure the Candidate Vector ContactInformation fields are correct and Click Next.7.Update Candidate Rate Settings (Payment Basis and Pay Rate) and Click Next.8.Add the candidate resume (references, if applicable, and any other attachments) by clicking AddNew Attachment.9.After you click Add New Attachment, the below pops up. You can select what document you areattaching, a brief description, and select the file to attach. Click Save afterwards.10.You will then see the attachments in the candidate’s profile.11.Click Submit to submit the candidate for the open position.12.You will now see the candidate(s) you submitted to the requisition.Accepting or Rejecting Interview(s)1.When you have interviews to accept (or reject), you will see it on your dashboard (potentially intwo different ways depending on your dashboard view).2.Click either option to accept (or reject the interview).3.Click the candidate’s name and it will bring you to their profile.4.Click the interview tab, click the clipboard with the green arrow icon and either Accept or Rejectthe proposed interview.a.If you accept the interview, the below box will pop-up and you can add comments, such asconfirmed interview time with candidate. Click Submit after adding your comments(required field).b.After Clicking Submit, you will see this screen with the interview details.c.You can add the interview to your calendar by clicking the clipboard with green arrow iconand selecting Add to Calendar.d.If you reject the proposed interview time(s), the below box will pop-up and you can addcomments, such as reason why you are rejecting the interview and/or propose newinterview time(s). Click Submit after adding your comments (required field).Accepting (or Rejecting) Engagements1.When you have an engagement to accept (or reject), you will see it on your dashboard (potentiallyin two different ways depending on your dashboard view).2.Click either option to accept (or reject the engagement).3.Click the clipboard with the green arrow icon under Action and Select View Engagement.4.Select either Accept or Do Not Accept.a.If you Accept the Engagement, it will bring you to below screen and you can see thecandidate is Engaged.b.If you do not accept the Engagement, you will be able to add a comment has to why youdo not accept the Engagement.Proxy Timesheet1.To enter time on behalf of the contractor, Select Proxy Timesheet under the Create dropdown viayour black toolbar on your dashboard.2.Click the clipboard with the green arrow icon dropdown and select Enter Time.3.Input time and Click Submit.4.Timesheet has been successfully submitted.。

Support vector machine_A tool for mapping mineral prospectivity

Support vector machine:A tool for mapping mineral prospectivityRenguang Zuo a,n,Emmanuel John M.Carranza ba State Key Laboratory of Geological Processes and Mineral Resources,China University of Geosciences,Wuhan430074;Beijing100083,Chinab Department of Earth Systems Analysis,Faculty of Geo-Information Science and Earth Observation(ITC),University of Twente,Enschede,The Netherlandsa r t i c l e i n f oArticle history:Received17May2010Received in revised form3September2010Accepted25September2010Keywords:Supervised learning algorithmsKernel functionsWeights-of-evidenceTurbidite-hosted AuMeguma Terraina b s t r a c tIn this contribution,we describe an application of support vector machine(SVM),a supervised learningalgorithm,to mineral prospectivity mapping.The free R package e1071is used to construct a SVM withsigmoid kernel function to map prospectivity for Au deposits in western Meguma Terrain of Nova Scotia(Canada).The SVM classiﬁcation accuracies of‘deposit’are100%,and the SVM classiﬁcation accuracies ofthe‘non-deposit’are greater than85%.The SVM classiﬁcations of mineral prospectivity have5–9%lowertotal errors,13–14%higher false-positive errors and25–30%lower false-negative errors compared tothose of the WofE prediction.The prospective target areas predicted by both SVM and WofE reﬂect,nonetheless,controls of Au deposit occurrence in the study area by NE–SW trending anticlines andcontact zones between Goldenville and Halifax Formations.The results of the study indicate theusefulness of SVM as a tool for predictive mapping of mineral prospectivity.&2010Elsevier Ltd.All rights reserved.1.IntroductionMapping of mineral prospectivity is crucial in mineral resourcesexploration and mining.It involves integration of information fromdiverse geoscience datasets including geological data(e.g.,geologicalmap),geochemical data(e.g.,stream sediment geochemical data),geophysical data(e.g.,magnetic data)and remote sensing data(e.g.,multispectral satellite data).These sorts of data can be visualized,processed and analyzed with the support of computer and GIStechniques.Geocomputational techniques for mapping mineral pro-spectivity include weights of evidence(WofE)(Bonham-Carter et al.,1989),fuzzy WofE(Cheng and Agterberg,1999),logistic regression(Agterberg and Bonham-Carter,1999),fuzzy logic(FL)(Ping et al.,1991),evidential belief functions(EBF)(An et al.,1992;Carranza andHale,2003;Carranza et al.,2005),neural networks(NN)(Singer andKouda,1996;Porwal et al.,2003,2004),a‘wildcat’method(Carranza,2008,2010;Carranza and Hale,2002)and a hybrid method(e.g.,Porwalet al.,2006;Zuo et al.,2009).These techniques have been developed toquantify indices of occurrence of mineral deposit occurrence byintegrating multiple evidence layers.Some geocomputational techni-ques can be performed using popular software packages,such asArcWofE(a free ArcView extension)(Kemp et al.,1999),ArcSDM9.3(afree ArcGIS9.3extension)(Sawatzky et al.,2009),MI-SDM2.50(aMapInfo extension)(Avantra Geosystems,2006),GeoDAS(developedbased on MapObjects,which is an Environmental Research InstituteDevelopment Kit)(Cheng,2000).Other geocomputational techniques(e.g.,FL and NN)can be performed by using R and Matlab.Geocomputational techniques for mineral prospectivity map-ping can be categorized generally into two types–knowledge-driven and data-driven–according to the type of inferencemechanism considered(Bonham-Carter1994;Pan and Harris2000;Carranza2008).Knowledge-driven techniques,such as thosethat apply FL and EBF,are based on expert knowledge andexperience about spatial associations between mineral prospec-tivity criteria and mineral deposits of the type sought.On the otherhand,data-driven techniques,such as WofE and NN,are based onthe quantiﬁcation of spatial associations between mineral pro-spectivity criteria and known occurrences of mineral deposits ofthe type sought.Additional,the mixing of knowledge-driven anddata-driven methods also is used for mapping of mineral prospec-tivity(e.g.,Porwal et al.,2006;Zuo et al.,2009).Every geocomputa-tional technique has advantages and disadvantages,and one or theother may be more appropriate for a given geologic environmentand exploration scenario(Harris et al.,2001).For example,one ofthe advantages of WofE is its simplicity,and straightforwardinterpretation of the weights(Pan and Harris,2000),but thismodel ignores the effects of possible correlations amongst inputpredictor patterns,which generally leads to biased prospectivitymaps by assuming conditional independence(Porwal et al.,2010).Comparisons between WofE and NN,NN and LR,WofE,NN and LRfor mineral prospectivity mapping can be found in Singer andKouda(1999),Harris and Pan(1999)and Harris et al.(2003),respectively.Mapping of mineral prospectivity is a classiﬁcation process,because its product(i.e.,index of mineral deposit occurrence)forevery location is classiﬁed as either prospective or non-prospectiveaccording to certain combinations of weighted mineral prospec-tivity criteria.There are two types of classiﬁcation techniques.Contents lists available at ScienceDirectjournal homepage:/locate/cageoComputers&Geosciences0098-3004/$-see front matter&2010Elsevier Ltd.All rights reserved.doi:10.1016/j.cageo.2010.09.014n Corresponding author.E-mail addresses:zrguang@,zrguang1981@(R.Zuo).Computers&Geosciences](]]]])]]]–]]]One type is known as supervised classiﬁcation,which classiﬁes mineral prospectivity of every location based on a training set of locations of known deposits and non-deposits and a set of evidential data layers.The other type is known as unsupervised classiﬁcation, which classiﬁes mineral prospectivity of every location based solely on feature statistics of individual evidential data layers.A support vector machine(SVM)is a model of algorithms for supervised classiﬁcation(Vapnik,1995).Certain types of SVMs have been developed and applied successfully to text categorization, handwriting recognition,gene-function prediction,remote sensing classiﬁcation and other studies(e.g.,Joachims1998;Huang et al.,2002;Cristianini and Scholkopf,2002;Guo et al.,2005; Kavzoglu and Colkesen,2009).An SVM performs classiﬁcation by constructing an n-dimensional hyperplane in feature space that optimally separates evidential data of a predictor variable into two categories.In the parlance of SVM literature,a predictor variable is called an attribute whereas a transformed attribute that is used to deﬁne the hyperplane is called a feature.The task of choosing the most suitable representation of the target variable(e.g.,mineral prospectivity)is known as feature selection.A set of features that describes one case(i.e.,a row of predictor values)is called a feature vector.The feature vectors near the hyperplane are the support feature vectors.The goal of SVM modeling is toﬁnd the optimal hyperplane that separates clusters of feature vectors in such a way that feature vectors representing one category of the target variable (e.g.,prospective)are on one side of the plane and feature vectors representing the other category of the target variable(e.g.,non-prospective)are on the other size of the plane.A good separation is achieved by the hyperplane that has the largest distance to the neighboring data points of both categories,since in general the larger the margin the better the generalization error of the classiﬁer.In this paper,SVM is demonstrated as an alternative tool for integrating multiple evidential variables to map mineral prospectivity.2.Support vector machine algorithmsSupport vector machines are supervised learning algorithms, which are considered as heuristic algorithms,based on statistical learning theory(Vapnik,1995).The classical task of a SVM is binary (two-class)classiﬁcation.Suppose we have a training set composed of l feature vectors x i A R n,where i(¼1,2,y,n)is the number of feature vectors in training samples.The class in which each sample is identiﬁed to belong is labeled y i,which is equal to1for one class or is equal toÀ1for the other class(i.e.y i A{À1,1})(Huang et al., 2002).If the two classes are linearly separable,then there exists a family of linear separators,also called separating hyperplanes, which satisfy the following set of equations(KavzogluandFig.1.Support vectors and optimum hyperplane for the binary case of linearly separable data sets.Table1Experimental data.yer A Layer B Layer C Layer D Target yer A Layer B Layer C Layer D Target1111112100000 2111112200000 3111112300000 4111112401000 5111112510000 6111112600000 7111112711100 8111112800000 9111012900000 10111013000000 11101113111100 12111013200000 13111013300000 14111013400000 15011013510000 16101013600000 17011013700000 18010113811100 19010112900000 20101014010000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]2Colkesen,2009)(Fig.1):wx iþb Zþ1for y i¼þ1wx iþb rÀ1for y i¼À1ð1Þwhich is equivalent toy iðwx iþbÞZ1,i¼1,2,...,nð2ÞThe separating hyperplane can then be formalized as a decision functionfðxÞ¼sgnðwxþbÞð3Þwhere,sgn is a sign function,which is deﬁned as follows:sgnðxÞ¼1,if x400,if x¼0À1,if x o08><>:ð4ÞThe two parameters of the separating hyperplane decision func-tion,w and b,can be obtained by solving the following optimization function:Minimize tðwÞ¼12J w J2ð5Þsubject toy Iððwx iÞþbÞZ1,i¼1,...,lð6ÞThe solution to this optimization problem is the saddle point of the Lagrange functionLðw,b,aÞ¼1J w J2ÀX li¼1a iðy iððx i wÞþbÞÀ1Þð7Þ@ @b Lðw,b,aÞ¼0@@wLðw,b,aÞ¼0ð8Þwhere a i is a Lagrange multiplier.The Lagrange function is minimized with respect to w and b and is maximized with respect to a grange multipliers a i are determined by the following optimization function:MaximizeX li¼1a iÀ12X li,j¼1a i a j y i y jðx i x jÞð9Þsubject toa i Z0,i¼1,...,l,andX li¼1a i y i¼0ð10ÞThe separating rule,based on the optimal hyperplane,is the following decision function:fðxÞ¼sgnX li¼1y i a iðxx iÞþb!ð11ÞMore details about SVM algorithms can be found in Vapnik(1995) and Tax and Duin(1999).3.Experiments with kernel functionsFor spatial geocomputational analysis of mineral exploration targets,the decision function in Eq.(3)is a kernel function.The choice of a kernel function(K)and its parameters for an SVM are crucial for obtaining good results.The kernel function can be usedTable2Errors of SVM classiﬁcation using linear kernel functions.l Number ofsupportvectors Testingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.2580.00.00.0180.00.00.0 1080.00.00.0 10080.00.00.0 100080.00.00.0Table3Errors of SVM classiﬁcation using polynomial kernel functions when d¼3and r¼0. l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25120.00.00.0160.00.00.01060.00.00.010060.00.00.0 100060.00.00.0Table4Errors of SVM classiﬁcation using polynomial kernel functions when l¼0.25,r¼0.d Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)11110.00.0 5.010290.00.00.0100230.045.022.5 1000200.090.045.0Table5Errors of SVM classiﬁcation using polynomial kernel functions when l¼0.25and d¼3.r Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0120.00.00.01100.00.00.01080.00.00.010080.00.00.0 100080.00.00.0Table6Errors of SVM classiﬁcation using radial kernel functions.l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25140.00.00.01130.00.00.010130.00.00.0100130.00.00.0 1000130.00.00.0Table7Errors of SVM classiﬁcation using sigmoid kernel functions when r¼0.l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25400.00.00.01400.035.017.510400.0 6.0 3.0100400.0 6.0 3.0 1000400.0 6.0 3.0R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]3to construct a non-linear decision boundary and to avoid expensive calculation of dot products in high-dimensional feature space.The four popular kernel functions are as follows:Linear:Kðx i,x jÞ¼l x i x j Polynomial of degree d:Kðx i,x jÞ¼ðl x i x jþrÞd,l40Radial basis functionðRBFÞ:Kðx i,x jÞ¼exp fÀl99x iÀx j992g,l40 Sigmoid:Kðx i,x jÞ¼tanhðl x i x jþrÞ,l40ð12ÞThe parameters l,r and d are referred to as kernel parameters. The parameter l serves as an inner product coefﬁcient in the polynomial function.In the case of the RBF kernel(Eq.(12)),l determines the RBF width.In the sigmoid kernel,l serves as an inner product coefﬁcient in the hyperbolic tangent function.The parameter r is used for kernels of polynomial and sigmoid types. The parameter d is the degree of a polynomial function.We performed some experiments to explore the performance of the parameters used in a kernel function.The dataset used in the experiments(Table1),which are derived from the study area(see below),were compiled according to the requirementfor Fig.2.Simpliﬁed geological map in western Meguma Terrain of Nova Scotia,Canada(after,Chatterjee1983;Cheng,2008).Table8Errors of SVM classiﬁcation using sigmoid kernel functions when l¼0.25.r Number ofSupportVectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0400.00.00.01400.00.00.010400.00.00.0100400.00.00.01000400.00.00.0R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]4classiﬁcation analysis.The e1071(Dimitriadou et al.,2010),a freeware R package,was used to construct a SVM.In e1071,the default values of l,r and d are1/(number of variables),0and3,respectively.From the study area,we used40geological feature vectors of four geoscience variables and a target variable for classiﬁcation of mineral prospec-tivity(Table1).The target feature vector is either the‘non-deposit’class(or0)or the‘deposit’class(or1)representing whether mineral exploration target is absent or present,respectively.For‘deposit’locations,we used the20known Au deposits.For‘non-deposit’locations,we randomly selected them according to the following four criteria(Carranza et al.,2008):(i)non-deposit locations,in contrast to deposit locations,which tend to cluster and are thus non-random, must be random so that multivariate spatial data signatures are highly non-coherent;(ii)random non-deposit locations should be distal to any deposit location,because non-deposit locations proximal to deposit locations are likely to have similar multivariate spatial data signatures as the deposit locations and thus preclude achievement of desired results;(iii)distal and random non-deposit locations must have values for all the univariate geoscience spatial data;(iv)the number of distal and random non-deposit locations must be equaltoFig.3.Evidence layers used in mapping prospectivity for Au deposits(from Cheng,2008):(a)and(b)represent optimum proximity to anticline axes(2.5km)and contacts between Goldenville and Halifax formations(4km),respectively;(c)and(d)represent,respectively,background and anomaly maps obtained via S-Aﬁltering of theﬁrst principal component of As,Cu,Pb and Zn data.R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]5the number of deposit locations.We used point pattern analysis (Diggle,1983;2003;Boots and Getis,1988)to evaluate degrees of spatial randomness of sets of non-deposit locations and toﬁnd distance from any deposit location and corresponding probability that one deposit location is situated next to another deposit location.In the study area,we found that the farthest distance between pairs of Au deposits is71km,indicating that within that distance from any deposit location in there is100%probability of another deposit location. However,few non-deposit locations can be selected beyond71km of the individual Au deposits in the study area.Instead,we selected random non-deposit locations beyond11km from any deposit location because within this distance from any deposit location there is90% probability of another deposit location.When using a linear kernel function and varying l from0.25to 1000,the number of support vectors and the testing errors for both ‘deposit’and‘non-deposit’do not vary(Table2).In this experiment the total error of classiﬁcation is0.0%,indicating that the accuracy of classiﬁcation is not sensitive to the choice of l.With a polynomial kernel function,we tested different values of l, d and r as follows.If d¼3,r¼0and l is increased from0.25to1000,the number of support vectors decreases from12to6,but the testing errors for‘deposit’and‘non-deposit’remain nil(Table3).If l¼0.25, r¼0and d is increased from1to1000,the number of support vectors ﬁrstly increases from11to29,then decreases from23to20,the testing error for‘non-deposit’decreases from10.0%to0.0%,whereas the testing error for‘deposit’increases from0.0%to90%(Table4). In this experiment,the total error of classiﬁcation is minimum(0.0%) when d¼10(Table4).If l¼0.25,d¼3and r is increased from 0to1000,the number of support vectors decreases from12to8,but the testing errors for‘deposit’and‘non-deposit’remain nil(Table5).When using a radial kernel function and varying l from0.25to 1000,the number of support vectors decreases from14to13,but the testing errors of‘deposit’and‘non-deposit’remain nil(Table6).With a sigmoid kernel function,we experimented with different values of l and r as follows.If r¼0and l is increased from0.25to1000, the number of support vectors is40,the testing errors for‘non-deposit’do not change,but the testing error of‘deposit’increases from 0.0%to35.0%,then decreases to6.0%(Table7).In this experiment,the total error of classiﬁcation is minimum at0.0%when l¼0.25 (Table7).If l¼0.25and r is increased from0to1000,the numbers of support vectors and the testing errors of‘deposit’and‘non-deposit’do not change and the total error remains nil(Table8).The results of the experiments demonstrate that,for the datasets in the study area,a linear kernel function,a polynomial kernel function with d¼3and r¼0,or l¼0.25,r¼0and d¼10,or l¼0.25and d¼3,a radial kernel function,and a sigmoid kernel function with r¼0and l¼0.25are optimal kernel functions.That is because the testing errors for‘deposit’and‘non-deposit’are0%in the SVM classiﬁcations(Tables2–8).Nevertheless,a sigmoid kernel with l¼0.25and r¼0,compared to all the other kernel functions,is the most optimal kernel function because it uses all the input support vectors for either‘deposit’or‘non-deposit’(Table1)and the training and testing errors for‘deposit’and‘non-deposit’are0% in the SVM classiﬁcation(Tables7and8).4.Prospectivity mapping in the study areaThe study area is located in western Meguma Terrain of Nova Scotia,Canada.It measures about7780km2.The host rock of Au deposits in this area consists of Cambro-Ordovician low-middle grade metamorphosed sedimentary rocks and a suite of Devonian aluminous granitoid intrusions(Sangster,1990;Ryan and Ramsay, 1997).The metamorphosed sedimentary strata of the Meguma Group are the lower sand-dominatedﬂysch Goldenville Formation and the upper shalyﬂysch Halifax Formation occurring in the central part of the study area.The igneous rocks occur mostly in the northern part of the study area(Fig.2).In this area,20turbidite-hosted Au deposits and occurrences (Ryan and Ramsay,1997)are found in the Meguma Group, especially near the contact zones between Goldenville and Halifax Formations(Chatterjee,1983).The major Au mineralization-related geological features are the contact zones between Gold-enville and Halifax Formations,NE–SW trending anticline axes and NE–SW trending shear zones(Sangster,1990;Ryan and Ramsay, 1997).This dataset has been used to test many mineral prospec-tivity mapping algorithms(e.g.,Agterberg,1989;Cheng,2008). More details about the geological settings and datasets in this area can be found in Xu and Cheng(2001).We used four evidence layers(Fig.3)derived and used by Cheng (2008)for mapping prospectivity for Au deposits in the yers A and B represent optimum proximity to anticline axes(2.5km) and optimum proximity to contacts between Goldenville and Halifax Formations(4km),yers C and D represent variations in geochemical background and anomaly,respectively, as modeled by multifractalﬁlter mapping of theﬁrst principal component of As,Cu,Pb,and Zn data.Details of how the four evidence layers were obtained can be found in Cheng(2008).4.1.Training datasetThe application of SVM requires two subsets of training loca-tions:one training subset of‘deposit’locations representing presence of mineral deposits,and a training subset of‘non-deposit’locations representing absence of mineral deposits.The value of y i is1for‘deposits’andÀ1for‘non-deposits’.For‘deposit’locations, we used the20known Au deposits(the sixth column of Table1).For ‘non-deposit’locations(last column of Table1),we obtained two ‘non-deposit’datasets(Tables9and10)according to the above-described selection criteria(Carranza et al.,2008).We combined the‘deposits’dataset with each of the two‘non-deposit’datasets to obtain two training datasets.Each training dataset commonly contains20known Au deposits but contains different20randomly selected non-deposits(Fig.4).4.2.Application of SVMBy using the software e1071,separate SVMs both with sigmoid kernel with l¼0.25and r¼0were constructed using the twoTable9The value of each evidence layer occurring in‘non-deposit’dataset1.yer A Layer B Layer C Layer D100002000031110400005000061000700008000090100 100100 110000 120000 130000 140000 150000 160100 170000 180000 190100 200000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]] 6training datasets.With training dataset1,the classiﬁcation accuracies for‘non-deposits’and‘deposits’are95%and100%, respectively;With training dataset2,the classiﬁcation accuracies for‘non-deposits’and‘deposits’are85%and100%,respectively.The total classiﬁcation accuracies using the two training datasets are97.5%and92.5%,respectively.The patterns of the predicted prospective target areas for Au deposits(Fig.5)are deﬁned mainly by proximity to NE–SW trending anticlines and proximity to contact zones between Goldenville and Halifax Formations.This indicates that‘geology’is better than‘geochemistry’as evidence of prospectivity for Au deposits in this area.With training dataset1,the predicted prospective target areas occupy32.6%of the study area and contain100%of the known Au deposits(Fig.5a).With training dataset2,the predicted prospec-tive target areas occupy33.3%of the study area and contain95.0% of the known Au deposits(Fig.5b).In contrast,using the same datasets,the prospective target areas predicted via WofE occupy 19.3%of study area and contain70.0%of the known Au deposits (Cheng,2008).The error matrices for two SVM classiﬁcations show that the type1(false-positive)and type2(false-negative)errors based on training dataset1(Table11)and training dataset2(Table12)are 32.6%and0%,and33.3%and5%,respectively.The total errors for two SVM classiﬁcations are16.3%and19.15%based on training datasets1and2,respectively.In contrast,the type1and type2 errors for the WofE prediction are19.3%and30%(Table13), respectively,and the total error for the WofE prediction is24.65%.The results show that the total errors of the SVM classiﬁcations are5–9%lower than the total error of the WofE prediction.The 13–14%higher false-positive errors of the SVM classiﬁcations compared to that of the WofE prediction suggest that theSVMFig.4.The locations of‘deposit’and‘non-deposit’.Table10The value of each evidence layer occurring in‘non-deposit’dataset2.yer A Layer B Layer C Layer D110102000030000411105000060110710108000091000101110111000120010131000140000150000161000171000180010190010200000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]7classiﬁcations result in larger prospective areas that may not contain undiscovered deposits.However,the 25–30%higher false-negative error of the WofE prediction compared to those of the SVM classiﬁcations suggest that the WofE analysis results in larger non-prospective areas that may contain undiscovered deposits.Certainly,in mineral exploration the intentions are notto miss undiscovered deposits (i.e.,avoid false-negative error)and to minimize exploration cost in areas that may not really contain undiscovered deposits (i.e.,keep false-positive error as low as possible).Thus,results suggest the superiority of the SVM classi-ﬁcations over the WofE prediction.5.ConclusionsNowadays,SVMs have become a popular geocomputational tool for spatial analysis.In this paper,we used an SVM algorithm to integrate multiple variables for mineral prospectivity mapping.The results obtained by two SVM applications demonstrate that prospective target areas for Au deposits are deﬁned mainly by proximity to NE–SW trending anticlines and to contact zones between the Goldenville and Halifax Formations.In the study area,the SVM classiﬁcations of mineral prospectivity have 5–9%lower total errors,13–14%higher false-positive errors and 25–30%lower false-negative errors compared to those of the WofE prediction.These results indicate that SVM is a potentially useful tool for integrating multiple evidence layers in mineral prospectivity mapping.Table 11Error matrix for SVM classiﬁcation using training dataset 1.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposit’10032.6132.6‘Non-deposit’067.467.4Total100100200Type 1(false-positive)error ¼32.6.Type 2(false-negative)error ¼0.Total error ¼16.3.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Table 12Error matrix for SVM classiﬁcation using training dataset 2.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposits’9533.3128.3‘Non-deposits’566.771.4Total100100200Type 1(false-positive)error ¼33.3.Type 2(false-negative)error ¼5.Total error ¼19.15.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Table 13Error matrix for WofE prediction.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposit’7019.389.3‘Non-deposit’3080.7110.7Total100100200Type 1(false-positive)error ¼19.3.Type 2(false-negative)error ¼30.Total error ¼24.65.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Fig.5.Prospective targets area for Au deposits delineated by SVM.(a)and (b)are obtained using training dataset 1and 2,respectively.R.Zuo,E.J.M.Carranza /Computers &Geosciences ](]]]])]]]–]]]8。

libsvm参数说明

libsvm参数说明【原创实用版】目录1.LIBSVM 简介2.LIBSVM 参数说明3.LIBSVM 的使用方法4.LIBSVM 的应用场景5.总结正文1.LIBSVM 简介LIBSVM（Library for Support Vector Machines）是一个开源的支持向量机（SVM）算法库，它提供了一系列用于解决分类和回归问题的高效算法。

LIBSVM 由 Chenning Peng 和 Tsung-Yuan Lee 开发，是 SVM 领域最为知名和广泛使用的工具之一。

2.LIBSVM 参数说明LIBSVM 包含了许多参数，这些参数可以影响模型的性能。

以下是一些重要的 LIBSVM 参数及其说明：- "kernel": 指定核函数类型，如"linear"（线性核）、"rbf"（高斯径向基函数核）等。

- "C": 指定惩罚参数 C，用于控制模型对训练数据的拟合程度。

较小的 C 值会导致更宽松的边界，可能允许一些误分类，但可以提高模型的泛化能力。

较大的 C 值则会强制模型在训练集上尽量减少误差，可能导致过拟合。

- "degree": 指定多项式核函数的阶数。

- "gamma": 指定高斯核函数的参数，用于控制核函数的形状。

- "coef0": 指定拉格朗日乘子 alpha 的初始值。

- " CacheSize": 指定 LIBSVM 使用的内存缓存大小。

3.LIBSVM 的使用方法使用 LIBSVM 主要包括以下步骤：（1）数据预处理：将数据集分为特征矩阵 X 和目标向量 y。

（2）训练模型：调用 LIBSVM 的 train 函数，传入预处理后的数据和参数设置，训练 SVM 模型。

（3）预测：使用训练好的模型，调用 LIBSVM 的 predict 函数，对新的数据进行分类预测。

Text Categorization with Support Vector Machines说明

Text Categorization with Support Vector Machines: Learning with Many Relevant
Features
Thorsten Joachims
Universit£t Dortmund lnformatik LS8, Baroper Str. 301
This representation scheme leads to very high-dimensional feature spaces containing 10000 dimensions and more. Many have noted the need for feature selection to make the use of conventional learning methods possible, to improve generalization accuracy, and to avoid "overfitting". Following the recommendation of [11], the reformation gain criterion will be used in this paper to select a subset of features.
1 Introduction
With the rapid growth of online information, text categorization has become one of the key techniques for handling and organizing text data. Text categorization techniques are used to classify news stories, to find interesting information on the WWW, and to guide a user's search through hypertext. Since building text classifiers by hand is difficult and time-consuming, it is advantageous to learn classifiers from examples.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Incremental Support Vector Machine Learning:a Local ApproachLiva Ralaivola and Florence d’Alch´e–BucLaboratoire d’Informatique de Paris6,Universit´e Pierre et Marie Curie,8,rue du Capitaine Scott,F-75015Paris,Franceliva.ralaivola,florence.dalche@lip6.frAbstract.In this paper,we propose and study a new on-line algorithm for learn-ing a SVM based on Radial Basis Function Kernel:Local Incremental Learningof SVM or LISVM.Our method exploits the“locality”of RBF kernels to up-date current machine by only considering a subset of support candidates in theneighbourhood of the input.The determination of this subset is conditioned bythe computation of the variation of the error estimate.Implementation is basedon the SMO one,introduced and developed by Platt[13].We study the behaviourof the algorithm during learning when using different generalization error esti-mates.Experiments on three data sets(batch problems transformed into on-lineones)have been conducted and analyzed.1IntroductionThe emergence of smart portable systems and the daily growth of databases on the Web has revived the old problem of incremental and on-line learning.Meanwhile,advances in statistical learning have placed Support Vector Machines(SVM)as one of the most powerful family of learners(see[6,16]).Their speciﬁcity lies on three characteristics: SVM maximizes a soft margin criterion,the major parameters of SVM(support vectors) are taken from the training sample and non linear SVM are based on the use of kernels to deal with high dimensional feature space without directly working in it.However,few works tackle the issue of incremental learning of SVM.One of the main reasons lies on the nature of the optimization problem posed by SVM learning.Al-though there exist some very recent works that propose ways to update SVM each time new data are available[4,11,15],they generally imply to re-learn the whole machine.The work presented here starts from another motivation:since the principal param-eters of SVM are the training points themselves,and as far as a local kernel such as a gaussian kernel is used,it is possible to focus learning only on a neighbourhood of the new data and update the weights of concerned training data.In this paper,we brieﬂy present the key idea of SVM and then introduce incremental learning problem.State of the art is shortly presented and discussed.Then,we present the local incremental algorithm or LISVM and discuss the model selection method to determine the size of the neighbourhood to be used at each step.Numerical simulations on IDA benchmark datasets[14]are presented and analyzed.2Support Vector MachinesGiven a training set,support vector learning[6]tries toﬁnd a hyper-plane with minimal norm that separates the data mapped into a feature space via a nonlinear map,where denotes the dimension of vectors.To construct such a hyperplane,one must solve the following quadratic problem[3]:property that support vectors summarize well the data and has been tested against some standard learning machine datasets to evaluate some goodness criteria such as stability, improvement and recoverability.Finally,a very recent work[4]proposes a way to incrementally solve the global optimization problem in order toﬁnd the exact solution.Its reversible aspect allows to do“decremental”unlearning and to efﬁciently compute leave-one-out estimations.4Local Incremental Learning of a Support Vector MachineWeﬁrst consider SVM as a voting machine that combines the outputs of experts,each of which is associated with a support vector in the input space.When using RBF kernel or any kernel that is based upon the notion of neighbourhood,the inﬂuence of a support vector concerns only a limited area with a high degree.Then,in the framework of on-line learning,when a new example is available it should not be necessary to re-consider all the current experts but only those which are concerned by the localization of the input.In some extent,the proposed algorithm is linked with work of Bottou and Vapnik [1]about local learning algorithm.4.1AlgorithmWe sketch the updating procedure to build from when the incoming data is to be learned,being the set of instances learned so far,and:1.Initialize lagrangian multiplier to zero2.If(point is well classiﬁed)then terminate(take as new hypothesis3.Build a working subset of size2with and its nearest example in input space4.Learn a candidate hypothesis by optimizing the quadratic problem on examples in theworking subset5.If the generalization error estimation of is above a given threshold,increase the workingsubset by adding the next closest point to not yet in the current subset and return to step4 6.is set toWe stop constructing growing neighbourhoods(see Fig.1)around the new data as soon as the generalization error estimation falls under a given threshold.A compro-mise is thus performed between complexity in time and the value of the generalization error estimate,as we consider that this estimate is minimal when all data are re-learned. The key point of our algorithm thus lies onﬁnding the size of the neighbourhood(e.g. the number of neighbours to be considered)and thus onﬁnding a well suited gener-alization error estimate,what we will focus on in the next section.To increase computational speed,and implement our idea of locality,we only con-sider,as shown in Fig.1,a small band around the decision surface in which points may be insteresting to re-consider.This band is deﬁned upon a real:points of class for which and points of class for which are in the band.f (x) = 0εFig.1.Left:three neighbourhoods around the new data and the interesting small band of points around the decision surface parameterized by.Right:new decision surface and example of point whose status changed4.2Model Selection and the Neighbourhood DeterminatonThe local algorithm requires to choose the value of or the neighbourhood size.We have several choices to do that.Aﬁrst simple solution is toﬁx a priori before the beginning of the learning process.However the best at time is obviously not the best one at time.So it may be difﬁcult to choose a single suitable for all points.A second more interesting solution is therefore to determine it automatically through the minimization of a cost criterion.The idea is to apply some process of model selection upon the different hypothesis that can be built.The way we choose to select models consists in comparing them according to an estimate of their generalization error.One way to do that is to evaluate the estimation error on a test set and thus keeping as the one for which is the least.In real problems,it is however not realistic to be provided with a test set during the incremental learning process.So this solution cannot be considered as a good answer to our problem.Elsewhere,there exist some analytical expression of Leave-One-Out estimates of SVMs generalization error such as those recalled in[5].However,in order to use these estimates,one has to ensure that the margin optimization problem has been solved ex-actly.The same holds for Joachims’estimators[10,11].This restriction prevents us from using these estimates as we only do a partial local optimization.To circumvent the problem,we propose to use the bound on generalization pro-vided by a result of Cristianini and Shawe-Taylor[7]for thresholded linear real-valued functions.While the bound it gives is large,it allows to“qualitatively”compare the behaviours of functions of the same family.The theorem states as follows: Theorem1.Consider thresholding real-valued linear functions with unit weight vec-tors on an inner product space andﬁx.There is a constant c,such that for any probability distribution on with support in a ball of radius around the origin,with probability over random examples,any hypothesis has error no more thanWe notice that once the kernel parameter isﬁxed,this theorem,directly applied in the feature space deﬁned by the kernel,provides an estimate of generalization error for the machines we work on.This estimate is expressed in terms of a margin value,the norm of the slack margin vector and the radius of the ball containing the data.In order to use this theorem,we consider the feature space of dimensiondeﬁned by the Gaussian kernel with aﬁxed value for.In this space,we consider with unit weight vectors.At step,different functions can be learnt with. For each,we get a function by normalizing the weight vector of.belongs to and when thresholded provides the same outputs than does.The theorem can then be applied to and data of.It ensures that:(4) Hence,for each,we can use this bound as a test error estimate.However,as is the radius of the ball containing the examples in the feature space [17],it only depends on the chosen kernel and not on k.On the contrary,,deﬁned as:is the unique quantity which differs among functions. Slack vector are thus sufﬁcient to compare functions,justifying our choice to use it as a model selection criterion.Looking at the bound,we can see that a value of must be chosen:in order to do that,we take a time-varying value deﬁned as.5Experimental ResultsExperiments were conducted on three different binary classiﬁcation problems:Banana [14],Ringnorm[2]and Diabetes.Datasets are available at www.ﬁrst.gmd.de/˜raetsch/. For each problem,we tested LISVM for different values of the threshold.The main points we want to assess are the classiﬁcation accuracy our algorithm is able to achieve, the appropriateness of the proposed criterion to select the“best”neighbourhood and the relevance of the local approach.We simulated on-line incremental learning by providing the classiﬁer with one ex-ample at a time,taken from a given training set.After each presentation,the current hypothesis is updated and evaluated on a validation set of size two thirds the size of the corresponding testing set.Only the“best test”algorithm uses the remaining third of this latter set to perform neighbourhood selection.An other incremental learning pro-cess called“rebatch”is also evaluated:it consists in realizing a classical SVM learning procedure over the whole dataset when a new instance is available.Experiments are run on samples in order to compute means and standard deviations.Quantities of inter-est are plotted on Fig.2and Fig.3.Table1reports the results at the end of the training process.We led experiments for the same range of values in order to show that it is not difﬁcult toﬁx a threshold that implies correct generalization performance.However, we must be aware that reﬂects our expectation of the error performed by the current hypothesis.Hence,smaller threshold should be preferred when the data are assumed to be easily separable(e.g.Ringnorm)while bigger values should beﬁxed when data are supposed to be harder to discriminate.This remark can be conﬁrmed by the obser-vation of Banana and Diabetes results.While the chosen values of lead to equivalentv a l i d a t i o n e r r o r # of training instances observed(a) Banana# o f s u p p o r t v e c t o r s # of training instances observed (b) Banana||ξ||2 / t r a i n i n g i n s t a n c e s # of training instances observed (c) BananaW 2# of training instances observed(d) Banana Fig.2.Evolution of the machine parameters for the Banana problem during on-line learning with a potential band (see Fig.1)deﬁned byTable 1.Results on the three datasets.The potential band used (see Fig.1)is deﬁned by.Classical SVM parameters are in the top-left cell of each table Banana,Nb of SvsBest test LISVMvalidation errorBatch/Rebatch e -e -LISVMvalidation errorBatch/RebatchLISVM00.10.20.30.40.50.6050100150200250300350400v a l i d a t i o n e r r o r # of training instances observed RingnormRebatch LISVM 0.1LISVM 0.01LISVM 0.0010.20.250.30.350.40.450.50.550.6050100150200250300350400450500v a l i d a t i o n e r r o r # of training instances observedDiabetes Rebatch LISVM 0.1LISVM 0.01LISVM 0.001Fig.3.Evolution of validation error for Ringnorm (left )and Diabetes (right )(or bigger)complexity than the batch algorithm for equivalent validation error,otherexperiments withshow that LISVM obtains a lower complexity (SVs and for Banana ,SVs and for Diabetes )but with a degraded performance on the validation set (rates of 0.16and 0.24respectively).For Ringnorm ,this observation can also be made in the chosen range of values .Re-laxing the value of leads to a lower complexity at the cost of a higher validation error.These experiments conﬁrm that the relevant range of correponds to a balance between a low validation error and a small number of neighbours needed to reach the threshold at each step.CPU time measures provide means to directly evaluate for which the local approach sounds attractive.In the Ringnorm task for instance,CPU time is of s for a large neighbourhood ()while it is reduced to s and s for respective smaller values of and and lower complexity.Several curves reﬂecting the behaviour of the algorithm during time were drawn for the Banana problem.Same curves were measured for the other problems but are omitted for sake of space.The validation errors curves show the convergence of the algorithm.This behaviour is conﬁrmed on the Fig.2(c)where all the incremental algo-rithms exhibit a stabilizing value for .For LISVM and “rebatch”algorithm,the number of support vectors linearly increase with the number of new observed instances.This is not an issue,considering that the squared norm of the weight vector increases very much slower,suggesting that if the number of training instances had been bigger,a stabilization should have been observed.At last,let us consider the behaviour of the “best test”algorithm to LISVM on the Banana problem.This algorithm performs the SVM selection by choosing the size of the neighbourhood that minimizes the test error and thus is very demanding in terms of CPU time.Nevertheless,it is remarkable to no-tice that it reaches the same validation performance with twice less support vectors and a restricted norm of the weight vector,illustrating the relevance of the local approach.6ConclusionIn this paper,we propose a new incremental learning algorithm designed for RBF kernel-based SVM.It exploits the locality of RBF by re-learning only weights of train-ing data that lie in the neighbourhood of the new data.Our scheme of model selection isbased on a criterion derived from a bound an error generalization from[7]and allows to determine a relevant neighbourhood size at each learning step.Experimental results on three data sets show very promising results and open the door to real applications.The reduction in terms of CPU time provided by the local approach should be especially important in case of availability of numerous training instances.Next further works concern tests on large scale incremental learning tasks like text categorization.The possibility of the parameter to be adaptive will also be studied. Moreover,LISVM will be extended to the context of drifting concepts by the use of a temporal window.References1.L.Bottou and V.Vapnik.Local learning algorithms.Neural Computation,4(6):888–900,1992.2.L.Breiman.Bias,variance and arcing classiﬁers.Technical Report460,University of Cali-fornia,Berkeley,CA,USA,1996.3. C.Burges.A tutorial on support vector machines for pattern recognition.Data Mining andKnowledge Discovery,2(2):955–974,1998.4.G.Cauwenberghs and T.Poggio.Incremental and decremental support vector machine learn-ing.In Adv.Neural Information Processing,volume13.MIT Press,2001.5.O.Chapelle,V.Vapnik,O.Bousquet,and S.Mukherjee.Choosing kernel parameters forsupport vector machines.Technical report,AT&T Labs,March2000.6. C.Cortes and V.Vapnik.Support vector networks.Machine Learning,20:1–25,1995.7.Nello Cristianini and John Shawe-Taylor.An Introduction to Support Vector Machines andother kernel-based learning methods,chapter4Generalisation Theory,page68.Cambridge University Press,2000.8.T.Friess,F.Cristianini,and N.Campbell.The kernel-adatron algorithm:a fast and simplelearning procedure for support vector machines.In J.Shavlik,editor,Machine Learning: Proc.of the Int.Conf.Morgan Kaufmann Publishers,1998.9.T.Joachims.Making large-scale support vector machine learning practical.In B.Sch¨o lkopf,C.Burges,and A.Smola,editors,Advances in Kernel Methods–Support Vector Learning,pages169–184.MIT Press,Cambridge,MA,1998.10.T.Joachims.Estimating the generalization performance of a svm efﬁciently.In Proc.of theInt.Conf.on Machine Learning.Morgan Kaufmann,2000.11.R.Klinkenberg and J.Thorsten.Detecting concept drift with support vector machines.InProc.of the Int.Conf.on Machine Learning.Morgan Kaufmann,2000.12. E.Osuna,R.Freund,and F.Girosi.Improved training algorithm for support vector machines.In Proc.IEEE Workshop on Neural Networks for Signal Processing,pages276–285,1997.13.J.C.Platt.Sequential minimal optimization:A fast algorithm for training support vector ma-chines.Technical Report98-14,Microsof Research,April1998.14.G.R¨a tsch,T.Onoda,and K.-R.M¨u ller.Soft margins for AdaBoost.Technical Report NC-TR-1998-021,Department of Computer Science,Royal Holloway,University of London, Egham,UK,1998.15.N.Syed,H.Liu,and K.Sung.Incremental learning with support vector machines.In Proc.of the Int.Joint Conf.on Artiﬁcial Intelligence(IJCAI),1999.16.V.Vapnik.The nature of statistical learning theory.Springer,New York,1995.17.V.Vapnik.Statistical learning theory.John Wiley and Sons,inc.,1998.。