gaussians斯坦福大学机器学习笔记

合集下载

机器学习个人笔记完整版v5(原稿)

斯坦福大学2014机器学习教程个人笔记（V5.01）摘要本笔记是针对斯坦福大学2014年机器学习课程视频做的个人笔记黄海广haiguang2000@qq群：554839127最后修改：2017-12-3斯坦福大学2014机器学习教程中文笔记课程概述Machine Learning(机器学习)是研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。

它是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域，它主要使用归纳、综合而不是演译。

在过去的十年中，机器学习帮助我们自动驾驶汽车，有效的语音识别，有效的网络搜索，并极大地提高了人类基因组的认识。

机器学习是当今非常普遍，你可能会使用这一天几十倍而不自知。

很多研究者也认为这是最好的人工智能的取得方式。

在本课中，您将学习最有效的机器学习技术，并获得实践，让它们为自己的工作。

更重要的是，你会不仅得到理论基础的学习，而且获得那些需要快速和强大的应用技术解决问题的实用技术。

最后，你会学到一些硅谷利用机器学习和人工智能的最佳实践创新。

本课程提供了一个广泛的介绍机器学习、数据挖掘、统计模式识别的课程。

主题包括：（一）监督学习（参数/非参数算法，支持向量机，核函数，神经网络）。

（二）无监督学习（聚类，降维，推荐系统，深入学习推荐）。

（三）在机器学习的最佳实践（偏差/方差理论；在机器学习和人工智能创新过程）。

本课程还将使用大量的案例研究，您还将学习如何运用学习算法构建智能机器人（感知，控制），文本的理解（Web搜索，反垃圾邮件），计算机视觉，医疗信息，音频，数据挖掘，和其他领域。

本课程需要10周共18节课，相对以前的机器学习视频，这个视频更加清晰，而且每课都有ppt课件，推荐学习。

本人是中国海洋大学2014级博士生，2014年刚开始接触机器学习，我下载了这次课程的所有视频和课件给大家分享。

中英文字幕来自于https:///course/ml，主要是教育无边界字幕组翻译，本人把中英文字幕进行合并，并翻译剩余字幕，对视频进行封装，归类，并翻译了课程目录，做好课程索引文件，希望对大家有所帮助。

斯坦福大学机器学习所有问题及答案合集

CS 229机器学习(问题及答案)斯坦福大学目录(1) 作业1（Supervised Learning） 1(2) 作业1解答（Supervised Learning） 5(3) 作业2（Kernels, SVMs, and Theory）15(4) 作业2解答（Kernels, SVMs, and Theory）19(5) 作业3（Learning Theory and UnsupervisedLearning）27 (6) 作业3解答（Learning Theory and Unsupervised Learning）31 (7) 作业4（Unsupervised Learning and Reinforcement Learning）39 (8) 作业4解答（Unsupervised Learning and Reinforcement Learning）44(9) Problem Set #1: Supervised Learning 56(10) Problem Set #1 Answer 62(11) Problem Set #2: Problem Set #2: Naive Bayes, SVMs, and Theory 78 (12) Problem Set #2 Answer 85CS229Problem Set#11 CS229,Public CourseProblem Set#1:Supervised Learning1.Newton’s method for computing least squaresIn this problem,we will prove that if we use Newton’s method solve the least squares optimization problem,then we only need one iteration to converge toθ∗.(a)Find the Hessian of the cost function J(θ)=1T x(i)−y(i))2.2P i=1(θm(b)Show that theﬁrst iteration of Newton’s method gives usθ⋆=(X T X)−1X T~y,thesolution to our least squares problem.2.Locally-weighted logistic regressionIn this problem you will implement a locally-weighted version of logistic regression,wherewe weight diﬀerent training examples diﬀerently according to the query point.The locally-weighted logistic regression problem is to maximizeλℓ(θ)=−θTθ+Tθ+2mw(i)hy(i)log hθ(x(i))+(1−y(i))log(1−hθ(x(i)))i. Xi=1The−λTθhere is what is known as a regularization parameter,which will be discussed2θin a future lecture,but which we include here because it is needed for Newton’s method to perform well on this task.For the entirety of this problem you can use the valueλ=0.0001.Using this deﬁnition,the gradient ofℓ(θ)is given by∇θℓ(θ)=XT z−λθwhere z∈R m is deﬁned byz i=w(i)(y(i)−hθ(x(i)))and the Hessian is given byH=XT DX−λIwhere D∈R m×m is a diagonal matrix withD ii=−w(i)hθ(x(i))(1−hθ(x(i)))For the sake of this problem you can just use the above formulas,but you should try toderive these results for yourself as well.Given a query point x,we choose compute the weightsw(i)=exp2τ2CS229Problem Set#12(a)Implement the Newton-Raphson algorithm for optimizingℓ(θ)for a new query pointx,and use this to predict the class of x.T he q2/directory contains data and code for this problem.You should implementt he y=lwlr(X train,y train,x,tau)function in the lwlr.mﬁle.This func-t ion takes as input the training set(the X train and y train matrices,in the formdescribed in the class notes),a new query point x and the weight bandwitdh tau.G iven this input the function should1)compute weights w(i)for each training exam-p le,using the formula above,2)maximizeℓ(θ)using Newton’s method,andﬁnally3)o utput y=1{hθ(x)>0.5}as the prediction.W e provide two additional functions that might help.The[X train,y train]=load data;function will load the matrices fromﬁles in the data/folder.The func-t ion plot lwlr(X train,y train,tau,resolution)will plot the resulting clas-s iﬁer(assuming you have properly implemented lwlr.m).This function evaluates thel ocally weighted logistic regression classiﬁer over a large grid of points and plots ther esulting prediction as blue(predicting y=0)or red(predicting y=1).Dependingo n how fast your lwlr function is,creating the plot might take some time,so wer ecommend debugging your code with resolution=50;and later increase it to atl east200to get a better idea of the decision boundary.(b)Evaluate the system with a variety of diﬀerent bandwidth parametersτ.In particular,tryτ=0.01,0.050.1,0.51.0,5.0.How does the classiﬁcation boundary change whenvarying this parameter?Can you predict what the decision boundary of ordinary(unweighted)logistic regression would look like?3.Multivariate least squaresSo far in class,we have only considered cases where our target variable y is a scalar value.S uppose that instead of trying to predict a single output,we have a training set withm ultiple outputs for each example:{(x(i),y(i)),i=1,...,m},x(i)∈R n,y(i)∈R p.Thus for each training example,y(i)is vector-valued,with p entries.We wish to use a linearmodel to predict the outputs,as in least squares,by specifying the parameter matrixΘiny=ΘT x,whereΘ∈R n×p.(a)The cost function for this case isJ(Θ)=12m p2(Θi=1j=1T x(i))j−y(ji)X X(Θ.W rite J(Θ)in matrix-vector notation(i.e.,without using any summations).[Hint: Start with the m×n design matrixX=—(x(1))T——(x(2))T—...—(x(m))T—2CS229Problem Set#13 and the m×p target matrixY=—(y(1))T——(y(2))T—...—(y(m))T—and then work out how to express J(Θ)in terms of these matrices.](b)Find the closed form solution forΘwhich minimizes J(Θ).This is the equivalent tothe normal equations for the multivariate case.(c)Suppose instead of considering the multivariate vectors y(i)all at once,we instead(i)compute each variable yj separately for each j=1,...,p.In this case,we have a p individual linear models,of the form(i)T(i),j=1,...,p. yj=θj x(So here,eachθj∈R n).How do the parameters from these p independent least squares problems compare to the multivariate solution?4.Naive BayesI n this problem,we look at maximum likelihood parameter estimation using the naiveB ayes assumption.Here,the input features x j,j=1,...,n to our model are discrete,b inary-valued variables,so x j∈{0,1}.We call x=[x1x2···x n]T to be the input vector.For each training example,our output targets are a single binary-value y∈{0,1}.Our model is then parameterized by φj|y=0=p(x j=1|y=0),φj|y=1=p(x j=1|y=1),andφy=p(y=1).We model the joint distribution of(x,y)according top(y)=(φy)y(1−φy)1−yp(x|y=0)=nYp(x j|y=0)j=1=nYx j(1−φj|y=0)1−x(φj|y=0)j j=1p(x|y=1)=nYp(x j|y=1)j=1=nYx j(1−φj|y=1)1−x(φj|y=1)j j=1(i),y(i);ϕ)in terms of the(a)Find the joint likelihood functionℓ(ϕ)=log Q i=1p(xmodel parameters given above.Here,ϕrepresents the entire set of parameters {φy,φj|y=0,φj|y=1,j=1,...,n}.(b)Show that the parameters which maximize the likelihood function are the same as3CS229Problem Set#14those given in the lecture notes;i.e.,thatm(i)φj|y=0=P j=1∧yi=11{x(i)=0}mP i=11{y(i)=0} m(i)φj|y=1=P j=1∧yi=11{x(i)=1}mP i=11{y(i)=1}φy=P(i)=1}mi=11{y.m(c)Consider making a prediction on some new data point x using the most likely classestimate generated by the naive Bayes algorithm.Show that the hypothesis returnedby naive Bayes is a linear classiﬁer—i.e.,if p(y=0|x)and p(y=1|x)are the classp robabilities returned by naive Bayes,show that there exists someθ∈R n+1sucht hatT x1p(y=1|x)≥p(y=0|x)if and only ifθ≥0.(Assumeθ0is an intercept term.)5.Exponential family and the geometric distribution(a)Consider the geometric distribution parameterized byφ:p(y;φ)=(1−φ)y−1φ,y=1,2,3,....Show that the geometric distribution is in the exponential family,and give b(y),η,T(y),and a(η).(b)Consider performing regression using a GLM model with a geometric response vari-able.What is the canonical response function for the family?You may use the factthat the mean of a geometric distribution is given by1/φ.(c)For a training set{(x(i),y(i));i=1,...,m},let the log-likelihood of an examplebe log p(y(i)|x(i);θ).By taking the derivative of the log-likelihood with respect toθj,derive the stochastic gradient ascent rule for learning using a GLM model withgoemetric responses y and the canonical response function.4CS229Problem Set#1Solutions1CS229,Public CourseProblem Set#1Solutions:Supervised Learning1.Newton’s method for computing least squaresIn this problem,we will prove that if we use Newton’s method solve the least squares optimization problem,then we only need one iteration to converge toθ∗.(a)Find the Hessian of the cost function J(θ)=1T x(i)−y(i))2.2P i=1(θmAnswer:As shown in the class notes∂J(θ)∂θj=mXT x(i)−y(i))x(i)(θj. i=1So∂2J(θ)∂θj∂θk==m∂X T x(i)−y(i))x(i)(θ∂θkj i=1mX(i)(i)T X)jkx j x k=(Xi=1Therefore,the Hessian of J(θ)is H=X T X.This can also be derived by simply applyingrules from the lecture notes on Linear Algebra.(b)Show that theﬁrst iteration of Newton’s method gives usθ⋆=(X T X)−1X T~y,thesolution to our least squares problem.Answer:Given anyθ(0),Newton’s methodﬁndsθ(1)according toθ(1)=θ(0)−H−1∇θJ(θ(0))=θ(0)−(X T X)−1(X T Xθ(0)−X T~y)=θ(0)−θ(0)+(X T X)−1X T~y=(XT X)−1X T~y.Therefore,no matter whatθ(0)we pick,Newton’s method alwaysﬁndsθ⋆after oneiteration.2.Locally-weighted logistic regressionIn this problem you will implement a locally-weighted version of logistic regression,where we weight diﬀerent training examples diﬀerently according to the query point.The locally- weighted logistic regression problem is to maximizeλℓ(θ)=−θTθ+Tθ+2mw(i)hy(i)log hθ(x(i))+(1−y(i))log(1−hθ(x(i)))i. Xi=15CS229Problem Set#1Solutions2 The−λTθhere is what is known as a regularization parameter,which will be discussedθ2in a future lecture,but which we include here because it is needed for Newton’s method to perform well on this task.For the entirety of this problem you can use the valueλ=0.0001.Using this deﬁnition,the gradient ofℓ(θ)is given by∇θℓ(θ)=XT z−λθwhere z∈R m is deﬁned byz i=w(i)(y(i)−hθ(x(i)))and the Hessian is given byH=XT DX−λIwhere D∈R m×m is a diagonal matrix withD ii=−w(i)hθ(x(i))(1−hθ(x(i)))For the sake of this problem you can just use the above formulas,but you should try toderive these results for yourself as well.Given a query point x,we choose compute the weightsw(i)=exp2τ2CS229Problem Set#1Solutions3theta=zeros(n,1);%compute weightsw=exp(-sum((X_train-repmat(x’,m,1)).^2,2)/(2*tau));%perform Newton’s methodg=ones(n,1);while(norm(g)>1e-6)h=1./(1+exp(-X_train*theta));g=X_train’*(w.*(y_train-h))-1e-4*theta;H=-X_train’*diag(w.*h.*(1-h))*X_train-1e-4*eye(n);theta=theta-H\g;end%return predicted yy=double(x’*theta>0);(b)Evaluate the system with a variety of diﬀerent bandwidth parametersτ.In particular,tryτ=0.01,0.050.1,0.51.0,5.0.How does the classiﬁcation boundary change whenv arying this parameter?Can you predict what the decision boundary of ordinary(unweighted)logistic regression would look like?Answer:These are the resulting decision boundaries,for the diﬀerent values ofτ.tau = 0.01 tau = 0.05 tau = 0.1tau = 0.5 tau = 0.5 tau = 5For smallerτ,the classiﬁer appears to overﬁt the data set,obtaining zero training error,but outputting a sporadic looking decision boundary.Asτgrows,the resulting deci-sion boundary becomes smoother,eventually converging(in the limit asτ→∞to theunweighted linear regression solution).3.Multivariate least squaresSo far in class,we have only considered cases where our target variable y is a scalar value. Suppose that instead of tryingto predict a single output,we have a training set with7CS229Problem Set#1Solutions4multiple outputs for each example:{(x(i),y(i)),i=1,...,m},x(i)∈R n,y(i)∈R p.Thus for each training example,y(i)is vector-valued,with p entries.We wish to use a linearmodel to predict the outputs,as in least squares,by specifying the parameter matrixΘiny=ΘT x,whereΘ∈R n×p.(a)The cost function for this case isJ(Θ)=12m p2 X X T x(i))j−y(j=1ji)(Θi=1.W rite J(Θ)in matrix-vector notation(i.e.,without using any summations).[Hint: Start with the m×n design matrixX=—(x(1))T——(x(2))T—...—(x(m))T—and the m×p target matrixY=—(y(1))T——(y(2))T—...—(y(m))T—and then work out how to express J(Θ)in terms of these matrices.] Answer:The objective function can be expressed asJ(Θ)=12tr T(XΘ−Y)(XΘ−Y).To see this,note thatJ(Θ)===1tr T(XΘ−Y)(XΘ−Y)212X i T(XΘ−Y)XΘ−Y)ii12X i X(XΘ−Y)2ijj=12m p2i)(Θi=1j=1T x(i))j−y j(X X(Θ8CS229Problem Set#1Solutions5(b)Find the closed form solution forΘwhich minimizes J(Θ).This is the equivalent tothe normal equations for the multivariate case.Answer:First we take the gradient of J(Θ)with respect toΘ.∇ΘJ(Θ)=∇Θ1tr(XΘ−Y)T(XΘ−Y)2=∇ΘT X T XΘ−ΘT X T Y−Y T XΘ−Y T T1trΘ21=∇ΘT X T XΘ)−tr(ΘT X T Y)−tr(Y T XΘ)+tr(Y T Y)tr(Θ21=tr(Θ2∇ΘT X T XΘ)−2tr(Y T XΘ)+tr(Y T Y)12T XΘ+X T XΘ−2X T Y=X=XT XΘ−X T YSetting this expression to zero we obtainΘ=(XT X)−1X T Y.This looks very similar to the closed form solution in the univariate case,except now Yis a m×p matrix,so thenΘis also a matrix,of size n×p.(c)Suppose instead of considering the multivariate vectors y(i)all at once,we instead(i)compute each variable y j separately for each j=1,...,p.In this case,we have a p individual linear models,of theform(i)T(i),j=1,...,p. y j=θxj(So here,eachθj∈R n).How do the parameters from these p independent least squares problems compareto the multivariate solution?Answer:This time,we construct a set of vectors~y j=(1)jyy(2)j...(m)yj,j=1,...,p.Then our j-th linear model can be solved by the least squares solutionθj=(XT X)−1X T~y j.If we line up ourθj,we see that we have the following equation:[θ1θ2···θp]=(X T X)−1X T~y1(X T X)−1X T~y2···(X T X)−1X T~y p=(XT X)−1X T[~y1~y2···~y p]=(XT X)−1X T Y=Θ.Thus,our p individual least squares problems give the exact same solution as the multi- variate least squares. 9CS229Problem Set#1Solutions64.Naive BayesI n this problem,we look at maximum likelihood parameter estimation using the naiveB ayes assumption.Here,the input features x j,j=1,...,n to our model are discrete,b inary-valued variables,so x j∈{0,1}.We call x=[x1x2···x n]T to be the input vector.For each training example,our output targets are a single binary-value y∈{0,1}.Our model is then parameterized by φj|y=0=p(x j=1|y=0),φj|y=1=p(x j=1|y=1),andφy=p(y=1).We model the joint distribution of(x,y)according top(y)=(φy)y(1−φy)1−yp(x|y=0)=nYp(x j|y=0)j=1=nYx j(1−φj|y=0)1−x(φj|y=0)j j=1p(x|y=1)=nYp(x j|y=1)j=1=nY x j(1−φj|y=1)1−x(φj|y=1)jj=1m(i),y(i);ϕ)in terms of the(a)Find the joint likelihood functionℓ(ϕ)=log Qi=1p(xmodel parameters given above.Here,ϕrepresents the entire set of parameters {φy,φj|y=0,φj|y=1,j= 1,...,n}.Answer:mY p(x(i),y(i);ϕ) ℓ(ϕ)=logi=1mYp(x(i)|y(i);ϕ)p(y(i);ϕ) =logi=1(i);ϕ)m nY Y(i)p(y(i);ϕ) =log j|yp(xi=1j=1(i);ϕ)m nX Xlog p(y(i);ϕ)+(i)=j|ylog p(xi=1j=1m"Xy(i)logφy+(1−y(i))log(1−φy) =i=1nX j=1(i)(i)+xlogφj|y(i)+(1−x j)log(1−φj|yj(i))(b)Show that the parameters which maximize the likelihood function are the same as10CS229Problem Set#1Solutions7those given in the lecture notes;i.e.,thatm(i)φj|y=0=P(i)=0}i=11{x j=1∧ymP i=11{y(i)=0} m(i)φj|y=1=P(i)=1}i=11{x j=1∧ymPi=11{y(i)=1} m(i)=1}φy=Pi=11{y.mA nswer:The only terms inℓ(ϕ)which have non-zero gradient with respect toφj|y=0 are those which includeφj|y(i).Therefore,∇φj|y=0ℓ(ϕ)=∇φj|y=0mX i=1(i)(i)(i))x j)log(1−φj|yjlogφj|y(i)+(1−xj)log(1−φj|y=∇φj|y=0mi=1(i)X(i)=0} xjlog(φj|y=0)1{y=(i)(i)=0}+(1−xj)log(1−φj|y=0)1{ym111{y(i)=0}CS229Problem Set#1Solutions8To solve forφy,mi=1y(i)logφy+(1−y(i))log(1−φy) X∇φyℓ(ϕ)=∇φym1−φy1p(y=1|x)≥p(y=0|x)if and only ifθ≥0.(Assumeθ0is an intercept term.)Answer:p(y=1|x)≥p(y=0|x)≥1⇐⇒p(y=1|x)p(y=0|x)⇐⇒Qj=1p(x j|y=1)≥1np(y=1)p(y=0)Q j=1p(x j|y=0)n⇐⇒Qj≥1 nx j(1−φj|y=0)1−xφyj=1(φj|y=0)Qjn j=1(φj|y=1)x j(1−φj|y=1)1−x(1−φy)n+(1−x j)log1−φj|y=0x j logφj|y=0≥0,12CS229Problem Set#1Solutions9 whereθ0=nlog1−φj|y=01y log(1−φ)−log1−φThenb(y)=1η=log(1−φ)T(y)=ya(η)=log1−φφCS229Problem Set#1Solutions10ℓi(θ)=log"exp T x(i)·y(i)−log T x(i)!!#θT x(i)eθ1−eθ=log exp T x(i)·y(i)−log T x(i)−1θ1e−θ∂∂θj=θT x(i)·y(i)+log T x(i)−1e−θT x(i)e−θ(i)(i)jy(−x(i)+ℓi(θ)=x j)e−θT x(i)−1(i)(i)−=x j y1(i)T x(i)xj1−e−θ=y(i)−1T x(i)CS229Problem Set#21 CS229,Public CourseProblem Set#2:Kernels,SVMs,and Theory1.Kernel ridge regressionIn contrast to ordinary least squares which has a cost functionJ(θ)=12mXT x(i)−y(i))2,(θi=1we can also add a term that penalizes large weights inθ.In ridge regression,our least s quares cost is regularized by adding a termλkθk2,whereλ>0is aﬁxed(known)constant (regularization will be discussed at greater length in an upcoming course lecutre).The ridger egression cost function is thenJ(θ)=12mXT x(i)−y(i))2+(θi=1λkθk2.2(a)Use the vector notation described in class toﬁnd a closed-form expreesion for thevalue ofθwhich minimizes the ridge regression cost function.(b)Suppose that we want to use kernels to implicitly represent our feature vectors in ahigh-dimensional(possibly inﬁnite dimensional)ing a feature mappingφ, the ridge regression cost function becomesJ(θ)=12mXTφ(x(i))−y(i))2+(θi=1λkθk2.2Making a prediction on a new input x new would now be done by computingθTφ(x new).S how how we can use the“kernel trick”to obtain a closed form for the predictiono n the new input without ever explicitly computingφ(x new).You may assume thatt he parameter vectorθcan be expressed as a linear combination of the input feature vectors;i.e.,θ=P (i))for some set of parametersαi.mi=1αiφ(x[Hint:You mayﬁnd the following identity useful:(λI+BA)−1B=B(λI+AB)−1.If you want,you can try to prove this as well,though this is not required for theproblem.]2.ℓ2norm soft margin SVMsIn class,we saw that if our data is not linearly separable,then we need to modify our support vector machine algorithm by introducing an error margin that must be minimized.Speciﬁcally,the formulation we have looked at is known as theℓ1norm soft margin SVM.In this problem we will consider an alternative method,known as theℓ2norm soft margin SVM.This new algorithm is given by the following optimization problem(notice that theslack penalties are now squared):.min w,b,ξ12+Ck wk2P i=1ξ2m2is.t.y(i)(w T x(i)+b)≥1−ξi,i=1,...,m15CS229Problem Set#22(a)Notice that we have dropped theξi≥0constraint in theℓ2problem.Show that thesenon-negativity constraints can be removed.That is,show that the optimal value ofthe objective will be the same whether or not these constraints are present.(b)What is the Lagrangian of theℓ2soft margin SVM optimization problem?(c)Minimize the Lagrangian with respect to w,b,andξby taking the following gradients:∇w L,∂L∂b,and∇ξL,and then setting them equal to0.Here,ξ=[ξ1,ξ2,...,ξm]T.(d)What is the dual of theℓ2soft margin SVM optimization problem?3.SVM with Gaussian kernelC onsider the task of training a support vector machine using the Gaussian kernel K(x,z)=exp(−kx−zk2/τ2).We will show that as long as there are no two identical points in thet raining set,we can alwaysﬁnd a value for the bandwidth parameterτsuch that the SVMa chieves zero training error.(a)Recall from class that the decision function learned by the support vector machinecan be written asf(x)=mXαi y(i)K(x(i),x)+b. i=1A ssume that the training data{(x(1),y(1)),...,(x(m),y(m))}consists of points whicha re separated by at least a distance ofǫ;that is,||x(j)−x(i)||≥ǫfor any i=j.F ind values for the set of parameters{α1,...,αm,b}and Gaussian kernel widthτs uch that x(i)is correctly classiﬁed,for all i=1,...,m.[Hint:Letαi=1for all ia nd b=0.Now notice that for y∈{−1,+1}the prediction on x(i)will be correct if|f(x(i))−y(i)|<1,soﬁnd a value ofτthat satisﬁes this inequality for all i.](b)Suppose we run a SVM with slack variables using the parameterτyou found in part(a).Will the resulting classiﬁer necessarily obtain zero training error?Why or whynot?A short explanation(without proof)will suﬃce.(c)Suppose we run the SMO algorithm to train an SVM with slack variables,underthe conditions stated above,using the value ofτyou picked in the previous part,and using some arbitrary value of C(which you do not know beforehand).Will thisnecessarily result in a classiﬁer that achieve zero training error?Why or why not?Again,a short explanation is suﬃcient.4.Naive Bayes and SVMs for Spam ClassiﬁcationI n this question you’ll look into the Naive Bayes and Support Vector Machine algorithmsf or a spam classiﬁcation problem.However,instead of implementing the algorithms your-s elf,you’ll use a freely available machine learning library.There are many such librariesa vailable,with diﬀerent strengths and weaknesses,but for this problem you’ll use theW EKA machine learning package,available at /ml/weka/.WEKA implements many standard machine learning algorithms,is written in Java,andhas both a GUI and a command line interface.It is not the best library for very large-scaledata sets,but it is very nice for playing around with many diﬀerent algorithms on mediumsize problems.You can download and install WEKA by following the instructions given on the websiteabove.To use it from the command line,youﬁrst need to install a java runtime environ-ment,then add the weka.jarﬁle to your CLASSPATH environment variable.Finally,you16CS229Problem Set#23 can call WEKA using the command:java<classifier>-t<training file>-T<test file>For example,to run the Naive Bayes classiﬁer(using the multinomial event model)on ourprovided spam data set by running the command:java weka.classifiers.bayes.NaiveBayesMultinomial-t spam train1000.arff-T spam test.arffT he spam classiﬁcation dataset in the q4/directory was provided courtesy of ChristianS helton(cshelton@).Each example corresponds to a particular email,and eachf eature correspondes to a particular word.For privacy reasons we have removed the actualw ords themselves from the data set,and instead label the features generically as f1,f2,etc.H owever,the data set is from a real spam classiﬁcation task,so the results demonstrate thep erformance of these algorithms on a real-world problem.The q4/directory actually con-t ains several diﬀerent trainingﬁles,named spam train50.arff,spam train100.arff,etc(the“.arﬀ”format is the default format by WEKA),each containing the correspondingn umber of training examples.There is also a single test set spam test.arff,which is ah old out set used for evaluating the classiﬁer’s performance.(a)Run the weka.classifiers.bayes.NaiveBayesMultinomial classiﬁer on the datasetand report the resulting error rates.Evaluate the performance of the classiﬁer usingeach of the diﬀerent trainingﬁles(but each time using the same testﬁle,spam test.arff).Plot the error rate of the classiﬁer versus the number of training examples.(b)Repeat the previous part,but using the weka.classifiers.functions.SMO classiﬁer,which implements the SMO algorithm to train an SVM.How does the performanceof the SVM compare to that of Naive Bayes?5.Uniform convergenceIn class we proved that for anyﬁnite set of hypotheses H={h1,...,h k},if we pick the hypothesis hˆthat minimizes the training error on a set of m examples,then with probabilityat least(1−δ),12kε(hˆ)≤ε(h i)min log,+2r2mδiwhereε(h i)is the generalization error of hypothesis h i.Now consider a special case(oftenc alled the realizable case)where we know,a priori,that there is some hypothesis in ourc lass H that achieves zero error on the distribution from which the data is drawn.Thenw e could obviously just use the above bound with min iε(h i)=0;however,we can prove ab etter bound than this.(a)Consider a learning algorithm which,after looking at m training examples,choosessome hypothesis hˆ∈H that makes zero mistakes on this training data.(By ourassumption,there is at least one such hypothesis,possibly more.)Show that withprobability1−δε(hˆ)≤1m logkδ.N otice that since we do not have a square root here,this bound is much tighter.[Hint: C onsider the probability that a hypothesis with generalization error greater thanγmakes no mistakes on the training data.Instead of the Hoeﬀding bound,you might alsoﬁnd the following inequality useful:(1−γ)m≤e−γm.]17CS229Problem Set#24(b)Rewrite the above bound as a sample complexity bound,i.e.,in the form:forﬁxedδandγ,forε(hˆ)≤γto hold with probability at least(1−δ),it suﬃces that m≥f(k,γ,δ)(i.e.,f(·)is some function of k,γ,andδ).18CS229Problem Set#2Solutions1 CS229,Public CourseProblem Set#2Solutions:Kernels,SVMs,and Theory1.Kernel ridge regressionIn contrast to ordinary least squares which has a cost functionJ(θ)=12mXT x(i)−y(i))2,(θi=1we can also add a term that penalizes large weights inθ.In ridge regression,our least s quares cost is regularized by adding a termλkθk2,whereλ>0is aﬁxed(known)constant (regularization will be discussed at greater length in an upcoming course lecutre).The ridger egression cost function is thenJ(θ)=12mX T x(i)−y(i))2+(θi=1λkθk2.2(a)Use the vector notation described in class toﬁnd a closed-form expreesion for thevalue ofθwhich minimizes the ridge regression cost function.Answer:Using the design matrix notation,we can rewrite J(θ)asJ(θ)=12(Xθ−~y)T(Xθ−~y)+T(Xθ−~y)+λθTθ.Tθ.2Then the gradient is∇θJ(θ)=XT Xθ−X T~y+λθ.Setting the gradient to0gives us0=XT Xθ−X T~y+λθθ=(XT X+λI)−1X T~y.(b)Suppose that we want to use kernels to implicitly represent our feature vectors in ahigh-dimensional(possibly inﬁnite dimensional)ing a feature mappingφ, the ridge regression cost function becomesJ(θ)=12mX Tφ(x(i))−y(i))2+(θi=1λkθk2.2Making a prediction on a new input x new would now be done by computingθTφ(x new).S how how we can use the“kernel trick”to obtain a closed form for the predictiono n the new input without ever explicitly computingφ(x new).You may assume that the parameter vectorθcan beexpressed as a linear combination of the input featurem(i))for some set of parametersαi.vectors;i.e.,θ=P i=1αiφ(x19。

gaussian教程

gaussian教程节译自Exploring Chemistry with Electronic Structure Methos, Second Edition,作者James B. Foresman, Eleen Frisch 出版社Gaussian, Inc, USA, 1996前言Gaussian可以做很多事情,具体包括分子能量和结构研究过渡态的能量和结构研究化学键以及反应的能量分子轨道偶极矩和多极矩原子电荷和电势振动频率红外和拉曼光谱核磁极化率和超极化率热力学性质反应途径计算可以模拟在气相和溶液中的体系,模拟基态和激发态.Gaussian是研究诸如取代效应,反应机理,势能面和激发态能量的有力工具.全书结构序言运行Gaussian第一部分基本概念和技术第一章计算模型第二章单点能计算第三章几何优化第四章频率分析第二部分计算化学方法第五章基族的影响第六章理论方法的选择第七章高精度计算第三部分应用第八章研究反应和反应性第九章激发态第十章溶液中的反应附录A 理论背景附录B Gaussian输入方法简介运行GaussianUnix/Linux平台:运行gaussian前要设置好运行参数,比如在C Shell中,需要加这两句setenv g94root directory / directory指程序的上级目录名source $g94root/g94/bsd/g94.login然后运行就可以了.比如有输入文件,采用C Shell时的运行格式是g94 h2o.logWindows平台:图形界面就不用多说了输入输出文件介绍在Unix系统中,输入文件是.com为扩展名的,输出文件为.log;在Windows系统中,输入文件是.gjf为扩展名,输出文件为.out.下面是一个输入文件#T RHF/6-31G(d) TestMy first Gaussian job: water single point energy0 1O -0.464 0.177 0.0H -0.464 1.137 0.0H 0.441 -0.143 0.0第一行以#开头,是运行的说明行,#T表示指打印重要的输出部分,#P表示打印更多的信息.后面的RHF表示限制性Hartree-Fock方法,这里要输入计算所选用的理论方法6-31G(d)是计算所采用的基组,就是用什么样的函数组合来描述轨道Test是指不记入Gaussian工作档案,对于单机版没有用处.第三行是对于这个工作的描述,写什么都行,自己看懂就是了.第二行是空行,这个空行以及第四行的空行都是必须的.第五行的两个数字,分别是分子电荷和自旋多重度.第六行以后是对于分子几何构性的描述.这个例子采用的是迪卡尔坐标.分子结构输入完成后要有一个空行.对于Windows版本,程序的图形界面把这几部分分得很清楚.输入的时候就不要再添空行了.输出文件输出文件一般很长,对于上面的输入文件,其输出文件中,首先是版权说明,然后是作者,Pople的名字在最后一个.然后是Gaussian读入输入文件的说明,再将输入的分子坐标转换为标准内坐标,这些东西都不用去管.当然,验证自己的分子构性对不对就要看这个地方.关键的是有SCF Done的一行,后面的能量可是重要的,单位是原子单位,Hartree,1 Hartree= 4.3597482E-18 Joules或=2625.500 kJ/mol=27.2116 eV再后面是布居分析,有分子轨道情况,各个轨道的本征值(能量),各个原子的电荷,偶极距.然后是整个计算结果的一个总结,各小节之间用\分开,所要的东西基本在里面了.然后是一句格言,随机有Gaussian程序从它的格言库里选出的(在l9999.exe中,想看的可以用文本格式打开这个文件,自己去找,学英语的好机会).然后是CPU时间,注意这不是真正的运行时间,是CPU运行的时间,真正的时间要长一些.如果几个工作一起做的话(Window下好像不可能,Unix/Linix下可以同时做多个工作),实际计算时间就长很多了.最后一句话,"Normal termination of Gaussian 94"很关键,如果没有这句话,说明工作是失败的,肯定在什么地方出错误了.这是这里应该有出错信息.根据输入文件的设置,输出文件还要多一些内容,上面的是基本的东西.第一章计算模型1.1 计算化学的方法主要有分子理论(Molecular Mechanics)和电子结构理论(Electronic Structure Theory).两者的共同点是1. 计算分子的能量,分子的性质可以根据能量按照一定的方法得到.2. 进行几何优化,在起始结构的附近寻找具有最低的能量的结构.几何优化是根据能量的一阶导数进行的.3. 计算分子内运动的频率.计算依据是能量的二阶导数.1.2 分子理论分子理论采用经典物理对分子进行处理,可以在MM3,HyperChem, Quanta, Sybyl, Alchemy等软件中看到.根据所采用的力场的不同,分子理论又分为很多种.分子理论方法很便宜(做量化的经常用贵和便宜来描述计算,实际上就是计算时间的长短,因为对于要花钱上机的而言,时间就是金钱;对于自己有机器的,要想算的快,也要多在机器上花钱),可以计算多达几千个原子的体系.其缺点是1. 每一系列参数都是针对特定原子得出的.没有对于原子各个状态的统一参数.2. 计算中忽略了电子,只考虑键和原子,自然就不能处理有很强电子效应的体系, 比如不能描述键的断裂.1.3 电子结构理论这一理论基于薛定鄂方程,采用量子化学方法对分子进行处理.主要有两类:1. 半经验方法,包括AM1, MINDO/3, PM3,常见的软件包有MOPAC, AMPAC, HyperChem, 以及Gaussian.半经验方法采用了一些实验得来的参数,来帮助对薛定鄂方程的求解.2. 从头算.从头算,在解薛定鄂方程的过程中,只采用了几个物理常数,包括光速,电子和核的质量,普朗克常数,在求解薛定鄂方程的过程中采用一系列的数学近似,不同的近似也就导致了不同的方法.最经典的是Hartree-Fock方法,缩写为HF.从头算能够在很广泛的领域提供比较精确的信息,当然计算量要比前面讲的方法大的多,就是贵得多了.1.4 密度泛函(Density Functional Methods)密度泛函是最近几年兴起的第三类电子结构理论方法.它采用泛函(以函数为变量的函数)对薛定鄂方程进行求解,由于密度泛函包涵了电子相关,它的计算结果要比HF方法好,计算速度也快.1.5 化学模型(Model Chemistries)Gaussian认为所谓理论是,一个理论模型,必须适用于任何种类和大小体系,它的应用限制只应该来自于计算这里包括两点,1. 一个理论模型应该对于任何给定的核和电子有唯一的定义,就是说,对于解薛定鄂方程来讲,分子结构本身就可以提供充分的信息.2. 一个理论模型是没有偏见的,指不依靠于任何的化学结构和化学过程.这样的理论可以被认为是化学理论模型(theoretical-model chemistry),简称化学模型(model chemistry)(这个翻译我可拿不准,在国内没听说过).1.6 定义化学模型Gaussian包含多种化学模型,比如计算方法Gaussian关键词方法HF Hartree-Fock自恰场模型B3L YP Becke型3参数密度泛函模型,采用Lee-Yang-Parr泛函MP2 二级Moller-Plesset微扰理论MP4 四级Moller-Plesset微扰理论QCISD(T) 二次CI具体在第六章讨论基组基组是分子轨道的数学表达,具体见第五章开壳层,闭壳层指电子的自旋状态,对于闭壳层,采用限制性计算方法,在方法关键词前面加R对于开壳层,采用非限制性计算方法,在方法关键词前面加U.比如开壳层的HF就是UHF.对于不加的,程序默认为是闭壳层.一般采用开壳层的可能性是1. 存在奇数个电子,如自由基,一些离子2. 激发态3. 有多个单电子的体系4. 描述键的分裂过程模型的组合高精度的计算往往要几种模型进行组合,比如用中等算法进行结构优化,然后用高精度算法计算能量.第二章单点能计算2.1 单点能计算是指对给定几何构性的分子的能量以及性质进行计算,由于分子的几何构型是固定不变的,只是"一个点",所以叫单点能计算.单点能计算可以用于:计算分子的基本信息可以作为分子构型优化前对分子的检查在由较低等级计算得到的优化结果上进行高精度的计算在计算条件下,体系只能进行单点计算单点能的计算可以在不同理论等级,采用不同基组进行,本章的例子都采用HF方法2.2 计算设置计算设置中,要有如下信息:计算采用的理论等级和计算的种类计算的名称分子结构方法设置这里设置了计算要采用的理论方法,采用的基组,所要进行的计算的种类等信息.这一行,以#开头,默认的计算种类为单点能计算,关键词为SP,可以不写.这一部分需要出现的关键词有,计算的理论,如HF(默认关键词,可以不写),B3PW91;计算采用的基组,如6-31G, Lanl2DZ;布局分析方法,如Pop=Reg;波函数自恰方法,如SCF=Tight.Pop=Reg只在输出文件中打印出最高的5条HOMO轨道和最低的5条LOMU轨道,而采用Pop=Full则打印出全部的分子轨道.SCF设置是指波函数的收敛计算时的设定,一般不用写,SCF=Tight设置表示采用比一般方法较严格的收敛计算.计算的名称一般含有一行,如果是多行,中间不能有空行.在这里描述所进行的计算.分子结构首先是电荷和自旋多重度电荷就是分子体系的电荷了,没有就是0,自旋多重度就是2S+1,其中S是体系的总自旋量子数,其实用单电子数加1就是了.没有单电子,自旋多重度就是1.然后是分子几何构性,一般可以用迪卡尔坐标,也可以用Z-矩阵(Z-Matrix)多步计算Gaussian支持多步计算,就是在一个输入文件中进行多个计算步骤.2.3 输出文件中的信息例2.1 文件e2_01 甲醛的单点能标准几何坐标.找到输出文件中Standard Orientation一行,下面的坐标值就是输入分子的标准几何坐标.能量找到SCF Done: E(RHF)= -113.863697598 A. U. after 6 cycles这里的树脂就是能量,单位是hartree.在一些高等级计算中,往往有不止一个能量值,比如下一行E2=-0.3029540001D+00 EUMP2=-0.11416665769315D+03这里在EUMP2后面的数字是采用MP2计算后的能量.MP4计算的能量输出就更复杂了分子轨道和轨道能级对于按照计算设置所打印出的分子轨道,列出的内容包括,轨道对称性以及电子占据情况,O表示占据,V表示空轨道;分子轨道的本征值,也就是分子轨道的能量,分子轨道的顺序就是按照能量由低到高的顺序排列的;每一个原子轨道对分子轨道的贡献.这里要注意轨道系数,这些数字的相对大小(忽略正负号)表示了组成分子轨道的原子轨道在所组成的分子轨道中的贡献大小.寻找HOMO和LUMO轨道的方法就是看占据轨道和非占据轨道的交界处.电荷分布Gaussian采用的默认的电荷分布计算方法是Mullikin方法,在输出文件中寻找Total atomic charges可以找到分子中所有原子的电荷分布情况.偶极矩和多极矩Gassian提供偶极矩和多极矩的计算,寻找Dipole momemt (Debye),下面就是偶极矩的信息,再下两行是四极矩偶极矩的单位是德拜CPU时间和其他Job cpu time : 0days 0 hours 0 minuites 9.1 seconds.这里是计算的时间,注意是CPU时间.2.4 核磁计算例2.2 文件e2_02 甲烷的核磁计算核磁是单点能计算中另外一个可以提供的数据,在计算的工作设置部分,就是以#开头的一行里,加入NMR关键词就可以了,如#T RHF/6-31G(d) NMR Test在输出文件中,寻找如下信息GIAO Magnetic shielding tensor (ppm)1 C Isotropic = 199.0522 Anisotropy = 0.0000这是采用上面的设置计算的甲烷的核磁结果,所采用的甲烷构形是用B3L YP密度泛函方法优化得到的.一般的,核磁数据是以TMS为零点的,下面是用同样的方法计算的TMS(四甲基硅烷)的结果1 C Isotropic = 195.1196 Anisotropy = 17.5214这样,计算所得的甲烷的核磁共振数据就是-3.9ppm,与实验值-7.0ppm相比,还是很接近的.2.5 练习练习2.1 文件2_01 丙烷的单点能练习要点:寻找分子的标准坐标,寻找单点能,偶极矩的方向和大小,电荷分布练习2.2 文件2_02a (RR), 2_02b (SS), 2_02c (RS) 1,2-二氯-1,2-二氟乙烷的能量练习要点:比较该化合物三个旋光异构体的能量和偶极矩差异练习2.3 文件2_03 丙酮和甲醛的比较练习要点:比较甲基取代氢原子后带来的影响说明能量比较必须在有同样的原子种类和数量的情况下进行练习2.4 文件2_04 乙烯和甲醛的分子轨道练习要点:寻找HOMO和LUMO能级,并分析能级的组成情况练习2.5 文件2_05a, 2_05b, 2_05c 烷,烯,炔的核磁共振比较练习2.6 文件2_06 C60的单点能练习要点:分析C60最高占据轨道注意在收敛方法选择的时候,要有SCF=Tight,否则有收敛问题.练习2.7 文件2_07 计算大小的CPU资源比较本练习比较不同基组函数数量,SCF方法对CPU时间,资源的占用情况.比较传统SCF方法(SCF=Convern),直接SCF方法(Gaussian默认方法)传统SCF 直接SCF基组函数数量int文件大小(MB) CPU时间CPU时间23 2 8.6 12.842 4 11.9 19.861 16 23.2 38.880 42 48.7 72.199 92 95.4 122.5118 174 163.4 186.8137 290 354.5 268.0156 437 526.5 375.0175 620 740.2 488.0194 832 1028.4 622.1很显然,函数数量对资源占用和CPU时间都有很大影响,函数越多,资源占用越大,CPU时间越长.理论上来讲,认为CPU时间和函数数量的四次方成正比,但实际上没有这么高, 在本例中,基本上和函数数量的2.5次方成正比.一般的讲,直接SCF方法的效率要比传统SCF方法要好,在本例中,当函数数量比较大时, 可以看到这一点.练习2.8 文件2_08a (O2), 2_08b (O3) SCF稳定性计算本例中采用SCF方法分析分子的稳定性.对于未知的体系,SCF稳定性是必须要做的.当分子本身不稳定的时候,所得到的SCF结果以及波函数等信息就没有化学意义.SCF稳定性分析是寻找是不是存在比当前状态能量更低的分子状态.关键词有Stable 检验分子的稳定性,放松对分子的限制,比如由闭壳层改为开壳层等.Stable=OPT 这一选项设定,当发现不稳定的时候,对新的状态进行优化.这种做法一般是不推荐的,因为所得到的新的状态的几何形太接近原来的几何构形.本例中首先计算闭壳层的单重态的氧分子.很显然,闭壳层单重态的氧分子不应该是稳定的.在输出文件中,我们可以找到这样的句子:The wavefuction has an RHF --> UHF instability.这表明存在一个UHF的状态,其能量要比当前状态低.这说明可能,能量最低的状态是单重态的,但不是闭壳层的;存在有更低能量的三重态;所计算的状态不是能量最低点,可能是过渡态.在三重态情况下重新计算,也进行稳定性验证,可以看到如下的句子The wavefunction is stable under the perturbations considered.臭氧是单重态的,但有不一般的电子结构.采用RHF Stable=Opt可以发现一个RHF-->UHF的不稳定性,在所得到的UHF状态下进行稳定性检验,采用UHF Stable=Opt,发现体系仍然不稳定.The wavefunction has an inernal instability再在此基础上进行的优化,体系又回到了RHF的状态.这时,就需要在进行SCF前的构性初始电子状态猜测上进行改动,使用Guess=Mix,在初始猜测中混合HOMO 和LUMO轨道,从而消除空间对称性,然后进行的UHF Guess=Mix Stable 表明得到了稳定的结构.确定电子状态还可以采用Guess=Alter详见Gaussian User's Reference第三章几何优化前面讨论了在特定几何构型下的能量的计算,可以看出,分子几何构型的变化对能量有很大的影响.由于分子几何构型而产生的能量的变化,被称为势能面.势能面是连接几何构型和能量的数学关系.对于双原子分子,能量的变化与两原子间的距离相关,这样得到势能曲线,对于大的体系,势能面是多维的,其维数取决与分子的自由度.3.1势能面势能面中,包括一些重要的点,包括全局最大值,局域极大值,全局最小值,局域极小值以及鞍点.极大值是一个区域内的能量最高点,向任何方向的几何变化都能够引起能量的减小.在所有的局域极大值中的最大值,就是全局最大值;极小值也同样,在所有极小之中最小的一个就是具有最稳定几何结构的一点.鞍点则是在一个方向上具有极大值,而在其他方向上具有极小值的点.一般的,鞍点代表连接着两个极小值的过渡态.寻找极小值几何优化做的工作就是寻找极小值,而这个极小值,就是分子的稳定的几何形态.对于所有的极小值和鞍点,其能量的一阶导数,也就是梯度,都是零,这样的点被称为稳定点.所有的成功的优化都在寻找稳定点,虽然找到的并不一定就是所预期的点. 几何优化有初始构型开始,计算能量和梯度,然后决定下一步的方向和步长,其方向总是向能量下降最快的方向进行.大多数的优化也计算能量的二阶导数,来修正力矩阵,从而表明在该点的曲度.3.2 收敛标准当一阶导数为零的时候优化结束,但实际计算上,当变化很小,小于某个量的时候,就可以认为得到优化结构.对于Gaussian,默认的条件是力的最大值必须小于0.00045,均方根小于0.0003为下一步所做的取代计算为小于0.0018,其均方根小于0.0012这四个条件必须同时满足,比如,对于非常松弛的体系,势能面很平缓,力的值已经小于域值,但优化过程仍然有很长的路要走.对于非常松弛的体系,当力的值已经低于域值两个数量级,尽管取代计算仍然高于域值,系统也认为找到了最优点.这条规则用于非常大,非常松弛的体系.3.3 几何优化的输入Opt关键字描述了几何优化例3.1 文件e3_01 乙烯的优化输入文件的设置行为#R RHF/6-31G(d) Opt Test表明采用RHF方法,6-31G(d)基组进行优化3.4 输出文件优化部分的计算包含在两行相同的GradGradGradGradGradGradGradGradGradGrad...........之间,这里有优化的次数,变量的变化,收敛的结果等等.注意这里面的长度单位是波尔.在得到每一个新的几何构型之后,都要计算单点能,然后再在此基础上继续进行优化,直到四个条件都得到满足.而最后一个几何构型就被认为是最优构型.注意,最终构型的能量是在最后一次优化计算之前得到的.在得到最优构型之后,在文件中寻找--Stationmay point found.其下面的表格中列出的就是最后的优化结果以及分子坐标.随后按照设置行的要求,列出分子有关性质例3.2 文件e3_02 氟代乙烯的优化3.5 寻找过渡态Gaissian使用STQN方法确定反应过渡态,关键词是Opt=QST2例3.3 文件e3_03 过渡态优化例中分析的是H3CO --> H2COH 的变化,输入文件格式#T UHF/6-31G(d) Opt=QST2 TestH3CO --> H2COH Reactants0,2structure for H3CO0,2structure for H2COHGaussian也提供QST3方法,可以优化反应物,产物和一个由用户定义的猜测的过渡态.3.6 难处例的优化有一些系统的优化很难进行,采用默认的方法得不到结果,其产生的原因往往是所计算出的力矩阵与实际的相差太远.当默认方法得不到结果时,就要采用其他的方法. Gaussian提供很多的选择,具体可以看User's Reference.下面列举一些.Opt=ReadFC 从频率分析(往往是采用低等级的计算得到的)所得到的checkpoint文件中读取初始力矩阵,这一选项需要在设置行之前加入%Chk= filename 一句,说明文件的名称.Opt=CalCFC 采用优化方法同样的基组来计算力矩阵的初始值.Opt=CalcAll 在优化的每一步都计算力矩阵.这是非常昂贵的计算方法,只在非常极端的条件下使用.有时候,优化往往只需要更多的次数就可以达到好的结果,这可以通过设置MaxCycle 来实现.如果在优化中保存了Checkpoint文件,那么使用Opt=Restart可以继续所进行的优化.当优化没有达到效果的时候,不要盲目的加大优化次数.这是注意观察每一步优化的区别,寻找没有得到优化结果的原因,判断体系是否收敛,如果体系能量有越来越小的趋势,那么增加优化次数是可能得到结果的,如果体系能量变化没有什么规律,或者,离最小点越来越远,那么就要改变优化的方法.也可以从输出文件的某一个中间构型开始新的优化,关键词Geom=(Check,Step=n)表示在取得在Checkpoint文件中第n步优化的几何构型3.7 练习练习3.1 文件3_01a (180), 3_01b (0) 丙烯的优化从两种丙烯的几何异构体进行优化,一个是甲基的一个氢原子与CCH形成180度二面角,另一个是0.优化结果表明,二者有0.003Hartree的差别,0度的要低.练习3.2 文件3_02a (0), 3_02b (180), 3_02c (acteald.) 乙烯醇的优化乙烯醇氧端的氢原子与OCC平面的二面角可以为0和180,优化得到的结果时,0度的能量比180度的低0.003Hartree,但同时做的乙醛的优化表明,乙醛的能量还要低,比0度异构体低0.027hartree.练习3.3 文件3_03 乙烯胺的优化运行所有原子都在同一平面上的乙烯胺的优化.比较本章的例子和练习,可以看到不同取代基对乙烯碳碳双键的影响.练习3.4 文件3_04 六羰基铬的优化本例采用STO-3G和3-21G基组,在设置行中加入SCF=NoVarAcc对收敛有帮助.3-21G基组的优化结果要优于STO-3G练习3.5 文件3_05a (C6H6), 3_05b (TMS) 苯的核磁共振采用6-31G(d)基组,B3L YP方法优化几何构性,采用HF方法,6-311+G(2d,p)基组在优化的几何构型基础上计算碳的化学位移.注意,核磁共振的可靠程度依赖准确的几何结构和大的基组.输入文件如下%Chk=NMR#T B3L YP/6-31G(d) Opt TestOptmolecule specification--Link1--%Chk=NMR%NoSave#T RHF/6-311+G(2d,p) NMR Geom=Check Guess=Read TestNMRcharg & spin同样,还需要采用同样方法计算TMS.下面是计算结果绝对位移相对位移实验值TMS Benzene188.7879 57.6198 131.2 130.9练习3.6 文件3_06a (PM3), 3_06b (STO-3G) 氧化碳60的优化C60中有两种碳碳键,一是连接两个六元环的6-6键,另一是连接六元环和无元环的5-6键. 氧化C60就有两种异构体.本例采用PM3和HF/STO-3G方法来判断那种异构体是稳定的,以及氧化后的C-C键的变化.采用Opt=AddRedundant关键词可以在输出文件中打印所要求的键长,键角,这一关键词需要在分子构型输入结束后在增加关于所要键长键角的信息,键长用两个原子的序列号表示,键角则用三个原子表示.计算结果显示,6-6键的氧化,碳碳键仍然存在,接近环氧化合物,而5-6键已经打开.采用不同的方法,得到的几何结构相差不多,但在能量上有很大差异.在采用MNDO,PM3,HF/3-21G方法得到的能量数据中,5-6键氧化的异构体的能量低,但采用HF/STO-3G得到的结果,确实6-6键氧化的能量低.Raghavachari在其进行的上述研究中阐述动力学因素同样是重要的;实验上还没有发现那个是能量最低的异构体;应该进行更精确的计算练习3.7 文件3_07 一个1,1消除反应的过渡态优化分析反应SiH4 --> SiH2 + H2, 可以采用Opt=(QST2, AddRedundant)关键词来进行过渡态优化,同时特别关注过渡态结构中的某个键长练习3.8 文件3_08 优化进程比较采用下述三种方法优化二环[2,2,2]直接采用默认方式冗余内坐标优化Opt;采用迪卡尔坐标优化Opt=Cartesian;采用内坐标优化Opt=Z-Matrix结果显示,冗余内坐标优化的优化次数最短,内坐标优化的次数最多.第四章频率分析频率分析可以用于多种目的,预测分子的红外和拉曼光谱(频率和强度)为几何优化计算力矩阵判断分子在势能面上的位置计算零点能和热力学数据如系统的熵和焓4.1 红外和拉曼光谱几何优化和单点能计算都将原子理想化了,实际上原子一直处于振动状态.在平衡态,这些振动是规则的和可以预测的.频率分析的计算要采用能量对原子位置的二阶导数.HF方法,密度泛函方法(如B3L YP), 二阶Moller-Plesset方法(MP2)和CASSCF方法(CASSCF)都可以提供解析二阶导数.对于其他方法,可以提供数值二阶导数.4.2 频率分析输入Freq关键词代表频率分析.频率分析只能在势能面的稳定点进行,这样,频率分析就必须在已经优化好的结构上进行.最直接的办法就是在设置行同时设置几何优化和频率分析.特别注意的是,频率分析计算是所采用的基组和理论方法,必须与得到该几何构型采用的方法完全相同!例4.1 文件e4_01 甲醛的频率分析例中采用的是已经优化好的几何构型,输入格式# RHF/6-31G(d) Freq Test4.3 频率和强度频率分析首先要计算输入结构的能量,然后计算频率.Gaussian提供每个振动模式的频率,强度,拉曼极化率.以下是例4.1的输出文件中的前四个频率1 2 3 4B1 B2 A1 A1。

(13)因子分析

因子分析（Factor Analysis）JerryLeadcsxulijie@2011年5月11日1 问题之前我们考虑的训练数据中样例x(i)的个数m都远远大于其特征个数n，这样不管是进行回归、聚类等都没有太大的问题。

然而当训练样例个数m太小，甚至m<<n的时候，使用梯度下降法进行回归时，如果初值不同，得到的参数结果会有很大偏差（因为方程数小于参数个数）。

另外，如果使用多元高斯分布(Multivariate Gaussian distribution)对数据进行拟合时，也会有问题。

让我们来演算一下，看看会有什么问题：多元高斯分布的参数估计公式如下：μ=1m∑x(i)mi=1Σ=1m∑(x(i)−μ)(x(i)−μ)Tmi=1分别是求mean和协方差的公式，x(i)表示样例，共有m个，每个样例n个特征，因此μ是n维向量，Σ是n*n协方差矩阵。

当m<<n时，我们会发现Σ是奇异阵（|Σ|=0），也就是说Σ−1不存在，没办法拟合出多元高斯分布了，确切的说是我们估计不出来Σ。

如果我们仍然想用多元高斯分布来估计样本，那怎么办呢？2 限制协方差矩阵当没有足够的数据去估计Σ时，那么只能对模型参数进行一定假设，之前我们想估计出完全的Σ（矩阵中的全部元素），现在我们假设Σ就是对角阵（各特征间相互独立），那么我们只需要计算每个特征的方差即可，最后的Σ只有对角线上的元素不为0Σjj=1m∑(x j(i)−μj)2mi=1回想我们之前讨论过的二维多元高斯分布的几何特性，在平面上的投影是个椭圆，中心点由μ决定，椭圆的形状由Σ决定。

Σ如果变成对角阵，就意味着椭圆的两个轴都和坐标轴平行了。

如果我们想对Σ进一步限制的话，可以假设对角线上的元素都是等值的。

Σ=σ2I其中σ2=1mn∑∑(x j (i )−μj )2mi=1n j=1也就是上一步对角线上元素的均值，反映到二维高斯分布图上就是椭圆变成圆。

当我们要估计出完整的Σ时，我们需要m>=n+1才能保证在最大似然估计下得出的Σ是非奇异的。

斯坦福大学机器学习课程个人笔记完整版

CS 229机器学习(个人笔记目录(1线性回归、logistic 回归和一般回归1(2 判别模型、生成模型与朴素贝叶斯方法10(3 支持向量机SVM 上) 20(4 支持向量机SVM 下) 32(5规则化和模型选择45(6K-mea ns聚类算法50(7混合高斯模型和EM 算法53 (8EM 算法55 (9 在线学习62(10 主成分分析65(11独立成分分析80(12线性判别分析91(13 因子分析103 (14 增强学习114(15典型关联分析120(16偏最小二乘法回归129这里面的内容是我在 2011 年上半年学习斯坦福大学《机器学习》课程的个人学习笔记，内容主要来自 Andrew Ng 教授的讲义和学习视频。

另外也包含来自其他论文和其他学校讲义的一些内容。

每章内容主要按照个人学习时的思路总结得到。

由于是个人笔记，里面表述错误、公式错误、理解错误、笔误都会存在。

更重要的是我是初学者，千万不要认为里面的思路都正确。

如果有疑问的地方，请第一时间参考 Andrew Ng 教授的讲义原文和视频，再有疑问的地方可以找一些大牛问问。

博客上很多网友提出的问题，我难以回答，因为我水平确实有限，的内容最好找相关大牛咨询和相关论文研读。

如果有网友想在我这个版本基础上再添加自己的笔记，可以发送我提供原始的 word docx 版本。

另，本人目前在科苑软件所读研，马上三年了，方向是分布式计算，主要偏大数据分布式处理，平时主要玩 Hadoop 、Pig 、Hive 、 Mahout 、NoSQL 啥的，关注系统方面和数据库方面的会议。

希望大家多多交流，以后会往博客上放这些内容，机器学习会放的少了。

Anyway ，祝大家学习进步、事业成功！1 对回归方法的认识JerryLead2011 年 2 月 27 日1 摘要更深层次Email 给我，本报告是在学习斯坦福大学机器学习课程前四节加上配套的讲义后的总结与认识。

(3)支持向量机SVM(上)

支持向量机（上）JerryLeadcsxulijie@2011年3月12日星期六1简介支持向量机基本上是最好的有监督学习算法了。

最开始接触SVM是去年暑假的时候，老师要求交《统计学习理论》的报告，那时去网上下了一份入门教程，里面讲的很通俗，当时只是大致了解了一些相关概念。

这次斯坦福提供的学习材料，让我重新学习了一些SVM 知识。

我看很多正统的讲法都是从VC 维理论和结构风险最小原理出发，然后引出SVM什么的，还有些资料上来就讲分类超平面什么的。

这份材料从前几节讲的logistic回归出发，引出了SVM，既揭示了模型间的联系，也让人觉得过渡更自然。

2重新审视logistic回归Logistic回归目的是从特征学习出一个0/1分类模型，而这个模型是将特性的线性组合作为自变量，由于自变量的取值范围是负无穷到正无穷。

因此，使用logistic函数（或称作sigmoid函数）将自变量映射到(0,1)上，映射后的值被认为是属于y=1的概率。

形式化表示就是假设函数其中x是n维特征向量，函数g就是logistic函数。

的图像是可以看到，将无穷映射到了(0,1)。

而假设函数就是特征属于y=1的概率。

当我们要判别一个新来的特征属于哪个类时，只需求ℎθ(x)，若大于0.5就是y=1的类，反之属于y=0类。

再审视一下ℎθ(x)，发现ℎθ(x)只和θT x有关，θT x>0，那么ℎθ(x)>0.5，g(z)只不过是用来映射，真实的类别决定权还在θT x。

还有当θT x≫0时，ℎθ(x)=1，反之ℎθ(x)=0。

如果我们只从θT x出发，希望模型达到的目标无非就是让训练数据中y=1的特征θT x≫0，而是y=0的特征θT x≪0。

Logistic回归就是要学习得到θ，使得正例的特征远大于0，负例的特征远小于0，强调在全部训练实例上达到这个目标。

图形化表示如下：中间那条线是θT x=0，logistic回顾强调所有点尽可能地远离中间那条线。

Gaussian03笔记

一Gaussian03运行1.1Scratch文件Gaussian运行中使用数个Scratch文件，包括：CheckPoint文件：name.chk读写文件：name.rwf双电子积分文件：name.int双电子积分的导数文件name.d2e默认情况下这些文件由Gaussian处理进程ID命名并储存于Scratch中，计算结束后删除。

一般情况下我们希望保存CheckPoint 文件，此时只需要在输入文件中给CheckPoint 文件命名或者指定路径即可。

常用格式：%Chk=name；%Chk=chem/scratch/name。

%RWF=path读写文件%Int=path 积分文件%D2E=path积分导数文件通过以上三个命令可以将以上三个文件分别存入不同的盘中。

%NoSave命令%RWF=/chem/scratch2/water到这里为止的文件都被删除%NoSave%Chk=water到这里为止的文件都被保存1.2控制内存使用%mem=***KB/MB/GB/GW二Gaussian03的输入2.1Gaussian03的输入文件概述Gaussian03输入文件的基本结构如下Gaussian03输入界面Gaussian03的输入部分分类任务类型计算方法基组输入语法Gaussian语法规则：输入是自由格式，而且与大小写无关；空格、TAB、正斜线/、逗号都可以作为一行之内的不同项之间的连接符。

关键字格式：keyword = 选项keyword(选项)keyword=(选项1, 选项2, ...)keyword(选项1, 选项2, ...)这些选项中可以带数值。

在Gaussian中所有的关键字和选项都必须使用可以识别的缩写。

引用外部文件使用@文件名。

注释行以！开始。

Gaussian03任务类型形式为方法2/基组2 // 方法1/基组1 的计算执行路径，表示用方法1/基组1进行几何优化计算，之后在优化的结构上用方法2/基组2进行单点能计算。

Gaussian03学习笔记(二)

CH=1.07
HCC=121.5
第1行：计算执行路径以“#”开头，“T”表示打印重要的输出部分，“RHF”表示采用限制性的Hartree-Fock方法，“6-31G(d)” 是采用的基组，“Opt”表示进行几何优化，“Test”表示不计入高斯工作档案。
第2行：空行这是高斯输入的格式要求
The following legend is applicable only to US Government
contracts under FAR:
RESTRICTED RIGHTS LEGEND
Use, reproduction and disclosure by the US Government is subject
（二）输出文件分析
以下是版权和引用说明部分。
Entering Link 1 = d:\G03W\l1.exe PID= 1016.
Copyright (c) 1003, Gaussian, Inc.
All Rights Reserved.
Gaussian 03: x86-Win32-G03RevB.01 3-Mar-2003
31-Jan-2007
*********************************************
以下是最大硬盘使用量、计算执行路径、标题和分子说明部分。
computational chemistry and represents and warrants to the
licensee that it is not a competitor of Gaussian, Inc. and that
it will not use this program in any manner prohibited above.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

n n
z T Σz =
i=1 j =1 n n
(Σij zi zj ) (Cov [Xi , Xj ] · zi zj ) (E [(Xi − E [Xi ])(Xj − E [Xj ])] · zi zj )
n
(2)
=
i=1 j =1 n n
=
i=1 j =1 n
=E
i=1j − E [Xj ]) · zi zj .
√
1
1 2πσ
∞ −∞
exp −
1 (x − µ)2 2σ 2
= 1.
Recall from the section notes on linear algebra that Sn ++ is the space of symmetric positive deﬁnite n × n matrices, deﬁned as
The Multivariate Gaussian Distribution
Chuong B. Do October 10, 2008
A vector-valued random variable X = X1 · · · Xn is said to have a multivariate 1 normal (or Gaussian) distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Sn ++ 2 if its probability density function is given by p(x; µ, Σ) = 1 (2π )n/2 |Σ|1/2 1 exp − (x − µ)T Σ−1 (x − µ) . 2
(3)
Here, (2) follows from the formula for expanding a quadratic form (see section notes on linear algebra), and (3) follows by linearity of expectations (see probability notes). To complete the proof, observe that the quantity inside the brackets is of the form T 2 i j xi xj zi zj = (x z ) ≥ 0 (see problem set #1). Therefore, the quantity inside the expectation is always nonnegative, and hence the expectation itself must be nonnegative. We conclude that z T Σz ≥ 0. From the above proposition it follows that Σ must be symmetric positive semideﬁnite in order for it to be a valid covariance matrix. However, in order for Σ−1 to exist (as required in the deﬁnition of the multivariate Gaussian density), then Σ must be invertible and hence full rank. Since any full rank symmetric positive semideﬁnite matrix is necessarily symmetric positive deﬁnite, it follows that Σ must be symmetric positive deﬁnite.
The following proposition (whose proof is provided in the Appendix A.1) gives an alternative way to characterize the covariance matrix of a random vector X : Proposition 1. For any random vector X with mean µ and covariance matrix Σ, Σ = E [(X − µ)(X − µ)T ] = E [XX T ] − µµT . (1)
3
3
The diagonal covariance matrix case
T
We write this as X ∼ N (µ, Σ). In these notes, we describe multivariate Gaussians and some of their basic properties.
1
Relationship to univariate Gaussians
Recall that the density function of a univariate normal (or Gaussian) distribution is given by p(x; µ, σ 2 ) = √ 1 1 exp − 2 (x − µ)2 . 2σ 2πσ
1 2 Here, the argument of the exponential function, − 2σ 2 (x − µ) , is a quadratic function of the variable x. Furthermore, the parabola points downwards, as the coeﬃcient of the quadratic 1 term is negative. The coeﬃcient in front, √2 , is a constant that does not depend on x; πσ hence, we can think of it as simply a “normalization factor” used to ensure that
2
1
0.9 0.8 0.02 0.7 0.6 0.5 0.4 0.005 0.3 0.2 0.1 0 0 10 5 0 −5 0 1 2 3 4 5 6 7 8 9 10 −10 −10 0 −5 5 10 0.015
0.01
Figure 1: The ﬁgure on the left shows a univariate Gaussian density for a single variable X . The ﬁgure on the right shows a multivariate Gaussian density over two variables X1 and X2 . In the case of the multivariate Gaussian density, the argument of the exponential function, 1 −2 (x − µ)T Σ−1 (x − µ), is a quadratic form in the vector variable x. Since Σ is positive deﬁnite, and since the inverse of any positive deﬁnite matrix is also positive deﬁnite, then for any non-zero vector z , z T Σ−1 z > 0. This implies that for any vector x = µ, 1 − (x − µ)T Σ−1 (x − µ) < 0. 2 Like in the univariate case, you can think of the argument of the exponential function as being a downward opening quadratic bowl. The coeﬃcient in front (i.e., (2π)n/1 2 |Σ|1/2 ) has an even more complicated form than in the univariate case. However, it still does not depend on x, and hence it is again simply a normalization factor used to ensure that 1 (2π )n/2 |Σ|1/2
In the deﬁnition of multivariate Gaussians, we required that the covariance matrix Σ be symmetric positive deﬁnite (i.e., Σ ∈ Sn ++ ). Why does this restriction exist? As seen in the following proposition, the covariance matrix of any random vector must always be symmetric positive semideﬁnite: Proposition 2. Suppose that Σ is the covariance matrix corresponding to some random vector X . Then Σ is symmetric positive semideﬁnite. Proof. The symmetry of Σ follows immediately from its deﬁnition. Next, for any vector z ∈ Rn , observe that
∞ −∞ ∞ −∞
(x − µ)T Σ−1 (x − µ) > 0