李宏毅-B站机器学习视频课件Bias and Variance (v2)

合集下载

李宏毅机器学习课程——Lifelonglearning学习笔记

李宏毅机器学习课程——Lifelonglearning学习笔记概述lifelong learning⾮常直观，意思是机器不能前边学后边忘。

常见的⽅法是对前边的task中学习出来的参数加⼀个保护系数，在后⾯的任务中，训练参数时，对保护系数⼤的参数很难训练，⽽保护系数⼩的参数则容易⼀些。

下⾯的图⾮常直观，颜⾊的深浅代表loss的⼤⼩，颜⾊越深loss越⼩。

在task1中θ2的变化对loss的变化⾮常敏感，⽽θ1则不敏感，所以在task2中尽量只通过改变θ1来减⼩loss，⽽不要改变θ2。

在lifelong learning中，loss的计算公式如下：L′(θ)=L(θ)+λΣi b i(θi−θb i)2其中b i就是对θ的保护系数，θi表⽰本次task中需要学习的参数，θb i是从之前的task中学习到的参数。

不同的⽅法差异就在于b i的计算。

这⾥将会结合Coding整理⼀下遇到的三个⽅法。

Coding这部分针对HW14，介绍了EWC，MAS，SCP三种⽅法，这⾥讲解⼀下具体的代码实现，并定性地分析⼀下这些⽅法是如何把哪些重要的参数保护起来。

EWCEWC中不同的保护系数f i使⽤如下的⽅法计算得到：F=[∇log(p(y n|x n,θ∗A))∇log(p(y n|x n,θ∗A))T]F的对⾓线的各个数就是各个θ的保护系数。

p(y n|x n,θ∗A)指的就是模型在给点之前 task 的 data x n以及给定训练完 task A (原来)存下来的模型参数θ∗A得到y n(x n对应的 label ) 的后验概率。

其实对参数θi，它的保护系数就是向量log(p(y n|x n,θ∗A))对θ1的偏导数∂log(p(y n|x n,θ∗A))∂θ1与⾃⾝的内积。

当对这个参数敏感时，这个偏导数会变⼤，当预测结果正确率⾼时，p(y n|x n)也会⾼，最终都会使的保护系数变⼤。

某⼀个参数⽐较敏感，这个参数下正确率⾼时，这个参数就会被很好地保护起来。

李宏毅2021春机器学习课程笔记——生成对抗模型模型

李宏毅2021春机器学习课程笔记——⽣成对抗模型模型本⽂作为⾃⼰学习李宏毅⽼师2021春机器学习课程所做笔记，记录⾃⼰⾝为⼊门阶段⼩⽩的学习理解，如果错漏、建议，还请各位博友不吝指教，感谢！！概率⽣成模型概率⽣成模型（Probabilistic Generative Model）简称⽣成模型，指⼀系列⽤于随机⽣成可观测数据的模型。

假设在⼀个连续或离散的⾼维空间\(\mathcal{X}\)中，存在⼀个随机向量\(X\)服从⼀个未知的数据分布\(p_r(x), x \in\mathcal{X}\)。

⽣成模型根据⼀些可观测的样本\(x^{(1)},x^{(2)}, \cdots ,x^{(N)}\)来学习⼀个参数化的模型\(p_\theta(x)\)来近似未知分布\(p_r(x)\)，并可以⽤这个模型来⽣成⼀些样本，使得⽣成的样本和真实的样本尽可能地相似。

⽣成模型的两个基本功能：概率密度估计和⽣成样本（即采样）。

隐式密度模型在⽣成模型的⽣成样本功能中，如果只是希望⼀个模型能⽣成符合数据分布\(p_r(x)\)的样本，可以不显⽰的估计出数据分布的密度函数。

假设在低维空间\(\mathcal{Z}\)中有⼀个简单容易采样的分布\(p(z)\)，\(p(z)\)通常为标准多元正态分布\(\mathcal{N}(0,I)\)，我们⽤神经⽹络构建⼀个映射函数\(G : \mathcal{Z} \rightarrow \mathcal{X}\)，称为⽣成⽹络。

利⽤神经⽹络强⼤的拟合能⼒，使得\(G(z)\)服从数据分布\(p_r(x)\)。

这种模型就称为隐式密度模型（Implicit Density Model）。

隐式密度模型⽣成样本的过程如下图所⽰：⽣成对抗⽹络⽣成对抗⽹络（Generative Adversarial Networks，GAN）是⼀种隐式密度模型，包括判别⽹络（Discriminator Network）和⽣成⽹络（Generator Network）两个部分，通过对抗训练的⽅式来使得⽣成⽹络产⽣的样本服从真实数据分布。

COM336COM648COM682 Contents 1 THE BACK-PROPAGATION ALGORITHM 3

LEARNING AND GENERALIZATION IN NEURAL NETWORKSSteve Renals27Febuary1998Contents1THE BACK-PROPAGATION ALGORITHM31.1Introduction (3)1.1.1Units and weights in ANNs (3)1.1.2Neural network architectures (4)1.1.3Learning (5)1.1.4Applications of neural networks (6)1.2Supervised learning in feed-forward networks (6)1.2.1Single layer perceptrons (6)1.2.2Linear Separability (9)1.2.3Multi-layer perceptrons—The back-propagation algorithm (9)1.3Issues in training MLPs by backprop (13)1.3.1Batch and online learning (13)1.3.2When is training complete? (13)1.3.3Will backprop alwaysﬁnd the solution? (13)2GENERALIZATION152.1Introduction (15)2.2Evaluating Generalization:Training,Test and Validation Sets (15)2.3Training and Generalization (16)2.4Bias and Variance (19)2.5References (21)3OVERTRAINING AND HOW TO A VOID IT233.1Introduction (23)3.2Cross-Validation and Early Stopping (23)3.3Regularization (25)4CLASSIFICATION USING NEURAL NETWORKS274.1Introduction (27)4.2Estimation of Posterior Probabilities (27)4.3Practical Implications of Posterior Probability Estimation (28)4.3.1Minimum Error-Rate Decisions (29)4.3.2Compensating for Different Priors (29)4.3.3Outputs Sum to One (29)4.3.4Combining Network Outputs (29)4.3.5Conﬁdence and Rejection (30)4.4References (30)1Chapter1THE BACK-PROPAGATION ALGORITHM1.1IntroductionIn this course we are interested in artiﬁcial neural networks—biology does not concern us!In the only analogy of the kind that I’ll use,the table below gives the correspondences between the terminology used for biological neural networks and that used for artiﬁcial neural networks(ANNs):Neuron Unit/Node/CellSynapse ConnectionSynaptic Weight WeightNeuron Firing Rate Node Output1expv(1.5)Note that in both these cases the bias of the unit can be used to shift the function along the x axis.Units in neural networks may be described as:Input units,which receive some input from the environment (e.g.a pixel map for a net-work to recognize handwritten characters);Output units may be observed from the environments (e.g.the class of character input tothe network);Hidden units which are internal to the network and do not directly interact with the envi-ronment.1.1.2Neural network architecturesThe architecture of a neural network tells us what connections between units are allowed.Many architectures have been studied and used;here we’ll characterize them in terms of some deﬁning characteristics.Recurrent or feed-forward?In a feed-forward network there is an ordering imposed onthe nodes in a network:if there is a connection from unit a to unit b then there can-not be a connection from b to a .There is no such constraint in recurrent networks (ﬁgure 1.3(a)),in which any unit can be connected to any other,thus allowing loops or recurrences .The absence of recurrences makes feed-forward networks much eas-ier to train,but also less powerful.A feed-forward structure makes things simpler because there is a simple ﬂow of computation through the network.In a layered feed-forward network (ﬁgure 1.3(b))the weights are organized in layers connecting groups of units together.Fully or partially connected?In a fully connected network all allowed connections areimplemented;in a partially connected network there is a structuring of connections.For example in the AT&T handwriting recognition system (part 5of these notes)the network connectivity is structured to take account of prior knowledge about the problem....1y 2y ny w nw 2w 1v =y = f(v)yΣw y + w y + ... + w y nn 2112Figure 1.1:ANN unit with weighted input summation41expv(1.8)where we have assumed that the output unit transfer function f O is a sigmoid.The activation value of output unit k is represented by v O k .Input 0always has value 1,so theweight w OI k 0corresponds to the bias of output unit k .Exercise:Rewrite the above equations in matrix-vector notation (i.e.representing theweights as a single matrix W ,the inputs as vector y I ,and so on).The key problem in supervised training is that of credit assignment :given that there is an output error (i.e.the network outputs differ from the desired outputs supplied by the training data),how do we adjust the weights of the network to minimize that output error over all the training data?The basic idea is as follows:If an output unit has the desired output value then do not adjust the weights leading into that unit;If an output unit has an output less than the desired output,then increment the weights leading into that unit by a small amount (proportional to the difference between de-sired and actual outputs);If an output unit has an output greater than the desired output,then decrement the weights leading into that unit by a small amount (proportional to the difference be-tween desired and actual outputs);This assumes that all the inputs are positive.Exercise:How do the rules change if some inputs are negative?This is rather heuristic.A more principled way of proceeding involves deﬁning an error function for the network.For a single pattern p we deﬁne the error E p :E p12∑k d O k y O k2(1.10)e kd O ky O k(1.11)where d O k is the desired value for output unit k and e kis the local error for output unit k .We describe E p as the sum squared error.We can sum E p over all patterns to give the overall training set error:E∑pE p(1.12)E p tells us how well the network performs on pattern p .E p 0means that it is perfect for this pattern and no weights need adjusting.The larger E p the worse the network is doing.We can write e k in terms of the weights:e kd kf O ∑iw OI ki y Ii (1.13)72∑kd k f O∑iw OI ki y I i2(1.14)E p is a function of the weights w OIki ,the input values y I i and desired outputs d Ok.The inputsand desired outputs are speciﬁed by the training set;the task is to optimize the weights given that training set.Thus the learning task becomes the following:adjust the weightsw OI kiso that E is minimized.One way to minimize E(or any function)is by a process called gradient descent.Theidea of gradient descent is to look at the error surface(i.e.E in terms of all the w OI ki)and to adjust the weights in the direction of steepest descent.We can express this as:∆w OI kiη∂E p∂w OI ki ∂E∂w OI ki(1.17)∂E∂v O k∂v O k∂y O kd k y O ke k(1.19)∂y O k∂w OI kiy I i(1.21) (Exercise:Verify that each of these derivatives is correct.)Substituting(1.19),(1.20)and(1.21)into(1.18)we have the derivative we require:∂E1In one-dimensional input space it would deﬁne a point;in two-dimensional input space it would deﬁne a line; in three-dimensional input space a plane.Things get harder to visualize in higher dimensions.9NOTATIONAs for the single layer perceptron(equations(1.6–1.8))we can write down the set of equations that describe the behaviour of the network:w HI ji y I i(1.24)v H j∑iy H j f O v H j(1.25)v O k∑w OH k j y H i(1.26)jy O k f O v O k(1.27) And we’ll assume both transfer functions are sigmoids:1f O v f H v2∑ke2k1∂w OH k j∂E∂v O k∂v O k∂w HI ji∂E∂w HI ji(1.33)The second term on the right hand side can be written down simply by differentiating(1.24) and(1.25):∂y H j∂v H j∂v H j∂y H j ∑k∂E∂y H j(1.36)We must sum over all output units,since each hidden unit is connected to all output units and so affects the error of all output units.The required derivatives are now straightforward:∂E∂y H jy O k1y O k w OH k j(1.38)∂E∂w HI ji ∑kd k y O k y0k1y O k w OH k j y H j1y H j y I i(1.40)y H j1y H j∑kd k y O k y0k1y O k w OH k j y I i(1.41)∆w HI jiηy H j1y H j∑kd k y O k y0k1y O k w OH k j y I i(1.42)We can now write down the algorithm for training an MLP by back-propagation oferror:1.Initialize weight matrix to small random values2.While E is unsatisfactory(a)For each pattern p:pute hidden node activations(v H j)using equation(1.24)pute hidden node outputs(y H j)using equations(1.25)and(1.28)pute output node activations(v Ok)using equation(1.26)pute network outputs(y Ok)using equations(1.27)and(1.28)pute the local output errors(e k)and the network error for this pattern(E p)using equations(1.11)and(1.9)pute the gradient of the error E p with respect to each hidden-to-outputweight w OHk jand update the weights using equation(1.32)pute the gradient of the error E p with respect to each input-to-hidden weight w HI ji and update the weights using equation(1.42)12∂w ji ∑p ∂E pChapter2GENERALIZATION2.1IntroductionWe can regard feed-forward networks as implementing a function mapping an input vector (y I y I1y I i y I n)to an output vector(y O y O1y O i y O n).The parameters of this function are the weights.We can divide the functions implemented by an MLP into classi-ﬁers(which map an input vector to a discrete class)and regressors(which map a continuous input vector to a continuous output vector).In the case of classiﬁcation the output vector usually has the same dimensionality as there are classes,with each element corresponding to one of the classes.In regression,the output vectors corresponds to the continuous valued quantity being estimated.The classiﬁcation and regression functions that are estimated using an MLP(or other neural network)are unknown—all we have is a training set,that gives input-output ex-amples of the function.The role of neural network training is to identify this“mystery function”,given only the training data.What the training process(e.g.backprop)does is to estimate the parameters of the function(i.e.the weights of the network)so that it replicates the data as well as possible—and generalizes to new data well.It turns out that obtaining the best generalization performance on new data is not usually the same solution as replicating the training data as well as possible.Consider the case when we have very few training patterns but a large network with many patterns.It will be relatively easy to manipulate the weights to reproduce the training set,but it does not seem likely that the resulting network will have learned the charac-teristics of new data.On the other hand,if we have very many training patterns,and we manage to train the network to replicate them,then it would seem that it is more likely to respond correctly to new,unseen patterns.Our aim is to make such intuitions precise,and to develop approaches which can maximize the generalization performance of a network.2.2Evaluating Generalization:Training,Test and Valida-tion SetsThe obvious way to obtain an estimate of how well a network generalizes is to test its performance on new data.However we have to be careful which data we use—if we keep on using the same test set,then although this data isn’t being used by the training algorithm, the fact we are trying to get the best performance on this particular data set may cause other factors(e.g.choice of learning rate)to be tuned to this“test set”.To make the distinction explicit it is common to work with three data sets:Training Set This is the data that used by the training algorithm to adjust the weights of the network.15Chapter3OVERTRAINING AND HOW TO A VOID IT3.1IntroductionThe lowest generalization error is achieved by an optimal tradeoff between bias and vari-ance.A function that is too closelyﬁtted to the training set will tend to have a large variance and hence give a large expected error—this is called overtraining.We can decrease the variance by smoothing the function(hence increasing the bias)but if we go to far the large bias will cause the expected generalization error to become large again.One way to reduce both the bias and variance is to use moreﬂexible networks while simultaneously increasing the size of the training set.Increasing theﬂexibility of the net-work will reduce the bias,while adding more training data will decrease the variance since each extra training data point adds a new constraint in the space of functions available to the network that implement the function described by the training data.However in many situations we do not have the option of increasing the amount of training data.If we have some prior information about the unknown function we are trying to model,then using this information to constrain the network function will not necessarily increase the bias.For example,if the true function is linear,then constraining the network to linear functions will not increase the bias since the constrained functions are consistent with the true network function.The bias-variance model tells us that in some situations (with limited training data)performance will be better with a constrained model(e.g.a simple linear network)than a less constrained model(e.g.a multi-layer perceptron)even though the lessﬂexible model is a special case of the moreﬂexible one.However,if we do not have this task-speciﬁc information we can still make an attack on the bias-variance problem(or,equivalently,the problem of overtraining).We do this by methods that reduce the effective complexity of the network.Two important methods for doing this are cross validation and regularization.3.2Cross-Validation and Early StoppingThe basic idea of cross-validation is to hold out a validation set from the training data, which is used to control when to stop training.The idea of early stopping comes about from some empirical results that indicate that while the training error will monotonically decrease as training progresses,the validation set error will only decrease up to a certain point,after which it will increase again.This is illustrated inﬁgure3.1.This result implies that generalization can be improved if the network is tested with the validation set every epoch or every few epochs(an epoch is a complete pass through the training set),and that23∂y I2iThen minimizing E W will result in a low curvature(smoother)function.The derivatives of this type of regularizer can be computed for an MLP,but curvature based regularizers haven’t been used much in practice in neural computing(although they are widely used in computer vision and other areas).Weight decay is a more commonly used regularizer.In weight decay,we deﬁne:1E Ww i∂w iRemembering that backprop training uses the negative of this derivative,we have:∂E∆w iηβ∂E W∂w iηβw i(3.5)∂w i25Chapter4CLASSIFICATION USING NEURAL NETWORKS4.1IntroductionThe pattern classiﬁcation task is to classify an input vector x(y I using the notation of part1of these notes)into one of M classes(C1C2C m).This is a problem that has been studied for many years in theﬁeld of statistical pattern recognition.A solution to this problem is to compute the Bayesian Posterior Probability P C i x for each class and to assign the input vector to the class with the largest posterior probability.The posterior probability,P C i x is the conditional probability of the class C i given the input data x.A method which directly estimates such probabilities may be thought of as a recognition model.Recognition models are discriminative—training consists of moving class boundaries to maximize the correct classiﬁcation of the training data by the model.Standard statistical pattern recognition techniques do not usually estimate P C i x di-rectly as this can be difﬁcult.Instead they use a generative model.A generative model may be thought of as a machine C i that generates pattern vectors x with a likelihood P x C i. Training consists of building a machine for each class,using the training data;recognition involves computing the likelihood of each machine generating a test example,and labelling the pattern with the class whose machine was most likely to have generated the data.Intuitively it seems that neural networks are more like recognition models than genera-tive models when used for classiﬁcation.Indeed,it turns out that(given some conditions) we can show that feed-forward networks trained as a classiﬁer directly estimate the poste-rior probability of each class given the data P C i x.4.2Estimation of Posterior ProbabilitiesWhen a feed-forward network(e.g.an MLP)is trained as a classiﬁer a“1-from-M”output coding is usually employed.That is,if there are M classes we use M output units,1for each class.The target vector d used in training for a pattern of class i,will consist of zeros except that d i 1.If:The network is trained using a sum-squared error cost function(i.e.,the usual error function),There is enough training dataThere network isﬂexible enough to represent the desired function,andThe learning algorithm(e.g.backprop)is powerful toﬁnd the global minimum,27(4.1)p xwhere P C i is the prior probability of class C i and p x is a normalizing constant.Gen-erative pattern recognition models estimate both the likelihood p x C i and the prior P C i to arrive at a posterior probability estimate.Neural networks,on the other hand,are able to estimate the posterior probability directly.The prior probability is simply the probability of each class occurring,without having seen any data.The training data estimate of the priors is simply the relative frequency of each class in the training set.This decomposition may be used if the prior probabilities are known to be different in the training and test sets.To compensate for a different prior,we simply divide each output by the relative frequency of the corresponding class,and multiply by a new estimate of the prior probability.Bishop gives the following example for when this might be useful: Consider the problem of classifying medical images into“normal”and“tu-mour”.When used for screening purposes in the general population we havea very low prior probability for tumour.To obtain a good variety of tumourimages in the training set would therefore require huge numbers of examples.An alternative is to increase artiﬁcially the proportion of tumour images in thetraining set,and then to compensate for different priors in the test data.Theprior probabilities for tumours in the general population can be obtained frommedical statistics without having to collect the corresponding images.Cor-rection of the network outputs is then a simple matter of multiplication anddivision.4.3.3Outputs Sum to OneSince the outputs of a network trained as a classiﬁer are probability estimates they will sum to one.This fact can be used to as a measure of how well a network has trained.Further,it is possible to use this fact as a constraint in the network architecture—the outputs may be normalized in such a way so that they are forced to sum to1.It can be shown that the average posterior probability of each class in the training set will correspond to the training set estimate of the class prior probability(the relative fre-quency of that class).We can use this information as a measure of how well the network is trained by checking whether the training set posterior probability averages for each output unit do indeed tend to the prior probabilities.If they do not,this is a good indication that the network is not modelling the posterior probabilities well.4.3.4Combining Network OutputsIt has been shown by several researchers that rather than using a single network to solve a problem,there can be beneﬁts in breaking a problem down into subunits each solved by29。

李宏毅深度学习笔记-半监督学习

李宏毅深度学习笔记-半监督学习半监督学习什么是半监督学习？⼤家知道在监督学习⾥，有⼀⼤堆的训练数据（由input和output对组成）。

例如上图所⽰x r是⼀张图⽚，y r是类别的label。

半监督学习是说，在label数据上⾯，有另外⼀组unlabeled的数据，写成x u (只有input没有output)，有U笔ublabeled的数据。

通常做半监督学习的时候，我们常见的情景是ublabeled的数量远⼤于labeled的数量（U>>R)。

半监督学习可以分成两种：⼀种叫做转换学习，ublabeled 数据就是testing set，使⽤的是testing set的特征。

另⼀种是归纳学习，不考虑testing set，学习model的时候不使⽤testing set。

unlabeled数据作为testing set，不是相当于⽤到了未来数据吗？⽤了label 才算是⽤了未来数据，⽤了testing set的特征就不算是使⽤了未来数据。

例如图⽚，testing set的图⽚特征是可以⽤的，但是不能⽤label。

什么时候使⽤转换学习或者归纳学习？看testing set是不是给你了，在⼀些⽐赛⾥，testing set给你了，那么就可以使⽤转换学习。

但在真正的应⽤中，⼀般是没有testing set的，这时候就只能做归纳学习。

为什么使⽤半监督学习？缺有lable的数据，⽐如图⽚，收集图⽚很容易，但是标注label很困难。

半监督学习利⽤未标注数据做⼀些事。

对⼈类来说，可能也是⼀直在做半监督学习，⽐如⼩孩⼦会从⽗母那边做⼀些监督学习，看到⼀条狗，问⽗亲是什么，⽗亲说是狗。

之后⼩孩⼦会看到其他东西，有狗有猫，没有⼈会告诉他这些动物是什么，需要⾃⼰学出来。

为什么半监督学习有⽤？假设现在做分类任务，建⼀个猫和狗的分类器。

有⼀⼤堆猫和狗的图⽚，这些图⽚没有label。

Processing math: 100%假设只考虑有label的猫和狗图⽚，要画⼀个边界，把猫和狗训练数据集分开，可能会画⼀条如上图所⽰的红⾊竖线。

李宏毅深度学习(一)：深度学习模型的基本结构

李宏毅深度学习（⼀）：深度学习模型的基本结构李宏毅深度学习(⼀):深度学习模型的基本结构转⾃简书的⼀位⼤神博主：下⾯开始正题吧！1、全连接神经⽹络(Fully Connected Structure)最基本的神经⽹络⾮全连接神经⽹络莫属了，在图中，a是神经元的输出，l代表层数，i代表第i个神经元。

两层神经元之间两两连接，注意这⾥的w代表每条线上的权重，如果是第l-1层连接到l层，w的上标是l，下表ij代表了第l-1层的第j个神经元连接到第l层的第i个神经元，这⾥与我们的尝试似乎不太⼀样，不过并⽆⼤碍。

所以两层之间的连接矩阵可以写为如下的形式：每⼀个神经元都有⼀个偏置项：这个值记为z，即该神经元的输⼊。

如果写成矩阵形式如下图：针对输⼊z，我们经过⼀个激活函数得到输出a：常见的激活函数有：这⾥介绍三个：sigmoidSigmoid 是常⽤的⾮线性的激活函数，它的数学形式如下：特别的，如果是⾮常⼤的负数，那么输出就是0；如果是⾮常⼤的正数，输出就是1，如下图所⽰：.sigmoid 函数曾经被使⽤的很多，不过近年来，⽤它的⼈越来越少了。

主要是因为它的⼀些缺点：**Sigmoids saturate and kill gradients. **（saturate 这个词怎么翻译？饱和？）sigmoid 有⼀个⾮常致命的缺点，当输⼊⾮常⼤或者⾮常⼩的时候（saturation），这些神经元的梯度是接近于0的，从图中可以看出梯度的趋势。

所以，你需要尤其注意参数的初始值来尽量避免saturation的情况。

如果你的初始值很⼤的话，⼤部分神经元可能都会处在saturation的状态⽽把gradient kill掉，这会导致⽹络变的很难学习。

Sigmoid 的 output 不是0均值. 这是不可取的，因为这会导致后⼀层的神经元将得到上⼀层输出的⾮0均值的信号作为输⼊。

产⽣的⼀个结果就是：如果数据进⼊神经元的时候是正的(e.g. x>0 elementwise in f=wTx+b)，那么 w 计算出的梯度也会始终都是正的。

李宏毅-B站机器学习视频课件BP全

Backpropagation
Gradient Descent
Network parameters
Starting
0

Parameters
L
L w1
L w
2

L b1

L b2

w1 , w2 ,, b1 , b2 ,
b
4

2

=

′
’’
′ ′′
(Chain rule)
=
+
′ ′′
Assumed
?
?

3
4
it’s known
Backpropagation – Backward pass
Compute Τ for all activation function inputs z
Chain Rule
y g x
Case 1
z h y
x y z
Case 2
x g s
y hs
x
s
z
y
dz dz dy

dx dy dx
z k x, y
dz z dx z dy

ds x ds y ds
Backpropagation
2
Compute Τ for all parameters
Backward pass:
Compute Τ for all activation
function inputs z
Backpropagation – Forward pass

2024《机器学习》ppt课件完整版

《机器学习》ppt课件完整版•引言•机器学习基础知识•监督学习算法目录•无监督学习算法•深度学习基础•强化学习与迁移学习•机器学习实践与应用引言机器学习的定义与目标定义目标机器学习的目标是让计算机系统能够自动地学习和改进，而无需进行明确的编程。

这包括识别模式、预测趋势以及做出决策等任务。

早期符号学习01统计学习阶段02深度学习崛起0301020304计算机视觉自然语言处理推荐系统金融风控机器学习基础知识包括结构化数据（如表格数据）和非结构化数据（如文本、图像、音频等）。

数据类型特征工程特征选择方法特征提取技术包括特征选择、特征提取和特征构造等，旨在从原始数据中提取出有意义的信息，提高模型的性能。

包括过滤式、包装式和嵌入式等，用于选择对模型训练最有帮助的特征。

如主成分分析（PCA ）、线性判别分析（LDA ）等，用于降低数据维度，减少计算复杂度。

数据类型与特征工程损失函数与优化算法损失函数优化算法梯度下降变种学习率调整策略模型评估与选择评估指标评估方法模型选择超参数调优过拟合模型在训练集上表现很好，但在测试集上表现较差，泛化能力不足。

欠拟合模型在训练集和测试集上表现都不佳，未能充分学习数据特征。

防止过拟合的方法包括增加数据量、使用正则化项、降低模型复杂度等。

解决欠拟合的方法包括增加特征数量、使用更复杂的模型、调整超参数等。

机器学习中的过拟合与欠拟合监督学习算法线性回归与逻辑回归线性回归逻辑回归正则化二分类问题核技巧软间隔与正则化030201支持向量机（SVM ）决策树与随机森林剪枝决策树特征重要性随机森林一种集成学习方法，通过构建多棵决策树并结合它们的输出来提高模型的泛化性能。

Bagging通过自助采样法（bootstrap sampling）生成多个数据集，然后对每个数据集训练一个基学习器，最后将所有基学习器的输出结合起来。

Boosting一种迭代式的集成学习方法，每一轮训练都更加关注前一轮被错误分类的样本，通过加权调整样本权重来训练新的基学习器。

机器学习入门ppt课件

朴素贝叶斯算法
朴素贝叶斯分类器：假定模型的的各个特征变量都是概率独立的，根据训练数据和分类标记的的联合分布概率来判定新数据的分类和回归值。优点：对于在小数据集上有显著特征的相关对象，朴素贝叶斯方法可对其进行快速分类场景举例：情感分析、消费者分类
机器学习应用的场景
1. 风控征信系统2. 客户关系与精准营销3. 推荐系统4. 自动驾驶5. 辅助医疗6. 人脸识别7. 语音识别8. 图像识别9. 机器翻译量化交易智能客服商业智能BI
机器学习的通用步骤
选择数据：将你的数据分成三组：训练数据、验证数据和测试数据 (训练效果，验证效果，泛化效果)
数据建模：使用训练数据来构建使用相关特征的模型 (特征：对分类或者回归结果有影响的数据属性，例如，表的字段) 特征工程。
训练模型：使用你的特征数据接入你的算法模型，来确定算法模型的类型，参数等。
测试模型：使用你的测试数据检查被训练并验证的模型的表现 (模型的评价标准准确率，精确率，召回率等)
使用模型：使用完全训练好的模型在新数据上做预测
调优模型：使用更多数据、不同的特征或调整过的参数来提升算法的性能表现
机器学习的位置
传统编程：软件工程师编写程序来解决问题。首先存在一些数据→为了解决一个问题，软件工程师编写一个流程来告诉机器应该怎样做→计算机遵照这一流程执行，然后得出结果统计学：分析并比较变量之间的关系
机器学习：数据科学家使用训练数据集来教计算机应该怎么做，然后系统执行该任务。该计算可学习识别数据中的关系、趋势和模式
智能应用：智能应用使用人工智能所得到的结果，如图是一个精准农业的应用案例示意，该应用基于无人机所收集到的数据
机器学习的分类
1、监督式学习工作机制：用有正确答案的数据来训练算法进行机器学习。代表算法：回归、决策树、随机森林、K – 近邻算法、逻辑回归，支持向量机等。2、非监督式学习工作机制：训练数据没有标签或者答案，目的是找出数据内部的关联和模式，趋势。代表算法：关联算法和 K – 均值算法。3、强化学习工作机制：给予算法一个不断试错，并具有奖励机制的场景，最终使算法找到最佳路径或者策略。代表算法：马尔可夫决策过程，AlphaGo+Zero, 蒙特卡洛算法4. 半监督学习工作机制：训练数据一部分数据为生成数据，一部分数据为监督数据，算法分为生成器和判定器两部分，生成器的目标是使判定器接受自己的数据，判别器是为了最大可能的区分生成数据和监督数据。通过不断的训练使两者都达到最佳性能。代表算法： GANs(生成式对抗网络算法)

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。