MIT牛人解说数学体系(增加部分英文翻译和备注)

合集下载

MIT Newey 课程讲义

GENERALIZED METHOD OF MOMENTSWhitney K. NeweyMITOctober 2007THE GMM ESTIMATOR: The idea is to choose estimates of the parameters by setting sample moments to be close to population counterparts. To describe the underlying moment model and the GMM estimator, let β denote a p ×1 parameter vector, w i a data observation with i =1,...,n, where n is the sample size. Let g i (β)= g (w i ,β) be a m × 1 vector of functions of the data and parameters. The GMM estimator is based on a model where, for the true parameter value β0 the moment conditionsE [g i (β0)] = 0are satis ﬁed.The estimator is formed by choosing β so that the sample average of g i (β)is close t o its zero population value. Let n def1 X g ˆ(β)= g i (β) n i =1 denote the sample average of g i (β). Let Aˆdenote an m ×m positive semi-de ﬁnite matrix. The GMM estimator is given byβˆ=arg m in g ˆ(β)0A ˆg ˆ(β). βThat is βˆis the parameter vector that minimizes the quadratic form ˆg (β)0A ˆg ˆ(β).The GMM estimator chooses βˆso t he sample average ˆg (β) is close to zero. To seethis let k g k ˆ= qAg ,which i s a w ell d e ﬁned norm as long as ˆg 0 A is positive de ﬁnite.AThen since taking the square root is a strictly monotonic transformation, and since the minimand of a function does not change after it is transformed, we also haveβˆ=arg m in k g ˆ(β) − 0k A ˆ. βThus,in a norm corresponding to Aˆthe estimatorβˆis being chosen so that the distancebetweenˆg(β)and0is as small a s p ossible.As w e d iscuss further b elow,w hen m=p,so there are the same number of parameters as moment functions,βˆwill be invariant to Aˆasymptotically.When m>p t he choice of Aˆwill aﬀectβˆ.The acronym GMM is an abreviation for”generalized method of moments,”refering to GMM being a generalization of the classical method moments.The method of moments is b ased on knowing t he form of up to p moments of a variable y as functions of the parameters,i.e.onE[y j]=h j(β0),(1≤j≤p).The method of moments estimatorβˆofβ0is obtained by replacing the population moments by sample moments and solving forβˆ,i.e.by solvingn1X(y i)j=h j(βˆ),(1≤j≤p).ni=1Alternatively,forg i(β)=(y i−h1(β),...,y i p−h p(β))0,method of moments solvesˆg(βˆ)=0.This also means thatβˆminimizesˆg(β)0Aˆgˆ(β)for any Aˆ,so that it is a GMM estimator.GMM is more general in allowing momentfunctions of diﬀerent form than y j−h j(β)and in allowing for more moment functionsithan parameters.One important setting where GMM applies is instrumental variables(IV)estimation. Here the model isy i=X i0β0+εi,E[Z iεi]=0,where Z i is an m×1vector of instrumental variables and X i a p×1vector of right-hand side variables.The condition E[Z iεi]=0is often called a population”orthogonality condition”or”moment condition.”Orthogonality”refers to the elements of Z i andεi being orthogonal in the expectation sense.The moment condition refers to the fact that the product of Z i and y i−X i0βhas expectation zero at the true parameter.This moment condition motivates a GMM estimator where the moment functions are the vector ofproducts of instrumental variables and residuals,as ing i(β)=Z i(y i−X i0β).The GMM estimator can then be obtained by minimizingˆg(β)0Aˆgˆ(β).Because the moment function is linear in parameters there is an explicit,closed form for the estimator.To describe it let Z=[Z1,...,Z n]0,X=[X1,...,X n]0,and y=(y1,...,y n)0.In this example the sample moments are given bynXgˆ(β)=Z i(y i−X i0β)/n=Z0(y−Xβ)/n.i=1Theﬁrst-order conditions for minimization ofˆg(β)0Aˆgˆ(β)can b e w ritten as0=X0AZ0β)=X0Zˆ0y−X0AZ0Zˆ(y−XˆAZ ZˆXβ.ˆThese assuming that X0ZˆAZ0X is nonsingular,this equation can be solved to obtainˆAZ X)−1X0Zˆβ=(X0Zˆ0AZ0y.This is sometimes referred to as a generalized IV estimator.It generalizes the usual two stage least squares estimator,where Aˆ=(Z0Z)−1.Another example is provided by the intertemporal CAPM.Let c i be consumption at time i,R i is asset return between i and i+1,α0is time discount factor,u(c,γ0) utility function,Z i observations on variables available at time i.First-order conditions for utility maximization imply that moment restrictions satisﬁed for·g i(β)=Z i{R i·αu c(c i+1,γ)/u c(c i,γ)−1}.Here GMM is nonlinear IV;residual is term in brackets.No autocorrelation because of one-step ahead decisions(c i+1and R i known at time i+1).Empirical Example:Hansen and Singleton(1982,Econometrica),u(c,γ)=cγ/γ(constant relative risk aversion), c i monthly,seasonally adjusted nondurables(or plus services),R i from stock returns. Instrumental variables are1,2,4,6lags of c i+1and R i.Findγnot signiﬁcantly diﬀerentthan one, marginal rejection from overidenti ﬁcation test. Stock and Wright (2001) ﬁnd weak identi ﬁcation.Another example is dynamic panel data. It is a simple model that is important starting point for microeconomic (e.g. ﬁrm investment) and macroeconomic (e.g. cross-country growth) applications isE ∗(y it |y i,t −1,y i,t −2,...,y i 0,αi )= β0y i,t −1 + αi ,where αi is unobserved individual e ﬀect and E ∗() denotes a population regression. Let ·ηit = y it − E ∗(y it |y i,t −1,...,y i 0,αi ). By orthogonality of residuals and regressors,E [y i,t −j ηit ]=0, (1 ≤ j ≤ t,t =1,...,T ),E [αi ηit ]=0, (t =1,...,T ).Let ∆ denote the ﬁrst di ﬀerence, i.e. ∆y it = y it −y i,t −1.Note that ∆y it = β0∆y i,t −1+∆ηit . Then, by orthogonality of lagged y with current η we haveE [y i,t −j (∆y it − β0∆y i,t −1)] = 0, (2 ≤ j ≤ t,t =1,...,T ).These are instrumental variable type moment conditions. Levels of y it lagged at least two period can be used as instruments for the di ﬀerences. Note that there are di ﬀerent instruments for di ﬀerent residuals. There are also additional moment conditions that come from orthogonality of αi and ηit .They areE [(y iT − β0y i,T −1)(∆y it − β0∆y i,t −1)] = 0, (t =2,...,T − 1).These are nonlinear. Both sets of moment conditions can be combined. To form big moment vector by ”stacking”. Let⎞⎛ y i 0 ⎜⎜⎝ ⎟⎟⎠ i t (β)= . . . y i,t −2(∆y it − β∆y i,t −1), (t =2,...,T ),g ⎛ ⎞g i α(β)= ⎜⎜⎝ ∆y i 2 − β∆y i 1 . . . ∆y i,T −1 − β∆y i,T −2 ⎟⎟⎠(y iT − βy i,T −1).These moment functions can be combined asg i (β)=(g i 2(β)0,...,g i T (β)0,g i α(β)0)0.Here there are T (T −1)/2+(T −2) moment restrictions. Ahn and Schmidt (1995, Journalof Econometrics) show that the addition of the nonlinear moment condition g iα(β)to the IV ones often gives substantial asymptotic e ﬃciency improvements.Hahn, Hausman, Kuersteiner approach: Long di ﬀerences⎞⎛ g i (β)= ⎜⎜⎜⎜⎝ y i 0 y i 2 − βy i 1 . . .y i,T −1 − βy i,T −2⎟⎟⎟⎟⎠[y iT − y i 1 − β(y i,T −1 − y i 0)] Has better small sample properties by getting most of the information with fewer moment conditions.IDENTIFICATION: Identi ﬁcation is essential for understanding any estimator. Unless parameters are identi ﬁed, no consistent estimator will exist. Here, since GMM estimators are based on moment conditions, we focus on identi ﬁcation based on the moment functions. The parameter value β0 will be identi ﬁed if there is a unique solution tog ¯(β)=0,g ¯(β)= E [g i (β)].If there is more than one solution to these moment conditions then the parameter is not identi ﬁed from the moment conditions.One important necessary order condition for identi ﬁcation is that m ≥ p .Whenm < p ,i .e. there are fewer equations to solve than parameters. there will typically be multiple solutions to the moment conditions, so that β0 is not identi ﬁed from the moment conditions. In the instrumental variables case, this is the well known order condition that there be more instrumental variables than right hand side variables.When the moments are linear in the parameters then there is a simple rank condition that is necessary and su ﬃcient for identi ﬁcation. Suppose that g i (β)is linear in β and let G i = ∂g i (β)/∂β (which does not depend on β by linearity in β). Note that by linearityg i(β)=g i(β0)+G i(β−β0).The moment condition is0=¯g(β)=G(β−β0),G=E[G i]The solution to this moment condtion occurs only atβ0if and only ifrank(G)=p.If rank(G)=p then the only solution to this equation isβ−β0=0,i.e.β=β0.If rank(G)<p t hen there is c=0s uch t hat G c=0,so that forβ=β0+c=β0,g¯(β)=G c=0.For IV G=−E[Z i X i0]so t hat r ank(G)=p is one form of the usual rank condition for identiﬁcation in the linear IV seeting,that the expected cross-product matrix of instrumental variables and right-hand side variables have rank equal to the number of right-hand side variables.In the general nonlinear case it is diﬃcult to specify conditions for uniqueness of the solution to¯g(β)=0.Global conditions for unique solutions to nonlinear equations are not well developed,although there has been some progress recently.Conditions for local identiﬁcation are more straightforward.In general let G=E[∂g i(β0)/∂β]. Then,assuming¯g(β)is continuously diﬀerentiable in a neighborhood ofβ0the condition rank(G)=p will be suﬃcient for local identiﬁcation.That is,rank(G)=p implies that there exists a neighborhood ofβ0such thatβ0is the unique solution to¯g(β)for a llβin that neighborhood.Exact identiﬁcation refers the case where there are exactly as many moment conditions as parameters,i.e.m=p.For IV there would be exactly as many instruments as right-hand side variables.Here the GMM estimator will satisfyˆg(βˆ)=0asymptotically. When there is the same number of equations as unknowns,one can generally solve the equations,so a solution toˆg(β)=0will exist asymptotically.The proof of this statement (due to McFadden)makes use of theﬁrst-order conditions for GMM,which areh i0=∂gˆ(βˆ)/∂β0Aˆgˆ(βˆ).The regularity conditions will require that both∂gˆ(βˆ)/∂βand Aˆare nonsingular with probability approaching one(w.p.a.1),so theﬁrst-order conditions implyˆg(βˆ)=0 w.p.a.1.This will be true whatever the weight matrix,so thatβˆwill be invariant to the form of A.ˆOveridentiﬁcation refers to the case where there are more moment conditions than parameters,i.e.m>p.For IV this will mean more instruments than right-hand side variables.Here a solution toˆg(β)=0generally will not exist,because this would solve more equations than parameters.Also,it can be shown that√ng(βˆ)has a nondegenerateasymptotically normal distribution,so that the probabability ofˆg(βˆ)=0g oes t o z ero. When m>p all that can be done is set sample moments close to zero.Here the choice of Aˆmatters for the estimator,aﬀecting its limiting distribution.TWO STEP OPTIMAL GMM ESTIMATOR:When m>p the GMM estimator will depend on the choice of weighting matrix Aˆ.An important question is how to choose Aˆoptimally,to minimize the asymptotic variance of the GMM estimator.It turnsˆˆpout that an optimal choice of A is any such that A−→Ω−1,whereΩis the asymptoticnvariance of√ngˆ(β0)=P i=1g i(β0)/√n Choosing Aˆ=Ωˆ−1to be the inverse of a consistent estimatorΩˆofΩwill minimize the asymptotic variance of the GMM estimator. This leads to a two-step optimal GMM estimator,where theﬁrst step is construction of Ωˆand the second step is GMM with Aˆ=Ωˆ−1.The optimal Aˆdepends on the form ofΩ.In general a central limit theorem will lead toΩ=lim E[ngˆ(β0)ˆg(β0)0],n−→∞when the limit exists.Throughout these notes we will focus on the stationary case where E[g i(β0)g i+ (β0)0]does not depend on i.We begin by assuming that E[g i(β0)g i+ (β0)0]=0 for all positive integers .ThenΩ=E[g i(β0)g i(β0)0].In this caseΩcan be estimated by replacing the expectation by a sample average andβ0by an estimator β˜, leading to n Ωˆ=1 X g i (β˜)g i (β˜)0. n i =1The β˜could be obtained by GMM estimator by using a choice of A ˆthat does not depend on parameter estimates. For example, for IV β˜could be the 2SLS estimator where Aˆ=(Z 0Z )−1 . In the IV setting this Ωˆhas a heteroskedasticity consistent form. Note that for ε˜i = y i − X i 0β˜, n 1 XΩˆ= Z i Z i 0ε˜2 i . n i =1 The optimal two step GMM (or generalized IV) estimator is thenβˆ=(X 0Z Ωˆ−1Z 0X )−1X 0Z Ωˆ−1Z 0y. Because the 2SLS corresponds to a non optimal weighting matrix this estimator will generally have smaller asymptotic variance than 2SLS (when m >p ). However, whenhomoskedasticity prevails, Ωˆ=ˆσε 2Z 0Z/n is a consistent estimator of Ω, and the 2SLSestimator will be optimal. The 2SLS estimator appears to have better small sample properties also, as shown by a number of Monte Carlo studies, which may occur becauseusing a heteroskedasticity consistent Ωˆadds noise to the estimator. When moment conditions are correlated across observations, an autocorrelation consistent variance estimator estmator can be used, as inX X Ωˆ= Λˆ0 + L w L (Λˆ + Λˆ0 ), Λˆ = n − g i (β˜)g i + (β˜)0/n. =1 i =1 where L is the number of lags that are included and the weights w L are used to ensure Ωˆis positive semi-de ﬁnite. A common example is Bartlett weights w L =1 − /(L +1), as in Newey and West (1987). It is beyond the scope of these notes to suggest choices of L .ˆA consistent estimator Vof the asymptotic variance of √n (βˆ− β0) is needed for asymptotic inference. For the optimal Aˆ= Ωˆ−1 a consistent estimator is given by Vˆ=(G ˆ0Ωˆ−1G ˆ)−1 ,G ˆ= ∂g ˆ(βˆ)/∂β.One could also update the Ωˆby using the two step optimal GMM estimator in place of β˜in its computation. The value of this updating is not clear. One could also update the Aˆin the GMM estimator and calculate a new GMM estimator based on the update. Thisiteration on Ωˆappears to not improve the properties of the GMM estimator very much. A related idea that is important is to simultaneously minimize over β in Ωˆand in the moment functions. This is called the continuously updated GMM estimator (CUE). Forn example, when there is no autocorrelation, for Ωˆ(β)= P i =1 g i (β)g i (β)0/n the CUE isβˆ=arg m in g ˆ(β)0Ωˆ(β)−1g ˆ(β). βThe asymptotic distribution of this estimator is the same as the two step optimal GMM estimator but it tends to have smaller bias in the IV setting, as will be discussed below. It is generally harder to compute than the two-step optimal GMM.ADDING MOMENT CONDITIONS: The optimality of the two step GMM estimator has interesting implications. One simple but useful implication is that adding moment conditions will also decrease (or at least not decrease) the asymptotic variance of the optimal GMM estimator. This occurs because the optimal weighting matrix for fewer moment conditions is not optimal for all the moment conditions. To explain further,suppose that g i (β)=(g i 1(β)0,g i 2(β)0)0. Then the optimal GMM estimator for just the ﬁrstset of moment conditions g i 1(β)is usesÃ!A ˆ= (Ωˆ1)−1 0 ,00 n 1where Ωˆ1 is a consistent estimator of the asymptotic variance of P i =1 g i (β0)/√n. This A ˆis not generally optimal for the entire moment function vector g i (β).For example, consider the linear regression modelE [y i |X i ]= X i 0β0. The least squares estimator is a GMM estimator with moment functions g i 1(β)= Xi (y i − X i 0β). The conditional moment restriction implies that E [εi |X i ]= 0 f or εi = y i − X i 0β0.We can add to these moment conditions by using nonlinear functions of X i as additional”instrumental variables.” Let g 2(β)= a (X i )(y i − X 0β)for s ome (m − p ) × 1vector ofi i functions of X i . Then the optimal two-step estimator based onÃ! g i (β)= a (X X ii )(y i − X i 0β)will be more e ﬃcient than least squares when there is heteroskedasticity. This estimator has the form of the generalized IV estimator described above where Z i =(X i 0,a (X i )0)0. It will provide no e ﬃciency gain when homoskedasticity prevails. Also, the asymptoticvariance estimator Vˆ=(G ˆ0Ωˆ−1G ˆ)−1 tends to provide a poor approximation to the variance of βˆ. See Cragg (1982, Econometrica). Interesting questions here are what and how many functions to include in a (X ) and how to improve the variance estimator. Some of these issues will be further discussed below.Another example is provided by missing data. Consider again the linear regression model, but now just assume that E [X i εi ] = 0, i.e. X i 0β0 may not be the conditional mean. Suppose that some of the variables are sometimes missing and W i denote the variables that are always observed. Let ∆i denote a complete data indicator, equal to 1 if (y i ,X i ) are observed and equal to 0 if only W i is observed. Suppose that the data is missingcompletely at random, so that ∆i is independent of W i .Then thereare two types of moment conditions available. One is E [∆i X i εi ] = 0, leading to a moment function of the formg i 1(β)= ∆i X i (y i − X i 0β). GMM for this moment condition is just least squares on the complete data. The other type of moment condition is based on Cov (∆i ,a (W i )) = 0 for any vector of functions a (W ), leading to a moment function of the formg i 2(η)=(∆i − η)a (W i ).One can form a GMM estimator by combining these two moment conditions. This will generally be asymptotically more e ﬃcient than least squares on the complete data when Y i is included in W i . Also, it turns out to be an approximately e ﬃcient estimator in thepresence of missing data. As in the previous example, the choice of a (W )is an interesting question.Although adding moment conditions often lowers the asymptotic variance it may not improve the small sample properties of estimators. When endogeneity is present adding moment conditions generally increases bias. Also, it can raise the small sample variance. Below we discuss criteria that can be used to evaluate these tradeo ﬀs.One setting where adding moment conditions does not lower asymptotic e ﬃciency i is when those the same number of additional parameters are also added. That is, ifthe second vector of moment functions takes the form g 2(β,γ)where γ has the same2dimension as g situation is analogous to that in the linear simultaneous equations model where adding exactly identi ﬁed equations does not improve e ﬃciency of IV estimates. Here addingexactly identi ﬁedm oment f unctions does not i mprove e ﬃciency of GMM. Another thing GMM can be used for is derive the variance of two step estimators.Consider a two step estimator βˆthat is formed by solving i (β,γ) then there will be no e ﬃciency gain for the estimator of β. This1 n X g n i =12 i (β,γˆ)=0, P 1 i n i i =1 g then ( β,ˆγˆ) is a (joint) GMM estimator for the triangular moment conditionsÃ! 1g where ˆγ is some ﬁrst step estimator. If ˆγ is a GMM estimator solving(γ)/n =0 (γ)g i (β,γ)= 2 .(β,γ) i g The asymptotic variance of √n (βˆ− β0) can be calculated by applying the general GMMformula to this triangular moment condition. = 0 the asymptotic variance of βˆwill not depend on estii When E [∂g 2 mation of γ, i.e. (β0,γ0)/∂γ] i 2will the same as for GMM based on g i (β)= g condition for this is that2 (β,γ0). A su ﬃcientE [g (β0,γ)] = 0i i for all γ in some neighborhood of γ0. Di ﬀerentiating this identity with respect to γ,andassuming that di ﬀerentiation inside the expectation is allowed, gives E [∂g 2(β0,γ0)/∂γ]=0. The interpretation of this is that if consistency of the ﬁrst step estimator does not a ﬀect consistency of the second step estimator, the second step asymptotic variance does not need to account for the ﬁrst step.ASYMPTOTIC THEORY FOR GMM: We mention precise results for the i.i.d. case and give intuition for the general case. We begin with a consistency result: If the data are i.i.d. and i) E [g i (β)] = 0 if and only if β = β0 (identi ﬁcation); ii) the GMM minimization takes place over a compact set B containing β0; iii) g i (β) iscontinuous at each β with probability one and E [sup β∈B k g i (β)k ] is ﬁnite; iv) Aˆp A → positive de ﬁnite; then βˆp β0.→ See Newey and McFadden (1994) for the proof. The idea is that, for g (β)= E [g i (β)], by the identi ﬁcation hypothesis and the continuity conditions g (β)0Ag (β) will be bounded away from zero outside any neighborhood N of β0. Then by the law of large numbersˆˆp and iv), so will ˆg (β)0A ˆg ˆ(β). But, ˆg (βˆ)0Ag ˆ(βˆ) ≤ g ˆ(β0)0Ag(β0) → 0from the d e ﬁnition of βˆand the law of large numbers, so βˆmust be inside N with probability approaching one. The compact parameter set is not needed if g i (β) is linear, like for IV.Next we give an asymptotic normality result:If the data are i.i.d., βˆp β0 and i) β0 is in the interior of the parameter set over→ which minimization occurs; ii) g i (β) is continuously di ﬀerentiable on a neighborhood Np of β0 iii) E [sup β∈N k ∂g i (β)/∂βk ] is ﬁnite; iv) A ˆ→ A and G 0AG is nonsingular, forG = E [∂g i (β0)/∂β];v) Ω = E [g i (β0)g i (β0)0] exists, thend √ n (βˆ− β0) −→ N (0,V ),V =(G 0AG )−1G 0A ΩAG (G 0AG )−1 .See Newey and McFadden (1994) for the proof. Here we give a derivation of theasymptotic variance that is correct even if the data are not i.i.d.. By consistency of βˆand β0 in the interior of the parameter set, with probability approaching (w.p.a.1) the ﬁrst order condition0= G ˆ0A ˆg ˆ(βˆ), is satis ﬁed, where G ˆ= ∂g ˆ(βˆ)/∂β. Expand ˆg (βˆ)around β0 to obtain0= G ˆ0A ˆg ˆ(β0)+ Gˆ0A ˆG ¯(βˆ− β0),where G ¯= ∂g ˆ(β¯)/∂β and β¯lies on the line joining βˆand β0, and actually di ﬀers from row to row of G¯. Under regularity conditions like those above G ˆ0A ˆG ¯will be nonsingular w.p.a.1. Then multiplying through by √ n and solving gives³´ √ n (βˆ− β0)= − G ˆ0A ˆG ¯−1 G ˆ0A ˆ√ ng ˆ(β0).d ˆp By an appropriate central limit theorem√ ng ˆ(β0) −→ N (0, Ω). Also we have A −→³´ ˆp ¯p ˆ−1 ˆp A, G −→ G, G −→ G, so by the continuous mapping theorem, G 0A ˆG ¯G0A ˆ−→ (G 0AG )−1 G 0A. Then by the Slutzky lemma, d √ n (βˆ− β0) −→ − (G 0AG )−1 G 0AN (0, Ω)= N (0,V ).The fact that A = Ω−1 minimizes the asymptotic varince follows from the Gauss Markov Theorem. Consider a linear model.E [Y ]= G δ,V ar (Y )= Ω.The asymptotic variance of the G MM estimator w ith A = Ω−1 is (G 0Ω−1G )−1.This is also the variance of generalized least squares (GLS) in this model. Consider an estmator δˆ=(G 0AG )−1G 0AY . It is linear and unbiased and has variance V . Then by the Gauss-Markov Theorem,V − (G 0Ω−1G )−1 is p.s.d..We can also derive a condition for A to be e ﬃcient. The Gauss-Markov theorem says that GLS is the the unique minimum variance estimator, so that A is e ﬃcient if and only if(G 0AG )−1G 0A =(G 0Ω−1G )−1G 0Ω−1 .Transposing and multiplying givesΩAG = GB,where B is a nonsingular matrix. This is the condition for A to be optimal.CONDITIONAL MOMENT RESTRICTIONS: Often times the moment restrictions on which GMM is based arise from conditional moment restrictions. Letρi(β)=ρ(w i,β)be a r×1residual vector.Suppose that there are some instruments z i such that the conditional moment restrictionsE[ρi(β0)|z i]=0are satisﬁed.Let F(z i)be an m×r matrix of instrumental variables that are functions of z i.Let g i(β)=F(z i)ρi(β).Then by iterated expectations,E[g i(β0)]=E[F(z i)E[ρi(β0)|zβi]]=0.Thus g i(β)satisﬁes the GMM moment restrictions,so that one can form a GMM estimator as described above.For moment functions of the form g i(β)=F(z i)ρi(β)we can think of GMM as a nonlinear instrumental variables estimator.The optimal choice of F(z)can b e d escribed as follows.Let D(z)=E[∂ρi(β0)/∂β|z i= z]andΣ(z)=E[ρi(β0)ρi(β0)0|z i=z].The optimal choice of instrumental variables F(z)isF∗(z)=D(z)0Σ(z)−1.This F∗(z)is optimal in the sense that it minimizes the asymptotic variance of a GMM estimator with moment functions g i(β)=F(z i)ρi(β)and a weighting matrix A.To show this optimality let F i=F(z i),F i∗=F∗(z i),andρi=ρi(β0).Then by iterated expectations,for a GMM estimator with moment conditions g i(β)=F(z i)ρi(β), G=E[F i∂ρi(β0)/∂β]=E[F i D(z i)]=E[F iΣ(z i)F i∗0]=E[F iρiρ0i F i∗0].Let h i=G0AF iρi and h∗i=F i∗ρi,so thatG0AG=G0AE[F iρi h∗i0]=E[h i h i∗0],G0AΩAG=E[h i h i0].Note that for F i=F i∗we have G=Ω=E[h∗i h∗i0].Then the diﬀerence of the asymptotic variance for g i(β)=F iρi(β)and s ome A and the asymptotic variance for g i(β)=F i∗ρi(β) is(G0AG)−1G0AΩAG(G0AG)−1−(E[h∗i h∗i0])−1=(E[h i h∗i0])−1{E[h i h0i]−E[h i h i∗0](E[h i∗h i∗0])−1E[h i∗h i0]}(E[h i∗h i0])−1.The matrix in brackets is the second moment matrix of the population least squares projection of h i on h∗i and is thus positive semideﬁnite,so the whole matrix is positive semi-deﬁnite.Some examples help explain the form of the optimal instruments.Consider the linear regression model E[y i|X i]=X i0β0and letρi(β)=y i−X i0β,εi=ρi(β0),andσi2=E[ε2i|X i]=Σ(z i).Here the instruments z i=X i.A GMM e stimator with moment conditions F(z i)ρi(β)=F(X i)(y i−X0β)is the estimator described above thatiwill be asymptotically more eﬃcient than least squares when F(X i)includes X i.Here ∂ρi(β)/∂β=−X i0,so that the optimal instruments areF i∗=−X2i.iHere the GMM estimator with the optimal instruments in the heteroskedasticity corrected generalized least squares.Another example is a homoskedastic linear structural equation.Here againρi(β)= y i−X i0βbut now z i is not X i and E[εi2|z i]=σ2is constant.Here D(z i)=−E[X i|z i]is the reduced form for the right-hand side variables.The optimal instruments in this example areF i∗=−D(z i).Here the reduced form may be linear in z i or nonlinear.For a given F(z)the GMM estimator with optimal A=Ω−1corresponds to an approximation to the optimal estimator.For simplicity we describe this interpretation for r=p=1.Note that for g i=F iρi it follows similarly to above that G=E[g i h i∗0],so thatG0Ω−1=E[h i∗g i0](E[g i g i0])−1.That is G0Ω−1are the coeﬃcients of the population projection of h∗i on g i.Thus we can interpret theﬁrst order conditions for GMMnX0=Gˆ0Ωˆ−1gˆ(βˆ)=Gˆ0Ωˆ−1F iρi(β)/n,i=1can be interpreted as an estimated mean square approximation to theﬁrst order conditions for the optimal estmatornX0=F i∗ρi(β)/n.i=1(This holds for GMM in other models too).One implication of this interpretation is that if the number and variety of the elements of F increases in such a way that linear combinations of F can approximate any function arbitrarily well then the asymptotic variance for GMM with optimal A will approach the optimal asymptotic variance.To show this,recall that m is the dimension of F i and let the notation F i m indicate dependence on m.Suppose that for any a(z)with E[Σ(z i)a(z i)2]ﬁnite there exists m×1vectorsπm such that as m−→∞E[Σ(z i){a(z i)−πm0F i m}2]−→0.For example,when z i is a scalar the nonnegative integer powers of a bounded monotonic transformation of z i will have this property.Then it follows that for h m i=ρi F i m0Ω−1G E[{h∗i i i−ρi F i m0i{F i∗−F i m−h m}2]≤E[{h∗πm}2]=E[ρ20πm}2]=E[Σ(z i){F i∗−F i m0πm}2]−→0.Since h i m converges in mean square to h i∗,E[h i m h i m0]−→E[h i∗h∗i0],and hence (G0Ω−1G)−1=(G0Ω−1E[g i g i0]Ω−1G)−1=(E[h i m h i m0])−1−→(E[h i∗h i∗0])−1.Because the asymptotic variance is minimzed at h∗i the asymptotic variance will a pproach the lower bound more rapidly as m grows than h m i approaches h∗i.In practice this may mean that it is possible to obtain quite low asymptotic variance with relatively few approximating functions in F i m.An important issue for practice is the choice of m.There has been some progress on this topic in the last few years,but it is beyond the scope of these notes.BIAS IN GMM:The basic idea of this discussion is to consider the expectation of the GMM objective function.This analysis is similar to that in Han and Phillips(2005).。

mit离散数学笔记

mit离散数学笔记离散数学是一门重要的数学学科，它研究离散对象和离散结构，如集合、图论、逻辑等。

MIT（麻省理工学院）是世界知名的学府，其离散数学课程给予了很多学生深刻的学习体验。

本篇文章将对MIT离散数学课程的内容进行笔记总结。

一、集合论集合论是离散数学的基础。

在MIT的离散数学课程中，集合论位于开篇的位置，主要包括集合的定义与运算、集合的基数、无穷集合、基本逻辑等内容。

集合论不仅在数学领域有着广泛的应用，还在计算机科学、人工智能等领域中扮演着重要的角色。

二、图论图论是离散数学中最重要的分支之一。

MIT的离散数学课程中，图论部分包含了图的基本概念、图的表示方法、图的连通性、最短路径算法、最小生成树算法等内容。

图论在计算机科学、社交网络分析、电路设计等领域中有着广泛的应用。

三、逻辑与证明逻辑是离散数学的核心内容之一。

MIT的离散数学课程中，逻辑与证明部分包括命题逻辑、谓词逻辑、命题等价性、谓词等价性、证明方法等内容。

通过学习逻辑与证明，学生不仅可以提高思维的严密性，还可以培养解决问题的能力。

四、数论数论是离散数学中的重要分支，研究整数的性质与结构。

MIT的离散数学课程中，数论部分主要包括整除性、素数、模运算等内容。

数论在密码学、编码理论等领域有着广泛的应用。

五、关系与函数关系与函数是离散数学中的重要概念。

MIT的离散数学课程中，关系与函数部分主要包括关系的性质、函数的性质、逆关系、复合函数等内容。

关系与函数不仅在数学中有着重要的应用，还在数据库设计、计算机网络等领域中起着重要作用。

六、排列与组合排列与组合是离散数学中的经典话题。

MIT的离散数学课程中，排列与组合部分主要包括排列、组合、二项式定理等内容。

排列与组合在概率论、统计学等领域中有着重要的应用。

总结：通过学习MIT离散数学课程，我们不仅可以掌握离散数学的基础概念和重要理论，还可以培养严密的逻辑思维和解决问题的能力。

离散数学在计算机科学、人工智能、密码学等领域都发挥着重要的作用。

MIT基础数学讲义(计算机系)lecture13

pi j m ) pi j p1 p2 ) pi j 1
pn + 1
2
Lecture 13: Lecture Notes
The rst implication follows by substituting the de nition of m. The second follows because pi divides the product p1p2 pn, and so must divide 1 in order to divide the sum. But no prime divides 1, so we have a contradiction. Therefore, there are an in nite number of primes. Proving that the set of primes is in nite is relatively easy, but the next example shows that determining whether a set is nite or in nite can be tricky.
De nition An in nite set S is said to be countably in nite i there exists a bijection f : N 7! S .
A set is countable if it is nite or countably in nite. Many familiar sets are countable: N , the even numbers, primes, integers modulo a constant k, etc. The formal de nition of a countable set is equivalent to the notion of a listable set, since listing the elements of a set gives a bijection with N and vice-versa. If we can list the elements of S as s0 s1 s2 : : : , then we can construct a bijection f : N 7! S de ned by f (i) = si. In the reverse direction, if we have such a bijection, then f (0) f (1) f (2) : : : is a list of all the elements of S . Sometimes we can list the elements of a set S , but the corresponding bijection f : N 7! S is very hard to compute. For example, in principle we could list the elements in any subset of N . However, most of these subsets are bizarre and di cult to describe writing a program to compute f (i) might be di cult or even impossible. The good news is that to prove that S is countable, we must only prove that a bijection f : N 7! S exists we do not have to write f explicitly or to give an algorithm to compute f (i).

MIT基础数学讲义(计算机系)24

These two examples show that Markov's Theorem gives weak results for well-behaved random variables however, the theorem is actually tight for some nasty examples. Suppose we ip 100 fair coins and use Markov's Theorem to compute the probability of getting all heads: Pr(heads 100) Ex(heads) = 50 = 1 100 100 2
Massachusetts Institute of Technology 6.042J/18.062J: Mathematics for Computer Science Professor Tom Leighton
Lecture 24 2 Dec 97
Lecture Notes 1 Deviation from the Mean
If the coins in are mutually independent, then the actual probability of getting all heads is a miniscule 1 in 2100. In this case, Markov's Theorem looks very weak. However, in applying Markov's Theorem, we made no independence assumptions. In fact, if all the coins are glued together, then probability of throwing all heads is exactly 1 . In this nasty case, Markov's Theorem is actually 2 tight!

MIT基础数学讲义(计算机系)lecture5

De nition A tree is a connected n-node graph with exactly n ; 1 edges.
The vertices in a tree can be classi ed into two categories. Vertices of degree at most one are called leaves, and vertices of degree greater than one are called internal nodes. Trees are usually drawn as in Figure 1 with the leaves on the bottom. Keep this convention in mind otherwise, phrases like \all the vertices below..." will be confusing. (The English mathematician Littlewood once remarked that he found such directional terms particularly bothersome, since he habitually read mathematics reclined on his back!) Trees arise in many problems. For example, the le structure in a computer system can be naturally represented by a tree. In this case, each internal node corresponds to a directory, and each leaf corresponds to a le. If one directory contains another, there there is an edge between the associated internal nodes. If a directory contains a le, then there there is an edge between the internal node and a leaf. There are several ways to describe trees that are equivalent to the preceding de nition.

MIT基础数学讲义(计算机系)lecture15

every item of B, then A has k times as many items as B.
For example, suppose A is a set of students, B is a set of recitations, and f de nes the assignment
f f;1( x1x2 : : : xn]) = (x1 x2 x3 : : : xn)
(xn x1 x2 : : : xn;1)
(xn;1 xn x1 : : : xn;2)
: : :x1 appears (x2 x3 x4 : : :
in n di
x1)g
erent
places : : :
By the Division Rule, jAj = njBj. This gives:
de in
nition, however, f A to some element
;b12(bB) c,atnhebne
a set,
f ;1(b)
not just a single value. For exampe, if f
is actually the empty set. In the special
4
Lecture 15: Lecture Notes
1 The Division Rule
We will state the Division Rule twice, once informally and then again with more precise notation.
Theorem 1.1 (Division Rule) If B is a nite set and f : A 7! B maps precisely k items of A to

数学漫步之旅英文解说词

数学漫步之旅英文解说词Embarking on a journey through the realm of mathematics is like stepping into a world of infinite possibilities. It's a landscape where numbers dance and equations whisper secrets of the universe.Imagine a path lined with ancient Pythagorean theorems, each step revealing the harmony in the lengths of triangles. As we wander further, we encounter the Fibonacci sequence, nature's own pattern, unfolding in the spirals of sunflowers and the spirals of galaxies.The air is filled with the fragrance of algebra, where variables are the keys to unlocking the doors of countless equations. Each solution is a treasure, a piece of the puzzle that makes up the grand design of the cosmos.Further along, we come upon the majestic geometry, where shapes and solids stand as monuments to the beauty of symmetry and proportion. Here, the Platonic solids teach us about balance and perfection in form.As we delve deeper, calculus awaits, a realm of motion and change, where the slopes of lines tell stories of rates and the areas under curves reveal the hidden depths of integration.The journey is not without its challenges, for the pathis often steep and winding, filled with complex numbers and abstract concepts. But with perseverance, each summit reached offers a breathtaking view of the mathematical vista.In the end, the journey through mathematics is not just about reaching the destination, but about the discoveries made along the way. It's a voyage of the mind, a quest for understanding, and a celebration of the elegance inherent in every equation and theorem.。

MIT基础数学讲义(计算机系)lecture16

(For example, let S be the set fA B C D Eg, with elements ordered alphabetically. Let R be the 7-combination with repetition fA B B B D E Eg. The stars-and-bars string corresponding to R is shown below.
n
+
r r
;
1!:
2
Lecture 16: Lecture Notes
In the example above, we found six ways to choose two elements from the set S = fA B Cg with
rseept eistit;i3o+n22;a1llo=we6d.. Sure enough, the theorem says that the number of 2-combinations of a 3-element
swthaircshainsd;nb+arrr;s1.
This .
is
the
number
of
ordinary
r-combinations
of
a
set
with
n
+
r
;
1
elements,
1.2 Triple-Scoop Ice Cream Cones
Baskin-Robbins is an ice cream store that has 31 di erent avors. How many di erent triple-scoop ice cream cones are possible at Baskin-Robbins? Two ice cream cones are considered the same if one can be obtained from the other by reordering the scoops. Of course, we are permitted to have two or even three scoops of the same avor.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

在过去的一年中，我一直在数学的海洋中游荡，research进展不多，对于数学世界的阅历算是有了一些长进。

为什么要深入数学的世界作为计算机的学生，我没有任何企图要成为一个数学家。

我学习数学的目的，是要想爬上巨人的肩膀，希望站在更高的高度，能把我自己研究的东西看得更深广一些。

说起来，我在刚来这个学校的时候，并没有预料到我将会有一个深入数学的旅程。

我的导师最初希望我去做的题目，是对appearance 和motion建立一个unified的model。

这个题目在当今Computer Vision中百花齐放的世界中并没有任何特别的地方。

事实上，使用各种Graphical Model把各种东西联合在一起framework，在近年的论文中并不少见。

我不否认现在广泛流行的Graphical Model是对复杂现象建模的有力工具，但是，我认为它不是panacea（万应灵药），并不能取代对于所研究的问题的深入的钻研。

如果统计学习包治百病，那么很多“下游”的学科也就没有存在的必要了。

事实上，开始的时候，我也是和Vision中很多人一样，想着去做一个Graphical Model——我的导师指出，这样的做法只是重复一些标准的流程，并没有很大的价值。

经过很长时间的反复，另外一个路径慢慢被确立下来——我们相信，一个图像是通过大量“原子”的某种空间分布构成的，原子群的运动形成了动态的可视过程。

微观意义下的单个原子运动，和宏观意义下的整体分布的变换存在着深刻的联系——这需要我们去发掘。

在深入探索这个题目的过程中，遇到了很多很多的问题，如何描述一个一般的运动过程，如何建立一个稳定并且广泛适用的原子表达，如何刻画微观运动和宏观分布变换的联系，还有很多。

在这个过程中，我发现了两个事情：我原有的数学基础已经远远不能适应我对这些问题的深入研究。

在数学中，有很多思想和工具，是非常适合解决这些问题的，只是没有被很多的应用科学的研究者重视。

于是，我决心开始深入数学这个浩瀚大海，希望在我再次走出来的时候，我已经有了更强大的武器去面对这些问题的挑战。

我的游历并没有结束，我的视野相比于这个博大精深的世界的依旧显得非常狭窄。

在这里，我只是说说，在我的眼中，数学如何一步步从初级向高级发展，更高级别的数学对于具体应用究竟有何好处。

集合论：现代数学的共同基础现代数学有数不清的分支，但是，它们都有一个共同的基础——集合论——因为它，数学这个庞大的家族有个共同的语言。

集合论中有一些最基本的概念：集合(set)，关系(relation)，函数(function)，等价(equivalence)，是在其它数学分支的语言中几乎必然存在的。

对于这些简单概念的理解，是进一步学些别的数学的基础。

我相信，理工科大学生对于这些都不会陌生。

附图部分英文翻译不过，有一个很重要的东西就不见得那么家喻户晓了——那就是“选择公理” (Axiom of Choice)。

这个公理的意思是“任意的一群非空集合，一定可以从每个集合中各拿出一个元素。

”——似乎是显然得不能再显然的命题。

不过，这个貌似平常的公理却能演绎出一些比较奇怪的结论，比如巴拿赫-塔斯基分球定理——“一个球，能分成五个部分，对它们进行一系列刚性变换（平移旋转）后，能组合成两个一样大小的球”。

正因为这些完全有悖常识的结论，导致数学界曾经在相当长时间里对于是否接受它有着激烈争论。

现在，主流数学家对于它应该是基本接受的，因为很多数学分支的重要定理都依赖于它。

在我们后面要回说到的学科里面，下面的定理依赖于选择公理：1. 拓扑学：Baire Category Theorem2. 实分析（测度理论）：Lebesgue 不可测集的存在性3. 泛函分析四个主要定理：Hahn-Banach Extension Theorem, Banach-Steinhaus Theorem (Uniform boundedness principle), Open Mapping Theorem, Closed Graph Theorem在集合论的基础上，现代数学有两大家族：分析(Analysis)和代数(Algebra)。

至于其它的，比如几何和概率论，在古典数学时代，它们是和代数并列的，但是它们的现代版本则基本是建立在分析或者代数的基础上，因此从现代意义说，它们和分析与代数并不是平行的关系。

分析：在极限基础上建立的宏伟大厦微积分：分析的古典时代——从牛顿到柯西先说说分析(Analysis)吧，它是从微积分(Calculus)发展起来的——这也是有些微积分教材名字叫“数学分析”的原因。

不过，分析的范畴远不只是这些，我们在大学一年级学习的微积分只能算是对古典分析的入门。

分析研究的对象很多，包括导数(derivatives)，积分(integral)，微分方程(differential equation)，还有级数(infinite series)——这些基本的概念，在初等的微积分里面都有介绍。

如果说有一个思想贯穿其中，那就是极限——这是整个分析（不仅仅是微积分）的灵魂。

一个很多人都听说过的故事，就是牛顿(Newton)和莱布尼茨(Leibniz)关于微积分发明权的争论。

事实上，在他们的时代，很多微积分的工具开始运用在科学和工程之中，但是，微积分的基础并没有真正建立。

那个长时间一直解释不清楚的“无穷小量”的幽灵，困扰了数学界一百多年的时间——这就是“第二次数学危机”。

直到柯西用数列极限的观点重新建立了微积分的基本概念，这门学科才开始有了一个比较坚实的基础。

直到今天，整个分析的大厦还是建立在极限的基石之上。

柯西(Cauchy)为分析的发展提供了一种严密的语言，但是他并没有解决微积分的全部问题。

在19世纪的时候，分析的世界仍然有着一些挥之不去的乌云。

而其中最重要的一个没有解决的是“函数是否可积的问题”。

我们在现在的微积分课本中学到的那种通过“无限分割区间，取矩阵面积和的极限”的积分，是大约在1850年由黎曼(Riemann)提出的，叫做黎曼积分。

但是，什么函数存在黎曼积分呢（黎曼可积）？数学家们很早就证明了，定义在闭区间内的连续函数是黎曼可积的。

可是，这样的结果并不令人满意，工程师们需要对分段连续函数的函数积分。

实分析：在实数理论和测度理论上建立起现代分析在19世纪中后期，不连续函数的可积性问题一直是分析的重要课题。

对于定义在闭区间上的黎曼积分的研究发现，可积性的关键在于“不连续的点足够少”。

只有有限处不连续的函数是可积的，可是很多有数学家们构造出很多在无限处不连续的可积函数。

显然，在衡量点集大小的时候，有限和无限并不是一种合适的标准。

在探讨“点集大小”这个问题的过程中，数学家发现实数轴——这个他们曾经以为已经充分理解的东西——有着许多他们没有想到的特性。

在极限思想的支持下，实数理论在这个时候被建立起来，它的标志是对实数完备性进行刻画的几条等价的定理（确界定理，区间套定理，柯西收敛定理，Bolzano-Weierstrass Theorem和Heine-Borel Theorem等等）——这些定理明确表达出实数和有理数的根本区别：完备性（很不严格的说，就是对极限运算封闭）。

随着对实数认识的深入，如何测量“点集大小”的问题也取得了突破，勒贝格创造性地把关于集合的代数，和Outer content（就是“外测度”的一个雏形）的概念结合起来，建立了测度理论(Measure Theory)，并且进一步建立了以测度为基础的积分——勒贝格(Lebesgue Integral)。

在这个新的积分概念的支持下，可积性问题变得一目了然。

上面说到的实数理论，测度理论和勒贝格积分，构成了我们现在称为实分析(Real Analysis)的数学分支，有些书也叫实变函数论。

对于应用科学来说，实分析似乎没有古典微积分那么“实用”——很难直接基于它得到什么算法。

而且，它要解决的某些“难题”——比如处处不连续的函数，或者处处连续而处处不可微的函数——在工程师的眼中，并不现实。

但是，我认为，它并不是一种纯数学概念游戏，它的现实意义在于为许多现代的应用数学分支提供坚实的基础。

下面，我仅仅列举几条它的用处：1. 黎曼可积的函数空间不是完备的，但是勒贝格可积的函数空间是完备的。

简单的说，一个黎曼可积的函数列收敛到的那个函数不一定是黎曼可积的，但是勒贝格可积的函数列必定收敛到一个勒贝格可积的函数。

在泛函分析，还有逼近理论中，经常需要讨论“函数的极限”，或者“函数的级数”，如果用黎曼积分的概念，这种讨论几乎不可想像。

我们有时看一些paper中提到Lp函数空间，就是基于勒贝格积分。

2.勒贝格积分是傅立叶变换（这东西在工程中到处都是）的基础。

很多关于信号处理的初等教材，可能绕过了勒贝格积分，直接讲点面对实用的东西而不谈它的数学基础，但是，对于深层次的研究问题——特别是希望在理论中能做一些工作——这并不是总能绕过去。

3. 在下面，我们还会看到，测度理论是现代概率论的基础。

拓扑学：分析从实数轴推广到一般空间——现代分析的抽象基础随着实数理论的建立，大家开始把极限和连续推广到更一般的地方的分析。

事实上，很多基于实数的概念和定理并不是实数特有的。

很多特性可以抽象出来，推广到更一般的空间里面。

对于实数轴的推广，促成了点集拓扑学(Point-set Topology)的建立。

很多原来只存在于实数中的概念，被提取出来，进行一般性的讨论。

在拓扑学里面，有4个C构成了它的核心：Closed set（闭集合）。

在现代的拓扑学的公理化体系中，开集和闭集是最基本的概念。

一切从此引申。

这两个概念是开区间和闭区间的推广，它们的根本地位，并不是一开始就被认识到的。

经过相当长的时间，人们才认识到：开集的概念是连续性的基础，而闭集对极限运算封闭——而极限正是分析的根基。

Continuous function（连续函数）。

连续函数在微积分里面有个用ϵ−δ语言给出的定义，在拓扑学中它的定义是“开集的原像是开集的函数”。

第二个定义和第一个是等价的，只是用更抽象的语言进行了改写。

我个人认为，它的第三个（等价）定义才从根本上揭示连续函数的本质——“连续函数是保持极限运算的函数”——比如y是数列x1,x2,x3,⋯的极限，那么如果f是连续函数，那么f(y)就是f(x1),f(x2),f(x3),…的极限。

连续函数的重要性，可以从别的分支学科中进行类比。

比如群论中，基础的运算是“乘法”，对于群，最重要的映射叫“同态映射”——保持“乘法”的映射。

在分析中，基础运算是“极限”，因此连续函数在分析中的地位，和同态映射在代数中的地位是相当的。

Connected set（连通集合）。

比它略为窄一点的概念叫(Path connected)，就是集合中任意两点都存在连续路径相连——可能是一般人理解的概念。