用R软件进行一元线性回归 实验报告
用R软件进行一元线性回归 实验报告

数理统计上机报告上机实验题目:用R软件进行一元线性回归上机实验目的:1、进一步理解假设实验的基本思想,学会使用实验检验和进行统计推断。
2、学会利用R软件进行假设实验的方法。
一元线性回归基本理论、方法:基本理论:假设预测目标因变量为Y,影响它变化的一个自变量为X,因变量随自变量的增(减)方向的变化。
一元线性回归分析就是要依据一定数量的观察样本(Xi, Yi),i=1,2…,n,找出回归直线方程Y=a+b*X方法:对应于每一个Xi,根据回归直线方程可以计算出一个因变量估计值Yi。
回归方程估计值Yi 与实际观察值Yj之间的误差记作e-i=Yi-Yi。
显然,n个误差的总和越小,说明回归拟合的直线越能反映两变量间的平均变化线性关系。
据此,回归分析要使拟合所得直线的平均平方离差达到最小,据此,回归分析要使拟合所得直线的平均平方离差达到最小,简称最小二乘法将求出的a和b 代入式(1)就得到回归直线Yi=a+bXi 。
那么,只要给定Xi值,就可以用作因变量Yi的预测值。
(一)实验实例和数据资料:有甲、乙两个实验员,对同一实验的同一指标进行测定,两人测定的结果如试问:甲、乙两人的测定有无显著差异?取显著水平α=0.05.上机实验步骤:1(1)设置假设:H0:u1-u-2=0:H1:u1-u-2<0(2)确定自由度为n1+n2-2=14;显著性水平a=0.05 (3)计算样本均值样本标准差和合并方差统计量的观测值alpha<-0.05;n1<-8;n2<-8;x<-c(4.3,3.2,3.8,3.5,3.5,4.8,3.3,3.9);y<-c(3.7,4.1,3.8,3.8,4.6,3.9,2.8,4.4);var1<-var(x);xbar<-mean(x);var2<-var(y);ybar<-mean(y);Sw2<-((n1-1)*var1+(n2-1)*var2)/(n1+n2-2)t<-(xbar-ybar)/(sqrt(Sw2)*sqrt(1/n1+1/n2));tvalue<-qt(alpha,n1+n2-2);(4)计算临界值:tvalue<-qt(alpha,n1+n2-2)(5)比较临界值和统计量的观测值,并作出统计推断实例计算结果及分析:alpha<-0.05;> n1<-8;> n2<-8;> x<-c(4.3,3.2,3.8,3.5,3.5,4.8,3.3,3.9);> y<-c(3.7,4.1,3.8,3.8,4.6,3.9,2.8,4.4);> var1<-var(x);> xbar<-mean(x);> var2<-var(y);> ybar<-mean(y);> Sw2<-((n1-1)*var1+(n2-1)*var2)/(n1+n2-2)> t<-(xbar-ybar)/(sqrt(Sw2)*sqrt(1/n1+1/n2));> var1[1] 0.2926786> xbar[1] 3.7875> var2[1] 0.29267862> ybar[1] 3.8875Sw2[1] 0.2926786> t[1] -0.3696873tvalue[1] -1.76131分析:t=-0.3696873>tvalue=-1.76131,所以接受假设H1即甲乙两人的测定无显著性差异。
用R软件实现一元线性回归

用R软件实现一元线性回归一)理解一元线性回归,并会通过软件实现一元线性回归二)通过软件计算回归系数,会进行回归方差的显著性检验实验目的(三)掌握残差分析四)掌握回归系数的区间估计五)掌握预测和控制实验环境一、实验原理一)一元线性回归模型PC电脑1部,R软件y1x二)回归方程的显著性检验1.t检验2.F检验3.相关系数的显著性检验三)残差分析四)回归系数的区间估计五)预测1.单值预测2.区间预测二、实验内容及步骤案例3一家保险公司十分关心其总公司营业部加班的程度,决定认真调查一下现状,经过10周时间,收集了每周加班时间的数据和签发的新保单的数据,x为每周签发的新保单数据,y为每周加班时间(小时),数据见表2-1.表2-1每周加班工夫和签发的新保单的数据表周序号xy18253.522151.0310704.045502.054801.0713504.583251.596703.01012155.01.绘制散点图X Y<-c(3.5,1,4,2,1,4.5,1.5,3,5) 1plot(X,Y,main="每周加班工夫和签发的新保单的散点图")。
abline(lm(Y~X))成效分析:从图可发觉,每周加班工夫和签发的新保单成线性干系,因此能够斟酌一元线性模子。
2.求出回归方程,并对响应的方程做检修求出回归方程,并对相应方程做检验a<-lm(Y~X)summary(a)Call:lm(formula = Y ~ X)Residuals:Min1QMedian3QMax-0. -0....Coefficients:Estimate Std。
Error t value Pr(>|t|)(Intercept) 0.xxxxxxxx.xxxxxxxx.3390.745X0.xxxxxxxx.xxxxxxxx.4656.34e-05 ***2Signif。
codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.485 on 7 degrees of freedomMultiple : 71.65 on 1 and 7 DF,p-value: 6.344e-05结果分析:从上述程序的输出结果可以看出回归方程通过了回归参数的检验与回归方程的检验,因此得到的回归方程为:Y0.xxxxxxx+0.xxxxxxxX3.参数估计confint(a)2.5 %97.5 %Intercept) -0.xxxxxxxx0 0.xxxxxxxx4#这里出现的是常数项置信区间X0.xxxxxxxx7 0.xxxxxxxx1#这里出现的是系数的置信区间4.预测预测N<-data.frame(X=260)#输入X=260NX1 260XXXXXXfitlwrupr1 1. -0.xxxxxxx 2.结果分析:由上述程序计算结果可得预测值与相应的预测区间为Y(260)0.241.069,2.374]平均值E(y)的置信区间计算如下:yconf yconffitlwrupr1 1. 0. 1.结果分析:平均值E(y)的置信区间为[1.069,1.692]5.残差分析残差分析残差e e3xxxxxxx.xxxxxxxx0.xxxxxxxx -0.xxxxxxxx -0.xxxxxxxx -0.xxxxxxxx -0.xxxxxxxx7890.xxxxxxxx0.xxxxxxxx0.xxxxxxxx标准化残差ZRE<-e/1.319 ##计算回归a的标准化残差ZRExxxxxxx.xxxxxxxx0.xxxxxxxx -0.xxxxxxxx -0.xxxxxxxx -0.xxxxxxxx -0.xxxxxxxx7890.xxxxxxxx0.xxxxxxxx0.xxxxxxxx学生化残差SRE<-rstandard(a) ##计算学生化残差SRExxxxxxx.xxxxxxxx0.xxxxxxxx -0.xxxxxxxx -0.xxxxxxxx -1.xxxxxxxx -1.xxxxxxxx7890.xxxxxxxx0.xxxxxxxx1.xxxxxxxx成效分析:能够看出,学生氏残差绝对值都小于2,因此模子吻合根本假定。
用R软件进行回归分析

数理统计上机报告上机实验题目: 用R软件进行回归分析上机实验目的:1 进一步理解回归分析的基本思想, 学会使用回归进行统计推理。
2 学会利用R软件进行回归分析的方法。
一元线性回归基本理论、方法:1 根据样本观察值对经济计量模型参数进行估计, 求的回归方程。
2 对回归方程、参数估计值进行显著性检验。
3 利用回归方程进行分析、评论及预测。
P430第十一题上机实验步骤:y<-c(2813,2705,11103,2590,2131,5181)x<-c(3.25,3.20,5.07,3.14,2.90,4.02)xbar<-mean(x)L11<-sum((x-xbar)^2)ybar<-mean(y)Lyy<-sum((y-ybar)^2)L1y<-sum((x-xbar)*(y-ybar))n<-length(x)beta_1<-L1y/L11beta_0<-ybar-xbar*beta_1sigma2_hat<-(Lyy-beta_1*L1y)/(n-2)sigma_hat<-sqrt(sigma2_hat)实例计算结果及分析:1> L11[1] 3.321333> Lyy[1] 59353704> L1y[1] 13836.19> beta_0[1] -10562.69> beta_1[1] 4165.854> sigma_hat[1] 654.6287>P=-10562.69+4165LP432 第十八题上机实验步骤:y<-c(64,60,71,61,54,77,81,93,93,51,76,96,77,93,95,54,168,99)X<-matrix(0, nrow = 18, ncol = 4)X[,1]<-rep(1,18)X[,2]<-c(0.4,0.4,3.1,0.6,4.7,1.7,9.4,10.1,11.6,12.6,10.9,23.1,23.1,21 .6,23.1,1.9,26.8,29.9)X[,3]<-c(53,23,19,34,24,65,44,31,29,58,37,46,50,44,56,36,58,51)X[,4]<-c(158,163,37,157,59,123,46,117,173,112,111,114,134,73,168,143, 202,124)beta<-solve(t(X)%*%X)%*%t(X)%*%yyhat<-X%*%betaytidle<-y-yhat23所求得的回归方程为123ˆ43.65 1.780.080.16y x x x =+-+。
一元线性回归实验报告

⼀元线性回归实验报告实验⼀⼀元线性回归⼀实验⽬的:掌握⼀元线性回归的估计与应⽤,熟悉EViews的基本操作。
⼆实验要求:应⽤教材P61第12题做⼀元线性回归分析并做预测。
三实验原理:普通最⼩⼆乘法。
四预备知识:最⼩⼆乘法的原理、t检验、拟合优度检验、点预测和区间预测。
五实验内容:第2章练习12下表是中国2007年各地区税收Y和国内⽣产总值GDP的统计资料。
单位:亿元(1)作出散点图,建⽴税收随国内⽣产总值GDP变化的⼀元线性回归⽅程,并解释斜率的经济意义;(2)对所建⽴的回归⽅程进⾏检验;(3)若2008年某地区国内⽣产总值为8500亿元,求该地区税收收⼊的预测值及预测区间。
六实验步骤1.建⽴⼯作⽂件并录⼊数据:(1)双击桌⾯快速启动图标,启动Microsoft Office Excel, 如图1,将题⽬的数据输⼊到excel表格中并保存。
(2)双击桌⾯快速启动图标,启动EViews6程序。
(3)点击File/New/ Workfile…,弹出Workfile Create对话框。
在WorkfileCreate对话框左侧Workfile structure type栏中选择Unstructured/Undated 选项,在右侧Data Range中填⼊样本个数31.在右下⽅输⼊Workfile的名称P53.如图2所⽰。
图 1 图 2(4)下⾯录⼊数据,点击File/Import/Read Text-Lotus-Excel...选中第(1)步保存的excel表格,弹出Excel Spreadsheet Import对话框,在Upper-left data cell栏输⼊数据的起始单元格B2,在Excel 5+sheet name栏中输⼊数据所在的⼯作表sheet1,在Names for series or Number if named in file栏中输⼊变量名Y GDP,如图3所⽰,点击OK,得到如图4所⽰界⾯。
线性回归分析实验报告

实验一:线性回归分析实验目的:通过本次试验掌握回归分析的基本思想和基本方法,理解最小二乘法的计算步骤,理解模型的设定T检验,并能够根据检验结果对模型的合理性进行判断,进而改进模型。
理解残差分析的意义和重要性,会对模型的回归残差进行正态型和独立性检验,从而能够判断模型是否符合回归分析的基本假设。
实验内容:用线性回归分析建立以高血压作为被解释变量,其他变量作为解释变量的线性回归模型。
分析高血压与其他变量之间的关系。
实验步骤:1、选择File | Open | Data 命令,打开gaoxueya.sav图1-1 数据集gaoxueya 的部分数据2、选择Analyze | Regression | Linear…命令,弹出Linear Regression (线性回归) 对话框,如图1-2所示。
将左侧的血压(y)选入右侧上方的Dependent(因变量) 框中,作为被解释变量。
再分别把年龄(x1)、体重(x2)、吸烟指数(x3)选入Independent (自变量)框中,作为解释变量。
在Method(方法)下拉菜单中,指定自变量进入分析的方法。
图1-2 线性回归分析对话框3、单击Statistics按钮,弹出Linear Regression : Statistics(线性回归分析:统计量)对话框,如图1-3所示。
1-3线性回归分析统计量对话框4、单击 Continue 回到线性回归分析对话框。
单击Plots ,打开Linear Regression:Plots (线性回归分析:图形)对话框,如图1-4所示。
完成如下操作。
图1-4 线性回归分析:图形对话框5、单击Continue ,回到线性回归分析对话框,单击Save按钮,打开Linear Regression;Save 对话框,如图1-5所示。
完成如图操作。
图1-5 线性回归分析:保存对话框6、单击Continue ,回到线性回归分析对话框,单击Options 按钮,打开Linear Regression ;Options 对话框,如图1-6所示。
R软件实现线性回归模型

Executive SummaryThe purpose of this project is to explore the impact of a set of variables like horsepower, transmission configuration, engine cylinder configuration, etc. on the mileage mpg (Miles per Gallon). And come up with a model which can accurately predict the mileage of a car. Ordinary Least Squares model was fit and best model was selected based on RMSE, R-squared and AIC value.Most of the predictors in the dataset are correlated which makes the ordinary least squares solution unstable and increases the variability. Therefore, we used the Partial least Sqaures which finds components that maximally summarise the variation of the predictors while simultaneously requiring the predictors to have maximum correlation with the response. 3 components were chosen as it gave the minimum RMSE value.On comparing both the models in terms of RMSE and R-squared, we conclude that PLS is a bit superior than the linear regression as the RMSE is significantly lower.Exploratory AnalysisWe plot various box plots to visualise the distribution of mpg by various groups of the categorical variable.[1] In plot-1 (see appendix for all plots) we see that the mpg of cars with automatic transmission is much lower than manual. We can check if the difference b/w their mean values are statistically significant through a two-samples independent t-test.aggregate(mpg~am, data = mtcars, mean)## am mpg## 1 0 17.14737## 2 1 24.39231t.test(mpg~am, data=mtcars)#### Welch Two Sample t-test#### data: mpg by am## t = -3.7671, df = 18.332, p-value = 0.001374## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -11.280194 -3.209684## sample estimates:## mean in group 0 mean in group 1## 17.14737 24.39231p value <0.05, we reject the null hypothesis that the true mean difference is equal to zero. Hence, the difference is statistically significant and the cars with automatic transmission have alower mpg on an average. Plot-2 shows that generally, mpg of cars with S engine configuration is much higher. We can confirm this using a t-test as shown above.t.test(mpg~vs, data= mtcars)#### Welch Two Sample t-test#### data: mpg by vs## t = -4.6671, df = 22.716, p-value = 0.0001098## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -11.462508 -4.418445## sample estimates:## mean in group 0 mean in group 1## 16.61667 24.55714##t = -4.6671, df = 22.716, p-value = 0.0001098p value is 0.0001, we reject the null hypothesis that the true mean difference b/w cars with different engine configuration is 0. The difference is statistically significant. The configuration of the engine significantly affects the mileage. By looking at plot-3 it is safe to assume, higher th number of cylinders in the car, lower is the mpg. Such definite conclusions cannot be drawn by looking at plot-4 & plot-5 but we can perform t-tests for it.head(mtcars,3)## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1dim(mtcars)## [1] 32 11names(mtcars)## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"## [11] "carb"For buliding the model we will randomly split the mtcars data into training and test setusing split.sample function in caTools package.[2].#install.packages("caTools")library(caTools)set.seed(88) #to fix the result that we get as we get different results everytime we r spl =sample(1:nrow(data), size=0.85*nrow(data))train =data[spl,]test =data[-spl,]#this is the method for splitting the data set into training and test data when the outcome #continuousmtcars1<-mtcars[-(28:32), ] #the training settest_data<-mtcars[28:32, ] #test data settest_data## mpg cyl disp hp drat wt qsec vs am gear carb## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2Regression AnalysisWe have graphically seen that Manual is better for mpg. Now we will qantify the difference between them.mtcars1$cyl<-factor(mtcars1$cyl)mtcars1$vs<-factor(mtcars1$vs)mtcars1$am<-factor(mtcars1$am)mtcars1$carb<-factor(mtcars1$carb)mtcars1$gear<-factor(mtcars1$gear)There are various methods for choosing a subset of variables for the best regression model which can explain variability in response variable well. There are 2 classes of algorithms for that:---1. All possible regression approach2. sequential selectioni. forward selectionii. backward selectionModels are selected on the basis of adj. R2, MS Res, Mallow's statistic and/or AIC. Here I will use the 'all possible regression approach' algorithm and choose the best model on the basis of AIC using StepAIC funcion.Selecting the Best Model#install.packages("MASS")library(MASS)stepAIC(lm(mpg~., mtcars1), direction ="both") #direction can be forward, backward or both ## Start: AIC=58.36## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb#### Df Sum of Sq RSS AIC## - cyl 2 0.5244 77.711 54.543## - qsec 1 1.3899 78.577 56.842## - am 1 2.0239 79.211 57.059## - vs 1 3.3084 80.495 57.494## <none> 77.187 58.361## - wt 1 6.1828 83.369 58.441## - hp 1 6.2913 83.478 58.476## - gear 2 13.0445 90.231 58.577## - carb 3 21.9188 99.105 59.109## - disp 1 10.0829 87.270 59.675## - drat 1 24.4139 101.601 63.781#### Step: AIC=54.54## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb#### Df Sum of Sq RSS AIC## - qsec 1 0.975 78.686 52.880## - am 1 2.272 79.983 53.322## - vs 1 3.685 81.396 53.794## <none> 77.711 54.543## - gear 2 14.013 91.724 55.019## - wt 1 8.006 85.717 55.191## - hp 1 9.495 87.206 55.656## - disp 1 13.237 90.948 56.790## + cyl 2 0.524 77.187 58.361## - carb 3 36.381 114.092 58.912## - drat 1 28.307 106.018 60.930#### Step: AIC=52.88## mpg ~ disp + hp + drat + wt + vs + am + gear + carb#### Df Sum of Sq RSS AIC## - am 1 2.925 81.611 51.865## - vs 1 3.430 82.116 52.032## <none> 78.686 52.880## - wt 1 7.356 86.042 53.293## - gear 2 15.910 94.595 53.852## - hp 1 10.714 89.400 54.327## + qsec 1 0.975 77.711 54.543## - disp 1 12.295 90.981 54.800## + cyl 2 0.109 78.577 56.842## - drat 1 31.300 109.986 59.922## - carb 3 50.314 128.999 60.227#### Step: AIC=51.87## mpg ~ disp + hp + drat + wt + vs + gear + carb#### Df Sum of Sq RSS AIC## - vs 1 0.580 82.191 50.057## - wt 1 4.872 86.483 51.431## <none> 81.611 51.865## - hp 1 7.895 89.506 52.359## - disp 1 9.458 91.069 52.826## + am 1 2.925 78.686 52.880## - gear 2 16.904 98.515 52.948## + qsec 1 1.628 79.983 53.322## + cyl 2 0.181 81.430 55.805## - drat 1 39.399 121.010 60.501## - carb 3 60.038 141.649 60.753#### Step: AIC=50.06## mpg ~ disp + hp + drat + wt + gear + carb#### Df Sum of Sq RSS AIC## <none> 82.191 50.057## - hp 1 7.949 90.140 50.549## - gear 2 16.535 98.726 51.006## + qsec 1 0.784 81.406 51.798## + vs 1 0.580 81.611 51.865## - wt 1 12.873 95.064 51.985## + am 1 0.075 82.116 52.032## - disp 1 17.622 99.812 53.301## + cyl 2 0.091 82.100 54.027## - drat 1 40.417 122.608 58.855## - carb 3 61.085 143.276 59.061#### Call:## lm(formula = mpg ~ disp + hp + drat + wt + gear + carb, data = mtcars1) #### Coefficients:## (Intercept) disp hp drat wt## 6.43785 0.03444 -0.03613 5.56520 -2.27712## gear4 gear5 carb2 carb3 carb4## 4.49717 2.95981 -4.03391 -1.42888 -7.24308m1<-lm(mpg ~disp +hp +drat +wt +vs +am +gear +carb, data=mtcars1) #checking for multicollinearity using vif and clorrelation chart#install.packages("car")library(car)vif(m1)## GVIF Df GVIF^(1/(2*Df))## disp 58.92572 1 7.676309## hp 23.34290 1 4.831449## drat 14.24762 1 3.774602## wt 22.09888 1 4.700944## vs 24.57621 1 4.957440## am 23.80195 1 4.878724## gear 36.26722 2 2.454023## carb 144.39082 3 2.290463#install.packages("PerformanceAnalytics")library(PerformanceAnalytics)chart.Correlation(mtcars1[,c(1,3,4,5,6,7)], histogram =T)StepAIC fn. will form models with all possible combinations of regressors. Then model is selected on the basis of AIC (Akaike's Information Criteria);The variance inflation factor (VIF), which assesses how much the variance of an estimated regression coefficient increases if your predictors are correlated. If no factors are correlated, the VIFs will all be 1. A VIF between 5 and 10 indicates high correlation that may be problematic. And if the VIF goes above 10, you can assume that the regression coefficients are poorly estimated due to multicollinearity. Remove highly correlated predictors from the model. If you have two or morefactors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't drastically reduce the R-squared. Consider using stepwise regression, best subsets regression, or specialized knowledge of the data set to remove these variables.We remove disp as it is highly correlated with all the other regressors along with other regressors which had high vif. We fit various other models and select the best of them:m2<-lm(mpg~hp+drat+wt+am+carb-1, mtcars1)extractAIC(m2)## [1] 8.00000 51.25171summary(m2)$coef## Estimate Std. Error t value Pr(>|t|)## hp -0.023422257 0.01556769 -1.504543172 0.148884027## drat 4.979435055 1.74005818 2.861648600 0.009984374## wt -0.003874243 1.15347572 -0.003358756 0.997355121## am0 7.775902513 8.60531624 0.903616124 0.377513564## am1 9.979994455 8.74368339 1.141394765 0.267887780## carb2 -2.127666831 1.36934339 -1.553786172 0.136734494## carb3 -2.531807306 1.92848074 -1.312850705 0.204870209## carb4 -5.842635235 2.02749628 -2.881699602 0.009555014#the coef are not significant and vif is also somewhat highVariables like hp(horsepower), wt are expected to be significant but are not. Hence, problem due to multicollinearity still persists in the model. Hence, we remove wt. By looking at the correlation chart, we see that most of the regressor var. are correlated to each other. so we pick a model with less no. of regressors.m9<-lm(mpg~am+wt, mtcars1)m6<-lm(mpg~am+wt+hp, mtcars1)anova(m6,m9) #it is significant.## Analysis of Variance Table#### Model 1: mpg ~ am + wt + hp## Model 2: mpg ~ am + wt## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 23 154.23## 2 24 212.61 -1 -58.38 8.706 0.007174 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1m7<-lm(mpg~am+wt+hp+cyl, mtcars1)anova(m6,m7); #not significant according to anova## Analysis of Variance Table#### Model 1: mpg ~ am + wt + hp## Model 2: mpg ~ am + wt + hp + cyl## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 23 154.23## 2 21 126.14 2 28.09 2.3382 0.1211summary(m6)#### Call:## lm(formula = mpg ~ am + wt + hp, data = mtcars1)#### Residuals:## Min 1Q Median 3Q Max## -3.6144 -1.5918 -0.3638 1.1406 5.5011#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 32.79946 2.88463 11.370 6.43e-11 ***## am1 2.63214 1.60467 1.640 0.11455## wt -2.23388 1.03965 -2.149 0.04242 *## hp -0.04513 0.01530 -2.951 0.00717 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #### Residual standard error: 2.59 on 23 degrees of freedom## Multiple R-squared: 0.8416, Adjusted R-squared: 0.8209## F-statistic: 40.72 on 3 and 23 DF, p-value: 2.295e-09vif(m6)## am wt hp## 2.161742 4.201016 3.160674extractAIC(m6) #variance inflation factor and AIC look good.## [1] 4.00000 55.05088According to our objectives our model must contain wt and am. Hence we start from there and perform anova test to check if the added regressor add any new, relevant information to the model. m6 is significant amongst all and it's variance inflation factor and AIC are also low. Mileage at zero weight and horsepower doesn't make any sense, we subtract the intercept.#final modelm6<-lm(mpg~am+wt+hp-1, mtcars1)summary(m6)#### Call:## lm(formula = mpg ~ am + wt + hp - 1, data = mtcars1)#### Residuals:## Min 1Q Median 3Q Max## -3.6144 -1.5918 -0.3638 1.1406 5.5011#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## am0 32.79946 2.88463 11.370 6.43e-11 ***## am1 35.43160 1.90375 18.612 2.31e-15 ***## wt -2.23388 1.03965 -2.149 0.04242 *## hp -0.04513 0.01530 -2.951 0.00717 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.59 on 23 degrees of freedom## Multiple R-squared: 0.9869, Adjusted R-squared: 0.9847## F-statistic: 434.1 on 4 and 23 DF, p-value: < 2.2e-16#all coef are significant.confint(m6, level =0.95)## 2.5 % 97.5 %## am0 26.83215463 38.76676260## am1 31.49339554 39.36979756## wt -4.38455441 -0.08320047## hp -0.07677241 -0.01348965Let's check how accurately it can predict the mileage of the car in the test_data.test_data$cyl<-factor(test_data$cyl)test_data$vs<-factor(test_data$vs)test_data$am<-factor(test_data$am)test_data$carb<-factor(test_data$carb)test_data$gear<-factor(test_data$gear)predict(m6, test_data) ; test_data## Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E## 26.95193 16.43561 21.34583 12.33776 24.30214## mpg cyl disp hp drat wt qsec vs am gear carb## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 21-(abs(predict(m6, test_data)-test_data[1]))/test_data[1]## mpg## Lotus Europa 0.8865768## Ford Pantera L 0.9597713## Ferrari Dino 0.9164555## Maserati Bora 0.8225173## Volvo 142E 0.8643862#comparing predictions through mean and standard deviation of accuracymean((1-(abs(predict(m6, test_data)-test_data[1]))/test_data[1])[,1]);mean((1-(abs(predict(m test_data)-test_data[1]))/test_data[1])[,1]) ## [1] 0.8899414## [1] 0.9041645sd((1-(abs(predict(m6, test_data)-test_data[1]))/test_data[1])[,1]);sd((1-(abs(predict(m7, test_data)-test_data[1]))/test_data[1])[,1]) ## [1] 0.05193654## [1] 0.07878624Predictions from m6 and the actual mileage of the cars in test data are very close with an average accuracy of abput 89%; model m6 is parsimonious and interpretable too.Checking basic assumptions of regressionres<-m6$residualsshapiro.test(res)#### Shapiro-Wilk normality test#### data: res## W = 0.95114, p-value = 0.2285#install.packages("ggfortify")library(ggfortify)autoplot(m6, label.size=4)p>0.05 hence we fail to reject the null hypothesis that residuals are normally distributed. By looking at diagnostic plots, we can say that the residuals are homoscedastic. The scatter points in QQ Plot also seem to lie on the line, thereby confirming their normality. Basic assumptions of regression are met.We can finally, say that compared to cars which had automatic transmission(0) we would expect mileage of cars with manual transmission(1) 2.6 miles per gallon more on average given values of other regressor variables remain same. By looking at the coefficient of theweight variable in the final model, we can conclude that a 1000lbs increase in the weight a car will decrease the mileage by 2.234 miles per gallon.Partial Least SquaresThe predictors in the dataset are correlated which makes the ordinary least squares solution unstable and increases the variability. Therefore, we will use the Partial least Sqaures which finds components that maximally summarise the variation of the predictors while simultaneously requiring the predictors to have maximum correlation with the response.#remove the columns which contains binary or nominal variables.library(e1071)seg_data<-mtcars[ ,-c(2,8:11)]apply(seg_data, 2, skewness)## mpg disp hp drat wt qsec## 0.6106550 0.3816570 0.7260237 0.2659039 0.4231465 0.3690453library(caret)trans<-preProcess(seg_data, method =c("center", "scale"))transformed<-predict(trans, seg_data)complt<-data.frame(transformed, mtcars[,c(2,8:11)])TrainXtrans<-data.frame(transformed[-1] , mtcars[,c(2,8:11)])TrainYtrans<-transformed[,1]The coefficients of skewness are not very large, BoxCox Transformations are not required.cntrl<-trainControl(method ="cv", number =8)plsTune <-train(TrainXtrans, TrainYtrans, method ="pls",trControl = cntrl)plsTune## Partial Least Squares#### 32 samples## 10 predictors#### No pre-processing## Resampling: Cross-Validated (8 fold)## Summary of sample sizes: 28, 28, 29, 27, 27, 29, ...## Resampling results across tuning parameters:#### ncomp RMSE Rsquared MAE## 1 0.4359938 0.9008139 0.3659894## 2 0.4689337 0.8752059 0.3888295## 3 0.4524320 0.8711316 0.3784411#### RMSE was used to select the optimal model using the smallest value.## The final value used for the model was ncomp = 1.Comparison between OLS and PLSWe will now compare the selected models from these two methods.dt<-t(data.frame(c(0.8935, 0.9021),c(0.40844, 6.556)))rownames(dt)<-c("R-squared","RMSE")colnames(dt)<-c("PLS","OLS")library(knitr);library(kableExtra)dt %>%kable("html") %>%kable_styling()PLS OLSR-squared0.893500.9021RMSE0.40844 6.5560The R-squared values do not differ much but the RMSE of PLS is much smaller than the RMSE of OLS. Hence the PLS model is superior than OLS.AppendixThis section includes all the above mentioned Plots.layout(matrix(c(1,2,1,2),2,2, byrow =TRUE))boxplot(mpg~am,data=mtcars,col=c("red", "turquoise"),xlab="transmission type", ylab="miles pergallon", names=c("Automatic","Manual"),main="Plot-1") boxplot(mpg~vs, data= mtcars, col=c(4,"cyan"), xlab="Engine Cylinder Configuration", ylab="Miles/Gallon", las=TRUE, names=c("V shape", "Straight Line Shape"), main="Plolayout(matrix(c(1,2,1,2),2,2, byrow = T))boxplot(mpg~cyl, data= mtcars, col=c("cyan",42,23),ylab="Miles per Gallon", las=TRUE, main="Plot-3")boxplot(mpg~gear, data= mtcars, col=c("cyan",42,23), xlab="No. of Gears",ylab="Miles per Gallon", las=T, main="Plot-4")boxplot(mpg~carb, data= mtcars, col=c("cyan",42,23,"green",600), xlab="Number of carburetors ylab="Miles per Gallon", las=T, main="Plot-5")。
一元线性回归分析研究实验报告

一元线性回归分析研究实验报告一元线性回归分析研究实验报告一、引言一元线性回归分析是一种基本的统计学方法,用于研究一个因变量和一个自变量之间的线性关系。
本实验旨在通过一元线性回归模型,探讨两个变量之间的关系,并对所得数据进行统计分析和解读。
二、实验目的本实验的主要目的是:1.学习和掌握一元线性回归分析的基本原理和方法;2.分析两个变量之间的线性关系;3.对所得数据进行统计推断,为后续研究提供参考。
三、实验原理一元线性回归分析是一种基于最小二乘法的统计方法,通过拟合一条直线来描述两个变量之间的线性关系。
该直线通过使实际数据点和拟合直线之间的残差平方和最小化来获得。
在数学模型中,假设因变量y和自变量x之间的关系可以用一条直线表示,即y = β0 + β1x + ε。
其中,β0和β1是模型的参数,ε是误差项。
四、实验步骤1.数据收集:收集包含两个变量的数据集,确保数据的准确性和可靠性;2.数据预处理:对数据进行清洗、整理和标准化;3.绘制散点图:通过散点图观察两个变量之间的趋势和关系;4.模型建立:使用最小二乘法拟合一元线性回归模型,计算模型的参数;5.模型评估:通过统计指标(如R2、p值等)对模型进行评估;6.误差分析:分析误差项ε,了解模型的可靠性和预测能力;7.结果解释:根据统计指标和误差分析结果,对所得数据进行解释和解读。
五、实验结果假设我们收集到的数据集如下:经过数据预处理和散点图绘制,我们发现因变量y和自变量x之间存在明显的线性关系。
以下是使用最小二乘法拟合的回归模型:y = 1.2 + 0.8x模型的R2值为0.91,说明该模型能够解释因变量y的91%的变异。
此外,p 值小于0.05,说明我们可以在95%的置信水平下认为该模型是显著的。
误差项ε的方差为0.4,说明模型的预测误差为0.4。
这表明模型具有一定的可靠性和预测能力。
六、实验总结通过本实验,我们掌握了一元线性回归分析的基本原理和方法,并对两个变量之间的关系进行了探讨。
一元线性回归分析实验报告

一元线性回归在公司加班制度中的应用院(系):专业班级:学号姓名:指导老师:成绩:完成时间:一元线性回归在公司加班制度中的应用一、实验目的掌握一元线性回归分析的基本思想和操作,可以读懂分析结果,并写出回归方程,对回归方程进行方差分析、显著性检验等的各种统计检验 二、实验环境SPSS21.0 windows10.0 三、实验题目一家保险公司十分关心其总公司营业部加班的程度,决定认真调查一下现状。
经10周时间,收集了每周加班数据和签发的新保单数目,x 为每周签发的新保单数目,y 为每周加班时间(小时),数据如表所示y3.51.04.02.01.03.04.51.53.05.01. 画散点图。
2. x 与y 之间大致呈线性关系?3. 用最小二乘法估计求出回归方程。
4. 求出回归标准误差σ∧。
5. 给出0β∧与1β∧的置信度95%的区间估计。
6. 计算x 与y 的决定系数。
7. 对回归方程作方差分析。
8. 作回归系数1β∧的显著性检验。
9. 作回归系数的显著性检验。
10.对回归方程做残差图并作相应的分析。
11.该公司预测下一周签发新保单01000x =张,需要的加班时间是多少?12.给出0y的置信度为95%的精确预测区间。
13.给出()E y的置信度为95%的区间估计。
四、实验过程及分析1.画散点图如图是以每周加班时间为纵坐标,每周签发的新保单为横坐标绘制的散点图,从图中可以看出,数据均匀分布在对角线的两侧,说明x和y之间线性关系良好。
2.最小二乘估计求回归方程用SPSS 求得回归方程的系数01,ββ分别为0.118,0.004,故我们可以写出其回归方程如下:0.1180.004y x =+3.求回归标准误差σ∧由方差分析表可以得到回归标准误差:SSE=1.843 故回归标准误差:2=2SSEn σ∧-,2σ∧=0.48。
4.给出回归系数的置信度为95%的置信区间估计。
由回归系数显著性检验表可以看出,当置信度为95%时:0β∧的预测区间为[-0.701,0.937], 1β∧的预测区间为[0.003,0.005].0β∧的置信区间包含0,表示0β∧不拒绝为0的原假设。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
数理统计上机报告
上机实验题目:用R软件进行一元线性回归
上机实验目的:
1、进一步理解假设实验的基本思想,学会使用实验检验和进行统计推断。
2、学会利用R软件进行假设实验的方法。
一元线性回归基本理论、方法:
基本理论:假设预测目标因变量为Y,影响它变化的一个自变量为X,因变量随自变量的增(减)方向的变化。
一元线性回归分析就是要依据一定数量的观察样本(Xi, Yi),i=1,2…,n,找出回归直线方程Y=a+b*X
方法:对应于每一个Xi,根据回归直线方程可以计算出一个因变量估计值Yi。
回归方程估计值Yi 与实际观察值Yj之间的误差记作e-i=Yi-Yi。
显然,n个误差的总和越小,说明回归拟合的直线越能反映两变量间的平均变化线性关系。
据此,回归分析要使拟合所得直线的平均平方离差达到最小,据此,回归分析要使拟合所得直线的平均平方离差达到最小,简称最小二乘法将求出的a和b 代入式(1)就得到回归直线Yi=a+bXi 。
那么,只要给定Xi值,就可以用作因变量Yi的预测值。
(一)
实验实例和数据资料:
有甲、乙两个实验员,对同一实验的同一指标进行测定,两人测定的结果如
试问:甲、乙两人的测定有无显著差异?取显著水平α=0.05.
上机实验步骤:
1
(1)设置假设:H0:u1-u-2=0:H1:u1-u-2<0
(2)确定自由度为n1+n2-2=14;显著性水平a=0.05 (3)计算样本均值样本标准差和合并方差统计量的观测值alpha<-0.05;
n1<-8;
n2<-8;
x<-c(4.3,3.2,3.8,3.5,3.5,4.8,3.3,3.9);
y<-c(3.7,4.1,3.8,3.8,4.6,3.9,2.8,4.4);
var1<-var(x);
xbar<-mean(x);
var2<-var(y);
ybar<-mean(y);
Sw2<-((n1-1)*var1+(n2-1)*var2)/(n1+n2-2)
t<-(xbar-ybar)/(sqrt(Sw2)*sqrt(1/n1+1/n2));
tvalue<-qt(alpha,n1+n2-2);
(4)计算临界值:tvalue<-qt(alpha,n1+n2-2)
(5)比较临界值和统计量的观测值,并作出统计推断
实例计算结果及分析:
alpha<-0.05;
> n1<-8;
> n2<-8;
> x<-c(4.3,3.2,3.8,3.5,3.5,4.8,3.3,3.9);
> y<-c(3.7,4.1,3.8,3.8,4.6,3.9,2.8,4.4);
> var1<-var(x);
> xbar<-mean(x);
> var2<-var(y);
> ybar<-mean(y);
> Sw2<-((n1-1)*var1+(n2-1)*var2)/(n1+n2-2)
> t<-(xbar-ybar)/(sqrt(Sw2)*sqrt(1/n1+1/n2));
> var1
[1] 0.2926786
> xbar
[1] 3.7875
> var2
[1] 0.2926786
2
> ybar
[1] 3.8875
Sw2
[1] 0.2926786
> t
[1] -0.3696873
tvalue
[1] -1.76131
分析:t=-0.3696873>tvalue=-1.76131,所以接受假设H1即甲乙两人的测定无显著性差异。
(二)
实验实例和数据资料:
2.某型号玻璃纸的横向延伸率要求不低于65%,且其服从正态分布,现对一批该批号的玻璃纸测得100个数据如下:
上机实验步骤:
(1)设置假设:H0:u=65, H1:u<65.
(2)确定自由度为n=100-1=99;显著性水平a=0.05
(3) 输入数据x<-
c(35.5,35.5,35.5,35.5,35.5,35.5,35.5,37.5,37.5,37.5,37.5,37.5,37.5,37.5,37.5,39.5,39.5,
3
39.5,39.5,39.5,39.5,39.5,39.5,39.5,39.5,39.5,41.5,41.5,41.5,41.5,41.5,41.5,41.5,41.5,4 1.5,43.5,43.5,43.5,43.5,43.5,43.5,43.5,43.5,43.5,45.5,45.5,45.5,45.5,45.5,45.5,45.5,45. 5,45.5,45.5,45.5,45.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5, 47.5,47.5,47.5,47.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,4 9.5,51.5,51.5,51.5,51.5,51.5,53.5,53.5,53.5,55.5,55.5,59.5,59.5,63.5)
(4)用R软件计算临界值
(5)比较临界值和统计量的观测值,并作出推断
实例计算结果及分析:
计算过程如下:
alpha<-0.05;
n<-100;
x<-
c(35.5,35.5,35.5,35.5,35.5,35.5,35.5,37.5,37.5,37.5,37.5,37.5,37.5,37.5,37.5,39.5,39.5, 39.5,39.5,39.5,39.5,39.5,39.5,39.5,39.5,39.5,41.5,41.5,41.5,41.5,41.5,41.5,41.5,41.5,4 1.5,43.5,43.5,43.5,43.5,43.5,43.5,43.5,43.5,43.5,45.5,45.5,45.5,45.5,45.5,45.5,45.5,45. 5,45.5,45.5,45.5,45.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5,47.5, 47.5,47.5,47.5,47.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,49.5,4 9.5,51.5,51.5,51.5,51.5,51.5,53.5,53.5,53.5,55.5,55.5,59.5,59.5,63.5)
sd1<-sd(x);
xbar<-mean(x);
t<-(xbar-65.0)/(sd1/sqrt(n));
tvalue<-qt(alpha,n-1);
sd1
[1] 5.815896
xbar
[1] 45.06
t
[1] -34.28534
tvalue
[1] -1.660391
分析推断:因为t=-34.28534<tvalue=-1.660391所以拒绝原假设。
即该批玻璃纸的横向延伸率不符合要求
(三)
实验实例和数据资料:
4
为了检验一种杂交作物的两种新处理方案,在同一地区随机选择16块地段在各实验地段,按两种方案处理作物,这8块地段的单位面积产量(单位:公斤)是:
一号方案产量:86 87 56 93 84 93 75 79 81 78 79 90 68 65 87 90;
二号方案产量:80 79 58 91 77 82 74 66 58 59 64 78 76 80 82 55;
假设两种方案的产量都服从正态分布,分别为N(u1,a^2),N(u2,a^2),a^2未知,求均值差u1-u2的置信区间;
实例计算结果及分析:
利用R软件求解过程如下:
>alpha<-0.05;
> x<-c(86,87,56,93,84,93,75,79,81,78,79,90,68,65,87,90);
> y<-c(80,79,58,91,77,82,74,66,58,59,64,78,76,80,82,55);
> n1<-length(x);
> n2<-length(y);
> xbar=mean(x);
> ybar=mean(y);
> sw<-sqrt((n1-1)*sqrt(var(x))+(n2-1)*sqrt(var(y))) /(n1+n2-2);
> q<-qt(1-alpha/2,(n1+n2-2));
> left<-xbar-ybar-q*sw*sqrt(1/n1+1/n2);
> right<-xbar-ybar+q*sw*sqrt(1/n1+1/n2);
> n1
[1] 16
> n2
[1] 16
> left
[1] 7.819162
> right
[1] 8.680838
所以置信区间【7.819162,,8.680838】
5
6。