Bootstrapping

合集下载

bootstrapping

bootstrapping

A quick view of bootstrap (cont)
• It has minimum assumptions. It is merely based on the assumption that the sample is a good representation of the unknown population.
• Bootstrap distributions usually approximate the shape, spread, and bias of the actual sampling distribution.
• Bootstrap distributions are centered at the value of the statistic from the original data plus any bias, while the sampling distribution is centered at the value of the parameter in the population, plus any bias.
Cases where bootstrap does not apply
• Small data sets: the original sample is not a good approximation of the population • Dirty data: outliers add variability in our estimates. • Dependence structures (e.g., time series, spatial problems): Bootstrap is based on the assumption of independence.

Bootstrapping Analysis Excel

Bootstrapping Analysis Excel

Spreadsheets in Education (eJSiE)Volume 4|Issue 3Article 46-21-2010Bootstrapping Analysis, Inferential Statistics and EXCELJohn A. Rochowicz JrAlvernia University , john.rochowicz@Follow this and additional works at:.au/ejsieThis Regular Article is brought to you by the Faculty of Business at ePublications@bond . It has been accepted for inclusion in Spreadsheets in Education (eJSiE) by an authorized administrator of ePublications@bond. For more information, please contact Bond University's Repository Coordinator .Recommended CitationRochowicz, John A. Jr (2011) "Bootstrapping Analysis, Inferential Statistics and EXCEL,"Spreadsheets in Education (eJSiE): Vol. 4: Iss.3, Article 4.Available at:.au/ejsie/vol4/iss3/4Bootstrapping Analysis, Inferential Statistics and EXCELAbstractPerforming a parametric statistical analysis requires the justification of a number of necessary assumptions. If assumptions are not justified research findings are inaccurate and in question. What happens when assumptions are not or cannot be addressed? When a certain statistic has no known sampling distribution what can a researcher do for statistical inference? Options are available for answering these questions and conducting valid research. This paper provides various numerical approximation techniques that can be used to analyze data and make inferences about populations from samples. The application of confidence intervals to inferential statistics is addressed. The analysis of data that is parametric as well as nonparametric is discussed. Bootstrapping analysis for inferential statistics is shown with the application of the Index Function and the use of macros and the Data Analysis Toolpak on the EXCEL spreadsheet. A variety of interesting observations are described.KeywordsBootstrapping, Resampling, Statistics, Inferences, ApproximationThis regular article is available in Spreadsheets in Education (eJSiE):.au/ejsie/vol4/iss3/4Bootstrapping Analysis, Inferential Statistics and EXCE LJohn A. Rochowicz JrAlvernia Universityjohn.rochowicz@AbstractPerforming a parametric statistical analysis requires the justification of a number of necessary assumptions. If assumptions are not justified research findings are inaccurate and in question. What happens when assumptions are not or cannot be addressed? When a certain statistic has no known sampling distribution what can a researcher do for statistical inference? Options are available for answering these questions and conducting valid research. This paper provides various numerical approximation techniques that can be used to analyze data and make inferences about populations from samples. The application of confidence intervals to inferential statistics is addressed. The analysis of data that is parametric as well as nonparametric is discussed. Bootstrapping analysis for inferential statistics is shown with the application of the Index Function and the use of macros and the Data Analysis Toolpak on the EXCEL spreadsheet. A variety of interesting observations are described.Keywords: EXCEL, Spreadsheets, Inferential Statistics, Bootstrapping,Resampling Data, Numerical Approximations.1Rochowicz: Bootstrapping AnalysisPublished by ePublications@bond, 20111. IntroductionEXCEL is prevalent, easy to learn and can be applied for numerous statistical projects. EXCEL, part of the Microsoft Office software package, has uses in a variety of educational and research settings including many for statistics and mathematics.From my various classroom experiences, students doing quantitative research for undergraduate degrees in social work, psychology, and science to graduate degrees such as the MBA and PhD program in leadership and social sciences, apply most parametric tests of hypotheses without checking assumptions. Many examples of this lack of thoroughness on the part of the researcher have been observed from my personal experiences of working with faculty colleagues and mentoring research students. When parametric research is conducted assumptions must be met. Many students and faculty colleagues doing research do not address the assumptions required to conduct any of the typical parametric inferential statistical methods including t-tests, Analysis of Variances tests of hypothesis (ANOVA’s) and regression analysis. As a result, testing hypotheses for inferential statistics are never completed and the reported results are open to debate.2. Statistics: Parametric or NonparametricParametric statistics are taught at all levels of post secondary education. Undergraduate to graduate levels and for all types of majors from the social to the physical sciences, statistics are studied and applied in all kinds of research. The assumptions associated with parametric tests of hypothesis [4] include: 1) Data are collected randomly and independently (tests of hypothesis for means, Analysis of Variances (ANOVA’s) for means and regression analysis). 2) Data are collected from normally distributed populations (tests of hypothesis for means, ANOVA’S for means and regression analysis). 3) Variances are equal when comparing two or more groups (tests of hypothesis for means, ANOVA’S for means and regression analysis).Many ways exist for checking assumptions. In order to check that sampled data is from a normally distributed population, histograms or stem-leaf plots are usually displayed. If they appear mound-shaped the assumption of normality is accepted as verified. Another way to verify normality is applying a nonparametric test of hypothesis, the Kolgomorov-Smirnov goodness of fit test of hypothesis for fitting data to a specific distribution. In this test of hypothesis the researcher needs to accept the null hypothesis that the data are normal. In this way the assumption of normality is justified. Levene’s Test of Hypothesis of the Equality of Variances determines whether variances for populations being studied are equal. The need to accept the null hypothesis of equal variances 2Spreadsheets in Education (eJSiE), Vol. 4, Iss. 3 [2011], Art. 4.au/ejsie/vol4/iss3/4verifies that the assumption that variances are equal. Usually these tests are used in conjunction with conducting parametric statistical inferences. The manner in which the research experiment is defined and how data are collected would justify the requirements that data are collected randomly and independently. Examples of where parametric statistics are used include:a) The researcher is interested in studying if there is a difference inthe means for scores achieved by students taking a mathematicscourse compared to students taking a social science course. Thisis an example of where a parametric independent samples t-testwould be conductedb) A researcher is looking for a linear relationship between the timea student puts into study and the final grade for a course. In thissituation linear regression and correlation analysis would beperformed.If the data collected are not normal or the homogeneity or equality of variance assumption is violated, there are alternative ways to make inferences. When assumptions are not validated, published results become invalid and questionable. A researcher that does not justify assumptions or cannot determine a specific sampling distribution for a statistic must apply nonparametric distribution-free statistical techniques.Nonparametric techniques exist where no assumptions are needed or no sampling distributions are found. For example the Kruskal-Wallis test of hypothesis is a nonparametric alternative to the one way ANOVA This nonparametric technique tests a hypothesis about the nature of three or more populations and whether they have identical distributions and if they differ with respect to location [7]. There are many others [4], [5], and [7].Another concern is the inability of determining the sampling distribution for certain statistics. For example there is no sampling distribution that can be used for population medians [5]. Computing technology and bootstrapping can be used for inferential statistics where the sampling distribution of the statistic is unknown. Bootstrapping is a nonparametric, numerical application that can be applied in EXCEL.3. Bootstrapping Analysis: Nonparametric AnalysisThe analysis of data without checking assumptions can be done with bootstrapping techniques. Bradley Efron [2] described a computer intensive technique that resamples collected data in order to study the behavior of a distribution of a specific statistic. As a result inferences are made on populations that are parametric as well as nonparametric.3Rochowicz: Bootstrapping AnalysisPublished by ePublications@bond, 2011Bootstrapping is a numerical sampling technique where the data sampled are resampled with replacement [2]. This means that you acquire a sample. Place sample values back and then select another sample. In this way you get sample data from which you can generate summary statistics for each resample. Various descriptive statistics such as mean, median, mode, variance and correlation can be bootstrapped.With the use of a computer the student or researcher can create many resamples. Bootstrapping statistics allows the student or researcher to analyze any distribution and make inferences. The sampled data becomes the population and the resampled data are the samples. There are advantages as well as disadvantages when bootstrapping.3.1 Advantages and Disadvantages :Advantages include:1) Verifying assumptions of normality and equality of variances for the population is unnecessary. Inferences are valid even when assumptions are not verified.2) There is no need to determine the underlying sampling distribution for any population quantity.3) Interpretations and results are based upon many observations.Disadvantages include:1) Powerful computers are necessary2) Randomness must be understood3) Computers have built-in error.4) Large sample sizes must be generated.4. Inferential Statistics: Confidence IntervalsIn any parametric or nonparametric statistics the researcher’s goal, is to infer something about the population. In order to make inferences about the population, constructing confidence intervals is acceptable. A confidence interval includes a sample statistic or point estimate plus or minus a constant error term. These ranges of values are found by setting the significance level (usually 0.05) for making a decision for the hypothesis about the population; determining the correct sampling distribution; and finding whether the theorized population value is in the confidence interval or not. The confidence interval takes on the form point estimate ± a critical value (z α/2) times the standard error (SE) of the statistic. For the mean, the confidence interval looks likexbar ± z α/2 SE(sample mean), where xbar is the sample mean.4Spreadsheets in Education (eJSiE), Vol. 4, Iss. 3 [2011], Art. 4.au/ejsie/vol4/iss3/4Testing hypothesis using confidence intervals is accomplished by checking whether the theorized population quantity is or is not in a certain confidence interval [1] and [4]. If the theorized population value is in the interval the null hypothesis is accepted and if not the null hypothesis is rejected. This means that the value obtained in the sample is not significantly different from the theorized population value if in the interval and the sample and population are significantly different when the population value is not in the interval. Suppose a researcher wants to find a 95% confidence interval for the mean. The error term is comprised of the standard error of the mean, that is the standard deviation over the square root of the sample size and a critical z for large samples (where population standard deviation is known) and t for small (where population standard deviation is unknown) samples. These z or t values are critical values and are based of the significance level set by a researcher. Alpha is the type I error or the probability of rejecting a true null hypothesis [4]. For a 95% confidence interval alpha is 1-.95 or 0.05. The values for z α/2 or t α/2 are found from z or t tables or in EXCEL with the application of the function “=TINV( probability, degrees of freedom)”.Consider finding a 95% confidence interval for the mean. Calculating such an interval indicates that upon repeated sampling 95% of the samples will contain the population mean. Suppose a researcher wishes to test the hypothesis that the population mean is 80 from a set of grades for 20 students in a certain statistics course. The data were 81 70 79 86 89 65 76 69 71 88 78 79 81 79 73 84 73 81 89 88. The sample mean was 79.02 and the standard deviation was 7.26. In order to test the hypothesis that sample mean is not significantly different from the theorized mean of 80, a 95% confidence interval is constructed. The researcher never knows what the population mean is, only an approximation. The researcher is 95% confident it’s between the numbers that define the interval upon repeated sampling. That is the interval is in agreement with the sample the researcher obtained. If the population mean is outside that interval, then the sample mean is significantly different from the population mean.The results of the calculations of the confidence interval for the mean in this example are as follows: The sample mean is 79.02. The confidence interval is comprised of the sample mean plus or minus t α/2 times the standard error of the mean. In this case, the t value based on an alpha of 0.05 and 20-1 degrees of freedom is 2.09 from a statistical table [4] and the standard error of the mean or the standard deviation divided by the square root of the sample size is 1.62. Combining these numbers provides the 95% confidence interval for the mean as 79.02 ± (2.09)(1.62). The confidence interval is 75.62 < μ < 82.42, where μ is the population mean.5Rochowicz: Bootstrapping AnalysisPublished by ePublications@bond, 2011At the 5% or 0.05 significance level, the theorized mean of 80 is contained in the interval found and so the null hypothesis is not rejected. There is no significant difference between the sample mean of 79.02 and the population mean of 80. Example 1 shows how to use EXCEL to test this hypothesis about the mean.If there is no known sampling distribution for a particular theorized population quantity such as median or log means [6] an alternative nonparametric confidence interval is found by using the resampled data and applying the 2.5 percentile and 97.5 percentile for the generated distribution In this way a 95% confidence interval is obtained from the lower 2.5 percentile and upper 97.5 percentile. These bootstrapped confidence intervals are used as any other confidence intervals to make inferences about the population [6].The examples that follow show the application of EXCEL to inferential statistics and bootstrapping analysis. Classical confidence intervals and bootstrapped percentile confidence intervals are presented. Similar conclusions and results are reached. The determination of bootstrapped percentile confidence intervals is necessary for distributions of medians since there is no known sampling distribution .5. EXCEL ApplicationsEXCEL is useful for doing parametric as well as nonparametric statistics. The simulation of normal data, the determination of t-values, means and medians for sets of data are described in the following examples. Also percentiles for sets of data can be found in EXCEL.5.1 EXCEL: Classical ExampleConsider the dataset of 20 statistics grades analyzed above. The classical way to make an inference concerning the mean is to: a) Identify the null and alternative hypothesis. b) Construct a confidence interval and c) Decide to reject or fail to reject the null hypothesis. The assumption that data are normal can be checked by checking a histogram. Using the fact that a histogram appears normal and data were generated using the function “=(NORMSINV(RAND())*standard deviation)+mean”, the t-test of hypothesis for means can be applied. The rejection or failure to reject a null hypothesis is accomplished by using confidence intervals [4].Figure 1 displays the classical method, the traditional textbook method for determining confidence intervals.A class of 20 students took a statistics test and the class mean was 79.02 with a standard deviation of 7.26. A researcher wants to test whether the sample mean is significantly different from a theorized population mean of 80, an average grade necessary for meeting a statewide assessment mandate. In figure 1 the 6Spreadsheets in Education (eJSiE), Vol. 4, Iss. 3 [2011], Art. 4.au/ejsie/vol4/iss3/4results for a 95% confidence interval for the mean are shown. The confidence interval does contain the population mean of 80 and so the hypothesis of no difference between the population mean and the sample men is not rejected. Findings indicate there is no significant difference. And the t α/2 is found using “=tinv(probability, degrees of freedom)”. Notice the value 2.09 agrees with the value from a textbook [4].Figure 1: Classical t-test of Hypothesis with Confidence Intervals7Rochowicz: Bootstrapping AnalysisPublished by ePublications@bond, 2011Figure 2: The Data are Normal.The histogram appears mound shaped and the application of the t-distribution is appropriate. For any dataset the distribution would be normal since the function “=(NORMSINV(RAND())*standard deviation)+mean” was applied.5.2 EXCEL EXAMPLE 2: Large Sample StatisticsThe data for this example are based on IQ scores with a mean of 100 and a standard deviation of 15. A distribution of these IQ scores of 40 samples of size 30 is shown in figure 3. The distribution of means of the 40 samples with a histogram and frequency distribution are displayed in figure 4. The distribution of means appears to be mound shaped. For parametric statistics the shape of the sampling distribution is supposed to be approximately mound-shaped or normal. The data were simulated using the function “=NORMINV(RAND())*$C$2)+$C$1” where cell C2 contains the standard deviation and cell C1 contains the mean. This function is copy and pasted into as many cells as desired. In figure 5, the 95% confidence intervals for each sample and the distribution of means are illustrated. As expected about 95% of the classical confidence intervals for the mean do contain 100.8Spreadsheets in Education (eJSiE), Vol. 4, Iss. 3 [2011], Art. 4.au/ejsie/vol4/iss3/4Figure 3: Section of 40 SamplesFigure 4: The Distribution of Means Appears Mound-shaped.Figure 5: Classical Confidence Intervals for the Sample MeansTests of hypothesis using confidence intervals can be conducted using EXCEL. If a theorized value is in a certain confidence interval accept the null hypothesis of no difference and if the theorized value is outside the interval reject the null hypothesis. The findings suggest that the population quantity is significantly different from the sample quantity acquired for a specific experiment. And the difference has not occurred by chance.A hypothesis test can de performed on means by applying a 95% confidence interval for the mean using the classical t-distribution and the percentile method. In the case where no sampling distribution of a statistics is known constructing a 95% confidence interval using percentiles is applied. See examples 7 and 8 for percentile confidence intervals.5.3 EXCEL Example 3: Bootstrapping a small sampleEXCEL does not have any built-in commands or programs to perform bootstrapping. But there are ways to do bootstrapping in EXCEL without the purchase and learning of other software such as SPSS. They include:1)Applying the INDEX Function2)The application of Data Analysis Toolpak and Macros. A detaileddescription of using the Data Analysis Toolpak and Macros forbootstrapping data is supplied in the AppendixApplying the EXCEL INDEX Function is a way to conduct bootstrapping analysis without using an add-in or any macros. If you wish to do a large number of resamples, use the INDEX command and the random number generator. The primary use of the Index Function is to return a value from a table or range of data. The structure of the Index Function is: INDEX(table or a range or an array, row location, column location). The use of rand()+1 will generate a random row or column location. The syntax for the command that generates random rows and columns of data from a sample is the following: “=INDEX(range of cells, ROWS(range of cells)*RAND()+1,COLUMNS(range of cells)*RAND()+1)”. Next copy and paste this command for as many resamples as desired. Pressing F9 key on computer keyboard recalculates data.With the INDEX function, bootstrapping can be performed on any statistic including means, medians, modes, analyses of variance and regression analysis such as correlation and beta weights.Consider the following set of data: 1 5 8 9 12 15 18. Using the function “=INDEX(($c$4:$c$10),ROWS($c$4:$c$10*RAND()+1,COLUMNS($c$4:$c$10*RA ND()+1)” will generate random resamples. As expected some of the values are shown and others are not. Also some of the numbers occur more than once. The results are 18 9 5 8 1 8 8. Pressing F9 will show many resamples. Figure 6 shows one resample using “INDEX”Figure 6: Resampling One Sample5.4 EXCEL Example 4: Bootstrapping AnalysisResample the data for 30 IQ scores from example 2 using the INDEX function. Forty resamples were obtained and displayed in figure 7. Notice that the distribution of means on the resampled data appears normal (figure 8). Since the data appears normal, the t-test of hypothesis can be applied and the classical confidence intervals can be determined.Suppose you with to test whether the mean is not equal to 100, the mean for IQ scores for the population of all persons that take the IQ test.With the use of confidence intervals a student or researcher can test this hypothesis.Figure 7: Section of Resampled MeansFigure 8: The Distribution of Resampled Means Appears Mound-ShapedFigure 9: Confidence Intervals for Resampled MeansIn figure 9 the classical confidence intervals for resampled means are presented. Using the distribution of means and the percentile functions “=PERCENTILE (range of cells, .025)” and “=PERCENTILE (range of cells, .975)” a 95% percentile confidence interval for the mean is found.For figure 8 apply “=PERCENTILE (C3:C42, .025)” and “=PERCENTILE (C3:C42, .975)” to obtain the percentile confidence intervals. Using the distribution of resampled means of figure 8, the percentile confidence interval is 92.403 to 103.096. In the distribution of means with percentile confidence intervals, notice that results are very similar. That is about 95% of the confidence intervals for sample means contain the hypothesized mean of 100. And upon repeated resampling about 95% of the resamples contain 100.When testing whether the population mean is 100, note that since 100 is in the interval the null hypothesis of no difference between the sample mean and the theorized mean of 100 is not rejected.From figure 4 the confidence interval for means based on percentiles is 94.99 to 104.34.5.5 EXCEL Example 5: Bootstrapping with the Data Analysis ToolpakCheck the appendix for installing Data Analysis Toolpak, applying the Data Analysis Toolpak and applying a macro in Data Analysis Toolpak. The following set of data 1 5 8 9 12 15 18 was resampled using the Data Analysis Toolpak. The results after one application are 18 18 12 15 9 15 8. Some data are the same and some are missing. Bootstrapping has occurred. Figure 10 shows the results of using the Toolpak.Figure 10: Resampling with Data Analysis Toolpak5.6 EXCEL Example 6: Bootstrapping Analysis with Data Analysis Toolpak: Many ResamplesSince bootstrapping analysis requires many “resampled” samples the process can be repeated as many times as desired by changing the output range in the Sampling Dialog Box but this tends to be tedious In order to automate this process the development of a macro will help. The macro is in the appendix. The macro automates the calculation of many reseamples. The example above for bootstrapped means is analyzed by using the Toolpak. Figure 11 displays the results. The distribution appears mound-shaped and t-distribution confidence intervals can be constructed as in example 4.Figure 11: Distribution of Resampled Means with Data Analysis Toolpak5.7 EXCEL Example 7: Inferential Statistics on Medians Using Bootstrapping TechniquesConsider a dataset of size 30. The data are 20 25 33 42 48 51 60 72 75 74 81 87 102 105 110 123 142 151 159 200 214 234 244 300 500 602 603 604 609 651.Resample this data set 40 times and obtain a distribution of medians. Since there is no known sampling distribution for medians the application of bootstrapping techniques should be used. Generating a distribution of medians is accomplished by resampling with the INDEX function. Figure 12 displays a sample of size 30 resampled 40 times. The distribution of medians is not usually normal and since there is no known sampling distribution for medians bootstrapping analysis is applied. Calculating a 95% confidence interval for the distribution of medians is shown in figure 13. This confidence interval for the population median is based on percentiles and leads to decisions about significant differences in the 2 groups.Figure 12: Section of Resampled MediansFigure 13: Distribution of Resampled Medians.The histogram is not normal and so percentile confidence intervals are used to make an inference about the population median. The percentile confidence interval for the distribution of medians in figure 13 is 76 to 200. Suppose a researcher wants to test whether the null hypothesis for the population median is 150. Based on the percentile confidence interval the null hypothesis would not be rejected, since 150 is in the interval 76 to 200. Therefore there is no significant difference between 150 and the sample median of 126, the average of all resampled medians.5.8 EXCEL Example 8: Inferential Statistics on the Differences in Medians Suppose the following data represent the prices on homes in 2 different locations. The prices on homes at location A are: 99 96 93 92 87 81 82 89 77 74 76 66 71 82 69 71 80 66 42 45 33 46 32 22 25 19 15 17 14 12The prices on homes at location B are: 300 333 321 345 333 245 324 222 321 119 117 115 111 100 96 65 62 61 69 71 45 42 55 69 78 88 67 42 66 68Consider the difference in the median prices of the homes in the 2 locations. Is there a significant difference in the median prices?The first column is the sampled data for each dataset. The data is resampled 40 times for each set of data. Inferences are made on whether there is a significant difference in the median prices at the 2 locations based on percentile confidence intervals. That is the difference in the median prices at the 2 locations is 0.Figure 14: Display of Two Sets of DataFigure 15: A Distribution of Median Differences with the Percentile ConfidenceIntervalThe sampling distribution for the difference in medians is shown in figure 15. The 95% percentile confidence interval for the median difference is 2.5 to 83. Since 0 is not in the interval the null hypothesis of no differences in the median prices for the 2 groups is rejected. That is there is a difference in medians for the 2 sets of data. Note that pressing F9 many times provides about 95% of the intervals with 0 in them. This is in agreement with the concept of confidence interval. The medians of all these resamples were obtained by applying the function “=MEDIAN(range of cells)”. For example “=MEDIAN(C9:P9)” calculates the median of the numbers in cells c9 through p96. Bootstrapping Analysis in EXCEL: ObservationsStudents as well as researchers learn in applying bootstrapping there is no need to satisfy assumptions when conducting inferential statistics. Also bootstrapping analysis can be performed on not only sampling distributions that are known but also on sampling distributions that are not known. Doing research this way enables the research community to recognize valid research findings even where assumptions are not justified.。

Bootstrapping算法

Bootstrapping算法

1、Bootstrapping方法简介
Bootstrapping算法又叫自扩展技术,它是一种被广泛用于知识获取的机器学习技术。

它是一种循序渐进的学习方法,只需要很小数量的种子,以此为基础,通过一次次的训练,把种子进行有效的扩充,最终达到需要的数据信息规模。

2、Bootstrapping算法的主要步骤
(1) 建立初始种子集;
(2) 根据种子集,在抽取一定窗口大小的上下文模式,建立候选模式
集;
(3) 利用模式匹配识别样例,构成候选实体名集合。

将步骤(2)所得的
模式分别与原模式进行匹配,识别出样例,构成候选集合。

(4) 利用一定的标准评价和选择模式和样例,分别计算和样例的信息
熵增益,然后进行排序,选择满足一定要求的模式加入最终可用模式集,选择满足一定条件的样例加入种子集。

(5) 重复步骤(2)-(4),直到满足一定的迭代次数或者不再有新的样例
被识别。

3 相关概念
(1)上下文模式
它是指文本中表达关系和事件信息的重复出现的特定语言表达形式,可以按照特定的规则通过模式匹配,触发抽取特定信息。

上下文模式是由项级成的有有序序列,每个项对应于一个词或者词组的集合。

(2)模式匹配
模式匹配是指系统将输入的句子同有效模式进行匹配,根据匹配成功的模式,得到相应的解释。

(3)样例
样例是在Bootstrapping迭代过程中,经过模式匹配后,抽取出来的词语。

Bootstrapping示例

Bootstrapping示例

基础数据:来源于Taylor and Ashe(1983)累计赔款数据,也就是书上的例子,增量赔款357,848766,940610,542482,940527,326574,398 352,118884,021933,8941,183,289445,745320,996 290,5071,001,799926,2191,016,654750,816146,923 310,6081,108,250776,1891,562,400272,482352,053 443,160693,190991,983769,488504,851470,639 396,132937,085847,498805,037705,960440,832847,6311,131,3981,063,269359,4801,061,6481,443,370376,686986,608344,014累计赔款357,8481,124,7881,735,3302,218,2702,745,5963,319,994 352,1181,236,1392,170,0333,353,3223,799,0674,120,063 290,5071,292,3062,218,5253,235,1793,985,9954,132,918 310,6081,418,8582,195,0473,757,4474,029,9294,381,982 443,1601,136,3502,128,3332,897,8213,402,6723,873,311 396,1321,333,2172,180,7152,985,7523,691,712440,8321,288,4632,419,8613,483,130359,4801,421,1282,864,498376,6861,363,294344,014进展因子:这里采用加权平均数9876543.4906 1.7473 1.4574 1.1739 1.1038 1.0863拟合累计赔款数据,由斜线数据除以进展因子,进而计算拟合增量赔款数据拟合累计赔款270,061942,6781,647,1722,400,6102,817,9603,110,531 376,1251,312,9042,294,0813,343,4233,924,6824,332,157 372,3251,299,6412,270,9053,309,6473,885,0354,288,393 366,7241,280,0892,236,7413,259,8563,826,5874,223,877 336,2871,173,8462,051,1002,989,3003,508,9953,873,311 353,7981,234,9702,157,9033,144,9563,691,712391,8421,367,7652,389,9413,483,130469,6481,639,3552,864,498390,5611,363,294344,014拟合增量赔款270,061672,617704,494753,438417,350292,571 376,125936,779981,1761,049,342581,260407,474 372,325927,316971,2641,038,741575,388403,358 366,724913,365956,6521,023,114566,731397,290 336,287837,559877,254938,200519,695364,316 353,798881,172922,933987,053546,756391,842975,9231,022,1751,093,189469,6481,169,7071,225,143390,561972,733344,014计算残差:使用书中介绍的Pearson残差,r表示残差,C表示真实增量赔款,m表示拟合增量赔款(r=(C-m)/m^0.5)Pearson残差168.93115.01-111.94-311.63170.23521.04-39.14-54.51-47.73130.76-177.75-135.47-134.0977.35-45.71-21.67231.27-403.77-92.67203.92-184.51533.16-390.87-71.77184.29-157.75122.49-174.18-20.59176.1571.1759.56-78.52-183.21215.3178.26-129.87108.03-28.62-160.76-99.91197.16-22.2014.07-对Person残差有放回的抽样,第一步生成1到总数的随机数,第二步找到该随机数对应的行和列,第三步,找到对应的数NUM202146465051495523361295555492734541214463323324438483321492135253712341441474814395052ROW66101010101010781810101078105510878991086106879585910101941010COLUMN561156410281110104669241524823564674126452317357残差抽样-20.59176.15168.93168.93170.23521.04 -311.63-0.00203.9258.77--134.09-0.00-0.00-311.63-71.77-403.77-86.9159.56-183.21168.93231.27203.92-21.6725.11-54.51-111.94231.27176.15-311.63176.15207.21533.16-39.1459.56-403.77-183.21-177.75115.01-111.94-252.02108.03170.23-235.52用抽样所得残差与拟合赔款数据,得到抽样的增量赔款以及抽样的累计赔款。

Bootstrapping

Bootstrapping

Why bootstrapping works?
• If we want to ask a question of a population but you can't. So you take a sample and ask the question of it instead. Now, how confident you should be that the sample answer is close to the population answer obviously depends on the structure of population. One way you might learn about this is to take samples from the population again and again, ask them the question, and see how variable the sample answers tended to be. Since this isn't possible you can either make some assumptions about the shape of the population, or you can use the information in the sample you actually have to learn about it.
• NOTICE: Resampling is not done to provide an estimate of the population distribution--we take our sample itself as a model of the population.

bootstrapping中标准差计算

bootstrapping中标准差计算
function [boot_e] = boot_error(b_data,d) %calculate standard error for bootstrapping data % b_data is a matrix, each sampling result is arranged in column. % b_data 是一个矩阵,每一个抽样结果按列排列。 % d is the observing data. boot_num=size(b_data,2); d_copy=repmat(d,[1,boot_num]); diff2=(d_copy-b_data).^2; Numerator_tmp=sum(diff2,2); denominator=boot_num*(boot_num-1); boot_e=sqrt(Numerator_tmp./denominator); end
登录后才能查看或发表评论立即登录或者逛逛博客园首页
bootstrapping中 标 准 差 计 算
根据文献,我需要的bootstrapping标准差为
d(t)是我的原始叠加结果,b(t)是第i次bootstrapping的结果。 而Matlab中std函数提供的标准差:
所以直接采用std计算bootstrapping的标准差是不行的 所以写了一个bootstrapping的标准差的脚本:
t_e] = boot_error(b_data,d) %calculate standard error for bootstrapping data % b_data is a matrix, each sampling result is arranged in column. % b_data 是一个矩阵,每一个抽样结果按列排列。 % d is the observing data. boot_num=size(b_data,2); d_copy=repmat(d,[1,boot_num]); diff2=(d_copy-b_data).^2; Numerator_tmp=sum(diff2,2,'omitnan'); boot_num_real=sum(~isnan(b_data), 2); denominator=boot_num_real.*(boot_num_real-1); boot_e=sqrt(Numerator_tmp./denominator); end

稳健性检验方法

稳健性检验方法稳健性检验是指在统计学中用来检验模型的稳定性和鲁棒性的一种方法。

在实际应用中,由于数据的不确定性和复杂性,我们需要对模型进行稳健性检验,以确保模型的可靠性和有效性。

本文将介绍稳健性检验的基本原理、常用方法以及实际应用。

一、稳健性检验的基本原理。

稳健性检验的基本原理是通过对模型的参数进行一定的扰动,来检验模型对数据的变化和异常值的敏感程度。

在实际应用中,我们经常会遇到数据的异常值、缺失值等问题,这些问题可能会对模型的参数估计产生影响。

稳健性检验可以帮助我们评估模型对这些问题的鲁棒性,从而提高模型的可靠性和泛化能力。

二、稳健性检验的常用方法。

1. Bootstrapping(自助法)。

Bootstrapping是一种常用的稳健性检验方法,它通过对原始数据进行重抽样来估计参数的分布。

在每次重抽样中,我们可以得到一个新的参数估计值,通过对这些值的分布进行分析,可以评估模型对数据的变化和异常值的敏感程度。

2. Robust regression(鲁棒回归)。

Robust regression是一种通过对残差进行加权来减小异常值对参数估计的影响的方法。

它可以有效地降低异常值对模型的影响,提高模型的稳健性。

3. Sensitivity analysis(敏感性分析)。

敏感性分析是一种通过对模型参数进行一定范围内的变化来评估模型的稳健性的方法。

通过对参数进行逐步调整,我们可以了解模型对参数变化的敏感程度,从而评估模型的稳健性。

三、稳健性检验的实际应用。

稳健性检验在实际应用中具有重要的意义。

在金融领域,由于金融数据的复杂性和波动性,我们经常需要对模型进行稳健性检验,以确保模型对市场波动和异常事件的鲁棒性。

在医学领域,稳健性检验也被广泛应用于临床试验和流行病学研究中,以评估模型对异常数据和缺失数据的处理能力。

总之,稳健性检验是保证模型可靠性和有效性的重要手段。

通过对模型的稳健性进行评估,我们可以更好地理解模型对数据的敏感程度,从而提高模型的预测能力和泛化能力。

bootstrap方法理论一,二


/
999
=
0.0731 。
4.如果τˆ > Cα∗ 或 pˆ ∗ (τˆ) < α 则拒绝零假设。
当 B 是有限的,可行的 P 值 pˆ ∗ (τˆ) 依赖于使用 bootstrap 样本重复抽样得到的随机变量个
数。在 B → ∞ ,大样本准则显示 bootstrap P 值为
pˆ ∗(τˆ) ≡ Prμˆ (τ ≥ τˆ)
yt∗
=
β1
+
β2
y∗ t −1
+
ut∗ , ut∗

NID(0, s2 )

(4)关键在于零假设。如,如果参数 β = ⎡⎣β1 β2 ⎤⎦ ,零假设 β2 = 0 ,则实际估计的模型是
y = X1β1 + u ,因此使用 β = ⎡⎣β1 0⎤⎦ 生成 bootstrap 样本。
如果不需要假设误差项是正态分布,但是可以假设误差项是独立同分布。则可以使用半参
rejection probability function (RPF)定义为,
R(α , μ) ≡ Prμ (πτ ≤ α ) 明显地, R(α , μ) 依赖于α 和 DGP μ 。
对于确定性检验,RPF 等于α 。 对于主轴量检验,RPF 是平滑的,但一般不等于α 。
对于非主轴量检验,RPF 是非平滑的。
对于这类主轴量检验,bootstrap 样本很容易生成。因为所有这些统计量都是 M X ε 的函数,
我们只要生成 ε ∗ ∼ N (0, I) ,这里不需要计算 u∗ , y∗ 。注意:这些假设没有滞后自变量和其他
依赖于滞后自变量的回归变量。 三、参数 bootstrap 估计
对于线性回归模型,参数 bootstrap 估计如下:

matlabbootstrapping算法

matlabbootstrapping算法
Matlab蒙特卡罗(Bootstrap)算法介绍
1. 什么是Matlab蒙特卡罗(Bootstrap)算法?
Matlab蒙特卡罗(Bootstrap)算法是一种经典的机器学习方法,它可以通过模拟人们使用蒙特卡罗法的计算方式,来从输入数据中提取有用的参数。

这种方法通常被用于有限规模的机器学习问题,并且由于反复使用既有数据集和计算法则,因此
可以在相当小的计算和空间中获得很多信息。

2. Matlab蒙特卡罗(Bootstrap)算法的工作原理
Matlab蒙特卡罗(Bootstrap)算法的工作原理是很简单的,它的基本步骤是:首先,系统从输入数据中随机采样样本数据,注意,这里采样的是最大可能量;然后,按照用户指定的规则,利用采样出来的数据构建模型;最后,根据构建出来的模型,对新输入数据进行分析,从而获得输出数据。

3. Matlab蒙特卡罗(Bootstrap)算法的优势
4. Matlab蒙特卡罗(Bootstrap)算法的应用
Matlab蒙特卡罗(Bootstrap)算法可用于多种机器学习问题,例如文本分类、图像分类、语音识别、数据挖掘等。

这一算法的处理者应用,能够有效的解决复杂的机器学习问题,并减少大量的计算成本。

综上所述,Matlab蒙特卡罗(Bootstrap)算法是一种有效且简单易操作的机器学习方法,它能够有效地从输入数据中提取有用的参数并用于模型的构建,从而实现精准的分析结果,并有效的解决复杂的机器学习问题。

因此, Matlab蒙特卡罗(Bootstrap)算法在机器学习领域非常强大,具有广泛的应用前景。

一种基于Bootstrapping构建训练语料的方法


algorithm. A small subset is selected randomly ,t he wr ng ann tations are corrected ,and a seed set is o o
generated. T his seed set trains one clas 一 s based language model,and then this model ann tate the rest o s corpus . 0 ther small subsets are selected fr m the rest c rpus . T his proce sing is iterated until t he quantity o o s of t raining corpus is optimizat ion . This method can minimize the human eff rt while keeping t he quality of o the annotation reas nably go d f r stat ist ical language m del . o o o o
计算机研究与发展
Jo r al of Cbn1Puter Re u n a e s rch and Develo ment P
IS N 1000一 S 1239l CN l l 一 1777/ TP 44(Suppl. ) : 394 一397 , 2007
ห้องสมุดไป่ตู้
一种基于Bootstrapping 构建习练语料的方法 1 1
一个 完整 的训 练语料.
1
类语言模型
基于类 的语言模型可 以较好地识别命 名实
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

BootstrappingSteven AbneyAT&T Laboratories–Research 180Park Avenue Florham Park,NJ,USA,07932AbstractThis paper refines the analysis of co-training,defines and evaluates a newco-training algorithm that has theo-retical justification,gives a theoreti-cal justification for the Yarowsky algo-rithm,and shows that co-training andthe Yarowsky algorithm are based ondifferent independence assumptions.1OverviewThe term bootstrapping here refers to a prob-lem setting in which one is given a small set of labeled data and a large set of unlabeled data, and the task is to induce a classifier.The plen-itude of unlabeled natural language data,and the paucity of labeled data,have made boot-strapping a topic of interest in computational linguistics.Current work has been spurred by two papers,(Yarowsky,1995)and(Blum and Mitchell,1998).Blum and Mitchell propose a conditional in-dependence assumption to account for the effi-cacy of their algorithm,called co-training,and they give a proof based on that conditional in-dependence assumption.They also give an in-tuitive explanation of why co-training works, in terms of maximizing agreement on unla-beled data between classifiers based on different “views”of the data.Finally,they suggest that the Yarowsky algorithm is a special case of the co-training algorithm.The Blum and Mitchell paper has been very influential,but it has some shortcomings.The proof they give does not actually apply directly to the co-training algorithm,nor does it directly justify the intuitive account in terms of classifier agreement on unlabeled data,nor,for that mat-ter,does the co-training algorithm directly seek tofind classifiers that agree on unlabeled data. Moreover,the suggestion that the Yarowsky al-gorithm is a special case of co-training is based on an incidental detail of the particular applica-tion that Yarowsky considers,not on the prop-erties of the core algorithm.In recent work,(Dasgupta et al.,2001)prove that a classifier has low generalization error if it agrees on unlabeled data with a second classifier based on a different“view”of the data.This addresses one of the shortcomings of the original co-training paper:it gives a proof that justifies searching for classifiers that agree on unlabeled data.I extend this work in two ways.First,(Das-gupta et al.,2001)assume the same conditional independence assumption as proposed by Blum and Mitchell.I show that that independence as-sumption is remarkably powerful,and violated in the data;however,I show that a weaker as-sumption suffices.Second,I give an algorithm thatfinds classifiers that agree on unlabeled data,and I report on an implementation and empirical results.Finally,I consider the question of the re-lation between the co-training algorithm and the Yarowsky algorithm.I suggest that the Yarowsky algorithm is actually based on a dif-ferent independence assumption,and I show that,if the independence assumption holds,the Yarowsky algorithm is effective atfinding a high-precision classifier.2Problem Setting and NotationA bootstrapping problem consists of a space of instances X,a set of labels L,a functionY:X→L assigning labels to instances, and a space of rules mapping instances to la-bels.Rules may be partial functions;we write F(x)=⊥if F abstains(that is,makes no pre-diction)on input x.“Classifier”is synonymous with“rule”.It is often useful to think of rules and labels as sets of instances.A binary rule F can be thought of as the characteristic function of the set of instances{x:F(x)=+}.Multi-class rules also define useful sets when a particular target class is understood.For any rule F, we write F for the set of instances{x:F(x)= },or(ambiguously)for that set’s characteristic function.We write¯F for the complement of F ,either as a set or characteristic function.Note that ¯Fcontains instances on which F abstains.We write F¯ for{x:F(x)= ∧F(x)=⊥}.When F does not abstain,¯F and F¯ are identical. Finally,in expressions like Pr[F=+|Y=+] (with square brackets and“Pr”),the functions F(x)and Y(x)are used as random variables. By contrast,in the expression P(F|Y)(with parentheses and“P”),F is the set of instances for which F(x)=+,and Y is the set of in-stances for which Y(x)=+.3View IndependenceBlum and Mitchell assume that each instance x consists of two“views”x1,x2.We can take this as the assumption of functions X1and X2such that X1(x)=x1and X2(x)=x2.They propose that views are conditionally independent given the label.Definition1A pair of views x1,x2satisfy view independence just in case:Pr[X1=x1|X2=x2,Y=y]=Pr[X1=x1|Y=y] Pr[X2=x2|X1=x1,Y=y]=Pr[X2=x2|Y=y] A classification problem instance satisfies view independence just in case all pairs x1,x2satisfy view independence.There is a related independence assumption that will prove useful.Let us define H1to con-sist of rules that are functions of X1only,and define H2to consist of rules that are functions of X2only.Definition2A pair of rules F∈H1,G∈H2 satisfies rule independence just in case,for all u,v,y:Pr[F=u|G=v,Y=y]=Pr[F=u|Y=y] and similarly for F∈H2,G∈H1.A classi-fication problem instance satisfies rule indepen-dence just in case all opposing-view rule pairs satisfy rule independence.If instead of generating H1and H2from X1 and X2,we assume a set of features F(which can be thought of as binary rules),and take H1=H2=F,rule independence reduces to the Naive Bayes independence assumption. The following theorem is not difficult to prove;we omit the proof.Theorem1View independence implies rule independence.4Rule Independence andBootstrappingBlum and Mitchell’s paper suggests that rules that agree on unlabelled instances are useful in bootstrapping.Definition3The agreement rate between rules F and G isPr[F=G|F,G=⊥]Note that the agreement rate between rules makes no reference to labels;it can be deter-mined from unlabeled data.The algorithm that Blum and Mitchell de-scribe does not explicitly search for rules with good agreement;nor does agreement rate play any direct role in the learnability proof given in the Blum and Mitchell paper.The second lack is emended in(Dasgupta et al.,2001).They show that,if view inde-pendence is satisfied,then the agreement rate between opposing-view rules F and G upper bounds the error of F(or G).The following statement of the theorem is simplified and as-sumes non-abstaining binary rules.Theorem2For all F∈H1,G∈H2that sat-isfy rule independence and are nontrivial predic-tors in the sense that min u Pr[F=u]>Pr[F=G],one of the following inequalities holds:Pr[F=Y]≤Pr[F=G]Pr[¯F=Y]≤Pr[F=G]If F agrees with G on all but unlabelled in-stances,then either F or¯F predicts Y with er-ror no greater than .A small amount of la-belled data suffices to choose between F and¯F.I give a geometric proof sketch here;the reader is referred to the original paper for a for-mal proof.Considerfigures1and2.In these diagrams,area represents probability.For ex-ample,the leftmost box(in either diagram)rep-resents the instances for which Y=+,and the area of its upper left quadrant represents Pr[F=+,G=+,Y=+].Typically,in such a diagram,either the horizontal or vertical line is broken,as infigure2.In the special case in which rule independence is satisfied,both hori-zontal and vertical lines are unbroken,as infig-ure1.Theorem2states that disagreement upper bounds error.First let us consider a lemma,to wit:disagreement upper bounds minority prob-abilities.Define the minority value of F given Y=y to be the value u with least probability Pr[F=u|Y=y];the minority probability is the probability of the minority value.(Note that minority probabilities are conditional probabili-ties,and distinct from the marginal probability min u Pr[F=u]mentioned in the theorem.)Infigure1a,the areas of disagreement are the upper right and lower left quadrants of each box,as marked.The areas of minority values are marked infigure1b.It should be obvious that the area of disagreement upper bounds the area of minority values.The error values of F are the values opposite to the values of Y:the error value is−when Y=+and+when Y=−.When minority values are error values,as infigure1,disagree-ment upper bounds error,and theorem2follows immediately.However,three other cases are possible.One possibility is that minority values are opposite to error values.In this case,the minority val-ues of¯F are error values,and disagreement be-tween F and G upper bounds the error of¯F.Y = +Y = −F−+G−+−G+Y = +Y = −(a) Disagreement(b) Minority ValuesG+−+G+F−Figure1:Disagreement upper-bounds minority probabilities.This case is admitted by theorem2.In the final two cases,minority values are the same regardless of the value of Y.In these cases, however,the predictors do not satisfy the“non-triviality”condition of theorem2,which re-quires that min u Pr[F=u]be greater than the disagreement between F and G.5The Unreasonableness of Rule IndependenceRule independence is a very strong assumption; one remarkable consequence will show just how strong it is.The precision of a rule F is de-fined to be Pr[Y=+|F=+].(We continue to assume non-abstaining binary rules.)If rule in-dependence holds,knowing the precision of any one rule allows one to exactly compute the preci-sion of every other rule given only unlabeled data and knowledge of the size of the target concept. Let F and G be arbitrary rules based on in-dependent views.Wefirst derive an expression for the precision of F in terms of G.Note that the second line is derived from thefirst by rule independence.P(F G)=P(F|GY)P(GY)+P(F|G¯Y)P(G¯Y)=P(F|Y)P(GY)+P(F|¯Y)P(G¯Y)P(G|F)=P(Y|F)P(G|Y)+[1−P(Y|F)]P(G|¯Y)P(Y|F)=P(G|F)−P(G|¯Y)P(G|Y)−P(G|¯Y)To compute the expression on the righthand side of the last line,we require P(Y|G),P(Y), P(G|F),and P(G).Thefirst value,the preci-sion of G,is assumed known.The second value, P(Y),is also assumed known;it can at any rate be estimated from a small amount of labeled data.The last two values,P(G|F)and P(G), can be computed from unlabeled data.Thus,given the precision of an arbitrary rule G,we can compute the precision of any other-view rule F.Then we can compute the precision of rules based on the same view as G by using the precision of some other-view rule F.Hence we can compute the precision of every rule given the precision of any one.6Some DataThe empirical investigations described here and below use the data set of(Collins and Singer, 1999).The task is to classify names in text as person,location,or organization.There is an unlabeled training set containing89,305in-stances,and a labeled test set containing289 persons,186locations,402organizations,and 123“other”,for a total of1,000instances. Instances are represented as lists of features. Intrinsic features are the words making up the name,and contextual features are features of the syntactic context in which the name oc-curs.For example,consider Bruce Kaplan, president of Metals Inc.This text snippet con-tains two instances.Thefirst has intrinsic fea-tures N:Bruce-Kaplan,C:Bruce,and C:Kaplan (“N”for the complete name,“C”for“con-tains”),and contextual feature M:president (“M”for“modified by”).The second instance has intrinsic features N:Metals-Inc,C:Metals, C:Inc,and contextual feature X:president-of (“X”for“in the context of”).Let us define Y(x)=+if x is a“location”instance,and Y(x)=−otherwise.We can es-timate P(Y)from the test sample;it contains 186/1000location instances,giving P(Y)= .186.Let us treat each feature F as a rule predict-ing+when F is present and−otherwise.The precision of F is P(Y|F).The internal feature N:New-York has precision1.This permits us to compute the precision of various contextual fea-tures,as shown in the“Co-training”column of Table1.We note that the numbers do not even look like probabilities.The cause is the failure of view independence to hold in the data,com-bined with the instability of the estimator.(The “Yarowsky”column uses a seed rule to estimateF Co-training Yarowsky Truth M:chairman-12.7.068.030X:Company-of10.2.979.989X:court-in-.183 1.00.981X:Company-in75.7 1.00.986X:firm-in-9.94.952.949 X:%-in-15.2.875.192X:meeting-in-2.25 1.00.753Table1:Some dataG GY = +Y = −(b) Negative correlation−−++−+ FG GY = +Y = −(a) Positive correlation−−++−+FFigure2:Deviation from conditional indepen-dence.P(Y|F),as is done in the Yarowsky algorithm, and the“Truth”column shows the true value of P(Y|F).)7Relaxing the AssumptionNonetheless,the unreasonableness of view inde-pendence does not mean we must abandon the-orem2.In this section,we introduce a weaker assumption,one that is satisfied by the data, and we show that theorem2holds under this weaker assumption.There are two ways in which the data can di-verge from conditional independence:the rules may either be positively or negatively corre-lated,given the class value.Figure2a illus-trates positive correlation,andfigure2b illus-trates negative correlation.If the rules are negatively correlated,then their disagreement(shaded infigure2)is larger than if they are conditionally independent,and the conclusion of theorem2is maintained a for-tiori.Unfortunately,in the data,they are posi-tively correlated,so the theorem does not apply. Let us quantify the amount of deviation from conditional independence.We define the condi-tional dependence of F and G given Y=y tobe d y=1 2u,v|Pr[G=v|Y=y,F=u]−Pr[G=v|Y=y]|If F and G are conditionally independent,then d y=0.This permits us to state a weaker version of rule independence:Definition4Rules F and G satisfy weak rule dependence just in case,for y∈{+,−}:d y≤p2q1−p1 2p1q1where p1=min u Pr[F=u|Y=y],p2= min u Pr[G=u|Y=y],and q1=1−p1.By definition,p1and p2cannot exceed0.5.If p1=0.5,then weak rule dependence reduces to independence:if p1=0.5and weak rule depen-dence is satisfied,then d y must be0,which is to say,F and G must be conditionally indepen-dent.However,as p1decreases,the permissible amount of conditional dependence increases. We can now state a generalized version of the-orem2:Theorem3For all F∈H1,G∈H2that satisfy weak rule dependence and are nontrivial predictors in the sense that min u Pr[F=u]> Pr[F=G],one of the following inequalities holds:Pr[F=Y]≤Pr[F=G]Pr[¯F=Y]≤Pr[F=G] Considerfigure3.This illustrates the most relevant case,in which F and G are positively correlated given Y.(Only the case Y=+is shown;the case Y=−is similar.)We assume that the minority values of F are error values; the other cases are handled as in the discussion of theorem2.Let u be the minority value of G when Y=+. Infigure3,a is the probability that G=u when F takes its minority value,and b is the proba-bility that G=u when F takes its majority value.The value r=a−b is the difference.Note that r=0if F and G are conditionally inde-pendent given Y=+.In fact,we can showFigure3:Positive correlation,Y=+.that r is exactly our measure d y of conditional dependence:2d y=|a−p2|+|b−p2|+|(1−a)−(1−p2)| +|(1−b)−(1−p2)|=|a−b|+|a−b|=2rHence,in particular,we may write d y=a−b. Observe further that p2,the minority proba-bility of G when Y=+,is a weighted average of a and b,namely,p2=p1a+bining this with the equation d y=a−b allows us to express a and b in terms of the remaining variables,to wit:a=p2+q1d y and b=p2−p1d y.In order to prove theorem3,we need to show that the area of disagreement(B∪C)upper bounds the area of the minority value of F(A∪B).This is true just in case C is larger than A, which is to say,if bq1≥ap1.Substituting our expressions for a and b into this inequality and solving for d y yields:d y≤p2q1−p12p1q1In short,disagreement upper bounds the mi-nority probability just in case weak rule depen-dence is satisfied,proving the theorem.8The Greedy Agreement Algorithm Dasgupta,Littman,and McAllester suggest a possible algorithm at the end of their paper,but they give only the briefest suggestion,and do not implement or evaluate it.I give here an algorithm,the Greedy Agreement Algorithm,Input:seed rules F,Gloopfor each atomic rule HG’=G+Hevaluate cost of(F,G’)keep lowest-cost G’if G’is worse than G,quitswap F,G’Figure4:The Greedy Agreement Algorithm that constructs paired rules that agree on un-labeled data,and I examine its performance.The algorithm is given infigure4.It begins with two seed rules,one for each view.At each iteration,each possible extension to one of the rules is considered and scored.The best one is kept,and attention shifts to the other rule.A complex rule(or classifier)is a list of atomic rules H,each associating a single feature h with a label .H(x)= if x has feature h,and H(x)=⊥otherwise.A given atomic rule is permitted to appear multiple times in the list. Each atomic rule occurrence gets one vote,and the classifier’s prediction is the label that re-ceives the most votes.In case of a tie,there is no prediction.The cost of a classifier pair(F,G)is based on a more general version of theorem2,that admits abstaining rules.The following theorem is based on(Dasgupta et al.,2001).Theorem4If view independence is satisfied, and if F and G are rules based on different views,then one of the following holds:Pr[F=Y|F=⊥]≤δµ−δPr[¯F=Y|¯F=⊥]≤δµ−δwhereδ=Pr[F=G|F,G=⊥],andµ= min u Pr[F=u|F=⊥].In other words,for a given binary rule F,a pes-simistic estimate of the number of errors made by F isδ/(µ−δ)times the number of instances labeled by F,plus the number of instances left unlabeled by F.Finally,we note that the cost of F is sensitive to the choice of G,but the cost of F with respect to G is not necessarily the same as the cost of G with respect to F.To get an overall cost,we average the cost of F with respect to G and G with respect to F.0.20.40.60.8100.20.40.60.81PrecisionRecallFigure5:Performance of the greedy agreement algorithmFigure5shows the performance of the greedy agreement algorithm after each iteration.Be-cause not all test instances are labeled(some are neither persons nor locations nor organiza-tions),and because classifiers do not label all instances,we show precision and recall rather than a single error rate.The contour lines show levels of the F-measure(the harmonic mean of precision and recall).The algorithm is run to convergence,that is,until no atomic rule can be found that decreases cost.It is interesting to note that there is no significant overtrain-ing with respect to F-measure.Thefinal values are89.2/80.4/84.5recall/precision/F-measure, which compare favorably with the performance of the Yarowsky algorithm(83.3/84.6/84.0). (Collins and Singer,1999)add a specialfinal round to boost recall,yielding91.2/80.0/85.2 for the Yarowsky algorithm and91.3/80.1/85.3 for their version of the original co-training algo-rithm.All four algorithms essentially perform equally well;the advantage of the greedy agree-ment algorithm is that we have an explanation for why it performs well.9The Yarowsky AlgorithmFor Yarowsky’s algorithm,a classifier again con-sists of a list of atomic rules.The prediction of the classifier is the prediction of thefirst rule in the list that applies.The algorithm constructs aclassifier iteratively,beginning with a seed rule. In the variant we consider here,one atomic rule is added at each iteration.An atomic rule F is chosen only if its precision,Pr[G =+|F =+] (as measured using the labels assigned by the current classifier G),exceeds afixed threshold θ.1Yarowsky does not give an explicit justifica-tion for the algorithm.I show here that the algorithm can be justified on the basis of two independence assumptions.In what follows,F represents an atomic rule under consideration, and G represents the current classifier.Recall that Y is the set of instances whose true label is ,and G is the set of instances{x:G(x)= }. We write G∗for the set of instances labeled by the current classifier,that is,{x:G(x)=⊥}. Thefirst assumption is precision indepen-dence.Definition5Candidate rule F and classifier G satisfy precision independence just in caseP(Y |F ,G∗)=P(Y |F )A bootstrapping problem instance satisfies pre-cision independence just in case all rules G and all atomic rules F that nontrivially overlap with G(both F ∩G∗and F −G∗are nonempty)sat-isfy precision independence.Precision independence is stated here so that it looks like a conditional independence assump-tion,to emphasize the similarity to the analysis of co-training.In fact,it is only“half”an in-dependence assumption—for precision indepen-dence,it is not necessary that P(Y |¯F ,G∗)= P(Y |¯F ).The second assumption is that classifiers make balanced errors.That is:P(Y ,G¯ |F )=P(Y¯ ,G |F )Let usfirst consider a concrete(but hypo-thetical)example.Suppose the initial classifier correctly labels100out of1000instances,and makes no mistakes.Then the initial precision is 1(Yarowsky,1995),citing(Yarowsky,1994),actually uses a superficially different score that is,however,a monotone transform of precision,hence equivalent to precision,since it is used only for sorting.1and recall is0.1.Suppose further that we add an atomic rule that correctly labels19new in-stances,and incorrectly labels one new instance. The rule’s precision is0.95.The precision of the new classifier(the old classifier plus the new atomic rule)is119/120=0.99.Note that the new precision lies between the old precision and the precision of the rule.We will show that this is always the case,given precision independence and balanced errors.We need to consider several quantities:the precision of the current classifier,P(Y |G ); the precision of the rule under consideration, P(Y |F );the precision of the rule on the cur-rent labeled set,P(Y |F G∗);and the precision of the rule as measured using estimated labels, P(G |F G∗).The assumption of balanced errors implies that measured precision equals true precision on labeled instances,as follows.(We assume here that all instances have true labels,hence that ¯Y=Y¯ .)P(F G ¯Y )=P(F G¯ Y )P(F G Y )+P(F G ¯Y )=P(Y F G )+P(F G¯ Y )P(F G )=P(Y F G∗)P(G |F G∗)=P(Y |F G∗)This,combined with precision independence, implies that the precision of F as measured on the labeled set is equal to its true precision P(Y |F ).Now consider the precision of the old and new classifiers at predicting .Of the instances that the old classifier labels ,let A be the num-ber that are correctly labeled and B be the number that are incorrectly labeled.Defining N t=A+B,the precision of the old classifier is Q t=A/N t.Let∆A be the number of new instances that the rule under consideration cor-rectly labels,and let∆B be the number that it incorrectly labels.Defining n=∆A+∆B,the precision of the rule is q=∆A/n.The precision of the new classifier is Q t+1=(A+∆A)/N t+1, which can be written as:Q t+1=N tN t+1Q t+nN t+1qThat is,the precision of the new classifier is a weighted average of the precision of the old classifier and the precision of the new rule.0.20.40.60.8100.20.40.60.81P r e c i s i o nRecallFigure 6:Performance of the Yarowsky algo-rithmAn immediate consequence is that,if we only accept rules whose precision exceeds a given threshold θ,then the precision of the new classi-fier exceeds θ.Since measured precision equals true precision under our previous assumptions,it follows that the true precision of the final clas-sifier exceeds θif the measured precision of ev-ery accepted rule exceeds θ.Moreover,observe that recall can be written as:A N =N t N Q t where N is the number of instances whose truelabel is .If Q t >θ,then recall is bounded below by N t θ/N ,which grows as N t grows.Hence we have proven the following theorem.Theorem 5If the assumptions of precision in-dependence and balanced errors are satisfied,then the Yarowsky algorithm with threshold θobtains a final classifier whose precision is at least θ.Moreover,recall is bounded below by N t θ/N ,a quantity which increases at each round.Intuitively,the Yarowsky algorithm increases recall while holding precision above a threshold that represents the desired precision of the final classifier.The empirical behavior of the algo-rithm,as shown in figure 6,is in accordance with this analysis.We have seen,then,that the Yarowsky algo-rithm,like the co-training algorithm,can be jus-tified on the basis of an independence assump-tion,precision independence.It is important to note,however,that the Yarowsky algorithm is not a special case of co-training.Precision in-dependence and view independence are distinct assumptions;neither implies the other.210ConclusionTo sum up,we have refined previous work on the analysis of co-training,and given a new co-training algorithm that is theoretically justified and has good empirical performance.We have also given a theoretical analysis of the Yarowsky algorithm for the first time,and shown that it can be justified by an indepen-dence assumption that is quite distinct from the independence assumption that co-training is based on.ReferencesA.Blum and bining labeled and unlabeled data with co-training.In COLT:Proceedings of the Workshop on Computational Learning Theory .Morgan Kaufmann Publishers.Michael Collins and Yoram Singer.1999.Unsuper-vised models for named entity classification.In EMNLP .Sanjoy Dasgupta,Michael Littman,and David McAllester.2001.PAC generalization bounds for co-training.In Proceedings of NIPS .David Yarowsky.1994.Decision lists for lexical am-biguity resolution.In Proceedings ACL 32.David Yarowsky.1995.Unsupervised word sense disambiguation rivaling supervised methods.In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics ,pages 189–196.2To see that view independence does not imply pre-cision indepence,consider an example in which G =Y always.This is compatible with rule independence,butit implies that P (Y |F G )=1and P (Y |F ¯G )=0,violat-ing precision independence.。

相关文档
最新文档