集成学习中预测精度的影响因素分析

·78·

兵工自动化

Ordnance Industry Automation

2019-01

38(1)

doi: 10.7690/bgzdh.2019.01.017

集成学习中预测精度的影响因素分析

郭福亮，周钢

(海军工程大学电子工程学院计算机工程系，武汉430033)

摘要：集成学习被认为是当前数据挖掘、机器学习中提升预测精度的重要方法。在分析集成学习基本概念的基础上，将集成学习模型设计划分为分类器构建、分类器集成和分类结果整合3个阶段，并从分类器误差控制、集成泛化能力提升和应用误差容忍具体对提升集成学习预测精度进行研究探讨，通过实例分析研究3个阶段预测精度的影响因素和提升方法。结果表明，该研究对控制集成学习预测误差、提升预测精度和构建合理高效集成学习模型具有较为重要的指导意义。

关键词：集成学习；预测精度；偏差-方差分解；Bagging算法；AdaBoost算法；怀卡托智能分析环境

中图分类号：TP311.13 文献标志码：A

Analysis of Influencing Factors of Prediction Accuracy in Integrated Learning

Guo Fuliang, Zhou Gang

(Department of Computer Engineering, School of Electronic Engineering, Naval University of Engineering,

Wuhan 430033, China)

Abstract: Ensemble learning is considered as an important method to improve the accuracy of data mining and machine learning. On the base of the analysis of the basic concepts of ensemble learning, the design of ensemble learning model is divided into 3 stages: classifier construction, classifier integration, and classification result integration, then the method of increasing prediction accuracy were discussed from 3 aspects: controlling classifier error, enhancing generalization ability, and distinguishing acceptance-error in the application. Then, the influencing factors and the increasing methods of the 3 stages were studied through the experiments. The results show that it has great significance to reduce predication error, improving prediction accuracy, and construct a reasonable integrated learning model.

Keywords:ensemble learning; forecast accuracy; bias-variance decomposition; Bagging algorithm; AdaBoost algorithm; Waikato environment for knowledge analysis

0 引言

集成学习(ensemble learning)是利用多个学习器的集成来解决问题，通过集成多个弱学习器形成强学习器[1]。Elder[2]证明了分类器集成技术优于简单的平均法和单一模型。集成学习被认为是未来机器学习的重要研究方向之一，是提高学习精度的重要手段[3]。

集成学习方法源于1989年Kearns提出的“概率近似正确”(probably approximately correct，PAC)学习模型，提出了弱学习器和强学习器，进而构建了一个多项式级的学习器[4]。集成学习方法发展至今，形成了Breiman提出的Bagging(Bootstrap Aggregating)算法[5]、Robert提出的算法[6]、在Boosting基础上Freund和Schapire提出的AdaBoost(Adaptive Boosting)算法[7]，以及Worlpert 提出的用于集成基分类器学习结果的Stacking算法[8]。在这些经典集成学习的基础上，发展产生了神经网络集成算法[9]、随机森林算法[10]和选择性集成算法[11]等。

集成模型预测精度(mean squared error，MSE)是评价集成学习方法优劣的重要指标[12]。在介绍集成学习基本概念的基础上，笔者针对集成学习3个阶段工作，从基分类器构建中的预测误差、分类模型集成的泛化能力和分类结果整合容忍误差上分析了集成模型预测精度影响因素，并探讨了提升精度的基本策略方法。

1 集成学习的概念

集成学习是数据挖掘算法的一种，本质上是将多个弱分类器通过有效融合集成为一个强分类器，提高分类精度。数据挖掘包括分类、聚类和关联等多种方法[13]，集成学习主要针对分类和回归作为基分类器。两者区别在于预测输出值是否为离散值。笔者主要针对分类器的集成方法进行研究。

分类器是一种利用已知的观察数据(测试数据

收稿日期：2018-11-19；修回日期：2018-12-26

作者简介：郭福亮(1963—)，男，河北人，博士，教授，从事数据挖掘研究。万方数据