分类器方法性能的比较——基于威斯康星州乳腺癌数据集(R语言)

合集下载

【原创】R语言中多分类问题 multicalss classification 的性能测量数据分析报告论文(代码+数据)

【原创】R语言中多分类问题 multicalss classification 的性能测量数据分析报告论文(代码+数据)

咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablogR语言中多分类问题multicalss classification 的性能测量数据分析报告来源:大数据部落| 有问题百度一下“”就可以了原文链接:/?p=11160对于分类问题,通常根据与分类器关联的混淆矩阵来定义分类器性能。

根据混淆矩阵,可以计算灵敏度(召回率),特异性和精度。

对于二进制分类问题,所有这些性能指标都很容易获得。

非得分分类器的数据为了展示多类别设置中非得分分类器的性能指标,让我们考虑观察到\(N = 100 \)的分类问题和观察到\(G = \ {1,\ ldots,5 \}的五个分类问题\):bels <- c(rep("A", 45), rep("B" , 10), rep("C", 15), rep("D", 25), rep("E", 5))predictions <- c(rep("A", 35), rep("E", 5), rep("D", 5),rep("B", 9), rep("D", 1),rep("C", 7), rep("B", 5), rep("C", 3),rep("D", 23), rep("C", 2),rep("E", 1), rep("A", 2), rep("B", 2))df <- data.frame("Prediction" = predictions, "Reference" = bels)准确性和加权准确性咨询QQ:3025393450有问题百度搜索“”就可以了欢迎登陆官网:/datablog通常,将多类准确性定义为正确预测的平均数:其中\(I \)是指标函数,如果类匹配,则返回1,否则返回0。

威斯康星乳腺癌良性预测

威斯康星乳腺癌良性预测

威斯康星乳腺癌良性预测⼀、获取数据wget https:///ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data 原始数据以逗号分隔: 各个列的属性(包括乳房肿块细针抽吸活检图像的数字化的多项测量值,这些值代表出现在数字化图像中的细胞核的特征): 1.Sample Code Number id number 2.Clump Thickness 1 - 10 肿块厚度 3.Uniformity Of Cell Size 1 - 10 细胞⼤⼩均⼀性 4.Uniformity Of Cell Shape 1 - 10 细胞形状的均⼀性 5.Marginal Adhesion 1 - 10 边缘附着性 6.Single Epithelial Cell Size 1 - 10 单上⽪细胞⼤⼩ 7.Bare Nuclei 1 - 10 裸核 8.Bland Chromatin 1 - 10 布兰染⾊质 9.Normal Nucleoli 1 - 10 正常核仁 10.Mitoses 1 - 10 有丝分裂 11.Class 2是良性,4是恶性 这些细胞特征得分为1(最接近良性)⾄10(最接近病变)之间的整数。

 任⼀变量都不能单独作为判别良性或恶性的标准,建模的⽬的是找到九个细胞特征的某种组合,从⽽实现对恶性肿瘤的准确预测 早期的检测过程包括检查乳腺组织的异常肿块。

如果发现⼀个肿块,那么就需要进⾏细针抽吸活检,即利⽤⼀根空⼼针从肿块中提取细胞的⼀个⼩样品,然后临床医⽣在显微镜下检查细胞,从⽽确定肿块可能是恶性的还是良 如果机器学习能够⾃动识别癌细胞,那么它将为医疗系统提供相当⼤的益处,可以让医⽣在诊断上花更少的时间,⽽在治疗疾病上花更多的时间 ⾃动化筛查系统还可能通过去除该过程中的内在主观⼈为因素来提供更⾼的检测准确性⼆、代码实例 准备⼯作:桌⾯新建Breast->cd Breast->shift + 右键->在此处新建命令窗⼝->jupyter notebook->新建MalignentOrBenign脚本: 1)读取数据并去掉数据中的?import pandas as pdimport numpy as np#数据没有标题,因此加上参数header#df = pd.read_csv('https:///ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)df = pd.read_csv('breast-cancer-wisconsin.data', header=None)column_names = ['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']df.columns = column_namesdf[df['Sample code number'] == 1057067] 从下⾯的结果可以看出数据中是存在空值的,只是这个空值是? 我们将?⽤np.nan填充后,drop掉:df = df.replace('?', np.nan)df[df['Sample code number'] == 1057067]df = df.dropna(how='any')df[df['Sample code number'] == 1057067] 从上⾯的结果可以看出?显⽰被NaN填充,然后该样本被drop掉了 2)数据归⼀化并拆分数据集import warningswarnings.filterwarnings('ignore')from sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler#⼀般1代表恶性,0代表良性(本数据集4恶性,所以将4变成1,将2变成0)#df['Class'][data['Class'] == 4] = 1#df['Class'][data['Class'] == 2] = 0df.loc[ df['Class'] == 4, 'Class' ] = 1df.loc[ df['Class'] == 2, 'Class' ] = 0#Sample code number特征对分类没有作⽤,将数据集75%作为训练集,25%作为测试集X_train, X_test, y_train, y_test = train_test_split(df[ column_names[1:10] ], df[ column_names[10] ], test_size = 0.25, random_state = 33) ss = StandardScaler()X_train = ss.fit_transform(X_train)X_test = ss.transform(X_test)print(X_train[0:5,:]) 从上⾯的X_train的前5条数据可以看出数据已经被归⼀化 3)使⽤进⾏分类,并查看其分类正确率from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scorelr = LogisticRegression()lr.fit(X_train, y_train)lr_y_predict = lr.predict(X_test)print( 'The LR Predict Result', accuracy_score(lr_y_predict, y_test) )#LR也⾃带了scoreprint( "The LR Predict Result Show By lr.score", lr.score(X_test, y_test) ) 正确率结果如下: 4)使⽤SGDC进⾏分类,并查看其分类准确率from sklearn.linear_model import SGDClassifierfrom sklearn.metrics import accuracy_scoresgdc = SGDClassifier()sgdc.fit(X_train, y_train)sgdc_y_predict = sgdc.predict(X_test)print( "The SGDC Predict Result", accuracy_score(sgdc_y_predict, y_test) )#SGDC也⾃带了scoreprint( "The SGDC Predict Result Show By SGDC.score", sgdc.score(X_test, y_test) ) 正确率结果如下: 5)利⽤classification_report查看报告print('y_train中的标签数据分布:')print(y_train.value_counts())print( '#' * 25 )print('y_test中的标签数据分布:')print(y_test.value_counts())#分类报告:from sklearn.metrics import classification_report#使⽤classification_report模块获得LR三个指标的结果(召回率,精确率,调和平均数)print('使⽤逻辑回归进⾏分类的报告结果:\n')print( classification_report( y_test,lr_y_predict,target_names=['Benign','Malignant'] ) )##使⽤classification_report模块获得SGDC三个指标的结果print('使⽤SGDC进⾏分类的报告结果:\n')print( classification_report( y_test,sgdc_y_predict,target_names=['Benign','Malignant'] ) ) 代码运⾏结果如下: 综上:LR和SGDC相⽐,两者分类效果相同(这⾥要特别注意:使⽤SGDC进⾏分类的结果是不稳定的,每次的结果不⼀定相同,读者可以多次执⾏,查看每次的分类结果) 好奇的读者可能会有这样的想法-SGD不是⼀种随机梯度下降优化算法吗?怎么和分类器搞上关系了呢? 解释如下:原来,随机梯度下降分类器并不是⼀个独⽴的算法,⽽是⼀系列利⽤随机梯度下降求解参数的算法的集合 6)使⽤Xgboost进⾏分类,并查看分类效果import xgboost as xgbfrom xgboost import XGBClassifierimport matplotlib.pyplot as pltfrom sklearn.metrics import accuracy_scoreimport pandas as pdfrom sklearn.model_selection import GridSearchCVdef modelfit(alg, dtrain_x, dtrain_y, useTrainCV=True, cv_flods=5, early_stopping_rounds=50):""":param alg: 初始模型:param dtrain_x:训练数据X:param dtrain_y:训练数据y(label):param useTrainCV: 是否使⽤cv函数来确定最佳n_estimators:param cv_flods:交叉验证的cv数:param early_stopping_rounds:在该数迭代次数之前,eval_metric都没有提升的话则停⽌"""if useTrainCV:xgb_param = alg.get_xgb_params()xgtrain = xgb.DMatrix(dtrain_x, dtrain_y)print(alg.get_params()['n_estimators'])cv_result = xgb.cv( xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'],nfold=cv_flods, metrics = 'auc', early_stopping_rounds=early_stopping_rounds)print('useTrainCV\n',cv_result)print('Total estimators:',cv_result.shape[0])alg.set_params(n_estimators=cv_result.shape[0])# train dataalg.fit(X_train, y_train, eval_metric='auc')#predict train datatrain_y_pre = alg.predict(X_train)print ("\nModel Report")print ("Accuracy : %.4g" % accuracy_score(y_train, train_y_pre))feat_imp = pd.Series( alg.get_booster().get_fscore()).sort_values( ascending=False )feat_imp.plot( kind = 'bar', title='Feature Importance' )plt.ylabel('Feature Importance Score')plt.show()print(alg)#XGBoost调参def xgboost_change_param(train_X, train_y):print('######Xgboost调参######')print('\n step1 确定学习速率和迭代次数n_estimators')xgb1 = XGBClassifier(learning_rate=0.1, booster='gbtree', n_estimators=300,max_depth=4, min_child_weight=1,gamma=0, subsample=0.8, colsample_bytree=0.8,objective='binary:logistic',nthread=2, scale_pos_weight=1, seed=10)#useTrainCV=True时,最佳n_estimators=32, learning_rate=0.1modelfit(xgb1, train_X, train_y, early_stopping_rounds=45)print('\n step2 调试参数min_child_weight以及max_depth')param_test1 = { 'max_depth' : range(3, 8, 1), 'min_child_weight' : range(1, 6, 2) }#GridSearchCV()中的estimator参数所使⽤的分类器#并且传⼊除需要确定最佳的参数之外的其他参数#每⼀个分类器都需要⼀个scoring参数,或者score⽅法gsearch1 = GridSearchCV( estimator=XGBClassifier( learning_rate=0.1,n_estimators=32,gamma=0,subsample=0.8,colsample_bytree=0.8,objective='binary:logistic',nthreads=2,scale_pos_weight=1,seed=10 ),param_grid=param_test1,scoring='roc_auc',n_jobs=1,cv=5)gsearch1.fit(train_X,train_y)#最佳max_depth = 5 min_child_weight=1print(gsearch1.best_params_, gsearch1.best_score_)print('\n step3 gamma参数调优')param_test2 = { 'gamma': [i/10.0 for i in range(0,5)] }gsearch2 = GridSearchCV( estimator=XGBClassifier(learning_rate=0.1,n_estimators=32,max_depth=5,min_child_weight=1,subsample=0.8,colsample_bytree=0.8,objective='binary:logistic',nthread=2,scale_pos_weight=1,seed=10),param_grid=param_test2,scoring='roc_auc',cv=5 )gsearch2.fit(train_X, train_y)#最佳 gamma = 0.4print(gsearch2.best_params_, gsearch2.best_score_)print('\n step4 调整subsample 和 colsample_bytrees参数')param_test3 = { 'subsample': [i/10.0 for i in range(6,10)],'colsample_bytree': [i/10.0 for i in range(6,10)] }gsearch3 = GridSearchCV( estimator=XGBClassifier(learning_rate=0.1,n_estimators=32,max_depth=5,min_child_weight=1,gamma=0.4,objective='binary:logistic',nthread=2,scale_pos_weight=1,seed=10),param_grid=param_test3,scoring='roc_auc',cv=5 )gsearch3.fit(train_X, train_y)#最佳'subsample': 0.9, 'colsample_bytree': 0.8print(gsearch3.best_params_, gsearch3.best_score_)print('\nstep5 正则化参数调优')param_test4= { 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100] }gsearch4= GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=32,max_depth=5,min_child_weight=1,gamma=0.4,subsample=0.9,colsample_bytree=0.8,objective='binary:logistic',nthread=2,scale_pos_weight=1,seed=10),param_grid=param_test4,scoring='roc_auc',cv=5 )gsearch4.fit(train_X, train_y)#reg_alpha:0.1print(gsearch4.best_params_, gsearch4.best_score_)param_test5 ={ 'reg_lambda':[1e-5, 1e-2, 0.1, 1, 100] }gsearch5= GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=32,max_depth=5,min_child_weight=1,gamma=0.4,subsample=0.9,colsample_bytree=0.8,objective='binary:logistic',nthread=2,reg_alpha=0.1,scale_pos_weight=1,seed=10),param_grid=param_test5,scoring='roc_auc',cv=5)gsearch5.fit(train_X, train_y)#reg_lambda:1print(gsearch5.best_params_, gsearch5.best_score_)# XGBoost调参xgboost_change_param(X_train, y_train)#parameters at lastprint('\nNow we use the best parasms to fit and predict:')xgb1 = XGBClassifier(learning_rate=0.1,n_estimators=32,max_depth=5,min_child_weight=1,gamma=0.4,subsample=0.9,colsample_bytree=0.8,objective='binary:logistic',reg_alpha=0.1,reg_lambda=1,nthread=2, scale_pos_weight=1,seed=10)xgb1.fit(X_train,y_train)y_test_pre = xgb1.predict(X_test)y_test_true = y_testprint ("The xgboost model Accuracy : %.4g" % accuracy_score(y_pred=y_test_pre, y_true=y_test_true)) print('使⽤Xgboost进⾏分类的报告结果:\n')print( classification_report( y_test_true,y_test_pre,target_names=['Benign','Malignant'] ) ) 下⾯来看看Xgboost效果如何: 7)使⽤SVM进⾏分类,并查看分类效果from sklearn.svm import SVCfrom sklearn.metrics import accuracy_scoremodels = ( SVC(kernel='rbf', gamma=0.1, C=1.0),SVC(kernel='rbf', gamma=1, C=1.0),SVC(kernel='rbf', gamma=10, C=1.0) )models = ( clf.fit(X_train, y_train) for clf in models )for clf in models:clf.fit(X_train, y_train)y_test_pre = clf.predict(X_test)print ("The SVM model Accuracy with gamma=%f : %.4g" % (clf.gamma,accuracy_score(y_pred=y_test_pre, y_true=y_test)) )print('使⽤SVM进⾏分类的报告结果 with gamma=%f :\n' % clf.gamma)print( classification_report( y_test,y_test_pre,target_names=['Benign','Malignant'] ) ) 我们设置了gamma参数,其分类效果如下,最佳结果还是gamm=0.1 SVM模型有两个⾮常重要的参数C与gamma。

基于乳腺癌数据的分类方法比较

基于乳腺癌数据的分类方法比较

基于乳腺癌数据的分类方法比较乳腺癌的早期诊断与治疗有着重要的作用,已有多种分类方法应用于此种诊断。

本文分别对C4.5决策树算法、朴素贝叶斯算法,支持向量机,KNN的原理进行论述,并基于乳腺癌数据运用上述分类方法进行模型构建,分析比较各模型性能,其中支持向量机性能较优。

标签:乳腺癌;分类方法;C4.5决策树;朴素贝叶斯;支持向量机;KNN乳腺癌是女性常见的癌症,据统计,乳腺癌是世界上第二大最常见的癌症,也有着较高的致死率。

历年来,已有不少分类算法应用于癌症的辅助诊断。

C4.5决策树算法,朴素贝叶斯算法,支持向量机算法都属于分类方法中的经典算法,它们基于不同的原理,对乳腺癌数据的分类性能上也存在些微的差异。

1 方法1.1 C4.5分类器C4.5是一种经典的决策树算法。

是昆兰早期ID3算法的扩展版本。

ID3主要基于信息增益来进行属性分裂,而C4.5不同于ID3,其属性选择度量基于信息增益率。

即其中splitInfoA(D)代表由训练数据集D划分成对应于属性A测试的v个输出的v个分区所产生的信息;Grain(A)表示基于按A属性划分的所获得信息增益。

1.2朴素贝叶斯分类朴素贝叶斯同时具有类条件独立的强假设,即一个属性值对给定类的影响独立于其他属性值。

针对离散变量与连续变量,朴素贝叶斯处理方式不同。

对于离散变量,常用比例表示概率;对于连续变量,且通常假设其报从高斯分布或其它连续性分布。

其原理如下:假设C为与随机变量相关联的类,X为观察值的一组随机变量的向量。

C表示某一具体类标,x表示某一具体随机变量值。

假设测试集中某一测试数据x来进行分类,则其概率可基于贝叶斯概率来得到,然后基于此来预测最有可能的类。

1.3支持向量机支持向量机是一种可对线性及非线性数据进行分类的方法,其基础建立在VC维理论和结构风险最小化理论之上,具有坚实的统计学理论基础。

针对医学数据复杂的非线性边界,支持向量机有着强大的建模能力,同时通边间隔参数的设定也可有效的抑制过拟合。

【原创】r语言uci乳房肿块数据分析挖掘报告

【原创】r语言uci乳房肿块数据分析挖掘报告

一.收集数据数据由UCI机器学习数据仓库的一个数据集得到,数据集名称为“Breast Cancer Wisconsin (Diagnostic) Data Set”,包括乳房肿块镇抽吸活检图像的数字化的多项测度值,这些值代表出现在数字化图像中的细胞核的特征。

乳腺癌数据包括569例细胞活检案例,每个案例有32个特征。

一个特征是识别号码,一个特征是癌症诊断结果,其他30个特征是数值型的实验室测量结果。

癌症诊断结果用编码“M”表示恶性,用编码“B”表示良性。

30个数值型测量结果由数字化细胞核的10个不同特征的均值、标准差、最大值构成,这10个特征包括:a) radius (mean of distances from center to points on the perimeter)b) texture (standard deviation of gray-scale values)c) perimeterd) areae) smoothness (local variation in radius lengths)f) compactness (perimeter^2 / area - 1.0)g) concavity (severity of concave portions of the contour)h) concave points (number of concave portions of the contour)i) symmetryj) fractal dimension二.探索和准备数据使用命令str(iris)可以确认数据是由569个案例和32个特征构成的,输出结果如下所示:我们看到了预期的569个观察值和32个特征(变量),第一个变量v1是每个病人在数据中唯一的标识符(ID),并不能提供有用的信息,所以我们需要把它从模型中排除。

我们将数据分成两部分:用来建立决策树的训练数据集和用来评估模型性能的测试数据集。

基于乳腺癌数据的分类方法比较

基于乳腺癌数据的分类方法比较

基于乳腺癌数据的分类方法比较乳腺癌是女性最常见的恶性肿瘤之一,也是导致女性死亡原因中排名第二的疾病。

由于乳腺癌的早期症状不明显,许多患者在确诊时已经进入晚期,因此,开发有效的分类方法对于乳腺癌的早期筛查和治疗至关重要。

在本文中,将比较目前常用的乳腺癌数据分类方法,探讨其应用和优劣。

首先,基于机器学习算法的乳腺癌数据分类方法是目前主流研究领域之一。

机器学习算法可以自动从大量数据中学习,并通过训练模型来预测新的数据。

这些算法包括朴素贝叶斯、支持向量机、决策树等。

朴素贝叶斯算法基于贝叶斯定理,通过计算特征之间的联合概率来进行分类。

支持向量机算法通过在高维特征空间中寻找一个最优超平面来进行分类。

决策树算法则是通过对数据的各种属性进行递归分割,构建一个树形结构,从而实现分类。

这些算法都可以通过使用开源编程库,如Scikit-learn等来实现,方便快捷。

其次,基于深度学习算法的乳腺癌数据分类方法正在成为研究热点。

深度学习算法是一种通过多层神经网络进行数据建模和分析的方法。

与传统机器学习算法相比,深度学习算法可以更好地处理大规模数据和复杂的非线性关系。

在乳腺癌数据分类中,深度学习算法可以自动学习相关特征,并从中提取出更有区分度的特征来实现分类。

常用的深度学习算法包括卷积神经网络、循环神经网络和深度信念网络等。

这些算法通过在大规模数据上进行训练,不断调整网络参数,从而达到更准确的分类结果。

另外,基于遗传算法和优化算法的乳腺癌数据分类方法也是研究的一项重要工作。

遗传算法是一种通过模拟自然选择和遗传变异的过程来寻找最优解的算法。

在乳腺癌数据分类中,可以将分类问题转化为一个优化问题,通过遗传算法来搜索最佳的分类器参数,从而实现高准确率的分类。

此外,模拟退火算法、粒子群优化算法等也可以应用于乳腺癌数据分类中。

这些算法通过不断迭代和搜索来优化分类器的性能,逐步逼近最优解。

然而,各种基于乳腺癌数据的分类方法各有其优劣。

机器学习算法通常具有较好的可解释性和较低的计算复杂度,但对于一些非线性关系的建模能力较弱。

R语言在医疗数据挖掘与分析中的应用研究

R语言在医疗数据挖掘与分析中的应用研究

R语言在医疗数据挖掘与分析中的应用研究一、引言随着医疗信息化的发展和医疗大数据的快速增长,如何高效地挖掘和分析医疗数据成为了医疗领域的重要课题。

R语言作为一种强大的统计分析工具,被广泛运用于医疗数据挖掘与分析中。

本文将探讨R 语言在医疗领域的应用现状以及未来发展趋势。

二、R语言在医疗数据处理中的优势R语言作为一种开源的统计分析工具,具有以下优势: - 丰富的数据处理函数:R语言拥有丰富的数据处理函数,可以方便地对医疗数据进行清洗、转换和整合。

- 强大的可视化能力:R语言通过ggplot2等包提供了强大的数据可视化功能,可以直观地展示医疗数据的特征和规律。

- 丰富的统计分析方法:R语言集成了各种统计分析方法,可以帮助医疗领域从业者进行深入的数据分析和挖掘。

三、R语言在医疗数据挖掘中的应用案例1. 医疗数据清洗利用R语言可以对医疗数据进行清洗,包括缺失值处理、异常值检测等,确保数据质量符合分析要求。

2. 医疗数据可视化通过R语言强大的可视化功能,可以将医疗数据以图表形式展示出来,帮助医务人员更直观地理解数据背后的含义。

3. 医疗数据建模利用R语言进行医疗数据建模,可以构建预测模型、分类模型等,帮助医务人员进行风险评估和决策支持。

4. 医疗数据挖掘通过R语言进行聚类分析、关联规则挖掘等技术,可以发现医疗数据中隐藏的规律和关联,为临床实践提供参考依据。

四、未来展望随着人工智能和大数据技术的不断发展,R语言在医疗领域的应用前景十分广阔。

未来,我们可以期待R语言在医疗影像识别、个性化治疗方案制定等方面发挥更大作用,为提升医疗服务质量和效率做出更多贡献。

五、结论综上所述,R语言在医疗数据挖掘与分析中具有重要意义和广泛应用前景。

通过不断深入研究和实践,相信R语言将为医疗领域带来更多创新和突破,推动整个行业迈向数字化、智能化时代。

希望本文能够对读者了解R语言在医疗领域的应用有所帮助,并激发更多人投身于这一领域的研究与实践。

乳腺癌分类器及数据样本验证(Python)-推荐下载

乳腺癌分类器及数据样本验证(Python)-推荐下载

乳腺癌分类器及数据样本验证By TobyQQ:231469242欢迎爱好者交流,并改进代码数据下载地址uci machine learing/breast cancer词汇:Malignancy 恶性biopsy 活组织检查benign 良性的diagnosis 诊断periodic examination定期检查Clump Thickness 肿块厚度Uniformity of Cell Size 细胞大小的均匀性Uniformity of Cell Shape 细胞形状的均匀性Marginal Adhesion 边缘粘Single Epithelial Cell Size 单上皮细胞的大小Bare Nuclei 裸核Bland Chromatin 乏味染色体Normal Nucleoli 正常核Mitoses 有丝分裂背景知识isconsin Breast Cancer Database (WBCD)January 8, 1991Revised Nomeber 3, 1994This is a description of the Wisconsin Breast Cancer Database, collected by Dr. William H. Wolberg, University of Wisconsin Hospitals, Madison. The actual database is contained in another file (datacum). Samples were collected periodically as Dr. Wolberg reported his clinical cases. The database therefore reflects this chronological grouping of the data. The samples consist of visually assessed nuclear features of fine needle aspirates (FNAs) taken from patients' breasts. Each sample has been assigned a 9-dimensional vector (attributes 3 to 9 below) by Dr. Wolberg. Each component is in the interval 1 to 10, with value 1 corresponding to a normal state and 10 to a most abnormal state. Attribute 1 is sample number, while attribute 2 designates whether the sample is benign or malignant. Malignancy 恶性is determined by taking a sample tissue from the patient's breast and performing a biopsy on it. A benign 良性的diagnosis 诊断is confirmed either by biopsy 活组织检查or by periodic examination定期检查, depending on the patient's choice.All groups are in the same file. We have separated the groups感谢Wisconsin医学院的william H.Wolberg博士提供乳腺癌数据样本。

机器学习_Breast Cancer Wisconsin (Prognostic) Data Set(威斯康星乳腺癌(预后性症状)数据集)

机器学习_Breast Cancer Wisconsin (Prognostic) Data Set(威斯康星乳腺癌(预后性症状)数据集)

Breast Cancer Wisconsin (Prognostic) Data Set(威斯康星乳腺癌(预后性症状)数据集)数据摘要:Prognostic Wisconsin Breast Cancer Database中文关键词:多变量,分类,回归,UCI,威斯康星,乳腺癌,预后性症状,英文关键词:MultiVarite,Classification,Regression,UCI,Wisconsin,Breast Cancer,Prognostic,数据格式:TEXT数据用途:Classification, Regression数据详细介绍:Breast Cancer Wisconsin (Prognostic) Data SetAbstract: Prognostic Wisconsin Breast Cancer DataSource:Creators:1. Dr. William H. Wolberg, General Surgery Dept.University of Wisconsin, Clinical Sciences CenterMadison, WI 53792wolberg '@' 2. W. Nick Street, Computer Sciences Dept.University of Wisconsin1210 West Dayton St., Madison, WI 53706street '@' 608-262-66193. Olvi L. Mangasarian, Computer Sciences Dept.,University of Wisconsin1210 West Dayton St., Madison, WI 53706olvi '@' Donor:Nick StreetData Set Information:Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.The first 30 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link]The separation described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in:[K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].The Recurrence Surface Approximation (RSA) method is a linear programming model which predicts Time To Recur using both recurrent and nonrecurrent cases. See references (i) and (ii) above for details of the RSA method.This database is also available through the UW CS ftp server:ftp cd math-prog/cpo-dataset/machine-learn/WPBC/Attribute Information:1) ID number2) Outcome (R = recur, N = nonrecur)3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N)4-33) Ten real-valued features are computed for each cell nucleus:a) radius (mean of distances from center to points on the perimeter)b) texture (standard deviation of gray-scale values)c) perimeterd) areae) smoothness (local variation in radius lengths)f) compactness (perimeter^2 / area - 1.0)g) concavity (severity of concave portions of the contour)h) concave points (number of concave portions of the contour)i) symmetryj) fractal dimension ("coastline approximation" - 1)Relevant Papers:W. N. Street, O. L. Mangasarian, and W.H. Wolberg. An inductive learning approach to prognostic prediction. In A. Prieditis and S. Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 522--530, San Francisco, 1995. Morgan Kaufmann.[Web Link]O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.[Web Link]W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516. [Web Link]W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear ``grade'' and breast cancer prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17, pages 257-264, 1995.[Web Link]See also:[Web Link][Web Link]W. N. Street, O. L. Mangasarian, and W.H. Wolberg. An inductive learning approach to prognostic prediction. In A. Prieditis and S. Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 522--530, San Francisco, 1995. Morgan Kaufmann.[Web Link]O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.[Web Link]W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516. [Web Link]W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear ``grade'' and breast cancer prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17, pages 257-264, 1995.[Web Link]数据预览:119513,N,31,18.02,27.6,117.5,1013,0.09489,0.1036,0.1086,0.07055,0.1865 ,0.06333,0.6249,1.89,3.972,71.55,0.004433,0.01421,0.03233,0.009854,0.0 1694,0.003495,21.63,37.08,139.7,1436,0.1195,0.1926,0.314,0.117,0.2677, 0.08113,5,58423,N,61,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.0 7871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.0300 3,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0. 1189,3,2842517,N,116,21.37,17.44,137.5,1373,0.08836,0.1189,0.1255,0.0818,0.233 3,0.0601,0.5854,0.6105,3.928,82.15,0.006167,0.03449,0.033,0.01805,0.03 094,0.005039,24.9,20.98,159.1,1949,0.1188,0.3449,0.3414,0.2032,0.4334, 0.09067,2.5,0843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.259 7,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.0 5963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.663 8,0.173,2,0843584,R,27,20.29,14.34,135.1,1297,0.1003,0.1328,0.198,0.1043,0.1809,0. 05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.017 56,0.005115,22.54,16.67,152.2,1575,0.1374,0.205,0.4,0.1625,0.2364,0.076 78,3.5,0843786,R,77,12.75,15.29,84.6,502.7,0.1189,0.1569,0.1664,0.07666,0.1995, 0.07164,0.3877,0.7402,2.999,30.85,0.007775,0.02987,0.04561,0.01357,0.0 1774,0.005114,15.51,20.37,107.3,733.2,0.1706,0.4196,0.5999,0.1709,0.34 85,0.1179,2.5,0844359,N,60,18.98,19.61,124.4,1112,0.09087,0.1237,0.1213,0.0891,0.1727 ,0.05767,0.5285,0.8434,3.592,61.21,0.003703,0.02354,0.02222,0.01332,0. 01378,0.003926,23.39,25.45,152.6,1593,0.1144,0.3371,0.299,0.1922,0.272 6,0.09581,1.5,?844582,R,77,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,0.5835,1.377,3.856,50.96,0.008805,0.03029,0.02488,0.01448,0. 01486,0.005412,17.06,28.14,110.6,897,0.1654,0.3682,0.2678,0.1556,0.319 6,0.1151,4,10844981,N,119,13,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.0 7389,0.3063,1.002,2.406,24.32,0.005731,0.03502,0.03553,0.01226,0.0214 3,0.003749,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1 072,2,1845010,N,76,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203, 0.08243,0.2976,1.599,2.039,23.94,0.007149,0.07217,0.07743,0.01432,0.01 789,0.01008,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2 075,6,20845636,N,123,16.02,23.24,102.7,797.8,0.08206,0.06669,0.03299,0.03323,0 .1528,0.05697,0.3795,1.187,2.466,40.51,0.004029,0.009269,0.01101,0.007 591,0.0146,0.003042,19.19,33.88,123.8,1150,0.1181,0.1551,0.1459,0.0997 5,0.2948,0.08452,2,0846100,N,125,15.78,17.89,103.6,781,0.0971,0.1292,0.09954,0.06606,0.184 2,0.06082,0.5058,0.9849,3.564,54.16,0.005771,0.04061,0.02791,0.01282,0 .02008,0.004144,20.42,27.28,136.5,1299,0.1396,0.5609,0.3965,0.181,0.37 92,0.1048,1.4,0846381,N,117,15.85,23.95,103.7,782.7,0.08401,0.1002,0.09938,0.05364,0. 1847,0.05338,0.4033,1.078,2.903,36.58,0.009769,0.03126,0.05051,0.0199 2,0.02981,0.003002,16.84,27.66,112,876.5,0.1131,0.1924,0.2322,0.1119,0. 2809,0.06287,1,0847990,R,36,14.54,27.54,96.73,658.8,0.1139,0.1595,0.1639,0.07364,0.230 3,0.07077,0.37,1.033,2.879,32.55,0.005607,0.0424,0.04741,0.0109,0.0185 7,0.005466,17.46,37.13,124.1,943.2,0.1678,0.6577,0.7026,0.1712,0.4218,0 .1341,6,6848406,N,123,14.68,20.13,94.74,684.5,0.09867,0.072,0.07395,0.05259,0.1 586,0.05922,0.4727,1.24,3.195,45.4,0.005718,0.01162,0.01998,0.01109,0. 0141,0.002085,19.07,30.88,123.4,1138,0.1464,0.1871,0.2914,0.1609,0.302 9,0.08216,1.1,0848620,R,10,16.13,20.68,108.1,798.8,0.117,0.2022,0.1722,0.1028,0.2164,0 .07356,0.5692,1.073,3.854,54.18,0.007026,0.02501,0.03188,0.01297,0.016 89,0.004142,20.96,31.48,136.8,1315,0.1789,0.4233,0.4784,0.2073,0.3706, 0.1142,3,18511133,N,20,15.34,14.26,102.5,704.4,0.1073,0.2135,0.2077,0.09756,0.25 21,0.07032,0.4388,0.7096,3.384,44.91,0.006789,0.05328,0.06446,0.02252, 0.03672,0.004394,18.07,19.08,125.1,980.9,0.139,0.5954,0.6305,0.2393,0.4 667,0.09946,1.3,0851509,R,10,21.16,23.04,137.2,1404,0.09428,0.1022,0.1097,0.08632,0.176 9,0.05278,0.6917,1.127,4.303,93.99,0.004728,0.01259,0.01715,0.01038,0. 01083,0.001987,29.17,35.59,188,2615,0.1401,0.26,0.3155,0.2009,0.2822,0 .07526,4,1852552,N,96,16.65,21.38,110,904.6,0.1121,0.1457,0.1525,0.0917,0.1995,0.0633,0.8068,0.9017,5.455,102.6,0.006048,0.01882,0.02741,0.0113,0.0146 8,0.002801,26.46,31.56,177,2215,0.1805,0.3578,0.4695,0.2095,0.3613,0.0 9564,3,0852631,N,116,17.14,16.4,116,912.7,0.1186,0.2276,0.2229,0.1401,0.304,0.0 7413,1.046,0.976,7.276,111.4,0.008029,0.03799,0.03732,0.02397,0.02308, 0.007444,22.25,21.4,152.4,1461,0.1545,0.3949,0.3853,0.255,0.4066,0.105 9,4.4,1852763,N,103,14.58,21.53,97.41,644.8,0.1054,0.1868,0.1425,0.08783,0.22 52,0.06924,0.2545,0.9832,2.11,21.05,0.004452,0.03055,0.02681,0.01352,0 .01454,0.003711,17.62,33.21,122.4,896.9,0.1525,0.6643,0.5539,0.2701,0.4 264,0.1275,2.5,0852781,N,16,18.61,20.25,122.1,1094,0.0944,0.1066,0.149,0.07731,0.1697, 0.05699,0.8529,1.849,5.632,93.54,0.01075,0.02722,0.05081,0.01911,0.022 93,0.004217,21.31,27.26,139.9,1403,0.1338,0.2117,0.3446,0.149,0.2341,0. 07421,3.4,13852973,N,52,15.3,25.27,102.4,732.4,0.1082,0.1697,0.1683,0.08751,0.1926, 0.0654,0.439,1.012,3.498,43.5,0.005233,0.03057,0.03576,0.01083,0.01768 ,0.002967,20.27,36.71,149.3,1269,0.1641,0.611,0.6335,0.2024,0.4027,0.09 876,2,0853201,N,94,17.57,15.05,115,955.1,0.09847,0.1157,0.09875,0.07953,0.173 9,0.06149,0.6003,0.8225,4.655,61.1,0.005627,0.03033,0.03407,0.01354,0. 01925,0.003742,20.01,19.52,134.9,1227,0.1255,0.2812,0.2489,0.1456,0.27 56,0.07919,4,0853612,N,116,11.84,18.7,77.93,440.6,0.1109,0.1516,0.1218,0.05182,0.230 1,0.07799,0.4825,1.03,3.475,41,0.005551,0.03414,0.04205,0.01044,0.0227 3,0.005667,16.82,28.12,119.4,888.7,0.1637,0.5775,0.6956,0.1546,0.4761,0 .1402,3,2853826,N,87,17.02,23.98,112.8,899.3,0.1197,0.1496,0.2417,0.1203,0.2248, 0.06382,0.6009,1.398,3.999,67.78,0.008268,0.03082,0.05042,0.01112,0.02 102,0.003854,20.88,32.09,136.1,1344,0.1634,0.3559,0.5588,0.1847,0.353, 0.08482,1.3,1854002,N,53,19.27,26.47,127.9,1162,0.09401,0.1719,0.1657,0.07593,0.185 3,0.06261,0.5558,0.6062,3.528,68.17,0.005015,0.03318,0.03497,0.009643, 0.01543,0.003896,24.15,30.9,161.4,1813,0.1509,0.659,0.6091,0.1785,0.36 72,0.1123,3.5,0854039,N,109,16.13,17.88,107,807.2,0.104,0.1559,0.1354,0.07752,0.1998, 0.06515,0.334,0.6857,2.183,35.03,0.004185,0.02868,0.02664,0.009067,0.0 1703,0.003817,20.21,27.26,132.7,1261,0.1446,0.5804,0.5274,0.1864,0.427 ,0.1233,2.5,0854253,N,12,16.74,21.59,110.1,869.5,0.0961,0.1336,0.1348,0.06018,0.189 6,0.05656,0.4615,0.9197,3.008,45.19,0.005776,0.02499,0.03695,0.01195,0 .02789,0.002665,20.01,29.02,133.5,1229,0.1563,0.3835,0.5409,0.1813,0.4 863,0.08633,1.5,?854268,N,31,14.25,21.72,93.63,633,0.09823,0.1098,0.1319,0.05598,0.1885,0.06125,点此下载完整数据集。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

分类器方法性能的比较——基于威斯康星州乳腺癌数据集
1引言
目前,机器学习领域有众多的分类算法。

有的想法朴素,有的理解抽象,但是对于不同分类方法的分类性能没有直观展示,解释清楚分类器的分类性能是复杂困难的,本文就是以威斯康星州乳腺癌数据集为例简单探讨不同分类器性能的比较。

2 乳腺癌数据集介绍
威斯康星州乳腺癌数据集包含699个细针抽吸活检的样本单元,其中458个为良性样本单元,241个为恶性样本单元。

对于每一个样本来说,另外九个变量是与判别恶性肿瘤相关的细胞特征。

这些细胞特征得分为1(最接近良性)至10(最接近病变)之间的整数。

这十个变量分别是:
➢X1 代表肿块厚度
➢X2代表细胞大小的均匀性
➢X3代表细胞形状的均匀性
➢X4代表边际附着力
➢X5代表单个上皮细胞的大小
➢X6代表裸核
➢X7代表乏味染色体
➢X8代表正常核
➢X9代表有丝分裂
➢Class代表类别(类型变量)
任一变量都不能单独作为判别良性或恶性的标准,建模的目的是找到九个细胞特征的某种组合,从而实现对恶性肿瘤的准确预测。

在R软件中读取数据之后的前十行数据如图1所示。

图1 数据集的前十行数据
3 数据预处理
首先对数据的概况进行观察,数据描述如图2所示。

观察数据概述图可知,该数据集由699个观测和10个变量组成。

区分了1个分类变量和9个其他变量,其中9个其它变量为数值类型,一个分类变量为因子类型,变量X6缺失了16个。

数值类型的9个变量分别给出了他们的缺失值个数、数据完整率、均值、标准差、直方图等参数。

其中变量X6的完整率为97.7%与缺失了16个观测有关。

下面再用另外一种方法来直观显示缺失值的分布。

利用VIM包中的aggr()函数来直观显示数据集中缺失值的数目如图3所示。

因为数据集的观测比较多,从简易的角度考虑,采取删除这16个观测的做法,其结果如图4所示。

图2 数据集的数据概述图
图3 未处理之前的数据集
图4 处理好的数据集
处理好数据集中的缺失值之后,对除分类变量之外的其他变量进行相关性分析,采用corrplot包中的corrplot()函数绘制自变量的相关性图示,如图5所示。

对数据集进行划分,按照数据集的70%为训练数据,30%为测试数据划分为训练数据集和测试数据集。

之后对训练数据集进行各种模型测试,在测试数据集上检验分类效果。

图5 变量的相关性图示
4 不同分类器
4.1 逻辑回归
逻辑回归是一种将一系列连续型变量和类别型变量来预测二值型的结果变量的方法。

利用线性回归的方法进行回归拟合值,将回归拟合值通过sigmoid函数变为[0,1]之间的值,设定阀值为0.5,以此为依据进行分类。

在R语言中利用glm函数来实现,第一次进行逻辑回归实验,且结果如图6所示。

发现其中的X2、X3、X5变量不显著,考虑将其去除以简化模型,重新进行实验其结果如图7所示。

观察图7可发现每个变量都显著,之后在测试集上进行测试其结果如图8所示。

观察图8,可以知道训练模型在测试集上的分类准确率为97.6%。

图6 第一次逻辑回归结果
图7 改进之后的逻辑回归结果
图8 逻辑回归在训练集上的测试结果
4.2 决策树
决策树首先选定一个最佳预测变量将全部样本单元分为两类,实现两类中的纯度最大化。

如果预测变量连续,则选定一个分割点进行分类,使两类纯度最大化;若预测变量为分类变量,则对各类别进行合并后再分类,之后对每一个子类别继续执行上述过程,直至子类别中所含的样本单元数过少,或者没有分类法可将不纯度下降到一个阈值以下。

最后依据每个终端节点中样本单元的类别众数来判别这一终端节点的类别。

该算法一般会得到一颗较大的树,出现过拟合现象,导致模型在测试集上表现不佳,为解决这一问题可采用10折交叉验证法选择预测误差最小的树。

在R语言中可采用rpart()函数实现。

rpart函数返回的cptable值中包括不同大小的数对应的预测误差,可用于辅助设定最终的树的大小,其结果如图9所示。

图9 cptable值截图
其中,复杂度参数cp用于惩罚过大的树、树的大小即分支树nsplit、rel error栏即训练集各种树所对应的误差、交叉验证误差xerror(基于训练样本所得的10折交叉验证误差)、xstd为交叉验证误差的标准差。

最后得到的最优模型如图10所示
图10 最终预测树
4.3 随机森林
随机森林是一种组成式的有监督学习方法。

在随机森林中,我们同时生成多个预测模型,并将模型的结果汇总以提升分类效果,其类别采用多数决定原则。

在R语言中采用randomForest包中的randomForest函数来生成随机森林。

函数默认生成500棵树,并且默认在每个节点处抽取3个变量,其实现结果如图11所示,其中袋外误差(OOB)即在训练集上的分类错误率为2.93%。

图11 随机森林实现结果
只有对训练的模型在测试集上,进行测试,其结果如图12所示。

图12 测试集上的测试结果
4.4 支持向量机
支持向量机(SVM)旨在在多维空间中找到一个能将全部样本单元分成两类的最优平面,这一平面使得两类中的距离最近的点的间距尽可能大,在间距边界上的点称为支持向量,分割的超平面位于间距的中间。

在R语言中采用e1071包中的svm函数实现,其实现结果如图13所示。

图13 支持向量机实现结果
之后在测试集上进行测试,得到的结果如图14所示。

图14 SVM在测试集上的结果
5 性能比较
对于一个分类结果,我们除了观察它的分类准确率是多少,我们还关心以下四个问题:
⚫良性结果中有多大比例成功鉴别
⚫恶性结果中有多大比例成功鉴别
⚫如果一个人被诊为恶性,这个判别有多大概率是准确的
⚫如果一个人被诊为良性,这个判别有多大概率是准确的
这些问题涉及一个分类器的敏感度、特异性、正例命中率、负例命中率,这四个指标的概念如图15所示。

图15 指标图解
本次文章中通过设计指标函数,使得函数返回值为这五个指标,以此来评价分类器的性能,其结果如图16所示。

图16 分类性能图
由图16观察可得,在该数据集上,分类性能均较好都达到了90%以上,最好的分类器是支持向量机(SVM),其次是随机森林和逻辑回归,决策树最差,在各个分项指标上基本也是SVM表现最佳。

这些方法复杂度各异,相对简单的如逻辑回归和决策树,一些复杂、黑箱式的如随机森林和支持向量机,在分类效果基本相同时优先选择较简单的,只有当简单方法分类效果不佳时再考虑复杂度较高的方法。

相关文档
最新文档