Imbalanced Classification Algorithm in Botnet Detection

合集下载

不平衡数据分类研究综述

不平衡数据分类研究综述

不平衡数据分类研究综述赵楠;张小芳;张利军【期刊名称】《计算机科学》【年(卷),期】2018(045)0z1【摘要】在很多应用领域中,数据的类别分布不平衡,如何对其正确分类是数据挖掘和机器学习领域中的研究热点.经典的数据分类算法未考虑数据类别的不平衡性,认为类别之间的误分类代价相同,导致不平衡数据分类的效果不理想.针对数据分类的各个步骤,相继提出了不同的不平衡数据分类处理方法.对多年来的相关研究成果进行归类分析,从特征选择、数据分布调整、分类算法、分类结果评估等几个方面系统地介绍了相关方法,并探讨了进一步的探索方向.%Imbalanced data classification has been drawn significant attention from research community in last decade. Because of the assumption of relatively balanced class distribution and equal misclassification costs,most standard clas-sifiers do not perform well with imbalanced data classification.In view of various phases of data classification,different imbalanced data classification methods have been proposed.The relevant research achievements over the years were ana-lyzed,and various approaches with imbalanced data were introduced from the view of featureselection,adjustment of the data distribution,classification algorithm and classifier evaluation.The future trends and research issues that still need to be faced in imbalanced data classification were discussed in the end.【总页数】7页(P22-27,57)【作者】赵楠;张小芳;张利军【作者单位】西北工业大学计算机学院西安 710000;西北工业大学计算机学院西安 710000;西北工业大学计算机学院西安 710000【正文语种】中文【中图分类】TP311【相关文献】1.常用分类算法在不同样本量和类分布的不平衡数据中的分类效果比较 [J], 袁联雄;佘玲玲;林爱华;骆福添2.不平衡数据分类研究综述 [J], 陈湘涛;高亚静3.基于不平衡数据分类的人体姿态分类算法 [J], 黄勃; 王忠震; 陈欢; 王中森4.基于证据理论融合两级分类规则的不平衡数据分类方法 [J], 李莎莎5.基于投影寻踪分类树的不平衡数据分类研究 [J], 王瑞楠因版权原因,仅展示原文概要,查看原文内容请购买。

基于欠采样和代价敏感的不平衡数据分类算法

基于欠采样和代价敏感的不平衡数据分类算法

2021‑01‑10计算机应用,Journal of Computer Applications 2021,41(1):48-52ISSN 1001‑9081CODEN JYIIDU http ://基于欠采样和代价敏感的不平衡数据分类算法王俊红1,2*,闫家荣1,2(1.山西大学计算机与信息技术学院,太原030006;2.计算智能与中文信息处理教育部重点实验室(山西大学),太原030006)(∗通信作者电子邮箱wjhwjh@ )摘要:针对不平衡数据集中的少数类在传统分类器上预测精度低的问题,提出了一种基于欠采样和代价敏感的不平衡数据分类算法——USCBoost 。

首先在AdaBoost 算法每次迭代训练基分类器之前对多数类样本按权重由大到小进行排序,根据样本权重选取与少数类样本数量相当的多数类样本;之后将采样后的多数类样本权重归一化并与少数类样本组成临时训练集训练基分类器;其次在权重更新阶段,赋予少数类更高的误分代价,使得少数类样本权重增加更快,并且多数类样本权重增加更慢。

在10组UCI 数据集上,将USCBoost 与AdaBoost 、AdaCost 、RUSBoost 进行对比实验。

实验结果表明USCBoost 在F1-measure 和G -mean 准则下分别在6组和9组数据集获得了最高的评价指标。

可见所提算法在不平衡数据上具有更好的分类性能。

关键词:不平衡数据;分类;代价敏感;AdaBoost 算法;欠采样中图分类号:TP18文献标志码:AClassification algorithm based on undersampling andcost -sensitiveness for unbalanced dataWANG Junhong 1,2*,YAN Jiarong 1,2(1.School of Computer and Information Technology ,Shanxi University ,Taiyuan Shanxi 030006,China ;2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education (Shanxi University ),Taiyuan Shanxi 030006,China )Abstract:Focusing on the problem that the minority class in the unbalanced dataset has low prediction accuracy bytraditional classifiers ,an unbalanced data classification algorithm based on undersampling and cost -sensitiveness ,called USCBoost (UnderSamples and Cost -sensitive Boosting ),was proposed.Firstly ,the majority class samples were sorted from large weight sample to small weight sample before base classifiers being trained by the AdaBoost (Adaptive Boosting )algorithm in each iteration ,the majority class samples with the number equal to the number of minority class samples were selected according to sample weights ,and the weights of majority class samples after sampling were normalized and a temporary training set was formed by these majority class samples and the minority class samples to train base classifiers.Secondly ,in the weight update stage ,higher misclassification cost was given to the minority class ,which made the weights of minority class samples increase faster and the weights of majority class samples increase more slowly.On ten sets of UCI datasets ,USCBoost was compared with AdaBoost ,AdaCost (Cost -sensitive AdaBoosting ),and RUSBoost (Random Under -Sampling Boosting ).Experimental results show that USCBoost has the highest evaluation indexes on six sets and nine sets of datasets under the F1-measure and G -mean criteria respectively.The proposed algorithm has better classification performance on unbalanced data.Key words:unbalanced data;classification;cost -sensitiveness;AdaBoost algorithm;undersampling引言分类是数据挖掘领域中一个重要的分支,普通的分类模型通常假设数据集中各类别的样本数量差距很小且对于每个类别的误分代价相等,而使用不平衡数据集训练传统的分类器会导致模型对于少数类的预测精度很低,因此不平衡数据学习一直是机器学习领域的研究热点[1]。

金康荣 随机森林算法的中文文本分类方法

金康荣 随机森林算法的中文文本分类方法

金康荣随机森林算法的中文文本分类方法1. Random Forest algorithm is widely used in Chinese text classification.随机森林算法被广泛应用于中文文本分类。

2. This algorithm combines multiple decision trees to improve classification accuracy.该算法通过组合多个决策树来提高分类的准确性。

3. Random Forest algorithm can effectively handle high-dimensional and sparse feature spaces.随机森林算法可以有效处理高维稀疏特征空间。

4. It has been successfully applied in sentiment analysis, topic classification, and news categorization.该算法已成功应用于情感分析、主题分类和新闻归类。

5. The Random Forest algorithm can handle unbalanced datasets in text classification tasks.随机森林算法可以处理文本分类任务中的不平衡数据集。

6. By using feature importance measures, the algorithm can identify the most influential features in the classification process.通过使用特征重要性度量,该算法可以识别分类过程中最具影响力的特征。

7. Random Forest algorithm is computationally efficient and scalable to large datasets.随机森林算法在计算效率和大规模数据集上具有可扩展性。

机器学习专业词汇中英文对照

机器学习专业词汇中英文对照

机器学习专业词汇中英⽂对照activation 激活值activation function 激活函数additive noise 加性噪声autoencoder ⾃编码器Autoencoders ⾃编码算法average firing rate 平均激活率average sum-of-squares error 均⽅差backpropagation 后向传播basis 基basis feature vectors 特征基向量batch gradient ascent 批量梯度上升法Bayesian regularization method 贝叶斯规则化⽅法Bernoulli random variable 伯努利随机变量bias term 偏置项binary classfication ⼆元分类class labels 类型标记concatenation 级联conjugate gradient 共轭梯度contiguous groups 联通区域convex optimization software 凸优化软件convolution 卷积cost function 代价函数covariance matrix 协⽅差矩阵DC component 直流分量decorrelation 去相关degeneracy 退化demensionality reduction 降维derivative 导函数diagonal 对⾓线diffusion of gradients 梯度的弥散eigenvalue 特征值eigenvector 特征向量error term 残差feature matrix 特征矩阵feature standardization 特征标准化feedforward architectures 前馈结构算法feedforward neural network 前馈神经⽹络feedforward pass 前馈传导fine-tuned 微调first-order feature ⼀阶特征forward pass 前向传导forward propagation 前向传播Gaussian prior ⾼斯先验概率generative model ⽣成模型gradient descent 梯度下降Greedy layer-wise training 逐层贪婪训练⽅法grouping matrix 分组矩阵Hadamard product 阿达马乘积Hessian matrix Hessian 矩阵hidden layer 隐含层hidden units 隐藏神经元Hierarchical grouping 层次型分组higher-order features 更⾼阶特征highly non-convex optimization problem ⾼度⾮凸的优化问题histogram 直⽅图hyperbolic tangent 双曲正切函数hypothesis 估值,假设identity activation function 恒等激励函数IID 独⽴同分布illumination 照明inactive 抑制independent component analysis 独⽴成份分析input domains 输⼊域input layer 输⼊层intensity 亮度/灰度intercept term 截距KL divergence 相对熵KL divergence KL分散度k-Means K-均值learning rate 学习速率least squares 最⼩⼆乘法linear correspondence 线性响应linear superposition 线性叠加line-search algorithm 线搜索算法local mean subtraction 局部均值消减local optima 局部最优解logistic regression 逻辑回归loss function 损失函数low-pass filtering 低通滤波magnitude 幅值MAP 极⼤后验估计maximum likelihood estimation 极⼤似然估计mean 平均值MFCC Mel 倒频系数multi-class classification 多元分类neural networks 神经⽹络neuron 神经元Newton’s method ⽜顿法non-convex function ⾮凸函数non-linear feature ⾮线性特征norm 范式norm bounded 有界范数norm constrained 范数约束normalization 归⼀化numerical roundoff errors 数值舍⼊误差numerically checking 数值检验numerically reliable 数值计算上稳定object detection 物体检测objective function ⽬标函数off-by-one error 缺位错误orthogonalization 正交化output layer 输出层overall cost function 总体代价函数over-complete basis 超完备基over-fitting 过拟合parts of objects ⽬标的部件part-whole decompostion 部分-整体分解PCA 主元分析penalty term 惩罚因⼦per-example mean subtraction 逐样本均值消减pooling 池化pretrain 预训练principal components analysis 主成份分析quadratic constraints ⼆次约束RBMs 受限Boltzman机reconstruction based models 基于重构的模型reconstruction cost 重建代价reconstruction term 重构项redundant 冗余reflection matrix 反射矩阵regularization 正则化regularization term 正则化项rescaling 缩放robust 鲁棒性run ⾏程second-order feature ⼆阶特征sigmoid activation function S型激励函数significant digits 有效数字singular value 奇异值singular vector 奇异向量smoothed L1 penalty 平滑的L1范数惩罚Smoothed topographic L1 sparsity penalty 平滑地形L1稀疏惩罚函数smoothing 平滑Softmax Regresson Softmax回归sorted in decreasing order 降序排列source features 源特征sparse autoencoder 消减归⼀化Sparsity 稀疏性sparsity parameter 稀疏性参数sparsity penalty 稀疏惩罚square function 平⽅函数squared-error ⽅差stationary 平稳性(不变性)stationary stochastic process 平稳随机过程step-size 步长值supervised learning 监督学习symmetric positive semi-definite matrix 对称半正定矩阵symmetry breaking 对称失效tanh function 双曲正切函数the average activation 平均活跃度the derivative checking method 梯度验证⽅法the empirical distribution 经验分布函数the energy function 能量函数the Lagrange dual 拉格朗⽇对偶函数the log likelihood 对数似然函数the pixel intensity value 像素灰度值the rate of convergence 收敛速度topographic cost term 拓扑代价项topographic ordered 拓扑秩序transformation 变换translation invariant 平移不变性trivial answer 平凡解under-complete basis 不完备基unrolling 组合扩展unsupervised learning ⽆监督学习variance ⽅差vecotrized implementation 向量化实现vectorization ⽮量化visual cortex 视觉⽪层weight decay 权重衰减weighted average 加权平均值whitening ⽩化zero-mean 均值为零Letter AAccumulated error backpropagation 累积误差逆传播Activation Function 激活函数Adaptive Resonance Theory/ART ⾃适应谐振理论Addictive model 加性学习Adversarial Networks 对抗⽹络Affine Layer 仿射层Affinity matrix 亲和矩阵Agent 代理 / 智能体Algorithm 算法Alpha-beta pruning α-β剪枝Anomaly detection 异常检测Approximation 近似Area Under ROC Curve/AUC Roc 曲线下⾯积Artificial General Intelligence/AGI 通⽤⼈⼯智能Artificial Intelligence/AI ⼈⼯智能Association analysis 关联分析Attention mechanism 注意⼒机制Attribute conditional independence assumption 属性条件独⽴性假设Attribute space 属性空间Attribute value 属性值Autoencoder ⾃编码器Automatic speech recognition ⾃动语⾳识别Automatic summarization ⾃动摘要Average gradient 平均梯度Average-Pooling 平均池化Letter BBackpropagation Through Time 通过时间的反向传播Backpropagation/BP 反向传播Base learner 基学习器Base learning algorithm 基学习算法Batch Normalization/BN 批量归⼀化Bayes decision rule 贝叶斯判定准则Bayes Model Averaging/BMA 贝叶斯模型平均Bayes optimal classifier 贝叶斯最优分类器Bayesian decision theory 贝叶斯决策论Bayesian network 贝叶斯⽹络Between-class scatter matrix 类间散度矩阵Bias 偏置 / 偏差Bias-variance decomposition 偏差-⽅差分解Bias-Variance Dilemma 偏差 – ⽅差困境Bi-directional Long-Short Term Memory/Bi-LSTM 双向长短期记忆Binary classification ⼆分类Binomial test ⼆项检验Bi-partition ⼆分法Boltzmann machine 玻尔兹曼机Bootstrap sampling ⾃助采样法/可重复采样/有放回采样Bootstrapping ⾃助法Break-Event Point/BEP 平衡点Letter CCalibration 校准Cascade-Correlation 级联相关Categorical attribute 离散属性Class-conditional probability 类条件概率Classification and regression tree/CART 分类与回归树Classifier 分类器Class-imbalance 类别不平衡Closed -form 闭式Cluster 簇/类/集群Cluster analysis 聚类分析Clustering 聚类Clustering ensemble 聚类集成Co-adapting 共适应Coding matrix 编码矩阵COLT 国际学习理论会议Committee-based learning 基于委员会的学习Competitive learning 竞争型学习Component learner 组件学习器Comprehensibility 可解释性Computation Cost 计算成本Computational Linguistics 计算语⾔学Computer vision 计算机视觉Concept drift 概念漂移Concept Learning System /CLS 概念学习系统Conditional entropy 条件熵Conditional mutual information 条件互信息Conditional Probability Table/CPT 条件概率表Conditional random field/CRF 条件随机场Conditional risk 条件风险Confidence 置信度Confusion matrix 混淆矩阵Connection weight 连接权Connectionism 连结主义Consistency ⼀致性/相合性Contingency table 列联表Continuous attribute 连续属性Convergence 收敛Conversational agent 会话智能体Convex quadratic programming 凸⼆次规划Convexity 凸性Convolutional neural network/CNN 卷积神经⽹络Co-occurrence 同现Correlation coefficient 相关系数Cosine similarity 余弦相似度Cost curve 成本曲线Cost Function 成本函数Cost matrix 成本矩阵Cost-sensitive 成本敏感Cross entropy 交叉熵Cross validation 交叉验证Crowdsourcing 众包Curse of dimensionality 维数灾难Cut point 截断点Cutting plane algorithm 割平⾯法Letter DData mining 数据挖掘Data set 数据集Decision Boundary 决策边界Decision stump 决策树桩Decision tree 决策树/判定树Deduction 演绎Deep Belief Network 深度信念⽹络Deep Convolutional Generative Adversarial Network/DCGAN 深度卷积⽣成对抗⽹络Deep learning 深度学习Deep neural network/DNN 深度神经⽹络Deep Q-Learning 深度 Q 学习Deep Q-Network 深度 Q ⽹络Density estimation 密度估计Density-based clustering 密度聚类Differentiable neural computer 可微分神经计算机Dimensionality reduction algorithm 降维算法Directed edge 有向边Disagreement measure 不合度量Discriminative model 判别模型Discriminator 判别器Distance measure 距离度量Distance metric learning 距离度量学习Distribution 分布Divergence 散度Diversity measure 多样性度量/差异性度量Domain adaption 领域⾃适应Downsampling 下采样D-separation (Directed separation)有向分离Dual problem 对偶问题Dummy node 哑结点Dynamic Fusion 动态融合Dynamic programming 动态规划Letter EEigenvalue decomposition 特征值分解Embedding 嵌⼊Emotional analysis 情绪分析Empirical conditional entropy 经验条件熵Empirical entropy 经验熵Empirical error 经验误差Empirical risk 经验风险End-to-End 端到端Energy-based model 基于能量的模型Ensemble learning 集成学习Ensemble pruning 集成修剪Error Correcting Output Codes/ECOC 纠错输出码Error rate 错误率Error-ambiguity decomposition 误差-分歧分解Euclidean distance 欧⽒距离Evolutionary computation 演化计算Expectation-Maximization 期望最⼤化Expected loss 期望损失Exploding Gradient Problem 梯度爆炸问题Exponential loss function 指数损失函数Extreme Learning Machine/ELM 超限学习机Letter FFactorization 因⼦分解False negative 假负类False positive 假正类False Positive Rate/FPR 假正例率Feature engineering 特征⼯程Feature selection 特征选择Feature vector 特征向量Featured Learning 特征学习Feedforward Neural Networks/FNN 前馈神经⽹络Fine-tuning 微调Flipping output 翻转法Fluctuation 震荡Forward stagewise algorithm 前向分步算法Frequentist 频率主义学派Full-rank matrix 满秩矩阵Functional neuron 功能神经元Letter GGain ratio 增益率Game theory 博弈论Gaussian kernel function ⾼斯核函数Gaussian Mixture Model ⾼斯混合模型General Problem Solving 通⽤问题求解Generalization 泛化Generalization error 泛化误差Generalization error bound 泛化误差上界Generalized Lagrange function ⼴义拉格朗⽇函数Generalized linear model ⼴义线性模型Generalized Rayleigh quotient ⼴义瑞利商Generative Adversarial Networks/GAN ⽣成对抗⽹络Generative Model ⽣成模型Generator ⽣成器Genetic Algorithm/GA 遗传算法Gibbs sampling 吉布斯采样Gini index 基尼指数Global minimum 全局最⼩Global Optimization 全局优化Gradient boosting 梯度提升Gradient Descent 梯度下降Graph theory 图论Ground-truth 真相/真实Letter HHard margin 硬间隔Hard voting 硬投票Harmonic mean 调和平均Hesse matrix 海塞矩阵Hidden dynamic model 隐动态模型Hidden layer 隐藏层Hidden Markov Model/HMM 隐马尔可夫模型Hierarchical clustering 层次聚类Hilbert space 希尔伯特空间Hinge loss function 合页损失函数Hold-out 留出法Homogeneous 同质Hybrid computing 混合计算Hyperparameter 超参数Hypothesis 假设Hypothesis test 假设验证Letter IICML 国际机器学习会议Improved iterative scaling/IIS 改进的迭代尺度法Incremental learning 增量学习Independent and identically distributed/i.i.d. 独⽴同分布Independent Component Analysis/ICA 独⽴成分分析Indicator function 指⽰函数Individual learner 个体学习器Induction 归纳Inductive bias 归纳偏好Inductive learning 归纳学习Inductive Logic Programming/ILP 归纳逻辑程序设计Information entropy 信息熵Information gain 信息增益Input layer 输⼊层Insensitive loss 不敏感损失Inter-cluster similarity 簇间相似度International Conference for Machine Learning/ICML 国际机器学习⼤会Intra-cluster similarity 簇内相似度Intrinsic value 固有值Isometric Mapping/Isomap 等度量映射Isotonic regression 等分回归Iterative Dichotomiser 迭代⼆分器Letter KKernel method 核⽅法Kernel trick 核技巧Kernelized Linear Discriminant Analysis/KLDA 核线性判别分析K-fold cross validation k 折交叉验证/k 倍交叉验证K-Means Clustering K – 均值聚类K-Nearest Neighbours Algorithm/KNN K近邻算法Knowledge base 知识库Knowledge Representation 知识表征Letter LLabel space 标记空间Lagrange duality 拉格朗⽇对偶性Lagrange multiplier 拉格朗⽇乘⼦Laplace smoothing 拉普拉斯平滑Laplacian correction 拉普拉斯修正Latent Dirichlet Allocation 隐狄利克雷分布Latent semantic analysis 潜在语义分析Latent variable 隐变量Lazy learning 懒惰学习Learner 学习器Learning by analogy 类⽐学习Learning rate 学习率Learning Vector Quantization/LVQ 学习向量量化Least squares regression tree 最⼩⼆乘回归树Leave-One-Out/LOO 留⼀法linear chain conditional random field 线性链条件随机场Linear Discriminant Analysis/LDA 线性判别分析Linear model 线性模型Linear Regression 线性回归Link function 联系函数Local Markov property 局部马尔可夫性Local minimum 局部最⼩Log likelihood 对数似然Log odds/logit 对数⼏率Logistic Regression Logistic 回归Log-likelihood 对数似然Log-linear regression 对数线性回归Long-Short Term Memory/LSTM 长短期记忆Loss function 损失函数Letter MMachine translation/MT 机器翻译Macron-P 宏查准率Macron-R 宏查全率Majority voting 绝对多数投票法Manifold assumption 流形假设Manifold learning 流形学习Margin theory 间隔理论Marginal distribution 边际分布Marginal independence 边际独⽴性Marginalization 边际化Markov Chain Monte Carlo/MCMC 马尔可夫链蒙特卡罗⽅法Markov Random Field 马尔可夫随机场Maximal clique 最⼤团Maximum Likelihood Estimation/MLE 极⼤似然估计/极⼤似然法Maximum margin 最⼤间隔Maximum weighted spanning tree 最⼤带权⽣成树Max-Pooling 最⼤池化Mean squared error 均⽅误差Meta-learner 元学习器Metric learning 度量学习Micro-P 微查准率Micro-R 微查全率Minimal Description Length/MDL 最⼩描述长度Minimax game 极⼩极⼤博弈Misclassification cost 误分类成本Mixture of experts 混合专家Momentum 动量Moral graph 道德图/端正图Multi-class classification 多分类Multi-document summarization 多⽂档摘要Multi-layer feedforward neural networks 多层前馈神经⽹络Multilayer Perceptron/MLP 多层感知器Multimodal learning 多模态学习Multiple Dimensional Scaling 多维缩放Multiple linear regression 多元线性回归Multi-response Linear Regression /MLR 多响应线性回归Mutual information 互信息Letter NNaive bayes 朴素贝叶斯Naive Bayes Classifier 朴素贝叶斯分类器Named entity recognition 命名实体识别Nash equilibrium 纳什均衡Natural language generation/NLG ⾃然语⾔⽣成Natural language processing ⾃然语⾔处理Negative class 负类Negative correlation 负相关法Negative Log Likelihood 负对数似然Neighbourhood Component Analysis/NCA 近邻成分分析Neural Machine Translation 神经机器翻译Neural Turing Machine 神经图灵机Newton method ⽜顿法NIPS 国际神经信息处理系统会议No Free Lunch Theorem/NFL 没有免费的午餐定理Noise-contrastive estimation 噪⾳对⽐估计Nominal attribute 列名属性Non-convex optimization ⾮凸优化Nonlinear model ⾮线性模型Non-metric distance ⾮度量距离Non-negative matrix factorization ⾮负矩阵分解Non-ordinal attribute ⽆序属性Non-Saturating Game ⾮饱和博弈Norm 范数Normalization 归⼀化Nuclear norm 核范数Numerical attribute 数值属性Letter OObjective function ⽬标函数Oblique decision tree 斜决策树Occam’s razor 奥卡姆剃⼑Odds ⼏率Off-Policy 离策略One shot learning ⼀次性学习One-Dependent Estimator/ODE 独依赖估计On-Policy 在策略Ordinal attribute 有序属性Out-of-bag estimate 包外估计Output layer 输出层Output smearing 输出调制法Overfitting 过拟合/过配Oversampling 过采样Letter PPaired t-test 成对 t 检验Pairwise 成对型Pairwise Markov property 成对马尔可夫性Parameter 参数Parameter estimation 参数估计Parameter tuning 调参Parse tree 解析树Particle Swarm Optimization/PSO 粒⼦群优化算法Part-of-speech tagging 词性标注Perceptron 感知机Performance measure 性能度量Plug and Play Generative Network 即插即⽤⽣成⽹络Plurality voting 相对多数投票法Polarity detection 极性检测Polynomial kernel function 多项式核函数Pooling 池化Positive class 正类Positive definite matrix 正定矩阵Post-hoc test 后续检验Post-pruning 后剪枝potential function 势函数Precision 查准率/准确率Prepruning 预剪枝Principal component analysis/PCA 主成分分析Principle of multiple explanations 多释原则Prior 先验Probability Graphical Model 概率图模型Proximal Gradient Descent/PGD 近端梯度下降Pruning 剪枝Pseudo-label 伪标记Letter QQuantized Neural Network 量⼦化神经⽹络Quantum computer 量⼦计算机Quantum Computing 量⼦计算Quasi Newton method 拟⽜顿法Letter RRadial Basis Function/RBF 径向基函数Random Forest Algorithm 随机森林算法Random walk 随机漫步Recall 查全率/召回率Receiver Operating Characteristic/ROC 受试者⼯作特征Rectified Linear Unit/ReLU 线性修正单元Recurrent Neural Network 循环神经⽹络Recursive neural network 递归神经⽹络Reference model 参考模型Regression 回归Regularization 正则化Reinforcement learning/RL 强化学习Representation learning 表征学习Representer theorem 表⽰定理reproducing kernel Hilbert space/RKHS 再⽣核希尔伯特空间Re-sampling 重采样法Rescaling 再缩放Residual Mapping 残差映射Residual Network 残差⽹络Restricted Boltzmann Machine/RBM 受限玻尔兹曼机Restricted Isometry Property/RIP 限定等距性Re-weighting 重赋权法Robustness 稳健性/鲁棒性Root node 根结点Rule Engine 规则引擎Rule learning 规则学习Letter SSaddle point 鞍点Sample space 样本空间Sampling 采样Score function 评分函数Self-Driving ⾃动驾驶Self-Organizing Map/SOM ⾃组织映射Semi-naive Bayes classifiers 半朴素贝叶斯分类器Semi-Supervised Learning 半监督学习semi-Supervised Support Vector Machine 半监督⽀持向量机Sentiment analysis 情感分析Separating hyperplane 分离超平⾯Sigmoid function Sigmoid 函数Similarity measure 相似度度量Simulated annealing 模拟退⽕Simultaneous localization and mapping 同步定位与地图构建Singular Value Decomposition 奇异值分解Slack variables 松弛变量Smoothing 平滑Soft margin 软间隔Soft margin maximization 软间隔最⼤化Soft voting 软投票Sparse representation 稀疏表征Sparsity 稀疏性Specialization 特化Spectral Clustering 谱聚类Speech Recognition 语⾳识别Splitting variable 切分变量Squashing function 挤压函数Stability-plasticity dilemma 可塑性-稳定性困境Statistical learning 统计学习Status feature function 状态特征函Stochastic gradient descent 随机梯度下降Stratified sampling 分层采样Structural risk 结构风险Structural risk minimization/SRM 结构风险最⼩化Subspace ⼦空间Supervised learning 监督学习/有导师学习support vector expansion ⽀持向量展式Support Vector Machine/SVM ⽀持向量机Surrogat loss 替代损失Surrogate function 替代函数Symbolic learning 符号学习Symbolism 符号主义Synset 同义词集Letter TT-Distribution Stochastic Neighbour Embedding/t-SNE T – 分布随机近邻嵌⼊Tensor 张量Tensor Processing Units/TPU 张量处理单元The least square method 最⼩⼆乘法Threshold 阈值Threshold logic unit 阈值逻辑单元Threshold-moving 阈值移动Time Step 时间步骤Tokenization 标记化Training error 训练误差Training instance 训练⽰例/训练例Transductive learning 直推学习Transfer learning 迁移学习Treebank 树库Tria-by-error 试错法True negative 真负类True positive 真正类True Positive Rate/TPR 真正例率Turing Machine 图灵机Twice-learning ⼆次学习Letter UUnderfitting ⽋拟合/⽋配Undersampling ⽋采样Understandability 可理解性Unequal cost ⾮均等代价Unit-step function 单位阶跃函数Univariate decision tree 单变量决策树Unsupervised learning ⽆监督学习/⽆导师学习Unsupervised layer-wise training ⽆监督逐层训练Upsampling 上采样Letter VVanishing Gradient Problem 梯度消失问题Variational inference 变分推断VC Theory VC维理论Version space 版本空间Viterbi algorithm 维特⽐算法Von Neumann architecture 冯 · 诺伊曼架构Letter WWasserstein GAN/WGAN Wasserstein⽣成对抗⽹络Weak learner 弱学习器Weight 权重Weight sharing 权共享Weighted voting 加权投票法Within-class scatter matrix 类内散度矩阵Word embedding 词嵌⼊Word sense disambiguation 词义消歧Letter ZZero-data learning 零数据学习Zero-shot learning 零次学习Aapproximations近似值arbitrary随意的affine仿射的arbitrary任意的amino acid氨基酸amenable经得起检验的axiom公理,原则abstract提取architecture架构,体系结构;建造业absolute绝对的arsenal军⽕库assignment分配algebra线性代数asymptotically⽆症状的appropriate恰当的Bbias偏差brevity简短,简洁;短暂broader⼴泛briefly简短的batch批量Cconvergence 收敛,集中到⼀点convex凸的contours轮廓constraint约束constant常理commercial商务的complementarity补充coordinate ascent同等级上升clipping剪下物;剪报;修剪component分量;部件continuous连续的covariance协⽅差canonical正规的,正则的concave⾮凸的corresponds相符合;相当;通信corollary推论concrete具体的事物,实在的东西cross validation交叉验证correlation相互关系convention约定cluster⼀簇centroids 质⼼,形⼼converge收敛computationally计算(机)的calculus计算Dderive获得,取得dual⼆元的duality⼆元性;⼆象性;对偶性derivation求导;得到;起源denote预⽰,表⽰,是…的标志;意味着,[逻]指称divergence 散度;发散性dimension尺度,规格;维数dot⼩圆点distortion变形density概率密度函数discrete离散的discriminative有识别能⼒的diagonal对⾓dispersion分散,散开determinant决定因素disjoint不相交的Eencounter遇到ellipses椭圆equality等式extra额外的empirical经验;观察ennmerate例举,计数exceed超过,越出expectation期望efficient⽣效的endow赋予explicitly清楚的exponential family指数家族equivalently等价的Ffeasible可⾏的forary初次尝试finite有限的,限定的forgo摒弃,放弃fliter过滤frequentist最常发⽣的forward search前向式搜索formalize使定形Ggeneralized归纳的generalization概括,归纳;普遍化;判断(根据不⾜)guarantee保证;抵押品generate形成,产⽣geometric margins⼏何边界gap裂⼝generative⽣产的;有⽣产⼒的Hheuristic启发式的;启发法;启发程序hone怀恋;磨hyperplane超平⾯Linitial最初的implement执⾏intuitive凭直觉获知的incremental增加的intercept截距intuitious直觉instantiation例⼦indicator指⽰物,指⽰器interative重复的,迭代的integral积分identical相等的;完全相同的indicate表⽰,指出invariance不变性,恒定性impose把…强加于intermediate中间的interpretation解释,翻译Jjoint distribution联合概率Llieu替代logarithmic对数的,⽤对数表⽰的latent潜在的Leave-one-out cross validation留⼀法交叉验证Mmagnitude巨⼤mapping绘图,制图;映射matrix矩阵mutual相互的,共同的monotonically单调的minor较⼩的,次要的multinomial多项的multi-class classification⼆分类问题Nnasty讨厌的notation标志,注释naïve朴素的Oobtain得到oscillate摆动optimization problem最优化问题objective function⽬标函数optimal最理想的orthogonal(⽮量,矩阵等)正交的orientation⽅向ordinary普通的occasionally偶然的Ppartial derivative偏导数property性质proportional成⽐例的primal原始的,最初的permit允许pseudocode伪代码permissible可允许的polynomial多项式preliminary预备precision精度perturbation 不安,扰乱poist假定,设想positive semi-definite半正定的parentheses圆括号posterior probability后验概率plementarity补充pictorially图像的parameterize确定…的参数poisson distribution柏松分布pertinent相关的Qquadratic⼆次的quantity量,数量;分量query疑问的Rregularization使系统化;调整reoptimize重新优化restrict限制;限定;约束reminiscent回忆往事的;提醒的;使⼈联想…的(of)remark注意random variable随机变量respect考虑respectively各⾃的;分别的redundant过多的;冗余的Ssusceptible敏感的stochastic可能的;随机的symmetric对称的sophisticated复杂的spurious假的;伪造的subtract减去;减法器simultaneously同时发⽣地;同步地suffice满⾜scarce稀有的,难得的split分解,分离subset⼦集statistic统计量successive iteratious连续的迭代scale标度sort of有⼏分的squares平⽅Ttrajectory轨迹temporarily暂时的terminology专⽤名词tolerance容忍;公差thumb翻阅threshold阈,临界theorem定理tangent正弦Uunit-length vector单位向量Vvalid有效的,正确的variance⽅差variable变量;变元vocabulary词汇valued经估价的;宝贵的Wwrapper包装分类:。

数据挖掘中PCA和LDA分类算法的比较

数据挖掘中PCA和LDA分类算法的比较

2019年第6期信息与电脑China Computer & Communication算法语言数据挖掘中PCA 和LDA 分类算法的比较李 华 黄华梅(南宁师范大学,广西 南宁 530001)摘 要:分类是一种重要的数据挖掘问题,它的一般过程是先输入数据,再利用相关的分类算法得到分类规则,对新的数据划分类别。

笔者详细介绍了两种简单的分类降维算法:Principal Component Analysis(PCA)和Linear Discriminant Analysis(LDA)。

通过比较这两种分类算法发现,LDA 是有监督的降维方法,可选择分类性能最好的投影方向,而PCA 是无监督的降维方法,可选择样本点投影具有最大方差的方向。

关键词:PCA;LDA;分类中图分类号:TP306 文献标识码:A 文章编号:1003-9767(2019)06-046-03Comparison of PCA and LDA Classification Algorithms in Data MiningLi Hua, Huang Huamei(Nanning Normal University, Nanning Guangxi 530001, China)Abstract: Classification is an important data mining problem. Its general process is to input data first, and then use relatedclassification algorithms to get classification rules and classify new data. Two simple classification dimension reduction algorithmsare introduced in detail: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). By comparing the two classification algorithms, it is found that LDA is a supervised dimension reduction method, and can choose the direction of projection with the best classification performance, while PCA is an unsupervised dimension reduction method, and the direction of projectionwith the largest variance can be selected.Key words: PCA; LDA; classification1 背景随着数据库应用的不断发展,数据库的规模不断增大,数据挖掘已成为时下讨论的重点。

基于聚类欠采样的极端学习机

基于聚类欠采样的极端学习机

基于聚类欠采样的极端学习机徐丽丽;闫德勤;高晴【摘要】Aiming at the problem that Extreme Learning Machine(ELM) is unsatisfying in dealing with imbalanced data set, an ELM algorithm based on clustering and under-sampling is proposed. Firstly, the new algorithm clusters the negative samples of training set and generates different clusters. Secondly, it takes samples in every cluster according the specified sampling rate, the data sampled make up a new negative data set, which can make the positive and negative data balanced in training set. Lastly, it trains the classifier and predicts the test set. The experimental results show that the new algorithm can effectively reduce the influence of imbalanced data for classification accuracy and has better classification performance.%针对极端学习机算法对不平衡数据分类问题的处理效果不够理想,提出了一种基于聚类欠采样的极端学习机算法。

新算法首先对训练集的负类样本进行聚类生成不同的簇,然后在各簇中按规定的采样率对其进行欠采样,取出的样本组成新的负类数据集,从而使训练集正负类数据个数达到相对平衡,最后训练分类器对测试集进行测试。

结合SMOTE和GEPSVM的不平衡数据分类方法

结合SMOTE和GEPSVM的不平衡数据分类方法

2017年第1湖 V声息疼术…文章编号= 1009 -2552 (2017)01-0005-04DOI:10. 13274/j. cnki. hdzj. 2017. 01. 002结合SMOTE和GEPSVM的不平衡数据分类方法林坚\郭剑辉1>2,邵晴薇\张敏怡1(1.南京理工大学计算机科学与工程学院,南京210094 ; 2.中国电子科技集团公司第二十八所,南京210007)摘要:文中针对不平衡数据导致分类结果倾斜现象,提出了一种结合SMOTE和GEPSVM的分 类方法。

该方法利用SMOTE过采样重构训练集,使训练集达到相对平衡,避免了重复样本数据带来的过学习问题,最后用GEPSVM进行分类学习。

在U C I数据集上的实验证明了该算法在不平衡数据集上与传统的SVM算法相比有更好的分类效果,在计算时间上也有一定的优势。

关键词:不平衡数据分类;过采样;支持向量机;广义特征值中图分类号:TP181 文献标识码:AA GEPSVM algorithm based on SMOTE in the applicationof imbalanced data classificationLINJian1,GUO Jian-hui1 2,SHAO Qing-wei1,ZHANG Min-yi1(1. School of Computer Science and Engineering,Nanjing University of Science and Engineering,Nanjing 210094,China;2. The 28th Institute of China Electronics Technology Group Corporation,Nanjing 210007,China) Abstract:In this paper,a GEPSVM algorithm based on SMOTE over-sampling method is proposed toaddress the problem ol skewed classification results in classification algorithms.This algorithm utilizes the SMOTE over-sampling method to reconstruct training datasets.As a result,the training datasets are relatively balanced and the over-fitting problem caused by repeated sample data is avoided.F inally,it utilizes GEPSVM to conduct learning.The experiments on the UCI datasets demonstrate that the proposed algorithm achieves better classification results and requires shorter computation time than the traditional SVM algorithm on imbalanced datasets.Key words:imbalanced data classification;over-sampling;support vector machine;generalized eigen values0引言SVM(Support Vector Machine)是 Vapnik等人在1995年提出的大间隔的算法。

基于聚类的非平衡数据欠采样算法研究及应用

基于聚类的非平衡数据欠采样算法研究及应用

目录中文摘要 (I)ABSTRACT (III)第一章绪论 (1)1.1研究背景与意义 (1)1.2故障检测问题 (2)1.3非平衡分类方法研究现状 (3)1.3.1非平衡分类概述 (3)1.3.2基于重采样的分类算法 (4)1.3.3基于算法改进的分类算法 (5)1.4本文工作以及组织结构 (6)1.4.1本文主要工作 (6)1.4.2本文组织结构 (7)第二章采样及分类算法概述 (9)2.1数据采样算法 (9)2.1.1SMOTE过采样 (9)2.1.2T OMEK LINKS欠采样 (9)2.1.3K-M EANS欠采样 (10)2.1.4随机欠采样 (11)2.2数据分类方法 (11)2.3评价标准 (15)2.4本章小结 (16)第三章基于密度聚类的非平衡数据欠采样算法 (17)3.1基于数据密度分布的欠采样方法 (17)3.1.1基于密度峰值聚类的欠采样算法思想 (17)3.1.2US-DP算法描述 (18)3.2实验分析 (18)3.2.1实验数据集 (19)3.2.2实验结果与分析 (19)3.3本章小结 (22)第四章非平衡数据分类方法在故障检测系统中的应用 (25)4.1故障检测系统整体架构 (25)4.2故障检测系统模块分析 (25)4.2.1用户登录模块 (25)4.2.2数据预处理模块 (25)4.2.3数据分析模块 (27)4.2.4结果可视化模块 (30)4.3系统展示 (32)4.4本章小结 (34)第五章总结与展望 (35)5.1总结 (35)5.2未来工作展望 (35)参考文献 (37)攻读学位期间取得的研究成果 (41)致谢 (43)个人简况及联系方式 (45)承诺书 (47)学位论文使用授权声明 (49)ContentsChinese Abstract (1)ABSTRACT (Ⅲ)Chapter1Introduction (1)1.1background and significiance of research (1)1.2fault detection problems (2)1.3research status of imbalanced classification methods (3)1.3.1overview of imbalanced classification (3)1.3.2classification algorithm based on resampling (4)1.3.3improved classification algorithm (5)1.4work and organizational structure of this paper (6)1.4.1main work (6)1.4.2organizational structure of this paper (7)Chapter2overview of sampling and classification algorithms (9)2.1data sampling algorithm (9)2.1.1smote oversampling (9)2.1.2Tomek links undersampling (9)2.1.3K-Means undersampling (10)2.1.4random undersampling (11)2.2data classification methods (11)2.3evaluation standard (15)2.4summary of this chapter (16)Chapter3research on under sampling algorithm based on density clustering (17)3.1undersampling based on data density distribution (17)3.1.1under sampling algorithm based on density peak clustering (17)3.1.2US-DP algorithm description (18)3.2experimental process and result analysis (19)3.2.1experimental data set (19)3.2.2experimental results and analysis (19)3.3summary of this chapter (22)Chapter4Application of imbalanced data classification method in fault detection system (25)4.1overall structure of fault detection system (25)4.2module analysis of fault detection system (25)4.2.1user login module (25)4.2.2data preprocessing module (25)4.2.3data analysis module (27)4.2.4result visualization module (30)4.3system display (32)4.4summary of this chapter (34)Chapter5summary and prospect (35)5.1summary (35)5.2future work prospect (35)References (37)Research achievements (41)Acknowlegment (43)Personal profiles (45)Letter of commitment (47)Authorization statement (49)中文摘要非平衡数据分类是机器学习和模式识别方面的一个重要研究方向,在欺诈检测、医疗诊断等领域具有广泛的应用价值。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Imbalanced Classification Algorithm in Botnet DetectionYun Yang1, Guyu Hu2Institute of Command Automation, PLAUST Nanjing 210007. China.e-mail:yyhacker@Shize Guo3The Institute of North Electronic EquipmentBeijing 100083. China.e-mail:ferret.yang@Jun Luo4Naval Command CollegeNanjing 211800. China.e-mail:everlastingstory@Abstract—An Imbal anced Classification anomaly detectiona l gorithm cal led “I-SVDD” for detecting Botnet was putforward in this paper. The algorithm combines the One-Classc l assification with the known Intrusion behaviors. Thisa gorithm has proven effective in reducing the number of botnet cl ients. The true positives reaches nearl y 100% and False Positive reaches 0% respectively. Hence, adjusting some parameters can make the fal se positive rate better. So using Imbal anced Cl assification method in Anomal y detection may be a future orientation in Pervasive computing area.Keywords- Imbalanced Classification;Anommly Detection; Botnet;I.I NTRODUCTIONThe continued growth of the Internet has been accompanied by an increasing prevalence of attacks and intrusions; a signi¿cant change in motivation for malicious activity called “Botnet” has taken place over the past several years. Bots are computers infected with malicious program that cause them to operate malfunctions without owners’ knowledge. Bots communicate with and take orders from their “masters”. There are distributed networks of bots to perform coordinated attacks, and many millions of bots on the Internet on any day given, organized into thousands of botnets. It is clear that botnets have become the most serious security threat on the Internet.At the same time, the Pervasive Computing is a rapidly developing area of Information and Communications Technology. Pervasive computing has many potential applications, from health and home care to and intelligent Intrusion D etection Systems. In security problems, One-Class classification has been proved to be effective in detecting Malware behaviors, but the accumulation of the known abnormal behaviors in security researches may be of little help. The situation stays the same in botnet detection; in this paper we try to solve this by improving the Original Classification Model using Imbalanced Learning Methods in which two kinds of data are studied. That is, normal data and the known botnet samples.The remainder of this paper is organized as follows, in section II, III the original Anomaly D etection and the improved models and their performance evaluation are presented. In section IV, some brief concluding remarks are demonstrated.II.A NOMALY D ETECTION M ODELWhen host have been infected with malicious programs, the system execution or service will change their behavior anomaly, so an effective way to detect Botnet is the Audit information-Based Anomaly Detection System, according to Forrest’s research [1], the key application behavior can be described by the Sequence of System Calls used during execution; it is also proved that valid behaviors of a simple application can be described by short sequences which are the partial mode in execution trace. Comparing with the short sequence pool in normal mode, we can find whether current process is running in normal mode or abnormal. If abnormal appears frequently in a fixed monitoring time, system infection may be taking place.A.SVDD Classification AlgorithmThe Support Vector D ata D escription (SVD D), which was first put forward by D avid.M.Tax, uses the Kernel method to map the data to the kernel feature space [2]. By this mapping a hypersphere in the kernel space including almost all the data for training will be formed.Figure 1. Block Diagram˖ SVDD Detection Model.2010 First International Conference on Pervasive Computing, Signal Processing and ApplicationsA new sample is recognized as normal only if the sample can be included by the hypersphere after kernel mapping. Linear Programming and Neoteric D etection Method [3] are included in this algorithm.By this mapping more flexible descriptions are obtained. It will be shown how the outlier sensitivity can be controlled in a flexible way. Finally, an extra option is added to include example outliers into the training procedure (when they are available) to find a more efficient description. In the support vector classifier, we define the structural error:(1)which has to be minimized with the constraints:(2)To allow the possibility of outliers in the training set, and therefore to make the method more robust, the distance from objects xi to the center a should not be strictly smaller than R2, but larger distances should be penalized. This means that the empirical error does not have to be 0 by definition. Weintroduce slack variables ȟ, ȟiı0, ׊i and the minimizationproblem changes into:(3)with constraints that (almost) all objects are within the sphere:(4)The parameter C gives the tradeoff between the volume of the description and the errors. The free parameters, a, R and ȟ, have to be optimized, taking the constraints (4) into account. Constraints (4) can be incorporated into formula (3) by introducing Lagrange multipliers and constructing the Lagrange:(5)with the Lagrange multipliers Įiı0and Ȗiı0 , where x i g x j stands for the inner product between x i and x j . Note that for each object x i a corresponding Įi and Ȗi are defined. L hasto be minimized with respect to R, a and ȟ, and maximized with respect to R, a and ȟ. Setting partial derivatives to 0 gives the constraints:(6)This results in the final error L:(7)A test object z is accepted when this distance is smaller than or equal to the radius:(8)By definition, R2 is the squared distance from the center of the sphere a to one of the support vectors on the boundary:(9)For any x kęSV bnd , especially the set of support vectorsfor which 0 < Įk < C. We will call this classifier the support vector data description (SVDD). It can now be written as:(10)where the indicator function I is defined as:(11)B.Experimental Design and ResultsThe executing procedure of a certain host can be monitored by the configured audit system [10], so as to gatherthe required sequence of system calls. The Short Sequence of System calls which symbolized the pattern of application behavior can be produced by using the K-sized Sliding Window techniqueA large amount of Short Sequence of System calls willbe produced after Sliding-Window slicing, they may be stored into the Security Audit D atabase for further processing. Generally speaking, a lot of repetitive short sequences are produced because most of user applicationsmay use the same system call repetitively. So D ata Reduction operation is the first job before the classifier is trained. Redundant short sequences are wiped off so as to avoid extra computation.Consequently, we have used the Portland State University’s Botnet data sets for our current study. they sortthe channels by the number of host-scanners producing a sorted list of potential botnets. Table 1 summarizes the different sets and the programs from which they were collected.TABLE I. A MOUNT OF DATA IN DATASETBotnet Infected Normal Data ChannelNo. of trace No. of SSC No. of trace No. of SSC Ubuntu-11282 11598 6494 28152 Ubuntu-1 622 7265 5086 17218 General Settings:Use Ubuntu-1 data set, use Ubuntu-2 for validation.Threshold used in Anomaly D etection ranges from 9 to35, the average result will be used.In this section, the representative normal dataset is usedto study the changes that different windows size brings to the result, the combo parameters are set as follows:(1) Sliding Window size K=6(ensure the real time detection);(2) Kernel Para ı=15 to 20Figure 2. Classification Results: Different ı.Note that the true positives grows from 90.15% to 99.70% when ı=15 to 19, but reduced to 95.48% when ı=20, while false positives reduces from 15.4%. to 1.24%, but raised to 3.60% when ı=20. It can be derived that ı is a vital parameter which directly affects the classification result. The bigger it is, the more accurate the pattern of application behavior it stands for, but of course, the more computation it will bring, so the result can be deduced:(1) Sliding Window size K=6 to ensure the real-time detection;(2) Kernel Para ı=19Figure 3. Classification Results: ı=19.As shown in Fig.2, we use Ubuntu-2 for validation, within the response time, ı=19 the dataset got the maximum true positive rate and the minimum false positive rates, the Ubuntu-2 detection results almost the same as the former even use the classifier trained by Ubuntu-1. Note that the result of Ubuntu-1 is a bit better than Uubtu-2, because the Ubuntu-1 data set includes fifteen weeks of activity, whilethe Ubuntu-2 set includes just four weeks so the application pattern included in the dataset is more accurate. The false positive rate of Ubuntu-2 equals zero may due to its little data size.III.I MBALANCED C LASSIFICATION M ETHODIn the last section, only normal data was used for training, which uses the One-Class classification method in Pervasive Computing. As botnet has been further studied, more and more samples of injected system execution are specified, however, which do take significant effect to the detection result. If those injected samples are studied by the classifier, not only the normal data, which is called Imbalanced Classification; then, improvement has been made to reduce false positive compared to original SVDD.A. I-SVDD Classification AlgorithmWeiss [11] did a detailed analysis into the effect that Imbalanced data problem made to the classifier, there are inappropriate performance evaluation criteria, generalization bias, sample size relative scarcity, etc. However, few researchers proceed in solving the class imbalance problem, coalesced with the D ata Pre-Processing procedure, the Imbalanced SVDD Detection (I-SVDD) Model is establishedas the following block diagram.Figure 4. Block Diagram ˖ I-SVDD Detection Model.To allow the possibility of abnormal data importation in the algorithm, a dual-SVDD method is used, the normal data and the abnormal ones are classified separately, then a new classify plane I new was generated by the abnormal classify plane take negation I a ’and multiply with the normal plane I n .I new = I n h I a ’(12) The new plane can be used for detecting and the experiments below will show its performance.B.Experimental Design and ResultsIn this section, the representative dataset is used to evaluate the classification performance of the F-SVD D model, because it got the last result in the original SVD D model. Parameters are set as follows: General Settings:(1) Sliding Window size K=6; (2) Kernel Para ı=19As shown in Fig.5, It is proved that this Detection Model is robust for the average true positives of all the datasets are going close to 100% while the average false positives are closed to 0%, the normal traces can be almost completely distinguished with the abnormal. Note that K=6 in this experiment, so this model can also be suitable for Real-Time detections. Compared with the results in section II, the true positives increased by 12% and 17% while the falsePositives decreased by 8% and 14%.Figure 5. Comparation: I-SVDD Vs. SVDDAs shown in Fig.6, as different samples are labeled by different weight, the result is better than original SVDD, the true & false Positive level are raised by 0.22%, 1.00% and0.45%, especially ,the true positives are reaching 100%.Figure 6. I-SVDD Classification Results: ı=19Algorithm is the same in these two models, but the result appears quite different.IV.C ONCLUSIONSThe Botnet detection model based on SVDD One-Class Classification method avoids the complex work of great amount of abstraction and matching operations. The algorithm also makes the security audit system detect new anomaly behaviors. Based on Imbalanced Classification Method, the I-SVD D method is put forward in this paper. Experiments show that in the aim of pay more attention to the samples which are certain to be injected by botnet or malware, the system users and other processes, both improvements to the algorithm and further reduction to the input data method result in a elevation to the performance The effect of some parameters such as Kernel Para ı are also considered in the new models.Experiments using Poland Botnet D atasets show that using Imbalanced method in Botnet Anomaly Detection may be a future orientation in Pervasive Computing area. Designing more effective methods for Imbalanced classification in Anomaly D etection is another idea to be further studied in the future.R EFERENCES[1]Hofmeyr, S., Forrest, S.: Principles of a computer immune system. In: Proceeding of New Security Paradigms Workshop, pp. 75–82 1997[2]Forrest, S., Hofmeyr, S.A.:”Computer Immunology”Communications of the ACM, pp 88–96 June 1997[3]David, M.J.T.: “One-class Classification”. Ph.D. Dissertation pp. 27-35 1999[4] E. Cooke, F. Jahanian, and D. McPherson, “The zombie roundup:Understanding, detecting and disrupting botnets.” In Proceedings of Usenix Workshop on Stepts to Reducing Unwanted Traf¿c on the Internet (SRUTI ’05), Cambridge, MA, July 2005.[5]James R. Binkley, Suresh Singh, “An Algorithm for Anomaly-basedBotnet D etection” n Proceedings of Usenix Workshop on Stepts to Reducing Unwanted Traf¿c on the Internet (SRUTI ’06), Cambridge, MA, July 2006.[6]J. Binkley, B. Massey, “Ourmon and Network MonitoringPerformance”. Proceedings of the Spring 2005 USENIX Conference, Freenix track, Anaheim, April 2005.[7]Tax D, Duin R . “Support vector data description”.Machine Leamina,pp.45-66 May 2004;[8]CERT Advisory CIAD -2004- 10 Multiple Vulnerabilities inMicrosoft Products / advisories /ciad- 2004-10.htm, April 2004.[9]P. Barford, V. Yegneswaran, “An Inside Look at Botnets, SpecialWorkshop on Malware Detection”, Advances in Information Security, Springer Verlag, pp. 3-7, July,2006[10]R. Pang and V. Paxson. “A High-level Programming Environment forPacket Trace Anonymization and Transformation”. In SIGCOMM ’03: Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications September, 2003[11]Weiss G.M,”Mining with rarity:a unifying framwork”,ACMSIGKDD Explorations, pp.7-19, May 2004。

相关文档
最新文档