在SVM中String Kernel的应用

合集下载

halcon create_class_svm参数

halcon create_class_svm参数create_class_svm 函数通常接受一系列参数，这些参数用于定义 SVM 分类器的训练过程、核函数类型以及其他相关设置。

以下是一些可能的参数及其简要说明：NumFeatures：特征的数量。

这个参数指定了每个输入样本包含多少个特征。

NumSupportVectors：支持向量的数量。

这个参数是可选的，它指定了 SVM 分类器应该使用多少个支持向量。

在某些情况下，HALCON 可能会自动选择支持向量的数量。

KernelType：核函数类型。

SVM 可以使用不同的核函数，如线性核、多项式核、径向基函数（RBF）核等。

这个参数用于指定要使用的核函数类型。

KernelParam：核函数参数。

对于某些核函数，如 RBF 核，需要额外的参数来定义核的行为。

这个参数用于提供这些额外的参数。

ClassLabels：类别标签。

这个参数是一个包含所有可能类别标签的列表或数组。

TrainData：训练数据。

这个参数包含了用于训练 SVM 分类器的样本数据。

TrainLabels：训练标签。

这个参数包含了与训练数据相对应的类别标签。

GenParamName, GenParamValue：这些是可选参数，用于指定 SVM 分类器的其他设置，如停止条件、优化算法的选择等。

使用 create_class_svm 函数时，你需要根据你的具体任务和数据集来调整这些参数。

通常，选择适当的核函数和参数是训练一个有效 SVM 分类器的关键步骤。

在 HALCON 中，你还可以使用交叉验证等技术来评估不同参数设置下分类器的性能，并选择最佳的配置。

fitcsvm 参数

fitcsvm 参数在MATLAB中，`fitcsvm` 函数用于训练支持向量机（SVM）分类器。

以下是一些常用的`fitcsvm` 函数的参数：```matlabSVMModel = fitcsvm(X, Y, 'ParameterName',ParameterValue, ...)```其中，`X` 是训练数据的特征矩阵，`Y` 是训练数据的标签向量。

除了`'ParameterName'` 和`ParameterValue` 之外，还有其他的参数可以设置。

以下是一些常见的参数及其说明：1. 'KernelFunction' (默认为'linear')：指定核函数的类型。

常见的选项包括：- 'linear': 线性核函数。

- 'rbf' 或'gaussian': 高斯径向基核函数。

- 'polynomial': 多项式核函数。

2. 'BoxConstraint'：控制软间隔的强度。

值越大表示模型对于分类错误的惩罚越大。

3. 'KernelScale'：用于非线性核函数的比例参数。

对于'rbf' 和'polynomial' 核函数有效。

4. 'Standardize' (默认为true)：指定是否对输入数据进行标准化。

5. 'ClassNames'：类别标签的名称。

6. 'KernelOffset'：用于添加到决策函数值的偏移量。

7. 'Prior'：指定先验概率，用于计算损失函数。

8. 'Cost'：指定不同类别的分类成本。

9. 'BoxConstraint'：约束条件，控制间隔的宽度。

10. 'Nu'：控制支持向量的数量。

《数据挖掘与数据分析(财会)》支持向量机(SVM)及应用

||||
因为平 + 0 在平面内，所以其值为0。原式变为：

= + 0 =
||||

X在平面
内的分
量
=

||||
但是，距离应该是正数，但计算出来的可能为正，也可能为负，因
此需要加上绝对值
||
=
||||
但加上绝对值，无法微分，因此，我们加上一些约束
也就是说：
是平面（线） + 0 的法线
4
总结
假设直线（平面）的方程为 + = ，和点
集{ , , … . }那么，哪些点距离直线最近？
根据几何知识，能够使得| + |最小的点，
距离平面最近。
5
SVM原理以及基本概念
2.SVM基本概念
2.1 点到分离面的距离
大智移云下的财务管理创新思维
问题的提出
在平面上有这样的两组数据，如何将他们进行分类，
以便于在将来新的数据加入进来能将新的数据划分到
某一方：
1
SVM原理以及基本概念
1. 什么是SVM
SVM （support vectors machine，SVM ，支持向量机）
支持向量机（又名支持向量网络）一种二类分类模型，它的基本模型是的定
当()大于0时，我们规定 = 1，当()小于0时， = −1
因此，点到平面的距离就变成了：r =

||||
. .
8
= ||||2
= −1.
= 1.
> 0
<0
> 0.
即： + 0 > 0 = 1, −1

svm算法r语言代码

svm算法r语言代码SVM算法是一种常用的机器学习算法，它在分类和回归问题中都有广泛的应用。

本文将介绍SVM算法的基本原理，并给出在R语言中实现SVM算法的代码示例。

SVM（Support Vector Machine）算法是一种基于统计学习理论的分类算法。

它的基本思想是通过在特征空间中找到一个最优的超平面，将不同类别的样本分开。

这个超平面被称为分离超平面，它使得同一类别的样本尽可能地靠近，不同类别的样本尽可能地远离。

在SVM算法中，我们首先需要将样本映射到高维特征空间中，然后在特征空间中找到一个最优的超平面。

为了找到这个最优的超平面，我们需要定义一个目标函数，并通过优化算法来求解。

在R语言中，我们可以使用e1071包来实现SVM算法。

首先，我们需要安装e1071包，并加载它：```Rinstall.packages("e1071")library(e1071)```接下来，我们可以使用svm函数来训练一个SVM模型。

假设我们有一个包含两个特征的数据集X和对应的标签y，其中y为1表示正样本，为-1表示负样本。

我们可以使用以下代码来训练一个线性SVM模型：```Rmodel <- svm(y ~ ., data = X, kernel = "linear")```在这个代码中，y ~ .表示使用所有的特征进行分类，data = X表示数据集为X，kernel = "linear"表示使用线性核函数。

训练完成后，我们可以使用predict函数来对新的样本进行分类。

以下是一个示例代码：```Rnew_data <- data.frame(feature1 = c(1, 2, 3), feature2 = c(4, 5, 6))predictions <- predict(model, newdata = new_data)```在这个代码中，我们创建了一个新的数据集new_data，然后使用predict函数对其进行分类，并将结果保存在predictions变量中。

svm求解对偶问题的例题

svm求解对偶问题的例题支持向量机（SVM）是一种强大的机器学习算法，用于分类和回归分析。

在分类问题中，SVM 试图找到一个超平面，将不同类别的数据点最大化地分开。

这个过程涉及到求解一个对偶问题，该问题是一个优化问题，旨在最大化间隔并最小化误差。

假设我们有一个简单的数据集，其中包括二维数据点，每个数据点都有一个标签（正类或负类）。

我们可以用SVM 来训练一个模型，该模型能够根据这些数据点预测新的未知数据点的标签。

以下是一个简单的例子，说明如何使用SVM 来解决对偶问题：1. **数据准备**：* 假设我们有8 个数据点，其中4 个属于正类（标记为+1）和4 个属于负类（标记为-1）。

* 数据点如下：```python`X = [[1, 1], [1, 0], [0, 1], [0, 0], [1, 2], [1, 3], [0, 2], [0, 3]]y = [1, 1, -1, -1, 1, 1, -1, -1]````2. **使用SVM**：* 我们将使用scikit-learn 的SVM 实现。

首先，我们需要将数据转换为SVM 可以理解的形式。

* 我们将使用线性核函数，因为我们的数据是线性可分的。

3. **求解对偶问题**：* SVM 的目标是找到一个超平面，使得正类和负类之间的间隔最大。

这可以通过求解一个对偶问题来实现，该问题是一个优化问题，旨在最大化间隔并最小化误差。

4. **训练模型**：```pythonfrom sklearn import svmfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# 将数据分为训练集和测试集X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 创建SVM 分类器clf = svm.SVC(kernel='linear')# 训练模型clf.fit(X_train, y_train)# 使用模型进行预测y_pred = clf.predict(X_test)# 打印预测的准确率print("Accuracy:", accuracy_score(y_test, y_pred))```5. **解释结果**：* 训练完成后，我们可以查看模型是如何对训练数据进行分类的。

在MATLAB中使用SVM进行模式识别的方法

在MATLAB中使用SVM进行模式识别的方法在MATLAB中，支持向量机(Support Vector Machine, SVM)是一种常用的模式识别方法。

SVM通过在特征空间中找到一个最优的超平面来分离不同的样本类别。

本文将介绍在MATLAB中使用SVM进行模式识别的一般步骤。

其次，进行特征选择与预处理。

在SVM中，特征选择是十分关键的一步。

合适的特征选择可以提取出最具有区分性的信息，从而提高SVM的分类效果。

特征预处理可以对样本数据进行归一化等，以确保特征具有相似的尺度。

然后，将数据集分为训练集和测试集。

可以使用MATLAB中的cvpartition函数来划分数据集。

一般来说，训练集用于训练SVM模型，测试集用于评估SVM的性能。

接下来，选择合适的核函数。

SVM利用核函数将数据映射到高维特征空间中，从而使得原本线性不可分的数据在新的特征空间中可分。

在MATLAB中，可以使用svmtrain函数的‘kernel_function’选项来选择不同的核函数，如线性核函数、多项式核函数、高斯核函数等。

然后，设置SVM的参数。

SVM有一些参数需要调整，如正则化参数C、软间隔的宽度等。

参数的选择会直接影响SVM的分类性能。

可以使用gridsearch函数或者手动调整参数来进行优化。

然后，用测试集测试SVM模型的性能。

使用svmclassify函数来对测试集中的样本进行分类。

svmclassify函数的输入是测试集特征向量和训练好的SVM模型。

最后，评估SVM的性能。

可以使用MATLAB中的confusionmat函数来计算分类结果的混淆矩阵。

根据混淆矩阵可以计算出准确率、召回率、F1分值等指标来评估SVM模型的性能。

除了上述步骤，还可以使用交叉验证、特征降维等方法进一步改进SVM的分类性能。

综上所述，通过以上步骤，在MATLAB中使用SVM进行模式识别的方法主要包括准备数据集，特征选择与预处理，数据集的划分，选择合适的核函数，设置SVM的参数，使用训练集训练SVM模型，用测试集测试SVM 模型的性能，评估SVM的性能等。

简述向量机的基本原理及应用

简述向量机的基本原理及应用一、向量机的基本原理向量机（Support Vector Machine，简称SVM）是一种非常流行且强大的机器学习算法，广泛应用于分类和回归问题。

它基于统计学习理论中的结构风险最小化原则，通过最大化分类间隔来进行分类。

1. 支持向量机的概念在支持向量机中，将数据点看作特征空间（高维空间）中的点，将向量看作特征空间中的向量。

支持向量机通过划分特征空间，找到一个超平面（决策边界），将不同类别的数据点分开。

2. 线性可分支持向量机当数据点能够被一个超平面完全分离的时候，称为线性可分。

线性可分支持向量机的目标是找到一个最佳的超平面，使得正负样本点到该超平面的距离最大。

这个最佳的超平面称为最优划分超平面。

3. 线性不可分支持向量机在实际应用中，数据点往往不是完全线性可分的。

对于线性不可分的情况，可以使用核函数（Kernel Function）将低维非线性可分问题映射到高维空间，从而实现线性划分的目的。

二、向量机的应用支持向量机作为经典的机器学习算法，在许多领域得到了广泛的应用。

1. 图像分类支持向量机在图像分类中具有良好的性能。

通过将图像数据表示为高维向量，将其映射到特征空间中，支持向量机可以对图像进行分类，例如人脸识别和手写体数字识别。

2. 文本分类支持向量机在文本分类中也具有很高的准确率。

通过将文本数据表示为向量空间模型（Vector Space Model），将其映射到特征空间中，支持向量机可以对文本进行分类，例如垃圾邮件过滤和情感分析。

3. 金融预测支持向量机在金融预测中有广泛的应用。

对于股票市场、外汇市场和期权市场等金融市场的预测，支持向量机可以通过对历史数据的学习，预测未来的价格趋势，帮助投资者做出决策。

4. 生物信息学支持向量机在生物信息学中也得到了广泛的应用。

通过对基因序列等生物数据的分析，支持向量机可以对蛋白质结构、基因功能和突变预测等问题进行分类和预测，帮助科研人员进行生物信息学研究。

向量机知识点总结

向量机知识点总结一、SVM的原理1. 线性可分支持向量机在二维空间中，将数据点按类别分开最常见的方法就是通过一条直线。

在高维空间中，这条直线变成了一个超平面。

支持向量机就是要找到这个超平面，使得同一类别的数据在超平面的一侧，而不同类别的数据在超平面的另一侧。

在数学上，超平面可以表示为$w^Tx+b=0$，其中$w$是法向量（即超平面的方向向量），$x$是数据点的特征向量，$b$是超平面的偏置。

2. 软间隔支持向量机在现实世界中，很多数据并不是线性可分的，即不存在一个超平面能够将数据完全分开。

软间隔支持向量机允许一些数据点位于超平面的错误一侧，引入了松弛变量$\xi$。

优化目标变成了最小化$\frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i$，其中$C$是一个调整间隔和错误分类之间权衡的参数。

3. 非线性支持向量机在实际应用中，很多数据并不是线性可分的。

为了解决这个问题，引入了核技巧（kernel trick）。

核技巧的基本思想是将数据从原始空间映射到高维空间，使得原本在原始空间中线性不可分的数据，在高维空间中变得线性可分。

通过核函数$K(x_i, x_j) =\Phi(x_i)^T\Phi(x_j)$，可以避免直接进行高维空间的计算。

4. 多类别分类支持向量机对于多类别分类问题，支持向量机可以通过一对一（one-vs-one）或一对其他（one-vs-rest）的方法来实现。

一对一的方法每次选取两个类别的数据进行训练，最后通过投票来确定最终分类结果；一对其他的方法将一个类别作为正例，其他所有类别作为负例，分别训练多个支持向量机。

这两种方法都能有效地实现多类别分类。

二、SVM的最优化问题支持向量机的优化问题可以通过拉格朗日对偶性（Lagrange duality）来求解。

在拉格朗日对偶性的框架下，原始问题的解可以通过求解其对偶问题来得到。

拉格朗日对偶性在SVM 中的应用，将原始问题从求解$w$和$b$的最优值转化为求解拉格朗日乘子$\alpha$的最优值。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

KEDRI-NICT Project Report - APPENDIX:DString Kernel Based SVM for InternetSecurity ImplementationZbynek Michlovsk´y1,Shaoning Pang1,Nikola Kasabov1,Tao Ban2,and Youki Kadobayashi21Knowledge Engineering&Discover Research InstituteAuckland University of Technology,Private Bag92006,Auckland1020,New Zealand{spang,nkasabov}@2Information Security Research Center,National Institute of Information andCommunications Technology,Tokyo,184-8795Japanbantao@nict.go.jp,youki-k@is.aist-nara.ac.jpAbstract.For network intrusion and virus detection,ordinary meth-ods detect malicious network traﬃc and viruses by examining packets,ﬂow logs or content of memory for any signatures of the attack.Thisimplies that if no signature is known/created in advance,attack detec-tion will be problematical.Addressing unknown attacks detection,wedevelop in this paper a network traﬃc and spam analyzer using a stringkernel based SVM(support vector machine)supervised machine learn-ing.The proposed method is capable of detecting network attack with-out known/earlier determined attack signatures,as SVM automaticallylearning attack signatures from traﬃc data.For application to internetsecurity,we have implemented the proposed method for spam email de-tection over the SpamAssasin and E.M.Canada datasets,and networkapplication authentication via real connection data analysis.The ob-tained above99%accuracies have demonstrated the usefulness of stringkernel SVMs on network security for either detecting‘abnormal’or pro-tecting‘normal’traﬃc.1IntroductionUpon computers and Internet being more and more integrated into our com-mon life,higher security requirements have been imposed on our computer andnetwork system.One of many ways that we can take for increasing the securityis to use the Intrusion Detection System(IDS).Intrusion detection(system)isa process of monitoring events occurring in a computer system or network andanalyzing them on signs of possible incidents.IDS,as described in[10],can begrouped into two categories:statistical anomaly based IDS and signature basedIDS.The idea of statistical anomaly IDS is to detect intrusions by comparingtraﬃc with normal traﬃc model,looking for deviations.Due to the diversity ofnetwork traﬃc,it is diﬃcult to model normal traﬃc,as we know that a normalemail relaying or peer-to-peer queries may also show like with some intrusiontraﬃc characteristics.Moreover,even for abnormal traﬃc,it does not in factC.S.Leung,M.Lee,and J.H.Chan(Eds.):ICONIP2009,Part II,LNCS5864,pp.530–539,2009.c Springer-Verlag Berlin Heidelberg2009String Kernel Based SVM for Internet Security Implementation531 constitute an intrusion/attack.Hence,anomaly detection often has a high false alarm rate,thus is seldom used in practice.For signature based IDS,network traﬃc is examined for predetermined attack patterns known as signatures.A signature consists of a string of characters(or bytes).Nowadays many intrusion detection systems also support regular expressions and even behavioralﬁnger-prints[13].The diﬃculty of signature based IDS system is that only intrusions whose signatures are known can be detected and it is necessary to constantly update a collection of these signatures to mitigate emerging threats[14].Another approach for enhancing network security is to authenticate and pro-tect legitimate network traﬃc.Traditional traﬃc authentication method based on network protocol(e.g.the port number)is becoming more inaccurate and not appropriate for the identiﬁcation of P2P and other new types of network traf-ﬁc.Other methods are mostly based on protocol anomaly detection,which are highly limited because that Internet legitimate traﬃc does not strictly conform to any speciﬁc traﬃc models.Thus those methods often involve the diﬃculty of high false positives error in real applications.For example,legitimate Smtp traﬃc could be identiﬁed as malicious traﬃc,if there is misconﬁguration in MTA server adding suspiciousﬁelds to the header of email message.Thus,it is likely to mis-authenticate valid Smtp traﬃcs following just the standard Smtp network protocol.For either attack detection or legitimate network traﬃc authentication,most signatures/rules are created by security experts who analyze network traﬃc and host logs after intrusions have occurred,whereas sifting through thousands lines of logﬁles and looking for characteristics that uniquely identify as an intrusion is a vast and error prone undertaking.To overcome this shortcoming and de-tect unknown attack(i.e.signature is not determined),we researched machine learning on string content recognition techniques.The motivation is to train a classiﬁer to distinguish between malicious and harmlessﬂows or memory dump and utilize the trained classiﬁer to classify real networkﬂow and memory dump.Support vector machine(SVM)is known as one of the most successful classi-ﬁcation algorithms for data mining,in this work we address string,rather than numerical,data analysis,and implemented string kernel SVM for network secu-rity in the way of intrusion detection and network application authentication. In spite of the limitation of the SVM on training eﬃciency,the advantage of string kernel SVMs is in needless of complete knowledge about attack signa-ture/normal application features,as string kernel SVM is able to automatically learn the problem knowledge during the training procedure.In this work,we develop SVM based string kernel method according to diﬀer-ent mathematical similarity expressions of two strings/substrings.For network security,we derive string kernel SVM for automatical attack(i.e.spam emails) signature analysis,conducting spamﬁltering without early determined spam signature.Moreover,we have used string kernel SVM to authenticate legitimate network applications,learning SVM from connection diﬀerences against normal connections.532Z.Michlovsk´y et al.2SVM and Kernels TheorySupport vector machines(SVM)are groups of supervised learning methods ap-plicable to classiﬁcation or regression.SVM maps the data points into a high dimensional feature space,where a linear learning machine is used toﬁnd a maximal margin separation[7,8].One of the main statistical properties of the maximal margin solution is that its performance does not depend on the dimen-sionality of the space where the separation takes place.In this way,it is possible to work in very high dimensional spaces,such as those induced by kernels,with-out overﬁtting.Kernels provide support vector machines with the capability of implicitly mapping non-linearly separable data points into a diﬀerent higher dimensional space,where they are more separable than the original space.This method is also called Kernel trick[7].Kernel function K(x,y)can be expressed as a dot product in a high dimen-sional space.If the arguments to the kernel are in a measurable space X,and if the kernel is positive semi-deﬁnite for anyﬁnite subset{x1,...,x n}of X and subset{c1,...,c n}of objectsK(x i,x j)c i c j≥0,(1)i,jthen there must exist a functionφ(x)whose range is in an inner product space of possibly high dimension,such that K(x,y)=φ(x)φ(y).The kernel method allows for a linear algorithm to be transformed into a non-linear algorithm.This non-linear algorithm is equivalent to the linear algorithm operating in the range space ofφ.However,because kernels are used,theφfunction is never explicitly computed.The kernel representation of data amounts to a nonlinear projection of data into a high-dimensional space where it is easier to separate into classes [12].Most popular kernels suitable for SVM are e.g.Polynomial Kernel,Gaussian Radial Basis Kernel,Hyberbolic Tangent Kernel[11].All of these kernels operate with numerical data.For our purpose is necessary to use string kernels which are described in following section.3String Kernels Used in SVMRegular kernels for SVM work merely on numerical data,which is unsuitable for internet security where huge amount of string data is presented.Towards extending SVM for string data processing,we implemented the following string kernels algorithms in our experiments.3.1Gap-Weighted Subsequence KernelThe theory of subsequence kernel is described in the book Kernel Methods for Pattern Analysis[2].The main idea behind the gap-weighted subsequence kernelString Kernel Based SVM for Internet Security Implementation 533is to compare strings by means of the subsequences they contain -the more subsequences and less gaps they contain the more similar they are.For reducing dimensionality of the feature space we consider non-contiguous substrings that have ﬁxed length p .The feature space of gap-weighted subsequence kernel is deﬁned as φp u (s )=i :u =s (i )λl (i ),u ∈Σp ,(2)where λ∈(0,1)is decay factor,i is index the occurrence of subsequence u =s (i )in string s and l (i )is length of the string in s .We weight the occurrence of u with the exponentially decaying factor λl (i ).The associated kernel is deﬁned as κ(s,t )= φp (s ),φp (t ) = u ∈Σpφp u (s )φp u (t ).(3)In Eq.(3),it is required to conduct an intermediate dynamic programming table DP p whose entries are:DP p (k,l )=ki =1l j =1λk −i +l −j κS p −1(s (1:i ),t (1:j )).(4)Then,the computational complexity is evaluated as κS p (sa,tb )= λ2DP p (|s |,|t |)if a =b ;0otherwise (5)which follows that for a single value of p ,the complexity of computing κS p is O (|s ||t |).Thus,the overall computational complexity of κp (s,t )is O (p |s ||t |).3.2Levenshtein DistanceLevenshtein (or edit)distance [4]counts diﬀerences between two strings.The distance is the number of substitutions,deletions or insertions required to trans-form string s with length n to string t with length m .The formal deﬁnition of Levenshtein distance [17]is given as follows:Given a string s ,let s (i )stand for its i th character.For two characters a and b ,deﬁner (a,b )=0ifa =b.Let r (a,b )=1(6)Assuming two strings s and t with the length of n and m ,respectively,then a (n +1)(m +1)array d furnishes the required values of the Levenshtein distance L (s,t ).The calculation of d is a recursive procedure.First set d (i,0)=i ,i =0,1,...,n and d (0,j )=j ,j =0,1,...,m ;Then,for other pairs i ,j ,we haved (i,j )=min (d (i −1,j )+1,d (i,j −1)+1,d (i −1,j −1)+r (s (i ),t (j ))).(7)In our implementation,we use D =e −λ·d (i,j )for getting better results.Analog to the above substring kernel,the computational complexity of Levenshtein Dis-tance is O (|s ||t |).In the case that s and t have the same length,the complexity is O (n 2).534Z.Michlovsk´y et al.3.3Bag of Words KernelThe Bag of words kernel is represented as an unordered collection of words, disregarding grammar and word order.Words are any sequences of letters from the basic alphabet separated by punctuation or spaces.We represent a bag as a vector in a space in which each dimension is associated with one term from the dictionaryφ:d→φ(d)=(tf(t1,d),tf(t2,d),....tf(t N,d))∈R N,(8) where tf(t i,d)is the frequency of the term t i in the document d.Hence,a docu-ment is mapped into a space of dimensionality N being the size of the dictionary, typically a very large number[2].3.4N-Gram KernelN-grams transform documents into high dimensional feature vectors where each feature corresponds to a contiguous substring[5].The feature space associated with n-gram kernel is deﬁned asφn u(s)φn u(t)(9)κ(s,t)= φn(s),φn(t) =u∈Σnwhereφn u(s)=|{(v1,v2):s=v1uv2}|,u∈Σn.We have used for computing n-gram kernel naive approach therefore the time complexity is O(n|s||t|).4Experiments and DiscussionsFor internet security,one useful approach is to actively detect andﬁlter spam/ attack by building Bayesianﬁlters.In addition,another practical approach is to protect legitimate network communication by authenticating every type of legitimate network application.In this section,we implemented string kernel SVMs on the spamAssasin public mail corpus for email spam detection,and experimented the authentication of12categories standard network application.For multi-class classiﬁcation,we used support vector machine software-lib-SVM[1].In our experiments,we used precomputed kernel matrices from pre-pared training and testing datasets.All values in precomputed kernel matrices have been scaled to interval[-1,1].Optimal parameters of string kernel functions have been determined by cross validation tests on training datasets.Kernel ma-trices has been applied as input to libSVM[1].String Kernel Based SVM for Internet Security Implementation535 4.1Spam DetectionFor spam detection experiment,we used5500ham messages from SpamAssasin public mail corpus[18]and24038spam messages from E.M.Canada[19].The ham emails dataset consists of two categories:EasyHam emails,5000non-spam messages without any spam signatures and Hard Ham emails,500non-spam mes-sages similar in many aspect to spam messages-using unusual HTML markup, colored text,spam-sounding phrases,etc.Each email message has a header,a body and some potentially attachments.Note that for the convenience of com-parison,we employed the exact same data setup as in[20],training dataset23630 (19230spam vs.4400ham)messages and testing dataset5918(4808spam vs. 1110ham)messages.Analog to[20],we intended to determine which part of email message have critical inﬂuence on the classiﬁcation results.To this end,we prepared four subsets:Subject,Body,Header and All subsets.The Subject subset uses only the subjectﬁeld of the email message,and all Subject data are normalized to the length of100characters;The body subset is the body part of the email message normalized to the length of1000characters;The header subset is the header section of the email message normalized to the length of100characters; The All subset concludes the Fromﬁeld and the Subjectﬁeld of the header section plus the whole body of the email massage.Also,every instance of All subset is normalized to the length of1200characters.Table1.The results from email classiﬁcation using each kernel function with an comparison to[20]Features String Kernel Function Ref.Acc.[20]N-gram Subsequence Edit Distance Bag of wordSubject96.3896.6495.7881.1692.64Body99.2899.2397.1981.0484.41Header99.8099.8099.7582.6892.12All99.3499.3898.0281.2490.13 Table1presents the percentage accuracy of correctly classiﬁed email messages for each type of string kernel and email subset,where percentage accuracy of [20]is presented in the Ref.Acc.column for comparison.As seen from the table,the classiﬁcation results reached by string kernel SVM are exceeding the percentage accuracy from the the reference paper[20].The results for subset Header demonstrate that theﬁrst100characters of a message header is enough for correct spam classiﬁcation.Results for other subsets are consistently good, however they are seemed to more susceptible to spammers tricks.Among4string kernels,the outstanding performance of N-Gram and Subse-quence kernel functions proves that the classiﬁcation with substring/subsequence kernels is more suitable for spam detection than the kernels using whole string, such as Edit Distance.The Bag of word kernel is seen problematical on spam536Z.Michlovsk´y et al.detection,this could be explained that these spammers normally use the same words for both spam and ham similarity evaluation.4.2Network Application AuthenticationIn this experiment,we used data from tcp network traﬃc produced by com-mon network applications using diﬀerent communication protocols like http, https,imap,pop3,ssh,ftp and some applications using dynamic ports like Bit-torent.All network data was captured,and sorted by program Wireshark[15] into separatedﬁles according to protocols and then split into individualﬂows (connections)by program tcpﬂow[16]see Fig.1.Fig.1.Schema of data preparation for network application work traﬃc data are sorted into separatedﬁles according to protocol with program Wireshark [15]and then split into individualﬂows using program tcpﬂow[16].At preprocessing stage,we removed the traﬃc data in unreliable protocols like UDP,and reorder traﬃc data by the type of connections using program tcpﬂow[16].All connections have been shorten to the length of450bytes and750 bytes respectively,where connections shorter than450/750bytes are normalized by repeating its content.Every connection is labelled as its connection port number within the interval of 1,49151 .For connections using dynamic ports or the same port as other applications,we labelled the connections with some unoccupied ports.For example,Http,Msnms and Ocsp connections are noticed using the same port80.To avoid repeat,we label Msnms and Ocsp as port6 and8,respectively.The summary of all connections is given in Table2.Table3presents the best percentage accuracies obtained by4types of string kernel.Percentage Gen.Acc.is recorded as a ratio of the correctly classiﬁed connection to total connection.Parameters Lambda and C have been gained from previous5-cross validation tests.Parameter Substring length represents the length of contiguous(non-contiguous)substring for N-gram(Subsequence) kernel function.String Kernel Based SVM for Internet Security Implementation537Table2.Number of connections(ﬂows)for each protocol for training and testingProtocol name(port number)Class label Training set Testing setHttp(80)80675235BitTorrent(dynamic)429999Https(443)44329290Imap(993)9934214Pop3(110)9933210Aol(5190)5190217Ftp(21)21124Msnms(80)694Microsoft-ds(445)44593Ocsp(80)883Ssh(22)2262Pptp(1723)172321Sum1407472Table3.The best results of network application classiﬁcation for each kernel function Kernel Function Para m eters Gen.AccSubstring length Connection length C of C-SVC La m bda Subsequence kernel4750655360.599.58% N-gra m kernel445040960.599.58% Edit distance kernel-450160.00199.15% Bag of word kernel-75010.2549.75% As seen,Subsequence and N-gram kernel functions give the best99.58%gen-eral accuracy,which follows that only2of total472connections are misclassiﬁed (two Msnms connections are misclassiﬁed as Http connections).It is worth not-ing that N-gram kernel function performs extremely well for distinguishing those network applications with shorter(450chars)connection instance.However,the Bag of word kernel function is unable to recognize network applications because this kernel distinguishes network applications based on word similarity,when the word(continuous sequence of characters between any two gaps)represents a whole network connection,it is often too long for the kernel to diﬀerentiate applications.Fig.2discloses the relationship between the substring length and general accuracy over networkﬂows(connections)with length450and750chars,re-spectively.As seen,the general accuracy decreases over the length of substring for both N-gram and Subsequence kernels.This suggests that an longer substring normally leads to a decreased classiﬁcation performance from the presented two string kernel functions.In general,results in Table3and Fig.2have proved the eﬀectiveness of ap-plying Edit Distance,Subsequence and N-gram kernel functions for recognizing hidden network traﬃc including encrypted connections.Edit Distance kernel538Z.Michlovsk´y et al.Fig.2.Relation between the substring length and general accuracy for networkﬂows (connections)with length450and750chars,respectivelyidentiﬁes a global similarity of the string(connection),but it surprisingly gives a99.15%authentication accuracy.This implies that functions based on complete string comparison are also suitable for network application recognition to some extent.5Conclusions and Future WorkIn this paper,we propose a new network security technique for spam and hidden network application detection.Our technique is based on known string kernel functions with precomputed kernel matrices.In our implementation of the pro-posed technique,we used support vector machines with optimal conﬁguration associated for each kernel function.This paper makes two major contributions.First,a study of Spam detec-tion.Our email classiﬁcation results presents excellent ability of string kernels SVM to detect spam from relevant emails.Kernel functions using substrings/ subsequences are evidently less susceptible to spammers tricks than function which comparing the whole strings.As seen,the best classiﬁcation accuracy has been reached for Header email subset and thisﬁnding proves the email header has critical inﬂuence on the spam classiﬁcation.Second,a detection of hidden network application traﬃc.Our experimental analysis shows that sizes ofﬁrst 450B of TCP connection are capable for accurate distinguishing network appli-cations.From results is evident the suitable functions for application recognition from TCP connections are N-Gram,Subsequence and Edit Distance function. Our experiments prove the best results in network application recognition was gained with short length of substrings in function parameter.For future work,we will continue to develop new string kernels,and address specially next network security tasks including testing memory dump and net-work intrusion detection.String Kernel Based SVM for Internet Security Implementation539 References1.Chang,C.-C.,Lin,C.-J.:LIBSVM:a library for support vector machines(2001),.tw/~cjlin/libsvm2.Shawe-Taylor,J.,Cristianini,N.:Kernel Methods for Pattern Analysis.CambridgeUniversity Press,New York(2004)3.Duda,R.O.,Hart,P.E.,Stork, D.G.:Pattern Classiﬁcation,2nd edn.Wiley-Interscience,Hoboken(2000)4.Charras,C.,Lecroqk,T.:Sequence comparison(1998),http://www-igm.univ-mlv.fr/~lecroq/seqcomp/index.html5.Lodhi,H.,Saunders,C.,Shawe-Taylor,J.,Cristianini,N.,Watkins,C.:Text clas-siﬁcation using string kernels.J.Mach.Learn.Res.2,419–4446.Fisk,M.,Varghese,G.:Applying Fast String Matching to Intrusion Detection(September2002)7.Aizerman,A.,Braverman,E.M.,Rozoner,L.I.:Theoretical foundations of the po-tential function method in pattern recognition learning.Automation and Remote Control25,821–837(1964)8.Boser,B.E.,Guyon,I.M.,Vapnik,V.N.:A training algorithm for optimal marginclassiﬁers.In:COLT1992:Proceedings of theﬁfth annual workshop on Computa-tional learning theory,pp.144–152.ACM,New York(1992)9.Yuan,G.-X.,Chang,C.-C.,Lin,C.-J.:LIBSVM:libsvm experimental code forstring inputs,http://140.112.30.28/~cjlin/libsvmtools/string/libsvm-2.88-string.zip 10.Scarfone,K.,Mell,P.:Guide to intrusion detection and prevention systems(idps).In:NIST:National Institute of Standards and Technology(2007), /publications/nistpubs/800-94/SP800-94.pdf11.Vapnik,V.N.:The nature of statistical learning.Springer,New York(1995)12.Cristianini,N.,Shawe-Taylor,J.:An Introduction to Support Vector Machines andOther Kernel-based Learning Methods.Cambridge University Press,Cambridge (2000)13.Caswell,B.,Beale,J.,Foster,J.C.,Faircloth,J.:Snort2.0Intrusion Detection.Syngress(2003),http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20&path=ASIN/193183674414.Whitman,M.E.,Mattord,H.J.:Principles of Information Security.Course Tech-nology Press,Boston(2004)bs,G.,et al.:Wireshark:network protocol analyzer,/16.Elson,J.:tcpﬂow:tcpﬂow reconstructs the actual data streams and stores eachﬂow in a separateﬁle for later analysis,/jelson/software/tcpflow/17.Bogomolny,A.:Distance Between Strings,/doyouknow/Strings.shtml18.SpamAssassin public mail corpus,/publiccorpus/19.Spam dataset,http://www.em.ca/7Ebruceg/spam/i,C.-C.:An empirical study of three machine learning methods for spamﬁltering.Knowledge-Based Systems20,249–254(2007)。