OPTIMIZING SVMS FOR COMPLEX CALL CLASSIFICATION

合集下载

支持向量机的性能优化和改进

支持向量机的性能优化和改进支持向量机（Support Vector Machine, SVM）是一种常用的监督学习算法，广泛应用于模式识别、文本分类、图像处理等领域。

然而，在实际应用中，SVM存在一些性能上的瓶颈和问题。

为了进一步提高SVM的性能和效率，并解决其在大规模数据集上的不足，研究者们提出了多种优化和改进方法。

本文将从几个方面介绍SVM的性能优化和改进.一、硬间隔支持向量机硬间隔支持向量机是SVM的最基本形式，其目标是找到一个最优的超平面，将两个不同类别的样本点分隔开来。

然而，硬间隔支持向量机对数据的要求非常严苛，要求数据是线性可分的。

对于线性不可分的数据，就无法使用硬间隔SVM进行分类。

因此，研究者提出了软间隔支持向量机。

二、软间隔支持向量机软间隔支持向量机允许一定程度上的数据混合在分隔超平面的两侧，引入了一个松弛变量来控制分隔裕度。

这样能够更好地适应线性不可分的情况，并且对噪声数据有一定的容错性。

然而，在实际应用中，软间隔SVM的性能也受到很多因素的影响，需要进行进一步的改进和优化。

三、核函数和非线性支持向量机在实际应用中，很多数据集是非线性可分的，使用线性支持向量机无法得到好的分类结果。

为了解决这个问题，研究者们提出了核支持向量机。

核函数将数据从原始空间映射到高维特征空间，使得数据在高维空间中更容易线性可分。

常用的核函数有线性核函数、多项式核函数、高斯核函数等。

通过使用核函数，支持向量机可以处理更加复杂的分类问题，提高了分类性能。

四、多分类支持向量机支持向量机最初是用于二分类问题的，即将数据分成两个类别。

然而，在实际应用中，很多问题是多分类问题。

为了解决多分类问题，研究者们提出了多分类支持向量机。

常见的方法有一对一（One-vs-One）和一对多（One-vs-Rest）两种。

一对一方法将多类别问题转化为多个二分类问题，每次选取两个类别进行训练。

一对多方法则将多个类别中的一个作为正例，其余类别作为反例进行训练。

svm常用核函数

svm常用核函数SVM常用核函数支持向量机（SVM）是一种常用的分类算法，它通过寻找最优的超平面来将数据分为两类。

在实际应用中，数据往往不是线性可分的，这时就需要使用核函数来将数据映射到高维空间中，使其变得线性可分。

本文将介绍SVM常用的核函数。

1. 线性核函数线性核函数是最简单的核函数，它将数据映射到原始特征空间中，不进行任何变换。

线性核函数适用于数据线性可分的情况，它的计算速度快，但对于非线性可分的数据效果不佳。

2. 多项式核函数多项式核函数将数据映射到高维空间中，通过多项式函数来实现非线性变换。

多项式核函数的形式为K(x,y)=(x*y+c)^d，其中c和d 为超参数。

多项式核函数适用于数据具有一定的非线性特征的情况，但需要调整超参数才能达到最优效果。

3. 高斯核函数高斯核函数是最常用的核函数之一，它将数据映射到无限维的高斯空间中。

高斯核函数的形式为K(x,y)=exp(-||x-y||^2/2σ^2)，其中σ为超参数。

高斯核函数适用于数据具有复杂的非线性特征的情况，但需要调整超参数才能达到最优效果。

4. Sigmoid核函数Sigmoid核函数将数据映射到高维空间中，通过Sigmoid函数来实现非线性变换。

Sigmoid核函数的形式为K(x,y)=tanh(αx*y+β)，其中α和β为超参数。

Sigmoid核函数适用于数据具有一定的非线性特征的情况，但需要调整超参数才能达到最优效果。

总结SVM常用的核函数有线性核函数、多项式核函数、高斯核函数和Sigmoid核函数。

不同的核函数适用于不同的数据特征，需要根据实际情况选择合适的核函数和超参数。

在实际应用中，可以通过交叉验证等方法来选择最优的核函数和超参数，以达到最好的分类效果。

SVM算法说明和优化算法介绍

SVM算法说明和优化算法介绍SVM（Support Vector Machine，支持向量机）是一种常用的机器学习算法，用于分类和回归分析。

SVM的基本思想是通过在特征空间中构造一个最优超平面，将不同类别的样本分开。

本文将为您介绍SVM的基本原理、分类和回归问题的实现方法以及一些常见的优化算法。

SVM的基本原理是寻找一个能够最大化类别间间隔（margin）的超平面，从而达到更好的分类效果。

在特征空间中，样本点可以用向量表示，所以SVM也可以看作是在特征空间中寻找一个能够最优分割两类样本的超平面。

为了找到这个最优超平面，SVM使用了支持向量（Support Vector），即离超平面最近的样本点。

支持向量到超平面的距离被称为间隔，而最优超平面使得间隔最大化。

对于线性可分的情况，SVM的目标是最小化一个损失函数，同时满足约束条件。

损失函数由间隔和误分类样本数量组成，约束条件则包括对超平面的限制条件。

通过求解优化问题，可以得到最优超平面的参数值。

对于非线性可分的情况，SVM使用核函数进行转换，将低维特征空间中的样本映射到高维特征空间中，从而使得样本在高维空间中线性可分。

SVM在分类问题中的应用广泛，但也可以用于回归问题。

在回归问题中，SVM的目标是找到一个超平面，使得点到该平面的距离尽可能小，并且小于一个给定的阈值。

SVM回归的思想是通过引入一些松弛变量，允许样本点在一定程度上偏离超平面来处理异常数据，从而得到更好的回归结果。

在实际应用中，SVM的性能和效果受到许多因素的影响，如数据集的分布、样本的数量和特征的选择等。

为了进一步优化SVM的性能，许多改进算法被提出。

下面我们介绍几种常见的SVM优化算法。

1.序列最小优化算法（SMO）：SMO是一种简单、高效的SVM优化算法。

它通过将大优化问题分解为多个小优化子问题，并使用启发式方法进行求解。

每次选择两个变量进行更新，并通过迭代优化这些变量来寻找最优解。

遗传算法优化svm参数

遗传算法优化svm参数遗传算法是一种基于自然适应性进化理论的优化算法，它通过模拟自然界中的进化过程，通过遗传算子（交叉和变异操作）对个体进行进化和选择，以找到最优解决方案。

支持向量机（Support Vector Machine，SVM）是一种非常有效的分类算法，通过在数据集中找到最有代表性的样本点，构建超平面分离不同类别的样本。

优化SVM的参数可以提高分类的准确率和稳定性。

下面是使用遗传算法优化SVM参数的一般步骤：1. 确定优化目标：首先明确需要优化的SVM参数，如惩罚系数C、核函数类型和参数、松弛变量等，这些参数会影响模型的性能。

2. 设计基因编码：将待优化的参数映射为基因的编码形式，可以使用二进制、整数或浮点数编码。

例如，某个参数的取值范围为[0, 1]，可以使用浮点数编码。

3. 初始化种群：随机生成初始的种群，每个个体都表示一个SVM参数的取值组合。

4. 适应度评估：使用训练集对每个个体进行评估，计算其在测试集上的准确率或其他指标作为个体的适应度。

5. 选择操作：根据适应度排序或轮盘赌等策略，选择优秀个体进行遗传操作。

6. 交叉操作：从选中的个体中进行交叉操作，生成新的个体。

可以使用单点交叉、多点交叉或均匀交叉等策略。

7. 变异操作：对生成的新个体进行变异操作，引入随机扰动，增加种群的多样性。

变异操作可以改变某个基因的值或重新随机生成某个基因。

8. 更新种群：将交叉和变异生成的个体合并到种群中。

9. 重复步骤4-8，直到满足终止条件（如达到最大迭代次数或种群适应度不再改变）。

10. 选择最优个体：从最终的种群中选择适应度最好的个体作为最优解，即SVM的最优参数。

通过以上步骤，遗传算法可以搜索参数空间，并找到最有解决方案。

通过尝试不同的参数组合，可以优化SVM模型的性能。

请注意，以上只是一般的遗传算法优化SVM参数的步骤，实际应用中可能会根据具体问题进行适当的调整。

在实际操作中，还可以通过引入其他优化技巧（如局部搜索）来进一步提高搜索效率。

svm求解对偶问题的例题

svm求解对偶问题的例题支持向量机（SVM）是一种强大的机器学习算法，用于分类和回归分析。

在分类问题中，SVM 试图找到一个超平面，将不同类别的数据点最大化地分开。

这个过程涉及到求解一个对偶问题，该问题是一个优化问题，旨在最大化间隔并最小化误差。

假设我们有一个简单的数据集，其中包括二维数据点，每个数据点都有一个标签（正类或负类）。

我们可以用SVM 来训练一个模型，该模型能够根据这些数据点预测新的未知数据点的标签。

以下是一个简单的例子，说明如何使用SVM 来解决对偶问题：1. **数据准备**：* 假设我们有8 个数据点，其中4 个属于正类（标记为+1）和4 个属于负类（标记为-1）。

* 数据点如下：```python`X = [[1, 1], [1, 0], [0, 1], [0, 0], [1, 2], [1, 3], [0, 2], [0, 3]]y = [1, 1, -1, -1, 1, 1, -1, -1]````2. **使用SVM**：* 我们将使用scikit-learn 的SVM 实现。

首先，我们需要将数据转换为SVM 可以理解的形式。

* 我们将使用线性核函数，因为我们的数据是线性可分的。

3. **求解对偶问题**：* SVM 的目标是找到一个超平面，使得正类和负类之间的间隔最大。

这可以通过求解一个对偶问题来实现，该问题是一个优化问题，旨在最大化间隔并最小化误差。

4. **训练模型**：```pythonfrom sklearn import svmfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# 将数据分为训练集和测试集X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 创建SVM 分类器clf = svm.SVC(kernel='linear')# 训练模型clf.fit(X_train, y_train)# 使用模型进行预测y_pred = clf.predict(X_test)# 打印预测的准确率print("Accuracy:", accuracy_score(y_test, y_pred))```5. **解释结果**：* 训练完成后，我们可以查看模型是如何对训练数据进行分类的。

机器学习中的支持向量机原理及优化方法

机器学习中的支持向量机原理及优化方法支持向量机（Support Vector Machine，SVM）是一种非常常用的机器学习算法，主要用于分类和回归问题。

它的基本原理是通过在特征空间中找到一个最佳的超平面，来实现对数据样本的分类。

SVM算法的优化方法包括凸优化、核函数和软间隔最大化。

SVM的原理是基于统计学习理论和结构风险最小化原则。

它的基本思想是将输入空间中的样本点映射到高维特征空间中，并在特征空间中找到一个最佳的超平面，使距离超平面最近的样本点到超平面的距离最大化。

通过这样的方式，能够得到一个能够很好地区分不同类别的分类器。

SVM算法的优化方法主要包括凸优化、核函数和软间隔最大化。

首先，凸优化是SVM算法的核心思想。

SVM的目标是寻找一个最佳的超平面，使得所有样本点到超平面的距离最大化。

这个距离被称为间隔（margin），表示了样本点分类的可靠性。

凸优化的目标是在满足约束条件（样本点到超平面的距离大于等于间隔）的情况下，找到一个最大间隔的超平面。

这个问题可以转化为一个二次规划问题，通过求解约束最优化问题可以得到最佳的超平面。

其次，核函数是SVM算法的另一个重要组成部分。

在实际应用中，往往需要处理高维甚至是无限维的特征空间。

为了避免计算复杂度过高，我们可以使用核函数将高维特征空间的运算转化为低维特征空间的运算。

核函数的作用是将输入样本点映射到特征空间中，并通过计算这些样本点在特征空间中的内积来计算它们之间的相似度。

常用的核函数有线性核、多项式核、高斯核等，可以根据具体问题选择合适的核函数。

最后，软间隔最大化是SVM算法的一种改进。

在实际应用中，样本点很可能不是完全线性可分的，即使找到了一个超平面，也可能存在分类错误的样本点。

为了避免过拟合和提高模型的鲁棒性，可以引入一定的分类误差容忍度，允许某些样本点被错误地分类。

软间隔最大化的目标是在凸优化问题的基础上，找到一个最佳的超平面，使得同时最大化间隔和最小化分类误差。

svm求解序列最小优化算法

svm求解序列最小优化算法序列最小优化算法（Sequential Minimal Optimization，简称SVM）是一种用于求解支持向量机（Support Vector Machine，简称SVM）的非线性优化算法。

它的思想是将复杂的二次规划问题分解为一系列便于求解的子问题，通过迭代求解这些子问题来逐步逼近最优解。

SVM是一种二分类算法，旨在找到一个超平面，将两类数据点分隔开来，使得在该超平面上的支持向量（离超平面最近的样本点）之间的距离最大化。

这样做的目的是为了使得分类效果更加鲁棒和泛化能力更强。

在实际问题中，往往无法通过线性超平面完美地分割数据。

因此，SVM引入了核函数的概念，可以将数据映射到高维特征空间中，从而使得数据在低维空间中线性不可分的问题，在高维空间中变得线性可分。

常用的核函数有线性核函数、多项式核函数、高斯核函数等。

SVM的优化问题可以形式化为一个二次规划问题，其中目标函数是一凸函数，约束条件是线性不等式约束。

传统的优化算法需要涉及到求解大规模的二次规划问题，计算复杂度较高。

而序列最小优化算法通过将复杂的优化问题转化为一系列二次规划子问题，来降低计算复杂度。

具体来说，序列最小优化算法每次选择两个需要更新的拉格朗日乘子，并通过求解该二次规划子问题来更新它们。

为了遵守KKT条件，算法会选择两个乘子，一个是违反KKT条件最严重的乘子，即违反KKT条件值最大的，另一个是根据启发式规则进行选择的。

然后，通过求解这两个乘子对应的子问题更新它们，并更新模型参数和阈值。

算法中的子问题可以通过解析求解或者使用优化算法（如坐标下降法）进行求解。

为了降低计算复杂度，算法通常会对子问题进行简化，例如利用上下界条件、KKT条件和线性约束条件等进行剪枝。

序列最小优化算法是一种经典的求解SVM的算法，简洁而高效。

它通过分解SVM的优化问题为一系列简单的子问题，通过迭代求解这些子问题来逼近最优解，降低了计算复杂度。

SVM时间复杂度分析及大数据处理建议

SVM时间复杂度分析及大数据处理建议支持向量机（Support Vector Machine，SVM）是一种常用的机器学习算法，广泛应用于分类和回归问题。

然而，随着数据量的不断增加，SVM在处理大数据集时可能面临时间复杂度的挑战。

本文将对SVM的时间复杂度进行分析，并提出一些建议来处理大数据集。

首先，我们来分析SVM的时间复杂度。

在训练阶段，SVM的时间复杂度主要取决于支持向量的数量。

支持向量是训练样本中与决策边界最近的样本点，它们决定了分类器的决策边界。

假设训练样本有N个点，其中支持向量的数量为M，那么SVM的训练时间复杂度为O(N*M^2)，其中M^2来自于计算支持向量之间的内积。

因此，当支持向量的数量较大时，SVM的训练时间会显著增加。

对于大数据集，SVM的时间复杂度可能会成为一个问题。

为了解决这个问题，我们可以采取以下几个策略：1. 降低数据维度：大数据集通常具有高维特征，而高维数据会增加SVM的计算复杂度。

因此，可以通过特征选择或降维技术（如主成分分析）来减少数据的维度，从而降低SVM的时间复杂度。

2. 数据采样：对于大数据集，可以通过采样的方式来减少数据量。

例如，可以使用随机采样或者聚类采样的方法，从原始数据集中选择一部分样本进行训练，以减少计算时间。

3. 并行计算：SVM的训练过程可以通过并行计算来加速。

可以利用多核处理器或者分布式计算框架（如Spark）来并行计算支持向量的内积，从而减少训练时间。

4. 增量学习：对于大数据集，可以采用增量学习的方式来训练SVM模型。

增量学习是指在已有模型的基础上，逐步引入新样本进行训练，从而避免重新训练整个模型。

这样可以节省大量的计算时间。

除了以上策略，还可以考虑使用其他基于近似算法的SVM变种，如核方法的近似计算或线性SVM的快速算法。

这些算法可以在一定程度上减少SVM的计算复杂度，同时保持较高的分类准确性。

总结起来，SVM在处理大数据集时可能面临时间复杂度的挑战，但通过降低数据维度、数据采样、并行计算、增量学习以及使用近似算法等策略，可以有效地减少计算时间。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

OPTIMIZING SVMS FOR COMPLEX CALL CLASSIFICATION Patrick Haffner Gokhan Tur Jerry H.WrightAT&T Labs-Research,180Park Avenue,Florham Park,NJ,07932,USAhaffner,gtur,jwright@ABSTRACTLarge margin classiﬁers such as Support Vector Machines (SVM)or Adaboost are obvious choices for natural language doc-ument or call routing.However,how to combine several binary classiﬁers to optimize the whole routing process and how this pro-cess scales when it involves many different decisions(or classes) is a complex problem that has only received partial answers[1,2]. We propose a global optimization process based on an optimal channel communication model that allows a combination of pos-sibly heterogeneous binary classiﬁers.As in Markov modeling, computational feasibility is achieved through simpliﬁcations and independence assumptions that are easy to ing this approach,we have managed to decrease the call-type classiﬁcation error rate for AT&T’s How May I Help You(HMIHY)natural dialog system by50%.1.INTRODUCTIONSince theﬁrst demonstration of the HMIHY system[3],Inter-active V oice Response(IVR)systems increasingly rely on natu-ral language call routing,where a computer attempts to recognize many possible outcomes to an open-ended prompt.The traditional solution to sentence classiﬁcation is the bag-of-words approach used in Information Retrieval.Because of the very large dimension of the input space,large margin classiﬁers such as SVMs[4,5]or Adaboost[1]were found to be very good candidates.To take into account context when two sentences are compared,one can match N-grams(a substring of N words)rather than words[1].Sub-word features can also be considered[6].Large margin classiﬁers were initially demonstrated on binary classiﬁcation problems,where the deﬁnition of the margin is un-ambiguous,and the reasons why this margin leads to good gen-eralization are reasonably well ually,their perfor-mance can be expressed as a single number:the binary classiﬁca-tion accuracy.Unfortunately,to generalize these binary classiﬁers to Multi-Class Classiﬁers(MCC)is not straightforward(in the rest of the paper,the word classiﬁer will only cover the binary case).The deﬁnition of a multi-class margin whose maximization is supposed to optimize the generalization,and the resulting opti-mization problem,lead to solutions which may be too complex to be practical[7].The most popular approach is to combine binary classiﬁers and to produce a vector whose components are the out-put of one binry classiﬁer.This vector is then used to make the multiclass decision.Error Correcting Output Codes(ECOC)pro-vide a good framework for this approach[2].However,this leaves us with the choice of the code.In the1-vs-others case,we have one classiﬁer per class,with the positive examples taken from class and the negative examples from the other classes.Class is recognized when the corresponding classiﬁer yields the largest output.In the1-vs-1case,each classiﬁer only discriminates one class from another:we need a total of classiﬁers.After choosing a multiclass combination of binary classiﬁers, we are still left with a large number of error criteria to choose from, most of them based on a combination of precision and recall for each class[5]or how the classes are ranked[1].Another problem which has seldom been addressed is the choice of the model,which is usually assumed to be the same for all the classiﬁers.However, some of the classiﬁers have to learn tasks with richer data than others,and may beneﬁt from more complex models[6].Ideally,one would like deﬁne a framework that provides us with a complete solution,that covers all choices in a globally opti-mal way.Section2describes an approach that,under well under-stood simplifying assumptions,attempts to get closer to this ideal.2.A CHANNEL OPTIMIZATION FRAMEWORKWe have seen in the introduction that deﬁning the optimal large margin multiclass architecture is still an open problem,with a large variety of solutions proposed for each of the critical choices:(i)the loss function that must be minimized,(ii)the output code(i.e.the set of binary classiﬁers to choose)and(iii)the optimization pro-cedure to minimize this loss.Note the order in which we want to make these choices:given an application,the loss function should be uniquely determined(and measured in dollars or customer sat-isfaction),and all the other choices must be derived from this im-posed loss function.In practice,there are many ways to“idealize”a real loss,and we start with the most common one for telecom-munication applications.2.1.The optimal communication channelTheﬁrst simplifying assumption is that we have an optimal com-munication channel.Take the example of interactive call routing: the objective is to obtain as much information as possible from the user within the smallest interaction time.We assume that after a sufﬁcient number of conﬁrmation prompts,the user request is sat-isﬁed,so the goal is to minimize the coding costs of such prompts. For a problem with two possible outcomes and,the ideal com-munication channel would work in three steps.First,the user sends the request.Second,the computer replies1with an estimate of the posterior probability to detect each class and.Usually,a logistic remapping is applied to ob-tain an output ranging from0to1:.Third, the user sends enough bits to correct the result so that it matches 1Our ideal scheme assumes no cost to send as we can duplicate a model of the computer that computes on the user side.«the ing an optimal entropic coder,this would corre-spond to the Kullback-Liebler(KL)divergence2(1)2.2.Class independence assumptionThisﬁrst assumption gave us the loss function we shall minimize,namely the KL divergence.However,when moving to a problemwith more than2outcomes,the optimization of this KL divergence cannot always be reduced to binary classiﬁcations.The second as-sumption is to uncouple the problem,and consider we have inde-pendent problems for each class.This means that observing that anexample belongs to classes or are independent,non-exclusive events,i.e..An obvious consequenceis that an example can belong to several different classes(multiplelabel examples).Denote the number of independent classes and the estimated probability to observe class,one can show that the total KL divergence is the sum of the class KL divergencesbetween the and the target outputs.This assumption gives us the code:we build an MCC out of1-vs-others binary classiﬁers.Each classiﬁer can be optimized separately.Other conditional independence assumptions lead to the1-vs-1case,but they are signiﬁcantly more complex.2.3.Optimization:the right balance between regularized andexhaustive search learningLearning should attempt toﬁnd the function over the input datathat minimizes an estimate of where is the expectation over the true distribution of the data.Techniques based on gradient descent or Expectation-Maximization could directly minimize on a training set.However,because of the high dimensionality of the input,overﬁtting would be a major issue,and we prefer to rely on techniques with good control over generalization.If the number of hypotheses corresponding to different param-eters conﬁgurations in the model isﬁnite,we can consider exhaus-tive search learning:for each hypotheses,computeand measure on a set of examples(called the validation set).The minimum hypotheses is kept.Note that we should only choose between a small number of hypotheses,as uniform convergence bounds([4]page123)show that the estimate on may overﬁt the generalization error by up to.Regularized learning should be used when the hypothesesspace becomes too large.Typically,a penalty term over the sizeof the parameters is added to prevent excessive overﬁtting on a setof examples(called the training set).In the case of the SVMs, this penalty is the inverse of the margin.Unfortunately,not every parameter in our model can be subjected to regularized learning as we do not know how to penalize some of the parameters.Also,ef-ﬁcient optimization algorithms require a convex form,which can-not be obtained after logistic remapping.In practice,we should separate parameters between the vast majority that can be estimated through regularized learning of a 2In practice,the computer would ask for a conﬁrmation that the out-come should be if and er would send back0if correct and1if wrong.So we need to encode a sequence of bits where the0is much more likely than the1.If the interactivity could be ignored,arithmetic coding could compress this sequence to the entropic limit.convex function,and a small minority that we can afford to esti-mate through exhaustive search.The function that performs binary classiﬁcation before lo-gistic remapping is a good candidate for regularized -ing SVMs with a kernel,this function is parameterized by the multipliers with.When,is called a support vector.Denoting,the loss corresponding to a mis-classiﬁed example in Eq.(1)is(2)While,for convexity reasons,SVMs cannot be directly optimized for this loss,Eq.(2)shows it can be approximated with the hinge loss traditionally used in SVMs[4].Bayesian approaches[8] (Gaussian processes,Relevant Vector Machines)should lead to a ﬁner level of understanding of how binary SVMs can be used in a probabilistic model.The that maximize the margin while mini-mizing the loss are the Lagrange multipliers of a convex optimiza-tion problem that we solve with an iterative procedure that updates the N maximum gradient violators and wasﬁrst implemented in SVM light[9].The parameters left to exhaustive search are the parameters and in the logistic remapping,the kernel parameters and the parameter(and if we use different ones for positive and negative examples)that weight the SVM loss.Parameters and can be chosen using a univariate logistic regression,to minimize the total KL divergence[10]on aﬁrst set of validation data.For each choice of kernel or parameters,the SVM needs to be retrained,so we can only afford a small set of parameter hypotheses.In summary,for each,we train the SVM on set,obtain,andon set and choose that minimizes on a second validation set.To maximize the amount of validation data available,and avoid reducing the amount of training data,we can use a double N fold cross-validation process to build and over all the training data available.3.EXPERIMENTSWe now evaluate our ideas for SVMs for call classiﬁcation on the spoken language understanding(SLU)component of the the AT&T’s HMIHY natural dialog system.In this system,users are asking questions about their bill,calling plans,etc.Within the SLU component,we are aiming at classifying the input telephone calls into classes(call-types),such as Billing Credit,or Calling Plans[11,12].First,Section3.1describes an implementation and measures the impact of a proper optimization of the SVMs against baseline approaches whose performance is considered as state-of-the-art for text classiﬁcation,namely Boostexter[1]and linear SVMs[5].Second,Section3.2shows how this approach was used to reduce the call-type classiﬁcation error rate in the ac-tual system.3.1.Implementation and validationOur methodology has been implemented as a software package called Llama where,besides the training data,the user only needs to specify the list of hypotheses kernels to explore.Validation sets are set apart automatically and used to determine the optimal ker-nel for each class.To our knowledge,this is theﬁrst implemen-tation of multi-class SVMs with heterogeneous kernels.Llama¬MCC Hamming Multiclass Error ThresholdError Test Train Error Boostexter33.719.915.1(*)SVM1464136.932.3915.1(*) SVM1map470435.127.815.1SVM1mapC478933.725.614.1 SVM2382031.97.9211.3(*) SVM2map380731.1 6.6411.2SVM2mapC384631.1 5.8111.8SVM12map380731.1 6.6411.2 Table1.Error rates using6058unigrams as input optimization also takes advantage of the fact that the kernel cache can be shared between classiﬁers.For strict binary SVM classiﬁ-cation,we veriﬁed that its performance is the same as SVM light.To show that this methodology signiﬁcantly improves per-formance over a simple recombination of binary SVMs,a large amount of data,representative of the problems that need to be solved,was needed.We used27892training utterances,represent-ing48call-types.This recently labeled data was preferred over our standard data set which is used in Section3.2:a much larger number of examples allows more signiﬁcant results and more call-types makes classiﬁcation more difﬁcult.We did not use easy to classify dialog acts classes such as“Yes”,“No”or“Hello”.Be-cause our methodology does not explicit model rejection yet,we also avoided the“Other”class used to train an explicit rejection model.Results are reported for both unigram(Table1)and bigram (Table2)inputs.In the case of unigrams,a sparse feature vector detecting word occurrences in the best3sentence recognized by the speech recognizer is computed.In both cases,the Euclidean norm of the feature vector is normalized to1.We compared SVMs with linear and degree2polynomial kernels.As a baseline,we also re-port results on Boostexter using real Adaboost.MH with logistic loss(it has been used for this type of task[13]and we found it to be as good or better than Adaboost with exponential loss).The following SVM conﬁguration were tried:SVM p uses polynomial of degree p,with C=1and without logistic remapping;SVM p map adds logistic remapping;SVM p mapC lets and be chosen independently for each classiﬁer(between1or2).In addition, SVM12map represents an optimal recombination of SVM1map and SVM2map,where linear and degree2polynomials are se-lected independently for each classiﬁer.In bothﬁgures theﬁrst column reports the Hamming error, which is the number of binary mis-classiﬁcations accumulated over the48classes and7105test utterances.The other columns re-port a multiclass error,which is the percentage of examples where the class with the largest output does not match one of the target classes.Note that is error may not be fully correlated with the Hamming error.The test error is computed over7105test utter-ances,the train error over27892training utterances and the thresh-old error over the4263test utterances(a40%reject rate)with the largest output for the top candidate4.Because they perform large margin classiﬁcation in the same 3Feature extracted from full lattices are described in the next section. The deﬁnition of efﬁcient lattice kernels is the topic of a separate study(C. Cortes and P.Haffner and M.Mohri,Lattice Kernels for Spoken-Dialog Classiﬁcation,Submitted to ICASSP’03)4The asterisk stresses out MCCs which were not subjected to logistic remapping,and where using the same threshold for all the classes may be questionable.MCC Hamming Multiclass Error ThresholdError Test Train Error Boostexter34.28.414.5(*)SVM1401534.122.513.1(*) SVM1map396831.615.212.2SVM2377731.8 3.5611.1(*) SVM2map375030.8 3.2911.1SVM12map374930.8 3.2911.0 Table2.Error rates using15002bigrams as input feature space(with a different metric though),linear SVM and Boostexter5have similar performances.After optimizing SVM1 performance with logistic remapping,SVM1map applied to uni-grams can be improved using either2degree polynomials or bi-grams.The percentages of improvement are similar and typically reduce the multiclass error from down to.The fea-tures induced by2-degree polynomial kernels correspond to every pair of words found in an utterance(not just consecutive words). Our best choice is SVM12map on bigrams:because it uses lin-ear SVMs whenever possible,it is faster than SVM2map.Note that no further signiﬁcant improvements were observed with de-gree3polynomial kernels or trigram inputs,and results are not reported here for lack of space.In summary,properly optimizing the SVM architecture results in a reduction of the threshold error,which is our most signiﬁcant measure of performance.Note that this threshold error corresponds to a false reject rate of about .A surprising result found in Table1is that the automatic se-lection of C performed in SVM p mapC actually increases the Ham-ming error when compared to SVM p map,while properly reducing the KL error over the test data.This discrepancy suggests that the hinge approximation in Eq.(2)is too strong.A similar phe-nomenon is observed with SVM12map:a signiﬁcant reduction in the KL error over SVM2map does not translate into a reduction of the Hamming error.We tried automatic selection of or the kernel to minimize each classiﬁer binary error rather than the KL error,but results actually got worse,with excessive overﬁtting. 3.2.Deployment in the HMIHY systemThe SLU component of HMIHY is a series of modules.The input of the SLU component is a compressed form of an ASR lattice, called word confusion network or sausage[11].First a prepro-cessor converts certain groups of words into a single token(e.g. converting tokens“A T and T”into“ATT”.).Then,Salient Gram-mar Fragments(SGFs)[14]are matched in the preprocessed utter-ance,ﬁltered and parsed[11]to be used as features for the call-type classiﬁer.SGFs are graphs combining salient phrases.How to automatically acquire salient phrases from a corpus of transcribed and labeled training data and cluster them into SGFs in the format of Finite State Machines(FSMs)[15]is described in[11,14].In the previous classiﬁer[14],the learned SGFs are attached to given call-types with weights.Then the classiﬁer decides on the call-type by combining the SGFs occurring in an utterance.With SVMs,we obtain the SGFs as features,but let the SVM decide on how to exploit the occurrences of them in an utterance.Thanks to the modular structure of the SLU component,this has been a 5As illustrated in the case of bigram input,Boostexter may overﬁt,and no obvious regularization parameter is available to control this.Theerror rate reported in Table2corresponds to full convergence.Stopping learning at the minimum of the test error would have yielded.¬Fig.1.Improvements in call classiﬁcation performance. minor change in the implementation.During our experiments,we have used7647utterances as our training data and1405utterances as our test data.All the utter-ances are responses to theﬁrst greeting in the dialog(e.g.“Hello, This is AT&T.How May I Help You?”.)The word error rate for these utterances is31.7%and the oracle accuracy,deﬁned as the accuracy of the path in the lattice closest to the transcription,is 91.7%.Sausages yield about the same word error rate and or-acle accuracy while being about100times smaller than lattices. We used Mangu et al.’s algorithm to convert lattices into sausages [16].In order to evaluate the change in performance due to the change in classiﬁer,we keep the ASR engine,the SGF features and other components unchanged.Figure1presents the results using the SVM12map architec-ture on19call-types.Besides the number of call-types,another major difference with the previous section is that we explicitly train a garbage class called“Others”.Results are presented in the form of an ROC curve,in which we vary the threshold.The utterances for which top classiﬁer output is below that threshold are rejected.The correct classiﬁcation rate is the ratio of corrects among the accepted utterances.The false rejection rate is the per-centage of rejected utterances which were labeled as one of the non-”Other”call types.Hence,there is a trade-off between correct classiﬁcation and false rejection by varying the rejection thresh-old,and the ultimate aim is to reach a correct classiﬁcation rate of 100%,and false rejection rate of0%.Rank-2curves compute the performance using the top two classiﬁer outputs.We have found that these curves give an indication of performance in a dialog sys-tem where we have the opportunity for conﬁrmation and error cor-rection.The error rate has decreased about50%relatively on the rank-2ROC curve for almost all thresholds.This shows the effec-tiveness of SVMs for this task.For rank-1,we have managed to improve the classiﬁcation accuracy by4-5%points absolute.4.CONCLUSIONApproaches based on probabilistic modeling have been very suc-cessful in speech and language understanding,and allowed to build large architectures that are globally consistent.This works de-scribes aﬁrst step toward the optimal embedding of binary large margin classiﬁers into such models,and immediately shows signif-icant improvements in performance.It also suggests strategies to further optimize the recombination of binary classiﬁing1-vs-1classiﬁers(instead of1-vs-others)would better model the fact that classes are exclusive.Replacing the hinge loss in SVMs with better approximations of the logistic loss[8,13]should be con-sidered.An explicit probabilistic modeling of the rejection would allow a better representation of the task.5.ACKNOWLEDGMENTSThe authors wish to thank Allen Gorin for initiating and strongly supporting this collaborative project.They are also grateful to Dilek Hakkani-Tur and Tirso Alonso for their help withlatex the experiments.6.REFERENCES[1]R.E.Schapire and Y.Singer,“Boostexter:A boosting-based systemfor text categorization,”Machine Learning,vol.39,no.2/3,pp.135–168,2000.[2] E.L.Allwein,R.E.Schapire,and Y.Singer,“Reducing multiclassto binary:A unifying approach for margin classiﬁers,”in Proc.of ICML,2000.[3] A.L.Gorin,G.Riccardi,and J.H.Wright,“How May I Help You?,”Speech Communication,vol.23,pp.113–127,1997.[4]V.N.Vapnik,Statistical Learning Theory,John Wiley&Sons,New-York,1998.[5]T.Joachims,“Text categorization with support vector machines:learning with many relevant features,”in Proc.of ECML-98.1998, Springer Verlag.[6]rson,S.Eickeler,G.Paa,E.Leopold,and J.Kindermann,“Ex-ploring sub-word features and linear support vector machines for german spoken document classiﬁcation,”in Proc.of ICSLP,2002.[7]Y.Guermeur,A.Eliseeff,and H.Paugam-Moisy,“A new multi-classSVM based on a uniform convergence result,”in Proc.of IJCNN, 2000.[8] B.Scholkopf and A.Smola,Learning with Kernels,MIT Press,Cambridge,MA,2002.[9]T.Joachims,“Making large-scale svm learning practical,”in Ad-vances in Kernel Methods—Support Vector Learning.1999,MIT Press.[10]J.Platt,“Probabilistic outputs for support vector machines and com-parison to regularized likelihood methods,”in NIPS.1999,MIT Press.[11]G.Tur,J.Wright,A.Gorin,G.Riccardi,and D.Hakkani-T¨u r,“Im-proving spoken language understanding using word confusion net-works,”in Proc.of ICSLP,2002.[12] A.L.Gorin,G.Riccardi,and J.H.Wright,“Automated natural spo-ken dialog,”IEEE Computer Magazine,vol.35,no.4,pp.51–56, April2002.[13]R.Schapire,M.Rochery,M.Rahim,and N.Gupta,“Incorporatingprior knowledge in boosting,”in Proc.of ICML,2002.[14]J.Wright,A.Gorin,and G.Riccardi,“Automatic acquisition ofsalient grammar fragments for call-type classiﬁcation,”in Proc.of Eurospeech,Rhodes,Greece,September1997,pp.1419–1422. [15]M.Mohri, F.Pereira,and M.Riley,“FSM Library–General purposeﬁnite state machine software tools,”/sw/tools/fsm.[16]L.Mangu,E.Brill,and A.Stolcke,“Finding consensus in speechrecognition:word error minimization and other applications of con-fusion networks,”Computer Speech and Language,vol.14,no.4, pp.373–400,2000.¬。