Text classification using string kernels

合集下载

文本分类学习(三)特征权重(TFIDF)和特征提取

文本分类学习（三）特征权重（TFIDF）和特征提取特征权重（TFIDF）是文本分类中常用的一种特征提取方法，可以用于将文本数据转化为数值特征，以便于机器学习算法的处理和分析。

在本文中，我们将介绍TFIDF特征权重及其原理，并讨论常用的特征提取方法。

TFIDF是Term Frequency-Inverse Document Frequency的缩写，意为词频-逆文档频率。

它结合了一个词在文本中的出现频率（term frequency）和它在整个语料库中的重要程度（inverse document frequency），通过计算一个词的TFIDF值来表示其在文本中的重要性。

TFIDF的计算公式如下：TFIDF=TF*IDF其中，TF表示词频，即一个词在文本中的出现次数。

IDF表示逆文档频率，即一个词在整个语料库中的重要程度。

具体计算方法为：IDF = log(N / (n + 1))其中，N表示语料库中文本的总数，n表示包含一些词的文本数。

这里的加1是为了避免出现除零错误。

通过计算TFIDF值，可以得到一个词的特征权重，代表了它在文本中的重要程度。

特别是对于那些在文本中高频出现，但在整个语料库中出现较少的词，TFIDF值会更高，表示它在文本分类中更具区分性。

在进行文本分类时，一般需要先进行特征提取，将文本数据转化为数值特征，然后再使用机器学习算法进行训练和预测。

特征提取的目的是将文本中的信息提取出来，并且能够保持一定的语义信息。

常用的特征提取方法有：1. 词袋模型（Bag of Words）：将文本视为一个袋子，忽略词语在句子中的顺序，只考虑词语的出现与否。

将文本中的词语作为特征，表示为词频或者TFIDF值。

2. n-gram模型：在词袋模型的基础上考虑相邻词语的组合，将连续的n个词语作为特征。

例如，bigram模型中，将相邻的两个词语作为特征。

3. Word2Vec模型：使用深度学习模型将词语表示为密集向量，保留了词语之间的语义信息。

机器学习设计知识测试选择题 53题

1. 在机器学习中，监督学习的主要目标是：A) 从无标签数据中学习B) 从有标签数据中学习C) 优化模型的复杂度D) 减少计算资源的使用2. 下列哪种算法属于无监督学习？A) 线性回归B) 决策树C) 聚类分析D) 支持向量机3. 在机器学习模型评估中，交叉验证的主要目的是：A) 增加模型复杂度B) 减少数据集大小C) 评估模型的泛化能力D) 提高训练速度4. 下列哪项不是特征选择的方法？A) 主成分分析（PCA）B) 递归特征消除（RFE）C) 网格搜索（Grid Search）D) 方差阈值（Variance Threshold）5. 在深度学习中，卷积神经网络（CNN）主要用于：A) 文本分析B) 图像识别C) 声音处理D) 推荐系统6. 下列哪种激活函数在神经网络中最为常用？A) 线性激活函数B) 阶跃激活函数C) ReLUD) 双曲正切函数7. 在机器学习中，过拟合通常是由于以下哪种情况引起的？A) 模型过于简单B) 数据量过大C) 模型过于复杂D) 数据预处理不当8. 下列哪项技术用于处理类别不平衡问题？A) 数据增强B) 重采样C) 特征选择D) 模型集成9. 在自然语言处理（NLP）中，词嵌入的主要目的是：A) 提高计算效率B) 减少词汇量C) 捕捉词之间的语义关系D) 增加文本长度10. 下列哪种算法不属于集成学习方法？A) 随机森林B) AdaBoostC) 梯度提升机（GBM）D) 逻辑回归11. 在机器学习中，ROC曲线用于评估：A) 模型的准确性B) 模型的复杂度C) 模型的泛化能力D) 分类模型的性能12. 下列哪项不是数据预处理的步骤？A) 缺失值处理B) 特征缩放C) 模型训练D) 数据标准化13. 在机器学习中，L1正则化主要用于：A) 减少模型复杂度B) 增加特征数量C) 特征选择D) 提高模型精度14. 下列哪种方法可以用于处理时间序列数据？A) 主成分分析（PCA）B) 线性回归C) ARIMA模型D) 决策树15. 在机器学习中，Bagging和Boosting的主要区别在于：A) 数据处理方式B) 模型复杂度C) 样本使用方式D) 特征选择方法16. 下列哪种算法适用于推荐系统？A) K-均值聚类B) 协同过滤C) 逻辑回归D) 随机森林17. 在机器学习中，A/B测试主要用于：A) 模型选择B) 特征工程C) 模型评估D) 用户体验优化18. 下列哪种方法可以用于处理缺失数据？A) 删除含有缺失值的样本B) 使用均值填充C) 使用中位数填充D) 以上都是19. 在机器学习中，偏差-方差权衡主要关注：A) 模型的复杂度B) 数据集的大小C) 模型的泛化能力D) 特征的数量20. 下列哪种算法属于强化学习？A) Q-学习B) 线性回归C) 决策树D) 支持向量机21. 在机器学习中，特征工程的主要目的是：A) 减少数据量B) 增加模型复杂度C) 提高模型性能D) 简化数据处理22. 下列哪种方法可以用于处理多分类问题？A) 一对多（One-vs-All）B) 一对一（One-vs-One）C) 层次聚类D) 以上都是23. 在机器学习中，交叉熵损失函数主要用于：A) 回归问题B) 分类问题C) 聚类问题D) 强化学习24. 下列哪种算法不属于深度学习？A) 卷积神经网络（CNN）B) 循环神经网络（RNN）C) 随机森林D) 长短期记忆网络（LSTM）25. 在机器学习中，梯度下降算法的主要目的是：A) 减少特征数量B) 优化模型参数C) 增加数据量D) 提高计算速度26. 下列哪种方法可以用于处理文本数据？A) 词袋模型（Bag of Words）B) TF-IDFC) 词嵌入D) 以上都是27. 在机器学习中，正则化的主要目的是：A) 减少特征数量B) 防止过拟合C) 增加数据量D) 提高计算速度28. 下列哪种算法适用于异常检测？A) 线性回归B) 决策树C) 支持向量机D) 孤立森林（Isolation Forest）29. 在机器学习中，集成学习的主要目的是：A) 提高单个模型的性能B) 结合多个模型的优势C) 减少数据量D) 增加模型复杂度30. 下列哪种方法可以用于处理高维数据？A) 主成分分析（PCA）B) 特征选择C) 特征提取D) 以上都是31. 在机器学习中，K-均值聚类的主要目的是：A) 分类B) 回归C) 聚类D) 预测32. 下列哪种算法适用于时间序列预测？A) 线性回归B) ARIMA模型C) 决策树D) 支持向量机33. 在机器学习中，网格搜索（Grid Search）主要用于：A) 特征选择B) 模型选择C) 数据预处理D) 模型评估34. 下列哪种方法可以用于处理类别特征？A) 独热编码（One-Hot Encoding）B) 标签编码（Label Encoding）C) 特征哈希（Feature Hashing）D) 以上都是35. 在机器学习中，AUC-ROC曲线的主要用途是：A) 评估分类模型的性能B) 评估回归模型的性能C) 评估聚类模型的性能D) 评估强化学习模型的性能36. 下列哪种算法不属于监督学习？A) 线性回归B) 决策树C) 聚类分析D) 支持向量机37. 在机器学习中，特征缩放的主要目的是：A) 减少特征数量B) 提高模型性能C) 增加数据量D) 简化数据处理38. 下列哪种方法可以用于处理文本分类问题？A) 词袋模型（Bag of Words）B) TF-IDFC) 词嵌入D) 以上都是39. 在机器学习中，决策树的主要优点是：A) 易于理解和解释B) 计算效率高C) 对缺失值不敏感D) 以上都是40. 下列哪种算法适用于图像分割？A) 卷积神经网络（CNN）B) 循环神经网络（RNN）C) 随机森林D) 支持向量机41. 在机器学习中，L2正则化主要用于：A) 减少模型复杂度B) 增加特征数量C) 特征选择D) 提高模型精度42. 下列哪种方法可以用于处理时间序列数据的季节性？A) 移动平均B) 季节分解C) 差分D) 以上都是43. 在机器学习中，Bagging的主要目的是：A) 减少模型的方差B) 减少模型的偏差C) 增加数据量D) 提高计算速度44. 下列哪种算法适用于序列数据处理？A) 卷积神经网络（CNN）B) 循环神经网络（RNN）C) 随机森林D) 支持向量机45. 在机器学习中，AdaBoost的主要目的是：A) 减少模型的方差B) 减少模型的偏差C) 增加数据量D) 提高计算速度46. 下列哪种方法可以用于处理文本数据的情感分析？A) 词袋模型（Bag of Words）B) TF-IDFC) 词嵌入D) 以上都是47. 在机器学习中，支持向量机（SVM）的主要优点是：A) 适用于高维数据B) 计算效率高C) 对缺失值不敏感D) 以上都是48. 下列哪种算法适用于推荐系统中的用户行为分析？A) 协同过滤B) 内容过滤C) 混合过滤D) 以上都是49. 在机器学习中，交叉验证的主要类型包括：A) K-折交叉验证B) 留一法交叉验证C) 随机划分交叉验证D) 以上都是50. 下列哪种方法可以用于处理图像数据？A) 卷积神经网络（CNN）B) 循环神经网络（RNN）C) 随机森林D) 支持向量机51. 在机器学习中，梯度提升机（GBM）的主要优点是：A) 适用于高维数据B) 计算效率高C) 对缺失值不敏感D) 以上都是52. 下列哪种算法适用于异常检测中的离群点检测？A) 线性回归B) 决策树C) 支持向量机D) 孤立森林（Isolation Forest）53. 在机器学习中，特征提取的主要目的是：A) 减少特征数量B) 提高模型性能C) 增加数据量D) 简化数据处理答案：1. B2. C3. C4. C5. B6. C7. C8. B9. C10. D11. D12. C13. C14. C15. C16. B17. D18. D19. C20. A21. C22. D23. B24. C25. B26. D27. B28. D29. B30. D31. C32. B33. B34. D35. A36. C37. B38. D39. D40. A41. A42. D43. A44. B45. B46. D47. A48. D49. D50. A51. D52. D53. B。

金康荣随机森林算法的中文文本分类方法

金康荣随机森林算法的中文文本分类方法1. Random Forest algorithm is widely used in Chinese text classification.随机森林算法被广泛应用于中文文本分类。

2. This algorithm combines multiple decision trees to improve classification accuracy.该算法通过组合多个决策树来提高分类的准确性。

3. Random Forest algorithm can effectively handle high-dimensional and sparse feature spaces.随机森林算法可以有效处理高维稀疏特征空间。

4. It has been successfully applied in sentiment analysis, topic classification, and news categorization.该算法已成功应用于情感分析、主题分类和新闻归类。

5. The Random Forest algorithm can handle unbalanced datasets in text classification tasks.随机森林算法可以处理文本分类任务中的不平衡数据集。

6. By using feature importance measures, the algorithm can identify the most influential features in the classification process.通过使用特征重要性度量，该算法可以识别分类过程中最具影响力的特征。

7. Random Forest algorithm is computationally efficient and scalable to large datasets.随机森林算法在计算效率和大规模数据集上具有可扩展性。

自然语言处理中的文本特征选择方法

自然语言处理中的文本特征选择方法自然语言处理（Natural Language Processing，NLP）是人工智能领域中一项重要的技术，旨在使计算机能够理解和处理人类语言。

在NLP中，文本特征选择是一个关键的步骤，它能够帮助我们从大量的文本数据中提取出最相关和有用的特征，以便用于后续的文本分类、情感分析、机器翻译等任务。

文本特征选择方法是指通过一系列的算法和技术，从原始的文本数据中选择出最具有代表性和区分性的特征。

这些特征可以是单词、短语、句子或者其他更高级的语义单元。

在NLP中，文本特征选择的目标是找到一组特征，使得它们能够最好地区分不同的文本类别或者表达不同的语义信息。

在文本特征选择的过程中，有一些常用的方法和技术。

首先是基于频率的方法，它们通过统计特征在整个文本集合中出现的频率来选择特征。

例如，常见的方法有词频（Term Frequency，TF）和逆文档频率（Inverse Document Frequency，IDF）。

TF表示一个特征在一个文本中出现的次数，而IDF则表示一个特征在整个文本集合中出现的频率。

通过将TF和IDF相乘，可以得到一个特征的重要性分数，从而进行特征选择。

另一种常见的方法是基于信息增益的方法。

信息增益是一种用于衡量特征对于分类任务的重要性的指标。

它通过计算一个特征对于分类任务的信息增益来选择特征。

信息增益越大，表示一个特征对于分类任务的贡献越大，因此越有可能被选择为特征。

除了上述方法外，还有一些其他的特征选择方法，如互信息、卡方检验等。

互信息是一种用于衡量两个随机变量之间相关性的指标，它可以用于选择特征。

卡方检验则是一种用于检验两个变量之间是否存在显著关联的统计方法，也可以用于特征选择。

在实际应用中，常常需要结合多种特征选择方法来进行文本特征选择。

例如，可以先使用基于频率的方法来选择一部分重要的特征，然后再使用基于信息增益的方法来进一步筛选特征。

这样可以综合考虑不同方法的优势，得到更好的特征选择结果。

基于主题和态度分类的文本过滤系统

—163—基于主题和态度分类的文本过滤系统闵锦，黄萱菁（复旦大学计算机科学与工程系，上海 200433）摘要：文本过滤是指从大量的文本数据流中寻找满足特定用户需求的文本的过程。

该文介绍了一种文本过滤算法，该算法把基于空间向量模型的主题分类算法与基于支持向量机文本态度分类结合起来。

实验结果表明该方法具有较高的精度和召回率。

关键词：文本过滤；文本分类；态度分类；支持向量机Text Filtering System Based on Topic and Sentiment ClassificationMIN Jin, HUANG Xuanjing(Department of Computer Science and Engineering, Fudan University, Shanghai 200433)【Abstract 】Text filtering is the procedure of retrieving documents relevant to the requirements of specific users from a large-scale text data stream.This paper introduces a text filtering system merging topic classification based on vector space model and sentiment classification based on support vector machine. The experimental results show this method has high classification precision and recall. 【Key words 】Text filtering; Text classification; Sentiment classification; Support vector machine(SVM)计算机工程Computer Engineering 第33卷第2期Vol.33 No.2 2007年1月January 2007·人工智能及识别技术·文章编号：1000—3428(2007)02—0163—02文献标识码：A中图分类号：TP18随着Internet 的不断普及，各种在线的电子文档如潮水般地涌来，往往使用户感到十分茫然，出现所谓的“信息过载”和“信息迷向”问题，即信息浩如瀚海，却不知如何寻找自己感兴趣的信息，而即使找到一些有用的信息，也经常混有很多的“噪音”。

IEC-61854架空线.隔离层的要求和检验

NORMEINTERNATIONALECEI IEC INTERNATIONALSTANDARD 61854Première éditionFirst edition1998-09Lignes aériennes –Exigences et essais applicables aux entretoisesOverhead lines –Requirements and tests for spacersCommission Electrotechnique InternationaleInternational Electrotechnical Commission Pour prix, voir catalogue en vigueurFor price, see current catalogue© IEC 1998 Droits de reproduction réservés Copyright - all rights reservedAucune partie de cette publication ne peut être reproduite niutilisée sous quelque forme que ce soit et par aucunprocédé, électronique ou mécanique, y compris la photo-copie et les microfilms, sans l'accord écrit de l'éditeur.No part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical,including photocopying and microfilm, without permission in writing from the publisher.International Electrotechnical Commission 3, rue de Varembé Geneva, SwitzerlandTelefax: +41 22 919 0300e-mail: inmail@iec.ch IEC web site http: //www.iec.chCODE PRIX PRICE CODE X– 2 –61854 © CEI:1998SOMMAIREPages AVANT-PROPOS (6)Articles1Domaine d'application (8)2Références normatives (8)3Définitions (12)4Exigences générales (12)4.1Conception (12)4.2Matériaux (14)4.2.1Généralités (14)4.2.2Matériaux non métalliques (14)4.3Masse, dimensions et tolérances (14)4.4Protection contre la corrosion (14)4.5Aspect et finition de fabrication (14)4.6Marquage (14)4.7Consignes d'installation (14)5Assurance de la qualité (16)6Classification des essais (16)6.1Essais de type (16)6.1.1Généralités (16)6.1.2Application (16)6.2Essais sur échantillon (16)6.2.1Généralités (16)6.2.2Application (16)6.2.3Echantillonnage et critères de réception (18)6.3Essais individuels de série (18)6.3.1Généralités (18)6.3.2Application et critères de réception (18)6.4Tableau des essais à effectuer (18)7Méthodes d'essai (22)7.1Contrôle visuel (22)7.2Vérification des dimensions, des matériaux et de la masse (22)7.3Essai de protection contre la corrosion (22)7.3.1Composants revêtus par galvanisation à chaud (autres queles fils d'acier galvanisés toronnés) (22)7.3.2Produits en fer protégés contre la corrosion par des méthodes autresque la galvanisation à chaud (24)7.3.3Fils d'acier galvanisé toronnés (24)7.3.4Corrosion causée par des composants non métalliques (24)7.4Essais non destructifs (24)61854 © IEC:1998– 3 –CONTENTSPage FOREWORD (7)Clause1Scope (9)2Normative references (9)3Definitions (13)4General requirements (13)4.1Design (13)4.2Materials (15)4.2.1General (15)4.2.2Non-metallic materials (15)4.3Mass, dimensions and tolerances (15)4.4Protection against corrosion (15)4.5Manufacturing appearance and finish (15)4.6Marking (15)4.7Installation instructions (15)5Quality assurance (17)6Classification of tests (17)6.1Type tests (17)6.1.1General (17)6.1.2Application (17)6.2Sample tests (17)6.2.1General (17)6.2.2Application (17)6.2.3Sampling and acceptance criteria (19)6.3Routine tests (19)6.3.1General (19)6.3.2Application and acceptance criteria (19)6.4Table of tests to be applied (19)7Test methods (23)7.1Visual examination (23)7.2Verification of dimensions, materials and mass (23)7.3Corrosion protection test (23)7.3.1Hot dip galvanized components (other than stranded galvanizedsteel wires) (23)7.3.2Ferrous components protected from corrosion by methods other thanhot dip galvanizing (25)7.3.3Stranded galvanized steel wires (25)7.3.4Corrosion caused by non-metallic components (25)7.4Non-destructive tests (25)– 4 –61854 © CEI:1998 Articles Pages7.5Essais mécaniques (26)7.5.1Essais de glissement des pinces (26)7.5.1.1Essai de glissement longitudinal (26)7.5.1.2Essai de glissement en torsion (28)7.5.2Essai de boulon fusible (28)7.5.3Essai de serrage des boulons de pince (30)7.5.4Essais de courant de court-circuit simulé et essais de compressionet de traction (30)7.5.4.1Essai de courant de court-circuit simulé (30)7.5.4.2Essai de compression et de traction (32)7.5.5Caractérisation des propriétés élastiques et d'amortissement (32)7.5.6Essais de flexibilité (38)7.5.7Essais de fatigue (38)7.5.7.1Généralités (38)7.5.7.2Oscillation de sous-portée (40)7.5.7.3Vibrations éoliennes (40)7.6Essais de caractérisation des élastomères (42)7.6.1Généralités (42)7.6.2Essais (42)7.6.3Essai de résistance à l'ozone (46)7.7Essais électriques (46)7.7.1Essais d'effet couronne et de tension de perturbations radioélectriques..467.7.2Essai de résistance électrique (46)7.8Vérification du comportement vibratoire du système faisceau/entretoise (48)Annexe A (normative) Informations techniques minimales à convenirentre acheteur et fournisseur (64)Annexe B (informative) Forces de compression dans l'essai de courantde court-circuit simulé (66)Annexe C (informative) Caractérisation des propriétés élastiques et d'amortissementMéthode de détermination de la rigidité et de l'amortissement (70)Annexe D (informative) Contrôle du comportement vibratoire du systèmefaisceau/entretoise (74)Bibliographie (80)Figures (50)Tableau 1 – Essais sur les entretoises (20)Tableau 2 – Essais sur les élastomères (44)61854 © IEC:1998– 5 –Clause Page7.5Mechanical tests (27)7.5.1Clamp slip tests (27)7.5.1.1Longitudinal slip test (27)7.5.1.2Torsional slip test (29)7.5.2Breakaway bolt test (29)7.5.3Clamp bolt tightening test (31)7.5.4Simulated short-circuit current test and compression and tension tests (31)7.5.4.1Simulated short-circuit current test (31)7.5.4.2Compression and tension test (33)7.5.5Characterisation of the elastic and damping properties (33)7.5.6Flexibility tests (39)7.5.7Fatigue tests (39)7.5.7.1General (39)7.5.7.2Subspan oscillation (41)7.5.7.3Aeolian vibration (41)7.6Tests to characterise elastomers (43)7.6.1General (43)7.6.2Tests (43)7.6.3Ozone resistance test (47)7.7Electrical tests (47)7.7.1Corona and radio interference voltage (RIV) tests (47)7.7.2Electrical resistance test (47)7.8Verification of vibration behaviour of the bundle-spacer system (49)Annex A (normative) Minimum technical details to be agreed betweenpurchaser and supplier (65)Annex B (informative) Compressive forces in the simulated short-circuit current test (67)Annex C (informative) Characterisation of the elastic and damping propertiesStiffness-Damping Method (71)Annex D (informative) Verification of vibration behaviour of the bundle/spacer system (75)Bibliography (81)Figures (51)Table 1 – Tests on spacers (21)Table 2 – Tests on elastomers (45)– 6 –61854 © CEI:1998 COMMISSION ÉLECTROTECHNIQUE INTERNATIONALE––––––––––LIGNES AÉRIENNES –EXIGENCES ET ESSAIS APPLICABLES AUX ENTRETOISESAVANT-PROPOS1)La CEI (Commission Electrotechnique Internationale) est une organisation mondiale de normalisation composéede l'ensemble des comités électrotechniques nationaux (Comités nationaux de la CEI). La CEI a pour objet de favoriser la coopération internationale pour toutes les questions de normalisation dans les domaines de l'électricité et de l'électronique. A cet effet, la CEI, entre autres activités, publie des Normes internationales.Leur élaboration est confiée à des comités d'études, aux travaux desquels tout Comité national intéressé par le sujet traité peut participer. Les organisations internationales, gouvernementales et non gouvernementales, en liaison avec la CEI, participent également aux travaux. La CEI collabore étroitement avec l'Organisation Internationale de Normalisation (ISO), selon des conditions fixées par accord entre les deux organisations.2)Les décisions ou accords officiels de la CEI concernant les questions techniques représentent, dans la mesuredu possible un accord international sur les sujets étudiés, étant donné que les Comités nationaux intéressés sont représentés dans chaque comité d’études.3)Les documents produits se présentent sous la forme de recommandations internationales. Ils sont publiéscomme normes, rapports techniques ou guides et agréés comme tels par les Comités nationaux.4)Dans le but d'encourager l'unification internationale, les Comités nationaux de la CEI s'engagent à appliquer defaçon transparente, dans toute la mesure possible, les Normes internationales de la CEI dans leurs normes nationales et régionales. Toute divergence entre la norme de la CEI et la norme nationale ou régionale correspondante doit être indiquée en termes clairs dans cette dernière.5)La CEI n’a fixé aucune procédure concernant le marquage comme indication d’approbation et sa responsabilitén’est pas engagée quand un matériel est déclaré conforme à l’une de ses normes.6) L’attention est attirée sur le fait que certains des éléments de la présente Norme internationale peuvent fairel’objet de droits de propriété intellectuelle ou de droits analogues. La CEI ne saurait être tenue pour responsable de ne pas avoir identifié de tels droits de propriété et de ne pas avoir signalé leur existence.La Norme internationale CEI 61854 a été établie par le comité d'études 11 de la CEI: Lignes aériennes.Le texte de cette norme est issu des documents suivants:FDIS Rapport de vote11/141/FDIS11/143/RVDLe rapport de vote indiqué dans le tableau ci-dessus donne toute information sur le vote ayant abouti à l'approbation de cette norme.L’annexe A fait partie intégrante de cette norme.Les annexes B, C et D sont données uniquement à titre d’information.61854 © IEC:1998– 7 –INTERNATIONAL ELECTROTECHNICAL COMMISSION––––––––––OVERHEAD LINES –REQUIREMENTS AND TESTS FOR SPACERSFOREWORD1)The IEC (International Electrotechnical Commission) is a worldwide organization for standardization comprisingall national electrotechnical committees (IEC National Committees). The object of the IEC is to promote international co-operation on all questions concerning standardization in the electrical and electronic fields. To this end and in addition to other activities, the IEC publishes International Standards. Their preparation is entrusted to technical committees; any IEC National Committee interested in the subject dealt with may participate in this preparatory work. International, governmental and non-governmental organizations liaising with the IEC also participate in this preparation. The IEC collaborates closely with the International Organization for Standardization (ISO) in accordance with conditions determined by agreement between the two organizations.2)The formal decisions or agreements of the IEC on technical matters express, as nearly as possible, aninternational consensus of opinion on the relevant subjects since each technical committee has representation from all interested National Committees.3)The documents produced have the form of recommendations for international use and are published in the formof standards, technical reports or guides and they are accepted by the National Committees in that sense.4)In order to promote international unification, IEC National Committees undertake to apply IEC InternationalStandards transparently to the maximum extent possible in their national and regional standards. Any divergence between the IEC Standard and the corresponding national or regional standard shall be clearly indicated in the latter.5)The IEC provides no marking procedure to indicate its approval and cannot be rendered responsible for anyequipment declared to be in conformity with one of its standards.6) Attention is drawn to the possibility that some of the elements of this International Standard may be the subjectof patent rights. The IEC shall not be held responsible for identifying any or all such patent rights. International Standard IEC 61854 has been prepared by IEC technical committee 11: Overhead lines.The text of this standard is based on the following documents:FDIS Report on voting11/141/FDIS11/143/RVDFull information on the voting for the approval of this standard can be found in the report on voting indicated in the above table.Annex A forms an integral part of this standard.Annexes B, C and D are for information only.– 8 –61854 © CEI:1998LIGNES AÉRIENNES –EXIGENCES ET ESSAIS APPLICABLES AUX ENTRETOISES1 Domaine d'applicationLa présente Norme internationale s'applique aux entretoises destinées aux faisceaux de conducteurs de lignes aériennes. Elle recouvre les entretoises rigides, les entretoises flexibles et les entretoises amortissantes.Elle ne s'applique pas aux espaceurs, aux écarteurs à anneaux et aux entretoises de mise à la terre.NOTE – La présente norme est applicable aux pratiques de conception de lignes et aux entretoises les plus couramment utilisées au moment de sa rédaction. Il peut exister d'autres entretoises auxquelles les essais spécifiques décrits dans la présente norme ne s'appliquent pas.Dans de nombreux cas, les procédures d'essai et les valeurs d'essai sont convenues entre l'acheteur et le fournisseur et sont énoncées dans le contrat d'approvisionnement. L'acheteur est le mieux à même d'évaluer les conditions de service prévues, qu'il convient d'utiliser comme base à la définition de la sévérité des essais.La liste des informations techniques minimales à convenir entre acheteur et fournisseur est fournie en annexe A.2 Références normativesLes documents normatifs suivants contiennent des dispositions qui, par suite de la référence qui y est faite, constituent des dispositions valables pour la présente Norme internationale. Au moment de la publication, les éditions indiquées étaient en vigueur. Tout document normatif est sujet à révision et les parties prenantes aux accords fondés sur la présente Norme internationale sont invitées à rechercher la possibilité d'appliquer les éditions les plus récentes des documents normatifs indiqués ci-après. Les membres de la CEI et de l'ISO possèdent le registre des Normes internationales en vigueur.CEI 60050(466):1990, Vocabulaire Electrotechnique International (VEI) – Chapitre 466: Lignes aériennesCEI 61284:1997, Lignes aériennes – Exigences et essais pour le matériel d'équipementCEI 60888:1987, Fils en acier zingué pour conducteurs câblésISO 34-1:1994, Caoutchouc vulcanisé ou thermoplastique – Détermination de la résistance au déchirement – Partie 1: Eprouvettes pantalon, angulaire et croissantISO 34-2:1996, Caoutchouc vulcanisé ou thermoplastique – Détermination de la résistance au déchirement – Partie 2: Petites éprouvettes (éprouvettes de Delft)ISO 37:1994, Caoutchouc vulcanisé ou thermoplastique – Détermination des caractéristiques de contrainte-déformation en traction61854 © IEC:1998– 9 –OVERHEAD LINES –REQUIREMENTS AND TESTS FOR SPACERS1 ScopeThis International Standard applies to spacers for conductor bundles of overhead lines. It covers rigid spacers, flexible spacers and spacer dampers.It does not apply to interphase spacers, hoop spacers and bonding spacers.NOTE – This standard is written to cover the line design practices and spacers most commonly used at the time of writing. There may be other spacers available for which the specific tests reported in this standard may not be applicable.In many cases, test procedures and test values are left to agreement between purchaser and supplier and are stated in the procurement contract. The purchaser is best able to evaluate the intended service conditions, which should be the basis for establishing the test severity.In annex A, the minimum technical details to be agreed between purchaser and supplier are listed.2 Normative referencesThe following normative documents contain provisions which, through reference in this text, constitute provisions of this International Standard. At the time of publication of this standard, the editions indicated were valid. All normative documents are subject to revision, and parties to agreements based on this International Standard are encouraged to investigate the possibility of applying the most recent editions of the normative documents indicated below. Members of IEC and ISO maintain registers of currently valid International Standards.IEC 60050(466):1990, International Electrotechnical vocabulary (IEV) – Chapter 466: Overhead linesIEC 61284:1997, Overhead lines – Requirements and tests for fittingsIEC 60888:1987, Zinc-coated steel wires for stranded conductorsISO 34-1:1994, Rubber, vulcanized or thermoplastic – Determination of tear strength – Part 1: Trouser, angle and crescent test piecesISO 34-2:1996, Rubber, vulcanized or thermoplastic – Determination of tear strength – Part 2: Small (Delft) test piecesISO 37:1994, Rubber, vulcanized or thermoplastic – Determination of tensile stress-strain properties– 10 –61854 © CEI:1998 ISO 188:1982, Caoutchouc vulcanisé – Essais de résistance au vieillissement accéléré ou à la chaleurISO 812:1991, Caoutchouc vulcanisé – Détermination de la fragilité à basse températureISO 815:1991, Caoutchouc vulcanisé ou thermoplastique – Détermination de la déformation rémanente après compression aux températures ambiantes, élevées ou bassesISO 868:1985, Plastiques et ébonite – Détermination de la dureté par pénétration au moyen d'un duromètre (dureté Shore)ISO 1183:1987, Plastiques – Méthodes pour déterminer la masse volumique et la densitérelative des plastiques non alvéolairesISO 1431-1:1989, Caoutchouc vulcanisé ou thermoplastique – Résistance au craquelage par l'ozone – Partie 1: Essai sous allongement statiqueISO 1461,— Revêtements de galvanisation à chaud sur produits finis ferreux – Spécifications1) ISO 1817:1985, Caoutchouc vulcanisé – Détermination de l'action des liquidesISO 2781:1988, Caoutchouc vulcanisé – Détermination de la masse volumiqueISO 2859-1:1989, Règles d'échantillonnage pour les contrôles par attributs – Partie 1: Plans d'échantillonnage pour les contrôles lot par lot, indexés d'après le niveau de qualité acceptable (NQA)ISO 2859-2:1985, Règles d'échantillonnage pour les contrôles par attributs – Partie 2: Plans d'échantillonnage pour les contrôles de lots isolés, indexés d'après la qualité limite (QL)ISO 2921:1982, Caoutchouc vulcanisé – Détermination des caractéristiques à basse température – Méthode température-retrait (essai TR)ISO 3417:1991, Caoutchouc – Détermination des caractéristiques de vulcanisation à l'aide du rhéomètre à disque oscillantISO 3951:1989, Règles et tables d'échantillonnage pour les contrôles par mesures des pourcentages de non conformesISO 4649:1985, Caoutchouc – Détermination de la résistance à l'abrasion à l'aide d'un dispositif à tambour tournantISO 4662:1986, Caoutchouc – Détermination de la résilience de rebondissement des vulcanisats––––––––––1) A publierThis is a preview - click here to buy the full publication61854 © IEC:1998– 11 –ISO 188:1982, Rubber, vulcanized – Accelerated ageing or heat-resistance testsISO 812:1991, Rubber, vulcanized – Determination of low temperature brittlenessISO 815:1991, Rubber, vulcanized or thermoplastic – Determination of compression set at ambient, elevated or low temperaturesISO 868:1985, Plastics and ebonite – Determination of indentation hardness by means of a durometer (Shore hardness)ISO 1183:1987, Plastics – Methods for determining the density and relative density of non-cellular plasticsISO 1431-1:1989, Rubber, vulcanized or thermoplastic – Resistance to ozone cracking –Part 1: static strain testISO 1461, — Hot dip galvanized coatings on fabricated ferrous products – Specifications1)ISO 1817:1985, Rubber, vulcanized – Determination of the effect of liquidsISO 2781:1988, Rubber, vulcanized – Determination of densityISO 2859-1:1989, Sampling procedures for inspection by attributes – Part 1: Sampling plans indexed by acceptable quality level (AQL) for lot-by-lot inspectionISO 2859-2:1985, Sampling procedures for inspection by attributes – Part 2: Sampling plans indexed by limiting quality level (LQ) for isolated lot inspectionISO 2921:1982, Rubber, vulcanized – Determination of low temperature characteristics –Temperature-retraction procedure (TR test)ISO 3417:1991, Rubber – Measurement of vulcanization characteristics with the oscillating disc curemeterISO 3951:1989, Sampling procedures and charts for inspection by variables for percent nonconformingISO 4649:1985, Rubber – Determination of abrasion resistance using a rotating cylindrical drum deviceISO 4662:1986, Rubber – Determination of rebound resilience of vulcanizates–––––––––1) To be published.。

一种改进的KNN文本分类

142
2012， 48 （2）
Computer Engineering and Applications 计算机工程与应用
一种改进的 KNN 文本分类
钟将，刘荣辉 ZHONG Jiang, LIU Ronghui
重庆大学计算机学院，重庆 400044 College of Computer Science, Chongqing University, Chongqing 400044, China ZHONG Jiang, LIU Ronghui. Improved KNN text categorization. Computer Engineering and Applications, 2012, 48 （2）： 142-144. Abstract： In text categorization, the problems of large feature dimension and samples data distributed imbalanced influence the classified results. To this problem, this paper puts forward an improved KNN method. Using latent semantic analysis to reduce dimensionality of text feature matrix. Using improved KNN method based on density to realize text categorization. The experimental results show that the proposed method can effectively improve the text categorization precision. Key words： feature reduction; latent semantic analysis; K-Nearest Neighbor （KNN） ; text categorization 摘要：在文本分类中，文本特征空间维数巨大以及训练样本分布不均衡等问题影响分类性能。针对这个问题，提出一种改进的 KNN 分类方法。利用隐含语义分析方法对特征样本空间进行降维处理；利用基于样本密度的改进的 KNN 分类器进行分类。实验结果表明提出的方法能够收到较好的分类效果。关键词：特征降维；潜在语义分析； K-最近邻法；文本分类 DOI： 10.3778/j.issn.1002-8331.2012.02.041 文章编号： 1002-8331 （2012） 02-0142-03 文献标识码： A 中图分类号： TP18

局部特征词选择方法及其在文本分类中的应用

目录中文摘要 (I)ABSTRACT (III)第一章绪论 (1)1.1 研究背景及意义 (1)1.2 国内外研究综述 (2)1.3 本文工作及结构安排 (3)第二章文本分类主要模型和方法 (7)2.1 文本预处理 (7)2.2 文本表示模型 (7)2.2.1 布尔逻辑模型 (7)2.2.2 向量空间模型 (8)2.2.3 共现潜在语义向量空间模型 (8)2.3 特征选择 (10)2.3.1 文档频率 (10)2.3.2 单词权 (11)2.3.3 互信息 (11)2.3.4 信息增益 (12)2.3.5 卡方统计 (12)2.4 分类算法 (13)2.4.1 K近邻 (13)2.4.2 随机森林 (14)2.5 文本分类效果评估 (14)第三章基于随机森林和共现分析的局部特征词选择方法 (15)3.1 基于随机森林的特征选择算法 (15)3.1.1 CART决策树 (15)3.1.2 随机森林的构建及分类 (15)3.1.3 随机森林中特征的重要性评估 (16)3.2 互信息特征选择算法的研究与启发 (18)3.2.1 互信息特征选择算法 (18)3.2.2 互信息特征选择算法的启发 (20)3.3 共词分析法研究 (21)3.3.1 共词分析法 (21)3.3.2 共词分析应用于文本表示模型存在的问题 (21)3.4 基于随机森林和共词分析的局部特征选择法 (23)3.4.1 局部特征选择方法的思想 (23)3.4.2 具体算法 (26)第四章实验及结果分析 (29)4.1 实验结构及安排 (29)4.2 实验数据 (29)4.2.1 数据来源 (29)4.2.2 划分训练集和测试集 (30)4.3 实验结果分析 (31)4.3.1 第一组数据实验及结果分析 (31)4.3.2 第二组数据实验及结果分析 (33)第五章总结与展望 (35)5.1 本文内容总结 (35)5.2 进一步的工作 (35)参考文献 (37)攻读学位期间取得的研究成果 (41)致谢 (42)个人简况及联系方式 (43)承诺书 (44)学位论文使用授权声明 (45)ContentsChinese Abstract (I)ABSTRACT ...................................................................................................................... I II Chapter 1 Introduction .. (1)1.1 Research background and significance (1)1.2 Research review at home and abroad (2)1.3 The work and structure of this paper (3)Chapter 2 The main models and methods of text classification (7)2.1 Text preprocessing (7)2.2 Text representation model (7)2.2.1 Boolean Logic Model (7)2.2.2 Vector Space Model (8)2.2.3 Co-occurrence Latent Semantic Vector Space Model (8)2.3 Feature selection (10)2.3.1 Document Frequency (10)2.3.2 Word Strength (11)2.3.3 Mutual Information (11)2.3.4 Information Gain (12)2.3.5 Chi-square Statistics (12)2.4 Classification algorithm (13)2.4.1 K-Nearest Neighbors (13)2.4.2 Random Forest (14)2.5 Evaluation of text categorization effect (14)Chapter 3 Selection of Local Feature Words Based on RF and CWA (15)3.1 Feature selection algorithm based on Random Forest (15)3.1.1 CART Decision Tree (15)3.1.2 Construction and classification of RF (15)3.1.3 Importance assessment of characteristics in RF (16)3.2 Research and enlightenment of MI Feature Selection Method (18)3.2.1 MI Feature Selection Method (18)3.2.2 Enlightenment of MI Feature Selection Method (20)3.3 Research on Co-word Analysis (21)3.3.1 Co-word Analysis (21)3.3.2 Problems in the application of CWA to text representation model (21)3.4 Selection of Local Feature Words Based on RF and CWA (23)3.4.1 The idea of Local Feature Selection Method (23)3.4.2 Concrete algorithm (26)Chapter 4 Experiments and results analysis (29)4.1 Experimental structure and arrangement (29)4.2 Experimental data (29)4.2.1 Data sources (29)4.2.2 Divide training set and test set (30)4.3 Analysis of experimental results (31)4.3.1 Experiments and results analysis of the first set of data (31)4.3.2 Experiments and results analysis of the second set of data (33)Chapter 5 Summary and Prospect (35)5.1 The summary of the content of this article (35)5.2 Further work (35)Reference (37)Research achievements (41)Acknowledgment (42)Personal profiles (43)Letter of commitment (44)Authorization statement (45)中文摘要在高度信息化的社会中，信息技术与互联网科技更新和普及的速度越来越快，随之电子数据库中以文本形式表现的信息资源也变得更加繁多复杂，基于人处理信息的基本认知，文本自动分类技术成为了处理大规模不断更新的文本数据的关键技术。

基于词频分类器集成的文本分类方法

计算机研究与发展I S SN 1000-1239!CN 11-1777!T PJournal o f C om p uter R esearch and d eve lo p m ent43（10）：1681"1687，2006收稿日期：2006-04-29；修回日期：2006-05-29基金项目：国家自然科学基金项目（60505013）；江苏省自然科学基金创新人才基金项目（BK 2005412）基于词频分类器集成的文本分类方法姜远周志华（南京大学软件新技术国家重点实验室南京210093）（南京大学计算机科学与技术系南京210093）（j ian gy uan #n j ）A text c lassification m et hod based on ter m fre I uenc y c lassifier e nse m bleJian g Yuan and zhou zhi hua（National l aborator y f or N oo el s o f t z are t echnolo gy ，Nan j in g U nio ersit y ，Nan j in g 210093）（D e P art m ent o f C o m P uter s cience and t echnolo gy ，Nan j in g U nio ersit y ，Nan j in g 210093）AbstractI n t his p a p er ，a m et hod o f text classification based on ter m fre C uenc y classifier ense m ble isp ro p osed.Ter mfre C uenc y classifier is a ki nd o f si m p le classifier obtai ned after calculati n g ter m s ’fre C uenc y o f texts i n t he cor p us.T hou g h t he g eneralization abilit y o f ter mfre C uenc y classifier is not stron g enou g h ，it is a C ualified base learner f or ense m ble because o f its low com p utational cost ，flex i bilit y i n u p dati n g w it h ne w sa m p les and classes ，and t he f easi bilit y o f i m p rovi n g g eneralization w it h t he hel p o f ense m blep aradi g ms.A n i m p roved A daB oost al g orit h m is used to build t he ense m ble ，which e m p lo y s a sche m e o f com p ulsi ve w ei g hts u p dati n g to avo i d earl y sto p .T heref ore it is m ore suitable f or text classification.Ex p eri m ental results on t he cor p us o f R euters-21578show t hat t he p ro p osed m et hod can achieve g ood p erf or m ance i n text classification tasks.K e y words text classification ；m achi ne learni n g ；ense m ble learni n g ；ter mfre C uenc y classifier ；A daB oost摘要提出了一种基于词频分类器集成的文本分类方法.词频分类器是在对文本中的单词和它在每个文本中出现的频率进行统计后得到的简单分类器.虽然词频分类器本身泛化能力不强，但它不仅计算代较小，而且在训练样本甚至类别增加时易于进行更新，而整个学习系统的泛化能力可以由集成学习机制来提高，因此，词频分类器很适合用做集成学习的基分类器.在集成时，使用了改进的A daB oost 算法，加入了一种强制重新分布权的机制，避免算法过早停止，更加适合文本分类任务.在标准文集R euters-21578上的实验结果表明，该方法能取得很好的效果.关键词文本分类；机器学习；集成学习；词频分类器；A daB oost中图法分类号T P18随着I nternet 技术的发展，越来越多的信息呈现于网页中，随之而来的问题是面临大量的信息时，如何快速而有效地对所需的信息进行检索.现有的许多搜索引擎例如G oo g le ，Y ahoo ，W ebC raw ler ，L y co s 等在对信息进行检索中承担了重要的角色.在网页文档中包含有大量的文本内容的信息，而对文本的检索［1］、归类［2］、选择［3］、过滤［4］等，通常是基于文本分类而进行的，这使得有效而准确的文本分类技术成为关键的所在.文本分类的任务是将文集（cor p us ）中的文本分到预先定义的类别中.通常的情况下，将一些已经!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!具有类别标记的文本作为训练数据，学习系统在学习之后能够将新的文本按其最大的相似度分到某个类中.简单贝叶斯方法、决策树、K近邻、支持向量机等机器学习方法都已被成功地应用于文本分类［5］.也有学者尝试使用多种不同分类器的组合来进行文本分类［6-7］.20世纪90年代末，集成学习（ense m ble learni n g）技术开始进入该领域，并成为该领域的一个研究热点，例如，W eiss等人［8］用决策树的集成进行文本分类，并且成功地用于em a il的过滤. S cha p ire等人［9］将决策树桩（decision st u m p）的集成用于文本分类系统B oosT exter中，也取得了较好的效果.本文提出了一种基于词频分类器集成的文本分类方法.集成中使用的词频分类器不仅计算开销小，而且在增加文本样本甚至类别时，可以容易地进行更新.在R euters-21578标准文集上的实验结果表明，该方法可以取得很好的文本分类性能.!集成学习集成学习通过训练基学习器的多个版本来解决同一个问题.由于集成通常能够得到比单个学习器更强的泛化能力，因此，对集成学习的研究被D ietterich认为是当前机器学习的四大研究方向之首［10］.集成的构建通常包括两个步骤，首先是利用基学习器训练出多个版本，即得到多个个体学习器，然后将这些个体学习器进行结合.按照个体学习器的生成方式的不同，可以将集成方法大致分为两类［11］：一类以A daB oost［12］为代表，在这一类方法中，个体学习器是顺序生成的，上一轮的学习器的结果将会影响到其后的学习器的生成，属于这一类的还有A rc-x4［13］，M ulti B oost［14］等集成方法.另一类以B a gg i n g［15］为代表，个体学习器的生成是并行的，相互不受干扰，属于这一类的集成方法还有W a gg i n g［16］，p-B a gg i n g［16］，GA SEN［17］等.此外，集成学习也被成功地应用到字符识别［18］、人脸识别［19］、图像分析［20］、医疗诊断［21］等应用领域.本文工作主要涉及A daB oost算法［12］，该算法将数据集｛（x1，y1），…，（xn，yn）｝作为其训练数据，其中，xi 是示例空间X中的示例，yi是xi的概念标记，这里Y定义在｛-1，+1｝上.在确定基学习器之后，A daB oost重复地调用基学习器，在每一轮的调用中，A daB oost算法对训练集上的数据分布进行调整，这可以通过维护一个权集合来实现.在初始的状态下，示例的权是相等的，在以后的各轮中，上一轮被错误分类的示例将被赋予更大的权，以使得这些较难的示例被更多地关注.具体的算法细节请参见文献［12］."词频分类器及其集成"#!文本分类的特殊性文本信息的形式多样且信息量大，如果以每个词为一个特征而言，对大批量的文本进行分类实质是一个在高维空间中对高维特征向量进行分类的任务.在面临开放应用的情况下，随时会有可利用的新的样本加入，这些新的样本甚至可能会属于新的类别，这时，要求文本分类系统能够充分利用这些新的样本而尽可能降低更新所带来的代价.由于文本分类经常是在线进行的，这就要求文本分类系统具有很好的实时性."#"理想的基分类器所应具有的性质针对文本分类的上述特点，可以看出，理想的文本分类学习算法应该具有分类精度高、计算代价小、计算速度快和易于进行数据更新的性质.通常，复杂的学习算法为了达到分类精度高的要求，在计算代价和计算速度上需要很大的代价.简单的学习算法相较而言可以降低计计算复杂度，提高计算速度，然而分类精度都较差.而根据新数据进行更新则是大多数学习算法所面临的共同难题.由于集成学习技术可以将弱分类器提升为泛化能力很强的强分类器，在集成中使用的基分类器就不再需要是计算代价高、更新困难的复杂分类器，而可以是更符合文本分类特殊要求的快速、高效且易于更新的基分类器，而基分类器的泛化能力则不必很强."#$词频分类器我们可以根据单词的出现定义一个基分类器.具体来说，在所有文本组成的文集上可以得到一个所有出现的单词组成的词汇表V.每个文本!i表示成形如（ui1，…，uin）的矢量，其中uik表示V中的第k个单词tk是否出现在文本!i中，如果出现，则uik=1，否则u ik=0.这样，基分类器可以被定义为I tk（!i）=1，if u ik=1，0，if u ik=!"#0.（1）本文称这种基分类器为词出现分类器（ter m occurrence classifier，TOC），需要注意的是词出现分类器TOC只考虑了单词在文本中出现与否，因此一2861计算机研究与发展2006，43（10）个单词在某文本中出现一次和出现多次其函数值都为1.而在实际情况中，单词在文本中的出现频率是传递了一定意义的.一般而言，在排除了干扰词表（sto p list）中的词后，某个单词在一个文本中出现的频率越高，则对该文本的分类影响越大.因此，我们可以将单词和它出现的频率一起作为一个基分类器.若文本!i 表示为矢量（Oi1，…，O i7），O ia表示V中的第a个单词t a在文本!i中出现的次数，此时，基分类器被定义为h ta ，f（!i）=1，if O ia）f，0，if O ia＜f：’L.（2）本文称这种基分类器为词频分类器（ter m fre C uenc y classifier，TFC），这里f是单词t a出现在文本中的频率.对一个文集来说，同一个词在不同的文本中有不同的词频，这样，根据每一个词及其每一种可能的词频，都有一个对应的基分类器版本.假设文集中有M个词，则词出现分类器TOC共有M个可能的版本；假设文集中第a个词的可能的词频的数目为la ，则词频分类器TFC共有】Ma=1l a个可能的版本.显然】Ma=1l a H M，即可供集成选择的基分类器的版本变多了，这为集成学习的处理提供了便利.值得注意的是，词频分类器TFC具有计算代价小、计算速度快的特点.这是由于对于其他类型的分类器而言，在形成用于学习的训练数据时，需要对文集中的文档和词的信息进行统计，然后才进入训练阶段.而词频分类器在对词频和文档信息进行统计之后，就已经形成了分类器.显然，词频分类器的计算开销远小于其他类型的分类器.此外，由词频分类器的定义可以看出，如果训练集中的文档发生变化时，例如有新文档加入训练集时，仅需要对新加入的文档中的词例和词频进行考察：对于词汇表中已经出现过的词例和词频，更新相应类别的计数；而将未在词汇表中出现的词例和词频，加入词汇表中并记录相应类别的计数.这样就完成了词频分类器的更新.若采用其他类型的分类器作为集成的基分类器，例如，在W eiss等人［8］的方法中采用决策树来作为集成的基分类器，则如果训练集发生变化，就需要对训练集中所有的数据进行重新训练，从而产生新的决策树.这样，即使训练集只是增加了或更改了很少的一部分文档，训练新的决策树所带来的计算开销也和最初在原有训练集上产生决策树的计算开销基本相同的.相比而言，词频分类器可以只针对新的内容进行快速更新，由于在真实世界的文本分类问题中，训练样本甚至类别都可能在应用中不断增加，因此，作为集成中的基分类器，词频分类器在文本分类应用中具有很大的优势.另外，词出现分类器的思想在以往的基于集成的文本分类方法中也有使用，例如S cha p ire等人的B oosT exter系统［9］中，基学习器就是以单词的出现与否来对当前文本赋予一实数值，用于进行文本的判别.本文的后续部分将把这一方法与词频分类器进行比较.2.4改进的AdaBOOst算法在标准的A daB oost算法［12］中，第t轮中弱分类器的训练误差!t如式（3）所示：!t=Pr i"i［ht（Ii）羊$i］，（3）其中，t表示第t轮中各训练样本的权所满足的分布；ht为基分类器的当前版本；Ii和$i分别为示例及其类别.在两类问题中，A daB oost算法最后得到的分类器H的训练误差最多为Ht［2!t（1~!t。

纹理物体缺陷的视觉检测算法研究--优秀毕业论文

摘要
在竞争激烈的工业自动化生产过程中，机器视觉对产品质量的把关起着举足轻重的作用，机器视觉在缺陷检测技术方面的应用也逐渐普遍起来。与常规的检测技术相比，自动化的视觉检测系统更加经济、快捷、高效与安全。纹理物体在工业生产中广泛存在，像用于半导体装配和封装底板和发光二极管，现代化电子系统中的印制电路板，以及纺织行业中的布匹和织物等都可认为是含有纹理特征的物体。本论文主要致力于纹理物体的缺陷检测技术研究，为纹理物体的自动化检测提供高效而可靠的检测算法。纹理是描述图像内容的重要特征，纹理分析也已经被成功的应用与纹理分割和纹理分类当中。本研究提出了一种基于纹理分析技术和参考比较方式的缺陷检测算法。这种算法能容忍物体变形引起的图像配准误差，对纹理的影响也具有鲁棒性。本算法旨在为检测出的缺陷区域提供丰富而重要的物理意义，如缺陷区域的大小、形状、亮度对比度及空间分布等。同时，在参考图像可行的情况下，本算法可用于同质纹理物体和非同质纹理物体的检测，对非纹理物体的检测也可取得不错的效果。在整个检测过程中，我们采用了可调控金字塔的纹理分析和重构技术。与传统的小波纹理分析技术不同，我们在小波域中加入处理物体变形和纹理影响的容忍度控制算法，来实现容忍物体变形和对纹理影响鲁棒的目的。最后可调控金字塔的重构保证了缺陷区域物理意义恢复的准确性。实验阶段，我们检测了一系列具有实际应用价值的图像。实验结果表明本文提出的纹理物体缺陷检测算法具有高效性和易于实现性。关键字: 缺陷检测；纹理；物体变形；可调控金字塔；重构
Keywords: defect detection, texture, object distortion, steerable pyramid, reconstruction
II

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Text Classiﬁcation using String KernelsHuma Lodhi John Shawe-Taylor Nello CristianiniChris WatkinsDepartment of Computer ScienceRoyal Holloway,University of LondonEgham,Surrey TW200EX,UKhuma,john,nello,chrisw@June,2000AbstractWe introduce a novel kernel for comparing two text documents.The kernel is an inner product in the feature space consisting of all subsequencesof length.A subsequence is any ordered sequence of characters occur-ring in the text though not necessarily contiguously.The subsequences areweighted by an exponentially decaying factor of their full length in the text,hence emphasising those occurrences which are close to contiguous.A di-rect computation of this feature vector would involve a prohibitive amount ofcomputation even for modest values of,since the dimension of the featurespace grows exponentially with.The paper describes how despite this factthe inner product can be efﬁciently evaluated by a dynamic programmingtechnique.A preliminary experimental comparison of the performance ofthe kernel compared with a standard word feature space kernel[4]is madeshowing encouraging results.1IntroductionStandard learning systems(like neural networks or decision trees)operate on in-put data after they have been transformed into feature vectors from Received May15,20001an dimensional space.There are cases,however,where the input data can not be readily described by explicit feature vectors:for example biosequences,images, graphs and text documents.For such datasets,the construction of a feature extrac-tion module can be as complex and expensive as solving the entire problem.An effective alternative to explicit feature extraction is provided by kernel methods.Kernel-based learning methods use an implicit mapping of the input data into a high dimensional feature space deﬁned by a kernel function,i.e.a function returning the inner product between the images of two data points in the feature space.The learning then takes place in the feature space,provided the learning algorithm can be entirely rewritten so that the data points only appear inside dot products with other data points.Several linear algorithms can be formulated in this way,for clustering,classi-ﬁcation and regression.The most typical example of kernel-based systems is the Support V ector Machine(SVM)[8][1],that implements linear classiﬁcation.One interesting property of kernel-based systems is that,once a valid kernel function has been selected,one can practically work in spaces of any dimension-ality without paying any computational cost,since the feature mapping is never effectively performed.In fact,one does not even need to know what features are being used.In this paper we examine the use of a kernel method based on string alignment for text categorization problems.A standard approach[3]to text categorisation makes use of the so-called bag of words(BOW)representation,mapping a document to a bag(i.e.a set that counts repeated elements),hence losing all the word order information and only retaining the frequency of the terms in the document.This is usually accompanied by the removal of non-informative words(stop words)and by the replacing of words by their stems,so losing inﬂection information.This simple technique has recently been used very successfully in supervised learning tasks with Support V ector Machines(SVM)[3].In this paper we propose a radically different approach,that considers doc-uments simply as symbol sequences,and makes use of speciﬁc kernels.The ap-proach is entirely subsymbolic,in the sense that it considers the document just like a unique long sequence,and still it is capable to capture topic information.We build on recent advances[9,2]that demonstrated how to build kernels over general structures like sequences.The most remarkable property of such methods is that they map documents to vectors without explicitly representing them,by means of sequence alignment techniques.A dynamic programming technique makes the computation of the kernels very efﬁcient(linear in the documents length).It is surprising that such a radical strategy,only extracting allignment informa-2tion,delivers positive results in topic classiﬁcation,comparable with the perfor-mance of problem-speciﬁc strategies:it seems that in some sense the semantic of the document can be at least partly captured by the presence of certain substringsof symbols.Support V ector Machines[1]are linear classiﬁers in a kernel deﬁned feature space.The kernel is a function which returns the dot product of the feature vectors and of two inputs and.Choosing very high dimensional feature spaces ensures that the required functionality can be obtainedusing linear classiﬁers.The computational difﬁculties of working in such feature spaces is avoided by using a dual representation of the linear functions in terms of the training set,.The danger of overﬁtting by resorting to such a high dimensional space is averted by maximising the margin or a related soft version of this criterion,a strategy that has been shown to ensure good generalisation despite the high dimensionality[6, 7].2A Kernel for Text SequencesIn this section we describe a kernel between two text documents.The idea is to compare them by means of the substrings they contain:the more substrings in common,the more similar they are.An important part is that such substrings do not need to be contiguous,and the degree of contiguity of one such substring in a document determines how much weight it will have in the comparison.For example:the substring’c-a-r’is present both in the word’car d’and in the word’c ust ar d’,but with different weighting.For each such substring there is a dimension of the feature space,and the value of such coordinate depends on how frequently and how compactly such string is embedded in the text.In order to deal with non-contiguous substrings,it is necessary to introduce a decay factor that can be used to weight the presence of a certain feature in a text(see Deﬁnition1for more details).Example.Consider the words cat,car,bat,bar.If we consider only=2,we obtain an8-dimensional feature space,where the words are mapped as follows:3c-a c-t a-t b-a b-t c-r a-r b-rcat00000car00000bat00000bar00000the unnormalized kernel between car and cat is car cat,wherease the normalized version is obtained as follows:car,car cat,catand hence’car cat.However,for interesting substring sizes(eg)direct computation of all the relevant features would be impractical even for moderately sized texts and hence explicit use of such representation would be impossible.But it turns out that a kernel using such features can be deﬁned and calculated in a very efﬁcient way by using dynamic progamming techniques.We derive the kernel by starting from the features and working out their inner product.In this case there is no need to prove that it satisﬁes Mercer’s conditions (symmetry and positive semi-deﬁniteness)since they will follow automatically from its deﬁnition as an inner product.This kernel is based on work[9,2]mostly motivated by bioinformatics applications.It maps strings to a feature vector in-dexed by all tuples of characters.A-tuple will have a non-zero entry if it occurs as a subsequence anywhere(not necessarily contiguously)in the string. The weighting of the feature will be the sum over the occurrences of the-tuple of a decaying factor of the length of the occurrence.Deﬁnition1(String subsequence kernel)Let be aﬁnite alphabet.A string is aﬁnite sequence of characters from,including the empty sequence.For strings,we denote by the length of the string,and by the string obtained by concatenating the strings and.The string is the substring of.We say that is a subsequence of,if there exist indices,with,such that,for ,or for short.The length of the subsequence in is.We denote by the set of allﬁnite strings of length,and by the set of all strings(1) We now deﬁne feature spaces.The feature mapping for a string is4given by deﬁning the coordinate for each.We deﬁne(2) for some.These features measure the number of occurrences of subse-quences in the string weighting them according to their lengths.Hence,the inner product of the feature vectors for two strings and give a sum over all common subsequences weighted according to their frequency of occurrence and lengthsIn order to derive an effective procedure for computing such kernel,we intro-duce an additional function which will aid in deﬁning a recursive computation for this kernel.Letthat is counting the length to the end of the strings and instead of just and .We can now deﬁne a recursive computation for and hence compute, Deﬁnition2Recursive computation of the subsequence kernel.,for all,if,ifThe correctness of this recursion follows from observing how the length of the strings has increased,incurring a factor of for each extra character,until the full length of characters has been attained.If we wished to compute for5a range of values of,we would simply perform the computation of up to one less than the largest required,and then apply the last recursion for each that is needed using the stored values of.We can of course create a kernel that combines the different giving different(positive) weightings for each.Once we have create such a kernel it is natural to normalise to remove any bias introduced by document length.We can produce this effect by normalising the feature vectors in the feature space.Hence,we create a new embedding which gives rise to the kernelThe normalised kernel introduced above was implemented using the recursive formulas described above.The next section gives some more details of the algo-rithmics and this is followed by a section describing the results of applying the kernel in a Support Vector Machine for text classiﬁcation.3AlgorithmicsIn this section we describe how special design techniques provide a signiﬁcant speed-up of the procedure,by both accelerating the kernel evaluations and reduc-ing their number.We used a simple gradient based implementation of SVMs(see[1])with a ﬁxed threshold.In order to deal with large datasets,we used a form of chunking: beginning with a very small subset of the data and gradually building up the size of the training set,while ensuring that only points which failed to meet margin1 on the current hypothesis were included in the next chunk.Since each evaluation of the kernel function requires not neglectable compu-tational resources,we designed the system so to only calculate those entries of the kernel matrix that are actually required by the training algorithm.This can sig-niﬁcantly reduce the training time,since only a relatively small part of the kernel matrix is actually used by our implementation of SVM.Special care in the implementation of the kernel described in Deﬁnition1can signiﬁcantly speed-up its evaluation.As can be seen from the description of the recursion in Deﬁnition2,its computation takes time proportional to,as6the outermost recursion is over the sequence length and for each length and each additional character in and a sum over the sequence must be evaluated.The complexity of the computation can be reduced to,byﬁrst eval-uatingand observing that we can then evaluate with the recursion, Now observe thatprovided does not occur in,whileThese observations together give an recursion for computing. Hence,we can evaluate the overall kernel in time.4Experimental ResultsOur aim was to test the efﬁcacy of this new approach to feature extraction for text categorization,and to compare with a state–of-the-art system such as the one used in[4].Expecially,we wanted to see how the performance is affected by the tunable parameter(we have used values5and5).As expected,using longer substrings in the comparison of two documents gives an improved performance.We used the same dataset as that reported in[4],namely the Reuters-21578[5]. We performed all of our experiments on a subset of four categories,‘earn’,‘acq’,‘crude’,and‘corn’.Weﬁrst made a comparison between our version of their approach for the training set sizes reported in that paper,in order to verify that we could reproduce their performance.The results we obtained are given in Table1 together with the breakeven points reported in[4]for the linear kernel applied to the features.They indicate a very close match between the two results and conﬁrm that our program is giving virtually identical performance.7Given a test document to be classiﬁed in two classes(positive and negative),there are4possible outcomes:False Positive(FP)if the systems labels it as apositive while it is a negative;False Negative(FN)if the system labels it as a negative while it is a positive;True Positive(TP)and True Negative(TN)if thesystem correctly predicts the label.In the following we will use,,, to denote the number of true positives,true negatives,false positives and false negatives,respectively.Note that with this notation the number of positivepoints in the test set can be written as,the number of negative points as,and the test set size asA confusion matrix can be used to summarize the performance of the classi-ﬁer:CorrectPredictedP N P TP FP N FN TNand thus a perfect predictor would have a diagonal confusion matrix.We now deﬁne:precision recallAnd we deﬁne the F1estimator as:F12precision recall precision+recallWe applied the two different kernels to a smaller dataset of380training ex-amples and90test examples.The only difference in the experiments was the kernel used.The splits of the data were identical with the sizes and numbers of positive examples in training and test sets given in Table1for the four categories considered.The initial experiments all used a sequence length of5for the string subse-quences kernel.We set.The results obtained are shown in Table2where the precision,recall and F1values are shown for both kernels.The results are much better in one category(‘corn’),similar for the‘acq’cate-gory and much worse for the categories‘earn’and‘crude’.They certainly indicate that the new kernel can outperform the more classical approach,but equally the performance is not reliably better.A further experiment was performed for one of8Precision Recall Comparison JoachimsF1Breakevenearn97.9897.980.980.982acq95.6190.960.9320.926crude85.9584.130.8500.86corn89.0987.500.8830.86#train#testout of370out of90earn15240acq11425crude7615corn3810Table1:F1and Class frequencies for the4categoriesPrecision Recall F1W-K5S-K W-K5S-K W-K5S-Kearn 1.00.3180.350.5250.5180.396acq0.750.440.120.1330.2070.204crude 1.00.1670.1330.1330.2350.148corn 1.00.5830.10.70.1820.636Table2:Precision,Recall and F1numbers for4categories for the two kernels: word kernel(W-K)and subsequences kernel(5S-K)9the categories on which the new kernel performed poorly.The subsequence length was increased to6for the most frequent category‘earn’.The results are presented in Table3.The increase in sequence length to6has made a signiﬁcant improve-Precision Recall F1W-K6S-K W-K6S-K W-K6S-Kearn 1.00.7690.35 1.00.5180.870Table3:Precision,Recall and F1numbers for the‘earn’category for two kernels: word kernel(W-K)and subsequences kernel(6S-K)ment in the performance of the subsequences kernel,which now outperforms the word feature kernel.We are aware that the experimental results presented in this section cover too few runs for any deﬁnite conclusions to be drawn.They do,however,indicate that the new kernel is certainly worthy of further investigation.Repeated runs on random splits of the data will be performed to evaluate statistically the effects we are observing,both in the length of sequences and kernel types tested.5ConclusionsThe paper has presented a novel kernel for text analysis,and tested it on a cat-egorization task,which relies on evaluating an inner product in a very high di-mensional feature space.For a given sequence length(was used in the experiments reported)the features are indexed by all strings of length.Direct computation of all the relevant features would be impractical even for moderately sized texts.The paper has presented a dynamic programming style computation for computing the kernel directly from the input sequences without explicitly cal-culating the feature vectors.Further reﬁnements of the algorithm have resulted in a practical alternative to the more standard word feature based kernel used in previous SVM applications to text classiﬁcation[4].We have presented an experimental comparison of the word feature kernel with our subsequences kernel on a benchmark dataset with encouraging results.The results reported here are very preliminary and many questions remain to be resolved.First more extensive experiments1are required 1Theﬁnal version of the paper will include more experiments.10to gain a more reliable picture of the performance of the new kernel,including the effect of varying the subsequence length and the parameter.The evaluation of the new kernel is still relatively time consuming and more research is needed to investigate ways of expediting this phase of the computation.The results also suggest that the choice of one value for the sequence length may be too restrictive and a kernel composed of a weighted sum of the sequence kernels for several different lengths would be more robust.This would not add signiﬁcantly to the computation since theﬁnal stage of computing from is relatively cheap.References[1]N.Cristianini and J.Shawe-Taylor.An Introduction to Support V ector Ma-chines.Cambridge University Press,2000.[2]D.Haussler.Convolution kernels on discrete structures.Technical ReportUCSC-CRL-99-10,University of California in Santa Cruz,Computer Science Department,July1999.[3]T.Joachims.Text categorization with support vector machines:Learningwith many relevant features.Technical Report23,LS VIII,University of Dortmund,1997.[4]T.Joachims.Text categorization with support vector machines.In Proceed-ings of European Conference on Machine Learning(ECML),1998.[5]David Lewis.Reuters-21578collection.Technical re-port,Available at:/˜lewis/reuters21578.html,1987.[6]J.Shawe-Taylor,P.L.Bartlett,R.C.Williamson,and M.Anthony.Structuralrisk minimization over data-dependent hierarchies.IEEE Transactions on Information Theory,44(5):1926–1940,1998.[7]J.Shawe-Taylor and N.Cristianini.Margin distribution and soft margin.InA.J.Smola,P.Bartlett,B.Sch¨o lkopf,andC.Schuurmans,editors,Advancesin Large Margin Classiﬁers.MIT Press,1999.[8]V.Vapnik.The Nature of Statistical Learning Theory.Springer Verlag,1995.11[9]C.Watkins.Dynamic alignment kernels.Technical Report CSD-TR-98-11,Royal Holloway,University of London,Computer Science department,Jan-uary1999.12。