生物信息文献

合集下载

生物医学文献关键信息抽取

生物医学文献关键信息抽取生物医学文献关键信息抽取生物医学文献关键信息抽取是一种重要的技术，用于从大量的生物医学文献中提取出关键信息。

这项技术在生物医学研究领域具有广泛的应用，可以帮助研究人员快速获得所需的信息，加快研究进展。

生物医学文献涵盖了大量的研究成果和知识，但由于信息量庞大，研究人员往往需要花费大量的时间和精力来筛选和提取有用的信息。

而生物医学文献关键信息抽取技术的出现，为处理这个问题提供了一个有效的解决方案。

生物医学文献关键信息抽取的过程可以分为以下几个步骤。

首先，需要对文献进行预处理，包括文本清洗和分词等操作，以便后续的处理。

然后，通过使用自然语言处理和机器学习等技术，将文本中的关键信息进行识别和提取。

这些关键信息可以是疾病名称、基因表达、药物剂量等。

最后，将提取出的关键信息整理和存储，以便进一步的分析和应用。

生物医学文献关键信息抽取的技术主要依赖于自然语言处理和机器学习的方法。

自然语言处理技术可以帮助将文本转化为计算机可以理解和处理的形式，例如将文本进行分词、词性标注和句法分析等操作。

而机器学习技术则可以通过训练模型，自动学习和识别文本中的关键信息。

生物医学文献关键信息抽取技术的应用非常广泛。

一方面，它可以帮助研究人员高效地获取所需的信息，提高研究效率。

另一方面，它也可以用于构建生物医学知识库和数据集，为生物医学研究提供更丰富的资源。

此外，生物医学文献关键信息抽取技术还可以应用于药物研发、临床决策支持等领域，为医学科学的发展做出贡献。

尽管生物医学文献关键信息抽取技术已经取得了一定的进展，但仍然存在一些挑战和问题。

例如，生物医学文献中的文本结构复杂多样，存在大量的领域专有名词和术语，这对于关键信息的准确提取提出了挑战。

此外，由于医学知识的快速更新和演进，需要不断更新和改进抽取模型，以适应新的研究进展。

综上所述，生物医学文献关键信息抽取技术是一项重要的技术，可以帮助研究人员快速获得所需的信息，推动生物医学研究的进展。

生物信息技术论文

生物信息技术论文二十一世纪是生命科学高速发展的时代，生物信息技术对人类的影响之大将不可预料。

下面是小编精心推荐的生物信息技术论文，希望你能有所感触!生物信息技术论文篇一信息技术改变生物教学摘要：随着新课改的不断深入，信息技术与学科课程的整合是当前基础教育改革的一个新视点。

在生物学科的教学中，新教材教学难度增加了，对教师的要求也更高了。

生物教学课本中涉及的图、文、形、像很多，这要求学生在学习过程中发挥主观能动性，去看、去听、去想。

信息技术可以化静为动，化抽象为直观，吸引学生注意，降低理解难度。

信息技术与生物教学整合，可以创新教学模式、增大课堂容量、突出重点、解决难点，可以增强学生学习兴趣，提高教学效果，优化教学过程，培养学生能力。

本文就信息技术与生物课程整合的本质、方法和意义等做了一定的阐述。

关键词：信息技术生物教学课程改革二十一世纪是生命科学高速发展的时代，生命科学对人类的影响之大将不可预料。

生物学是生命科学的基础课程，生物老师在这次教育教学改革中应该积极探索，大胆尝试。

教师在教学中，必须深入研究和恰当地设计、开发、运用信息，从努力实践到积极创新，开发制作适用于课堂教学的优质教育资源，优化课堂教学，力求最大限度地提高教学效率，学生能够应用现代信息技术更好地掌握生物学知识，获取更多的生物学信息。

今天的教师，不能满足于一支粉笔、一张利口，博闻强记、引经据典的传统教学，而应不断努力、不断探索、不断尝试将生物课堂教学与信息技术达到有效整合。

所谓整合就是根据学科教学需要，充分发挥计算机的工具性功能，使计算机溶入学科教学中，从而提高教学质量，促进教学改革，培养具有创造能力和创新精神的中学生。

整合并非是计算机与生物学科的简单结合，也并不能够解决生物教学中的所有问题，而是从实际出发，寻找最佳结合点，突出教学重点，解决难点，探索规律，启发思维，从而提高生物学科的教育教学质量。

但是现在的一些老师和学生对于信息技术与生物学科教学的整合认识存在着很多误区：有的认为直接照搬网络上下载的课件上课就是整合课了;有的认为课堂上只要用了多种电教媒体就是整合课;有的认为在机房上课，网络环境下上课，就是整合课。

我国生物信息学领域文献计量学研究

一
２２论文的地区分布．ｌ９４年一２０９年间，国有３９０全０个省、市、自治区发表了生物信息学论文，地区但差异悬殊。我国生物信息学研究论文的地区分布极不平衡，京以１３篇的发文量遥遥北ｌ１领先十其它地区，占总论文量的２％；１其次是上海５１篇，１湖南２５篇，东２ｌ篇一ｆ述９广５：居下前四位的地区发文量占总量的４％，Ｊ１ｎ‘ 以视为我国生物信息学研究的核心地区。其它发文量居１前ｌ二２位内的地区包括：＝、重汀苏庆、浙汀、陕西、Ｉ东、天津、辽宁和四川，１１这８个地区总的发文量是ｌ８４２篇，占总量的２％，８是我国生物信息学研究的有机力量。但是，贾州、青海、汀西、宁夏、海南、河北和广西七省的累计发文量才１８箱，占总量的５只约３悦明这些地区的生物信息学研究较为％，薄弱，待加强。值得注意的是，亟有美国（６篇）、英围（１篇）、付兰（篇）１、德国（篇）１和新加坡（篇）１的作者在我国的棚关期刊 ቤተ መጻሕፍቲ ባይዱ 发表生物信息学论文，体现了科研领域国际间的交流与合作，同时也可以看出我国的相关期刊努力吸引海外优秀论文，寻求在国际范围内的学术
１概述１１文献计量学研究概况．
文献是科学知识的载体，是科研成果的主要表现形式。文献的产生、分布、增长、老化和引证都与科学的交叉渗透和发展进化息息相关。文献信息源的定量研究开始十２０世纪初，渐形成了以布拉德福（ｒｄｏｄ定律、逐Ｂａｆｒ）齐普夫（ｉｆＺｐ）定律、洛特卡（ｏｋ）、文献Ｌｔａ定律增长规律、文献老化规律、文献引用规律人大规律为主体的、比较完整、系统的文献计量学体系，在后来的研究中得到不断的发展和完并善。用文献计量学方法研究科学活动特征ｊ科学发展规律，是文献计量学应用研究的重要内容。

生物信息文献总结范文

摘要：随着生物技术的飞速发展，生物信息学作为一门新兴的交叉学科，在疾病研究中的应用越来越广泛。

本文对生物信息学在疾病研究中的应用进行了综述，并分析了近年来生物信息学在疾病研究中的最新进展。

一、引言生物信息学是生物学、计算机科学和数学相互交叉的学科，利用计算机技术对生物数据进行处理、分析和解释。

在疾病研究中，生物信息学通过对大量生物数据的挖掘和分析，为疾病的发生、发展和治疗提供了新的思路和方法。

二、生物信息学在疾病研究中的应用1. 基因组学研究基因组学是研究生物体基因组的结构和功能的一门学科。

生物信息学在基因组学中的应用主要体现在以下几个方面：（1）基因注释：通过对基因组序列进行注释，确定基因的功能、位置和表达水平。

（2）基因发现：通过生物信息学方法，从基因组数据中识别新的基因和基因家族。

（3）基因变异分析：分析基因变异与疾病之间的关系，为疾病诊断和治疗提供依据。

2. 蛋白质组学研究蛋白质组学是研究生物体蛋白质组成和功能的一门学科。

生物信息学在蛋白质组学中的应用主要体现在以下几个方面：（1）蛋白质序列分析：通过生物信息学方法，分析蛋白质序列的结构、功能和进化关系。

（2）蛋白质相互作用网络分析：构建蛋白质相互作用网络，揭示蛋白质之间的相互作用关系。

（3）蛋白质功能预测：通过生物信息学方法，预测蛋白质的功能和调控机制。

3. 转录组学研究转录组学是研究生物体基因表达水平的一门学科。

生物信息学在转录组学中的应用主要体现在以下几个方面：（1）基因表达数据分析：通过生物信息学方法，分析基因表达数据，识别差异表达基因。

（2）基因调控网络分析：构建基因调控网络，揭示基因之间的调控关系。

（3）生物标记物发现：通过生物信息学方法，发现与疾病相关的生物标记物。

三、生物信息学在疾病研究中的最新进展1. 大数据分析随着生物技术的快速发展，生物数据量急剧增加。

大数据分析技术在生物信息学中的应用，使得研究人员能够从海量数据中挖掘有价值的信息。

中国生物医学文献数据库

中国生物医学文献数据库中国生物医学文献数据库旨在收录国内外生物医学领域的文献资料，提供便捷的文献检索和信息服务，为科学研究和医疗保健提供有力的支持。

一、背景介绍生物医学是一门涉及医学、生物学和化学等多个学科的科学，它以生物学为基础，通过研究人类疾病的基本机制和生理过程，以及药理学、毒理学的应用，最终实现人体健康的目标。

生物医学领域的研究范围非常广泛，包括但不限于癌症、心脏病、肝脏病、神经退行性疾病等各种疾病的发生机制、预防、治疗和康复等方面。

在当今医学领域，研究人员需要了解最新的科学发展、新的诊断和治疗方法、药物、分子筛选方法、等等。

同时，不同的国家/地区之间的生物医学研究领域的不同也非常大。

因此，高质量、及时地获取并筛选出与自己研究领域相关的文献资料变得非常重要。

在这样的背景下，中国生物医学文献数据库应运而生。

二、数据库内容和特点中国生物医学文献数据库是由中国知网（CNKI）联合多个医学院校、医疗机构和研究机构共同开发和维护的。

自建库以来，便致力于构建一个丰富、全面、高效且更加便捷的文献检索平台，争取成为全球生物医学领域最具代表性、权威性的数据库之一。

1. 数据库内容中国生物医学文献数据库涵盖了生物医学领域包括基础医学、临床医学、药学、医学检验、生物医学工程、生物医药等多个领域的期刊、学位论文、会议论文等文献类型。

现在，该库中含有200万余篇文献资料，而每年新增的文献数量也在不断增长。

2. 特点（1）涵盖全面中国生物医学文献数据库收录了国内外生物医学相关的文献资料，对于涉及多个学科的论文，也会给出相关研究领域的分类。

因此，网站上面的文献检索可以很直观地找到和您研究领域相关的文献资料。

（2）检索方便该数据库通过提供多种检索方式（如按主题、作者、机构、关键词、文献类型等）和过滤工具（如通用、期刊、学位论文等）来满足用户不同需求。

同时，它还提供了摘要、关键词、作者、机构、文献类型等多种信息检索，帮助用户快速定位需要的文献资料。

生物信息学应用论文3200字_生物信息学应用毕业论文范文模板

生物信息学应用论文3200字_生物信息学应用毕业论文范文模板生物信息学应用论文3200字(一)：应用生物信息学方法筛选食管鳞癌的关键基因论文[摘要]目的筛选食管鳞癌的关键基因，为肿瘤的发病机制研究提供新的思路。

方法检索GEO数据库中食管鳞癌基因表达芯片，分析差异表达基因并获得共同差异基因；利用在线数据库DAVID进行GO和KEGG通路富集分析；通过String数据库和Cytoscape软件分析获取链接度最高的10个关键基因，并在TCGA数据库中验证。

结果共筛选出204个差异表达基因。

GO分析显示其生物学过程富集在细胞分裂、细胞器断裂和细胞周期等163个条目中；细胞学组分富集在细胞外、细胞质和细胞器腔内等48个条目中；分子功能富集在调控肽酶活性、与细胞外基质结合等46个条目中。

KEGG通路富集在局部黏附、p53信号通路、错配修复等12个条目中。

筛选出10个链接度最高的Hub基因，且通过TCGA数据库验证其全部在食管鳞癌组织中高表达（P<0.01）。

结论CDK1、CCNA2、RFC4、CCNB1、TOP2A、AURKA、CDC6、BUB1、BUB1B、PLK1是食管鳞癌的关键基因，可能是食管鳞癌的生物标志和治疗靶点。

[关键词]食管鳞癌；关键基因；生物信息学；基因芯片根據WHO统计，全世界每年约有40万人死于食管癌，其中我国约20万人，占世界的一半[1]。

食管癌主要有两个亚型——食管鳞癌和腺癌，我国食管癌患者主要为鳞癌。

目前食管癌的发生发展及转移机制尚不清楚，因此进一步研究其发病机制，建立有效的预防和诊疗方法，是迫切需要解决的问题。

本研究通过分析GEO数据库[2]中食管鳞癌的相关芯片数据，旨在挖掘食管鳞癌的关键基因，利用生物信息学方法探讨其可能的发病机制，为进一步的基础与临床研究提供方向。

1资料与方法1.1一般资料资料来源GEO在线数据库，下载食管鳞癌全基因组表达谱芯片数据集。

入选条件：①全基因组RNA表达谱芯片；②人食管鳞癌组织与配对的癌旁正常组织。

生物文献总结范文

摘要：本文对近年来生物领域的研究进展进行了总结，重点分析了基因编辑技术、细胞治疗、生物信息学以及生物制药等方面的突破性成果，旨在为读者提供生物领域的研究动态和未来发展趋势的参考。

一、引言生物科学作为一门重要的自然科学，近年来取得了显著的进展。

随着生物技术的快速发展，人类对生命现象和疾病机理的认识不断深入。

本文将对近年来生物领域的研究进展进行总结，以期为读者提供有益的参考。

二、基因编辑技术1. CRISPR-Cas9技术：CRISPR-Cas9技术作为一种新型基因编辑工具，具有高效、简单、低成本等优点。

该技术已成功应用于基因治疗、作物改良等领域。

2. TALENs技术：TALENs技术是一种基于转录激活因子样效应器核酸酶的基因编辑技术，与CRISPR-Cas9技术类似，具有高度的灵活性和精确性。

三、细胞治疗1. 干细胞治疗：干细胞具有自我更新和分化为多种细胞类型的能力，在治疗多种疾病中具有广阔的应用前景。

近年来，干细胞治疗在血液系统疾病、神经退行性疾病等方面的研究取得了显著成果。

2. CAR-T细胞治疗：CAR-T细胞治疗是一种利用T细胞对肿瘤细胞进行识别和杀伤的治疗方法。

近年来，CAR-T细胞治疗在血液肿瘤治疗中取得了显著疗效。

四、生物信息学1. 生物大数据分析：随着高通量测序技术的快速发展，生物大数据分析成为生物信息学的重要研究方向。

通过对海量生物数据的分析，有助于揭示生命现象和疾病机理。

2. 蛋白质组学：蛋白质组学是研究生物体内所有蛋白质的组成、结构和功能的研究领域。

近年来，蛋白质组学在疾病诊断、药物研发等方面取得了显著进展。

五、生物制药1. 生物仿制药：生物仿制药是指与已批准的生物制品具有相同的安全性和疗效的药物。

近年来，生物仿制药的研发和应用逐渐成为生物制药领域的研究热点。

2. 抗体药物：抗体药物是一种针对特定靶点的生物药物，具有高度特异性和靶向性。

近年来，抗体药物在肿瘤、自身免疫性疾病等领域的治疗中取得了显著成果。

pubmed数据库入口ncbi

PubMed数据库入口：NCBI简介PubMed 是一种由美国国家图书馆（National Library of Medicine，NLM）所提供的生物医学文献数据库，是全球最大的生物医学文献存储库之一。

作为生物医学研究领域的重要工具，PubMed 提供了大量的生物医学文献信息，覆盖了基础医学、临床医学、生物技术等多个研究领域。

PubMed 的数据库入口是由美国国家生物技术信息中心（National Center for Biotechnology Information，NCBI）负责维护的。

NCBI 是一个提供了一系列生物技术信息资源和工具的综合性生物技术数据库，帮助研究人员获取和分析生物技术相关的数据。

在本文中，我们将重点介绍 PubMed 数据库的入口，即 NCBI 网站的功能和特点。

NCBI 网站NCBI 网站（https:/// ）是一个免费提供生物医学信息服务的网站，具有丰富全面的生物医学数据库资源。

在 NCBI 网站上，用户可以访问PubMed 数据库以及其他诸如 GenBank、BLAST、PubChem 等数据库。

NCBI 网站提供了直观友好的用户界面，让用户可以轻松访问并获取所需的生物医学文献信息。

下面我们将介绍 PubMed 数据库的一些主要功能和特点。

PubMed 数据库功能文献搜索作为一个重要的生物医学文献数据库，PubMed 提供了强大的搜索功能，帮助用户快速准确地检索感兴趣的文献。

用户可以根据关键词、作者、期刊等信息进行搜索，也可以通过高级搜索来进一步精确检索。

文献浏览用户搜索到感兴趣的文献后，可以通过 PubMed 的文献浏览功能查看摘要和全文。

摘要提供了文章的主要内容和结论，方便用户了解文献的核心内容。

全文则提供了更详细的信息，包括方法、结果和讨论等。

文献收藏用户在浏览文献时，可以将感兴趣的文献添加到自己的收藏夹中。

这样，用户可以方便地管理和访问自己收藏的文献，随时查阅。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

A Novel Method for Protein Secondary Structure Prediction Using Dual-Layer SVM and ProﬁlesJian Guo,1,2†Hu Chen,1†Zhirong Sun1*,and Yuanlie Lin21Institute of Bioinformatics,State Key Laboratory of Biomembrane and Membrane Biotechnology,Department of Biological Sciences and Biotechnology,Tsinghua University,Beijing,China2Department of Mathematical Sciences,Tsinghua University,Beijing,ChinaABSTRACT A high-performance method was developed for protein secondary structure predic-tion based on the dual-layer support vector machine (SVM)and position-speciﬁc scoring matrices (PSSMs).SVM is a new machine learning technology that has been successfully applied in solving prob-lems in theﬁeld of bioinformatics.The SVM’s perfor-mance is usually better than that of traditional machine learning approaches.The performance was further improved by combining PSSM proﬁles with the SVM analysis.The PSSMs were generated from PSI-BLAST proﬁles,which contain important evolu-tion information.Theﬁnal prediction results were generated from the second SVM layer output.On the CB513data set,the three-state overall per-residue accuracy,Q3,reached75.2%,while segment overlap (SOV)accuracy increased to80.0%.On the CB396 data set,the Q3of our method reached74.0%and the SOV reached78.1%.A web server utilizing the method has been constructed and is available at /pmsvm.Proteins 2004;54:738–743.©2004Wiley-Liss,Inc.Key words:protein structure prediction;protein secondary structure;support vector ma-chine;position-speciﬁc scoring matri-ces;PSI-BLASTINTRODUCTIONA large number of genome sequences have been pro-duced in high-throughput experiments.The next step is to analyze these genome and protein sequences toﬁnd new gene functions.1The prediction of protein structure and function from amino acid sequences is one of the most important problems in molecular biology.This problem is becoming more pressing as the number of known protein sequences is explored as a result of genome and other sequencing projects,and the protein sequence–structure gap is widening rapidly.2,3Therefore,computational tools to predict protein structures are badly needed to narrow the widening gap.Although the prediction of three-dimensional(3D)protein structures is the ultimate goal, the structure still cannot be accurately predicted directly from sequences.An intermediate but useful step is to predict the protein secondary structure,which provides some knowledge and simpliﬁes the complicated3D struc-ture prediction problem.The fundamental elements of the secondary structure ofproteins are␣-helices,␤-sheets,coils,and turns.Some methods have been developed for deﬁning various protein secondary structure elements from the atomic coordinates in the Protein Data Bank(PDB),such as DSSP,4STRIDE,5 and DEFINE.6According to DSSP,8types of protein secondary structure elements were classiﬁed and denoted by letters:H(␣-helix),E(extended␤-strand),G(310helix), I(␲-helix),B(isolated␤-strand),T(turn),S(bend)and“_”(coil).The8classes are usually reduced to three states, helix(H),sheet(E),and coil(C)by different reduction methods.7Thus,the secondary structure prediction can be analyzed as a typical three-state pattern recognition or classiﬁcation problem,where the secondary structure class of a given amino acid residue in a protein is predicted based on its sequence features.Since the1970s,many methods have been developed for predicting protein secondary structures.Early works usu-ally relied on the single-residue statistics in various second-ary structural elements,for example,the Chou–Fasman method8and the Garnier–Osguthorpe–Robson(GOR I) method.9Nearly20years later,a signiﬁcant improvement was made in the PHD method,10which is a three-level neural network including some machine learning tech-niques.After the PHD method,many further neural networks and machine learning reﬁnements were devel-oped.11–13Several machine learning approaches have suc-cessfully predicted protein secondary structures,and pre-diction accuracies were further improved.In2001,Hua and Sun14introduced a new method,support vector ma-chine(SVM),which is based on statistical learning theory (SLT).The SVM method achieved good segment overlap accuracy,SOVϭ76.2%,and good three-state overallper-residue accuracy,Q3ϭ73.5%.14Here,we describe an improved dual-layer SVM com-bined with a position-speciﬁc scoring matrix(PSSM)gener-†These two authors contributed equally in this work.Grant sponsor:Fondational Science Research Grant of Tsinghua University(JC2001043);863Projects(2002AA234041);973Project (2003CB715903);NSFC(90303017).*Correspondence to:Zhirong Sun,Institute of Bioinformatics,State Key Laboratory of Biomembrane and Membrane Biotechnology,De-partment of Biological Sciences and Biotechnology,Tsinghua Univer-sity,Beijing10084,China.E-mail:sunzhr@Received18June2003;Accepted3September2003PROTEINS:Structure,Function,and Bioinformatics54:738–743(2004)©2004WILEY-LISS,INC.ated from PSI-BLAST.The combined method,which isreferred to as PMSVM,provides a good SOV of80.0%and Q3of76.2%,which is nearly3%higher than the simpleSVM method’s SOV and Q3.14The method is also com-pared with existing prediction methods.The results show that our method more effectively predicts secondary struc-tures.MATERIALS AND METHODSData SetTwo data sets are frequently used in protein secondary structure predictions to test algorithms.One is the RS126 data set,which include126protein chains and was developed by Rost and Sander.10The other data set,which is called CB513,is much larger.It was constructed by Cuff and Barton,7and contains513protein chains.Almost all sequences in the RS126set are included in the CB513set. Both are nonhomologous,but the homology measurement of CB513is more strict than in the RS126set.Removal of protein chains contained in both the RS126set and the CB513set gives another data set,which include396 protein sequences and is named the CB396set.RS126was mostly used to develop early prediction methods with CB513set and CB396,now widely used.The CB513and CB396sets were used to compare the present algorithm with other prediction method.The Deﬁnition of Protein Secondary StructureThe automatic assignments of secondary structure to experimentally determined3D structures are usually per-formed using DSSP,4STRIDE,5and DEFINE.6This work exclusively used the DSSP assignments,which distinguish the secondary structure into8categories:H(␣-helix),G(310helix),I(␲-helix),E(extended␤-strand),B(isolated␤-strand),T(turn),S(bend),and coil(“_”).The8structure classes were reduced into3classes.There are four main methods to perform the reduction process.(1)DSSP:H,G to H;E,B to E;all other states to C;(2)DSSP:H to H;E to E;all other states to C;(3)DSSP:H,G,I to H;E to E;all other states to C;and(4)DSSP:H,G to H;E to E;all other states to C.In this article,deﬁnition(1)was adopted,because it is considered to be the strictest deﬁnition,which usually results in lower prediction accuracy than other deﬁnitions. PSI-BLAST ProﬁlesThis work used multiple-sequence alignment proﬁles generated from the PSI-BLAST15program for each protein chain in the CB513and CB396sets.First,we obtained a database,which contained all known databases:all nonre-dundant GenBank translations,PDB,SwissPort,PIR databank,and PRF databank.Then the low-complexity regions,transmembrane regions,and coiled-coil segments were removed from the database.A program named pﬁlt was used to remove these regions.16Then,encoded BLAST data bankﬁles were generated fromﬁltered FASTAﬁles. Finally,the PSI-BLAST program was used to query each protein in the CB513and CB396sets against theﬁltered NR database to generate PSSM proﬁles.These proﬁles were scaled to the required0–1range using the standard logistic functionf͑x͒ϭ11ϩexp͑Ϫx͒,where x is the raw proﬁle matrix value.These proﬁles were then used as the input information to theﬁrst-layer SVM. Support Vector MachineThe SVM is a new machine learning method that developed rapidly and has been widely used in many kinds of pattern recognition problems.The basic method of SVM is to transform the samples into a high-dimension Hilbert space and to seek a separating hyperplane in this space. The separating hyperplane,which is called the optimal separating hyperplane(OSH),is chosen in such a way as to maximize its distance from the closest training samples. As a supervised machine learning technology,SVM is well-founded theoretically on statistical learning theory. SVM has been successfully applied to manyﬁelds of pattern recognition,including object recognition,17speaker identiﬁcation,18and text categorization.19The SVM usu-ally outperforms other machine learning technologies, including Neural Networks and K-Nearest Neighbor clas-siﬁers.In recent years,the SVM has been used in bioinfor-matics,including gene expression proﬁle classiﬁcation, detection of remote protein homologies and recognition of translation initiation sites.Hua and Sun14used a single-layer SVM to analyze protein secondary structure with excellent prediction results(in this article,this method is called the simple SVM).More details about SVM can be found in Vapnik’s publications.20,21Here,we describe a dual-layer SVM system used to predict secondary structure.The dual-layer SVM system combined with the PSI-BLAST proﬁles provides more accurate prediction than Hua and Sun’s14simple SVM prediction system.Coding SchemeAs with Hua and Sun’s work,14the present analysis used the classical local coding scheme of the protein sequences with a sliding window.PSI-BLAST matrix with n rows and20columns can be deﬁned for single sequence with n residues.For theﬁrst layer in the prediction system,each residue is coded as a21-dimensional vector, where theﬁrst20elements of the vector are the correspond-ing elements in PSI-BLAST matrix.For the second layer, the vector corresponding to a residue has4elements, where theﬁrst3elements represent the3secondary structures(H,E,C).The last unit was added in order to allow a window to extend over the N-and the C-terminus. If the window length is l,the dimension of the feature vector is21*l for theﬁrst layer and4*l for the second layer. Prediction System StructureA dual-layer SVM structure was used in the prediction system(see Fig.1).Theﬁrst layer is an SVM classiﬁer that classiﬁes each residue of each sequence into the3second-ary structure classes(H,E,or C).The one-against-restSTRUCTURE PREDICTION USING SVM AND PROFILES739Fig.1.The dual-layer architecture of the PMSVM system.The system include three parts:the PSI-BLAST proﬁle,the ﬁrst layer,and the second layer.The proﬁle is tranformed into a number of 21*15demension vectors using the slide-window method.These vectors are input into the ﬁrst-layer SVM.The outputs of the ﬁrst-layer SVM are a number of 3D vectors representing the probability that the residue belongs to that ing the slide-window method,the outputs of the ﬁrst-layer SVM are tranformed into a number of 4*13dimensional vector,which are used as the inputs of the second-layer SVM.The ﬁnal decisions are based on the outputs of the second-layer SVM.740J.GUO ET AL.strategy was used for the multiclass classiﬁcation,so there were three outputs for each residue.The outputs represent the probability that the residue belongs to that class.Since the consecutive patterns are correlated (e.g.,a helix con-tains at least 4consecutive patterns,and a sheet contains at least 3consecutive patterns),the second-layer SVM classiﬁer ﬁltered successive outputs from the ﬁrst layer.The target outputs of the second layer were the same as the ﬁrst layer.As with the ﬁrst-layer SVM,the second layer also uses the one-against-rest strategy,with each residue classiﬁed into the class with the largest output value.Training and TestingSeven-fold cross-validation was used on the CB396and CB513data sets to test the method’s efﬁciency.The whole data set was randomly divided into 7subsets of equal size.In each validation,one subset was used for testing while the rest was used for training.Several parameters were regulated to optimize the training.This analysis used the radial basis function (RBF)kernel in both the ﬁrst-and the second-layer SVM,where ␥is a parameter to be deter-mined.The analysis used the soft-margin SVM,so the regularization parameter C also needed to be regulated.␥1and C 1were deﬁned as the gamma parameter and the regularization parameter in the ﬁrst-layer SVM,while ␥2and C 2were deﬁned as the gamma parameter and the regularization parameter in the second-layer SVM.For the CB513data set,␥1ϭ0.05,C 1ϭ2.3,␥2ϭ2.5,and C 2ϭ2.0;for the CB396data set,␥1ϭ0.05,C 1ϭ2.0,␥2ϭ2.4,and C 2ϭ2.5.K ͑x i ,x j ͒ϭexp(Ϫ␥͑x i Ϫx j ͒2)(1)Reliability IndexThe prediction reliability index (RI)was used to assess the effectiveness of the approaches for the prediction of the secondary structure of a new sequence.The RI offers an excellent tool for focusing on key regions having high prediction accuracy.There are different deﬁnitions of the RI.Here,we used a deﬁnition similar to that proposed by Rost and Sander 10:RI ϭINTEGER [(maximal_output(I)Ϫsecond_largest_output(I)]/0.5).If the value of RI Ͼ9,then set RI ϭ9,so the value of RI is an integer between 0and 9.The distribution of the prediction accuracy with different RIs is illustrated in Figure 2.The prediction accuracy of residues with higher RI values is much better than those with lower RI values.Therefore,the deﬁnition of RI reﬂects the prediction reliability.RESULTS AND DISCUSSIONSeveral standard performance measures were used to assess prediction accuracy.The three-state overall per-residue accuracy (Q 3),the Matthew’s correlation coefﬁ-cients (C H ,C E ,C C ),and the SOV were used to evaluate the accuracy.10,22,23The per-residue accuracies for each typeof secondary structure (Q H ,Q E ,Q C ,Q H pre ,Q E pre ,Q C pre)were also calculated.The PMSVM method was comparedwithFig.2.The Q 3distribution on different Reliability indices (from 0to 9).STRUCTURE PREDICTION USING SVM AND PROFILES741Hua and Sun’s simple SVM method and the famous PHDmethod.The results from the PMSVM method are very good.On the CB513set,the SOV was80.0%,nearly4% higher than that of the simple SVM method(76.2%).The three-state per-residue accuracy Q3was75.2%,which is nearly2%higher than the simple SVM method(73.5%) and3%higher than the PHD method.The results obtained on the CB396set was slightly lower than the results on the CB513.Cuff and Barton7also found that many other methods have slightly lower accuracies with the CB396 set.More comparisons with other methods are shown in Table I.The prediction accuracies using only theﬁrst-layer SVMhave been computed.Although the value of Q3was nearly the same as with the dual-layer prediction method,the SOV was about2%lower.The results reﬂect the fact that the second layerﬁlters some noise from theﬁrst layer and improves the accuracies.A web prediction system was developed using the PMSVM method and is available at http://www.bioinfo. /pmsvm.This webpage was tested with several new protein sequences in the PDB with good results.The webserver was also used to predict some secondary structures of severe acute respiratory syndrome (SARS)proteins,which we hope will provide useful infor-mation to experimental biologists.Further improvements of the prediction method will be made in future work.SVM is one of the best available machine learning methods,but it is still a passive learning method.In recent years,boosting methods and active learning methods have developed rapidly.Boost-ing is a general method for improving the accuracy of any given learning algorithm.The active learning method actively selects a subset of samples and trains the classiﬁcation system on the subset to achieve more accurate prediction results.It is our hope that the combination of boosting or active learning with the SVM will achieve higher prediction accuracies.The second need is to furtherﬁlter the noise and outliers in the prediction process.If the window length is not appropri-ate,or the training samples are not independent and identical,the noise-to-signal ratio will increase.The central SVM may help to reduce noise and outliers. Another idea is to use the wavelet transform method to ﬁlter the outputs of theﬁrst and second layers.Wavelet transformsﬁlter signal noise and outliers;therefore, they should improve the prediction accuracy.The third aspect is to combine the information of other alignment proﬁles with the PSI-BLAST proﬁle.Cuff and Barton’s work showed that combining PSI-BLAST with HM-MER2proﬁles improved the predictions compared to using the PSI-BLAST proﬁles only.Therefore,practical strategies may be developed to fuse different informa-tion from different alignment proﬁles.ACKNOWLEDGMENTSOur thanks to J.A.Cuff and G.J.Barton for providing the CB513data set,to D.T.Jones for providing the useful pﬁlt program,and to Thorsten Joachims for providing the SVM light program.REFERENCES1.Thorton JM.From genome to function.Science2001;292:2095–2097.2.Bairoch A,Apweiler R.The SWISS-PROT protein sequence databank and its supplement TrEMBL in2000.Nucleic Acids Res 2000;28:45–48.3.Berman HM,Westbrook J,Feng Z,Gilliland G,Bhat TN,WeissigH,Shindyalov IN,Bourne PE.The Protein Data Bank.Nucleic Acids Res2000;28:235–242.4.Kabsch W,Sander C.Dictionary of protein secondary structure:pattern recognition of hydrogen bonded and geometrical features.Biopolymers1983;22:2577–2637.5.Frishman D,Argos P.Knowledge-based secondary structureassignment.Proteins1995;23:566–579.6.Richards FM,Kundrot CE.Identiﬁcation of structural motifs fromprotein coordinate data:secondary structure andﬁrst-level super-secondary structure.Proteins1988;3:71–84.7.Cuff JA,Barton GJ.Evaluation and improvement of multiplesequence methods for protein secondary structure prediction.Proteins1999;34:508–519.8.Chou PY,Fasman GD.Prediction of protein conformation.Bio-chemistry1974;13:211–215.9.Garnier J,Osguthorpe DJ,Robson B.Analysis and implications ofsimple methods for predicting the secondary structure of globular proteins.J Mol Biol1978;120:97–120.10.Rost B,Sander C.Prediction of secondary structure at better than70%accuracy.J Mol Biol1993;232:584–599.220–223.11.Riis SK,Krogh A.Improving prediction of protein secondarystructure using structured neural networks and multiple se-quence alignments.J Comput Biol1996;3:163–183.12.Baldi P,Brunak S,Frasconi P,Soda G,Pollastri G.Exploiting thepast and the future in protein secondary structure prediction.Bioinformatics1999;15:937–946.13.Chandonia JM,Karplus M.New methods for accurate predictionof protein secondary structure.Proteins1999;35:293–306.14.Hua S,Sun Z.A novel method of protein secondary structureTABLE parison with the results of the PHD,the Simple SVM and our PMSVMMethod SOV(%)Q3(%)QH(%)QE(%)QC(%)QHpre(%)QEpre(%)QCpre(%)CHCECCPHD173.570.87266727360—0.60.520.51 SVM174.671.27358757766690.610.510.52 PHD2—72.17062797764720.630.530.52 SVM276.273.57560797967700.650.530.54 PMSVM180.075.280.471.572.879.466.476.40.710.610.61 PMSVM278.174.079.369.37279.466.473.60.70.60.59 PHD,SVM1:Results obtained on the RS126set.PHD2:Results obtained on another data set which contains250protein chains(Rost and Sander).22SVM2:Results obtained on CB513set.PMSVM1:Result obtained on CB513set.First-layer SVM parameters:␥ϭ0.05,Cϭ2.3,Second-layer SVM parameters:␥ϭ2.5,Cϭ2. PMSVM2:Result obtained on CB396set.First-layer SVM parameters:␥ϭ0.05,Cϭ2.0,Second-layer SVM parameters:␥ϭ2.5,Cϭ2.742J.GUO ET AL.prediction with high segment overlap measure:support vector machine approach.J Mol Biol2001;308:397–407.15.Altschul SF,Madden TL,Schaffer AA,Zhang JH,Zhang Z,MillerW,Lipman DJ.Gapped BLAST and PSI-BLAST:a new generation of protein database search programs.Nucleic Acids Res1997;25: 3389–3402.16.Jones DT.Protein secondary structure prediction based on position-speciﬁc scoring matrices.J Mol Biol1999;292:195–202.17.Roobaert D,Hulle MM.View based3D object recognition withsupport vector machines.In:Proceedings of the IEEE Interna-tional Workshop on Neural Networks for Signal Processing IEEE Press:Wisconsin;1999.p77–84.18.Schmidt M,Grish H.Speaker identiﬁcation via support vectorclassiﬁers.In:Proceeding of the International Conference on Acoustics,Speech and Signal Processing.Long Beach,CA:IEEEE Press;1996.p105–108.19.Drucker H,Wu D,Vapnik V.Support vector machines for spamcategorization.IEEE Trans Neural Networ1999;10:1048–1054.20.Vapnik V.The nature of statistical learning theory.New York:Springer-Verlag;1995.21.Vapnik V.Statistical learning theory.New York:Wiley;1998.22.Rost B,Sander C,Schneider R.Redeﬁning the goals of proteinsecondary structure prediction.J Mol Biol1994;235:13–26.23.Zemla A,Venclovas C,Fidelis K,Rost B.A modiﬁed deﬁnition ofSOV,a segment based measure for protein secondary structure prediction assessment.Proteins1999;34:220–223.STRUCTURE PREDICTION USING SVM AND PROFILES743。