Bioinformatics Integration Framework for Metabolic Pathway Data-Mining

合集下载

生物信息学英文介绍

生物信息学英文介绍Introduction to Bioinformatics.Bioinformatics is an interdisciplinary field that combines biology, computer science, mathematics, statistics, and other disciplines to analyze and interpret biological data. At its core, bioinformatics leverages computational tools and algorithms to process, manage, and minebiological information, enabling a deeper understanding of the molecular basis of life and its diverse phenomena.The field of bioinformatics has exploded in recent years, driven by the exponential growth of biological data generated by high-throughput sequencing technologies, proteomics, genomics, and other omics approaches. This data deluge has presented both challenges and opportunities for researchers. On one hand, the sheer volume and complexityof the data require sophisticated computational methods for analysis. On the other hand, the wealth of information contained within these data holds the promise oftransformative insights into the functions, interactions, and evolution of biological systems.The core tasks of bioinformatics encompass genome annotation, sequence alignment and comparison, gene expression analysis, protein structure prediction and function annotation, and the integration of multi-omic data. These tasks require a range of computational tools and algorithms, often developed by bioinformatics experts in collaboration with biologists and other researchers.Genome annotation, for example, involves the identification of genes and other genetic elements within a genome and the prediction of their functions. This process involves the use of bioinformatics algorithms to identify protein-coding genes, non-coding RNAs, and regulatory elements based on sequence patterns and other features. The resulting annotations provide a foundation forunderstanding the genetic basis of traits and diseases.Sequence alignment and comparison are crucial for understanding the evolutionary relationships betweenspecies and for identifying conserved regions within genomes. Bioinformatics algorithms, such as BLAST and multiple sequence alignment tools, are widely used for these purposes. These algorithms enable researchers to compare sequences quickly and accurately, revealing patterns of conservation and divergence that inform our understanding of biological diversity and function.Gene expression analysis is another key area of bioinformatics. It involves the quantification of thelevels of mRNAs, proteins, and other molecules within cells and tissues, and the interpretation of these data to understand the regulation of gene expression and its impact on cellular phenotypes. Bioinformatics tools and algorithms are essential for processing and analyzing the vast amounts of data generated by high-throughput sequencing and other experimental techniques.Protein structure prediction and function annotation are also important areas of bioinformatics. The structure of a protein determines its function, and bioinformatics methods can help predict the three-dimensional structure ofa protein based on its amino acid sequence. These predictions can then be used to infer the protein'sfunction and to understand how it interacts with other molecules within the cell.The integration of multi-omic data is a rapidly emerging area of bioinformatics. It involves theintegration and analysis of data from different omics platforms, such as genomics, transcriptomics, proteomics, and metabolomics. This approach enables researchers to understand the interconnectedness of different biological processes and to identify complex relationships between genes, proteins, and metabolites.In addition to these core tasks, bioinformatics also plays a crucial role in translational research and personalized medicine. It enables the identification of disease-associated genes and the development of targeted therapeutics. By analyzing genetic and other biological data from patients, bioinformatics can help predict disease outcomes and guide treatment decisions.The future of bioinformatics is bright. With the continued development of high-throughput sequencing technologies and other omics approaches, the amount of biological data available for analysis will continue to grow. This will drive the need for more sophisticated computational methods and algorithms to process and interpret these data. At the same time, the integration of bioinformatics with other disciplines, such as artificial intelligence and machine learning, will open up new possibilities for understanding the complex systems that underlie life.In conclusion, bioinformatics is an essential field for understanding the molecular basis of life and its diverse phenomena. It leverages computational tools and algorithms to process, manage, and mine biological information, enabling a deeper understanding of the functions, interactions, and evolution of biological systems. As the amount of biological data continues to grow, the role of bioinformatics in research and medicine will become increasingly important.。

生物信息学,英文

生物信息学,英文BioinformaticsBioinformatics is a rapidly growing field that combines biology, computer science, and information technology to analyze and interpret biological data. It has become an essential tool in modern scientific research, particularly in the fields of genomics, proteomics, and molecular biology. The advent of high-throughput sequencing technologies has led to an exponential increase in the amount of biological data available, and bioinformatics provides the means to manage, analyze, and interpret this vast amount of information.At its core, bioinformatics involves the use of computational methods and algorithms to store, retrieve, organize, analyze, and interpret biological data. This includes tasks such as DNA and protein sequence analysis, gene and protein structure prediction, and the identification of functional relationships between different biological molecules. Bioinformatics also plays a crucial role in the development of new drugs, the understanding of disease mechanisms, and the study of evolutionary processes.One of the primary applications of bioinformatics is in the field ofgenomics, where it is used to analyze and interpret DNA sequence data. Researchers can use bioinformatics tools to identify genes, predict their function, and study the genetic variation within and between species. This information is essential for understanding the genetic basis of diseases, developing personalized medicine, and exploring the evolutionary history of different organisms.In addition to genomics, bioinformatics is also widely used in proteomics, the study of the structure and function of proteins. Bioinformatics tools can be used to predict the three-dimensional structure of proteins, identify protein-protein interactions, and study the role of proteins in biological processes. This information is crucial for understanding the mechanisms of disease, developing new drugs, and engineering proteins for industrial and medical applications.Another important application of bioinformatics is in the field of systems biology, which aims to understand the complex interactions and relationships between different biological components within a living organism. By using computational models and simulations, bioinformaticians can study how these components work together to produce the observed behavior of a biological system. This knowledge can be used to develop new treatments for diseases, optimize agricultural practices, and gain a deeper understanding of the fundamental principles of life.Bioinformatics is also playing a crucial role in the field of personalized medicine, where the goal is to tailor medical treatments to the unique genetic and molecular profile of each individual patient. By using bioinformatics tools to analyze a patient's genetic and molecular data, healthcare providers can identify the most effective treatments and predict the likelihood of adverse reactions to certain drugs. This approach has the potential to improve patient outcomes, reduce healthcare costs, and pave the way for more targeted and effective therapies.In addition to its scientific applications, bioinformatics is also having a significant impact on various industries, such as agriculture, environmental science, and forensics. In agriculture, bioinformatics is used to improve crop yields, develop new pest-resistant varieties, and optimize the use of resources. In environmental science, bioinformatics is used to study the impact of human activities on ecosystems, identify new sources of renewable energy, and monitor the health of the planet. In forensics, bioinformatics is used to analyze DNA evidence and identify individuals involved in criminal activities.Despite the many benefits of bioinformatics, there are also significant challenges and ethical considerations that need to be addressed. As the amount of biological data continues to grow, there is an increasing need for efficient data storage, management, andsecurity systems. Additionally, the interpretation of biological data requires a deep understanding of both biological and computational principles, which can be a barrier for some researchers and healthcare providers.Furthermore, the use of bioinformatics in areas such as personalized medicine and forensics raises important ethical questions regarding privacy, informed consent, and the potential for discrimination based on genetic information. It is crucial that the development and application of bioinformatics technologies be guided by strong ethical principles and robust regulatory frameworks to ensure that they are used in a responsible and equitable manner.In conclusion, bioinformatics is a rapidly evolving field that is transforming the way we understand and interact with the biological world. By combining the power of computer science and information technology with the insights of biology, bioinformatics is enabling new discoveries, improving human health, and shaping the future of scientific research. As the field continues to grow and evolve, it will be important for researchers, policymakers, and the public to work together to ensure that the benefits of bioinformatics are realized in a responsible and ethical manner.。

人类基因组计划名词解释生物信息学

人类基因组计划名词解释生物信息学英文回答：Bioinformatics.Bioinformatics is a field that combines biology, computer science, and information technology. It involves the development and use of computational tools and techniques to manage, analyze, and interpret biological data. Bioinformatics is used in a wide range of research areas, including genomics, proteomics, drug discovery, and disease diagnosis.Key concepts in bioinformatics.Genomics: The study of the structure and function of genomes.Proteomics: The study of the structure and function of proteins.Transcriptomics: The study of the structure and function of transcripts.Metabolomics: The study of the structure and function of metabolites.Bioinformatics databases: Databases that store and manage biological data.Bioinformatics tools: Software tools that are used to analyze and interpret biological data.Applications of bioinformatics.Drug discovery: Bioinformatics is used to identify new drug targets and to design new drugs.Disease diagnosis: Bioinformatics is used to develop new diagnostic tests for diseases.Personalized medicine: Bioinformatics is used todevelop personalized treatment plans for patients.Evolutionary biology: Bioinformatics is used to study the evolution of species.Challenges in bioinformatics.Data explosion: The amount of biological data is growing rapidly, making it difficult to manage and analyze.Data integration: Biological data is often stored in different formats and in different databases, making it difficult to integrate and analyze.Algorithm development: New algorithms are needed to analyze and interpret complex biological data.Despite these challenges, bioinformatics is a rapidly growing field with the potential to revolutionize the way we understand and treat diseases.中文回答：生物信息学。

生物信息学考试题

生物信息学bioinformatics一、名词解释Silicon cloning：利用公共数据库信息, 借助计算机软件分析, 推测目的基因的编码区序列, 辅助全长cDNA克隆的方法BLAST：即基本局域联配搜索工具，Basic Local Alignment Search Tool，是一个局部比对搜索工具，用来确定一条查询序列和一个数据库的比对，最早的版本不引入间隙，但现在所用的版本已经允许比对中引入间隙。

Entrez ：是由NCBI 主持的一个数据库检索系统，它包括核酸，蛋白以及Medline 文摘数据库，在这三个数据库中建立了非常完善的联系。

因此，可以从一个DNA 序列查询到蛋白产物以及相关文献，而且，每个条目均有一个类邻(neighboring)信息，给出与查询条目接近的信息。

Entrez 中的数据库包括：Entrez 中核酸数据库为：GenBank, EMBL, DDBJ 蛋白质数据库为：Swiss-Prot, PIR, PFR, PDBPSI-BLAST：是一种迭代的搜索方法，可以提高BLAST 和FASTA 的相似序列发现率。

ORF：开放阅读框（ORF)是基因序列的一部分，包含一段可以编码蛋白的碱基序列，不能被终止子打断。

编码一个蛋白质的外显子连接成为一个连续的ORF。

当一个新基因被识别，其DNA 序列被解读，人们仍旧无法搞清相应的蛋白序列是什么。

这是因为在没有其它信息的前提下，DNA 序列可以按六种框架阅读和翻译（每条链三种，对应三种不同的起始密码子）ORF 识别包括检测这六个阅读框架并决定哪一个包含以启动子和终止子为界限的DNA 。

序列而其内部不包含启动子或终止子，符合这些条件的序列有可能对应一个真正的单一的基因产物。

ORF 的识别是证明一个新的DNA 序列为特定的蛋白质编码基因的部分或全部的先决条件。

相似性（similarity）/（identify）：相似性是指序列比对过程中用来描述检测序列和目标序列之间相同DNA碱基或氨基酸残基顺序所占比例的高低。

1生物制品绪论-2

（四）现代生物技术在科技与经济发展中的地位
现代生物技术、信息技术、新材料技术和新能源技术一起并列为21世纪影响国计民生的四大科学技术支柱和国民经济增长的强劲增长点，各国政府、科技界和企业家都予以高度重视和大力支持，并纷纷制定了刺激其发展的优惠政策。其中生物制品就属现代生物技术中的主要研发内容。
4)其他。
Ⅵ、重组激素类
1) 多肽蛋白类激素（ Polypeptide Protein Hormone）； 2)类固醇激素（Steroid Hormone）；
3)Aa类激素（Amino Acid Hormone）；
2、治
疗
① 免疫血清
采用特异性免疫血清治疗相应感染或传染病，一般称其为特异血液疗法。生物制品中的免疫血清多指抗毒素、抗菌血清和抗病毒血清。 ②疫苗目前，应用于治疗的疫苗不多。布氏杆菌疫苗用于治疗布氏杆菌病；卡介苗和厌氧性棒状杆菌疫苗常用作免疫佐剂治疗某些疾病；生态疫苗，如促菌生类，用于调整机体肠道的正常菌群起到防止肠道疾病的作用。
类生物制品的来源、结构特点、应用、生产工艺、
原理、现状、存在问题与发展前景等诸方面知识
的一门科学。
（二）现代生物技术的起源与发展(origin and development or modern biotechnology)
生物技术的发展是在数门基础学科的发展和
融合的基础上发展起来的，它本身又经历了几代
生物制品是一种特殊的药品，它在人、
动植物传染病的发生、发展和控制中起
着举足轻重的作用，是确保生物个体和
群体健康所必需的，所以生物制品学是
一门比较中重要的的课程。
（二）生物制品的应用 (application)

【bioinfo】生物信息学——代码遇见生物学的地方

【bioinfo】⽣物信息学——代码遇见⽣物学的地⽅注：从进⼊⽣信领域到现在，已经过去快8年了。

⽣物信息学包含了我最喜欢的三门学科：⽣物学、计算机科学和数学。

但是如果突然问起，什么是⽣物信息学，我还是⽆法给出⼀个让⾃⼰满意的答案。

于是便有了这篇博客。

起源据说在1970年，荷兰科学家Paulien Hogeweg和Ben Hesper最早在荷兰语中创造了"bioinformatica"⼀词，英语中的"bioinformatics" 在1978年⾸次被使⽤。

这两位科学家当时使⽤该词来表⽰：The study of information processes in biotic systems.该定义中有两个关键词：⽣物系统（biotic systems）和信息过程（information processes）。

但是这⾥的"信息过程"不太好理解。

此外，从该领域的著名期刊——"bioinformatics"期刊名称的变化也可以从另⼀个⾓度来考证"⽣物信息学"这个词的接受程度。

"bioinformatics"创⽴于1985年，改名前的期刊名为：Computer Applications in the Biosciences (CABIOS)同时也是国际计算⽣物学会（the International Society for Computational Biology, ISCB）的会刊，在1998年改为现在的名字。

各个不同时期的定义wiki【定义1】⾸先看⼀下维基百科对⽣物信息学的解释：Bioinformatics /ˌbaɪ.oʊˌɪnfərˈmætɪks/ (About this soundlisten) is an interdisciplinary field that develops methods and softwaretools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computerscience, information engineering, mathematics and statistics to analyze and interpret biological data. Bioinformatics has beenused for in silico analyses of biological queries using mathematical and statistical techniques.Bioinformatics and computational biology involve the analysis of biological data, particularly DNA, RNA, and proteinsequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the HumanGenome Project and by rapid advances in DNA sequencing technology.The primary goal of bioinformatics is to increase the understanding of biological processes.这⾥的定义强调交叉学科以及对⽣物学数据的理解，认为最主要的⽣物学数据是DNA、RNA和蛋⽩质的序列数据。

A_review_of_feature_selection_techniques_in_bioinformatics

A review of feature selection techniques in bioinformaticsAbstractFeature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques.In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.1 INTRODUCTIONDuring the last decade, the motivation for applying feature selection (FS) techniques in bioinformatics has shifted from being an illustrative example to becoming a real prerequisite for model building. In particular, the high dimensional nature of many modelling tasks in bioinformatics, going from sequence analysis over microarray analysis to spectral analyses and literature mining has given rise to a wealth of feature selection techniques being presented in the field.In this review, we focus on the application of feature selection techniques. In contrast to other dimensionality reduction techniques like those based on projection (e.g. principal component analysis) or compression (e.g. using information theory), feature selection techniques do not alter the original representation of the variables, but merely select a subset of them. Thus, they preserve the original semantics of the variables, hence, offering the advantage of interpretability by a domain expert.While feature selection can be applied to both supervised and unsupervised learning, we focus here on the problem of supervised learning (classification), where the class labels are known beforehand. The interesting topic of feature selection for unsupervised learning (clustering) is a more complex issue, and research into this field is recently getting more attention in several communities (Liu and Yu, 2005; Varshavsky et al., 2006).The main aim of this review is to make practitioners aware of the benefits, and in some cases even the necessity of applying feature selection techniques. Therefore, we provide an overview of the different feature selection techniques for classification: we illustrate them by reviewing the most important application fields in the bioinformatics domain, highlighting the efforts done by the bioinformatics community in developing novel and adapted procedures. Finally, we also point the interested reader to some useful data mining and bioinformatics software packages that can be used for feature selection.Previous SectionNext Section2 FEATURE SELECTION TECHNIQUESAs many pattern recognition techniques were originally not designed to cope with large amounts of irrelevant features, combining them with FS techniques has become a necessity in many applications (Guyon and Elisseeff, 2003; Liu and Motoda, 1998; Liu and Yu, 2005). The objectives of feature selection are manifold, the most important ones being: (a) to avoid overfitting andimprove model performance, i.e. prediction performance in the case of supervised classification and better cluster detection in the case of clustering, (b) to provide faster and more cost-effective models and (c) to gain a deeper insight into the underlying processes that generated the data. However, the advantages of feature selection techniques come at a certain price, as the search for a subset of relevant features introduces an additional layer of complexity in the modelling task. Instead of just optimizing the parameters of the model for the full feature subset, we now need to find the optimal model parameters for the optimal feature subset, as there is no guarantee that the optimal parameters for the full feature set are equally optimal for the optimal feature subset (Daelemans et al., 2003). As a result, the search in the model hypothesis space is augmented by another dimension: the one of finding the optimal subset of relevant features. Feature selection techniques differ from each other in the way they incorporate this search in the added space of feature subsets in the model selection.In the context of classification, feature selection techniques can be organized into three categories, depending on how they combine the feature selection search with the construction of the classification model: filter methods, wrapper methods and embedded methods. Table 1 provides a common taxonomy of feature selection methods, showing for each technique the most prominent advantages and disadvantages, as well as some examples of the most influential techniques.Table 1.A taxonomy of feature selection techniques. For each feature selection type, we highlight a set of characteristics which can guide the choice for a technique suited to the goals and resources of practitioners in the fieldFilter techniques assess the relevance of features by looking only at the intrinsic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classification algorithm. Advantages of filter techniques are that they easily scale to very high-dimensional datasets, they are computationally simple and fast, and they are independent of the classification algorithm. As a result, feature selection needs to be performed only once, and then different classifiers can be evaluated.A common disadvantage of filter methods is that they ignore the interaction with the classifier (the search in the feature subset space is separated from the search in the hypothesis space), and that most proposed techniques are univariate. This means that each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection techniques. In order to overcome the problem of ignoring feature dependencies, a number of multivariate filter techniques were introduced, aiming at the incorporation of feature dependencies to some degree.Whereas filter techniques treat the problem of finding a good feature subset independently of the model selection step, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, a search procedure in the space of possible feature subsets is defined, and various subsets of features are generated and evaluated. The evaluation of a specific subset of features is obtained by training and testing a specific classification model, rendering this approach tailored to a specific classification algorithm. To search the space of all feature subsets, a search algorithm is then ‘wrapped’ around the classification model. However, as the space of feature subsets grows exponentially with the number of features, heuristic search methods are used to guide the search for an optimal subset. These search methods can be divided in two classes: deterministic and randomized search algorithms. Advantages of wrapper approaches include the interaction between feature subset search and model selection, and the ability to take into account feature dependencies. A common drawback of these techniques is that they have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost.In a third class of feature selection techniques, termed embedded techniques, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of feature subsets and hypotheses. Just like wrapper approaches, embedded approaches are thus specific to a given learning algorithm. Embedded methods have the advantage that they include the interaction with the classification model, while at the same time being far less computationally intensive than wrapper methods.Previous SectionNext Section3 APPLICATIONS IN BIOINFORMATICS3.1 Feature selection for sequence analysisSequence analysis has a long-standing tradition in bioinformatics. In the context of feature selection, two types of problems can be distinguished: content and signal analysis. Content analysis focuses on the broad characteristics of a sequence, such as tendency to code for proteins or fulfillment of a certain biological function. Signal analysis on the other hand focuses on the identification of important motifs in the sequence, such as gene structural elements or regulatory elements.Apart from the basic features that just represent the nucleotide or amino acid at each position in a sequence, many other features, such as higher order combinations of these building blocks (e.g.k-mer patterns) can be derived, their number growing exponentially with the pattern length k. As many of them will be irrelevant or redundant, feature selection techniques are then applied to focus on the subset of relevant variables.3.1.1 Content analysisThe prediction of subsequences that code for proteins (coding potential prediction) has been a focus of interest since the early days of bioinformatics. Because many features can be extracted from a sequence, and most dependencies occur between adjacent positions, many variations of Markov models were developed. To deal with the high amount of possible features, and the often limited amount of samples, (Salzberg et al., 1998) introduced the interpolated Markov model (IMM), which used interpolation between different orders of the Markov model to deal with small sample sizes, and a filter method (χ2) to select only relevant features. In further work, (Delcher et al., 1999) extended the IMM framework to also deal with non-adjacent feature dependencies, resulting in the interpolated context model (ICM), which crosses a Bayesian decision tree with a filter method (χ2) to assess feature relevance. Recently, the avenue of FS techniques for coding potential prediction was further pursued by (Saeys et al., 2007), who combined different measures of coding potential prediction, and then used the Markov blanket multivariate filter approach (MBF) to retain only the relevant ones.A second class of techniques focuses on the prediction of protein function from sequence. The early work of Chuzhanova et al. (1998), who combined a genetic algorithm in combination with the Gamma test to score feature subsets for classification of large subunits of rRNA, inspired researchers to use FS techniques to focus on important subsets of amino acids that relate to the protein's; functional class (Al-Shahib et al., 2005). An interesting technique is described in Zavaljevsky et al. (2002), using selective kernel scaling for support vector machines (SVM) as a way to asses feature weights, and subsequently remove features with low weights.The use of FS techniques in the domain of sequence analysis is also emerging in a number of more recent applications, such as the recognition of promoter regions (Conilione and Wang, 2005), and the prediction of microRNA targets (Kim et al., 2006).3.1.2 Signal analysisMany sequence analysis methodologies involve the recognition of short, more or less conserved signals in the sequence, representing mainly binding sites for various proteins or protein complexes. A common approach to find regulatory motifs, is to relate motifs to gene expressionlevels using a regression approach. Feature selection can then be used to search for the motifs that maximize the fit to the regression model (Keles et al., 2002; Tadesse et al.,2004). In Sinha (2003), a classification approach is chosen to find discriminative motifs. The method is inspired by Ben-Dor et al. (2000) who use the threshold number of misclassification (TNoM, see further in the section on microarray analysis) to score genes for relevance to tissue classification. From the TNoM score, a P-value is calculated that represents the significance of each motif. Motifs are then sorted according to their P-value.Another line of research is performed in the context of the gene prediction setting, where structural elements such as the translation initiation site (TIS) and splice sites are modelled as specific classification problems. The problem of feature selection for structural element recognition was pioneered in Degroeve et al. (2002) for the problem of splice site prediction, combining a sequential backward method together with an embedded SVM evaluation criterion to assess feature relevance. In Saeys et al. (2004), an estimation of distribution algorithm (EDA, a generalization of genetic algorithms) was used to gain more insight in the relevant features for splice site prediction. Similarly, the prediction of TIS is a suitable problem to apply feature selection techniques. In Liu et al. (2004), the authors demonstrate the advantages of using feature selection for this problem, using the feature-class entropy as a filter measure to remove irrelevant features.In future research, FS techniques can be expected to be useful for a number of challenging prediction tasks, such as identifying relevant features related to alternative splice sites and alternative TIS.3.2 Feature selection for microarray analysisDuring the last decade, the advent of microarray datasets stimulated a new line of research in bioinformatics. Microarray data pose a great challenge for computational techniques, because of their large dimensionality (up to several tens of thousands of genes) and their small sample sizes (Somorjai et al., 2003). Furthermore, additional experimental complications like noise and variability render the analysis of microarray data an exciting domain.In order to deal with these particular characteristics of microarray data, the obvious need for dimension reduction techniques was realized (Alon et al., 1999; Ben-Dor et al., 2000; Golub et al., 1999; Ross et al., 2000), and soon their application became a de facto standard in the field. Whereas in 2001, the field of microarray analysis was still claimed to be in its infancy (Efron et al., 2001), a considerable and valuable effort has since been done to contribute new and adapt known FS methodologies (Jafari and Azuaje, 2006). A general overview of the most influential techniques, organized according to the general FS taxonomy of Section 2, is shown in Table 2.Table 2.Key references for each type of feature selection technique in the microarray domain3.2.1 The univariate filter paradigm: simple yet efficientBecause of the high dimensionality of most microarray analyses, fast and efficient FS techniques such as univariate filter methods have attracted most attention. The prevalence of these univariate techniques has dominated the field, and up to now comparative evaluations of different classification and FS techniques over DNA microarray datasets only focused on the univariate case (Dudoit et al., 2002; Lee et al., 2005; Li et al., 2004; Statnikov et al., 2005). This domination of the univariate approach can be explained by a number of reasons:the output provided by univariate feature rankings is intuitive and easy to understand;the gene ranking output could fulfill the objectives and expectations that bio-domain experts have when wanting to subsequently validate the result by laboratory techniques or in order to explore literature searches. The experts could not feel the need for selection techniques that take into account gene interactions;the possible unawareness of subgroups of gene expression domain experts about the existence of data analysis techniques to select genes in a multivariate way;the extra computation time needed by multivariate gene selection techniques.Some of the simplest heuristics for the identification of differentially expressed genes include setting a threshold on the observed fold-change differences in gene expression between the states under study, and the detection of the threshold point in each gene that minimizes the number of training sample misclassification (threshold number of misclassification, TNoM (Ben-Dor etal.,2000)). However, a wide range of new or adapted univariate feature ranking techniques has since then been developed. These techniques can be divided into two classes: parametric and model-free methods (see Table 2).Parametric methods assume a given distribution from which the samples (observations) have been generated. The two sample t-test and ANOVA are among the most widely used techniques in microarray studies, although the usage of their basic form, possibly without justification of their main assumptions, is not advisable (Jafari and Azuaje, 2006). Modifications of the standard t-test to better deal with the small sample size and inherent noise of gene expression datasets include a number of t- or t-test like statistics (differing primarily in the way the variance is estimated) and a number of Bayesian frameworks (Baldi and Long, 2001; Fox and Dimmic, 2006). Although Gaussian assumptions have dominated the field, other types of parametrical approaches can also be found in the literature, such as regression modelling approaches (Thomas et al., 2001) and Gamma distribution models (Newton et al.,2001).Due to the uncertainty about the true underlying distribution of many gene expression scenarios, and the difficulties to validate distributional assumptions because of small sample sizes,non-parametric or model-free methods have been widely proposed as an attractive alternative to make less stringent distributional assumptions (Troyanskaya et al., 2002). Many model-free metrics, frequently borrowed from the statistics field, have demonstrated their usefulness in many gene expression studies, including the Wilcoxon rank-sum test (Thomas et al., 2001), the between-within classes sum of squares (BSS/WSS) (Dudoit et al., 2002) and the rank products method (Breitling et al., 2004).A specific class of model-free methods estimates the reference distribution of the statistic using random permutations of the data, allowing the computation of a model-free version of the associated parametric tests. These techniques have emerged as a solid alternative to deal with the specificities of DNA microarray data, and do not depend on strong parametric assumptions (Efron et al., 2001; Pan, 2003; Park et al., 2001; Tusher et al., 2001). Their permutation principle partly alleviates the problem of small sample sizes in microarray studies, enhancing the robustness against outliers.We also mention promising types of non-parametric metrics which, instead of trying to identify differentially expressed genes at the whole population level (e.g. comparison of sample means), are able to capture genes which are significantly disregulated in only a subset of samples (Lyons-Weiler et al., 2004; Pavlidis and Poirazi, 2006). These types of methods offer a more patient specific approach for the identification of markers, and can select genes exhibiting complex patterns that are missed by metrics that work under the classical comparison of two prelabelled phenotypic groups. In addition, we also point out the importance of procedures for controlling the different types of errors that arise in this complex multiple testing scenario of thousands of genes (Dudoit et al., 2003; Ploner et al., 2006; Pounds and Cheng, 2004; Storey, 2002), with a special focus on contributions for controlling the false discovery rate (FDR).3.2.2 Towards more advanced models: the multivariate paradigm for filter, wrapperand embedded techniquesUnivariate selection methods have certain restrictions and may lead to less accurate classifiers by, e.g. not taking into account gene–gene interactions. Thus, researchers have proposed techniques that try to capture these correlations between genes.The application of multivariate filter methods ranges from simple bivariate interactions (Bø and Jonassen, 2002) towards more advanced solutions exploring higher order interactions, such as correlation-based feature selection (CFS) (Wang et al., 2005; Yeoh et al., 2002) and several variants of the Markov blanket filter method (Gevaert et al., 2006; Mamitsuka, 2006; Xing et al., 2001). The Minimum Redundancy-Maximum Relevance (MRMR) (Ding and Peng, 2003) and Uncorrelated Shrunken Centroid (USC) (Yeung and Bumgarner, 2003) algorithms are two other solid multivariate filter procedures, highlighting the advantage of using multivariate methods over univariate procedures in the gene expression domain.Feature selection using wrapper or embedded methods offers an alternative way to perform a multivariate gene subset selection, incorporating the classifier's; bias into the search and thus offering an opportunity to construct more accurate classifiers. In the context of microarray analysis, most wrapper methods use population-based, randomized search heuristics (Blanco et al., 2004; Jirapech-Umpai and Aitken, 2005; Li et al., 2001; Ooi and Tan, 2003), although also a few examples use sequential search techniques (Inza et al., 2004; Xiong et al., 2001). An interesting hybrid filter-wrapper approach is introduced in (Ruiz et al., 2006), crossing a univariatelypre-ordered gene ranking with an incrementally augmenting wrapper method.Another characteristic of any wrapper procedure concerns the scoring function used to evaluate each gene subset found. As the 0–1 accuracy measure allows for comparison with previous works, the vast majority of papers uses this measure. However, recent proposals advocate the use of methods for the approximation of the area under the ROC curve (Ma and Huang, 2005), or the optimization of the LASSO (Least Absolute Shrinkage and Selection Operator) model (Ghosh and Chinnaiyan, 2005). ROC curves certainly provide an interesting evaluation measure, especially suited to the demand for screening different types of errors in many biomedical scenarios.The embedded capacity of several classifiers to discard input features and thus propose a subset of discriminative genes, has been exploited by several authors. Examples include the use of random forests (a classifier that combines many single decision trees) in an embedded way to calculate the importance of each gene (Díaz-Uriarte and Alvarez de Andrés, 2006; Jiang et al., 2004). Another line of embedded FS techniques uses the weights of each feature in linear classifiers, such as SVMs (Guyon et al., 2002) and logistic regression (Ma and Huang, 2005). These weights are used to reflect the relevance of each gene in a multivariate way, and thus allow for the removal of genes with very small weights.Partially due to the higher computational complexity of wrapper and to a lesser degree embedded approaches, these techniques have not received as much interest as filter proposals. However, an advisable practice is to pre-reduce the search space using a univariate filter method, and only then apply wrapper or embedded methods, hence fitting the computation time to the available resources.3.3 Mass spectra analysisMass spectrometry technology (MS) is emerging as a new and attractive framework for disease diagnosis and protein-based biomarker profiling (Petricoin and Liotta, 2003). A mass spectrum sample is characterized by thousands of different mass/charge (m / z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. A typical MALDI-TOF low-resolution proteomic profile can contain up to 15 500 data points in the spectrum between 500 and 20 000 m / z, and the number of points even grows using higher resolution instruments.For data mining and bioinformatics purposes, it can initially be assumed that each m / z ratio represents a distinct variable whose value is the intensity. As Somorjai et al. (2003) explain, the data analysis step is severely constrained by both high-dimensional input spaces and their inherent sparseness, just as it is the case with gene expression datasets. Although the amount of publications on mass spectrometry based data mining is not comparable to the level of maturity reached in the microarray analysis domain, an interesting collection of methods has been presented in the last 4–5 years (see Hilario et al., 2006; Shin and Markey, 2006 for recent reviews) since the pioneering work of Petricoin et al.(2002).Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples (Coombes et al., 2007), the following crucial step is to extract the variables that will constitute the initial pool of candidate discriminative features. Some studies employ the simplest approach of considering every measured value as a predictive feature, thus applying FS techniques over initial huge pools of about 15 000 variables (Li et al., 2004; Petricoin et al., 2002), up to around 100 000 variables (Ball et al.,2002). On the other hand, a great deal of the current studies performs aggressive feature extraction procedures using elaborated peak detection and alignment techniques (see Coombes et al., 2007; Hilario et al., 2006; Shin and Markey, 2006 for a detailed description of these techniques). These procedures tend to seed the dimensionality from which supervised FS techniques will start their work in less than 500 variables (Bhanot et al., 2006; Ressom et al., 2007; Tibshirani et al., 2004). A feature extraction step is thus advisable to set the computational costs of many FS techniques to a feasible size in these MS scenarios. Table 3 presents an overview of FS techniques used in the domain of mass spectrometry. Similar to the domain of microarray analysis, univariate filter techniques seem to be the most common techniques used, although the use of embedded techniques is certainly emerging as an alternative. Although the t-test maintains a high level of popularity (Liu et al., 2002; Wu et al., 2003), other parametric measures such as F-test (Bhanot et al., 2006), and a notable variety of non-parametric scores (Tibshirani et al., 2004; Yu et al., 2005) have also been used in several MS studies. Multivariate filter techniques on the other hand, are still somewhat underrepresented (Liu et al., 2002; Prados et al., 2004).Table 3.Key references for each type of feature selection technique in the domain of mass pectrometryWrapper approaches have demonstrated their usefulness in MS studies by a group of influential works. Different types of population-based randomized heuristics are used as search engines in the major part of these papers: genetic algorithms (Li et al., 2004; Petricoin et al., 2002), particle swarm optimization (Ressom et al., 2005) and ant colony procedures (Ressom et al., 2007). It is worth noting that while the first two references start the search procedure in ∼ 15 000 dimensions by considering each m / z ratio as an initial predictive feature, aggressive peak detection and alignment processes reduce the initial dimension to about 300 variables in the last two references (Ressom et al., 2005; Ressom et al., 2007).An increasing number of papers uses the embedded capacity of several classifiers to discard input features. Variations of the popular method originally proposed for gene expression domains by Guyon et al. (2002), using the weights of the variables in the SVM-formulation to discard features with small weights, have been broadly and successfully applied in the MS domain (Jong et al., 2004; Prados et al., 2004; Zhang et al., 2006). Based on a similar framework, the weights of the input masses in a neural network classifier have been used to rank the features'importance in Ball et al. (2002). The embedded capacity of random forests (Wu et al., 2003) and other types of decision tree-based algorithms (Geurts et al., 2005) constitutes an alternative embedded FS strategy.Previous SectionNext Section4 DEALING WITH SMALL SAMPLE DOMAINSSmall sample sizes, and their inherent risk of imprecision and overfitting, pose a great challenge for many modelling problems in bioinformatics (Braga-Neto and Dougherty, 2004; Molinaro et al., 2005; Sima and Dougherty, 2006). In the context of feature selection, two initiatives have emerged in response to this novel experimental situation: the use of adequate evaluation criteria, and the use of stable and robust feature selection models.4.1 Adequate evaluation criteria。

英语作文-生物识别技术在政府健康事务管理服务行业中的前景研究

英语作文-生物识别技术在政府健康事务管理服务行业中的前景研究Biometric technology, particularly in the realm of healthcare, has seen significant advancements in recent years. Its integration into government health affairs management services holds immense potential, revolutionizing how health services are delivered and managed. This article explores the prospects of biometric technology in the government healthcare sector.One of the primary applications of biometric technology in healthcare is in patient identification. Traditional methods, such as ID cards or passwords, are prone to errors and can be easily forged or lost. Biometric identifiers, such as fingerprints or facial recognition, offer a more secure and reliable way to verify a patient's identity. This can help prevent medical errors, such as administering the wrong treatment or medication to a patient.Another area where biometric technology can have a significant impact is in the management of health records. By using biometric identifiers, such as fingerprints or iris scans, healthcare providers can quickly and accurately access a patient's medical history, medications, and other relevant information. This can lead to more efficient and personalized care, as healthcare providers can make more informed decisions based on a patient's medical history.Biometric technology can also improve the efficiency of healthcare services. For example, by using biometric identifiers, such as fingerprints, healthcare providers can quickly check-in patients and access their medical records, reducing wait times and administrative burdens. This can lead to cost savings for healthcare providers and improved satisfaction for patients.Furthermore, biometric technology can enhance security and privacy in healthcare. Biometric identifiers are unique to each individual and cannot be easily replicated orstolen. This can help protect patient information from unauthorized access and ensure the confidentiality of medical records.In conclusion, biometric technology holds immense potential in the government healthcare sector. Its applications in patient identification, health record management, efficiency improvement, and security enhancement can lead to significant benefits for both healthcare providers and patients. As the technology continues to advance, we can expect to see even greater integration of biometric technology in healthcare, improving the quality and accessibility of healthcare services.。

生物信息学在生物医学文献中自动提取疾病相关信息的运用

生物信息学在生物医学文献中自动提取疾病基因点突变信息的运用生物信息学(Bioinformatics)一词由美籍学者林华安博士(Hwa A．Lim)首先创造和使用。

生物信息学是多学科的交叉产物，涉及生物、数学、物理、计算机科学、信息科学等多个领域。

狭义的讲，生物信息学是对生物信息的获取、存储、分析和解释；计算生物学则是指为实现上述目的而进行的相应算法和计算机应用程序的开发。

这两门学科之间没有严格的分界线，统称为生物信息学。

生物医学研究的重要目标就是找到突变和相应的疾病表型。

但是大多数的疾病相关的突变数据都以文本的形式埋藏在生物医学文献之中，缺乏必要的结构来便于检索和查找。

信息的快速更新和持续增长的文献储存使得提取这些突变信息变得困难。

蛋白质和DNA的突变信息储存在像Mendelian inheritance in man(OMIM)和Swiss-Prot等数据库中。

数据挖掘的方法从这些数据库中提取突变信息可以达到0.98的准确性，但是还没有正确的自动转到疾病相关的突变的方法。

现有算法可以实现鉴定点突变（比如MutationFinder）或者突变和其相关的基因以及蛋白质的名称（比如MEMA和MuteXe）。

大多数“突变+基因”的方法可以通过各自不同的界面和算法来实现对点突变信息的表述和文本数据收集。

比如：Mutation Grab采用基于图表的（Graph based）的方法，而MutationMiner采用结构可视化的方法来表现。

但是所有方法都关注于提取点突变和相关基因的正确性。

新的高效的从生物医学文献中鉴别点突变以及他们和疾病表型的关系。

结合了数据挖掘（data mining）和序列分析（sequence analysis）来鉴定点突变和相关疾病。

采用PubMed引擎来从MEDLINE中检索一系列摘要。

将词汇索引控制在MEDLINE's Medical Subject Heading (MeSH)。

Bioinformatics

Bioinformatics
Chien-Yu Chen Graduate school of biotechnology and bioinformatics, Yuan Ze University
Course Information

Web page
– .tw/~cychen/course/922BioInfo/922-BioInfo-CourseInformation.htm
12
Functional Genomics – Research Issues

Which genes are expressed in which tissues? How is the expression of a gene affected by extracellular influences? Which genes are expressed during the development of an organism? What is the effect of misregulated expression of a gene? What patterns of gene expression cause a disease or lead to disease progression? What patterns of gene expression influence response to treatmen Reading Frames
Consider the double-stranded DNA sequence below
5' CAATGGCTAGGTACTATGTATGAGATCATGATCTTTACAAATCCGAG 3' 3' GTTACCGATCCATGATACATACTCTAGTACTAGAAATGTTTAGGCTC 5'

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Bioinformatics Integration Framework for Metabolic Pathway Data-Mining Tom´a s Arredondo V.1,Michael Seeger P.2,Lioubov Dombrovskaia3, Jorge Avarias3A.,Felipe Calder´o n B.3,Diego Candel C.3,Freddy Mu˜n oz R.3, Valeria Latorre R.2,Loreine Agull´o2,Macarena Cordova H.2,and Luis G´o mez21Departamento de Electr´o nicae-mail:tarredondo@elo.utfsm.cl2Millennium Nucleus EMBA,Departamento de Qu´ımica3Departamento de Inform´a tica,Universidad T´e cnica Federico Santa Mar´ıaAv.Espa˜n a1680,Valpara´ıso,ChileAbstract.A vast amount of bioinformatics information is continuouslybeing introduced to diﬀerent databases around the world.Handling thevarious applications used to study this information present a major datamanagement and analysis challenge to researchers.The present workinvestigates the problem of integrating heterogeneous applications anddatabases towards providing a more eﬃcient data-mining environmentfor bioinformatics research.A framework is proposed and GeXpert,anapplication using the framework towards metabolic pathway determi-nation is introduced.Some sample implementation results are also pre-sented.1IntroductionModern biotechnology aims to provide powerful and beneﬁcial solutions in di-verse areas.Some of the applications of biotechnology include biomedicine,biore-mediation,pollution detection,marker assisted selection of crops,pest manage-ment,biochemical engineering and many others[3,14,22,21,29].Because of the great interest in biotechnology there has been a proliferation of separate and disjoint databases,data formats,algorithms and applications.Some of the many types of databases currently in use include:nucleotide sequences (e.g.Ensemble,Genbank,DDBJ,EMBL),protein sequences(e.g.SWISS-PROT, InterPro,PIR,PRF),enzyme databases(e.g.Enzymes),metabolic pathways (e.g.ERGO,KEGG:Kyoto Encyclopedia of Genes and Genomes database)and literature references(e.g.PubMed)[1,4,11,6,23].The growth in the amount of information stored has been exponential:since the1970s,as of April2004there were over39billion bases in Entrez NCBI(National Center of Bioinformatics databases),while the number of abstracts in PubMed has been growing by10,000 abstracts per week since2002[3,13].The availability of this data has undoubtedly accelerated biotechnological re-search.However,because these databases were developed independently and aremanaged autonomously,they are highly heterogeneous,hard to cross-reference, and ill-suited to process mixed queries.Also depending on the database being accessed the data in them is stored in a variety of formats including:a host of graphic formats,RAW sequence data,FASTA,PIR,MSF,CLUSTALW,and other text based formats including XML/HTML.Once the desired data is re-trieved from the one of the database(s)it typically has to be manually manip-ulated and addressed to another database or application to perform a required action such as:database and homology search(e.g.BLAST,Entrez),sequence alignment and gene analysis(e.g.ClustalW,T-Coﬀee,Jalview,GenomeScan, Dialign,Vector NTI,Artemis)[8].Beneﬁcial application developments would occur more eﬃciently if the large amounts of biological feature data could be seamlessly integrated with data from literature,databases and applications for data-mining,visualization and analy-sis.In this paper we present a framework for bioinformatic literature,database, and application integration.An application based of this framework is shown that supports metabolic pathway research within a single easy to use graphical interface with assisting fuzzy logic decision support.To our knowledge,an inte-gration framework encompassing all these areas has not been attempted before. Early results have shown that the integration framework and application could be useful to bioinformatics researchers.In Section2,we describe current metabolic pathway research methodology.In Section3we describe existing integrating architectures and applications.Section 4describes the architecture and GeXpert,an application using this framework is introduced.Finally,some conclusions are drawn and directions of future work are presented.2Metabolic Pathway ResearchFor metabolic pathway reconstruction experts have been traditionally used a time intensive iterative process[30].As part of this process genesﬁrst have to be selected as candidates for encoding an enzyme within a potential metabolic pathway within an organism.Their selection then has to be validated with lit-erature references(e.g.non-hypothetical genes in Genbank)and using bioin-formatics tools(e.g.BLAST:Basic Local Alignment Search Tool)forﬁnding orthologous genes in various other organisms.Once a candidate gene has been determined,sequence alignment of the candidate gene with the sequence of the organism under study has to be performed in a diﬀerent application for gene lo-cations(e.g.Vector NTI,ARTEMIS).Once the genes required have been found in the organism then the metabolic pathway has to be conﬁrmed experimentally in the laboratory.For example,using the genome sequence of Acidithiobacillus ferrooxidans diverse metabolic pathways have been determined.One major group of organisms that is currently undergoing metabolic path-ways research is bacteria.Bacteria possess the highest metabolic versatility of the three domains of living organisms.This versatility stems from their expansion into diﬀerent natural niches,a remarkable degree of physiological and geneticadaptability and their evolutionary diversity.Microorganisms play a main role in the carbon cycle and in the removal of natural and man-made waste chemical compounds from the environment[17].For example,Burkholderia xenovorans LB400is a bacterium capable of degrading a wide range of PCBs[7,29].Because of the complex and noisy nature of the data,any selection of can-didate genes as part of metabolic pathways is currently only done by human experts prior to biochemical veriﬁcation.The lack of integration and standards in database,application andﬁle formats is time consuming and forces researchers to develop ad hoc data management processes that could be prone to error.In addition,the possibility of using Softcomputing based pattern detection and analysis techniques(e.g.fuzzy logic)have not been fully explored as an aid to the researcher within such environments[1,23].3Integration ArchitecturesThe trend in theﬁeld is towards data integration.Research projects continue to generate large amounts of raw data,and this is annotated and correlated with the data in the public databases.The ability to generate new data continues to outpace the ability to verify them in the laboratory and therefore to exploit them.Validation experiments and the ultimate conversion of data to validated knowledge need expert human involvement with the data and in the laboratory, consuming time and resources.Any eﬀort of data and system integration is an attempt towards reducing the time spent by experts unnecessarily which could be better spent in the lab[5,9,20].The biologist or biochemist not only needs to be an expert in hisﬁeld as well stay up to date with the latest software tools or even develop his own tools to be able to perform his research[16,28].One example of these types of tools is BioJava,which is an open source set of Java components such as parsers,ﬁle manipulation tools,sequence translation and proteomic components that allow extensive customization but still require the development of source code[25]. The goal should be integrated user-friendly systems that would greatly facili-tate the constructive cycle of computational model building and experimental veriﬁcation for the systematic analysis of an organism[5,8,16,20,28,30].Sun et al.[30]have developed a system,IdentiCS,which combines the identiﬁcation of coding sequences(CDS)with the reconstruction,comparison and visualization of metabolic networks.IdentiCS uses sequences from public databases to perform a BLAST query to a local database with the genome in question.Functional in-formation from the CDSs is used for metabolic reconstruction.One shortcoming is that the system does not incorporate visualization or ORF(Open Reading Frames)selection and it includes only one application for metabolic sequence reconstruction(BLAST).Information systems for querying,visualization and analysis must be able to integrate data on a large scale.Visualization is one of the key ways of making the researcher work easier.For example,the spatial representation of the genes within the genome shows location of the gene,reading direction and metabolicfunction.This information is made available by applications such as Vector NTI and Artemis.The function of the gene is interpreted through its product,nor-mally the protein in a metabolic pathway,which description is available from several databases such as KEGG or ERGO[24].Usually,the metabolic path-ways are represented as aﬂowchart of several levels of abstraction,which are constructed manually.Recent advances on metabolic network visualization in-clude virtual reality use[26],mathematical model development[27],and a mod-eling language[19].These techniques synthesize the discovery of genes and are available in databases such as Expasy,but they should be transparent to the re-searcher by their seamless integration into the research application environment.The software engineering challenges and opportunities in integrating and vi-sualizing the data are well documented[16,28].There are some applications servers that use a SOAP interface to answer queries.Linking such tools into uniﬁed working environment is non-trivial and has not been done to date[2]. One recent approach towards human centered integration is BioUse,a web por-tal developed by the Human Centered Software Engineering Group at Concor-dia University.BioUse provided an adaptable interface to NCBI,BLAST and ClustalW,which attempted to shield the novice user from unnecessary complex-ity.As users became increasingly familiar with the application the portal added shortcuts and personalization features for diﬀerent users(e.g.pharmacologists, microbiologists)[16].BioUse was limited in scope as a prototype and its website is no longer available.Current research is continued in the CO-DRIVE project but it is not completed yet[10].4Integration Framework and ImplementationThis project has focused on the development of a framework using open stan-dards towards integrating heterogeneous databases,web services and applica-tions/tools.This framework has been applied in GeXpert,an application for bioinformatics metabolic pathway reconstruction research in bacteria.In addi-tion this application includes the utilization of fuzzy logic for helping in the selection of the best candidate genes or sequences for speciﬁed metabolic path-ways.Previously,this categorization was typically done in an ad hoc manner by manually combining various criteria such as e-value,identities,gaps,positives and score.Fuzzy logic enables an eﬃciency enhancement by providing an auto-mated sifting mechanism for a very manually intensive bioinformatics procedure currently being performed by researchers.GeXpert is used toﬁnd,build and edit metabolic pathways(central or pe-ripheral),perform protein searches in NCBI,perform nucleotide comparisons of organisms versus the sequenced one(using tblastn).The application can also perform a search of3D models associated with a protein or enzyme(using the Cn3D viewer),generation of ORF diagrams for the sequenced genome,genera-tion of reports relating to the advance of the project and aid in the selection of BLAST results using T-S-K(Takagi Sugeno Kang)fuzzy logic.The architecture is implemented in three layers:presentation,logic and data [18].As shown in Figure1,the presentation layer provides for a web based as well as a desktop based interface to perform the tasks previously mentioned.Fig.1.High Level Framework ArchitectureThe logical layer consists of a core engine and a web server.The core engine performs bioinformatic processing functions and uses diﬀerent tools as required: BLAST for protein and sequence alignment,ARTEMIS for analysis of nucleotide sequences,Cn3D for3D visualization,and a T-S-K fuzzy logic library for candi-date sequence selection.The Web Server component provides an HTTP interface into the GeXpert core engine.As seen in Figure1,the data layer includes a communications manager and a data manager.The communication manager is charged with obtaining data (protein and nucleotide sequences,metabolic pathways and3D models)from various databases and application sources using various protocols(application calls,TCP/IP,SOAP).The data manager implements data persistence as well as temporary cache management for all research process related objects.4.1Application Implementation:GeXpertGeXpert[15]is an open source implementation of the framework previously described.It relies on Eclipse[12]builder for multiplatform support.The GeXpert desktop application as implemented consists of the following:–Metabolic pathway editor:tasked with editing or creating metabolic path-ways,shows them as directed graphs of enzymes and compounds.–Protein search viewer:is charged with showing the results of protein searches with user speciﬁed parameters.–Nucleotide search viewer:shows nucleotide search results given user speciﬁed protein/parameters.–ORF viewer:is used to visualize surrounding nucleotide ORFs with a map of colored arrows.The colors indicate the ORF status(found,indeterminate, erroneous or ignored).The GeXpert core component implements the application logic.It consists of the following elements:–Protein search component:manages the protein search requests and its re-sults.–Nucleotide search component:manages the nucleotide search requests and its results.In addition,calls the Fuzzy component with the search parameters speciﬁed in order to determine the best candidates for inclusion into the metabolic pathway.–Fuzzy component:attempts to determine the quality of nucleotide search re-sults using fuzzy criteria[1].The following normalized(0to1values)criteria are used:e-value,identities,gaps.Each hasﬁve membership functions(very low,low,medium,high,very high),the number of rules used is243(35).–ORF component:identiﬁes the ORFs present in requested genome region.–Genome component:manages the requests for genome regions and its results.GeXpert communications manager receives requests from the GeXpert core module to obtain and translates data from external sources.This module consists of the following subcomponents:–BLAST component:calls the BLAST application indicating the protein to be analyzed and the organism data base to be used.–BLAST parser component:translates BLAST results from XML formats into objects.–Cn3D component:sends three-dimensional(3D)protein models to the Cn3D application for display.–KEGG component:obtains and translates metabolic pathways received from KEGG.–NCBI component:performs3D protein model and document searches from NCBI.–EBI component:performs searches on proteins and documentation from the EBI(European Bioinformatic Institute)databases.GeXpert data manager is tasked with load administrating and storage of application data:–Cache component:in charge of keeping temporary search results to improve throughput.Also implements aging of cache data.–Application data component:performs data persistence of metabolic paths, 3D protein models,protein searches,nucleotide searches,user conﬁguration data,project conﬁguration for future usage.Fig.2.Metabolic Pathway ViewerFig.3.Protein Search ViewerFig.4.Nucleotide Search Viewer4.2Application Implementation:GeXpert User Interface andWorkﬂowGeXpert is to be used in metabolic pathway research work using a research workﬂow similar to identiCS[30].The workﬂow and some sample screenshots are given:er must provide the sequenced genome of the organism to be studied.2.The metabolic pathway of interest must be created(can be based on thepathway of a similar organism).In the example in Figure2,the glycoli-sis/gluconeogenesis metabolic pathway was imported from KEGG.3.For each metabolic pathway a key enzyme must be chosen in order to starta detailed search.Each enzyme could be composed of one or more proteins(subunits).Figure3shows the result of the search for the protein homopro-tocatechuate2,3-dioxygenase.4.Perform a search for amino acid sequences(proteins)in other organisms.Translate these sequences into nucleotides(using tblastn)and perform an alignment in the genome of the organism under study.If this search does not give positives results it return to the previous step and search another protein sequence.In Figure4,we show that the protein homoprotocatechuate 2,3-dioxygenase has been found in the organism under study(Burkholderia xenovorans LB400)in the chromosome2(contig481)starting in position 2225819and ending in region2226679.Also the system shows the nucleotide sequence for this protein.5.For a DNA sequence found,visualize the ORF map(using the ORF Viewer)and identify if there is an ORF that contains a large part of said sequence.If this is not the case go back to the previous step and chose another sequence.6.Verify if the DNA sequence for the ORF found corresponds to the chosensequence or a similar one(using blastx).If not choose another sequence. 7.Establish as found the ORF of the enzyme subunit and startﬁnding in thesurrounding ORFs sequences capable of coding the other subunits of the enzyme or other enzymes of the metabolic pathway.8.For the genes that were not found in the surrounding ORFs repeat the entireprocess.9.For the enzyme,the3D protein model can be obtained and viewed if it exists(using CN3D as integrated into the GeXpert interface).10.The process is concluded by the generation of reports with the genes foundand with associated related documentation that supports the information about the proteins utilized in the metabolic pathway.5ConclusionsThe integrated framework approach presented in this paper is an attempt to enhance the eﬃciency and capability of bioinformatics researchers.GeXpert is a demonstration that the integration framework can be used to implement useful bioinformatics applications.GeXpert has so far shown to be a useful tool for our researchers;it is currently being enhanced for functionality and usability improvements.The current objective of the research group is to use GeXpert in order to discover new metabolic pathways for bacteria[29].In addition to our short term goals,the following development items are planned for the future:using fuzzy logic in batch mode,web services and a web client,blastx for improved veriﬁcation of the ORFs determined,multi-user mode to enable multiple users and groups with potentially diﬀerent roles to share in a common research eﬀort,peer to peer communication to enable the interchange of documents(archives,search results,research items)thus enabling a network of collaboration in diﬀerent or shared projects,and the use of intel-ligent/evolutionary algorithms to enable learning based on researcher feedback into GeXpert.AcknowledgementsThis research was partially funded by Fundaci´o n Andes.References1.Arredondo,T.,Neelakanta,P.S.,DeGroﬀ,D.:Fuzzy Attributes of a DNA Complex:Development of a Fuzzy Inference Engine for Codon-’Junk’Codon Delineation.Artif.Intell.Med.351-2(2005)87-1052.Barker,J.,Thornton,J.:Software Engineering Challenges in bioinformatics.Pro-ceedings of the26th International Conference on Software Engineering,IEEE(2004) 3.Bernardi,M.,Lapi,M.,Leo,P.,Loglisci,C.:Mining Generalized Association Ruleson Biomedical Literature.In:Moonis,A.Esposito,F.(eds):Innovations in Applied Artiﬁcial Intelligence.Lect.Notes Artif.Int.3353(2005)500-5094.Brown,T.A.:Genomes.John Wiley and Sons,NY(1999)5.Cary M.P.,Bader G.D.,Sander C.:Pathway information for systems biology(Re-view Article).FEBS Lett.579(2005)1815-1820,6.Claverlie,J.M.:Bioinformatics for Dummies.Wiley Publishing(2003)7.C´a mara,B.,Herrera,C.,Gonz´a lez,M.,Couve,E.,Hofer,B.,Seeger,M.:From PCBsto highly toxic metabolites by the biphenyl pathway.Environ.Microbiol.(6)(2004) 842-8508.Cohen,J.:Computer Science and mun.ACM48(3)(2005)72-799.Costa,M.,Collins,R.,Anterola,A.,Cochrane,F.,Davin,L.,Lewis,N.:An insilico assessment of gene function and organization of the phenylpropanoid pathway metabolic networks in Arabidopsis thaliana and limitations thereof.Phytochem.64 (2003)1097-111210.CO-Drive project:http://hci.cs.concordia.ca/www/hcse/projects/CO-DRIVE/11.Durbin,R.:Biological Sequence Analysis,Cambridge,UK(2001)12.Eclipse project:13.Entrez NCBI Database:/entrez/query.fcgi?db=Nucleotide14.Gardner,D.,:Using genomics to help predict drug interactions.J rm.37(2004)139-14615.GeXpert sourceforge page:/projects/gexpert16.Javahery,H.,Seﬀah,A.,Radhakrishnan,T.:Beyond Power:Making BioinformaticsTools mun.ACM4711(2004)17.Jimenez,J.I.,Miambres, B.,Garca,J.,Daz, E.:Genomic insights in themetabolism of aromatics compounds in Pseudomonas.In:Ramos,J.L.(ed): Pseudomonas,vol.3.NY:Kluwer Academic Publishers,(2004)425-462rman,C.:Applying UML and Patterns:An Introduction to Object-OrientedAnalysis and Design and Iterative Development.Prentice Hall PTR(2004)19.Loew,L.W.,Schaﬀ,J.C.:The Virtual Cell:a software environment for computa-tional cell biology.Trends Biotechnol.1910(2001)20.Ma H.,Zeng A.,Reconstruction of metabolic networks from genome data andanalysis of their global structure for various organisms.Bioinformatics19(2003) 270-27721.Magalhaes,J.,Toussaint,O.:How bioinformatics can help reverse engineer humanaging.Aging Res.Rev.3(2004)125-14122.Molidor,R.,Sturn,A.,Maurer,M.,Trajanosk,Z.:New trends in bioinformatics:from genome sequence to personalized medicine.Exp.Gerontol.38(2003)1031-1036 23.Neelakanta,P.S.,Arredondo,T.,Pandya,S.,DeGroﬀ, D.:Heuristics of AI-Based Search Engines for Massive Bioinformatic Data-Mining:An Example of Codon/Noncodon Delineation Search in a Binary DNA Sequence,Proceeding of IICAI(2003)24.Papin,J.A.,Price,N.D.,Wiback,S.J.,Fell,D.A.,Palsson,B.O.:Metabolic Path-ways in the Post-genome Era.Trends Biochem.Sci.185(2003)25.Pocock,M.,Down,T.,Hubbard,T.:BioJava:Open Source Components for Bioin-formatics.ACM SIGBIO Newsletter202(2000)10-1226.Rojdestvenski,I.:VRML metabolic network p.Bio.Med.33(2003)27.SBML:Systems Biology Markup Language./index.psp28.Segal,T.,Barnard,R.:Let the shoemaker make the shoes-An abstraction layeris needed between bioinformatics analysis,tools,data,and equipment:An agenda for the next5years.First Asia-Paciﬁc Bioinformatics Conference,Australia(2003) 29.Seeger,M.,Timmis,K.N.,Hofer,B.:Bacterial pathways for degradation of poly-chlorinated biphenyls.Mar.Chem.58(1997)327-33330.Sun,J.,Zeng,A.:IdentiCS-Identiﬁcation of coding sequence and in silico recon-struction of the metabolic network directly from unannotated low-coverage bacterial genome sequence.BMC Bioinformatics5:112(2004)。