基于SVM识别的人类盒式外显子(IJEM-V1-N1-9)

合集下载

利用外显子组测序检测一个家系突变的分析方法介绍201412

Fastq文件示例>>
第二步：测序质量评估及过滤
• 评估数据产量和质量（Illumina报告示例），并根据需要去除接头污染和低质量序列，如：
– FastQC可对Illumina和ABI SOLiD测序序列质量进行快速评估（FastQC质量报告示例） – FASTX-Toolkit和Galaxy即可评估序列质量，还可去除污染碱基和低质量碱基并对序列进行质量过滤
变异注释工具比较
(Pabinger, et al. Brief in Bioinform, 2013)
实际应用中，具体运用某个特定的软件是可以根据需要调整、优化的
常用注释工具ANNOVAR
• /annovar/
• 较全面的功能注释，广为使用 • 需在本地安装注释数据库，如dbSNP、 1000genomes、SIFT、DGV等，按需灵活使用 • 可基于基因注释、基于区间注释，还可过滤 • 对于SNP和indel，结果包括基因注释、氨基酸置换预测评分、保守性预测评分、dbSNP ID、千人基因组变异频率、NHLBI-ESP 6500 个外显子测序变异频率等 • Annovar注释结果示例
– 目前是验证DNA序列突变的金标准
• 全基因组或全外显子组的第二代测序(Nextgeneration sequencing, NGS)(Illumina: 30-150bp)
– 优点：是通量高，成本较低 – 缺点：需PCR易引入误差，容易在高GC和同聚物的区域出现错误，无法对高重复区域和单倍体型或杂合子序列等这些复杂区域进行测序
可在线使用的注释工具 SeattleSeq Annotation
• /SeattleSeqAnnotation137/
• 可接受多种输入格式，如Maq、GFF、CASAVA、VCF、自定义格式、一行一基因型格式、GATK BED • 可根据NCBI 全基因注释、或CCDS（仅编码区）、或NCBI和CCDS两者兼有 • 注释的结果内容较SnpEff丰富，但不及ANNOVAR全面

基于深度学习框架的循环染色体异常细胞识别

微创、敏感、经济的肿瘤早期诊断方法。荧光原位杂交（ｆｌｕｏｒｅｓｃｅｎｃｅｉｎｓｉｔｕｈｙｂｒｉｄｉｚａｔｉｏｎ，ＦＩＳＨ）通过计算
荧光探针在细胞核内产生的增益，可以准确地检测ＣＡＣ中基因异常的繁殖。然而，基于ＦＩＳＨ的ＣＡＣ
识别存在细胞核重叠和荧光信号形态多样的问题，人工检测荧光信号是困难的。方法本研究基于四色
ｔｈｒｏｕｇｈｈｅａｔｍａｐｒｅｇｒｅｓｓｉｏｎａｎｄｃｕｓｔｏｍｉｚｉｎｇａｌｉｇｈｔｗｅｉｇｈｔｔａｒｇｅｔｄｅｔｅｃｔｉｏｎｎｅｔｗｏｒｋ．ＲｅｓｕｌｔｓＴｈｅＣＡＣＮＥＴ
ａｃｈｉｅｖｅ９２６７％ｏｎｉｎｔｅｒｓｅｃｔｉｏｎｏｖｅｒｕｎｉｏｎ（ＩｏＵ）ｉｎｎｕｃｌｅｕｓｓｅｇｍｅｎｔａｔｉｏｎ，ａｎｄｔｈｅａｖｅｒａｇｅｐｒｅｃｉｓｉｏｎ（ＡＰ）ｏｆ
ｔｈｅｐｒｏｂｌｅｍｓｏｆｏｖｅｒｌａｐｐｉｎｇｃｅｌｌｎｕｃｌｅｉａｎｄｄｉｖｅｒｓｅｆｌｕｏｒｅｓｃｅｎｔｓｉｇｎａｌｍｏｒｐｈｏｌｏｇｙ，ａｎｄｍａｎｕａｌｄｅｔｅｃｔｉｏｎｏｆ
· ５５２·
北京生物医学工程第４２卷
ＣｏｍｍｕｎｉｃａｔｉｏｎｓＴｅｃｈｎｏｌｏｇｙ，Ｂｅｉｊｉｎｇ１００１９１；
２ＺｈｕｈａｉＳａｎｍｅｄＢＩＯＴＥＣＨＬｔｄ，Ｚｈｕｈａｉ，ＧｕａｎｇｄｏｎｇＰｒｏｖｉｎｃｅ５１９０６０
Ｃｏｒｒｅｓｐｏｎｄｉｎｇａｕｔｈｏｒ：ＷＵＴｏｎｇｎｉｎｇ（Ｅ⁃ｍａｉｌ：ｗｕｔｏｎｇｎｉｎｇ＠ｃａｉｃｔａｃｃｎ）
ｅｎｈａｎｃｅｓｔｈｅｏｖｅｒｌａｐｐｉｎｇｃｅｌｌｎｕｃｌｅｉｓｅｇｍｅｎｔａｔｉｏｎｔｈｒｏｕｇｈｃｏｍｂｉｎｉｎｇａｔｔｅｎｔｉｏｎｍｅｃｈａｎｉｓｍａｎｄｅｄｇｅｃｏｎｓｔｒａｉｎｔ

C.parvum全基因组序列

DOI: 10.1126/science.1094786, 441 (2004);304Science et al.Mitchell S. Abrahamsen,Cryptosporidium parvum Complete Genome Sequence of the Apicomplexan, (this information is current as of October 7, 2009 ):The following resources related to this article are available online at/cgi/content/full/304/5669/441version of this article at:including high-resolution figures, can be found in the online Updated information and services,/cgi/content/full/1094786/DC1 can be found at:Supporting Online Material/cgi/content/full/304/5669/441#otherarticles , 9 of which can be accessed for free: cites 25 articles This article 239 article(s) on the ISI Web of Science. cited by This article has been /cgi/content/full/304/5669/441#otherarticles 53 articles hosted by HighWire Press; see: cited by This article has been/cgi/collection/genetics Genetics: subject collections This article appears in the following/about/permissions.dtl in whole or in part can be found at: this article permission to reproduce of this article or about obtaining reprints Information about obtaining registered trademark of AAAS.is a Science 2004 by the American Association for the Advancement of Science; all rights reserved. The title Copyright American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the Science o n O c t o b e r 7, 2009w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o m3.R.Jackendoff,Foundations of Language:Brain,Gram-mar,Evolution(Oxford Univ.Press,Oxford,2003).4.Although for Frege(1),reference was established rela-tive to objects in the world,here we follow Jackendoff’s suggestion(3)that this is done relative to objects and the state of affairs as mentally represented.5.S.Zola-Morgan,L.R.Squire,in The Development andNeural Bases of Higher Cognitive Functions(New York Academy of Sciences,New York,1990),pp.434–456.6.N.Chomsky,Reﬂections on Language(Pantheon,New York,1975).7.J.Katz,Semantic Theory(Harper&Row,New York,1972).8.D.Sperber,D.Wilson,Relevance(Harvard Univ.Press,Cambridge,MA,1986).9.K.I.Forster,in Sentence Processing,W.E.Cooper,C.T.Walker,Eds.(Erlbaum,Hillsdale,NJ,1989),pp.27–85.10.H.H.Clark,Using Language(Cambridge Univ.Press,Cambridge,1996).11.Often word meanings can only be fully determined byinvokingworld knowledg e.For instance,the meaningof “ﬂat”in a“ﬂat road”implies the absence of holes.However,in the expression“aﬂat tire,”it indicates the presence of a hole.The meaningof“ﬁnish”in the phrase “Billﬁnished the book”implies that Bill completed readingthe book.However,the phrase“the g oatﬁn-ished the book”can only be interpreted as the goat eatingor destroyingthe book.The examples illustrate that word meaningis often underdetermined and nec-essarily intertwined with general world knowledge.In such cases,it is hard to see how the integration of lexical meaning and general world knowledge could be strictly separated(3,31).12.W.Marslen-Wilson,C.M.Brown,L.K.Tyler,Lang.Cognit.Process.3,1(1988).13.ERPs for30subjects were averaged time-locked to theonset of the critical words,with40items per condition.Sentences were presented word by word on the centerof a computer screen,with a stimulus onset asynchronyof600ms.While subjects were readingthe sentences,their EEG was recorded and ampliﬁed with a high-cut-off frequency of70Hz,a time constant of8s,and asamplingfrequency of200Hz.14.Materials and methods are available as supportingmaterial on Science Online.15.M.Kutas,S.A.Hillyard,Science207,203(1980).16.C.Brown,P.Hagoort,J.Cognit.Neurosci.5,34(1993).17.C.M.Brown,P.Hagoort,in Architectures and Mech-anisms for Language Processing,M.W.Crocker,M.Pickering,C.Clifton Jr.,Eds.(Cambridge Univ.Press,Cambridge,1999),pp.213–237.18.F.Varela et al.,Nature Rev.Neurosci.2,229(2001).19.We obtained TFRs of the single-trial EEG data by con-volvingcomplex Morlet wavelets with the EEG data andcomputingthe squared norm for the result of theconvolution.We used wavelets with a7-cycle width,with frequencies ranging from1to70Hz,in1-Hz steps.Power values thus obtained were expressed as a per-centage change relative to the power in a baselineinterval,which was taken from150to0ms before theonset of the critical word.This was done in order tonormalize for individual differences in EEG power anddifferences in baseline power between different fre-quency bands.Two relevant time-frequency compo-nents were identiﬁed:(i)a theta component,rangingfrom4to7Hz and from300to800ms after wordonset,and(ii)a gamma component,ranging from35to45Hz and from400to600ms after word onset.20.C.Tallon-Baudry,O.Bertrand,Trends Cognit.Sci.3,151(1999).tner et al.,Nature397,434(1999).22.M.Bastiaansen,P.Hagoort,Cortex39(2003).23.O.Jensen,C.D.Tesche,Eur.J.Neurosci.15,1395(2002).24.Whole brain T2*-weighted echo planar imaging bloodoxygen level–dependent(EPI-BOLD)fMRI data wereacquired with a Siemens Sonata1.5-T magnetic reso-nance scanner with interleaved slice ordering,a volumerepetition time of2.48s,an echo time of40ms,a90°ﬂip angle,31horizontal slices,a64ϫ64slice matrix,and isotropic voxel size of3.5ϫ3.5ϫ3.5mm.For thestructural magnetic resonance image,we used a high-resolution(isotropic voxels of1mm3)T1-weightedmagnetization-prepared rapid gradient-echo pulse se-quence.The fMRI data were preprocessed and analyzedby statistical parametric mappingwith SPM99software(http://www.ﬁ/spm99).25.S.E.Petersen et al.,Nature331,585(1988).26.B.T.Gold,R.L.Buckner,Neuron35,803(2002).27.E.Halgren et al.,J.Psychophysiol.88,1(1994).28.E.Halgren et al.,Neuroimage17,1101(2002).29.M.K.Tanenhaus et al.,Science268,1632(1995).30.J.J.A.van Berkum et al.,J.Cognit.Neurosci.11,657(1999).31.P.A.M.Seuren,Discourse Semantics(Basil Blackwell,Oxford,1985).32.We thank P.Indefrey,P.Fries,P.A.M.Seuren,and M.van Turennout for helpful discussions.Supported bythe Netherlands Organization for Scientiﬁc Research,grant no.400-56-384(P.H.).Supporting Online Material/cgi/content/full/1095455/DC1Materials and MethodsFig.S1References and Notes8January2004;accepted9March2004Published online18March2004;10.1126/science.1095455Include this information when citingthis paper.Complete Genome Sequence ofthe Apicomplexan,Cryptosporidium parvumMitchell S.Abrahamsen,1,2*†Thomas J.Templeton,3†Shinichiro Enomoto,1Juan E.Abrahante,1Guan Zhu,4 Cheryl ncto,1Mingqi Deng,1Chang Liu,1‡Giovanni Widmer,5Saul Tzipori,5GregoryA.Buck,6Ping Xu,6 Alan T.Bankier,7Paul H.Dear,7Bernard A.Konfortov,7 Helen F.Spriggs,7Lakshminarayan Iyer,8Vivek Anantharaman,8L.Aravind,8Vivek Kapur2,9The apicomplexan Cryptosporidium parvum is an intestinal parasite that affects healthy humans and animals,and causes an unrelenting infection in immuno-compromised individuals such as AIDS patients.We report the complete ge-nome sequence of C.parvum,type II isolate.Genome analysis identiﬁes ex-tremely streamlined metabolic pathways and a reliance on the host for nu-trients.In contrast to Plasmodium and Toxoplasma,the parasite lacks an api-coplast and its genome,and possesses a degenerate mitochondrion that has lost its genome.Several novel classes of cell-surface and secreted proteins with a potential role in host interactions and pathogenesis were also detected.Elu-cidation of the core metabolism,including enzymes with high similarities to bacterial and plant counterparts,opens new avenues for drug development.Cryptosporidium parvum is a globally impor-tant intracellular pathogen of humans and animals.The duration of infection and patho-genesis of cryptosporidiosis depends on host immune status,ranging from a severe but self-limiting diarrhea in immunocompetent individuals to a life-threatening,prolonged infection in immunocompromised patients.Asubstantial degree of morbidity and mortalityis associated with infections in AIDS pa-tients.Despite intensive efforts over the past20years,there is currently no effective ther-apy for treating or preventing C.parvuminfection in humans.Cryptosporidium belongs to the phylumApicomplexa,whose members share a com-mon apical secretory apparatus mediating lo-comotion and tissue or cellular invasion.Many apicomplexans are of medical or vet-erinary importance,including Plasmodium,Babesia,Toxoplasma,Neosprora,Sarcocys-tis,Cyclospora,and Eimeria.The life cycle ofC.parvum is similar to that of other cyst-forming apicomplexans(e.g.,Eimeria and Tox-oplasma),resulting in the formation of oocysts1Department of Veterinary and Biomedical Science,College of Veterinary Medicine,2Biomedical Genom-ics Center,University of Minnesota,St.Paul,MN55108,USA.3Department of Microbiology and Immu-nology,Weill Medical College and Program in Immu-nology,Weill Graduate School of Medical Sciences ofCornell University,New York,NY10021,USA.4De-partment of Veterinary Pathobiology,College of Vet-erinary Medicine,Texas A&M University,College Sta-tion,TX77843,USA.5Division of Infectious Diseases,Tufts University School of Veterinary Medicine,NorthGrafton,MA01536,USA.6Center for the Study ofBiological Complexity and Department of Microbiol-ogy and Immunology,Virginia Commonwealth Uni-versity,Richmond,VA23198,USA.7MRC Laboratoryof Molecular Biology,Hills Road,Cambridge CB22QH,UK.8National Center for Biotechnology Infor-mation,National Library of Medicine,National Insti-tutes of Health,Bethesda,MD20894,USA.9Depart-ment of Microbiology,University of Minnesota,Min-neapolis,MN55455,USA.*To whom correspondence should be addressed.E-mail:abe@†These authors contributed equally to this work.‡Present address:Bioinformatics Division,Genetic Re-search,GlaxoSmithKline Pharmaceuticals,5MooreDrive,Research Triangle Park,NC27009,USA.R E P O R T S SCIENCE VOL30416APRIL2004441o n O c t o b e r 7 , 2 0 0 9 w w w . s c i e n c e m a g . o r g D o w n l o a d e d f r o mthat are shed in the feces of infected hosts.C.parvum oocysts are highly resistant to environ-mental stresses,including chlorine treatment of community water supplies;hence,the parasite is an important water-and food-borne pathogen (1).The obligate intracellular nature of the par-asite ’s life cycle and the inability to culture the parasite continuously in vitro greatly impair researchers ’ability to obtain purified samples of the different developmental stages.The par-asite cannot be genetically manipulated,and transformation methodologies are currently un-available.To begin to address these limitations,we have obtained the complete C.parvum ge-nome sequence and its predicted protein com-plement.(This whole-genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the project accession AAEE00000000.The version described in this paper is the first version,AAEE01000000.)The random shotgun approach was used to obtain the complete DNA sequence (2)of the Iowa “type II ”isolate of C.parvum .This isolate readily transmits disease among numerous mammals,including humans.The resulting ge-nome sequence has roughly 13ϫgenome cov-erage containing five gaps and 9.1Mb of totalDNA sequence within eight chromosomes.The C.parvum genome is thus quite compact rela-tive to the 23-Mb,14-chromosome genome of Plasmodium falciparum (3);this size difference is predominantly the result of shorter intergenic regions,fewer introns,and a smaller number of genes (Table 1).Comparison of the assembled sequence of chromosome VI to that of the recently published sequence of chromosome VI (4)revealed that our assembly contains an ad-ditional 160kb of sequence and a single gap versus two,with the common sequences dis-playing a 99.993%sequence identity (2).The relative paucity of introns greatly simplified gene predictions and facilitated an-notation (2)of predicted open reading frames (ORFs).These analyses provided an estimate of 3807protein-encoding genes for the C.parvum genome,far fewer than the estimated 5300genes predicted for the Plasmodium genome (3).This difference is primarily due to the absence of an apicoplast and mitochondrial genome,as well as the pres-ence of fewer genes encoding metabolic functions and variant surface proteins,such as the P.falciparum var and rifin molecules (Table 2).An analysis of the encoded pro-tein sequences with the program SEG (5)shows that these protein-encoding genes are not enriched in low-complexity se-quences (34%)to the extent observed in the proteins from Plasmodium (70%).Our sequence analysis indicates that Cryptosporidium ,unlike Plasmodium and Toxoplasma ,lacks both mitochondrion and apicoplast genomes.The overall complete-ness of the genome sequence,together with the fact that similar DNA extraction proce-dures used to isolate total genomic DNA from C.parvum efficiently yielded mito-chondrion and apicoplast genomes from Ei-meria sp.and Toxoplasma (6,7),indicates that the absence of organellar genomes was unlikely to have been the result of method-ological error.These conclusions are con-sistent with the absence of nuclear genes for the DNA replication and translation machinery characteristic of mitochondria and apicoplasts,and with the lack of mito-chondrial or apicoplast targeting signals for tRNA synthetases.A number of putative mitochondrial pro-teins were identified,including components of a mitochondrial protein import apparatus,chaperones,uncoupling proteins,and solute translocators (table S1).However,the ge-nome does not encode any Krebs cycle en-zymes,nor the components constituting the mitochondrial complexes I to IV;this finding indicates that the parasite does not rely on complete oxidation and respiratory chains for synthesizing adenosine triphosphate (ATP).Similar to Plasmodium ,no orthologs for the ␥,␦,or εsubunits or the c subunit of the F 0proton channel were detected (whereas all subunits were found for a V-type ATPase).Cryptosporidium ,like Eimeria (8)and Plas-modium ,possesses a pyridine nucleotide tran-shydrogenase integral membrane protein that may couple reduced nicotinamide adenine dinucleotide (NADH)and reduced nico-tinamide adenine dinucleotide phosphate (NADPH)redox to proton translocation across the inner mitochondrial membrane.Unlike Plasmodium ,the parasite has two copies of the pyridine nucleotide transhydrogenase gene.Also present is a likely mitochondrial membrane –associated,cyanide-resistant alter-native oxidase (AOX )that catalyzes the reduction of molecular oxygen by ubiquinol to produce H 2O,but not superoxide or H 2O 2.Several genes were identified as involved in biogenesis of iron-sulfur [Fe-S]complexes with potential mitochondrial targeting signals (e.g.,nifS,nifU,frataxin,and ferredoxin),supporting the presence of a limited electron flux in the mitochondrial remnant (table S2).Our sequence analysis confirms the absence of a plastid genome (7)and,additionally,the loss of plastid-associated metabolic pathways including the type II fatty acid synthases (FASs)and isoprenoid synthetic enzymes thatTable 1.General features of the C.parvum genome and comparison with other single-celled eukaryotes.Values are derived from respective genome project summaries (3,26–28).ND,not determined.FeatureC.parvum P.falciparum S.pombe S.cerevisiae E.cuniculiSize (Mbp)9.122.912.512.5 2.5(G ϩC)content (%)3019.43638.347No.of genes 38075268492957701997Mean gene length (bp)excluding introns 1795228314261424ND Gene density (bp per gene)23824338252820881256Percent coding75.352.657.570.590Genes with introns (%)553.9435ND Intergenic regions (G ϩC)content %23.913.632.435.145Mean length (bp)5661694952515129RNAsNo.of tRNA genes 454317429944No.of 5S rRNA genes 6330100–2003No.of 5.8S ,18S ,and 28S rRNA units 57200–400100–20022Table parison between predicted C.parvum and P.falciparum proteins.FeatureC.parvum P.falciparum *Common †Total predicted proteins380752681883Mitochondrial targeted/encoded 17(0.45%)246(4.7%)15Apicoplast targeted/encoded 0581(11.0%)0var/rif/stevor ‡0236(4.5%)0Annotated as protease §50(1.3%)31(0.59%)27Annotated as transporter ࿣69(1.8%)34(0.65%)34Assigned EC function ¶167(4.4%)389(7.4%)113Hypothetical proteins925(24.3%)3208(60.9%)126*Values indicated for P.falciparum are as reported (3)with the exception of those for proteins annotated as protease or transporter.†TBLASTN hits (e Ͻ–5)between C.parvum and P.falciparum .‡As reported in (3).§Pre-dicted proteins annotated as “protease or peptidase”for C.parvum (CryptoGenome database,)and P.falciparum (PlasmoDB database,).࿣Predicted proteins annotated as “trans-porter,permease of P-type ATPase”for C.parvum (CryptoGenome)and P.falciparum (PlasmoDB).¶Bidirectional BLAST hit (e Ͻ–15)to orthologs with assigned Enzyme Commission (EC)numbers.Does not include EC assignment numbers for protein kinases or protein phosphatases (due to inconsistent annotation across genomes),or DNA polymerases or RNA polymerases,as a result of issues related to subunit inclusion.(For consistency,46proteins were excluded from the reported P.falciparum values.)R E P O R T S16APRIL 2004VOL 304SCIENCE 442 o n O c t o b e r 7, 2009w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o mare otherwise localized to the plastid in other apicomplexans.C.parvum fatty acid biosynthe-sis appears to be cytoplasmic,conducted by a large(8252amino acids)modular type I FAS (9)and possibly by another large enzyme that is related to the multidomain bacterial polyketide synthase(10).Comprehensive screening of the C.parvum genome sequence also did not detect orthologs of Plasmodium nuclear-encoded genes that contain apicoplast-targeting and transit sequences(11).C.parvum metabolism is greatly stream-lined relative to that of Plasmodium,and in certain ways it is reminiscent of that of another obligate eukaryotic parasite,the microsporidian Encephalitozoon.The degeneration of the mi-tochondrion and associated metabolic capabili-ties suggests that the parasite largely relies on glycolysis for energy production.The parasite is capable of uptake and catabolism of mono-sugars(e.g.,glucose and fructose)as well as synthesis,storage,and catabolism of polysac-charides such as trehalose and amylopectin. Like many anaerobic organisms,it economizes ATP through the use of pyrophosphate-dependent phosphofructokinases.The conver-sion of pyruvate to acetyl–coenzyme A(CoA) is catalyzed by an atypical pyruvate-NADPH oxidoreductase(Cp PNO)that contains an N-terminal pyruvate–ferredoxin oxidoreductase (PFO)domain fused with a C-terminal NADPH–cytochrome P450reductase domain (CPR).Such a PFO-CPR fusion has previously been observed only in the euglenozoan protist Euglena gracilis(12).Acetyl-CoA can be con-verted to malonyl-CoA,an important precursor for fatty acid and polyketide biosynthesis.Gly-colysis leads to several possible organic end products,including lactate,acetate,and ethanol. The production of acetate from acetyl-CoA may be economically beneficial to the parasite via coupling with ATP production.Ethanol is potentially produced via two in-dependent pathways:(i)from the combination of pyruvate decarboxylase and alcohol dehy-drogenase,or(ii)from acetyl-CoA by means of a bifunctional dehydrogenase(adhE)with ac-etaldehyde and alcohol dehydrogenase activi-ties;adhE first converts acetyl-CoA to acetal-dehyde and then reduces the latter to ethanol. AdhE predominantly occurs in bacteria but has recently been identified in several protozoans, including vertebrate gut parasites such as Enta-moeba and Giardia(13,14).Adjacent to the adhE gene resides a second gene encoding only the AdhE C-terminal Fe-dependent alcohol de-hydrogenase domain.This gene product may form a multisubunit complex with AdhE,or it may function as an alternative alcohol dehydro-genase that is specific to certain growth condi-tions.C.parvum has a glycerol3-phosphate dehydrogenase similar to those of plants,fungi, and the kinetoplastid Trypanosoma,but(unlike trypanosomes)the parasite lacks an ortholog of glycerol kinase and thus this pathway does not yield glycerol production.In addition to themodular fatty acid synthase(Cp FAS1)andpolyketide synthase homolog(Cp PKS1), C.parvum possesses several fatty acyl–CoA syn-thases and a fatty acyl elongase that may partici-pate in fatty acid metabolism.Further,enzymesfor the metabolism of complex lipids(e.g.,glyc-erolipid and inositol phosphate)were identified inthe genome.Fatty acids are apparently not anenergy source,because enzymes of the fatty acidoxidative pathway are absent,with the exceptionof a3-hydroxyacyl-CoA dehydrogenase.C.parvum purine metabolism is greatlysimplified,retaining only an adenosine ki-nase and enzymes catalyzing conversionsof adenosine5Ј-monophosphate(AMP)toinosine,xanthosine,and guanosine5Ј-monophosphates(IMP,XMP,and GMP).Among these enzymes,IMP dehydrogenase(IMPDH)is phylogenetically related toε-proteobacterial IMPDH and is strikinglydifferent from its counterparts in both thehost and other apicomplexans(15).In con-trast to other apicomplexans such as Toxo-plasma gondii and P.falciparum,no geneencoding hypoxanthine-xanthineguaninephosphoribosyltransferase(HXGPRT)is de-tected,in contrast to a previous report on theactivity of this enzyme in C.parvum sporo-zoites(16).The absence of HXGPRT sug-gests that the parasite may rely solely on asingle enzyme system including IMPDH toproduce GMP from AMP.In contrast to otherapicomplexans,the parasite appears to relyon adenosine for purine salvage,a modelsupported by the identification of an adeno-sine transporter.Unlike other apicomplexansand many parasitic protists that can synthe-size pyrimidines de novo,C.parvum relies onpyrimidine salvage and retains the ability forinterconversions among uridine and cytidine5Ј-monophosphates(UMP and CMP),theirdeoxy forms(dUMP and dCMP),and dAMP,as well as their corresponding di-and triphos-phonucleotides.The parasite has also largelyshed the ability to synthesize amino acids denovo,although it retains the ability to convertselect amino acids,and instead appears torely on amino acid uptake from the host bymeans of a set of at least11amino acidtransporters(table S2).Most of the Cryptosporidium core pro-cesses involved in DNA replication,repair,transcription,and translation conform to thebasic eukaryotic blueprint(2).The transcrip-tional apparatus resembles Plasmodium interms of basal transcription machinery.How-ever,a striking numerical difference is seenin the complements of two RNA bindingdomains,Sm and RRM,between P.falcipa-rum(17and71domains,respectively)and C.parvum(9and51domains).This reductionresults in part from the loss of conservedproteins belonging to the spliceosomal ma-chinery,including all genes encoding Smdomain proteins belonging to the U6spliceo-somal particle,which suggests that this par-ticle activity is degenerate or entirely lost.This reduction in spliceosomal machinery isconsistent with the reduced number of pre-dicted introns in Cryptosporidium(5%)rela-tive to Plasmodium(Ͼ50%).In addition,keycomponents of the small RNA–mediatedposttranscriptional gene silencing system aremissing,such as the RNA-dependent RNApolymerase,Argonaute,and Dicer orthologs;hence,RNA interference–related technolo-gies are unlikely to be of much value intargeted disruption of genes in C.parvum.Cryptosporidium invasion of columnarbrush border epithelial cells has been de-scribed as“intracellular,but extracytoplas-mic,”as the parasite resides on the surface ofthe intestinal epithelium but lies underneaththe host cell membrane.This niche may al-low the parasite to evade immune surveil-lance but take advantage of solute transportacross the host microvillus membrane or theextensively convoluted parasitophorous vac-uole.Indeed,Cryptosporidium has numerousgenes(table S2)encoding families of putativesugar transporters(up to9genes)and aminoacid transporters(11genes).This is in starkcontrast to Plasmodium,which has fewersugar transporters and only one putative ami-no acid transporter(GenBank identificationnumber23612372).As a first step toward identification ofmulti–drug-resistant pumps,the genome se-quence was analyzed for all occurrences ofgenes encoding multitransmembrane proteins.Notable are a set of four paralogous proteinsthat belong to the sbmA family(table S2)thatare involved in the transport of peptide antibi-otics in bacteria.A putative ortholog of thePlasmodium chloroquine resistance–linkedgene Pf CRT(17)was also identified,althoughthe parasite does not possess a food vacuole likethe one seen in Plasmodium.Unlike Plasmodium,C.parvum does notpossess extensive subtelomeric clusters of anti-genically variant proteins(exemplified by thelarge families of var and rif/stevor genes)thatare involved in immune evasion.In contrast,more than20genes were identified that encodemucin-like proteins(18,19)having hallmarksof extensive Thr or Ser stretches suggestive ofglycosylation and signal peptide sequences sug-gesting secretion(table S2).One notable exam-ple is an11,700–amino acid protein with anuninterrupted stretch of308Thr residues(cgd3_720).Although large families of secretedproteins analogous to the Plasmodium multi-gene families were not found,several smallermultigene clusters were observed that encodepredicted secreted proteins,with no detectablesimilarity to proteins from other organisms(Fig.1,A and B).Within this group,at leastfour distinct families appear to have emergedthrough gene expansions specific to the Cryp-R E P O R T S SCIENCE VOL30416APRIL2004443o n O c t o b e r 7 , 2 0 0 9 w w w . s c i e n c e m a g . o r g D o w n l o a d e d f r o mtosporidium clade.These families —SKSR,MEDLE,WYLE,FGLN,and GGC —were named after well-conserved sequence motifs (table S2).Reverse transcription polymerase chain reaction (RT-PCR)expression analysis (20)of one cluster,a locus of seven adjacent CpLSP genes (Fig.1B),shows coexpression during the course of in vitro development (Fig.1C).An additional eight genes were identified that encode proteins having a periodic cysteine structure similar to the Cryptosporidium oocyst wall protein;these eight genes are similarly expressed during the onset of oocyst formation and likely participate in the formation of the coccidian rigid oocyst wall in both Cryptospo-ridium and Toxoplasma (21).Whereas the extracellular proteins described above are of apparent apicomplexan or lineage-specific in-vention,Cryptosporidium possesses many genesencodingsecretedproteinshavinglineage-specific multidomain architectures composed of animal-and bacterial-like extracellular adhe-sive domains (fig.S1).Lineage-specific expansions were ob-served for several proteases (table S2),in-cluding an aspartyl protease (six genes),a subtilisin-like protease,a cryptopain-like cys-teine protease (five genes),and a Plas-modium falcilysin-like (insulin degrading enzyme –like)protease (19genes).Nine of the Cryptosporidium falcilysin genes lack the Zn-chelating “HXXEH ”active site motif and are likely to be catalytically inactive copies that may have been reused for specific protein-protein interactions on the cell sur-face.In contrast to the Plasmodium falcilysin,the Cryptosporidium genes possess signal peptide sequences and are likely trafficked to a secretory pathway.The expansion of this family suggests either that the proteins have distinct cleavage specificities or that their diversity may be related to evasion of a host immune response.Completion of the C.parvum genome se-quence has highlighted the lack of conven-tional drug targets currently pursued for the control and treatment of other parasitic protists.On the basis of molecular and bio-chemical studies and drug screening of other apicomplexans,several putative Cryptospo-ridium metabolic pathways or enzymes have been erroneously proposed to be potential drug targets (22),including the apicoplast and its associated metabolic pathways,the shikimate pathway,the mannitol cycle,the electron transport chain,and HXGPRT.Nonetheless,complete genome sequence analysis identifies a number of classic and novel molecular candidates for drug explora-tion,including numerous plant-like and bacterial-like enzymes (tables S3and S4).Although the C.parvum genome lacks HXGPRT,a potent drug target in other api-complexans,it has only the single pathway dependent on IMPDH to convert AMP to GMP.The bacterial-type IMPDH may be a promising target because it differs substan-tially from that of eukaryotic enzymes (15).Because of the lack of de novo biosynthetic capacity for purines,pyrimidines,and amino acids,C.parvum relies solely on scavenge from the host via a series of transporters,which may be exploited for chemotherapy.C.parvum possesses a bacterial-type thymidine kinase,and the role of this enzyme in pyrim-idine metabolism and its drug target candida-cy should be pursued.The presence of an alternative oxidase,likely targeted to the remnant mitochondrion,gives promise to the study of salicylhydroxamic acid (SHAM),as-cofuranone,and their analogs as inhibitors of energy metabolism in the parasite (23).Cryptosporidium possesses at least 15“plant-like ”enzymes that are either absent in or highly divergent from those typically found in mammals (table S3).Within the glycolytic pathway,the plant-like PPi-PFK has been shown to be a potential target in other parasites including T.gondii ,and PEPCL and PGI ap-pear to be plant-type enzymes in C.parvum .Another example is a trehalose-6-phosphate synthase/phosphatase catalyzing trehalose bio-synthesis from glucose-6-phosphate and uridine diphosphate –glucose.Trehalose may serve as a sugar storage source or may function as an antidesiccant,antioxidant,or protein stability agent in oocysts,playing a role similar to that of mannitol in Eimeria oocysts (24).Orthologs of putative Eimeria mannitol synthesis enzymes were not found.However,two oxidoreductases (table S2)were identified in C.parvum ,one of which belongs to the same families as the plant mannose dehydrogenases (25)and the other to the plant cinnamyl alcohol dehydrogenases.In principle,these enzymes could synthesize protective polyol compounds,and the former enzyme could use host-derived mannose to syn-thesize mannitol.References and Notes1.D.G.Korich et al .,Appl.Environ.Microbiol.56,1423(1990).2.See supportingdata on Science Online.3.M.J.Gardner et al .,Nature 419,498(2002).4.A.T.Bankier et al .,Genome Res.13,1787(2003).5.J.C.Wootton,Comput.Chem.18,269(1994).Fig.1.(A )Schematic showing the chromosomal locations of clusters of potentially secreted proteins.Numbers of adjacent genes are indicated in paren-theses.Arrows indicate direc-tion of clusters containinguni-directional genes (encoded on the same strand);squares indi-cate clusters containingg enes encoded on both strands.Non-paralogous genes are indicated by solid gray squares or direc-tional triangles;SKSR (green triangles),FGLN (red trian-gles),and MEDLE (blue trian-gles)indicate three C.parvum –speciﬁc families of paralogous genes predominantly located at telomeres.Insl (yellow tri-angles)indicates an insulinase/falcilysin-like paralogous gene family.Cp LSP (white square)indicates the location of a clus-ter of adjacent large secreted proteins (table S2)that are cotranscriptionally regulated.Identiﬁed anchored telomeric repeat sequences are indicated by circles.(B )Schematic show-inga select locus containinga cluster of coexpressed large secreted proteins (Cp LSP).Genes and intergenic regions (regions between identiﬁed genes)are drawn to scale at the nucleotide level.The length of the intergenic re-gions is indicated above or be-low the locus.(C )Relative ex-pression levels of CpLSP (red lines)and,as a control,C.parvum Hedgehog-type HINT domain gene (blue line)duringin vitro development,as determined by semiquantitative RT-PCR usingg ene-speciﬁc primers correspondingto the seven adjacent g enes within the CpLSP locus as shown in (B).Expression levels from three independent time-course experiments are represented as the ratio of the expression of each gene to that of C.parvum 18S rRNA present in each of the infected samples (20).R E P O R T S16APRIL 2004VOL 304SCIENCE 444 o n O c t o b e r 7, 2009w w w .s c i e n c e m a g .o r g D o w n l o a d e d f r o m。

数据挖掘_概念与技术(第三版)部分习题答案汇总

1.4 数据仓库和数据库有何不同？有哪些相似之处？答：区别：数据仓库是面向主题的，集成的，不易更改且随时间变化的数据集合，用来支持管理人员的决策，数据库由一组内部相关的数据和一组管理和存取数据的软件程序组成，是面向操作型的数据库，是组成数据仓库的源数据。

它用表组织数据，采用ER数据模型。

相似：它们都为数据挖掘提供了源数据，都是数据的组合。

1.3 定义下列数据挖掘功能：特征化、区分、关联和相关分析、预测聚类和演变分析。

使用你熟悉的现实生活的数据库，给出每种数据挖掘功能的例子。

答：特征化是一个目标类数据的一般特性或特性的汇总。

例如，学生的特征可被提出，形成所有大学的计算机科学专业一年级学生的轮廓，这些特征包括作为一种高的年级平均成绩(GPA：Grade point aversge)的信息，还有所修的课程的最大数量。

区分是将目标类数据对象的一般特性与一个或多个对比类对象的一般特性进行比较。

例如，具有高GPA 的学生的一般特性可被用来与具有低GPA 的一般特性比较。

最终的描述可能是学生的一个一般可比较的轮廓，就像具有高GPA 的学生的75%是四年级计算机科学专业的学生，而具有低GPA 的学生的65%不是。

关联是指发现关联规则，这些规则表示一起频繁发生在给定数据集的特征值的条件。

例如，一个数据挖掘系统可能发现的关联规则为：major(X, “computing science”) ⇒ owns(X, “personal computer”)[support=12%, confidence=98%] 其中，X 是一个表示学生的变量。

这个规则指出正在学习的学生，12%（支持度）主修计算机科学并且拥有一台个人计算机。

这个组一个学生拥有一台个人电脑的概率是98%（置信度，或确定度）。

分类与预测不同，因为前者的作用是构造一系列能描述和区分数据类型或概念的模型（或功能），而后者是建立一个模型去预测缺失的或无效的、并且通常是数字的数据值。

基因组结构性变异检测的方法及DNA-seq（转载）

基因组结构性变异检测的⽅法及DNA-seq（转载）⼈类基因组中的变异和⼈类的演化、疾病风险等⽅⾯都有着密切的联系。

当前⼆代短读长⾼通量测序技术（NGS），虽然能够让测序成本⼤⼤降低，但这种短读长的测序⽅法也给基因组的变异检测（特别是结构性变异检测）带来了不⼩的挑战。

SNP和Indel⼤家应该都见得⽐较多了，因此在这篇⽂章⾥我将主要讨论常见结构性变异的检测⽅法和有关软件以及它们的⼀些优缺点。

变异的分类在开始之前，有必要先梳理⼀下⼈类基因组上的变异种类，按照⽬前业界的看法可以分为如下三个⼤类：单碱基变异，即单核苷酸多态性(SNP)，最常见也最简单的⼀种基因组变异形式；很短的Insertion 和 Deletion，也常被我们合并起来称为Indel。

主要指在基因组某个位置上发⽣较短长度的线性⽚段插⼊或者删除的现象。

强调线性的原因是，这⾥的插⼊和删除是有前后顺序的与下述的结构性变异不同。

Indel长度通常在50bp以下，更多时候甚⾄是不超过10bp，这个长度范围内的序列变化可以通过Smith-Waterman 的局部⽐对算法来准确获得，并且也能够在⽬前短读长的测序数据中较好地检测出来；基因组结构性变异（Structure Variantions，简称SVs），这篇⽂章的重点，通常就是指基因组上⼤长度的序列变化和位置关系变化。

类型很多，包括长度在50bp以上的长⽚段序列插⼊或者删除（Big Indel）、串联重复（Tandem repeate）、染⾊体倒位（Inversion）、染⾊体内部或染⾊体之间的序列易位（Translocation）、拷贝数变异（CNV）以及形式更为复杂的嵌合性变异。

图1. 结构性变异的不同种类值得⼀提的是，研究⼈员对基因组的结构性变异发⽣兴趣，主要还是由于在研究中发现：SVs对基因组的影响⽐起SNP更⼤，⼀旦发⽣往往会给⽣命体带来重⼤影响，⽐如导致出⽣缺陷、癌症等；有研究发现，基因组上的SVs⽐起SNP⽽⾔，更能代表⼈类群体的多样性特征；稀有且相同的⼀些结构性变异往往和疾病（包括癌症）的发⽣相互关联甚⾄还是其直接的致病诱因。

宁波大学基因工程考研题库-精

第一章绪论1.分子生物学要研究的主要内容是（）（）（）参考答案是：基因工程；基因表达调控研究；结构分子生物学第二章DNA结构一、填空题(6分)1.天然存在的DNA分子形式为右手（）型螺旋参考答案是：B2.Cot曲线方程为：（）参考答案是：C/C0=1/（1+K2C0t）3.DNA复性必须满足两个条件：（）（）参考答案是：盐浓度必须高；温度必须适当高4.DNA 携带有两类不同的遗传信息，即（）信息和（）信息。

参考答案是：基因编码；基因选择性表达二、判断题(12分)1.拓扑异构酶I解旋需要ATP酶。

参考答案是：不正确2.假基因通常与它们相似的基因位于相同的染色体上。

参考答案是：不正确3.水蜥的基因组比人的基因组大。

参考答案是：正确4.一段长度100bp的DNA，具有4100种可能的序列组合形式。

参考答案是：正确5.单个核苷酸通过磷酸二酯键连接到DNA骨架上。

参考答案是：正确6.在高盐和低温条件下由DNA单链杂交形成的双螺旋表现出几乎完全的互补性，这一过程可看作是一个复性（退火）反应。

参考答案是：不正确7.生物的遗传密码只存在于细胞核中参考答案是：不正确8.Top I解旋需要ATP参考答案是：不正确9.琼脂糖凝胶电泳—EBr电泳法分离纯化超螺旋DNA原理是根据EBr可以较多地插入到超螺旋DNA 中，因而迁移速度较快参考答案是：不正确10.高等真核生物的大部分DNA是不编码蛋白质的参考答案是：正确11.Cot1/2与基因组复杂性有关参考答案是：正确12.Cot1/2与基因组大小有关参考答案是：正确三、单选题(4分)1.DNA变性是由于（D）A 磷酸二酯键断裂；B 多核苷酸解离；C 碱基的甲基化修饰；D 互补碱基之间氢键断裂；E 糖苷键断裂；2.1953年，Watson 和Crick提出（A）A 多核苷酸DNA链通过氢键连接成一个双螺旋；B DNA的复制是半保留的，常常形成亲本-子代双螺旋杂合链；C 三个连续的核苷酸代表一个遗传密码；D 遗传物质通常是DNA而非RNA。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。