GATK使用方法详解-plob最详尽说明书

合集下载

GBTK安装步骤

GBTK安装步骤GBTK（Genome Background Toolkit）是一种用于基因组背景计算的工具包，用于研究基因组的组织结构、重复序列和基因功能等方面。

本文将介绍GBTK的安装步骤，并提供详细的指导，以便用户能够顺利地进行安装。

GBTK的安装分为以下几个步骤：第一步：安装前提在安装GBTK之前，您需要确保您的计算机系统满足以下要求：1. 操作系统：GBTK可以在Windows、Mac OS和Linux系统上安装，并确保操作系统是最新的版本。

2. 软件要求：GBTK依赖于Python环境和相关的Python库，因此在安装GBTK之前，您需要安装Python并设置好相关的环境变量。

第三步：解压缩安装包在您的计算机上选择合适的位置，将“gbtk.tar.gz”文件进行解压缩。

您可以通过命令行或使用解压缩软件来完成这一步骤。

解压缩完成后，您将获得一个名为“gbtk”的文件夹。

第四步：安装依赖库在安装GBTK之前，您需要安装一些必需的依赖库。

这些库包括numpy、matplotlib和biopython等。

您可以使用pip命令来安装这些库。

打开终端或命令提示符，然后执行以下命令：pip install numpypip install matplotlibpip install biopython第五步：配置GBTK在安装依赖库完成后，您需要进行一些配置以确保GBTK能够正常工作。

打开终端或命令提示符，进入到之前解压缩的“gbtk”文件夹中，并执行以下命令：python setup.py install这个命令将会配置GBTK，将其安装到您的计算机中。

第六步：验证安装在完成上述步骤后，您可以通过执行一些简单的命令来验证GBTK的安装是否成功。

打开终端或命令提示符，输入以下命令：gbtk -h如果安装成功，您将会看到GBTK的命令行帮助信息。

这表明GBTK已经成功地安装在您的计算机上。

至此，GBTK的安装已经完成。

GATK4基本概念整理

GATK4基本概念整理展开全文GATK 是 Genome Analysis ToolKit 的缩写，是一款从高通量测序数据中分析变异信息的软件，是目前最主流的snp calling 软件之一。

GATK 设计之初是用于分析人类的全外显子和全基因组数据，随着不断发展，现在也可以用于其他的物种，还支持CNV和SV变异信息的检测。

在官网上，提供了完整的分析流程，叫做目前最新版本文为叫做和之前的版本相比，picardjava 语言开发的，需要java 1.8 版本。

下载链接如下/gatk/download/安装过程如下：wgethttps:///broadinstitute/gatk/releases/download/4.0. 4.0/gatk-4.0.4.0.zipunzip gatk-4.0.4.0.ziptree -L 1 gatk-4.0.4.0/gatk-4.0.4.0/├── gatk├── gatk-completion.sh├── gatkcondaenv.yml├── GATKConfig.EXAMPLE.properties├── gatkdoc├── gatk-package-4.0.4.0-local.jar├── gatk-package-4.0.4.0-spark.jar├── gatkPythonPackageArchive.zip└── README.md可执行文件就行了。

通过一个简单的命令，查看程序是否正确安装gatk —list这个命令能够打印出所有的子命令，如果打印出来结果，说明程序安装正确。

部分子命令截图如下说明这个功能是继承于之前版本一样，混合使用picard 和 gatk 了。

GATK4 的最佳实践给出了5套pipeline1.Germline SNPs + Indels2.Somatic SNVs + Indels3.RNAseq SNPs + Indels4.Germline CNVs5.Somatic CNVs以上五套pipeline 可以根据研究对象是DNA还是RNA进行划分：DNA 测序（包含1,2,4,5）和RNA 测序（3）。

6GATK4完整流程

6GATK4完整流程0定义变量source activate wes#GATK=~/biosoft/gatk/gatk-4.1.2.0/gatkref=/mnt/f/kelly/bioTree/server/wesproject/hg38/Homo_sa piens_assembly38.fastasnp=/mnt/f/kelly/bioTree/server/wesproject/hg38/dbsnp_1 46.hg38.vcf.gzindel=/mnt/f/kelly/bioTree/server/wesproject/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz1 标记PCR重复readssample=SRR7696207echo $samplegatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I $sample.bam -O ${sample}_marked.bam -M $sample.metrics 1>log.mark 2>&1运行结束后的文件如下├── [ 17K] log.mark├── [3.8G] SRR7696207.bam├── [5.0G] SRR7696207_marked.bam├── [3.3K] SRR7696207.metrics2 FixMateInformationgatk --java-options "-Xmx20G -Djava.io.tmpdir=./" FixMateInformation -I ${sample}_marked.bam -O ${sample}_marked_fixed.bam -SO coordinate 1>${sample}_log.fix 2>&1这样就得到marked_fixed.bam文件。

structure 使用说明

Documentation for structure software:Version2.3Jonathan K.Pritchard aXiaoquan Wen aDaniel Falush b123a Department of Human GeneticsUniversity of Chicagob Department of StatisticsUniversity of OxfordSoftware from/structure.htmlApril21,20091Our other colleagues in the structure project are Peter Donnelly,Matthew Stephens and Melissa Hubisz.2Theﬁrst version of this program was developed while the authors(JP,MS,PD)were in the Department of Statistics,University of Oxford.3Discussion and questions about structure should be addressed to the online forum at structure-software@.Please check this document and search the previous discus-sion before posting questions.Contents1Introduction31.1Overview (3)1.2What’s new in Version2.3? (3)2Format for the dataﬁle42.1Components of the dataﬁle: (4)2.2Rows (5)2.3Individual/genotype data (6)2.4Missing genotype data (7)2.5Formatting errors (7)3Modelling decisions for the user73.1Ancestry Models (7)3.2Allele frequency models (12)3.3How long to run the program (13)4Missing data,null alleles and dominant markers144.1Dominant markers,null alleles,and polyploid genotypes (14)5Estimation of K(the number of populations)155.1Steps in estimating K (15)5.2Mild departures from the model can lead to overestimating K (16)5.3Informal pointers for choosing K;is the structure real? (16)5.4Isolation by distance data (17)6Background LD and other miscellania176.1Sequence data,tightly linked SNPs and haplotype data (17)6.2Multimodality (18)6.3Estimating admixture proportions when most individuals are admixed (18)7Running structure from the command line197.1Program parameters (19)7.2Parameters inﬁle mainparams (19)7.3Parameters inﬁle extraparams (21)7.4Command-line changes to parameter values (25)8Front End268.1Download and installation (26)8.2Overview (27)8.3Building a project (27)8.4Conﬁguring a parameter set (28)8.5Running simulations (30)8.6Batch runs (30)8.7Exporting parameterﬁles from the front end (30)8.8Importing results from the command-line program (31)8.9Analyzing the results (32)9Interpreting the text output339.1Output to screen during run (34)9.2Printout of Q (34)9.3Printout of Q when using prior population information (35)9.4Printout of allele-frequency divergence (35)9.5Printout of estimated allele frequencies(P) (35)9.6Site by site output for linkage model (36)10Other resources for use with structure3710.1Plotting structure results (37)10.2Importing bacterial MLST data into structure format (37)11How to cite this program37 12Bibliography371IntroductionThe program structure implements a model-based clustering method for inferring population struc-ture using genotype data consisting of unlinked markers.The method was introduced in a paper by Pritchard,Stephens and Donnelly(2000a)and extended in sequels by Falush,Stephens and Pritchard(2003a,2007).Applications of our method include demonstrating the presence of popu-lation structure,identifying distinct genetic populations,assigning individuals to populations,and identifying migrants and admixed individuals.Brieﬂy,we assume a model in which there are K populations(where K may be unknown), each of which is characterized by a set of allele frequencies at each locus.Individuals in the sample are assigned(probabilistically)to populations,or jointly to two or more populations if their genotypes indicate that they are admixed.It is assumed that within populations,the loci are at Hardy-Weinberg equilibrium,and linkage equilibrium.Loosely speaking,individuals are assigned to populations in such a way as to achieve this.Our model does not assume a particular mutation process,and it can be applied to most of the commonly used genetic markers including microsatellites,SNPs and RFLPs.The model assumes that markers are not in linkage disequilibrium(LD)within subpopulations,so we can’t handle markers that are extremely close together.Starting with version2.0,we can now deal with weakly linked markers.While the computational approaches implemented here are fairly powerful,some care is needed in running the program in order to ensure sensible answers.For example,it is not possible to determine suitable run-lengths theoretically,and this requires some experimentation on the part of the user.This document describes the use and interpretation of the software and supplements the published papers,which provide more formal descriptions and evaluations of the methods.1.1OverviewThe software package structure consists of several parts.The computational part of the program was written in C.We distribute source code as well as executables for various platforms(currently Mac,Windows,Linux,Sun).The C executable reads a dataﬁle supplied by the user.There is also a Java front end that provides various helpful features for the user including simple processing of the output.You can also invoke structure from the command line instead of using the front end.This document includes information about how to format the dataﬁle,how to choose appropriate models,and how to interpret the results.It also has details on using the two interfaces(command line and front end)and a summary of the various user-deﬁned parameters.1.2What’s new in Version2.3?The2.3release(April2009)introduces new models for improving structure inference for data sets where(1)the data are not informative enough for the usual structure models to provide accurate in-ference,but(2)the sampling locations are correlated with population membership.In this situation, by making explicit use of sampling location information,we give structure a boost,often allowing much improved performance(Hubisz et al.,2009).We hope to release further improvements in the coming months.loc a loc b loc c loc d loc eGeorge1-914566092George1-9-964094Paula110614268192Paula110614864094Matthew2110145-9092Matthew2110148661-9Bob210814264194Bob2-9142-9094Anja1112142-91-9Anja111414266194Peter1-9145660-9Peter1110145-91-9Carsten2108145620-9Carsten211014564192Table1:Sample dataﬁle.Here MARKERNAMES=1,LABEL=1,POPDATA=1,NUMINDS=7, NUMLOCI=5,and MISSING=-9.Also,POPFLAG=0,LOCDATA=0,PHENOTYPE=0,EX-TRACOLS=0.The second column shows the geographic sampling location of individuals.We can also store the data with one row per individual(ONEROWPERIND=1),in which case theﬁrst row would read“George1-9-9145-96664009294”.2Format for the dataﬁleThe format for the genotype data is shown in Table2(and Table1shows an example).Essentially, the entire data set is arranged as a matrix in a singleﬁle,in which the data for individuals are in rows,and the loci are in columns.The user can make several choices about format,and most of these data(apart from the genotypes!)are optional.For a diploid organism,data for each individual can be stored either as2consecutive rows, where each locus is in one column,or in one row,where each locus is in two consecutive columns. Unless you plan to use the linkage model(see below)the order of the alleles for a single individual does not matter.The pre-genotype data columns(see below)are recorded twice for each individual. (More generally,for n-ploid organisms,data for each individual are stored in n consecutive rows unless the ONEROWPERIND option is used.)2.1Components of the dataﬁle:The elements of the inputﬁle are as listed below.If present,they must be in the following order, however most are optional(as indicated)and may be deleted completely.The user speciﬁes which data are present,either in the front end,or(when running structure from the command line),in a separateﬁle,mainparams.At the same time,the user also speciﬁes the number of individuals and the number of loci.2.2Rows1.Marker Names(Optional;string)Theﬁrst row in theﬁle can contain a list of identiﬁersfor each of the markers in the data set.This row contains L strings of integers or characters, where L is the number of loci.2.Recessive Alleles(Data with dominant markers only;integer)Data sets of SNPs or mi-crosatellites would generally not include this line.However if the option RECESSIVEALLE-LES is set to1,then the program requires this row to indicate which allele(if any)is recessive at each marker.See Section4.1for more information.The option is used for data such as AFLPs and for polyploids where genotypes may be ambiguous.3.Inter-Marker Distances(Optional;real)the next row in theﬁle is a set of inter-markerdistances,for use with linked loci.These should be genetic distances(e.g.,centiMorgans),or some proxy for this based,for example,on physical distances.The actual units of distance do not matter too much,provided that the marker distances are(roughly)proportional to recombination rate.The front end estimates an appropriate scaling from the data,but users of the command line version must set LOG10RMIN,LOG10RMAX and LOG10RSTART in theﬁle extraparams.The markers must be in map order within linkage groups.When consecutive markers are from diﬀerent linkage groups(e.g.,diﬀerent chromosomes),this should be indicated by the value-1.Theﬁrst marker is also assigned the value-1.All other distances are non-negative.This row contains L real numbers.4.Phase Information(Optional;diploid data only;real number in the range[0,1]).This isfor use with the linkage model only.This is a single row of L probabilities that appears after the genotype data for each individual.If phase is known completely,or no phase information is available,these rows are unnecessary.They may be useful when there is partial phase information from family data or when haploid X chromosome data from males and diploid autosomal data are input together.There are two alternative representations for the phase information:(1)the two rows of data for an individual are assumed to correspond to the paternal and maternal contributions,respectively.The phase line indicates the probability that the ordering is correct at the current marker(set MARKOVPHASE=0);(2)the phase line indicates the probability that the phase of one allele relative to the previous allele is correct(set MARKOVPHASE=1).Theﬁrst entry should beﬁlled in with0.5toﬁll out the line to L entries.For example the following data input would represent the information from an male with5unphased autosomal microsatellite loci followed by three X chromosome loci, using the maternal/paternal phase model:102156165101143105104101100148163101143-9-9-90.50.50.50.50.5 1.0 1.0 1.0where-9indicates”missing data”,here missing due to the absence of a second X chromo-some,the0.5indicates that the autosomal loci are unphased,and the1.0s indicate that the X chromosome loci are have been maternally inherited with probability1.0,and hence are phased.The same information can be represented with the markovphase model.In this case the inputﬁle would read:102156165101143105104101100148163101143-9-9-90.50.50.50.50.50.5 1.0 1.0Here,the two1.0s indicate that theﬁrst and second,and second and third X chromosome loci are perfectly in phase with each other.Note that the site by site output under these two models will be diﬀerent.In theﬁrst case,structure would output the assignment probabilities for maternal and paternal chromosomes.In the second case,it would output the probabilities for each allele listed in the inputﬁle.5.Individual/Genotype data(Required)Data for each sampled individual are arranged intoone or more rows as described below.2.3Individual/genotype dataEach row of individual data contains the following elements.These form columns in the dataﬁle.bel(Optional;string)A string of integers or characters used to designate each individualin the sample.2.PopData(Optional;integer)An integer designating a user-deﬁned population from which theindividual was obtained(for instance these might designate the geographic sampling locations of individuals).In the default models,this information is not used by the clustering algorithm, but can be used to help organize the output(for example,plotting individuals from the same pre-deﬁned population next to each other).3.PopFlag(Optional;0or1)A Booleanﬂag which indicates whether to use the PopDatawhen using learning samples(see USEPOPINFO,below).(Note:A Boolean variable(ﬂag)isa variable which takes the values TRUE or FALSE,which are designated here by the integers1(use PopData)and0(don’t use PopData),respectively.)4.LocData(Optional;integer)An integer designating a user-deﬁned sampling location(orother characteristic,such as a shared phenotype)for each individual.This information is used to assist the clustering when the LOCPRIOR model is turned on.If you simply wish to use the PopData for the LOCPRIOR model,then you can omit the LocData column and set LOCISPOP=1(this tells the program to use PopData to set the locations).5.Phenotype(Optional;integer)An integer designating the value of a phenotype of interest,foreach individual.(φ(i)in table.)(The phenotype information is not actually used in structure.It is here to permit a smooth interface with the program STRAT which is used for association mapping.)6.Extra Columns(Optional;string)It may be convenient for the user to include additionaldata in the inputﬁle which are ignored by the program.These go here,and may be strings of integers or characters.7.Genotype Data(Required;integer)Each allele at a given locus should be coded by a uniqueinteger(eg microsatellite repeat score).2.4Missing genotype dataMissing data should be indicated by a number that doesn’t occur elsewhere in the data(often-9 by convention).This number can also be used where there is a mixture of haploid and diploid data (eg X and autosomal loci in males).The missing-data value is set along with the other parameters describing the characteristics of the data set.2.5Formatting errors.We have implemented reasonably careful error checking to make sure that the data set is in the correct format,and the program will attempt to provide some indication about the nature of any problems that exist.The front end requires returns at the ends of each row,and does not allow returns within rows;the command-line version of structure treats returns in the same way as spaces or tabs.One problem that can arise is that editing programs used to assemble the data prior to importing them into structure can introduce hidden formatting characters,often at the ends of lines,or at the end of theﬁle.The front end can remove many of these automatically,but this type of problem may be responsible for errors when the dataﬁle seems to be in the right format.If you are importing data to a UNIX system,the dos2unix function can be helpful for cleaning these up.3Modelling decisions for the user3.1Ancestry ModelsThere are four main models for the ancestry of individuals:(1)no admixture model(individuals are discretely from one population or another);(2)the admixture model(each individual draws some fraction of his/her genome from each of the K populations;(3)the linkage model(like the admixture model,but linked loci are more likely to come from the same population);(4)models with informative priors(allow structure to use information about sampling locations:either to assist clustering with weak data,to detect migrants,or to pre-deﬁne some populations).See Pritchard et al.(2000a)and(Hubisz et al.,2009)for more on models1,2,and4and Falush et al.(2003a)for model3.1.No admixture model.Each individual comes purely from one of the K populations.The output reports the posterior probability that individual i is from population k.The prior probability for each population is1/K.This model is appropriate for studying fully discrete populations and is often more powerful than the admixture model at detecting subtle structure.2.Admixture model.Individuals may have mixed ancestry.This is modelled by saying that individual i has inherited some fraction of his/her genome from ancestors in population k.The output records the posterior mean estimates of these proportions.Conditional on the ancestry vector,q(i),the origin of each allele is independent.We recommend this model as a starting point for most analyses.It is a reasonablyﬂexible model for dealing with many of the complexities of real populations.Admixture is a common feature of real data,and you probably won’tﬁnd it if you use the no-admixture model.The admixture model can also deal with hybrid zones in a natural way.Label Pop Flag Location Phen ExtraCols Loc1Loc2Loc3....Loc LM1M2M3....M Lr1r2r3....r L-1D1,2D2,3....D L−1,LID(1)g(1)f(1)l(1)φ(1)y(1)1,...,y(1)n x(1,1)1x(1,1)2x(1,1)3....x(1,1)LID(1)g(1)f(1)l(1)φ(1)y(1)1,...,y(1)n x(1,2)1x(1,2)2x(1,2)3....x(1,2)Lp(1)1p(1)2p(1)3....p(1)LID(2)g(2)f(2)l(2)φ(2)y(2)1,...,y(2)n x(2,1)1x(2,1)2x(2,1)3....x(2,1)LID(2)g(2)f(2)l(2)φ(2)y(2)1,...,y(2)n x(2,2)1x(2,2)2x(2,2)3....x(2,2)Lp(2)1p(2)2p(2)3....p(2)L ....ID(i)g(i)f(i)l(i)φ(i)y(i)1,...,y(i)n x(i,1)1x(i,1)2x(i,1)3....x(i,1)LID(i)g(i)f(i)l(i)φ(i)y(i)1,...,y(i)n x(i,2)1x(i,2)2x(i,2)3....x(i,2)Lp(3)1p(3)2p(3)3....p(3)L ....ID(N)g(N)f(N)l(N)φ(N)y(N)1,...,y(N)n x(N,1)1x(N,1)2x(N,1)3....x(N,1)LID(N)g(N)f(N)l(N)φ(N)y(N)1,...,y(N)n x(N,2)1x(N,2)2x(N,2)3....x(N,2)Lp(L)1p(L)2p(L)3....p(1)LTable2:Format of the dataﬁle,in two-row format.Most of these components are optional(see text for details).M l is an identiﬁer for marker l.r l indicates which allele,if any,is recessive at each marker(dominant genotype data only).D i,i+1is the distance between markers i and i+1.ID(i) is the label for individual i,g(i)is a predeﬁned population index for individual i(PopData);f(i)is aﬂag used to incorporate learning samples(PopFlag);l(i)is the sampling location of individual i (LocData);φ(i)can store a phenotype for individual i;y(i)1,...,y(i)n are for storing extra data(ignoredby the program);(x i,1l ,x i,2l)stores the genotype of individual i at locus l.p(l)i is the phase informationfor marker l in individual i.3.Linkage model.This is essentially a generalization of the admixture model to deal with“ad-mixture linkage disequilibrium”–i.e.,the correlations that arise between linked markers in recently admixed populations.Falush et al.(2003a)describes the model,and computations in more detail.The basic model is that,t generations in the past,there was an admixture event that mixed the K populations.If you consider an individual chromosome,it is composed of a series of“chunks”that are inherited as discrete units from ancestors at the time of the admixture.Admixture LD arises because linked alleles are often on the same chunk,and therefore come from the same ancestral population.The sizes of the chunks are assumed to be independent exponential random variables with mean length1/t(in Morgans).In practice we estimate a“recombination rate”r from the datathat corresponds to the rate of switching from the present chunk to a new chunk.1Each chunkin individual i is derived independently from population k with probability q(i)k ,where q(i)kis theproportion of that individual’s ancestry from population k.Overall,the new model retains the main elements of the admixture model,but all the alleles that are on a single chunk have to come from the same population.The new MCMC algorithm integrates over the possible chunk sizes and break points.It reports the overall ancestry for each individual,taking account of the linkage,and can also report the probability of origin of each bit of chromosome,if desired by the user.This new model performs better than the original admixture model when using linked loci to study admixed populations.It achieves more accurate estimates of the ancestry vector,and can extract more information from the data.It should be useful for admixture mapping.The model is not designed to deal with background LD between very tightly linked markers.Clearly,this model is a big simpliﬁcation of the complex realities of most real admixed popu-lations.However,the major eﬀect of admixture is to create long-range correlation among linked markers,and so our aim here is to encapsulate that feature within a fairly simple model.The computations are a bit slower than for the admixture model,especially with large K and unphased data.Nonetheless,they are practical for thousands of sites and individuals and multiple populations.The model can only be used if there is information about the relative positions of the markers(usually a genetic map).ing prior population information.The default mode for structure uses only genetic information to learn about population structure.However,there is often additional information that might be relevant to the clustering(e.g.,physical characteristics of sampled individuals or geographic sampling locations).At present,structure can use this information in three ways:•LOCPRIOR models:use sampling locations as prior information to assist the clustering–for use with data sets where the signal of structure is relatively weak2.There are some data sets where there is genuine population structure(e.g.,signiﬁcant F ST between sampling locations),but the signal is too weak for the standard structure models to detect.This is often the case for data sets with few markers,few individuals,or very weak structure.To improve performance in this situation,Hubisz et al.(2009)developed new models that make use of the location information to assist clustering.The new models can often provide accurate inference of population structure and individual ancestry in data sets where the signal of structure is too weak to be found using the standard structure models.Brieﬂy,the rationale for the LOCPRIOR models is as ually,structure assumes that all partitions of individuals are approximately equally likely a priori.Since there is an immense number of possible partitions,it takes highly informative data for structure to 1Because of the way that this is parameterized,the map distances in the inputﬁle can be in arbitrary units–e.g.,genetic distances,or physical distances(under the assumption that these are roughly proportional to genetic distances).Then the estimated value of r represents the rate of switching from one chunks to the next,per unit of whatever distance was assumed in the inputﬁle.E.g.,if an admixture event took place ten generations ago,then r should be estimated as0.1when the map distances are measured in cM(this is10∗0.01,where0.01is the probability of recombination per centiMorgan),or as10−4=10∗10−5when the map distances are measured in KB(assuming a constant crossing-over rate of1cM/MB).The prior for r is log-uniform.The front end tries to make some guesses about sensible upper and lower bounds for r,but the user should adjust these to match the biology of the situation.2Daniel refers to this as“Better priors for worse data.”conclude that any particular partition of individuals into clusters has compelling statistical support.In contrast,the LOCPRIOR models take the view that in practice,individuals from the same sampling location often come from the same population.Therefore,the LOCPRIOR models are set up to expect that the sampling locations may be informative about ancestry. If the data suggest that the locations are informative,then the LOCPRIOR models allow structure to use this information.Hubisz et al.(2009)developed a pair of LOCPRIOR models:for no-admixture and for admix-ture.In both cases,the underlying model(and the likelihood)is the same as for the standard versions.The key diﬀerence is that structure is allowed to use the location information to assist the clustering(i.e.,by modifying the prior to prefer clustering solutions that correlate with the locations).The LOCPRIOR models have the desirable properties that(i)they do not tend toﬁnd struc-ture when none is present;(ii)they are able to ignore the sampling information when the ancestry of individuals is uncorrelated with sampling locations;and(iii)the old and new models give essentially the same answers when the signal of population structure is very strong.Hence,we recommend using the new models in most situations where the amount of available data is very limited,especially when the standard structure models do not provide a clear signal of structure.However,since there is now a great deal of accumulated experience with the standard structure models,we recommend that the basic models remain the default for highly informative data sets(Hubisz et al.,2009).To run the LOCPRIOR model,the user mustﬁrst specify a“sampling location”for each individual,coded as an integer.That is,we assume the samples were collected at a set of discrete locations,and we do not use any spatial information about the locations.(We recognize that in some studies,every individual may be collected at a diﬀerent location,and so clumping individuals into a smaller set of discrete locations may not be an ideal representation of the data.)The“locations”could also represent a phenotype,ecotype,or ethnic group. The locations are entered into the inputﬁle either in the PopData column(set LOCISPOP=1), or as a separate LocData column(see Section2.3).To use the LOCPRIOR model you must ﬁrst specify either the admixture or no-admixture models.If you are using the Graphical User Interface version,tick the“use sampling locations as prior”box.If you are using the command-line version,set LOCPRIOR=1.(Note that LOCPRIOR is incompatible with the linkage model.)Our experience so far is that the LOCPRIOR model does not bias towards detecting structure spuriously when none is present.You can use the same diagnostics for whether there is genuine structure as when you are not using a LOCPRIOR.Additionally it may be helpful to look at the value of r,which parameterizes the amount of information carried by the locations. Values of r near1,or<1indicate that the locations are rger values of r indicate that either there is no population structure,or that the structure is independent of the locations.•USEPOPINFO model:use sampling locations to test for migrants or hybrids–for use with data sets where the data are very informative.In some data sets,the user mightﬁnd that pre-deﬁned groups(eg sampling locations)correspond almost exactly to structure clusters,except for a handful of individuals who seem to be misclassiﬁed.Pritchard et al.(2000a)developed a formal Bayesian test for evaluating whether any individuals in the sample are immigrants to their supposed populations,or have recent immigrant ancestors.Note that this model assumes that the predeﬁned populations are usually correct.It takes quite strong data to overcome the prior against misclassiﬁcation.Before using the USEPOPINFO model,you should also run the program without population information to ensure that the pre-deﬁned populations are in rough agreement with the genetic information.To use this model set USEPOPINFO to1,and choose a value of MIGRPRIOR(which isνin Pritchard et al.(2000a)).You might choose something in the range0.001to0.1forν.The pre-deﬁned population for each individual is set in the input dataﬁle(see PopData).In this mode,individuals assigned to population k in the inputﬁle will be assigned to cluster k in the structure algorithm.Therefore,the predeﬁned populations should be integers between 1and MAXPOPS(K),inclusive.If PopData for any individual is outside this range,their q will be updated in the normal way(ie without prior population information,according to the model that would be used if USEPOPINFO was turned oﬀ.3).•USEPOPINFO model:pre-specify the population of origin of some individuals to assist ancestry estimation for individuals of unknown origin.A second way to use the USEPOPINFO model is to deﬁne“learning samples”that are pre-deﬁned as coming from particular clusters.structure is then used to cluster the remaining individuals.Note:In the Front End,this option is switched on using the option“Update allele frequencies using only individuals with POPFLAG=1”,located under the“Advanced Tab”.Learning samples are implemented using the PopFlag column in the dataﬁle.The pre-deﬁned population is used for those individuals for whom PopFlag=1(and whose PopData is in(1...K)).The PopData value is ignored for individuals for whom PopFlag=0.If there is no PopFlag column in the dataﬁle,then when USEPOPINFO is turned on,PopFlag is set to1 for all individuals.Ancestry of individuals with PopFlag=0,or with PopData not in(1...K) are updated according to the admixture or no-admixture model,as speciﬁed by the user.As noted above,it may be helpful to setαto a sensible value if there are few individuals without predeﬁned populations.This application of USEPOPINFO can be helpful in several contexts.For example,there may be some individuals of known origin,and the goal is to classify additional individuals of unknown origin.For example,we might collect data from a set of dogs of known breeds (numbered1...K),and then use structure to estimate the ancestry for additional dogs of unknown(possibly hybrid)origin.By pre-setting the population numbers,we can ensure that the structure clusters correspond to pre-deﬁned breeds,which makes the output more interpretable,and can improve the accuracy of the inference.(Of course,if two pre-deﬁned breeds are genetically identical,then the dogs of unknown origin may be inferred to have mixed ancestry.Another use of USEPOPINFO is for cases where the user wants to update allele frequen-cies using only a subset of the individuals.Ordinarily,structure analyses update the allele frequency estimates using all available individuals.However there are some settings where you might want to estimate ancestry for some individuals,without those individuals aﬀecting the allele frequency estimates.For example you may have a standard collection of learning samples,and then periodically you want to estimate ancestry for new batches of genotyped 3If the admixture model is used to estimate q for those individuals without prior population information,αis updated on the basis of those individuals only.If there are very few such individuals,you may need toﬁxαat a sensible value.。

plink tagsnp用法

plink tagsnp用法摘要：1.介绍plink 和tagsnp2.说明plink tagsnp 用法3.详细解释plink tagsnp 的各个参数4.举例说明如何使用plink tagsnp5.总结plink tagsnp 的优缺点正文：plink 和tagsnp 是生物信息学领域中常用的两个工具。

plink 是一个用于构建基因型图的软件，它可以将单个核苷酸多态性(SNP) 或插入/缺失(Indel) 标记转换为遗传连锁图。

而tagsnp 是一个用于从基因型图中提取SNP 标记的软件，它可以帮助研究者快速有效地识别和分析遗传标记。

plink tagsnp 用法主要是在plink 软件中使用tagsnp 插件，通过该插件，用户可以直接在plink 中运行tagsnp，无需单独安装tagsnp 软件。

具体用法如下：1.首先，需要确保已经正确安装了plink 软件。

2.打开命令行界面(Windows 系统使用CMD，Mac 和Linux 系统使用Terminal)，输入以下命令：```plink -load <输入文件> -tagsnp```其中，`<输入文件>`是指包含基因型信息的文件，一般为PLINK 格式文件。

3.等待plink tagsnp 运行完成，结果文件默认保存在plink 的输出目录中。

在plink tagsnp 中，有几个常用的参数需要了解：- `-n`: 指定要分析的染色体数量。

- `-o`: 指定输出文件名。

- `-d`: 指定输入文件的目录。

- `-t`: 指定SNP 标记的截止值，用于过滤低质量的SNP 标记。

举个例子，如果我们想对一个包含1-22 号染色体的基因型图进行SNP 标记提取，并输出到一个名为`snps.txt`的文件中，可以使用以下命令：```plink -load chr1-22.ped -tagsnp -n 22 -o snps.txt```总的来说，plink tagsnp 是一个方便实用的工具，可以帮助研究者快速提取SNP 标记，从而进行遗传分析。

gatling 使用指南

gatling 使用指南Gatling 使用指南一、简介Gatling 是一款基于 Scala 编写的现代化压力测试工具，广泛应用于Web应用程序的性能测试与负载测试。

它具有高效、可扩展、易于使用的特点，能够模拟大量用户同时访问目标系统，从而测试系统的性能和稳定性。

二、安装1. 下载 Gatling 安装包，并解压到指定目录。

2. 配置 Java 环境变量，确保 Gatling 能够正常运行。

三、编写测试脚本1. 打开 Gatling 目录下的 user-files 文件夹，新建一个名为simulations 的文件夹。

2. 在 simulations 文件夹下新建一个以 .scala 结尾的文件，作为测试脚本。

3. 使用 Scala 语言编写测试脚本，包括定义场景、设置用户行为、设置请求等。

四、构建压测场景1. 在测试脚本中，使用 scenario 方法定义场景，并设置场景的名称。

2. 在场景中，使用 exec 方法设置用户行为，如发送 HTTP 请求、执行数据库操作等。

3. 可以设置用户行为的重复次数、时间间隔等参数，以模拟真实用户的操作行为。

五、配置压测参数1. 打开 Gatling 目录下的 conf 文件夹，找到 gatling.conf 文件。

2. 修改 gatling.conf 文件中的相应配置项，如压测持续时间、并发用户数、目标系统地址等。

3. 根据需要，可以配置更多高级参数，如断言、报告生成等。

六、运行压测1. 打开命令行界面，切换到 Gatling 目录下的 bin 文件夹。

2. 执行命令gatling.sh -s <simulationClassName>，其中<simulationClassName> 是测试脚本的类名。

3. Gatling 会开始执行压力测试，并实时输出测试进度和结果。

七、分析测试结果1. 压测结束后，Gatling 会生成测试报告。

GAMIT-GLOBK入门介绍及应用(含sh_gamit和sh_glred批处理)

2012-4-6 国测二大队 3
1、GAMIT/GLOBK介绍
主流高精度GPS处理软件(A、B级GPS网)：
GAMIT/GLOBK(美国麻省理工学院MIT) GIPSY/OASIS(美国喷气动力实验室JPL) Bernese(瑞士伯尔尼大学)
国内以GAMIT/GLOBK为主 GAMIT+ CosaGPS
I. 编辑Makefile.config
2012-4-6
国测二大队
8
3、 GAMIT/GLOBK安装
2012-4-6
国测二大队
9
3、 GAMIT/GLOBK安装
II. 修改安装脚本可执行属性 chmod +x install_software III. 执行./install_software(采用默认即可)
2012-4-6 国测二大队 26
4、 GAMIT/GLOBK处理流程
V. 基线解算 sh_gamit -s 2008 275 279 -expt c008 -noftp > sh_gamit.log
参数：
-d yr days 要处理的日期，年，年积日，例如： 1997 153 156 178 （处理1997年的153天、156 天和178天的数据） -s yr d1 d2 d1和d2是要处理的开始和结束的年积日，例如：1997 153 178（处理1997年从 153天到178天的数据）
2012-4-6
国测二大队
22
4、 GAMIT/GLOBK处理流程
IV. 制作准备文件（ sittbl.）
测站的精度控制，高精度的已知坐标强约束，待求点坐标松弛约束
2012-4-6
国测二大队
23

6GATK4完整流程

6GATK4完整流程1.数据准备在使用6GATK4进行数据分析之前，首先需要准备基因组数据。

这包括测序数据，如测序reads或比对到参考基因组的BAM文件，还有参考基因组序列文件及注释文件等。

2.质量控制质量控制是数据分析的关键一步。

使用6GATK4的工具对测序数据进行质量控制，剔除低质量的reads，去除测序引物序列和接头序列等，确保后续分析的可靠性。

3.变异检测6GATK4提供了多个工具用于检测各种类型的变异，包括单核苷酸变异（SNV）、插入缺失变异（indel）、结构变异、拷贝数变异等。

利用这些工具，可以根据测序数据和参考基因组进行变异检测，并生成相应的变异位点文件。

4.变异过滤在变异检测后，需要对得到的变异位点进行过滤，去除可能的假阳性结果和质量较差的变异位点。

6GATK4提供了一系列工具用于变异过滤，可以根据多种过滤条件，如覆盖度、质量值、频率等对变异位点进行筛选，获得高可信度的变异位点。

5.变异注释将过滤后的变异位点与已知的基因组注释信息进行关联，可以获得关于变异的详细信息，如变异类型、位点在基因组中的位置、影响的基因和功能等。

6GATK4提供了注释工具，可以根据基因组注释文件进行变异注释，帮助研究人员理解变异的生物学意义。

6.结果解释最后一步是对变异位点进行结果解释。

利用6GATK4提供的工具和数据库，可以进一步分析和解释变异的功能和致病性。

例如，根据变异位点的功能注释信息，可以预测其对基因的编码序列、调控元件或非编码RNA 的影响，进而推断变异的功能影响和可能的疾病相关性。

总结：6GATK4提供了一套全面的工具和算法，用于分析基因组数据。

其流程包括数据准备、质量控制、变异检测、变异过滤、变异注释和结果解释等步骤。

通过使用6GATK4，研究人员可以对基因组数据进行深入的分析和解释，从而揭示基因变异与疾病发生的关系，为生命科学研究和临床诊断提供重要支持。

mutect2 用法

mutect2 用法Mutect2是一种广泛用于分析肿瘤基因组的工具，它可以检测出样本之间的单核苷酸变异（SNV）和小片段插入/删除（indels）。

在这篇文章中，我们将详细介绍Mutect2的用法。

I. 前置条件在使用Mutect2之前，需要准备以下文件：1. 参考基因组序列文件（fasta格式）2. 样本对应的正常组织和肿瘤组织的测序数据（bam格式）II. 安装和环境配置1. 安装GATK软件包GATK是Mutect2所依赖的软件包之一。

可以从官方网站下载最新版本并安装。

2. 安装MuTect2MuTect2可以从GATK的GitHub页面下载。

安装过程与其他软件包类似，只需按照说明进行操作即可。

3. 配置环境变量为了使命令行工具能够正确识别MuTect2，需要将其添加到系统路径中。

这可以通过编辑.bashrc或.bash_profile文件来完成。

III. 运行Mutect21. 准备输入文件在运行Mutect2之前，需要将参考基因组序列和样本数据转换成GATK所需的格式。

这可以通过使用Picard工具包中的SamToFastq 和FastqToSam命令来完成。

2. 运行Mutect2运行Mutect2的命令格式如下：```gatk Mutect2 \-R reference.fasta \-I tumor.bam \-I normal.bam \-O output.vcf```其中，-R选项指定参考基因组序列文件，-I选项指定样本对应的正常组织和肿瘤组织的测序数据，-O选项指定输出文件名。

3. 解析输出文件Mutect2生成的输出文件是一个VCF格式的文件，其中包含了所有样本之间的SNV和indel信息。

可以使用GATK中的VariantFiltration 工具对结果进行过滤和筛选。

IV. 结论通过以上步骤，我们可以成功地使用Mutect2分析肿瘤基因组数据，并得到样本之间的SNV和indel信息。

GLOBK平差软件介绍及GAMITGLOBK软件自动化处理

❖ GLOBK现已可以接受其他GPS数据处理软件如 2020G/6/7IPSY和Bernese软武汉件大输学出测绘的学结院果及经典大地测量 2
和SLR观测数据。
二、 GLOBK软件模块
❖ GLOBK软件模块可分为四类：
✓ 格式转换模块（htoglb） ✓ 运算模块（GLRED、GLOBK和GLORG） ✓ GMT图形应用模块 ✓ 其他辅助模块
2020/6/7
武汉大学测绘学院
3
❖ 格式转换模块（htoglb）
➢ 这个模块是将GPS、VLBI和SLR等分析软件的解文件转换成GLOBK软件所需要的二进制文件H-file。
➢ 目前支持如下几类文件：
✓ GAMIT软件H-file；
✓ 关于GPS（或其它空间大地测量技术）SINEX格式文件；
✓…
❖ GLOBK平差结果文件：
✓
2020/6/7
✓
无约约束束平平差差结结果果orgp文rt武文件汉件，大，学如如测：绘：g学lgol院bokb_ke_xepxtp_ty_rydrodyo.oy.rpgrt
8
❖ 运行GLOBK软件时的注意事项：
✓ GLOBK是基于线性模型的，在测站坐标或轨道参数的改正值较大时，如：
11
❖ 生成二进制H文件
✓ 将svnav.dat文件拷贝到soln文件夹下；
✓ 将GAMIT生成的H文件（ASCII文本格式的）拷贝至glbf文件夹下；
❖ 使用合并后的H-file，再次运行glred/glorg，获得时间序列；
2❖020运/6/7 行globk/glorg，则武可汉获大学得测测绘站学速院度。
10
五、GLOBK软件进行网平差处理步骤
❖ 首先建立工作目录，比如globk_test；

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

GATK使用方法详解一、使用GATK前须知事项：（1）对GATK的测试主要使用的是人类全基因组和外显子组的测序数据，而且全部是基于illumina数据格式，目前还没有提供其他格式文件（如Ion Torrent）或者实验设计（RNA-Seq）的分析方法。

（2）GATK是一个应用于前沿科学研究的软件，不断在更新和修正，因此，在使用GATK进行变异检测时，最好是下载最新的版本，目前的版本是2.8.1（2014-02-25）。

下载网站：。

（3）在GATK使用过程中（见下面图），有些步骤需要用到已知变异信息，对于这些已知变异，GATK只提供了人类的已知变异信息，可以在GATK的FTP 站点下载（GATK resource bundle）。

如果要研究的不是人类基因组，需要自行构建已知变异，GATK提供了详细的构建方法。

（4）GATK在进行BQSR和VQSR的过程中会使用到R软件绘制一些图，因此，在运行GATK之前最好先检查一下是否正确安装了R和所需要的包，所需要的包大概包括ggplot2、gplots、bitops、caTools、colorspace、gdata、gsalib、reshape、RColorBrewer等。

如果画图时出现错误，会提示需要安装的包的名称。

二、GATK的使用流程GATK最佳使用方案：共3大步骤，即:原始数据的处理 --> 变异检测--> 初步分析。

原始数据的处理1. 对原始下机fastq文件进行过滤和比对（mapping）对于Illumina下机数据推荐使用bwa进行mapping。

Bwa比对步骤大致如下：（1）对参考基因组构建索引：例子：bwa index -a bwtsw hg19.fa。

构建索引时需要注意的问题：bwa构建索引有两种算法，两种算法都是基于BWT 的，这两种算法通过参数-a is 和-a bwtsw进行选择。

其中-a bwtsw对于短的参考序列是不工作的，必须要大于等于10Mb；-a is是默认参数，这个参数不适用于大的参考序列，必须要小于等于2G。

（2）寻找输入reads文件的SA坐标。

对于pair end数据，每个reads文件单独做运算，single end数据就不用说了，只有一个文件。

pair end：bwa aln hg19.fa read1.fq.gz -t 4 -I > read1.fq.gz.saibwa aln hg19.fa read2.fq.gz -t 4 -I > read2.fq.gz.saisingle end：bwa aln hg19.fa read.fq.gz -l 30 -k 2 -t 4 -I > read.fq.gz.sai主要参数说明：-o int：允许出现的最大gap数。

-e int：每个gap允许的最大长度。

-d int：不允许在3’端出现大于多少bp的deletion。

-i int：不允许在reads两端出现大于多少bp的indel。

-l int：Read前多少个碱基作为seed，如果设置的seed大于read长度，将无法继续，最好设置在25-35，与-k 2 配合使用。

-k int：在seed中的最大编辑距离，使用默认2，与-l配合使用。

-t int：要使用的线程数。

-R int：此参数只应用于pair end中，当没有出现大于此值的最佳比对结果时，将会降低标准再次进行比对。

增加这个值可以提高配对比对的准确率，但是同时会消耗更长的时间，默认是32。

-I int：表示输入的文件格式为Illumina 1.3+数据格式。

-B int：设置标记序列。

从5’端开始多少个碱基作为标记序列，当-B为正值时，在比对之前会将每个read的标记序列剪切，并将此标记序列表示在BC SAM 标签里，对于pair end数据，两端的标记序列会被连接。

-b ：指定输入格式为bam格式。

这是一个很奇怪的功能，就是对其它软件的bam文件进行重新比对的意思bwa aln hg19.fa read.bam > read.fq.gz.sai（3）生成sam格式的比对文件。

如果一条read比对到多个位置，会随机选择一种。

例子：single end：bwa samse hg19.fa read.fq.gz.sai read.fq.gz > read.fq.gz.sam参数：-n int：如果reads比对次数超过多少次，就不在XA标签显示。

-r str：定义头文件。

‘@RG\tID:foo\tSM:bar’，如果在此步骤不进行头文件定义，在后续GATK分析中还是需要重新增加头文件。

pair end：bwa sampe -a 500 read1.fq.gz.sai read2.fq.gz.sai read1.fq.gz read2.fq.gz > read.sam参数：-a int：最大插入片段大小。

-o int：pair end两reads中其中之一所允许配对的最大次数，超过该次数，将被视为single end。

降低这个参数，可以加快运算速度，对于少于30bp的read，建议降低-o值。

-r str：定义头文件。

同single end。

-n int：每对reads输出到结果中的最多比对数。

对于最后得到的sam文件，将比对上的结果提取出来（awk即可处理），即可直接用于GATK的分析。

注意：由于GATK在下游的snp-calling时，是按染色体进行call-snp的。

因此，在准备原始sam文件时，可以先按染色体将文件分开，这样会提高运行速度。

但是当数据量不足时，可能会影响后续的VQSR分析，这是需要注意的。

2. 对sam文件进行进行重新排序（reorder）由BWA生成的sam文件时按字典式排序法进行的排序（lexicographically）进行排序的（chr10，chr11…chr19，chr1，chr20…chr22，chr2，chr3…chrM，chrX，chrY），但是GATK在进行callsnp的时候是按照染色体组型（karyotypic）进行的（chrM，chr1，chr2…chr22，chrX，chrY），因此要对原始sam文件进行reorder。

可以使用picard-tools中的ReorderSam完成。

eg.java -jar picard-tools-1.96/ReorderSam.jarI=hg19.samO=hg19.reorder_00.samREFERENCE=hg19.fa注意：1) 这一步的头文件可以人工加上，同时要确保头文件中有的序号在下面序列中也有对应的。

虽然在GATK网站上的说明chrM可以在最前也可以在最后，但是当把chrM放在最后时可能会出错。

2) 在进行排序之前，要先构建参考序列的索引。

e.g. samtools faidx hg19.fa。

最后生成的索引文件：hg19.fa.fai。

3) 如果在上一步想把大文件切分成小文件的时候，头文件可以自己手工加上，之后运行这一步就好了。

3. 将sam文件转换成bam文件（bam是二进制文件，运算速度快）这一步可使用samtools view完成。

e.g. samtools view -bS hg19.reorder_00.sam -o hg19.sam_01.bam4. 对bam文件进行sort排序处理这一步是将sam文件中同一染色体对应的条目按照坐标顺序从小到大进行排序。

可以使用picard-tools中SortSam完成。

e.g.java -jar picard-tools-1.96/SortSam.jarINPUT=hg19.sam_01.bamOUTPUT=hg19.sam.sort_02.bamSORT_ORDER=coordinate5. 对bam文件进行加头（head）处理GATK2.0以上版本将不再支持无头文件的变异检测。

加头这一步可以在BWA比对的时候进行，通过-r参数的选择可以完成。

如果在BWA比对期间没有选择-r参数，可以增加这一步骤。

可使用picard-tools中AddOrReplaceReadGroups完成。

e.g.java -jar picard-tools-1.96/AddOrReplaceReadGroups.jarI=hg19.sam.sort_02.bamO=hg19.reorder.sort.addhead_03.bamID=hg19IDLB=hg19IDPL=illuminePU=hg19PUSM=hg19ID str：输入reads集ID号；LB：read集文库名；PL：测序平台（illunima或solid）；PU：测序平台下级单位名称（run的名称）；SM：样本名称。

注意：这一步尽量不要手动加头，本人尝试过多次手工加头，虽然看起来与软件加的头是一样的，但是程序却无法运行。

6. Merge如果一个样本分为多个lane进行测序，那么在进行下一步之前可以将每个lane 的bam文件合并。

e.g.java -jar picard-tools-1.70/MergeSamINPUT=lane1.bamINPUT=lane2.bamINPUT=lane3.bamINPUT=lane4.bam……INPUT=lane8.bamOUTPUT=sample.bam7. Duplicates Marking在制备文库的过程中，由于P CR扩增过程中会存在一些偏差，也就是说有的序列会被过量扩增。

这样，在比对的时候，这些过量扩增出来的完全相同的序列就会比对到基因组的相同位置。

而这些过量扩增的reads并不是基因组自身固有序列，不能作为变异检测的证据，因此，要尽量去除这些由PCR扩增所形成的duplicates，这一步可以使用picard-tools来完成。

去重复的过程是给这些序列设置一个flag以标志它们，方便GATK的识别。

还可以设置REMOVE_DUPLICATES=true 来丢弃duplicated序列。

对于是否选择标记或者删除，对结果应该没有什么影响，GATK官方流程里面给出的例子是仅做标记不删除。

这里定义的重复序列是这样的：如果两条reads具有相同的长度而且比对到了基因组的同一位置，那么就认为这样的reads是由PCR扩增而来，就会被GATK标记。

e.g.java -jar picard-tools-1.96/MarkDuplicates.jarREMOVE_DUPLICATES= falseMAX_INPUT=hg19.reorder.sort.addhead_03.bamOUTPUT=hg19.reorder.sort.addhead.dedup_04.bam METRICS_注意：dedup这一步只要在library层面上进行就可以了，例如一个sample如果建了多个库的话，对每个库进行dedup即可，不需要把所有库合成一个sample 再进行dedup操作。