核酸BLAST

核酸BLAST：

?blastn程式——核酸序列比对。

?MegaBLAST——可搜寻一批EST序列、长序列cDNA或基因体序列。

BLAST——Basic Local Alignment Search Tool——核酸与蛋白质序列比对工具。BLAST网页提供BLAST（Basic Local Alignment Search Tool）程式、概述、使用说明与常见问题解答（网址：https://www.360docs.net/doc/2c9826088.html,/BLAST/）。

BLAST Program Selection Guide：

https://www.360docs.net/doc/2c9826088.html,/blast/producttable.shtml#tab31

在做BLASTn的时候，系统会给出三个程序选项，分别是Highly similar sequences (megablast)， More dissimilar sequences (discontiguous megablast)，Somewhat similar sequences (blastn) 。

第一个选项megablast是对高度相似DNA序列间的比较。鉴别一段未知DNA序列的最好办法就是看看在公共数据库中这段序列是否存在。Megablast就是对那些具有高度相似（相似性95%

以上）的长序列片断所特别设计的一种序列比较工具。Megablast除了提供序列联配的显著性期望值域之外，还提供了一种百分值域。在进行序列比较时，用户可以同时调整这两个参数以优化搜索结果。

第二个选项discontiguous megablast，当序列之间的差异比megablast大时，一般选用这个程序。其算法的基本原理是将查询序列分为一个一个的小片断，我们把它叫做字，通过字与数据库序列相比较，如果能够精确匹配，则以这个字为种子向两边延伸，从而获得符合我们要求的相似性序列。discontiguous megablast所应用的字是不连续的，这使得他的搜索精确性在三种搜索程序中是最高的。其模板类型选项分为三种编码（0），非编码（1），两者都有（2）。在编码模式中，根据第三位碱基的摆动原理，只要第一个和第二个碱基能够精确匹配，那么第三个碱基可以忽略，不做比较。在字的长度相同的情况下，discontiguous megablast的精确度要高于blastn。

第三个选项Somewhat similar sequences (blastn)，这个程序比较的序列其相似程度可以非常低。它采用的算法与discontiguous megablast相同，只不过它的字是连续的。Blastn的字要比megablast短，所以其精确度要高于megablast，但是运算速度要慢一些。

注：字是影响blast灵敏度的一个主要参数，其取值要根据具体情况具体而定。

NCBI BLASTn:

https://www.360docs.net/doc/2c9826088.html,/public_documents/vibe/details/NcbiBlastn.html

Standard nucleotide-nucleotide BLAST

Takes nucleotides sequences and compares them against the NCBI nucleotide databases. It is better at finding sequences similar, but not identical, to your query.

The BLAST nucleotide algorithm finds similar sequences by generating an indexed table or dictionary of short subsequences called words for both the query and the database. The program can then rapidly find initial exact matches to the query words by simply looking up a particular word in the database dictionary. These initial matches serve as starting points for longer alignments that are generated in several steps, ending with a final gapped alignment.

One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words (word size). The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms since the initial exact match can be shorter. The word size is adjustable in blastn and can be reduced from the default value of 11 to a minimum of 7 to increase sensitivity. This word size can also be increased to increase the search speed and limit the number of database hits.

Search for short and near exact matches

It is useful for primer or short nucleotide motif searches.Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the expect value parameter is set too stringently and the default word size parameter is set too high. You can adjust both the word size and the expect value on parameter table to work with short sequences.

A common use of this is to check the specificity of primers used in the polymerase chain reaction (PCR) or hybridization. A useful way to check a pair of PCR primers is to concatenate them and search them as one sequence. The forward primer and the

reverse primer can simply be pasted together with a string of ten or more N's between the two sequences. Since BLAST looks for local alignments and searches both strands, there is no need to reverse complement one of the primers before doing the concatenation or the search.

Notes

Nucleotide-nucleotide searches are not the recommended way to find homologous protein coding regions in other organisms. It is better to perform searches at the protein level, either with translations of the nucleotide sequences or by direct protein-protein BLAST. This is because of the degeneracy of the genetic code, the greater information available in amino acid sequence, and the more sophisticated algorithm in protein-protein BLAST.

The query sequence should contain no ambiguous bases. Consensus motifs with degenerate bases will not work for this type of search.

Parameters Setting

[COMPOSITION BASED STATISTICS] Do search with tweak parameter set to true, learn more. This will automatically perform a gapped alignment, so using UNGAPPED_ALIGNMENT also is unnecessary and will trigger a warning message from NCBI rather than generating results.

?Value : yes, no

?Default : no

[DATABASE] Valid database name,

?Value : see nucleotide databases

?Default : nr

[EXPECT] The statistically significant expectation value. If the statistical significance ascribed to a match is greater than the E value, the match will not be reported. Lower E values are more stringent, leading to a fewer chance matches being reported. Learn more

?Value : double type value

?Default : 10.0

[ENTREZ_QUERY] Entrez query to limit Blast search

?Value :Entrez query format

?Default : Empty

[FILTER]Sequence filter identifier

?L for Low Complexity

?R for Human Repeats

?m for Mask for Lookup

[GAP_OPEN_COSTS] Gap open costs

?Value : integer values

?Default : 5 for nuc-nuc, 11 for proteins, non-affine for megablast

[GAP_EXTEND_COSTS] Gap extend costs

?Value : space separated float values

?Default : 2 for nuc-nuc, 1 for proteins, non-affine for megablast [HITLIST_SIZE] Number of hits to keep

?Value : integer value

?Default : 20

[LCASE_MASK] Enable masking of lower case in query

?Value : yes, no

?Default : no

[NUCL_PENALTY] Penalty for a nucleotide mismatch (blastn only)

?Value : negative integer value

?Default : -3

[NUCL_REWARD] Reward for a nucleotide match (blastn only)

?Value : integer value

?Default : 1

OTHER_ADVANCED

*[DROPOFF] Blast extensions in bits (default if Zero), not applicable for megablast

?Value : integer value

?Default :20 for nuc-nuc, 7 for other programs

*[FIANL_X_DROPOFF]Final X dropoff value for gapped alignment (in bits), not applicable for megablast

?Value : integer value

?Default :50 for nuc-nuc (blastn), 25 for other programs

*[DB_LENGTH] Effective length of the database (use Zero for real size)

?Value :real value

?Default :0

[PROGRAM] Blast program name

?Value : blastn, blastp, blastx, tblastn, tblastx

?Default : blastn

[QUERY_BELIEVE_DEFLINE] Whether to believe defline in FASTA query

?Value : yes, no

?Default : no

[QUERY_FROM] Start of subsequence (one offset)

?Value : integer value

?Default : 0

[QUERY_TO] End of subsequence (one offset)

?Value : integer value

?Default : 0, that means not to use subsequence

[SEARCHSP_EFF] Effective length of the search space

?Value : integer value

?Default : 0

[SERVICE] Blast service which needs to be performed

?Value : plain, psi, phi, rpsblast, megablast

?Default : plain

[THRESHOLD] Threshold for extending hits

?Value : integer value

?Default : ???

[UNGAPPED_ALIGNMENT] Should the ungapped alignment be performed? Note that this parameter should not be set to TRUE or YES when using

COMPOSITION_BASED_STATISTICS since that will automatically perform a gapped

alignment; if this parameter is on, it will trigger a warning message from NCBI rather than generating results.

?Value : yes, no

?Default : no

[WORD SIZE] The search word size

?Value : integer value; 2 or 3 for proteins, 7 or greater for nuc

?Default : 3 for proteins, 11 for nuc-nuc, 28 for megablast

https://www.360docs.net/doc/2c9826088.html,/blast/megablast.shtml

MEGABLAST Search

Mega BLAST uses a greedy algorithm [1] for the nucleotide sequence alignment search. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size is used (see explanation below), it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm.

Default parameters

Word size.

Word size is roughly the minimal length of an identical match an alignment must contain if it is to be found by the algorithm. Mega BLAST is most efficient with word sizes 16 and larger, although word size as low as 8 can be used.

If the value W of the word size is divisible by 4, it guarantees that all perfect matches of length W + 3 will be found and extended by Mega BLAST search, however perfect matches of length as low as W might also be found, although the latter is not guaranteed. Any value of W not divisible by 4 is equivalent to the nearest value divisible by 4 (with 4i+2 equivalent to 4i).

Gapping parameters

By default, non-affine gapping parameters are assumed. This means that the gap opening penalty is 0, and gap extension penalty E can be computed from match reward r and mismatch penalty q by the formula: E = r/2 - q. The non-affine version of Mega BLAST requires significantly less memory and is also significantly faster, however affine gapping parameters can also be used, preferrably with larger word sizes. Non-affine gapping parameters tend to yield alignments with more gaps, but the gap lengths are shorter.

X-dropoff value

As in BLAST, this value provides a cutoff threshold for the extension algorithm tree exploration. When the score of a given branch drops below the current best score minus the X-dropoff, the exploration of this branch stops. However the actual values of the X-dropoff for Mega BLAST and for traditional nucleotide BLAST algorithms are not necessarily compatible, i.e. with the same word size, match, mismatch and gapping penalties and with the same X-dropoff, the two algorithms might produce different results, which can be remedied by changing the X-dropoff value for one of the algorithms.