用ClustalX做多序列比对

用ClustalX做多序列比对
用ClustalX做多序列比对

Using ClustalX for multiple sequence alignment

Jarno Tuimala

December 2004

All rights reserved. The PDF version of this leaflet or parts of it can be used in Finnish universities as course material, provided that this copyright notice is included. However, this publication may not be sold or included as part of other publications without permission of the publisher.

Index

Index (3)

Quick Start (4)

1. Open ClustalX (4)

2. Read in the FastA-formatted sequences (4)

3. Modify the output format option, if necessary (5)

4. Create an alignment (5)

Creating the input file for multiple sequence alignment (6)

Multiple alignment theory (7)

Getting the data into ClustalX (8)

Setting up the alignment parameters (9)

Pairwise alignment parameters (9)

Multiple alignment parameters (11)

Alignment output-format (12)

Creating the alignment (13)

Writing alignment as Postscript (14)

Assessing the quality of the alignment (15)

Advanced alignment strategies (16)

Advanced options (17)

Do alignment from the guide tree (17)

Profile alignment (17)

Using secondary structure information in the profile alignment (19)

Quick Start

In order to make a multiple sequence alignment using ClustalX, you should have your sequences in FastA format. If you do not know haw to do this, check the chapter “Creating the input file for multiple sequence alignment”.

1. Open ClustalX

After starting ClustalX, and you will see a window that looks something like the one below.

2. Read in the FastA-formatted sequences

Pull down the File-menu, and choose Load Sequences menu item. Navigate to the folder (subdirectory) that contains the input file (text-file containing the sequences in FastA-format), and choose that file. Sequences should appear in the ClustalX window.

The left pane (in the figure above) lists the sequences according to the name that follows “>” symbol in the input file. The right pane shows the beginning of each sequence. You can scroll to the right to see the rest of each sequence by using the scroll bar at the bottom of the pane.

3. Modify the output format option, if necessary

Before aligning the sequences, you should make sure the output format options (from menu Alignment -> output format options) are set correctly. If you’d like to continue with phylogenetic analysis using Phylip package, you should select PHYLIP format. Note, that you should always save the Clustal formatted sequence alignment, also. Here’s an example of the output format option settings:

4. Create an alignment

In order to make the actual alignment, select “Do complete alignment” from the menu Alignment. At that point ClustalX asks for output file names. Your sequence alignment is automatically saved in those files once the alignment is ready. After the alignment has been successfully calculated, a new view will appear, and it might look something like that:

Now that the alignment has been created, you can close ClustalX, and use the generated alignment files in other programs.

Creating the input file for multiple sequence alignment

Here, ClustalX is going to be used for sequence alignment. It, like any other computer program requires the data it manipulates (the input file) to be in a format it can recognize. You can use your favourite word processor to create the input file, but I use Notepad.

In the previous chapters, we pasted the found sequences into the text editor. Often the unedited files look like this:

gi|15146064|gb|AY040893.1| Homo sapiens individual VP37 mitochondrial control region GGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCA CGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC ATCCTGTTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTAAAGTGTGTTAATTAATT AATGCTTGTAGGACATAATAATAACAATTG gi|15146065|gb|AY040894.1| Homo sapiens individual VP5 mitochondrial control region GGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCA CGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAATTAATT

ClustalX can recognize several formats for the sequences, but we will use the FastA format, because that’s the one most easily downloaded from the databanks. The FastA format can be recognized, because the first line begins with the “>” character. This line contains the title of the sequence. The sequence will start from the next line. The “>” character should be followed by one word (only letters or numbers and _), which ClustalX will use as the name for the sequence in the multiple alignment that it creates. Clustal treats everything between “>” and the first space as the sequence name. I suggest you to save the original title, and just enter the new name (up to 10 characters but not more) for the sequence and one space after that. This way you can still save the Genbank accession number in the same file as the sequences. You will need the accession number, if you’re ever going to publish your results, so save them!

For the sake of clarity, we will put a blank line after the first sequence before the second sequence. The order of the sequences in the file is not important.

After these formatting procedures, the aforementioned sequences should look like this.

>hs_vp37 gi|15146064|gb|AY040893.1| Homo sapiens individual VP37 mitochondrial control GGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCA CGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC ATCCTGTTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTAAAGTGTGTTAATTAATT AATGCTTGTAGGACATAATAATAACAATTG

>hs_vp5 gi|15146065|gb|AY040894.1| Homo sapiens individual VP5 mitochondrial control GGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCA CGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTC ATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAATTAATT

Save the sequences in this FastA-format as a plain text file (also known as ASCII). In the current Word version XP, the sequences should be saved as plain text, and then the encoding should be changed to MS-DOS in order to make sure that the file works in every system.

Multiple alignment theory

Dynamic programming can be used to align multiple sequences also. It creates an optimal alignment, but cannot be used for more than five or so sequences because of the calculation time. Therefore, progressive method of multiple sequence alignment is often applied.

Clustal performs a global-multiple sequence alignment by the progressive method. The steps include:

a) Perform pair-wise alignment of all the sequences by dynamic programming

b) Use the alignment scores to produce a phylogenetic tree by neighbor-joining

c) Align the multiple sequences sequentially, guided by the phylogenetic tree

Thus, the most closely related sequences are aligned first, and then additional sequences and groups of sequences are added, guided by the initial alignments to produce a multiple sequence alignment showing in each column the sequence variations among the sequences.

Sequence contributions to the multiple sequence alignment are weighted according to their relationships on the predicted evolutionary tree. Weights are based on the distance of each sequence from the root. The alignment scores between two positions of the multiple sequence alignment are then calculated using the resulting weights as multiplication factors.

As more sequences are added to the profile, gaps accumulate and influence the alignment of further sequences. Clustal calculates gaps in a novel way designed to place them between conserved domains. Gaps found in the initial alignments remain fixed. New gaps are then introduced into the multiple alignment when more sequences are added, but gaps can never be deleted, only added. Clustal also implements methods, which try to compensate for the scoring matrix (e.g., PAM), expected number of gaps, and differencies in sequence length.

Clustal has advanced options:

a) Add sequences with weight

b) Add weights to different sequence positions

c) Add a sequence or alignment to an alignment

d) Use user-defined tree for alignment

Some of these will be discussed in the next chapters.

The problem with progressive alignment is the dependence of the ultimate multiple sequence alignment on the initial pair-wise alignments. The very first sequences to be aligned are the most closely related on the sequence tree. If these sequences align very well, there will be few errors in the initial alignments. However, the more distantly related these sequences, the more errors will be made, and these errors will be propagated to the multiple sequence alignment. A second problem with the progressive alignment

method is the choise of suitable scoring matrices and gap penalties that apply to the set of sequences.

Getting the data into ClustalX

Start ClustalX, and you will see a window that looks something like the one below.

Pull down the File-menu, and choose Load Sequences menu item. Navigate to the folder (subdirectory) that contains the input file (text-file containing the sequences in FastA-format), and choose that file.

The left pane (in the figure above) lists the sequences according to the name that follows “>” symbol in the input file. The right pane shows the beginning of each sequence. You can scroll to the right to see the rest of each sequence by using the scroll bar at the bottom of the pane.

Setting up the alignment parameters

The alignment is done is several succeeding steps: (from Clustal documentation)

1. Reset All Gaps (Alignment->Alignment parameters, Edit->Remove all Gaps)

2. Refine Pairwise Alignment Parameters (Alignment->Alignment parameters)

3. Refine Multiple Alignment Parameters (Alignment->Alignment parameters)

4. Refine Output Format Options (Alignment->Output Format Options)

5. Write Alignment as Postscript (File->Write Alignment as Postscript)

6. Assess the quality of the alignment

a. Not satisfied -> Go to step 1.

b. Satisfied -> Refine the alignment by hand

Pairwise alignment parameters

In order to create the pairwise alignment, ClustalX needs to know what penalties to assign for the creation of a gap and for the extension of that gap. Choose Pairwise Alignment Parameters from the Alignment-menu. You will see a dialog box like the one below.

The first choise, Pairwise Alignments, allows you to choose between Slow-Accurate and Fast-Approximate methods. The Slow-method is preferred, but if you are aligning so many sequences or the sequences are so long that the program takes a long time to run,

you may want to use the Fast-method. The Fast-method uses a k-tuple method for pairwise alignment, whereas Slow-method uses a full dynamic programming algorith. The box shows the default values for Gap Opening and Gap Extension. Decreasing the gap penalties will allow the introduction of more gaps, and less mismatches. This may result is matches that do not reflect homology (identity by descent). Increasing gap penalties wil have an opposite effect, but may result in missing matches that actually are homologies.

Weight matrix parameters can be changed too. The IUB DNA weight matrix scores matches as 1.9 and mismatches as 0, except that it scores all X’s and N’s as matches to any IUB ambiguity symbols. The Protein Weight Matrices are equivalent to the same matrices used as evolutionary models in the production of the dendrogram. All the matrices have their strenghts and down-sides: PAM has been used for years, but is now somewhat outdated, and the Gonnet can be more appropriate for your purposes. BLOSUM seems to be best for searching databases.

You can create and load in your own matrix into ClustalX. For the description of the file format, take a look at the file matrices.h in the ClustalX-folder:

For amino acids:

char *amino_acid_order = "ABCDEFGHIKLMNPQRSTVWXYZ";

short blosum30mt[]={

4,

0, 5,

-3, -2, 17,

0, 5, -3, 9,

0, 0, 1, 1, 6,

-2, -3, -3, -5, -4, 10,

0, 0, -4, -1, -2, -3, 8,

-2, -2, -5, -2, 0, -3, -3, 14,

0, -2, -2, -4, -3, 0, -1, -2, 6,

0, 0, -3, 0, 2, -1, -1, -2, -2, 4,

-1, -1, 0, -1, -1, 2, -2, -1, 2, -2, 4,

1, -2, -2, -3, -1, -2, -2, 2, 1, 2, 2, 6,

0, 4, -1, 1, -1, -1, 0, -1, 0, 0, -2, 0, 8,

-1, -2, -3, -1, 1, -4, -1, 1, -3, 1, -3, -4, -3, 11,

1, -1, -2, -1, 2, -3, -2, 0, -2, 0, -2, -1, -1, 0, 8,

-1, -2, -2, -1, -1, -1, -2, -1, -3, 1, -2, 0, -2, -1, 3, 8,

1, 0, -2, 0, 0, -1, 0, -1, -1, 0, -2, -2, 0, -1, -1, -1, 4,

1, 0, -2, -1, -2, -2, -2, -2, 0, -1, 0, 0, 1, 0, 0, -3, 2, 5,

1, -2, -2, -2, -3, 1, -3, -3, 4, -2, 1, 0, -2, -4, -3, -1, -1, 1, 5,

-5, -5, -2, -4, -1, 1, 1, -5, -3, -2, -2, -3, -7, -3, -1, 0, -3, -5, -3, 20,

0, -1, -2, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, -1, 0, -1, 0, 0, 0, -2, -1,

-4, -3, -6, -1, -2, 3, -3, 0, -1, -1, 3, -1, -4, -2, -1, 0, -2, -1, 1, 5, -1, 9, 0, 0, 0, 0, 5, -4, -2, 0, -3, 1, -1, -1, -1, 0, 4, 0, -1, -1, -3, -1, 0,2,4}; /*

For the DNA:

char *nucleic_acid_order = "ABCDGHKMNRSTUVWXY";

short clustalvdnamt[]={

10,

0, 0,

0, 0, 10,

0, 0, 0, 0,

0, 0, 0, 0, 10,

0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

Multiple alignment parameters

Choose Multiple Alignment Parameters from the Alignment-menu. A new dialog box will appear.

The pairwise and multiple alignment options are set independently, because Clustal needs to know both. As have been already discussed, a matrix of pairwise alignments is first calculated, and on the basis of those distances, a multiple alignment is formed. Using different settings for these alignment steps gives us more flexibility to affect how the alignment is done.

Compared to pairwise alignment there are a couple of new settings. Delay Divergent Sequences determines how different two sequences must be in order for their alignment to be delayed. This tries to compensate for the bias introduced by progressive alignment method.

DNA transition weight can be modified. Weight 0 means that transitions are scored as mismatches, and weight 1 mean that transitions are given the same weight as transversions. For distantly related DNA sequences, the weight should be near to zero; for closely related sequences it can be useful to assign a higher score.

You don’t have to choose the individual Weight Matrix any more, because we know how dissimilar the sequences are by now. ClustalX will automatically choose the most appropriate matrix inside one matrix series. So, if you changed the matrix series to be used in the pairwise alignment options, be sure to change it here too.

Alignment output-format

The last thing to be modified before performing the alignment is the output-format. When ClustalX creates an alignment it writes the aligned sequences into a file. There are different multiple alignment formats, which might be needed depending on what program you have planned to use for further analyses.

For the phylogenetic construction using Phylip-package, choose the Phylip as an output format. You should also always remember to write the aligned sequences in the Clustal-format, because it is convenient for both publishing the alignment and refining the alignment in the Notepad. If you plan to publish the sequences, turn the sequence numbering on.

You can modify the output-format by selecting the Output Format Options menu item form the Aligment-menu.

Creating the alignment

Finally, it is time to create the alignment. Choose the Do Complete Alignment under the Alignment menu.

ClustalX tells you what it is doing, but it usually locks the computer, so do not plan to use the computer for any other purposes while the alignment is still under construction.

One word of warning, though. ClustalX does not understand spaces in the names of the folders. Therefore, for example, My Documents, cannot be used in the path.

After the alignment is done, the main window will be updated with the aligned sequences.

The bases are colored, which makes the assessment of alignment that much easier. The histogram (or line) below the ruler indicates the degree of similarity. Peaks indicate positions of high similarity, and valleys positions of low similarity.

The grey line just above the sequences is used to mark strongly conserved positions. The “*” character indicates positions that have been fully conserved.

Writing alignment as Postscript

It is possible to use the files we have just created to construct a phylogenetic tree, but the quality and the value of that tree will be no better than the quality of the alignment, and we have not yet considered quality.

It is also important to understand that no matter how dissimilar the sequences are, ClustalX will always produce an alignment. The mere existence of the alignment does not mean that the sequences are related. It is up to the user to ensure that the sequences in the data set are actually homologous and therefore alignable.

The quality of the alignment is easier to check if there exists hard copies of the sequences. You have two possibilities. You might print the ClustalX-format alignment, which exists in the text-format, and use that, but then you’ll lose the information of the colors.

The other possibility is to install Ghostscript and Ghostview programs, which allow the user to manipulate Postscript documents, and print the alignment as postscript, which includes the colors.

Go to File-emu and select Write Alignment as Postscript. A dialog box opens.

You have some options to change, but after you are satisfied with them, click OK. Open the postscript-file in Ghostview. The Ghostview-program might give you some warnings about the incorrect type of the postscript-file, but ignore them by clicking on the Cancel-button.

Now you can print this alignment by selecting the printer-button in the left and top corner of the window.

Assessing the quality of the alignment

In practice, many alignments are produced, and they are compared together. You can start by using the default ClustalX options for pairwise and multiple alignment options. In the next alignments try lowering and highering the gap options by 50%. Produce, say 10-20 different alignments, and then compare those together. Often you need to try more than 10 different settings in order to find the best alignment. What you also need to keep in mind is that alignment is not an absolute thing. It is a best guess according to some algorithm used by a computer program or by an experienced human eye. It is necessary for the user to carefully examine each alignment to see if it makes biological sense.

After printing the alignments you need to examine them to see if most of the gaps make sense. If many of the gaps seem to be arbitrary (i.e., you think you could have done better by eye), then you need to improve the alignment. If there are large regions that are present in only one or two sequences, you may need to delete those regions from the alignment. In practice, most of the programs used for phylogenetic tree reconstruction, will delete all the column containing any gaps from the analysis, so you don’t need to bother.

There are some guidelines on how to assess the quality. First remember, that we want to maximize similarity, and minimize dissimilarity. Therefore, the number of gaps is one parameter you should pay attention to. Use the histogram in the bottom of the alignment and the “*” characters above it to assess the conservation of different areas of the alignment.

It is also biologically unsound to assume that there are many gaps with equal spaces between them. Usually gaps are clustered, and are more common in certain areas than in other areas. So, look for the number and length of the conserved blocks of columns. If the pattern of the gaps looks like they have been randomly inserted, choose another alignment. This is, of course, assuming that the sequences are relatively closely related. If you are aligning an area where there are no functional genes, the above’s all you can do. If you have some knowledge on the functional regions of the sequence you are aligning, you can use this information when assessing the quality of the alignment.

The functional regions are often more or less conserved between the relatively closely related sequences. Therefore, quite a few gaps should be inserted into those areas, and most of the gaps should be inserted into less well conserved areas, for example, in the spacer regions between alfa-helixes.

Although it is very time consuming, attempting to improve the alignment through this process of examination and modification of penalties is probably the single most important thing you can do to ensure a high-quality alignment and make a high-quality phylogeny estimation possible.

Advanced alignment strategies

If you have very difficult sequences to align, you can try iterative alignment procedure in order to get a better estimate of the real alignment. First, produce an initial alignment with some quite closely fitting parameter values. Then, produce a new alignment from this initial alignment without removing gaps before this second alignment. You can iterate the alignment using the same settings, but doing the alignment based on the previous alignment multiple times. Sooner or later the alignment will stabilize, and will not change anymore with the same parameters. This is the best, “iterated”, alignment for those settings.

Normally, it is important to reset the gaps before producing the alignment with new settings, but this is not done with iterative alignment.

You can also realign only a part of the sequences. Hold down the left mouse button, and paint a selection. Then, from the Alignment menu select Realign Selected Residue Range. ClustalX will then do the alignment using the current setting only for the selected residues. You can also produce a new alignment for selected sequences (Realign Selected Sequences). Sequences can be selected by holding down the left mouse-button, and then dragging downwards in the sequence name list.

Advanced options

Do alignment from the guide tree

If you have some data on the relationships of the sequences, you can construct a tree in the Newick-format, and use that for producing a multiple sequence alignment. Produce a tree like the one below:

(

(

Pan_verus1:0.02428,

Pan_verus2:0.01474)

:0.03203,

(

Pan_velle1:0.00437,

Pan_velle2:0.00579)

:0.01402,

(

Pan_trog1:0.00306,

Pan_trog2:0.00306)

:0.05015);

Save the tree in a file in text-format.

In Clustal, select the Do Alignment From Guide Tree from the Alignment menu. This way you can try to get around the difficulties of the progressive alignment, which might create wrong alignments, if the sequences are very divergent. This can also be used to combine morphological data into the sequence alignment step: the knowledge of the relationships of the taxons can be used to guide the alignment process.

Profile alignment

Profile alignment is used for a couple of purposes, but we first discuss how to align a new sequence into an existing alignment.

There is a pull-down menu on the main window top left-hand corner. From the menu change to Profile Alignment Mode. The Alignment view window is now split into two parts. The upper part contains the alignment we just created, and lower is empty. The upper part is called “profile 1” and the lower part is “profile 2”.

In the File-menu you have the options to load profiles 1 and 2. Now we already have the profile one available, so we only need to load the profile 2. Let’s do that. The main window is updated, and look something like the one below.

From the alignment-menu, select first the option “Align Sequences to Profile 1”. After that select “Align Profile 2 to Profile 1”. This will create an alignment file, where all the sequences are together. Note that you should always first align the sequences in the profile 1 before aligning those (already aligned) sequences to the profile 2!

This is a very handy way to add new sequences into an existing alignment. Otherwise you would have needed to calculate the initial alignment again, which could have been very laborous in the case of many or very long sequences.

Using secondary structure information in the profile alignment

As has been shortly discussed above the sequence alignment can be guided by the secondary structure of the protein, if such information is available. Often such alignments are more biologically plausible than the alignments done “randomly”.

In ClustalX the gap penalties are raised at core alpha helix (A) or beta strand (B) residues. The structure information can be used only in the Profile Alignment Mode. These gap penalties cannot be used in the multiple alignment mode. There are two ways to include structure information in Clustal, but here we present only the easier one, which describes the domain areas of the protein. Then the penalties are adjusted in the ClustalX dialog box.

First we need to create the input files. In the first input file, which the first sequence of all the sequences to aligned, a descriptions of the domains (helix or strand) is included. The second input file contains the rest of the sequences in the FastA-format.

The information about the domains is most easily acquired from the SWISS-PROT descriptions. Find the relevant information from https://www.360docs.net/doc/8f14255885.html,/swissprot/, and after you have acquired the SRS results, click on the Accession Number link on the top of the page. This will take you to the plain text description.

From the description find the lines starting with two capital letters: ID, FT, SQ, and the sequence. Copy those lines into a text file (i.e., into Notepad) and save the file. It should now something like the one below.

ID XRC1_HUMAN STANDARD; PRT; 633 AA.

FT HELIX 315 403 BRCT 1.

FT HELIX 538 629 BRCT 2.

SQ SEQUENCE 633 AA; 69525 MW; 30CC2421345ABFC2 CRC64;

MPEIRLRHVV SCSSQDSTHC AENLLKADTY RKWRAAKAGE KTISVVLQLE KEEQIHSVDI GNDGSAFVEV LVGSSAGGAG EQDYEVLLVT SSFMSPSESR SGSNPNRVRM FGPDKLVRAA AEKRWDRVKI VCSQPYSKDS PFGLSFVRFH SPPDKDEAEA PSQKVTVTKL GQFRVKEEDE SANSLRPGAL FFSRINKTSP VTASDPAGPS YAAATLQASS AASSASPVSR AIGSTSKPQE SPKGKRKLDL NQEEKKTPSK PPAQLSPSVP KRPKLPAPTR TPATAPVPAR AQGAVTGKPR GEGTEPRRPR AGPEELGKIL QGVVVVLSGF QNPFRSELRD KALELGAKYR PDWTRDSTHL ICAFANTPKY SQVLGLGGRI VRKEWVLDCH RMRRRLPSRR YLMAGPGSSS EEDEASHSGG SGDEAPKLPQ KQPQTKTKPT QAAGPSSPQK PPTPEETKAA SPVLQEDIDI EGVQSEGQDN GAEDSGDTED ELRRVAEQKE HRLPPGQEEN GEDPYAGSTD ENTDSEEHQE PPDLPVPELP DFFQGKHFFL YGEFPGDERR KLIRYVTAFN GELEDYMSDR VQFVITAQEW DPSFEEALMD NPSLAFVRPR WIYSCNEKQK LLPHQLYGVV PQA

The ID line gives description of the sequence and what it codes. The FT lines describe what kind of domains are present in the protein. Those should say either HELIX or STRAND (always double-check those; they might be inaccurate). The SQ line gives molecular weight and some other information of the succeeding amino acid sequence.

For checking the secondary structures, go to http://www.embl-heidelberg.de/predictprotein/submit_def.html and paste in the first protein sequence. In a short while the results will be emailed to you. From the results, you’ll find a description: AA |MPEIRLRHVVSCSSQDSTHCAENLLKADTYRKWRAAKAGEKTISVVLQLEKEEQIHSVDI|

PHD sec | EEEEEEEEEE HHHHHHHHH HHHHHHHHHHH EEEEEE EEEEE |

Rel sec |993678997772578764999998651136878999946883699885235453354451|

detail:

prH sec |000000000000000116899988764467888899962100000000001111000000|

prE sec |006778898875210000000000000000000000000003799886431212566664|

prL sec |993210001114688872000001225432111000027886200012557665322225|

subset: SUB sec |LL.EEEEEEEE.LLLLL.HHHHHHHH...HHHHHHHH.LLL.EEEEEE..L.L..E..E.|

accessibility

3st: P_3 acc | eebebebbbbbbeeebeebbeebbebee ebebbeeeeebbbbbbebeeeeebbbbeb|

10st: PHD acc |397060600000099707600770070774570600777770000006067776000060|

Rel acc |002325036846201120058314433100121164412342639850325320404504|

subset: SUB acc |.....b..bbbb.......bb..bb.........bbe...e.b.bbb...e...b.bb.b|

The line marked with AA gives the original sequence, and the the next line, starting with PHD sec gives the predicted secondary structures. Rel sec gives the reliability of the secondary strusture prediction. H mean helix, and E mean sheet. You can use this information for the description of the structures for ClustalX input.

You have just created the first input file. The second file should include the other sequences you’re interested in in FastA-format (see above).

After preparing these input files, go to ClustalX, and switch to Profile Alignment Mode. From the File menu select Load Profile 1, and search for the first input file. If ClustalX recognizes the file to contain the weights for the gaps it asks you whether to use the penalties or not.

If you have formatted the files as described above, but ClustalX does not recognize the weights, go to Alignment->Alignment Parameters->Secondary Structure parameters. From the opening dialog box turn the Use Profile Secondary Structure option on (yes). After that try to load the first input file again.

Clustalx 多重序列比对图解教程(图解使用)

Clustalx 多重序列比对图解教程(By Raindy) 本帖首发于Raindy'blog,转载请保留作者信息,谢谢!欢迎有写生物学软件专长的战友,加入生信教程写作群:,接头暗号:你所擅长的生物学软件名称 软件简介: CLUSTALX-是CLUSTAL多重序列比对程序的Windows版本。Clustal X为进行多重序列和轮廓比对和分析结果提供一个整体的环境。 序列将显示屏幕的窗口中。采用多色彩的模式可以在比对中加亮保守区的特征。窗口上面的下拉菜单可让你选择传统多重比对和轮廓比对需要的所有选项。 主要功能: 你可以剪切、粘贴序列以更改比对的顺序; 你可以选择序列子集进行比对; 你可以选择比对的子排列(Sub-range)进行重新比对并可插入到原始比对中; 可执行比对质量分析,低分值片段或异常残基将以高亮显示。 当前版本:1.83 PS:如果你是新手或喜欢中文界面,推荐使用本人汉化的Clustalx 1.81版链接地址::ist&ID=7435(请完整复制) 应用:Clustalx比对结果是构建系统发育树的前提 实例:植物呼肠孤病毒属外层衣壳蛋白P8(AA序列)为例 流程:载入序列―>编辑序列―>设置参数―>完全比对―>比对结果 1.载入序列:运行ClustalX,主界面窗口如下所图(图1),依次在程序上方的菜单栏选择“File”-“Load Sequence”载入待比对的序列,如图2所示,如果当前已载入序列,此时会提示是否替换现有序列(Replace existing sequences),根据具体情形选择操作。

图1

图2 2.编辑序列:对标尺(Ruler)上方的序列进行编辑操作,主要有Cut sequences(剪切序列)、Paste sequences(粘贴)、Select All sequences(选定所有序列),Clear sequence Selection(清除序列选定)、Search for string(搜索字串)、Remove All gaps(移除序列空位)、Remove Gap-Only Columns(仅移除选定序列的空位)

多序列比对软件Clustalw使用方法

多序列比对软件Clustalw使用方法2011年06月23日星期四 16:44 Clustal的基本思想是基于相似序列通常具有进化相关性这一假设。比对过程中,先对所有的序列进行两两比对并计算它们的相似性分数值,然后根据相似性分数值将它们分成若干组,并在每组之间进行比对,计算相似性分数值。根据相似性分数值继续分组比对,直到得到最终比对结果。比对过程中,相似性程度较高的序列先进行比对,而距离较远的序列添加在后面。作为程序的一部分,Clusal可以输出用于构建进化树的数据。 Clustal程序有许多版本,ClustalW(Thompson等,1994),根据对亲缘关系较近的序列间空位情况,确定如何在亲缘关系较远的序列之间插入空位。同样,相似性较高的序列比对结果中的残基突变信息,可用于改变某个特殊位置空位罚分值的大小,推测该位点的序列变异性。ClustalW是一种渐进的多序列比对方法,先将多个序列两两比对构建距离矩阵,反应序列之间两两关系;然后根据距离矩阵计算产生系统进化指导树,对关系密切的序列进行加权;然后从最紧密的两条序列开始,逐步引入临近的序列并不断重新构建比对,直到所有序列都被加入为止。 ClustalX是CLUSTAL多重序列比对程序的Windows版本。Clustal X为进行多重序列和轮廓比对和分析结果提供一个整体的环境。 软件下载地址: 使用步骤如下: Step 1: 软件初始化界面

Step 2:选择1 进入如下界面 Step 3:输入序列名1Seq_650_300.txt.txt 进入如下界面

Step 4:选择2 进入如下界面 Step 5:选择9 进入如下界面 Step 6:选择1 进入如下界面

多重序列比对的数学模型与算法

多重序列比对的数学模型与算法 自美国提出组织的人类基因组计划(Human Genome Proreet)简称为HGP 以来,美国每年拔出相当大的经费支持,日本、法国、英国、德国等纷纷响应,它们的工作使新的交叉学科生物信息论得以诞生和发展,生物信息论是用数理和信息科学的观点、理论和方法去研究生命现象,组织和分析呈指数增长的生物学数据。生物信息学是一门综合学科,是计算机科学、数学、物理、生物学的结合。生物信息学的基础是各种数据库的建立和分析工具的发展。目前,生物学数据库已达500个以上,共有四大类:基因组数据库,核酸和蛋白质一级结构数据库、生物大分子三维空间结构数据库及其以她们为基础构建的二级数据库。生物信息学主要研究基因组测序及其信息分析、生物大分子的结构与功能预测及其模拟和药物设计、大规模基因表达数据的分析与基因芯片设计,以及基因与蛋白质相互作用网络等四方面的问题。 多重序列比对是计算分子生物学中最重要的运算。多重序列比对的基本问题就是找出适当安排删减与插入尽量少的空格,使得两个序列达到最大程度的一致的方案。比如给出下列三个序列: AC_G AGTCC (1) ACT 我们适当安排删减与插入空格得到: ACG___ A_GTCC (2) AC_T__ (2)就是多重序列的一个比对。 局部分段比对是其中更为常见的运算。上世纪80年代,Smith-Waterman提出了两个序列的局部比对的明确的模型。1998—1999年,相继出现利用k-tuple 的快速容错分段比对搜索法。2002年开始出现对完整基因组及其异常基因的比

较研究以及多重序列比对问题的研究,2003年刘军Mayetri Gupta和刘军得到Motif的搜索算法。 人类基因组计划后,目前已经进入后基因时代,主要就是对人类基因组计划实施得到的基本数据库进行信息分析、加工和利用,提取有用信息,用来研究生命现象中的重大问题。多重序列比对问题是生物信息学的基本问题,多重序列比对技术也是生物信息学的基本工具,有着十分广泛的应用,比如基因是否为同一个家族,癌症患者的基因与正常时的基因比对分析等等。因此,请您们就基因的多重序列比对,设计合理的衡量比对好坏的定量描述模型,建立多重序列比对的基本问题的数学模型,并设计一种求解的算法。最后就附录一中的12个序列,请您们利用你们得到的模型与算法,给出使序列有最大相似程度的比对。 附录一: CATTTCTTTTTAGGGATTTTAAAAGTTGTCTTTTCTT CATTTCTTTTTAAGGTTTTAAAAATTGTCTTTTTT CATTTCTTTTTAAGGGTTTTAAAAATTGTCTTTTCTT CATTTTTTCTTAAGTGTTTTGGTATTTATCTTTTTCTT CATTTTTGCTTATGTATTTATAGTGGGTTGTCTTTTTGACTT CATTTCTTTTGAAGTGATTTGAGATTTATCTTTTTCTT CATTTCTTTTTAAGGGTTTTAAAAATTGTCTTTTCTT CATTTCTTTTTATGTTGAGATATTTGTCTGTTTTCTT CATTTTTACTATGTGTTGATTGTGGATTGTCTTTTCTT CATTTCTTTTATTGAGTGAAGAAGAGATTTTGTCTTGTTTTGAT CATTTTTCTTAGTGTTTTGGTATTTATCTTTTTCTT CATTTCTTTTAAGGGTTTTAAAAATTGTCTTTTCTT

实验3 两条序列比对与多序列比对

实验三:两条序列比对与多序列比对 实验目的: 学会使用MegAlign,ClustalX和MUSCLE进行两条序列和多条序列比对分析 实验内容: 双序列比对是使两条序列产生最高相似性得分的序列排列方式和空格插入方式。两条序列比对是生物信息学最基础的研究手段。第一次实验我们用dotplot方法直观地认识了两条序列比对。但是dotplot仅仅是展示了两条序列中所有可能的配对,并不是真正意义上的序列比对。这里介绍进行两条序列比对的软件-MegAlign。 多序列比对是将多条序列同时比对,使尽可能多的相同(或相似)字符出现在同一列中。多序列比对的目标是发现多条序列的共性。如果说序列两两比对主要用于建立两条序列的同源关系,从而推测它们的结构和功能,那么,同时比对多条序列对于研究分子结构、功能及进化关系更为有用。多序列比对对于系统发育分析、蛋白质家族成员鉴定、蛋白质结构预测、保守模块的搜寻等具有非常重要的作用。我们这节课主要学习多条序列比对的软件-ClustalX, MUSCLE。 一、MegAlign DNASTAR公司的Lasergene软件包是一个比较全面的生物信息学软件,它包含了7个模块。其中MegAlign可进行两条或多条序列比对分析。 1. 两条序列比对 1.1 安装程序 解压DNASTAR Lasergene软件压缩包,双击Lasergene710WinInstall.exe文件,按照默认路径安装软件到自己电脑上。 1.2 载入序列 a.点击开始-程序-Lasergene-MegAlign,打开软件。 我们首先用演示序列(demo sequence)学习软件的使用。演示序列所在位置:C:\Program files\ DNASTAR\ Lasergene\ Demo Megalign\ Histone Sequences\。 b. 点击主菜单File—Enter sequence-选择序列所在文件夹,选择序列tethis21.seq和tethis22.seq,点击Add,这两条序列将出现在右侧selected sequences框中(Figure 3.1),选择完毕点击Done回到程序页面。 Figure 3.1 载入序列

多重序列比对及系统发生树的构建

多重序列比对及系统发生树的构建 【实验目的】 1、熟悉构建分子系统发生树的基本过程,获得使用不同建树方法、建树材料和建树参数对建树结果影响的正确认识; 2、掌握使用Clustalx进行序列多重比对的操作方法; 3、掌握使用Phylip软件构建系统发生树的操作方法。 【实验原理】 在现代分子进化研究中,根据现有生物基因或物种多样性来重建生物的进化史是一个非常重要的问题。一个可靠的系统发生的推断,将揭示出有关生物进化过程的顺序,有助于我们了解生物进化的历史和进化机制。 对于一个完整的进化树分析需要以下几个步骤:⑴ 要对所分析的多序列目标进行比对(alignment)。 ⑵ 要构建一个进化树(phyligenetic tree)。构建进化树的算法主要分为两类:独立元素法(discrete character methods)和距离依靠法(distance methods)。所谓独立元素法是指进化树的拓扑形状是由序列上的每个碱基/氨基酸的状态决定的(例如:一个序列上可能包含很多的酶切位点,而每个酶切位点的存在与否是由几个碱基的状态决定的,也就是说一个序列碱基的状态决定着它的酶切位点状态,当多个序列进行进化树分析时,进化树的拓扑形状也就由这些碱基的状态决定了)。而距离依靠法是指进化树的拓扑形状由两两序列的进化距离决定的。进化树枝条的长度代表着进化距离。独立元素法包括最大简约性法(M aximum Parsimony methods)和最大可能性法(Maximum Likelihood methods);距离依靠法包括除权配对法(UPGMAM)和邻位相连法(Neighbor-joining)。⑶ 对进化树进行评估,主要采用Bootstraping法。进化树的构建是一个统计学问题,我们所构建出来的进化树只是对真实的进化关系的评估或者模拟。如果我们采用了一个适当的方法,那么所构建的进化树就会接近真实的"进化树"。模拟的进化树需要一种数学方法来对其进行评估。不同的算法有不同的适用目标。一般来说,最大简约性法适用于符合以下条件的多序列:i 所要比较的序列的碱基差别小,ii 对于序列上的每一个碱基有近似相等的变异率,iii 没有过多的颠换/转换的倾向,iv 所检验的序列的碱基数目较多(大于几千个碱基);用最大可能性法分析序列则不需以上的诸多条件,但是此种方法计算极其耗时。如果分析的序列较多,有可能要花上几天的时间才能计算完毕。UPGMAM(Unweighted pair group method with arithmetic mean)假设在进化过程中所有核苷酸/氨基酸都有相同的变异率,也就是存在着一个分子钟。这种算法得到的进化树相对来说不是很准确,现在已经很少使用。邻位相连法是一个经常被使用的算法,它构建的进化树相对准确,而且计算快捷。其缺点是序列上的所有位点都被同等对待,而且,所分析的序列的进化距离不能太大。另外,需要特别指出的是对于一些特定多序列对象来说可能没有任何一个现存算法非常适合它。 CLUSTALX和PHYLIP软件能够实现上述的建树步骤。CLUSTALX是Windows界面下的多重序列比对软件。PH YLIP是多个软件的压缩包,功能极其强大,主要包括五个方面的功能软件:i,DNA和蛋白质序列数据的分析软件。ii,序列数据转变成距离数据后,对距离数据分析的软件。 iii,对基因频率和连续的元素分析的软件。iv,把序列的每个碱基/氨基酸独立看待(碱基/氨基酸只有0和1的状态)时,对序列进行分析的软件。v,按照DOLLO简约性算法对序列进行分析的软件。vi,绘制和修改进化树的软件。

vector nti 11 使用教程 第十章_多重序列比对

第十章多重序列比对 Vector NTI的多重序列比对程序和其他的比对软件比较起来非常的方便实用,操作接口也很简单,比对的结果可以存取和输出。NTI有两种序列比对程序,一种为AlignX,可以用在核酸序列和蛋白质序列比对;另一种为AlignX Blocks,只能用在蛋白质序列比对。 如何开始进行序列比对?用户可以从程序集开启档案(图10.1): 图10.1 由程序集开启AlignX 或者是从主程序中开启(图10.2): 图10.2 由主程序开启AlignX 用户也可以在主程序(图10.3)具有操作序列的情况下开启AlignX-Align Selected Molecules,使用者的序列会直接加载到AlignX中: 图10.3 在操作序列的情况下开启AlignX的方法

开启AlignX之后,使用者会见到图10.4的画面: 图10.4在操作序列的情况下开启AlignX 首先用户要把序列加载Vector NTI程序中,可以点选或者从左上方的Project →Add Files把序列档案加载,请注意文件名不可以过长,檔名过长会造成程序进行比对时无法完全显示文件名(图10.5): 图10.5 输入的档名注意不可过长 选取档案后按下开启就可以加载程序中,若比对的序列很多时可以用鼠标圈选欲分析的序列后选择开启。序列档案加载的时候程序会询问该序列为核酸序列或是蛋白质序列,点选好以后再点选Import就可以了(图10.6):

图10.6 载入时,会询问序列的性质,核酸序列或蛋白质序列 接下来程序的左上方会出现使用者加载的序列(图10.7),序列加载完成以后就可以开始进行比对的操作: 图10.7 成功载入序列的画面 进行比对前,先把欲比对的序列用鼠标进行圈选(图10.8): 图10.8 选取欲比对之序列 只要按下或是从上方Align→Align Selected Sequence(图10.9)就会进行比对运算:

多重序列比对及系统发生树的构建

多重序列比对及系统发生树的构建 作者:佚名来源:生物秀时间:2007-12-31 【实验目的】 1、熟悉构建分子系统发生树的基本过程,获得使用不同建树方法、建树材料和建树参数对建树结果影响的正确认识; 2、掌握使用Clustalx进行序列多重比对的操作方法; 3、掌握使用Phylip软件构建系统发生树的操作方法。 【实验原理】 在现代分子进化研究中,根据现有生物基因或物种多样性来重建生物的进化史是一个非常重要的问题。一个可靠的系统发生的推断,将揭示出有关生物进化过程的顺序,有助于我们了解生物进化的历史和进化机制。 对于一个完整的进化树分析需要以下几个步骤:⑴要对所分析的多序列目标进行比对(alignment)。⑵要构建一个进化树(phyligenetic tree)。构建进化树的算法主要分为两类:独立元素法(discrete character methods)和距离依靠法(distance methods)。所谓独立元素法是指进化树的拓扑形状是由序列上的每个碱基/氨基酸的状态决定的(例如:一个序列上可能包含很多的酶切位点,而每个酶切位点的存在与否是由几个碱基的状态决定的,也就是说一个序列碱基的状态决定着它的酶切位点状态,当多个序列进行进化树分析时,进化树的拓扑形状也就由这些碱基的状态决定了)。而距离依靠法是指进化树的拓扑形状由两两序列的进化距离决定的。进化树枝条的长度代表着进化距离。独立元素法包括最大简约性法(Maximum Parsimony methods)和最大可能性法(Maximum Likelihood methods);距离依靠法包括除权配对法(UPGMAM)和邻位相连法(Neighbor-joining)。⑶对进化树进行评估,主要采用Bootstraping法。进化树的构建是一个统计学问题,我们所构建出来的进化树只是对真实的进化关系的评估或者模拟。如果我们采用了一个适当的方法,那么所构建的进化树就会接近真实的“进化树”。模拟的进化树需要一种数学方法来对其进行评估。不同的算法有不同的适用目标。一般来说,最大简约性法适用于符合以下条件的多序列:i 所要比较的序列的碱基差别小,ii 对于序列上的每一个碱基有近似相等的变异率,iii 没有过多的颠换/转换的倾向,iv 所检验的序列的碱基数目较多(大于几千个碱基);

多序列比对

在寻找基因和致力于发现新蛋白的努力中,人们习惯于把新的序列同已知功能的蛋白序列作比对。由于这些比对通常都希望能够推测新蛋白的功能,不管它们是双重比对还是多序列比对,都可以回答大量的其它的生物学问题。举例来说,面对一堆搜集的比对序列,人们会研究隐含于蛋白之中的系统发生的关系,以便于更好地理解蛋白的进化。人们并不只是着眼于某一个蛋白,而是研究一个家族中的相关蛋白,看看进化压力和生物秩序如何结合起来创造出新的具有虽然不同但是功能相关的蛋白。研究完多序列比对中的高度保守区域,我们可以对蛋白质的整个结构进行预测,并且猜测这些保守区域对于维持三维结构的重要性。 显然,分析一群相关蛋白质时,很有必要了解比对的正确构成。发展用于多序列比对的程序是一个很有活力的研究领域,绝大多数方法都是基于渐进比对(progressive alignment)的概念。渐进比对的思想依赖于使用者用作比对的蛋白质序列之间确实存在的生物学上的或者更准确地说是系统发生学上的相互关联。不同算法从不同方面解决这一问题,但是当比对的序列大大地超过两个时(双重比对),对于计算的挑战就会很令人生畏。在实际操作中,算法会在计算速度和获得最佳比对之间寻求平衡,常常会接受足够相近的比对。不管最终使用的是什么方法,使用者都必须审视结果的比对,因为再次基础上作一些手工修改是十分必要的,尤其是对保守的区域。 由于本书偏重于方法而不是原理,这里只讨论一小部分现成的程序。我们从两个多序列比对的方法开始,接下去是一系列的利用蛋白质家族中已知的模体或是式样的方法,最后讨论两个具有赠送的方法,因为绝大多数公开的算法不能达到出版物的数量。在本章结尾部分将会列出更详细的多序列比对的算法。 渐进比对方法 CLUSTAL W CLUSTAL W算法是一个最广泛使用的多序列比对程序,在任何主要的计算机平台上都可以免费使用。这个程序基于渐进比对的思想,得到一系列序列的输入,对于每两个序列进行双重比对并且计算结果。基于这些比较,计算得到一个距离矩阵,反映了每对序列 Bioinformatics: A Practical Guide to the Analysis of genes and Proteins Edited by A.D. Baxevanis and B.E.E. Ouellette ISBN 0-471-191965. pages 172-188. Copyright ? 1998 Wiley – Liss. Inc.

Clustal多重序列比对图解教程图解使用

C l u s t a l x多重序列比对图解教程(B y R a i n d y) 本帖首发于Raindy'blog 软件简介: CLUSTALX-是CLUSTAL多重序列比对程序的Windows版本。ClustalX为进行多重序列和轮廓比对和分析结果提供一个整体的环境。 序列将显示屏幕的窗口中。采用多色彩的模式可以在比对中加亮保守区的特征。窗口上面的下拉菜单可让你选择传统多重比对和轮廓比对需要的所有选项。 主要功能: 你可以剪切、粘贴序列以更改比对的顺序; 你可以选择序列子集进行比对; 你可以选择比对的子排列(Sub-range)进行重新比对并可插入到原始比对中; 可执行比对质量分析,低分值片段或异常残基将以高亮显示。 当前版本:1.83 PS:如果你是新手或喜欢中文界面,推荐使用本人汉化的Clustalx1.81版 链接地址:ist&ID=7435(请完整复制) 应用:Clustalx比对结果是构建系统发育树的前提 实例:植物呼肠孤病毒属外层衣壳蛋白P8(AA序列)为例 流程:载入序列―>编辑序列―>设置参数―>完全比对―>比对结果 1.载入序列:运行ClustalX,主界面窗口如下所图(图1),依次在程序上方的菜单栏选择“File”-“LoadSequence”载入待比对的序列,如图2所示,如果当前已载入序列,此时会提示是否替换现有序列(Replaceexistingsequences),根据具体情形选择操作。

图1

图2 2.编辑序列:对标尺(Ruler)上方的序列进行编辑操作,主要有Cutsequences(剪切序列)、Pastesequences(粘贴)、SelectAllsequences(选定所有序列),ClearsequenceSelection(清除序列选定)、Searchforstring(搜索字串)、RemoveAllgaps(移除序列空位)、 RemoveGap-OnlyColumns(仅移除选定序列的空位)

MAFFT多重序列比对图解教程

MAFFT多重序列比对图解教程 2014年07月14日? Bioinformatics ?字号小中大?暂无评论?阅读793 次[点击加入在线收藏夹] 【絮语】 一提到多重序列比对,很多人禁不住就想到ClustalW(Clustalx为ClustalW的GUI版),其实有一款多重序列比对软件-MAFFT,不论从比对速度(Muscle>MAFFT>ClustalW>T-Coffee),还是比对准确性(MAFFT>Muscle>T-Coffee>ClustalW)来说,其相比于ClustalW(或ClustalX)有过之而无不及,所以这里强烈推荐使用MAFFT这款多重比对软件。 PS: 不同比对软件的比较,有兴趣的童鞋可以下载这篇文章看看:Alignment uncertainty and genomic analysis. Science, 2008 MAFFT官方网站:http://mafft.cbrc.jp/alignment/software/ 支持平台:Mac OS X 、Linux、Windows Windows 32位版本:http://mafft.cbrc.jp/alignment/software/mafft-7.037-win32.zip 64位版本:http://mafft.cbrc.jp/alignment/software/mafft-7.037-win64.zip 请根据自己操作系统选择相应版本下载 图1 MAFFT主界面 简明操作流程: 1.载入序列文件将FASTA格式的待比对序列文件(如:TMV.fas)复制MAFFT的根目录下(当然也可以放任意位置,只有找得到),双击“mafft.bat”启动MAFFT,此时提示输入文件(Input file?),在@后面输入示例的TMV.fas,也可以直接将文件拖入窗口(注意有个+,说明当前是拖放状态),如下图所示:

用ClustalX做多序列比对分析

用ClustalX做多序列比对分析图示 1、打开程序 如下图所示: 2、Load Sequnce, 载入序列 如下图所示: fasta格式的文件关键不在于文件名的后缀是什么,而是在于序列的格式。fasta的格式是: 1、第一行以>开头,紧接着序列的注释和描述。 2、第二行是纯序列atgcg.... 其他序列再起一行,如此下去就可以了。 如: >seq1 |this is a example atgattggaacttgacgt.... >seq2 |this is another example ttgagttgaccgtgacgtgag.....

3、选择序列文件,FASTA格式的 如下图所示: 4、用文本编辑器察看FASTA序列文件容,这里用的是记事本,推荐用EditPlus或者Ultraedit

如下图所示: 5、序列Load进去之后如下图所示:

6、Do Complete Alignment, 通常情况下直接选这个即可,无须修改比对参数 如下图所示: 7、点Do Complete Alignment之后弹出的文件对话框,.dnd的是输出的指导树文件,.aln的是序列比对结果,它们都是纯文本文件 如下图所示:

点“ALIGN”之后开始等待,如果序列不多,很快就可以算完,如果数据很多,可能要等一段时间,这时候可以用眼睛盯着ClustalX的状态栏,那里会有程序运行状态和现在正在比对那两条序列的提示信息,看看可以消磨时间。。。 8、比对结束之后,我们可以看到这个结果 如下图所示:

9、这时候我们可以发现ClustalX已经生成了.dnd和.aln两个文件,仍然用文本编辑器打开来看,这时.aln文件,这个文件可以用Mega2做进一步的bootstrap进化树分析 如下图所示: 10、这是.dnd文件(指导树) 如下图所示:

多重序列比对

第三章序列比较 3.3 序列多重比对 与序列两两比对不一样,序列多重比对(Multiple Alignment)的目标是发现多条序列的共性。如果说序列两两比对主要用于建立两条序列的同源关系和推测它们的结构、功能,那么,同时比对一组序列对于研究分子结构、功能及进化关系更为有用。例如,某些在生物学上有重要意义的相似性只能通过将多个序列对比排列起来才能识别。同样,只有在多序列比对之后,才能发现与结构域或功能相关的保守序列片段。对于一系列同源蛋白质,人们希望研究隐含在蛋白质序列中的系统发育的关系,以便更好地理解这些蛋白质的进化。在实际研究中,生物学家并不是仅仅分析单个蛋白质,而是更着重于研究蛋白质之间的关系,研究一个家族中的相关蛋白质,研究相关蛋白质序列中的保守区域,进而分析蛋白质的结构和功能。序列两两比对往往不能满足这样的需要,难以发现多个序列的共性,必须同时比对多条同源序列。 图3.14是从多条免疫球蛋白序列中提取的8个片段的多重比对。这8个片段的多重比对揭示了保守的残基(一个是来自于二硫桥的半胱氨酸,另一个是色氨酸)、保守区域(特别是前4个片段末端的Q-PG)和其他更复杂的模式,如1位和3位的疏水残基。实际上,多重序列比对在蛋白质结构的预测中非常有用。

多重比对也能用来推测各个序列的进化历史。从图3.14可以看出,前4条序列与后4条序列可能是从两个不同祖先演化而来,而这两个祖先又是由一个最原始的祖先演化得到。实际上,其中的4个片段是从免疫球蛋白的可变区域取出的,而另4个片段则从免疫球蛋白的恒定区域取出。当然,如果要详细研究进化关系,还必须取更长的序列进行比对分析。 对于多重序列比对的定义,实际上是两个序列的推广。设有k个序列s1, s2, ... ,s k,每个序列由同一个字母表中的字符组成,k大于2;通过插入操作,使得各序列s1, s2, ... ,s k的长度一样,从而形成这些序列的多重比对。如果将各序列在垂直方向排列起来,则可以根据每一列观察各序列中字符的对应关系,如图3.14。 通过序列的多重比对,可以得到一个序列家族的序列特征。当给定一个新序列时,根据序列特征,可以判断这个序列是否属于该家族。对于多序列比对,现有的大多数算法都基于渐进比对的思想,在序列两两比对的基础上逐步优化多序列比对的结果。进行多序列比对后,可以对比对结果进行进一步处理,例如构建序列的特征模式,将序列聚类,构建分子进化树等。 3.3.1 SP模型 SP 模型(Sum-of-Pairs,逐对加和)是一种多重序列比对的评价模型。在多重比对中,首先要对所得到的比对进行评价,以确定其优劣。例如,对图3.14中的8条序列进行比对,可以得到另外两种结果,如图3.15所示。那么,这样的三个多重比对,哪一个更好呢?这就需要有一种方法来评价一个多重比对。

相关文档
最新文档