常用数据分析方法

常用的分析数据方法
Molecular Dynamics Hands-on Session II
Tutorial on MD simulations using Gromacs 3.3.3
Tsjerk A. Wassenaar
Analysis of data from molecular dynamics simulation
When the simulation has finished, it is time to continue with the analysis of the data. The analysis comprises of two stages. First, it is necessary to perform some standard checks to assess the quality of the simulations. If the results from these analyses show that the simulations are fine, then the analyses required to answer the research question asked can be performed.

As a side note, many of the analysis tools produce graphs in the form of .xvg files. these files can be viewed with the program xmgrace, but they can also be viewed in a terminal using the python script xvg2ascii.py.

Quality assurance
The quality assurance (QA) involves tests for the convergence of thermodynamic parameters, such as the temperature, the pressure, the potential and the kinetic energy. Convergence is also checked in terms of the structure, through the root mean square deviation (RMSD) against the starting structure and the average structure. Next to that, it has to be checked that there have not been interactions between adjacent periodic images, as such interactions could lead to unphysical effects. A final QA test involves the calculation of the root mean square fluctuations of atoms, which can be compared to crystallographic b-factors.

Convergence of energy terms
We start of with the extraction of some thermodynamic parameters from the energy file. The following properties will be investigated: temperature, pressure, potential energy, kinetic energy, unit cell volume, density, and the box dimensions. Energy analysis is performed with the tool g_energy. This program reads in the energy file, which is produced during the simulation (1LW9-MD.edr). g_energy will ask for a term to extract from the energy file and will produce a graph from that. Give the following command to run g_energy:

g_energy -f 1LW9-MD.edr -o temperature.xvg

This will give a list of the energy and related terms which are stored in the .edr energy file. In our case, there are 68 terms in the energy file, each of which may be extracted and graphed. The first nine entries correspond to different energy terms in the force field. Also note the entries from 47 onward, which list the terms split over the groups Protein and Non-Protein, including the interactions between the two. Type "temperature" or "14" and press Return twice. Have a look at the graph with the program xmgrace, and see how the temperature converges to the value specified (300 K). Also note the fluctuations in the temperature:

xmgrace temperature.xvg

Referencing the energy terms by name facilitates automatic processing of the energy file. Using 'echo' and a pipe ("|") to redirect output from one program to the other, the query from g_energy can be answered automatically. Copy and paste or type the following lines to

extract the other terms:

echo pressure 0 | g_energy -f 1LW9-MD.edr -o pressure.xvg
echo potential kinetic 0 | g_energy -f 1LW9-MD.edr -o energy.xvg
echo volume 0 | g_energy -f 1LW9-MD.edr -o volume.xvg
echo density 0 | g_energy -f 1LW9-MD.edr -o density.xvg
echo box-x box-y box-z 0 | g_energy -f 1LW9-MD.edr -o box.xvg

Have a look at each of these files and see how the values converge. If the values have not converged, this indicates that the simulation has not yet reached thermal equilibrium and that it should be extended before doing further analysis. Moreover, the period over which equilibration takes place should not be included in the analysis. Here, for the sake of simplicity, we will neglect these considerations and use the results from the simulation as is.

Minimum distances between periodic images
One of the most important things to check in terms of quality assurance is that there have not been direct interactions between periodic images. Since periodic images are identical, such interactions are unphysical self-interactions. Imagine that a protein with a dipole would have direct interactions. Then the attraction between the two ends of the same molecule over the periodic boundary would affect and invalidate the results relating to the native behaviour of the protein. To verify that such interactions have not taken place, we calculate the minimal distance between periodic images at each time. This is done using g_mindist.

g_mindist -f 1LW9-MD.xtc -s 1LW9-MD.tpr -od minimal-periodic-distance.xvg -pi

Root mean square fluctuations
Next to inspection of energies and such, convergence of the simulation towards equilibrium should also be inspected from the relaxation of the structure. Usually, such relaxation is only considered in terms of the Cartesian distance from the structure to a reference structure, e.g. the crystal structure. This distance is termed the root mean square deviation (RMSD). However, it is recommended also to investigate the relaxation towards the average structure, i.e. the RMSD with respect to the average. The reasons for this will be set out in the next paragraph. But to calculate the RMSD against the average structure, requires first to obtain the average. This structure can be obtained as a side product from calculation the root mean square fluctuations (RMSF). The RMSF captures, for each atom, the fluctuation about its average position. This gives insight into the flexibility of regions of the protein and corresponds to the crystallographic b-factors (temperature factors). Usually, one would expect similar profiles for the RMSF and the b-factors and this can be used to investigate whether the simulation results are in accordance with the crystal structure. The RMSF (and the average structure) are calculated with g_rmsf. The -oq options allows to calculate b-factors and add them to a reference structure:

g_rmsf -f 1LW9-MD.xtc -s 1LW9-MD.tpr -o rmsf-all-atom.xvg -ox average.pdb -oq b

factors.pdb

Have a look at the graph of the RMSF with xmgrace and identify the flexible and rigid regions. Also view the two pdb files. Color the structure bfactors.pdb according to b-factors and inspect the flexible regions. The average structure is an unphysical structure. Have a look at some of the side chains, and notice the effect of averaging over conformations.

Convergence of RMSD
Now that we also have the average structure, we can calculate the RMSD. The RMSD is commonly used as an indicator of convergence of the structure towards an equilibrium state. As mentioned above, the RMSD is merely a distance measure. The RMSD is calculated using the program g_rms. First calculate the RMSD for all protein atoms, using the starting structure as a reference:

g_rms -f 1LW9-MD.xtc -s 1LW9-MD.tpr -o rmsd-all-atom-vs-start.xvg

Have a look at the graph, and note at what time the RMSD levels off and at which value. Then calculate the RMSD again, but now only selecting the backbone atoms:

g_rms -f 1LW9-MD.xtc -s 1LW9-MD.tpr -o rmsd-backbone-vs-start.xvg

This time the RMSD settles at a lower value, which is the result of excluding the, often flexible, side chain atoms. In both cases the RMSD increases to a plateau value. This means that the structure of the protein reaches a certain distance from the reference structure and then keeps that distance more or less. However, with increasing distance, the amount of conformations available also increases. This means that, although the RMSD has reached a plateau value, the structure may still be progressing towards its equilibrium state. For this reason, it is advisable also to check the convergence towards the average structure:

g_rms -f 1LW9-MD.xtc -s average.pdb -o rmsd-all-atom-vs-average.xvg
g_rms -f 1LW9-MD.xtc -s average.pdb -o rmsd-backbone-vs-average.xvg

Compare the resulting graphs to the previous ones. Note at which point the RMSD values level off.

Convergence of radius of gyration
As a final part of the QA, we calculate the radius of gyration. This measure gives an indication of the shape of the molecule at each time. The radius of gyration compares to the experimentally obtainable hyrdodynamic radius. It can be calculated using g_gyrate. This program will also give the individual components, which correspond to the eigenvalues of the matrix of inertia. This means that the first individual component corresponds to the longest axis of the molecule, while the last corresponds to the smallest. In effect, the three axes give a global indication of the shape of the molecule. Issue the command

g_gyrate -f 1LW9-MD.xtc -s 1LW9-MD.tpr -o radius-of-gyration.xvg

Have a look at the radius of gyration and the individual components and note how each of these progress to an equilibrium value.

Structural analysis: properties derived from configurations
Having assured that the simulation has converged to an equilibrium state, it is time to do some real analysis. Analysis

of simulation data can be divided into several categories. The first category comprises of the interpretation of single configurations according to some metric to obtain a value, or a number of values, for each time point. The RMSD and the radius of gyration are examples. Such properties can be called configurational, dependent or instantaneous properties. Next to that, one can analyze processes in the time domain, e.g. through averaging, as (auto)correlations or fluctutations. In this section, a number of common analyses will be performed, each of which yields a time series of values directly derived from the trajectory (the coordinates over time).

Solvent accessible surface area
One property which can be of interest is the surface area of the protein which is accessible to solvent, commonly referred to as the solvent accessible surface (SAS) or the solvent accessible surface area (SASA). This can be further divided into a hydrophilic SAS and a hydrophobic SAS. In addition, the SAS can be used together with some empirical parameters to obtain an estimate for the free energy of solvation. All four of these parameters are calculated by the program g_sas. This program also allows to calculate the average SAS over time per residue and/or per atom. Issue the following command, specifying Protein both for the group to calculate the SAS for and for the output group, and have a look at the output files.

g_sas -f 1LW9-MD.xtc -s 1LW9-MD.tpr -o solvent-accessible-surface.xvg -oa atomic-sas.xvg -or residue-sas.xvg

Hydrogen bonds
Another property which can be informative is the number of hydrogen bonds, either internally (Protein - Protein) or between the protein and the surrounding solvent. The presence or not of a hydrogen bond is inferred from the distance between a donor-H - acceptor pair and the donor - H - acceptor angle. To calculate the hydrogen bonds issue the following commands, and have a look at the output files using xmgrace:

g_hbond -f 1LW9-MD.xtc -s 1LW9-MD.tpr -num hydrogen-bonds-intra-protein.xvg
g_hbond -f 1LW9-MD.xtc -s 1LW9-MD.tpr -num hydrogen-bonds-protein-water.xvg

Salt bridges
Besides hydrogen bonds, proteins also often form salt bridges between oppositely charged residues. These can have an important stabilizing effect on the structure of the protein, in particular when they are embedded in a hydrophobic environment, such as the core of the protein. The existence of salt bridges can be investigated with g_saltbr. If requested (through the flag -sep) This program will generate an output file for every pair of oppositely charged residues which at some point in the trajectory are within a certain cut off distance from each other (here 1.0 nm, specified through the option -t). Execute the following command and look at the output generated.

g_saltbr -f 1LW9-MD.xtc -s 1LW9-MD.tpr -t 1.0 -sep

Secondary structure
Among the most common parameters to judge protein structure is the assignment of secondary structure

elements, such as α-helices and β-sheets. One such assignment is provided by dssp. This program, which is not part of the gromacs distribution, but can be obtained at the CMBI, Radboud University. Gromacs does provide an interface to dssp, to allow the calculation of secondary structure for each frame of a trajectory. To use this, first set an environment variable DSSP, which points to the location of the program. Then run do_dssp as follows:

setenv DSSP /home/coursead/Wassenaar/MD/dssp
do_dssp -f 1LW9-MD.xtc -s 1LW9-MD.tpr -o secondary-structure.xpm -sc secondary-structure.xvg

The file secondary-structure.xvg contains a time series, listing the numbers of residues associated with each type of secondary structure per frame. More detailed information is in the .xpm file, which gives a color coded assignment of the secondary structure per residue over time. The .xpm file can be viewed with e.g. the Gimp, but some useful metadata can be added with the gromacs tool xpm2ps and the result can be viewed with gview or a similar program:

xpm2ps -f secondary-structure.xpm -o secondary-structure.eps

Ramachandran (phi/psi) plots
The phi and psi torsion angles of the protein backbone are two parameters which are useful to gain insight into the structural properties of a protein. The plot of phi against psi is called a Ramachandran plot and certain regions of this plot are characteristic of secondary structure elements or of amino acids. Other regions are considered forbidden (inaccessible). Projection of the phi/psi angles over time may give insight in structural transitions. These angles can be calculated by the program g_rama, albeit in a bit of a crude way, namely put all together in a single file. To investigate single residues, the linux progam 'grep' can be used to select them out of the plot.

g_rama -f 1LW9-MD.xtc -s 1LW9-MD.tpr -o ramachandran.xvg

Analysis of dynamics and time-averaged properties
Analysis of interatomic distances, NOE's from simulations
Relaxation and order parameters
Configurational principal component analysis
The use of principal component analysis in molecular dynamics focuses on revealing the structure of atomic fluctuations.
An often used, but often little understood, method of analysis is principal component analysis (PCA) of the trajectory. This method, which is sometimes also referred to as 'essential dynamics' (ED), aims to identify large scale collective motions of atoms and thus reveal the structures underlying the atomic fluctuations. The fluctuations of particles in a molecular dynamics simulation are by definition correlated due to interactions between the particles. The degree of correlation will vary and notably particles which are directly connected through bonds or lie in the vicinity of each other will move in a concerted manner. The correlations between the motions of the particles give rise to structure in the total fluctuations in the system and for a macromolecule this structure

is often directly related to its function or (bio)physical properties. Thus, the study of the structure of the atomic fluctuations can give valuable insight in the behaviour of such macromolecules. However, it does require a certain level of understanding of linear algebra methods and multivariate statistics to interpret the results and identify the shortcomings of the method. In particular, the aim of principal component analysis is to describe the original data in terms of new variables which are linear combinations of the original ones. This is also the most important problem with PCA: it only offers an interpretation in terms of linear relationships between atomic motions.

The first step in PCA is the construction of the covariance matrix, which captures the degree of collinearity of atomic motions for each pair of atoms. The covariance matrix is, by definition, a symmetric matrix. This matrix is subsequently diagonalized, yielding a matrix of eigenvectors and a diagonal matrix of eigenvalues. Each of the eigenvectors describes a collective motion of particles, where the values of the vector indicate how much the corresponding atom participates in the motion. The associated eigenvalue gives equals the sum of the fluctuation described by the collective motion per atom, and thus is a measure for the total motility associated with an eigenvector. Usually most of the motion in the system (>90%) is described by less than 10 eigenvectors or principal components.

The construction and diagonalization of the covariance matrix can be performed with the program g_covar. Issue the following command to perform the analysis:

g_covar -s 1LW9-MD.tpr -f 1LW9-MD.xtc -o eigenvalues.xvg -v eigenvectors.trr -ascii covariances.dat

Select the backbone when asked for a selection. Constructing and diagonalizing the covariance matrix requires some time. When the program has finished, have a look at the plot of the eigenvalues and calculate how much of the total motility is explained by the first ten eigenvectors.