AMBER 2015 LONDON WORKSHOP - HANDS ON ANALYSIS SESSION

Example of Combined Cluster Analysis

by Daniel R. Roe

Cluster Analysis with CPPTRAJ

One way of determining structure populations from simulations is cluster analysis. Clustering is a means of partitioning data so that data points inside a cluster are more similar to each other than they are to points outside a cluster. In the context of molecular simulation, this means grouping similar conformations together. Similarity is determined by a distance metric - the smaller the distance, the more similar the structures. One commonly used distance metric is coordinate RMSD.

It is very important to note that there is no one "correct" way to perform cluster analysis. There are many different algorithms and distance metrics available, and different combinations will work better for certain systems. Cluster analysis usually requires some trial and error before satisfactory results are obtained. This tutorial is just one example of cluster analysis. For a more in-depth discussion on cluster analysis of MD trajectories, users are encouraged to read the 2007 paper by Shao et al. on the subject.

In this example we will use CPPTRAJ to perform both cluster analysis and combined clustering analysis (as a means of ascertaining convergence). CPPTRAJ supports a variety of clustering algorithms, distance metrics, clustering metrics, and output options. This example will use trajectories of the tetranucleotide rGACC that were generated using multidimensional replica exchange (MREMD, 24 x Temperature, 8 x aMD). Although the trajectories were originally generated with explicit TIP3P water, the solvent has been removed to decrease the size of the trajectories. This data and analysis was first reported (in part) in Roe et al., J. Phys. Chem. B, 2014, 118(13), pp 2543-3552.

Files

Note: You can also get files using wget

wget http://www.amber.utah.edu/PRACE-BioExcel-Seasonal-School-2019/Cluster/FILENAME

rGACC.nowat.parm7: Topology file, rGACC + 3 Na+ ions.
rGACC.MREMD1.nowat.nc.40: rGACC M-REMD trajectory @ 300 K, run 1.
rGACC.MREMD2.nowat.nc.40: rGACC M-REMD trajectory @ 300 K, run 2.
cpptraj.cluster.in: CPPTRAJ input for clustering single trajectory.
cpptraj.combined.cluster.in: CPPTRAJ input for combined clustering.

Note that although input is provided in files, users are encouraged to use the interactive mode (if on tegner) to become better familiar with CPPTRAJ workflow and command options. If on beskow, must do via batch.

Performing Cluster Analysis with CPPTRAJ

Step 1: Loading topology and trajectory

We will begin by performing cluster analysis on a single trajectory. Start CPPTRAJ by typing 'cpptraj'. Load the topology and trajectory with the following commands:

parm rGACC.nowat.parm7
trajin rGACC.MREMD1.nowat.nc.40

Since these trajectories contain ions and we are only interested in rGACC itself, we can remove them (ensuring they do not show up in any output structures) with a strip command. We will also use the outprefix keyword so that the corresponding stripped topology will be written as 'noions.rGACC.nowat.parm7'. Note that this step is optional and does not have to be done in order to perform clustering.

strip :Na+ outprefix noions

Step 2: Clustering Command

Now we will issue the clustering command. There are many options for clustering which can be grouped into 4 general categories: Clustering Algorithm, Distance Metric, Output, and Coordinate Output.

cluster C0 \
        dbscan minpoints 25 epsilon 0.9 sievetoframe \
        rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \
        sieve 10 random \
        out cnumvtime.dat \
        sil Sil \
        summary summary.dat \
        info info.dat \
        cpopvtime cpopvtime.agr normframe \
        repout rep repfmt pdb \
        singlerepout singlerep.nc singlerepfmt netcdf \
        avgout Avg avgfmt restart

We will briefly look at each option with more detail:

C0: Cluster output data set(s) name.

CLUSTERING OPTIONS:

dbscan: Use the DBSCAN (density-based) clustering algorithm.
minpoints: Minimum # of points to form a cluster.
epsilon: Distance cutoff for forming cluster.
sievetoframe: Restore sieved frames by comparing to all cluster frames, not just centroid.

DISTANCE METRIC OPTIONS:

rms : Use RMSD of atoms in as distance metric.
sieve 10: Typically generating the pair-wise distance matrix (the distance of every frame to every other frame) is a very time and memory consuming part of the clustering calculation. "Sieving" is a way to reduce the expense of this step by using "total / 10" frames for initial clustering. The sieved frames are then added to those clusters as an additional step. Note that in most cases you will want to also use the random keyword to make the sieve random (instead of ordered); however it is not used here in order to make this example reproducible.

OUTPUT OPTIONS:

out <file>: Write cluster number versus time to 'file'. Note that since the DBSCAN algorithm has a concept of "noise", any noise frames will be assigned to cluster -1 (no cluster).
summary <file>: Write overall clustering summary to 'file'.
info <file>: Write detailed cluster results (including DBI, pSF etc) to 'file'.
cpopvtime <file> normframe: Write cluster population vs time to 'file', normalized by # frames.

COORDINATE OUTPUT OPTIONS:

repout <prefix> repfmt pdb: Write cluster representatives to 'prefix.cX.fmt' with PDB format, where X is the cluster number and 'fmt' is the default format extension.
singlerepout <file> singlerepfmt netcdf: Write all cluster representatives to 'file' with NetCDF format.
avgout <prefix> avgfmt restart: Write average over all frames in each cluster to 'prefix.cX.fmt' with Amber restart file format.

Once your input has been read in, type run to begin trajectory processing and analysis. Be patient, as this may take about a minute running on a fairly modern CPU. Once clustering has completed you can type quit to exit CPPTRAJ. You will now have many output files containing clustering results. If you check 'summary.dat' you should see 16 total clusters. The various output options are explained in detail in the Amber manual.

One particularly interesting output are the cluster populations versus time, written to 'cpopvtime.agr' (xmgrace format). In particular one can see that at the beginning of the trajectory the cluster populations are changing rapidly over time. As the run progresses the cluster populations gradually stabilize until they reach their final values somewhere around 70,000 frames. This is an indication that the cluster populations are reaching an equilibrium and the results may be suitable for things like thermodynamic analysis. However, a better way of ascertaining this is to compare the results to an independent run.

Combined Cluster analysis with CPPTRAJ

"Convergence" is an important concept in molecular simulations. It is a way of gauging how effectively the underlying conformational landscape is being sampled. A simulation starting from a single point (i.e. conformation) may become kinetically trapped in a few local energy wells; it is difficult to ascertain whether this has occurred from a single simulation. However, if additional simulations of a system are started with different initial conditions and/or different initial conformations (ideally both), eventually they should sample the same conformations in the same populations; once this has occurred the simulations are said to have "converged". When two or more simulations have converged and structure populations are no longer significantly changing, one can start to calculate things like thermodynamic quantities with some degree of certainty.

Cluster analysis can be used to compare populations of structures in separate runs as a means of determining convergence. However, it can be challenging to compare clusters between different trajectory runs; they may either be present in each trajectory in different populations or a conformation may occur in one trajectory but not the other. In order to facilitate the comparison of clusters between different trajectories, CPPTRAJ supports "combined clustering", where cluster analysis is performed on two (or more!) combined trajectories, then the results are partitioned based on the original trajectories.

Step 1: Loading topology and trajectories

Since we are going to be combining the two trajectories before clustering, we need to know how many frames are in each so we can tell the cluster command where to partition the combined clustering results. This can be done with cpptraj on the command line:

cpptraj -p rGACC.nowat.parm7 -y rGACC.MREMD1.nowat.nc.40 -tl

Here the '-p' flag loads the topology, the '-y' flag loads the trajectory, and the '-tl' flag is used to report the number of frames. The output should be:

Frames: 83843

So we know we need to split the combined trajectories at frame 83843.

Now start CPPTRAJ by typing 'cpptraj'. This will bring up the interactive prompt. Load the topology and trajectories (and strip ions):

parm rGACC.nowat.parm7
trajin rGACC.MREMD1.nowat.nc.40
trajin rGACC.MREMD2.nowat.nc.40
strip :Na+ outprefix noions

Step 2: Clustering command for combined cluster analysis

Now enter the clustering command. Most lines will look similar to the previous run, except we will be adding keywords for doing combined cluster analysis.

cluster C1 \
        dbscan minpoints 25 epsilon 0.9 sievetoframe \
        rms :1@N2,O6,C1',P,:2@H2,N6,C1',P,:3@O2,H5,C1',P,:4@O2,H5,C1',P \
        sieve 20 \
        out combined.cnumvtime.dat \
        info combined.info.dat \
        summarysplit split.dat splitframe 83843

There are two new keywords here:

summarysplit <file>: Write cluster information for split to file.
splitframe 83843:Split clustering at frame 83843 (where first traj stops).

Once your input has been read in, type run to begin trajectory processing and analysis. Be patient, as this may take about a minute and a half to two minutes running on a fairly modern CPU. Once clustering has completed you can type quit to exit CPPTRAJ.

The main output from the combined clustering analysis is in 'split.dat'. The first two lines of this file describe how the splitting of the combined input frames was done and how many total frames comprise each section:

# 1st <= 83843 < 2nd
# 1st= 83843  2nd= 108887

Next comes several columns:

#Cluster: Cluster number.
Total: Total number of frames in the cluster.
Frac: Fraction of total frames in the cluster.
C#: Grace-compatible cluster color. Colors start from 1. Clusters 14 and beyond are all assigned color 15.
Color: Name of the Grace-compatible color (using Grace's default scheme).
NumIn1st NumIn2nd: Number of frames of that cluster that fall into the first trajectory and second trajectory respectively.
Frac1 Frac2: Fraction of frames of that cluster that fall into the first trajectory and second trajectory respectively.
First1 First2: Frame at which the cluster was first encountered in the first trajectory and second trajectory respectively.

The cluster populations between the first and second trajectories agree quite well (max absolute difference in fraction population about 0.015), indicating that the two trajectories are relatively well-converged.