JED: a Java Essential Dynamics Program for comparative analysis of protein trajectories

Background Essential Dynamics (ED) is a common application of principal component analysis (PCA) to extract biologically relevant motions from atomic trajectories of proteins. Covariance and correlation based PCA are two common approaches to determine PCA modes (eigenvectors) and their eigenvalues. Protein dynamics can be characterized in terms of Cartesian coordinates or internal distance pairs. In understanding protein dynamics, a comparison of trajectories taken from a set of proteins for similarity assessment provides insight into conserved mechanisms. Comprehensive software is needed to facilitate comparative-analysis with user-friendly features that are rooted in best practices from multivariate statistics. Results We developed a Java based Essential Dynamics toolkit called JED to compare the ED from multiple protein trajectories. Trajectories from different simulations and different proteins can be pooled for comparative studies. JED implements Cartesian-based coordinates (cPCA) and internal distance pair coordinates (dpPCA) as options to construct covariance (Q) or correlation (R) matrices. Statistical methods are implemented for treating outliers, benchmarking sampling adequacy, characterizing the precision of Q and R, and reporting partial correlations. JED output results as text files that include transformed coordinates for aligned structures, several metrics that quantify protein mobility, PCA modes with their eigenvalues, and displacement vector (DV) projections onto the top principal modes. Pymol scripts together with PDB files allow movies of individual Q- and R-cPCA modes to be visualized, and the essential dynamics occurring within user-selected time scales. Subspaces defined by the top eigenvectors are compared using several statistical metrics to quantify similarity/overlap of high dimensional vector spaces. Free energy landscapes can be generated for both cPCA and dpPCA. Conclusions JED offers a convenient toolkit that encourages best practices in applying multivariate statistics methods to perform comparative studies of essential dynamics over multiple proteins. For each protein, Cartesian coordinates or internal distance pairs can be employed over the entire structure or user-selected parts to quantify similarity/differences in mobility and correlations in dynamics to develop insight into protein structure/function relationships. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1676-y) contains supplementary material, which is available to authorized users.

Java Essential Dynamics (JED) is a java library (a package of programs) for analyzing protein trajectories. The trajectories may be derived from any molecular dynamic simulation method that outputs a trajectory as a set of PDB files. The program can handle single chain PDB files with no chain identifier as well as multi chain PDB files that use chain IDs. The user may specify the set of residues to be considered for the analysis, and this set need not be contiguous. A variety of utility tools related to Principal Component Analysis (PCA) provide users with additional features not found in MDsimulation packages. This stand-alone statistical software package is well suited for quantitatively comparing differences in protein dynamics. In particular, JED provides convenient tools to help with comparative analysis of protein dynamics from multiple trajectories. JED is capable of running on any platform with a suitable Java Runtime Environment (JRE).

Expected Input to JED:
Ideally, each PDB structure should follow standard PDB-format, although deviations from standard often work fine. The first residue label must start at 1 or higher. No 0 or negative numbers are allowed for residue labels. Preprocessing of PDB files should be done with external software that generates the conformational ensembles before using JED. It is convenient and recommended to label PDB files using leading zeros in the name of the files to simplify tracking time progression. For example, if a simulation generates 100,000 frames in the trajectory, it is best to name the PDB files like <file_name_000000>, <file_name_000001>, … <file_name_100001>. In this way, all frames are specified relative to the starting structure in sequential order.

JED Preprocessing Output:
As a preprocessing step, JED reads in all PDB files in a specified directory and aligns all the structures in the trajectory to a specified reference structure using a quaternion alignment algorithm. A matrix of the read PDB coordinates, obtained from all the residues in the input PDB files, is created so that it can be used for all subsequent JED runs. A list of all the residues (residue list) found in the PDB files (along with the chain IDs when appropriate) is generated. The original and transformed conformation RMSD are determined for each member structure in the trajectory relative to the specified reference structure. The residue RMSD (also commonly referred to as RMSF) is determined from the entire trajectory. An edited PDB file is also generated where the B-factors are replaced with the residue RMSD values for visualization purposes. The Z-scores for the variables are also calculated. This output automatically happens and is non-optional.

Carbon Alpha Atoms:
The current implementation of JED only considers Cα atoms. As such, we speak about residues because the information is tied to Cα atoms that represent dynamics of residues at a coarse grained level of description. For example, the distance between two residues is modeled in JED as the distance between the two Cα atoms associated with the two residues. By working only with Cα atoms allows the Cα atom labels to be synonymous with residue labels. For a single chain protein, this is a simple 1 to 1 mapping. For multiple chain proteins, JED also tracks the chain ID.

Different Types of PCA:
The core element of essential dynamics is to perform PCA. JED implements two variations of PCA. The first and most common method is based on Cartesian coordinates (cPCA). The cPCA using n residues will yield eigenvectors having 3n components, each corresponding to one Cartesian coordinate. The second method is based on internal coordinates using residue-pair distances (dpPCA). The dpPCA, using n residue-pairs, will yield eigenvectors having n components, each corresponding to one of inter-residue distance pairs. As a special case, an all-to-all comparison can be performed. However, an all-to-all comparison is computationally intense unless a small subset of residues is being considered.

PCA Models:
PCA methods are performed using a covariance matrix (Q), a correlation matrix (R) and a partial correlation matrix (P).
The correlation matrix is a normalized version of the covariance matrix. The results obtained from Q and R generally differ somewhat due to the inherent statistical biases in each approach. The P matrix is obtained from the inverse of the covariance matrix, with is then subsequently normalized. The current implementation automatically considers all three types of statistical metrics, and allows these metrics to be compared.

Conditioning of the sample Q Matrix:
JED has functionality to remove outliers prior to PCA using two approaches. First, the user can specify the percent (a decimal [0, 1]) of outlier structures to be removed from the sample based on conformation RMSD. The most deviant structures are tagged as outliers and subsequently removed from the sample. Although not the recommend approach because it discards massive amounts of data, it is included in JED because it is a commonly used method. Second, the user can specify a z-score cutoff (a float > 0) such that when the value of a PCA variable has a |deviation| from the variable mean that exceeds the z-score cutoff, it is identified as an outlier. For each PCA-entry identified as an outlier, it is replaced with its mean. This process is done per variable over all frames, and each PCA-entry is treated independently. This is the recommended method because a frame is never thrown away. Rather, only outlier entries (a small fraction of all variables) within a frame are modified in a way that preserves the mean. Both methods are intended to reduce the condition number of Q and to improve the estimator for the population covariance matrix. The first method of conditioning is often employed for protein dynamics (if at all). The second method of conditioning is commonly used in the field of statistics, and is the preferred method due to its superior effectiveness. Note that without conditioning, the results from PCA risk being highly skewed (having statistical bias) due to the presence of outliers. PCA results are always highly dependent on the quality of sampling. Therefore, it is strongly recommended to use the z-score cutoff conditioning method in all applications to avoid misinterpreting the PCA results. Since the R and P matrices derive from the Q matrix, this same conditioning process also improves the R and P matrices. To monitor the effect of outlier removal, different cutoffs should be considered and compared. Both outlier removal methods can be turned off independently simply by setting percent to 0 and/or z-score cutoff to 0.

Animated Visualization of cPCA modes:
JED computes the root mean square deviation (RMSD) and mean squared deviation (MSD) of cPCA modes without weighting, and by weighting the modes by their corresponding eigenvalue. The RMSD and MSD characteristics of cPCA modes can be animated directly on the protein 3D structures. The user can specify the number of Cartesian modes to animate, beginning with mode one. The animation of a mode is done by creating a set of 20 PDB files that capture the displacement of each residue's atoms for each requested mode using a sine function to produce atomic displacements in proportion to eigenvector components. A scale-factor parameter is used to control the amount of displacement in the modes. A PyMol TM script is generated to animate the frames. A movie for an individual mode looks like a vibrational mode having a sinusoidal periodic motion. A scale-factor parameter of 1 provides physically realistic levels of atomic displacements. However, the user may wish to increase the scale-factor to emphasize motions more clearly. Because the eigenvectors from dpPCA cannot be mapped to residue displacements a simple way, visualization is not provided.

Animated Visualization of Essential Dynamics:
JED provides a PyMol TM script to show movies for the superposition of PCA modes. It is assumed that the modes vibrate in phase through a time dependent sinusoidal function that governs the mode amplitude. Since the relative amplitude of higher frequency modes decrease rapidly, the user can select a window of modes (lowest to highest consecutively) to visualize the essential dynamics of the protein at different time scales, which is set by the lowest frequency mode in the window. A user specifies the first PCA mode # to define the leading edge of the window, along with the number of modes in the window. A good window size is usually 5 modes. The user can generate different movies for the essential dynamics at different time scales by sliding the window (say 5 modes in size). For example, on the slowest time scale, the user could select modes 1 through 5, while a selection from 16 to 20 would show a much faster time scale. JED by default provides an animation of the essential dynamics for the top N-modes, where N is MIN(5,Ncalculated).

Dimension Reduction Level:
The primary purpose of applying PCA to capture the essential dynamics of a protein is to reduce the large dimension of variables to a much smaller number of variables that captures the greatest variance in protein motion. The Q, R, and P matrices, once diagonalized, provide a set of eigenvalues and eigenvectors. The eigenvalues for proteins typically fall off fast for the first several modes, out of possibly thousands of modes. The number of dimensions needed to provide a fair assessment of the essential dynamics in a protein is system-dependent. The user can specify any number (say 20, which typically is more than needed) to obtain results for all possible selections, ranging from 1 up to the maximum value that is selected. As such, the user can see how the added dimensions help glean more information. Eventually, the user must decide the optimal number of dimensions to use for representing the essential dynamics based on one's objectives. For Q, R and P matrices, the eigenvectors with largest eigenvalues are deemed most important. The eigenvalues for Q and R are always positive, and they are always negative for P. It therefore is the case that the maximum eigenvalue from P has a magnitude that is always closest to zero.

Displacement Vectors:
A set of displacement vectors (DVs) based on the full conformational space is calculated using a specified reference structure. Those DVs are then projected onto a set of eigenvector directions to create delta vector projections (DVPs), which are similar to principle components (PCs). The PCs are delta vector projections, but according to the standard definition used in statistics, they are always relative to the mean conformation position as defined in the construction of the Q, R, or P matrix. In studying the essential dynamics of a protein, it is common to use a reference structure that has a particular physical or biochemical meaning, which is why we call these displacements DVPs, and not PCs. The DVPs are useful to visualize protein motions. For example, if the first two eigenvector directions are selected (those eigenvectors associated with the highest and second highest eigenvalues) the DVPs can be plotted for each frame to construct the trajectory in conformational space projected onto a two dimensional cross-section. Other eigenvector directions can be specified, allowing the user to investigate how the trajectory projects into the space defined by each eigenvector. The DVPs are given using normalized inner products, as well as weighted by the corresponding eigenvalue. The different methods highlight the structure of the data and provide scaling for visualization.

Post PCA Comparative Subspace Analysis:
JED performs a subspace analysis (SSA) on the two equidimensional sets of eigenvectors generated from the Q, R, and P variants of PCA. The results provide a relative comparison for different subspace dimensions starting with 1 dimension up to the dimension chosen by the user (when selecting the number of Cartesian or Distance modes to process) in a sequential fashion. This allows one to quantitatively determine how different the PCA results are due to the choice of PCA model, while assessing the size of the essential subspace. Additional analysis can be done using driver programs from the subspace analysis class. To perform comparative tests, it is best practice to first generate equidimensional sets of eigenvectors from each trajectory of interest, as well as from a pooled trajectory to use as a reference set, while ensuring that the subsets of residues being analyzed are identical. Subspace analysis is done by comparing the sets of eigenvectors, directly or iteratively, and determining the root mean square inner products (RMSIPs), Principal Angles (PAs), cumulative overlap (COs), cosine products (CPs), vectorial angular sum (VAS), and the maximum angle between subspaces of the given vector space. JED produces summary log files for all of these analyses.

II. Using JED
Note: In this tutorial, code, file paths, and text file content are shown in dark blue 9 point Consolas

JED Install Instructions:
Java is platform independent and JREs exist for all common architectures. JED requires JRE version 1.7 or higher installed. JED can be run from compiled source or from executable jar files. While JED can be installed in any directory that is part of your Java classpath, the source code must be compiled on the local machine to insure runtime integrity. When compiling from source, be sure to also compile the JAMA MATRIX (http://math.nist.gov/javanumerics/jama/) and KDE (https://github.com/decamp/kde) package as JED uses these library. Alternatively, no source code or compilation is needed to run the executable jar files. These can be placed in any directory that is on the Java classpath. Either the Java environment variable CLASSPATH should be correctly set to run Java programs at the command prompt, or add the -cp option to the java command, which allows you to specify the path that contains your Java classes.

Expected Memory Requirements:
For most applications, a 64bit OS is required to address memory needs. On high performance computer clusters make sure the 64 bit JRE is installed. Memory use is demanding because JED loads the complete covariance matrix (among other data structures). Typically 8 to 32 GB of RAM is needed depending on the size of the protein. For very large proteins with thousands of residues and/or tens of thousands of frames, make available as much memory per node as possible. On most platforms, Java performance can be optimized by specifying parameters at runtime for heap space.

Two Kinds of JED Drivers:
There are two driver programs for JED: JED_Driver runs a single job using parameters specified in the input file, while JED_Batch_Driver runs a batch of jobs sequentially. The first is suited for running a single job at the command line or when using submit scripts on computer cluster resources. This can be implemented using job arrays so that your jobs run in parallel rather than sequentially. The second is suited for running multiple jobs on a single computer so that a user can submit a batch of jobs, perhaps overnight, and then come back later with all jobs finished without having to launch each one separately. It could be that a user will organize batch jobs in terms of similar conditions, such that it could make sense to run multiple batches in parallel on high performance clusters.

Note:
The input file formats for the two driver programs are NOT equivalent.

Input File and Data for JED Driver:
JED requires an input file for job parameters. The format of this file will be described below. The run command takes only one argument, which is the name of the input file that includes the absolute path to the file. If no argument is specified, then JED assumes that the default input file name is used and the file is located in the same directory from which the Java Virtual Machine (JVM) was called. The default input file names are: JED_Driver.txt for JED_Driver.java (or .jar file) JED_Batch_Driver.txt for JED_Batch_Driver.java (or .jar file) Each job should be assigned to its own directory, which must contain either the PDB files to read (for Pre-Processing runs) or the Coordinates Matrix to process (for all Analytical runs), along with the reference PDB file and residue lists for specifying the subsets of interest: Cartesian, and/or Distance Pairs.

JED Command Line format:
To run JED_Driver at the command prompt or within a PBS script, you can use one of the following commands: java -d64 JED_Driver "/path/to/your/input/file.txt" (runs the compiled java program) java -jar -d64 JED_Driver.jar "/path/to/your/input/file.txt" (runs the executable jar file) To run JED_Batch_Driver at a command prompt or in a PBS script, you can use one of the following commands: java -d64 JED_Batch_Driver "/path/to/your/input/file.txt") java -jar -d64 JED_Batch_Driver.jar "/path/to/your/input/file.txt"

Remember to include command line switches to optimize the Java runtime environment for your jobs.
Note: Different platforms may have slight variants to the options, such as -d 64 with a space, versus without a space.

Organization of Output Files:
Output files from JED are written to subdirectories within the working directory, structured to organize the multitude of files produced in a meaningful manner. The start of this directory tree (the root) is named "JED_RESULTS_$description", where $description is a user set parameter that succinctly describes the job. Limbs of the tree separate Cartesian PCA (cPCA), distance-pair PCA (dpPCA), and mode visualization analysis (VIZ) when present. Each of these in turn contains limbs for Q (COV), R (CORR), and P (PCORR) compartmentalization. Each PCA directory contains 3 subdirectories for the subspace analysis (SSA). All output file names include the number of residues or residue pairs in the selected subset for reference, plus a description of the file contents.

Current Limitations:
Initial input of the protein trajectory must be done using PDB files that are expected to conform to the standard format, or a matrix of PDB coordinates containing the alpha carbon atomic positions only (see below for a description of this file). Only carbon-alpha atomic positions are used to create the coordinates matrix for essential dynamic analysis.
Each PDB file must have the exact same number of residues. The matrix of alpha carbon coordinates is determined from the first PDB file read. If other files in the working directory do not match exactly, then the array sizes will not match and the program will crash. IF JED crashes during the reading of PDB files, this is probably the reason.
While JED can process a PDB file with missing residues and various numbering schemes, it cannot interpret files that have alternate conformations within a given frame based on fractional occupancy values. Only a single conformation per frame is allowed. Note that the original residue coordinates in the PDB files are mapped to the rows of the coordinates matrix. A user should preprocess all PDB files and verify that they are error free and do not have ambiguities.
Note: JED reads all PDB files within a specified directory. Separate trajectories should be kept in different directories.

III. Overview of Using JED
A Preliminary Run must be performed to generate the JED formatted coordinate matrix file for all the alpha carbons in the PDB files. This makes subsequent subset analyses much faster to perform. It also serves to guarantee that the specified residues for subset selection are correctly represented in matrix form. After this initialization step, the PDB files can be deleted or archived, with the exception of the reference PDB file. The reference PDB is needed to make movies. Once the coordinate matrix is created, it should be used for all subsequent analyses, using different residue subsets and different job parameters.
The name of the coordinate file matrix produced from the PDB files is: "original_PDB_coordinates.txt" The matrix packing is as follows: Rows are coordinate variables and columns are frames. For N residues, there are 3N rows: N x-coordinates, N y-coordinates, and N-z coordinates, stacked in that order.

This matrix contains all the residues in the PDB files and thus can be used for any subset of those residues. When a subset is chosen, a new correspondence set is generated and a new transformation is done to optimize the alignment of the structures. This removes overall translation and rotation for each subset chosen.
In subsequent analyses, it is critical that no residues are requested that do not exist in the PDB files! JED maps the specified residue list to an internal list that is aligned to the rows of the coordinates matrix. JED generates a residue list file for all residues it finds in the PDB files that it reads. Therefore, this residue-list file should be edited with care.

Note:
The most critical step when using JED is in the creation of the input file. The input file must have the correct format (shown in examples below) and the entries must be accurate with proper ordering. If either of these conditions is violated, the program will crash, or worse, the results could be corrupt. To avoid producing un-intended results, JED provides error-handling feedback during most crashes so that the problems can be understood and addressed.

Common Causes for JED to Crash
Ø If any specified directory cannot be found or if any specified file cannot be found, JED will crash. Ø If unexpected format is found in any of the input files, JED will crash.
The JED driver programs employ many consistency error checks during the reading of the input files and the execution of the program. There are checks to validate the number formats of numeric data, and to ensure enough parameters were specified for a particular job. There are checks to ensure that the input files have the correct format/number of columns, and to ensure the number of modes requested does not exceed the actual number of modes available. JED also verifies that directories and files exist before performing any analysis. In many cases, missing or problematic parameter settings are set to a default value. The developers have attempted to provide meaningful information when the program crashes to facilitate making the necessary corrections. The specified input file is echoed to standard out, as well as assignment of parameters. Error messages are directed to standard error. In the case that a Java runtime exception is thrown, a stack trace will be sent to standard error. Please refer to the Appendix for creating properly formatted input files.

The PDB files (including the PDB reference file) must be in the working directory. JED input file may be in the working directory.
This pre-processing step will read all PDB files in the working directory, but will perform no PCA.
The purpose of this is to generate the matrix of coordinates for performing subset analyses efficiently.

ii. Root Output Files:
These are written to the root of the JED Results directory tree: /working/directory/JED_Results_Description/ JED LOG providing a summary of the job parameters and results: JED_Log.txt PDB READ LOG listing all the PDB files read, in the order they were read: PDB_READ_Log.txt coordinates matrix from all the alpha carbon coordinates in the PDB files: original_PDB_coordinates.txt transformed coordinates matrix, which aligns all the frames to the reference frame : ss_$num_res_transformed_PDB_coordinates.txt list of all residues found in the PDB files for subsequent editing and use:

iii. JED Driver Input File Format: The Preliminary Run
Notes: This is a whitespace separated file with 6 lines.
Line 1 Field 1 specifies read flag, whether to read PDB files (0 or 1) à 1 = yes Field 2: specifies multi flag, if the PDB files are Multi Chain (0 or 1) Line 2 Field 1: specifies the working directory (String) Line 3 Field 1: specifies the description (String) for the requested job Field 2: specifies the reference PDB (String) for the requested job

Key Points:
Ø The Read flag MUST be set to 1 to perform the pre-processing runs o When the Read flag is set to 1, all PCAs are turned off Ø The Multi flag must be set to 0 for Single Chain PDBs with no Chain IDs Ø The Multi flag must be set to 1 or "multi" for Multi Chain PDBs with Chain IDs o Multi Chain PDBs must have unique chain identifiers for every chain o Missing chain identifiers will cause JED to crash Ø The file to use in all subsequent JED analyses is the original_PDB_coordinates matrix

B. Debugging Crashes Part I:
Things that will generally make your life miserable…

i. Simple mistakes:
Are the Read and Multi flags set correctly? Does the path to the input file exist? Does the input file exist in the proper location? Does the input-file start on the first line? Is the number format correct? (20.0 will NOT parse as an integer) Did you forget a parameter declaration? Does the working directory string end in "/" for Linux or "\\" for Windows? Does the working directory exist? Does the working directory contain PDB files of different sizes? Does the working directory contain the reference PDB file? Does the reference PDB file exist? Does the reference PDB file correspond to the trajectory?
ii. Subtle mistakes: The directory contains PDB files in non-standard format. The directory contains PDB files with fractional occupancy data.
The directory contains PDB files with 2 or more chains, but no chain IDs.
The directory contains PDB files with missing chain IDs.
Problem: If the PDB file names are sorted in a different order than how they were generated, then the conformation RMSD results will not reflect what actually occurred in the simulation. Fix: Naming the PDB files sequentially by padding the numbers with leading zeros will ensure proper sorting to prevent this problem caused by the operating system.

Problem:
If the conformation RMSD is very different from what you expect, you may be using PDB files that contain occupancy information. JED does not use that information. Your results will not be accurate. Fix: Always perform error checking on your PDB files before using JED.
Problem: Trajectories from pooled data do not track original trajectories when they were individually analyzed. Fix: If you pool data together, make sure the combined matrix is constructed in the order you think it is, and the reference structure is the one you think it is. If done without error, you can always parse the output files by the same divisions to obtain information about each component trajectory, which can be colored differently and/or plotted separately, etc.
Problem: When making comparative analysis using cPCA across different proteins the covariance matrices look very different even though expectations are they should be similar. Fix: Do not use different reference structures when making comparisons between different trajectories. Always use the same reference structure for all cPCA procedures in order to directly compare the data.

C. Performing Only cPCA i. Run command:
java -jar -d64 JED_Driver.jar "/path/JED_Driver.txt" The working directory must contain: The coordinates matrix, the PDB reference file, and the Cartesian residue list.
The purpose of this type of run is to perform Essential Dynamics using cPCA based on Q, R, and P. The user specifies the subset of interest for the analysis, which may be the entire protein or a sub-region, which can be non-contiguous, by providing a residue list file. This task is simplified since JED has already created a list of all the residues in the protein.
The user can simply edit this file. Keeping a copy of the original is usually best practice. The cPCA results are written to the sub-directory "cPCA" and the visualization of the top modes (when selected) are written to the subdirectory "VIZ". The directory cPCA has sub directories for the Q, R and P analysis, as does the VIZ directory.

ii. Root Output Files:
These are written to the root of the JED Results directory tree:

If a requested residue cannot be found in the reference file, then JED will crash with an error message stating that a requested residue could not be found.
D. Performing Only dpPCA i.
Run command: java -jar -d64 JED_Driver.jar "/path/JED_Driver.txt" The working directory must contain: The coordinates matrix, the PDB reference file, and the residue list. JED input file may be in the working directory.
The purpose of this type of run is to perform Essential Dynamics using dpPCA based on Q , R and P models. The user specifies the set of residue pairs of interest for the analysis, by providing a residue pair list file. This file has two columns for Single Chain PDBs in which the pairs of interest are listed. However, for Multi Chain PDBs, the file has four columns, the first two for the chain ID and residue number of residue one, and the third and fourth columns for the chain ID and residue number of the second residue. The dpPCA results are written to the sub-directory "dpPCA". Note that for dpPCA no transform is needed since internal distances are used for coordinates and no visualization can be done in JED for the distance modes. The eigenvectors from dpPCA are easy to interpret as their components directly corresponding to the extension or compression of the distance pairs specified. The directory dpPCA has sub directories for the Q , R and P analysis, as well as for the subspace analysis (parallel to the cPCA method). Rationally selected distance pairs can be considered to investigate experimental findings in critical areas like binding pockets or clefts.
Note: Unfortunately, the dpPCA results cannot be visualized as no simple mapping can be made to the residues. ii.

Root Output Files:
These are written to the root of the JED Results directory tree: /working/directory/JED_Results_Description/ JED LOG providing a summary of the job parameters and results: JED_Log.txt iii. dpPCA Output Files: These are written to the /dpPCA subdirectory of the JED Results directory tree:  Note that all entries in the residue list are checked against the reference PDB file.

E. Debugging Crashes Part II:
Things that will generally make your life miserable…

i. Simple mistakes:
Did you set the Read and Multi flags correctly? Are you requesting to read PDBs when you are doing an analytical run? Did you request cPCA but not specify a Cartesian residue list? Did you request dpPCA but not specify a Distance Residue Pair List? Did you set the number of modes appropriately? Are you requesting residues that are not in the reference PDB?

ii. Subtle mistakes:
Did you request more PCA modes than actually exist?
For example (cPCA): If your Cartesian subset contains 12 residues and you ask for 50 modes, then you are going get error messages: Because there are only 36 Cartesian modes in total.
For example (dpPCA): If your Distance Pairs List contains 5 pairs and you request 10 modes, then you are going to get error messages: Because there are only 5 distance-pair modes in total.
In the above cases, JED will attempt to reset the offending value.
If your trajectory has not equilibrated, then you must address the problem of outliers. If you do not, then the covariance matrix will be highly ill-conditioned and may cause the eigenvalue decomposition to fail. You can check the variables in statistics packages that compute the Kaiser-Myer-Olkin (KMO) statistic as well as the Measure of Sampling Adequacy (MSA) for each coordinate variable to critically assess your data. If it is not well suited for PCA, you can condition the variables by setting the z-cutoff in JED between 2.0 and 3.0 when running your jobs. This type of conditioning is by far not very sophisticated, but it has the effect of lowering the condition numbers of Q and R as well as un-dilating the high and low regions of the eigenspectrum. In particular, it does not alter the ordinality of the eigenvalues, but does correct the distortion that arises from under sampling when trying to estimate the population covariance matrix from a poor sample covariance matrix.
Examination of the correlation matrix, in conjunction with the partial correlation matrix, can provide insight into the amount of correlation between the variables, and how many variables are conditionally independent.
Note: The KMO and MSA are determined by using the correlations and partial correlations.

F. Visualizing Cartesian Modes as an Animation
To visualize cPCA modes as an animation, you must be running a job with cPCA selected. To generate the output files, you need to set the number of modes VIZ > 0 (and ≤ number of available cPCA modes), and set the mode amplitude.
If the mode amplitude is not set, the default value of 1.5 will be applied. A mode amplitude of 1.5 usually provides good looking movies, however, movie characteristics are somewhat subjective depending on the system of interest and what is being shown. Adjusting this mode amplitude should be done as trial and error after JED completes the calculation. The output from this job creates movies for vibrations of the top modes chosen for visualization for all three models. JED perturbs the reference structure within the selected sub-region based on the eigenvectors. Using a sine function, the dynamics show as harmonic vibrations. To capture one cycle, 20 structures (PDBs) are used as frames for these movies, which repeat indefinitely using PyMol TM scripts to show periodic motion.
Note: Setting the modes VIZ flag to zero turns off the visualization feature.
Additionally, JED constructs an Essential Modes Visualization that is comprised of a superposition of the top chosen modes or the top 5, which ever is less. The process is similar to how individual modes are handled, with frequencies increasing and amplitude decreasing as mode number increases. When the ratios of eigenvalues are not whole numbers, the PyMol TM script that cycles indefinitely will show a discontinuity because higher frequencies will not be multiples of the lowest frequency. Because relative amplitudes of higher modes decreases rapidly, contributions to essential motion from higher frequency mode can be windowed to look at different time-scales.
These files will be located in the /VIZ subdirectory of the root of the JED results tree: /working/directory/JED_Results_Description/VIZ/ The Q results will be in the subdirectory /COV The R results will be in the subdirectory /CORR.
The P results will be in the subdirectory /PCORR.

G. Performing Multiple PCAs
The working directory must contain: The coordinates matrix, the PDB reference file, and the residue lists.
JED is capable of doing cPCA (with or without visualization), and dpPCA simultaneously. All outputs are delivered as discussed for the individual components.
JED expects the input file to follow the following format regarding the order of the residue lists: If cPCA then Cartesian Residue List If dpPCA then Distance Pair Residue List An important advantage of JED is that it is highly configurable to perform many types of Essential Dynamics analysis concurrently. Combined with cluster resources or just using the batch feature (discussed in the next section) allows a user to process a great deal of data efficiently.

Key Points:
Ø The Read flag MUST be set to 0 Ø The Multi flag must be set to 0 for Single Chain PDBs with no Chain IDs Ø The Multi flag must be set to 1 or "multi" for Multi Chain PDBs with Chain IDs Ø If the number of cPCA modes > 0, then there must be a Cartesian residue list specified Ø If the number of dpPCA modes > 0, then there must be a Distance Pair residue list specified

E. Debugging Batch Crashes:
Handling batch problems is more difficult than for single jobs. Any problem in any job can cause a crash. Thus, it is good to track the standard out and errors streams to record the cause of any problems. Also, you will know which jobs ran successfully so that you can edit the batch input file to finish the undone jobs.

i. Simple mistakes:
Are the Read and Multi flags set correctly? Is the number format correct? (20.0 will NOT parse as an integer) Do ALL the working directories, residue lists, and PDB reference files exist? Do ALL the working directories end in "/" for Linux or "\\" for Windows? Does the working directory contain ALL of the required files? (2 Types of PCA = 2 residue list input files)

ii. More subtle mistakes:
Do any of the jobs have residue sets that will not allow the batch parameters to apply?
For example, too few residues for the number of modes specified. Does the directory contain PDB files in non-standard format or with fractional occupancy data? Does the reference PDB file correspond to the trajectory? (JED must map the ref pdb to the coords matrix) Does the directory contain Multi Chain PDB files with missing chain IDs?

A. Pooling Data:
It is often useful to pool trajectory statistics. This can be done in JED by combining coordinate files and then performing the usual analysis. To combine the coordinate files, there is a utility program called Pool_Driver.java that will combine multiple matrices into one. Each matrix is appended to the last column of the preceding matrix. Of course, the number of rows in the coordinate files must match.
The matrices to combine are specified by an input file called POOL.txt that the user must construct correctly.

Notes:
This file specifies 2 jobs, with 3 matrices to combine for job 1, and 4 matrices to combine for job 2. Be sure that each path and matrix file exists.

iii. Output File format:
The output is a single, augmented matrix with the same number of rows as the composite matrices and columns equal to the sum of all columns in the composite matrices.
The output file name is: Pooled_Coordinates_Matrix_$description_$number-of-input-matrices.txt

B. Subspace Analysis:
Once JED Driver has been run on multiple trajectories, as well as pooled trajectories, an analysis can be done to compare how similar the essential subspaces derived from those trajectories are to each other. JED contains a program called Subspace_Analysis.java along with 3 driver programs that perform those functions. The core program takes as input two matrices of eigenvectors derived from PCA (or NMA, ANM, etc.). The matrices must have the same number of rows and columns, meaning the vectors being compared come from the same vector space having the same dimension. For example, as part of an analysis you might choose to process 20 cPCA modes while examining 10 different experimental conditions, plus pooled data. As long as all subsets in the analysis are the same, then they all can be directly compared to one another. Note that modes from other methods (e.g. ENM or ANM) can also be compared if they have the same dimension, but care should be taken to identify which eigenvectors from different methods map into similar subspaces.
Like most of the JED programs, the subspace analysis program driver reads an input file called SSA.txt to obtain runtime information. This file must be constructed properly to perform the analysis correctly. The three driver programs are SSA_Driver.java, FSSA_Driver.java, and FSSA_Iterated_Driver.java and are different in how much analysis is requested. The SSA_Driver gives full outputs for non-iterated subspace comparison including both log files and individual flat files. The FSSA_Driver is a light-weight version with only RMSIP and PA output in the log files. The Iterated version performs a recursive variation of the above where all equidimensional subspaces are compared up to the size that was provided, for example, from 1 to 20 by step-size 1 for a 20 column input file.

D. Essential Mode Visualization:
JED contains a program called VIZ_Driver.java that allows the user to view the superposition of a set of selected modes. A starting mode is specified as the leading edge of a window. The number of modes to visualize determines the size of that window. To enhance the log-coloring scheme, low and high thresholds can be adjusted for a percentage of frames to be set to minimum and maximum values, broadening the coloring of the inter-threshold range. The mode amplitude parameter determines the amount of displacement from equilibrium for visualizing the modes. This driver affords the user considerable control for the visualization process, including the number of frames to generate and the number of cycles to capture for the slowest mode. All required information is specified in the input file VIZ.txt

iv. Output File format:
The output is a set of PDBs of size $num_frames and sets of twenty PDBs for the individual modes if requested by setting $do_individual to "1". Also, PyMol TM scripts are generated to animate the sets of PDB files (.pml files).
Comparators include RMSIP and Principle Angles, for the essential subspace and iterated comparisons from dim 1 to 3 Additional log files can be found in the /SSA directory tree.
Performing Cartesian Mode Visualization on Top 3 cPCA modes. Sets of 20 structures were generated to animate each selected cPCA mode, for the COV, CORR, and PCORR PCA models.
Atoms of each residue were perturbed along the mode eigenvector using a sine function ranging from 0 to 2PI. A PyMol(TM) script was generated for each mode to play the mode structures as a movie. MODE AMPLITUDE = 2.500    Maximum possible angle between two subspaces of this dimension is 201 degrees