VMCMC: a graphical and statistical analysis tool for Markov chain Monte Carlo traces

Background MCMC-based methods are important for Bayesian inference of phylogeny and related parameters. Although being computationally expensive, MCMC yields estimates of posterior distributions that are useful for estimating parameter values and are easy to use in subsequent analysis. There are, however, sometimes practical difficulties with MCMC, relating to convergence assessment and determining burn-in, especially in large-scale analyses. Currently, multiple software are required to perform, e.g., convergence, mixing and interactive exploration of both continuous and tree parameters. Results We have written a software called VMCMC to simplify post-processing of MCMC traces with, for example, automatic burn-in estimation. VMCMC can also be used both as a GUI-based application, supporting interactive exploration, and as a command-line tool suitable for automated pipelines. Conclusions VMCMC is a free software available under the New BSD License. Executable jar files, tutorial manual and source code can be downloaded from https://bitbucket.org/rhali/visualmcmc/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1505-3) contains supplementary material, which is available to authorized users.


INTRODUCTION
VMCMC has not yet been released, and the information below is up-to-date. Feedback is much appreciated, should you want to try VMCMC out. We have a stable stand-alone version uploaded on this site.

Visual Markov Chain Monte Carlo
VMCMC is an application for analyzing output of MCMC (Markov Chain Monte Carlo) chains. It deals with various metrics for assessing MCMC convergence, as well as summary statistics of inferred realvalued parameters, tree topologies, etc. VMCMC is designed for tab-delimited MCMC files but can also be used with MCMC output from the C++ version of PrIME, some applications of JPrIME (e.g., DLRS and DLTRS), MrBayes and BEAST. VMCMC is a Java application that entails many computational algorithms and statistical techniques to analyze MCMC chains statistically and visually. The command line output can be parsed using any JSON parser and therefore VMCMC has the ability to be automated (thereby can be used as part of pipeline).

Input
Nothing or alternatively a MCMC file.

Output for command line
JSON formatted file with inquired statistics. Below, you will find information on the input and output of the application, while explanations of vital program options and some sample files can be found at the end of the document. If you want to get started quickly, we suggest that you download and try Example 1, and have a look at it while reading through the other parts of the guide.

DOWNLOAD
VMCMC is distributed as a Java executable and is included in the VMCMC JAR file, whose up-to-date version can be obtained from here.
Older VMCMC executables can be found here.

REQUIREMENTS
VMCMC requires Java SE 6 but can also run on later versions. However, on version 7 (at least), the 2D plotting functionality is slower than on version 6 and therefore, the Graph panel takes a lot of time to load and significant delay is observed. Hence, Java SE 6 is recommended for VMCMC.

REFERENCES
VMCMC is not yet accepted or published but we expect it to be published in some reputed journal soon. If you use it, please look at this site for updates on citation.

MCMC File
The MCMC file should be a tab delimited file containing samples consisting of one or more parameters. Additionally, it can be computed with any suitable Bayesian phylogeny inference program (PrIMe, JPrIMe, BEAST or MrBayes currently) and should be provided without any editing e.g. following is a snapshot of an output from JPrIMe and the complete example can be followed in Example 1: Note that VMCMC expects standard MCMC chain output from these software. If the numeric parameters in these software contain any String values or the tree provided is not in Newick or Nexus format or the output from these software is edited (except for sample deletion or whole line deletions from data) in particular the header line is removed, then VMCMC will give an error. So please make sure that the input to VMCMC is exactly as output by the originating software.

OUTPUT
Output is typically shown in GUI. However, with proper options, it can also be directed to a file in JSON format and a JSON parser can be employed to extract and understand the output. Following is the output of the posterior tree distribution through command line using the option "-p".The file with the samples will look something like: 6/25 7 RUNNING 7.1 Starting the application VMCMC can be started as a command-line Java application as well as with graphical user interface. You would start VMCMC graphical user interface by running e.g.: or preferably without filename.
or for command line output, use the corresponding option.
where the JAR file of course refer to your current setup. See also the section on options below. You may need to increase the default heap size allocated by Java in case of very large MCMC files usually in excess of 20k samples. The memory requirements will primarily depend on the size of your MCMC data. We recommend using Oracle's Java HotSpot? virtual machine (VM), in which case a recommended heap size of 512 MB and a maximum heap size of 1024 MB would be specified with: Note that these options refer to Java's VM itself, and are not options of VMCMC. Tuning other aspects of a Java VM is a non-trivial area and is probably not necessary. Should you still wish to do so, please have a look at these documents: Oracle Java HotSpot VM FAQand Oracle Java HotSpot VM Options.

7/25
8 Convergence diagnostics and other analysis tools Several MCMC-tailored analysis softwares exist, among them: • The CRAN R package CODA. • Tracer.

Alternate to VMCMC!
Generally, CRAN R is suitable for many types of analyses. For instance, you could read and plot the duplication rate in R thus: To get a quick glimpse of the tree space, you could even use a shell such as bash (assuming the tree is in column 9 and we only look at the last 5,000 samples): This will quickly identify the maximum probability tree among the 5,000 samples. But rest assured, VMCMC is much easier to use with a lot more functionality and much more deterministic e.g. for convergence diagnostics and tree parameters.

OPTIONS
VMCMC comes with a large number of user options; please don't feel overwhelmed! All options are detailed below. You can always type to get a more up-to-date but brief description of all available options. Notice that some options have precedence over others and will display the highest prioritized option of the selected ones only. VMCMC computed Geweke convergence and burn in estimator for the MCMC chain shown on stdout. It does not set the burnin to Geweke estimated burnin for other options but will estimate and output the burnin on command line. By default is turned off. • -e Effective Sample Size test result on command line only. VMCMC computed estimated sample size burn-in and convergence estimator for the MCMC chain shown on stdout. It does not set the burnin to ESS estimated burnin for other options but will estimate and output the burnin on command line. By default is turned off. • -r Gelman-Rubin convergence test result on command line only. VMCMC computed gelman rubin burn-in and convergence estimator for the MCMC chain shown on stdout. It does not set the burnin to Gelman-Rubin estimated burnin for other options but will estimate and output the burnin on command line. By default is turned off. • -p Display Posterior distribution of trees. By default is turned off.

General options
• -b<int> Burn-in samples to remove from trace for command line analysis for single chains only.
By default is -1. • -b1<int> Burn-in samples to remove from trace for command line analysis for first chain among parallel chains. By default is -1. • -b2<int> Burn-in samples to remove from trace for command line analysis for second chain among parallel chains. By default is -1. • -m Calculate and display MAP tree. By default is turned off.
• -ct Run the global convergence tests only. By default is turned off.
• -sd Sample a data point uniformly from converged MCMC chain. By default is turned off.
• -o Output command line options on standard output or in a specified file. By default is set to stdout. • -pa Path where to make the output file for command line VMCMC. Default: ./.
• -a<float> Alpha value/Level of significance for parallel chain analysis. Default: 0.05.

First window
If VMCMC is ran without any arguments, then the first Graphical User Interface appears with File menu. The user can choose if he wants to open example runs from various software from Examples menu or if they want to open a file from the hard disk. Furthermore, it contains the help menu referring to this website tutorial and also the names of members of the team responsible for VMCMC in the About menu.
From the command line, the first window can be accessed by the following command.

Convergence Diagnostics
VMCMC has several established parameter convergence diagnostics like Geweke and Gelman Rubin. These are displayed on the right panel for each parameter. While Geweke follow the traditional definitions, Gelman Rubin parameter burnin estimates are defined by "the first sample value for which gelman rubin diagnostic results as converged for a parameter". From the command line, the Geweke convergence diagnostic for each parameter can be calculated by the following command.
From the command line, the Gelman Rubin convergence diagnostic for each parameter can be calculated by the following command.
The ESS Max or Sahlin-Höhna ESS Burnin estimator (SHEB) is however not so commonly used. Please note that it does not display the ESS value for a given burnin but determines the sample number at which the ESS value is maximized for a given parameter chain. Sebastian Höhna and Kristoffer Sahlin should be credited for proposing this diagnostic for a single parameter. It will be interesting to show the actual ESS value observed for this burnin as well in a separate column for making convergence assessment decision easy (e.g., by using the Tracer heuristic of greater than 200 for converged and for less than 100 as not converged). Please refer to Kristoffer Sahlin Master thesis titled "Estimating convergence of Markov chain Monte Carlo simulations" for theory and estimation method for ESS Max (exact formula on page 15).

10/25
From the command line, the ESS M ax convergence diagnostic for each parameter can be calculated by the following command.
From the command line, the convergence assessment and burnin estimates of all three convergence diagnostics for each parameter can be calculated by the following command.
Of particular interest are the global convergence diagnostics which are not available or have not been established yet. What is the overall convergence status of the whole MCMC chain and not of the individual parameters? If the chain has converged, what is the universal burnin for all numeric parameters? VMCMC attempts to answer these questions using the global convergence diagnostics and we are in the process of establishing these burnin estimation and convergence assessment diagnostics for the complete run including all parameters. A basic study that establishes why these diagnostics are important will be coming soon in the near future! The global convergence diagnostics are shown in the left panel below.
1. ESS Last Burnin is the maximum of all parameter estimates for ESS M ax or Sahlin-Höhna ESS Burnin estimator (SHEB). We have used the maximum of all parameter burnins concept using this estimator to estimate a universal burnin for multiple parameters. 2. Geweke Last Burnin is the maximum of all parameter estimates for Geweke and not converged if any parameter in the chain has not converged according to Geweke convergence diagnostic. 3. Gelman-Rubin LB is the maximum of all parameter estimates for Gelman-Rubin and not converged if any parameter in the chain has not converged according to Gelman-Rubin convergence diagnostic. 4. Standaradized ESS standardizes all numeric parameters and then estimates convergence for the whole vector per sample (treating it as a n-dimensional data with n = number of parameters) instead of a single point per sample. This removes the bias for each parameter and brings the scale to the same level. 5. Normalized ESS normalizes all numeric parameters and then estimates convergence for the whole vector per sample (treating it as a n-dimensional data with n = number of parameters) instead of a single point per sample. This removes the bias for each parameter and brings the scale to between 0 and 1 for all parameters.
From the command line, the global convergence diagnostics can be applied by the following command and the burnin and convergence can be assessed for the complete chain.
$ j a v a −Xmx2000m −Xms1800m −j a r VMCMC −X.Y. Z . j a r <MCMC F i l e > −c t −o <Output> Please note that by default, the burnin suggested by ESS LastBurnin is selected as the burnin for all parameters in the chain for the GUI-based VMCMC and from the command line based VMCMC (if no burnin has been supplied) for those analysis that require burnin.

Parameter Trace and Statistics
MCMC trace is another important characteristic of MCMC chains and trace of selected parameter from the dropdown menu is shown graphically on the right hand side of the tab. Mixing of MCMC for numeric parameters is seen and visual estimation of convergence and burnin can be performed using the parameter trace.
Parameter statistics for the selected parameter are shown on the left hand side with minimum and maximum values, arithmetic, harmonic and geometric means, arithmetic and geometric standard deviations, Bayesian confidence interval and burnin and convergence estimates are shown. These values can be used to verify the mixing of each parameter numerically and in case of command-line automate this verification of mixing as well as accuracy of MCMC chains (if true parameter value, if known, exists within 95% confidence interval).
From the command line, the parameter statistics can be determined by the following command with the specified burnin or with the burnin estimated from ESS LastBurnin if burnin is not specified and determine the credible interval using the specified confidence level. An important characteristic of VMCMC is the ability to extract a selected interval and display it in a new tab (which is like zooming in on a particular window and estimating its statistics and visualizing its trace closely).
Another option available from command line is to determine the convergence test results and parameter statistics by the following command.

Tabular Representation of Data
Numeric parameters are presented in tabular format, so that values of each parameters are easily readable and user can see the data in a good (Microsoft excel-like) presentation.

Tree Parameter View and Properties
Tree parameter is an important parameter of MCMC analysis. Estimating the true posterior distribution and searching for MAP tree is a difficult task and in particular determining equivalent trees with different Newick and Nexus representation is not possible using conventional cut, tail, sort and uniq method. Therefore first the branch lengths of trees are removed and then we estimate the posterior distribution for trees. The output is sorted and displayed with tree numberings, number of trees with this topology and frequency within the chain after removing burnin samples (determined by ESSMax) on the left hand panel. At the top are the newick string(s) of the selected tree(s). At the right hand panel, the selected tree(s) are shown in cladogram(s) using forester library.
From the command line, the tree posterior can be determined by the following command with the specified burnin or with the burnin estimated from ESS LastBurnin if burnin is not specified. An important characteristic of VMCMC is its ability to display how a numeric parameter behaves for a selected tree from the tree posterior, i.e., what were the parameter statistics for a selected numeric parameter (selected from drop-down menu) for a selected tree in the tree panel. Therefore selecting tree(s) from left hand side and pressing Mark Selection in Graph button will highlight the corresponding regions in parameter trace in Graph tab. Switching to Graph tab, one can visualize the ares, this tree topology is observed and Clear Tree Markings will remove these marks. Notice that each tree selected has a unique color although at this moment, there is no mapping available for a color to a tree.

Consensus Tree Properties
Another interesting property of the trees is to see which trees are closer to each other and which are not. This can be done in two ways, through the consensus tree which shows the maximum splits based on frequency of selected trees and secondly through the Multi-Dimensional Scaling (MDS) technique discussed later. The consensus tree can be determined for all trees if one or none of the trees is selected. If two or more trees are selected, then the consensus tree for these trees will be computed based on the majority splits on tree frequency.

15/25
The second way to estimate distance between all trees is Multi-Dimensional Scaling (MDS) technique which displays the tree posterior in 2 Dimensional space as an edgeless graph, where each tree is represented by a labelled vertex (whose size is proportional to tree frequency and whose label is T + the tree number). First a 2Dimensional distance matrix is constructed, which takes as input all tree topologies with frequency greater than 0.2% and Robinson Fould's distances between two trees are computed and stored in the matrix. Then this matrix is passed to MDS scaling module, which fits these trees onto two dimensions according to the scaled distance between the trees and minimizing the difference (error).

Parallel Chain Analysis for Continuous Parameters
If two chains have ran in parallel with the same data and using the same software, how can we determine, if the samples have been sampled from the same posterior distribution? More importantly, if so, then how can we combine this information and get maximum usage of parallel chains? VMCMC addresses this question in three steps. The first way is that Mann Whitney U test is applied for numeric parameters for a given level of significance and decision for this parameter is shown at the bottom left side of the parallel chain trace window. The second step is pooling of tree parameters after MCMC convergence and using this data to infer tree related properties like tree posterior distribution and tree analysis tab etc. Please note that after removing the specified/default burnin samples from both traces, remaining samples in both chains are combined and statistics for a combined trace are displayed.

Parallel Chain Analysis for Tree Parameters using Splits
VMCMC uses tree splits to identify if both the tree parameter posterior distribution have been sampled from the same distribution using the Chi square for two independent samples test for a given level of significance. We acknowledge MrBayes for this measure. Burnin value are taken from trace window and after removing the samples in burnin period, the tree split bins are computed on which Chi square test is applied and the Chi Square test statistic, p-value and acceptance/rejection of null hypothesis are displayed on the left hand side. Please note that after removing the specified/default burnin samples from both traces, the remaining trees in the posterior of both chains are combined and displayed for the combined trace. The Mann Whitney U test for continuous parameters and Chi Square test for tree splits can also be applied through command-line with pre-specified burnin for both chains or with the burnin calculated from ESS Lastburnin for both chains with a specified alpha value.

Sample data
This option is only available from command line. If one wants to randomly select a single sample from converged part of the chain, then the following command can be used with the specified burnin or with the burnin estimated from ESS LastBurnin if burnin is not specified.
VMCMC is a simple application with single input even for parallel analysis. Following are a few tips in order to use GUI for single chain and parallel chain analyses.
1. Single chain analysis: The input file for single chain analyses is a single MCMC file for CODA, PrIMe and JPrIMe and two files (with same names but with different extensions where one ends with .p and other with .t) for MrBayes?. For GUI, first run the program without providing any arguments. Then click the File menu on the first GUI and click the "Open File" submenu. Select the MCMC file for JPrIMe, PrIMe and CODA and the file ending with .p for MrBayes. Note that "File" menu does not contain any "Parallel Analysis" submenu. This is hidden and will appear once a file has been opened using "Open File" submenu.
2. Parallel chain analyses: Parallel chain analyses is performed on two chains for the same program ran on the same data. Furthermore at this point of time, the number of samples in both chains must be same because the trace window is unable to handle empty data for one chain at this point of time. The Chi-Square test and Mann-Whitney U-test as well as other algorithms and testing utilities are however not effected by this limitation.
Go to File → Parallel Chain Analysis and select the second file under the same restrictions for Single chain analyses. For JPrIMe, CODA and PrIMe files, select the chain with which you want to perform Parallel Chain analysis of the first chain. For MrBayes?, select the file with extension ".p" of the chain with which you want to perform Parallel Chain analysis of the first chain. Note that parallel analysis can only be performed if a single file has been opened using the "Open File" submenu otherwise this submenu will not appear on "File" menu. Furthermore, before this analysis, "File" menu contained "Parallel Analysis" submenu. This is now hidden and will not appear once a parallel analysis has been performed. To perform another parallel analysis, open a file using "Open File" submenu and it will be possible to perform a parallel analysis with this opened chain.

EXAMPLES
All these examples are already in the application itself. When one runs VMCMC, they can be found under the Examples submenu. However, the original data files are not visible in the application examples and so here we give the example files for better understanding of the input and functionality of VMCMC. Read Useful Tips and GUI in line with these example files to run and understand the output and functionality provided by VMCMC. Q2) When I run VMCMC using other programs e.g. pDLRS, it runs fine but the convergence tests are always giving not converged. Why? A2) VMCMC assumes that the first column of MCMC input to VMCMC is the iteration number, which is neglected and not not used in MCMC analysis. pDLRS has a repetitive column of iteration number, which can never converge and as ESSMax is VMCMC's current solution to estimating convergence, therefore the output from pDLRS can never converge. We are planning to add an option to the user, where he can specify to neglect particular column(s) for MCMC analysis and so presence of such a column can be handled by the program.
Q3) I get error message that seem to be related to the input format and which states that VMCMC input does not belong to the specified list of software (MrBayes, CODA, JPrIMe and PrIME). What could be wrong?
A3) This happens, when the parser is unable to parse MCMC input. Possible reason could be tinkering with headers in the files. Make sure that the original file output from software is provided to VMCMC. removing of samples is allowed but whole lines must be removed and in case of MrBayes, the tree and the numeric parameter files must correspond to each other. Also the Newick tree structure and Nexus tree structure must be maintained.
Q4) I get a warning that large number of trees are present in the posterior and then the Tree Analysis tab seems to be struck and not working sometimes. Why is that? A4) You must wait for some time and the results will be displayed on the Tree Analysis tab. Such a thing occurs whenever the number of trees in tree posterior with greater than 0.2% frequency is more than 45. A warning is generated to the user since the 2D matrix computation for Robinson's Fould distance is very time consuming. Q5) I am getting error message on screen related to parsing of numeric columns, when I give a single or parallel chain to VMCMC. Why is this error message generated? A5) One reason for such a behavior is when values like INFINITY appear in numeric columns. The numeric data can not be parsed and an appropriate exception is raised for such a case. Make sure that the data is correct and numeric columns contain strictly numeric values.
Q6) I notice that the Graph panel takes a lot of time to load and almost all interactive functionalities (slider, burnin specification and parameter selection) are very slow on this tab. Is that a bug or natural behaviour of this tab? A6) We did not notice this behaviour in Java SE 6. However, when VMCMC is run on Java SE 7, such a behaviour is observed. On a little investigation, Oracle mentions that the Java SE 7 has some significant delays in drawing 2D plots as compared to Java SE 6. Therefore, it is plausible that the reason for these delays are usage of Java SE 7 and not Java SE 6. That is why we recommend to run VMCMC on Java SE 6. Q7) Does VMCMC support MrBayes with stationary trees (e.g. with just .p file)?
A7) VMCMC supports MrBayes files (.p with .t as well as just .p) files so MRBayes runs with stationary/fixed tree topology is supported. Note that we can even open .mcmc file of MrBayes, which is a tab separated file but which do not correspond to parameter posterior samples.

23/25
VMCMC Tutorial Q8) Whereas log files can become large, that's generally not a real problem in terms of memory management. The same can not be said about output files containing tree samples. Should some advice be given concerning the magnitude of .trees files (in terms of the number of sampled trees)?
A8) If the output file contains very large number of samples (e.g., more than tens of thousands of samples) or if not enough memory is available for VMCMC, the processing speed of VMCMC becomes very slow. However, the progress bar has been implemented exactly to show that VMCMC is working and has not crashed or gotten stuck. However we have tried with MCMC trace file generated from JPrIMe with one hundred thousand very large trees (with more than 200 leaves) and the response time is about 20 seconds to load the complete file. So we would not like to limit the number of sampled iterations but can say that waiting time increases exponentially with the number of sampled iterations and one should thereby expect longer waiting times.
Q9) Why are "ESS Normalized" and "ESS Standardized" not displayed when the GUI is loaded for an MCMC file? Is it a bug? A9) "ESS Normalized" and "ESS Standardized" are joint burnin estimators and convergence diagnostics and take a long time to compute. This computation is being done using a threaded architecture and we wanted to show the other results right away without waiting for these computations. Therefore, we load the GUI as soon as the file is read and other computations are done but leave the two joint ESS estimators blank until the values have been computed. This is particularly noticeable if a MCMC file with large number of samples is provided. So it is not a bug but an intended feature.
Q10) For MrBayes output, it is very confusing that the default files that can be opened are *.mcmc files. This mislead me to open the *.mcmc output from a MrBayes run which gave completely nonsense results A10) MCMC tools like PrIMe and JPrIMe have used the .mcmc suffix for their output and VMCMC was specifically designed for most popular MCMC tools. Therefore, one should open .p file for MrBayes and avoid opening .mcmc file.
Any further questions, please send to the corresponding authors.