- Correspondence
- Open Access
Equivalent input produces different output in the UniFrac significance test
https://doi.org/10.1186/1471-2105-15-278
© Long et al.; licensee BioMed Central Ltd. 2014
- Received: 11 December 2012
- Accepted: 21 July 2014
- Published: 13 August 2014
Abstract
Background
UniFrac is a well-known tool for comparing microbial communities and assessing statistically significant differences between communities. In this paper we identify a discrepancy in the UniFrac methodology that causes semantically equivalent inputs to produce different outputs in tests of statistical significance.
Results
The phylogenetic trees that are input into UniFrac may or may not contain abundance counts. An isomorphic transform can be defined that will convert trees between these two formats without altering the semantic meaning of the trees. UniFrac produces different outputs for these equivalent forms of the same input tree. This is illustrated using metagenomics data from a lake sediment study.
Conclusions
Results from the UniFrac tool can vary greatly for the same input depending on the arbitrary choice of input format. Practitioners should be aware of this issue and use the tool with caution to ensure consistency and validity in their analyses. We provide a script to transform inputs between equivalent formats to help researchers achieve this consistency.
Keywords
- Unifrac
- Computer methodologies
- Metagenomics
Background
UniFrac is a tool for comparing microbial communities while taking phylogenetic distance between organisms into account. It is available as a web service hosted by the University of Colorado at Boulder at http://bmf2.colorado.edu/unifrac/index.psp, and is described by Lozupone, Hamady and Knight [1, 2]. According to recent work by the authors, UniFrac is a popular tool that has been cited in over 150 publications [3].
Simple input tree for UniFrac. Simple example of UniFrac input that does not use abundance counts.
Input tree for UniFrac with abundance counts. A tree input into UniFrac where abundance counts are included at each leaf node in compact form.
It is exclusively the weighted version of the algorithm that is the subject of the remainder of this paper.
Results and discussion
Overview
In this section, we establish the nature of the discrepancy that leads to the UniFrac significance test producing different results on equivalent inputs. In order for two objects to be equivalent, we must show two things: first, that the two objects are isomorphic, and second, that both objects maintain the same semantic interpretation. Intuitively, two objects are isomorphic to each other if a procedure exists to transform one to the other and back again. A more mathematical definition of isomorphism follows in the next section.
In the preceding section, we described how trees input to the UniFrac program may or may not contain abundance counts. Below, we define a transformation for converting trees back and forth between these two formats. We briefly discuss how both tree formats have the same semantics. We then demonstrate that the UniFrac tool produces different outputs for the equivalent forms of the same input tree using a minimal example. Such a discrepancy is dangerous to researchers; it can lead them to false conclusions since their results may depend on the choice of input representation, making reproduction of results difficult. Finally, we illustrate the misleading results that can arise from this discrepancy using metagenomic data from a lake sediment study.
Isomorphism of Trees With and Without Abundance Counts
Expanded input tree for UniFrac. An expanded tree that is equivalent to that shown in Figure 2, but which does not use abundance counts.
Since we now have a fully reversible transformation, we have satisfied the conditions to prove that the trees in Figure 2 and Figure 3 are isomorphic. Henceforth, we will refer to trees with abundance counts like the tree in Figure 2 as the compact form and larger trees like the one in Figure 3 as the expanded form.
Semantics of compact and expanded trees
Both the compact and expanded form trees have the same semantics, and thus are valid input for the UniFrac significance test. In both cases, each leaf node represents an OTU and the count associated with it is an abundance count of that OTU (in the case of the expanded form, the counts are all implicitly 1). A distance value is attached to each branch; by means of their construction, these values are the same between both trees, except for the new branches in the expanded tree, which have an attached value of zero. Typically, these distances are expected to be edit distances between biological sequences, but the UniFrac formulation allows these values to be any distance metric. However, in mathematics it is standard that any function described as a distance function must satisfy a small number of very basic properties, one of which is the coincidence axiom, which states two objects have a distance of zero between them if and only if they are identical. Therefore, the semantics of a branch of zero length must necessarily be the same across both trees.
We have now shown that the compact and expanded form trees are both isomorphically and semantically equivalent and merely use a different visual representation. We would therefore expect any numeric calculations based on these trees to yield the same result.
UniFrac P-Values on compact and expanded trees
We can now ask the question: what should be the result of a UniFrac significance test on the tree in Figure 3? Although we have only very briefly discussed how UniFrac distances are calculated, in the case of Figure 3 it is clear that all labelings other than the original (and of course the completely symmetric swap of the Sample1 and Sample2 labels) have an equal or smaller distance between the two samples than the original labeling itself. This is because the two longer, top level branches are at least partially shared by both samples in any scenario other than the original labeling, which will result in a smaller UniFrac distance. The p-value of the UniFrac significance test for this example is equal to the probability of randomly generating the original labeling. There are possible ways to assign the labels to this tree, and since symmetric labelings are equivalent for our purposes (e.g. it doesn’t matter whether the first four or the last four nodes are all labeled with an A, so long as the As are all together), the probability of generating the original labeling at random is 2/70 or approximately 0.028.
We now upload the data for this tree (see Additional files 1 and 2) to the UniFrac web tool and perform a UniFrac significance test. We note that as the web tool by default generates 100 random labelings, we expect some variance in the p-values if the analysis is run multiple times, but we expect the range of the variance to be small. As expected, several trials performed by the authors yielded p-values in the range of 0 to 0.04.
Finally, we will compare this to the results for the tree in Figure 2. As we have previously shown, this tree is equivalent to the one in Figure 3, but because this tree is in compact form, the input files given to UniFrac are visually different from those for the tree in Figure 3 (contrast Additional files 3 and 4 with Additional files 1 and 2). For this tree, the resulting UniFrac significance value given by the web tool is 1.0 (and this result remains consistent no matter how many times we repeat this same input). This means that all randomly generated labelings had an equal or greater UniFrac distance between the two samples than in the original labeling. We can explain this discrepancy in output if we assume that, during the generation of random trees, the abundance counts are permanently associated with each sample label. In other words, in the case of Figure 2, whichever leaf node is associated with Sample1 automatically also has an abundance count of 4. Personal correspondence with the UniFrac authors has confirmed that is indeed the way the abundance counts are handled, and thus the discrepancy is not due to a code error nor to simple variance due to a relatively small number of random labelings. Because the compact-form tree has only two symmetric leaf nodes, it is irrelevant if the sample labels are swapped, and thus all randomly generated trees have the same UniFrac distance between samples as the original tree, resulting in a p-value of 1. Of course, this is completely different from the p-values in the range of 0 to 0.04 obtained for the tree in Figure 3, and so there is a discrepancy in output generated from semantically equivalent input.
We note that in this section, we have endeavoured to use the simplest possible example to illustrate the discrepancy between expanded and compact form trees. The problem is demonstrable with more complex trees as well.
Other tools for comparing phylogenetic trees
We briefly investigated two other recent tools used for metagenomic analysis: Parallel-META [4] and Meta-Storms [5]. As Parallel-Meta does not directly compare comunities using phylogenetic trees, the disrepancy we describe above was not applicable. Meta-Storms does compare communities using phylogenetic trees in a manner similar to UniFrac, so we used Meta-Storms to analyze a simple artificial data-set. The data consisted of two environments, A and B, each containing 100 sequences. All sequences in environment A were identical to each other and all sequences in environment B were identical to each other, but completely different from the sequences in environment A. Meta-Storms reported that these two environments were not different from one another, mirroring the UniFrac result from the preceding section. Thus we conclude it is likely that Meta-Storms suffers from the same methodological problem that we identify in UniFrac above.
Discussion on the discrepancy
The primary goal of this paper is to clearly demonstrate the discrepancy in output stemming from equivalent input for the UniFrac significance test. Our goal is not to make a definitive statement as to which output is better. However, here we will present some brief arguments and examples as to why the output produced when using expanded form trees as input may be more intuitive, scientifically useful, and robust.First, consider again the example in Figure 2, except now suppose that the abundance count for each leaf node is not 4 but 10,000, and that furthermore, the weight on each branch is not 0.1 but some much higher value, say 0.9. The p-value of a weighted UniFrac significance test on such an input tree is 1.0 for the same reasons we outlined in the preceding section, so the two communities would not be deemed significantly different. This seems an odd conclusion to draw, as this input tree represents two large, disjoint communities of individuals with a large phylogenetic distance between them. A significance test resulting from the expanded form of such a tree, on the other hand, yields a p-value of nearly 0, indicating that the two environments are indeed different with near-certainty.
Second, again using Figure 2 as a starting point, consider the case where the 4 individuals in Sample1 are not in fact identical, but are grouped together due to a clustering process (say, clustering DNA sequences at 97% sequence identity, a common operation in metagenomic studies), and similarly for Sample2. Let us then say that we wish to investigate the effect of this clustering by building a new tree where the sequences are not clustered at all. Such a tree would look very similar to the tree in Figure 3, except that the 0 weights on all the short branches would instead be some small, non-zero value (somewhere from 0 to 0.03 in this case). Note that because these branch weights are non-zero, this new tree is not a mere isomorphic transform of the original tree. However, a UniFrac significance test on this new tree would yield a p-value similar to the value obtained for the tree in Figure 3 (which was 0.028), and thus very different from the p-value of 1.0 for the tree in Figure 2. Thus, were we to use the result from the compact-form tree in Figure 2 as our starting point, we would conclude that the clustering process had a very large effect on determining community similarity, when in fact the clustering makes almost no difference if we start from the (equivalent) expanded form instead. This effect suggests that analysis of expanded form trees leads to more robust and consistent results in general.
Effect of the discrepancy on sample data
Normalized pairwise UniFrac distances on sample lake sediment data
Sample | A1 | A2 | B1 | B2 | C1 | C2 |
---|---|---|---|---|---|---|
A1 | 0.09 | 0.33 | 0.35 | 0.31 | 0.33 | |
A2 | 0.34 | 0.36 | 0.31 | 0.32 | ||
B1 | 0.10 | 0.15 | 0.19 | |||
B2 | 0.17 | 0.19 | ||||
C1 | 0.10 |
Weighted UniFrac p-values on compact form of lake sediment data
Sample | A1 | A2 | B1 | B2 | C1 | C2 |
---|---|---|---|---|---|---|
A1 | 0.03 | 0.05 | 0.02 | 0.13 | 0.22 | |
A2 | 0.03 | 0.05 | 0.15 | 0.26 | ||
B1 | 0.06 | 0.49 | 0.44 | |||
B2 | 0.32 | 0.55 | ||||
C1 | 0.48 |
Weighted UniFrac p-values on expanded form of lake sediment data
Sample | A1 | A2 | B1 | B2 | C1 | C2 |
---|---|---|---|---|---|---|
A1 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | |
A2 | 0.00 | 0.00 | 0.00 | 0.00 | ||
B1 | 0.00 | 0.00 | 0.00 | |||
B2 | 0.00 | 0.15 | ||||
C1 | 0.74 |
Conclusions
In this paper, we identify a discrepancy in the UniFrac methodology for testing of significant difference in community structure, showing how two equivalent inputs could generate vastly different outputs. UniFrac users need to be aware of this issue so as to avoid misleading and inconsistent results. We provide an example of the effect this discrepancy can have on real data. Finally, we provide software to perform the isomorphic transform so that data can be submitted to the existing UniFrac service in a consistent form.
Methods
Weighted UniFrac significance test with and without abundance counts
We generated p-values for Figures 2 and 3 using the UniFrac web service located at http://bmf2.colorado.edu/unifrac/index.psp. For Figure 2, Additional files 3 and 4 were used as the Newick tree and the Environment file, respectively. For Figure 3, Additional files 1 and 2 were used as the Newick tree and Environment file. In both cases, the "Select Analysis" option was set to "UniFrac Significance". The "Type of Test" option was set to "Each pair of environments". The "Number of Permutations" option was set to 100 (the default). The "Use Abundance Weights" option was set to "Yes".
Declarations
Acknowledgements
The authors gratefully acknowledge Sarah Klatt for laboratory work that provided sample data, as well as Amir Muhammadzadeh, Stephen Johnson and Julian Miller for helpful discussions and their general participation in the MAVEN project. The authors thank Steven Siciliano for discussions that initiated this investigation. This work was supported by MAVEN, a project funded by Western Economic Diversification Canada and Enterprise Saskatchewan, under the provisions of the Canada-Saskachewan Western Economic Development Partnership Agreement. Oversight of the MAVEN project is provided by Genome Prairie and Dr. Reno Pontarollo.
Authors’ Affiliations
References
- Lozupone C, Hamady M, Knight R: UniFrac - an online tool for comparing microbial community diversity in a phylogenetic context. BMC Bioinformatics. 2006, 7: 371-View ArticlePubMed CentralPubMedGoogle Scholar
- Lozupone C, Hamady M, Kelley S, Knight R: Quantitative and qualitative Beta diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol. 2007, 73: 1576-1585.View ArticlePubMed CentralPubMedGoogle Scholar
- Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R: UniFrac: an effective distance metric for microbial community comparison. ISME J. 2011, 5 (2): 169-172.View ArticlePubMed CentralPubMedGoogle Scholar
- Su X, Xu J, Ning K: Parallel-META: a high-performance computational pipeline for metagenomic data analysis. IEEE International Conference on Systems Biology (ISB). 2011, Zhuhai, China: Institute of Electrical and Electronics Engineers (IEEE), 173-178.Google Scholar
- Su X, Xu J, Ning K: Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data. Bioinformatics. 2012, 28 (19): 2493-2501.View ArticlePubMedGoogle Scholar
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75 (73): 7537-7541.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.