PhySIC_IST: cleaning source trees to infer more informative supertrees
 Celine Scornavacca^{1, 2},
 Vincent Berry^{2}Email author,
 Vincent Lefort^{2},
 Emmanuel JP Douzery^{1} and
 Vincent Ranwez^{1}Email author
DOI: 10.1186/147121059413
© Scornavacca et al; licensee BioMed Central Ltd. 2008
Received: 29 January 2008
Accepted: 04 October 2008
Published: 04 October 2008
Abstract
Background
Supertree methods combine phylogenies with overlapping sets of taxa into a larger one. Topological conflicts frequently arise among source trees for methodological or biological reasons, such as long branch attraction, lateral gene transfers, gene duplication/loss or deep gene coalescence. When topological conflicts occur among source trees, liberal methods infer supertrees containing the most frequent alternative, while veto methods infer supertrees not contradicting any source tree, i.e. discard all conflicting resolutions. When the source trees host a significant number of topological conflicts or have a small taxon overlap, supertree methods of both kinds can propose poorly resolved, hence uninformative, supertrees.
Results
To overcome this problem, we propose to infer nonplenary supertrees, i.e. supertrees that do not necessarily contain all the taxa present in the source trees, discarding those whose position greatly differs among source trees or for which insufficient information is provided. We detail a variant of the PhySIC veto method called PhySIC_IST that can infer nonplenary supertrees. PhySIC_IST aims at inferring supertrees that satisfy the same appealing theoretical properties as with PhySIC, while being as informative as possible under this constraint. The informativeness of a supertree is estimated using a variation of the CIC (Cladistic Information Content) criterion, that takes into account both the presence of multifurcations and the absence of some taxa. Additionally, we propose a statistical preprocessing step called STC (Source Trees Correction) to correct the source trees prior to the supertree inference. STC is a liberal step that removes the parts of each source tree that significantly conflict with other source trees. Combining STC with a veto method allows an explicit tradeoff between veto and liberal approaches, tuned by a single parameter.
Performing largescale simulations, we observe that STC+PhySIC_IST infers much more informative supertrees than PhySIC, while preserving low type I error compared to the wellknown MRP method. Two biological case studies on animals confirm that the STC preprocess successfully detects anomalies in the source trees while STC+PhySIC_IST provides wellresolved supertrees agreeing with current knowledge in systematics.
Conclusion
The paper introduces and tests two new methodologies, PhySIC_IST and STC, that demonstrate the interest in inferring nonplenary supertrees as well as preprocessing the source trees. An implementation of the methods is available at: http://www.atgcmontpellier.fr/physic_ist/.
Background
A phylogeny, or phylogenetic tree, is a representation of the evolutionary relationships among species. A wellknown problem in biological classification is to combine phylogenetic information to produce more inclusive phylogenies. One way is to use supertree methods, which combine overlapping source trees, inferred from primary data (e.g. amino acids, SINEs or morphological traits). Supertree methods are also useful, teamed with supermatrix methods, in a divideandconquer approach to reconstruct very large phylogenies: first, the set of data is divided into subsets that are analyzed individually, then the resulting phylogenies are combined to reconstruct the global phylogeny [1, 2].
Supertree methods can be classified into two categories, depending on the way they deal with topological conflicts, i.e. different arrangements of the same taxa among source trees. Liberal methods resolve conflicts, asking source trees to vote and opting for the topological alternative that maximizes an optimization criterion [3–7]. The hope is that each taxon is erroneously placed in only few source trees and this erroneous information will be overcome by the large number of source trees where the taxon is correctly placed. The most widespread liberal method is Matrix Representation with Parsimony (MRP, [3]). Supertrees proposed by liberal methods are often highly resolved and accurate, though several authors have shown that this approach sometimes leads to propose supertrees containing clades that contradict all source trees [8–10]. In contrast, veto methods do not allow the resulting tree to contain clades that contradict source trees. Some examples of veto kind methods are semistrict consensus [8], SMAST and SMCT [11, 12], PhySIC [13] and extensions of the strict consensus (e.g. [14, 15]).
A recent method, PhySIC, returns a supertree with appealing theoretical properties. First, since it is a veto method, it does not contain relationships contradicting the source trees (noncontradiction property, denoted by PC). In addition, it only infers relationships that are present in a source tree or collectively induced by several source trees (induction property, denoted by PI). The last property insures that the method does not make arbitrary inferences. These features provide an unambiguous phylogenetic framework that is well suited for taxonomic revisions as for other applications where the reliability of the supertree is crucial.
The algorithm presented in this paper, called PhySIC_IST (PHYlogenetic Signal with Induction and nonContradiction Inserting a Subset of Taxa), looks for a supertree that satisfies PC and PI properties. PhySIC_IST allows multifurcations in input trees to be resolved thanks to the information present in other source trees. To deal with topological conflicts PhySIC_IST allows, like SMAST and SMCT, the insertion of only a subset of the species present in the source trees. Moreover, PhySIC_IST can also propose new multifurcations to avoid contradicting source trees, while SMAST and SMCT can only remove taxa. The aim of PhySIC_IST is not only to find a supertree T (plenary or not) that satisfies PC and PI but to find the most informative supertree satisfying both properties. Choosing the most informative alternative among several candidate supertrees requires one to be able to compare trees including potentially different subsets of the source taxa (such as ST_{1} and ST_{2} in figure 2). The informativeness of a candidate supertree is computed by a variation of the CIC (Cladistic Information Content) criterion [16]. This measure has roots in information theory and is basically proportional to the number of complete binary trees that are compatible with the evaluated supertree.
The resolution of supertrees computed by veto methods can be poor when considering large numbers of source trees. Indeed, adding more trees provides more information on the relative position of some taxa, but in the same time increases the number of local conflicts. To handle large collections of source trees, one has to resort to the liberal approach that allows to arbitrate between conflicts arising among source trees. The most common way to deal with incongruent source trees is to use a supertree method that takes adhoc decisions (according to a chosen objective criterion) in the face of individual conflicts met when building the supertree. The second and much less explored way is to preprocess the data according to a statistical procedure and then to apply a veto method, not contradicting the retained information that was estimated to be reliable. In this paper, we follow the latter approach that has the advantage of making the removing of conflicts between source trees explicit. More precisely, we introduce a preprocessing step to detect and correct anomalies in the source trees. This step, called STC (Source Trees Correction), analyzes the contradictions among the source trees; for all contradictions, it evaluates the possible topological alternatives and it drops the alternative(s) that is (are) statistically less supported (with a threshold chosen by the user). Then STC modifies each source tree (using a schema similar to that of PhySIC_IST–see Methods) so that it does not contain the dropped alternatives and yet remains as informative as possible. In other words STC aims at correcting the source trees that propose anomalous phylogenetic position for some taxa (due to lateral gene transfers, long branch attractions, paralogy ...). For example, if source trees contain two contradicting resolutions, one present in 99% of the trees and the other one present in 1% of the trees, we can reasonably think that the latter resolution is an anomaly and ignore it. If the user approves the proposed modifications, the PhySIC_IST veto method is then applied to the modified source trees. The resulting supertree satisfies both PI and PC properties for the collection of modified source trees. If the user is not satisfied with the modified source trees, he can change the threshold and restart the procedure, or choose to skip it. In this way, the liberal component of the supertree inference is not only made explicit but also interactive and parametrized. PhySIC_IST and STC were implemented using the BIO++ libraries [17], and are available from: http://www.atgcmontpellier.fr/physic_ist/.
Results and Discussion
In this section we present results of largescale simulations conducted to evaluate both the resolution and the accuracy of PhySIC_IST supertrees. These results help to measure both the improvement offered by PhySIC_IST on the previous version of the method, and the effectiveness of the STC preprocess. We also validate the new methodology by applying STC+PhySIC_IST to two biological case studies.
Simulations
The informativeness of supertrees is frequently compared using type II error, i.e. the number of triplets of the model tree that are not present in the supertree divided by the number of triplets in the model tree. It seems to us that the CIC_{ N }is more appropriate when comparing the informativeness of supertrees. Indeed, if a triplet r ∈ $\mathcal{R}$ is included in the computation of the type II error, this may be a result of it not having been expressed in the supertree or of an alternative resolution having been proposed. To the contrary, the CIC_{ N }strictly measures the information contained in the supertree, whether it is accurate or not. The accuracy of the supertree is separately measured using the type I error. Because of this ambiguity of the type II error and for consistency with the optimization criterion of PhySIC_IST, CIC_{ N }graphics are provided instead of the type II error graphics.
Improvement of PhySIC_IST on PhySIC
The increase in resolution of PhySIC_IST in comparison to PhySIC is noteworthy (figure 4) no matter the deletion ratio. More precisely, the average CIC_{ N }of PhySIC_IST supertrees is 1.5 that of PhySIC (over all simulation conditions). Since CIC_{ N }is measured on a logarithmic scale, this means a considerable improvement on PhySIC. This different behaviour of the two methods is due, most of the time, to the fact that PhySIC_IST is allowed to infer nonplenary supertrees. Indeed, removing just one taxon is sometimes enough to make all source trees agree on a large subset of taxa. As veto methods are not allowed to contradict source trees, keeping the rogue taxa in the supertree means proposing a multifurcation for the surrounding subset of taxa, as done by PhySIC. The PhySIC_IST version escapes this situation by not including the rogue taxa in the supertree, and is hence able to obtain a relatively important resolution for the remaining taxa.
In the meantime, the type I error of PhySIC_IST (figure 5) is always inferior to 1% (except for d = 75% and k = 10) and decreases importantly as the number of source trees increases. From the experimental results, it could appear that there is a choice to be made between the two methods since PhySIC displays a significantly lower type I error rate (see figure 5), but this is mainly due to the fact that the trees reconstructed by PhySIC can be much less resolved, as expected from a plenary veto method applied to a large number of source trees. Thus, on practical data sets, PhySIC_IST is always to be preferred to PhySIC.
The foreseeable but undesirable behavior of veto supertree methods when facing large numbers of source trees can be overcome by an explicit liberal preprocessing of the input trees, such as the STC proposed in this paper.
The CIC_{ N }values of the PhySIC_IST supertrees decrease as the number of "notinserted" taxa increases, i.e. as the size of the supertrees decreases. This is expected given the role played by this number in the CIC_{ N }formula (see section the CIC criterion). More interestingly, PhySIC_IST supertrees overall have CIC values rather close to max CIC values, i.e. PhySIC_IST supertree are close to being fully resolved. Moreover, as the size of the supertrees decreases, CIC_{ N }values of PhySIC_IST supertrees and max CIC values decrease at a similar pace, the gap between both values narrowing slightly for the smallest supertrees. Thus, overall, the resolution degree of output supertrees appears to be only slightly dependent on the number of taxa inserted in the supertree. The only exception to this rule happens for the conditions d = 75 with k = 10 and k = 20. In these cases, which are the most extreme conditions in terms of overlap between the taxa set of source trees, the two curves decrease with different slopes. We now detail results obtained when resorting to STC statistical preprocess.
Efficiency of the STC preprocess
Figures 4 and 5 report simulations results for STC+PhySIC and STC+PhySIC_IST, when fixing the STC threshold to 95%, i.e. a 5% probability that a detected anomaly is not actually an anomaly (see the Methods section for more details). The resolution of both PhySIC and PhySIC_IST greatly increases thanks to the preprocessing step in most simulation conditions (25%, 50% and mixed deletion ratios d). The STC preprocess has no effect for d = 75%, where the low overlap between source trees impedes detecting anomalies.
STC+PhySIC_IST is on average 1.5 more informative than STC+PhySIC according to the CIC_{ N }measure. This replicates the gap observed between the methods without the preprocess, confirming the improvement of PhySIC_IST on PhySIC. The fact that the STC preprocess allows the PhySIC and PhySIC_IST supertrees to be more resolved without significantly changing the type I error, shows that this preprocessing step corrects the source trees in an appropriate way.
When only considering results with STC (Table (b) in figure 6), the average percentage of discarded taxa decreases with the number of source trees and increases when d augments. Thus, as more information is provided, supertrees are more and more informative, as usually happens with the liberal approach (e.g. see results for MRP in figure 4). Indeed, giving more information to STC brings out anomalies more and more clearly, thus tends to modify the source trees more and more accurately.
Comparison of liberal and veto methods
As expected, the resolution of supertrees obtained with MRP tends to increase with the number of source trees. In fact, MRP is a liberal method and adding trees supplies more information. Unexpectedly, its type I error does not decrease considerably when adding more trees to the analysis.
As already mentioned, the resolution of supertrees inferred by the two veto methods tends to decrease when including more trees (figure 4, 25%, 75% and mixed deletion rates d). In contrast, their type I error decreases importantly as the number of source trees increases. By applying the STC preprocess to PhySIC and PhySIC_IST, the two methods behave like liberal methods, i.e. the resolution of supertrees increases with the number of trees, as already explained except for d = 75%). This behavior is less apparent for PhySIC. Indeed, when faced with an insufficient number of triplets to satisfy the PI property, PhySIC can not benefit from the improvement with respect to PC achieved by the STC preprocess.
Note that in all conditions, MRP provides trees that are, on average, more resolved than other methods. Thus, MRP appears to be the most liberal supertree method among those investigated. This is not a surprise as, when two alternative resolutions conflict with one another, the MRP parsimony criterion favors that supported by the highest number of source trees, while the STC preprocess favors a resolution only when it is statistically more supported than the other (see Methods section for a precise description of STC). However, favoring more resolved supertrees also leads to more errors in trees. Indeed, the type I error of PhySIC and PhySIC_IST, with and without STC preprocess, is smaller than to that of MRP (except for the marginal condition d = 75% and k = 10).
The important question of whether less resolved but more correct supertrees should be preferred to the opposite alternative, can only be answered by knowing the subsequent use of the inferred supertree (see [13] for a list of cases where the former alternative is to be preferred.)
Plots of the type II error are not presented but they show the same relationships between the analyzed methods.
Case study focused on placental mammals
With exons longer than 3000 bp, the PhySIC_IST supertree is extensively multifurcated, with only three obvious clades recovered (Figure 8(a)): the two muroid rodents (Mus + Rattus), the two hominoids (Homo + Pan), and the catarrhine primates (hominoids + Macaca). This reflects the fact that the source trees contain topological conflicts. A closer look at the source trees shows, for instance, that there is likely a long branch attraction phenomenon of the long muroid branch by the marsupial outgroup for the alignment composed of Pan, Macaca, Mus, Rattus, Bos, Canis, and Monodelphis exons orthologues to human exon 3 of the CELSR3SLC26A6 gene (EnsEMBL transcript and exon references ENST00000383733, and ENSE00001498361). In the absence of the rabbit (Oryctolagus) orthologue that would break the muroid branch, Mus + Rattus are artefactually attracted towards the basalmost position among placentals. This example illustrates the existence of conflicting resolutions among triplets of different source trees. Thus, without the STC preprocess, satisfying the PC condition results in a highly multifurcated supertree. In contrast, applying the STC preprocess leads to a more resolved supertree (Figure 8(b)). The two remaining multifurcations involve (i) the rabbit relative to muroids and primates, and (ii) the armadillo (Dasypus), elephant (Loxodonta), and tenrec (Echinops) relative to the other placentals. This probably reflects the lack of phylogenetic signal for these taxa among the 50 source trees.
With exons longer than 2000 bp, the PhySIC_IST supertree is extensively multifurcated, with only two obvious clades recovered (Figure 8(a)): Mus + Rattus and Homo + Pan. The greater number of source trees introduces additional conflicts within primates as compared to ortho_{3000}. Additionally, the supertree lacks the taxon Macaca. The reason is that, in the source tree reconstructed from the ENSE00001300737 exon (EnsEMBL release 41), Pan is unexpectedly more closely related to Macaca than to Homo. This anomaly appears in only one of the 157 source trees, but this impedes pure veto methods from recovering the correct resolution for the clade. Indeed, inserting Macaca while preserving PC, implies losing the clade Homo + Pan, hence leads to a completely multifurcated tree on the 12 taxa except for the trivial clade Mus + Rattus. This supertree T' has a CIC_{ N }value inferior to that of the supertree T lacking Macaca (CIC_{ N }(T', 12) = 0.35 while CIC_{ N }(T, 12) = 0.435). For this reason, the taxon Macaca is not inserted. In contrast, STC+PhySIC_IST infers a plenary supertree (Figure 8(d)), the abovementioned anomaly being overcome by a significant number of correct resolutions in other source trees. This supertree is also fullyresolved – unlike the supertree obtained from ortho_{3000} – as STC benefits from the signal of 107 source trees additionally present in ortho_{2000}. The supertree topology is in agreement with the current view on placental phylogenetics which depicts the monophyly of euarchontoglires (rodents + lagomorphs + primates), laurasiatherians (Bos + Canis), boreoeutherians (the grouping of the latter two clades), afrotherians (Loxodonta + Echinops), and xenarthrans (Dasypus) + afrotherians [22, 24–26].
Case study focused on animals
The case study based on OrthoMaM only involved 12 species. To illustrate how PhySIC_IST performs on larger studies, we analyzed an animal phylogenomic data set containing 94 proteins (approximately 20,000 unambiguous amino acid positions) for 79 species, i.e. three poriferans (sponges), 5 cnidarians (sea anemones), and 71 bilaterians (chordates, urchins, mollusks, annelids, flatworms, roundworms, crustaceans, and insects) [27].
Conclusion
In this paper we propose a new supertree veto method (PhySIC_IST), running in polynomial time (see appendix in the supplementary material for details), that returns supertrees satisfying desirable theoretical properties (PC and PI). The simulations and the biological case studies confirm the practical effectiveness of PhySIC_IST, showing that this variant of PhySIC proposes supertrees that are much more informative than those inferred by the original PhySIC algorithm, while the type I error remains low (less than 1%). Additionally, we introduce a statistical preprocess of the source trees to detect and correct artifactual positions of taxa. This preprocess can be performed for any collection of source trees and hence benefits any veto supertree method. This approach has the advantage of separating the liberal resolution of conflicts among source trees from the assemblage of the supertree. This makes explicit the choices done to arbitrate between conflicting source trees, and allows the user to choose the extent with which the sources trees can be modified. In practice, STC+PhySIC_IST closes the gap between veto and liberal methods. This is the first practical method that provides informative and reliable nonplenary supertrees. The program is available for online executions and download at http://www.atgcmontpellier.fr/physic_ist/.
Methods
Definitions
We first recall notations used in the field, then we give a formal statement of the computational problem tackled by PhySIC_IST.
Notations
A tree T displays a set $\mathcal{R}$ of triplets when $\mathcal{R}\subseteq rt(T)$; a set $\mathcal{R}$ of triplets is compatible if there is at least one tree T that displays $\mathcal{R}$. A compatible set of triplets $\mathcal{R}$ induces a triplet r, denoted by $\mathcal{R}\u22a2r$, if and only if all trees displaying $\mathcal{R}$ contain r.
The PI and PC properties
Given a collection $\mathcal{R}$ of trees and a tree T with L(T) ⊆ L($\mathcal{T}$), $\mathcal{R}$(T, $\mathcal{T}$) denotes the set of triplets of T for which T proposes a resolution; i.e. $\mathcal{R}$(T, $\mathcal{T}$) = {ABC ∈ rt($\mathcal{T}$) such that rt(T) contains at least one of the possible triplets on A, B, C}. We denote by $\overline{r}$ the triplets contradicting r, i.e. the two alternative triplets for the same set of three taxa present in r. If both r and at least one of the triplets contradicting r are present in rt($\mathcal{T}$), we say that the taxa of r are involved in a direct contradiction. Using these notations, we recall the PI and PC properties [13]:

T satisfies PI for $\mathcal{T}$ if and only if for all r ∈ rt(T), it holds that $\mathcal{R}(T,\mathcal{T})\u22a2r$.

T satisfies PC for $\mathcal{T}$ if and only if for all r ∈ rt(T) and all $\overline{r}$, it holds that $\mathcal{R}(T,\mathcal{T})\u22a2/\overline{r}$.
The CIC criterion
In case of nonplenary supertrees, n_{ R }(T, n) depends on the multifurcations of T (since they reflect an ambiguity) and on the number of source taxa missing in T (since T contains no information for them). More formally, given a collection $\mathcal{T}$ of input trees and a candidate supertree T, the number of permitted binary trees for T referring to $\mathcal{T}$ is the number of binary trees T' such that L(T') = L($\mathcal{T}$) and T' L(T) refines T. We observe that, for each internal node u_{ i }with a number c_{ i }of children, we have (2c_{ i } 3)!! possible resolutions [30]. Moreover, if L(T) ⊂ L($\mathcal{T}$), we have to insert all missing taxa, i.e. those in L($\mathcal{T}$)  L(T). A rooted binary tree of i taxa has 2(i  1) branches; so, there are 2i  1 possible positions for the (i + 1)^{ th }taxon, taking into consideration the possibility of insertions above the root. We detail in the appendix how the value of n_{ R }(T, n) can be computed. In figures 4 and 10 we refer to CIC_{ N }(T, n) as the normalized value of CIC(T, n), i.e.
Another way to compare the information of different trees is to compare their number of triplets. However, the CIC criterion better takes into account missing taxa. For instance, consider the trees T_{1} and T_{2} in figure 10. The former is completely resolved but lacks taxon H, while the latter contains all taxa but is highly unresolved. Searching for the tree that maximizes the number of triplets, would lead to prefer T_{2} (since rt(T_{1}) = 35 while rt(T_{2}) = 48). However, it seems more reasonable to favor the tree that maximizes the value of the CIC criterion (in this case T_{1}, since CIC_{ N }(T_{1}, 8) = 0.78, while CIC_{ N }(T_{2}, 8) = 0.54).
Statement of the computational problem considered
We previously explained why it is important that supertrees satisfy the PI and PC properties. Among the supertrees, that satisfy these properties, some may be more informative than others, as can be measured by the CIC criterion. This gives rise to the following optimisation problem:
Problem Most INFORMATIVE INDUCED AND NONCONTRADICTING SUPERTREEE (MIICS)
Input a collection $\mathcal{T}$ of rooted trees.
Output a tree T such that:
(i)L(T) ⊆ L($\mathcal{T}$)
(ii)T satisfies PI and PC for $\mathcal{T}$
(iii)CIC(T,L($\mathcal{T}$)) is maximum among the trees satisfying (i)(ii).
We conjecture this problem to be intrinsically hard since it is a variant of the MIST (Maximum Identifying Subset of rooted Triplets) problem and of the ST (Triplet Supertree) problem, both shown to be NPhard [31–34]. PhySIC_IST is a polynomialtime heuristics to solve the MIICS problem. Note that it is heuristics only on point (iii), since it always outputs a supertree satisfying (i) and (ii).
Rooting the source trees
When PhySIC_IST is provided with unrooted source trees, it first has to root them. There are several approaches to root phylogenetic trees, among which are the outgroup, the molecular clock, and the nonreversible model of characterstate changes. It has been shown that the outgroup criterion is consistently able to identify the root [35]. The software incorporates a rooting tool that automates the procedure. This tool accepts as input different levels θ_{ i }of outgroup, each one being a list of taxa. The rooting procedure considers each unrooted source tree separately. For a given source tree T, it determines the first θ_{ i }such that θ_{ i }∩ L(T) ≠ ∅. Then the tree is rooted on the branch leading to the smallest subtree hosting all outgroup taxa of θ_{ i }. If the proposed outgroup is not monophyletic, the tree T is discarded from the analysis. This procedure does not alter the resolution inside the ingroup nor in the different outgroup levels that can be present in the tree.
Rooting trees is not trivial, hence outgroup levels have to be chosen carefully.
Inferring informative and reliable supertrees: PhySIC_IST
and we order taxa in decreasing priority order.
Then, we build the starting backbone tree, formed of a root node to which are connected two leaves corresponding to the first two taxa in the priority list.
Supports
The different kinds of insertions
Once the algorithm has ordered the taxa in a priority list and built the seed backbone tree from the first two taxa, it proceeds with the insertion of remaining taxa in decreasing priority order. The easiest algorithm would be the one which chooses, at each step, the taxon whose insertion leads to the highest increase of the CIC, with the proviso that PC and PI remain satisfied. Unfortunately, this approach is too slow and unusable in practice. A faster way is to choose the best taxon, without testing all taxa, based on information already available. First of all, we are sure that, if all source trees support the insertion of a taxon in a region, inserting it in this region will not create contradictions between the source trees and the supertree. Thus this insertion will not violate PC. Additionally, if the region supported by source trees is not limited to a node or an edge, it means that the information we have is not enough to choose where the taxon has to be inserted. Such an insertion will surely violate PI. These considerations make insertions supported by all trees more appealing than insertions supported by only a part of them, and the insertions on a region well delimited more attractive than insertions on a larger region. This is the reason why in PhySIC_IST the insertions of taxa are done in four successive steps, each step being less restrictive than the previous ones in its requirements for inserting taxa. The strictest steps are done first, in order to maximize the chances for future taxa to be inserted and to maximize the CIC of the computed supertree. These four steps are differentiated according to two parameters, all and cons, each taking two values. The all parameter indicates whether taxa should be inserted only when a maximum support is observed for them somewhere in the backbone tree (all = true), or whether, in the absence of places with maximum support, places of maximal support should be considered (all = false). By maximum support at a position we mean that all source trees containing the taxa agree that it could be inserted at the given position. Note though that there might be several places of maximum support for inserting a taxon, due to a lack of overlap between the source trees and the taxa already in the backbone tree.
The case where all = false leads the backbone tree to temporarily contradict at least one source tree. This means that some of its edges have to be collapsed to ensure that the backbone tree still satisfies PC after the insertions. The collapsing of a minimal number of edges is performed by calling the Check_{ PC }procedure; an analogous test to check PI is performed calling the Check_{ PI }procedure [13]. If this collapsing decreases the value of CIC of the tree compared to its value prior to the insertion, then the insertion is cancelled. Overall, the insertions with all = true promise a more resolved supertree and are hence performed first, namely during the first two insertion stages, while the latter two run with all = false. The parameter cons indicates whether the insertion procedure should insert taxa only when there is a single best supported position for them (cons = false) or when consensus insertions are allowed (cons = true). A consensus insertion means inserting taxa on a node when all best supported places for the taxa are edges incident to the node. In this case, the insertion of the taxon does not contradict the source trees. Insertions with cons = true are always on a node, therefore insertions with cons = false are preferable because the possibility to insert taxa on a edge provides a tree with a higher CIC than an insertion on a node. Thus, for each value of all, a step with cons = false is first performed followed by a step with cons = true. During each insertion stage (see insertion procedure in the pseudocode in appendix), all taxa not yet inserted in the backbone tree are considered. If the current taxon is inserted (by the roundIns procedure in the pseudocode), then the algorithm tries to insert, always in priority order, all taxa previously considered that could not have been inserted before. These taxa have higher priority than taxa following the current one, and it is possible that the insertion of the current taxon enables the supported position for some of these taxa to be circumvented to a small enough part of the tree for their insertion to be possible. After each insertion the problematic branches are collapsed, to ensure that the backbone tree still satisfies PC. After inserting several taxa, the backbone tree may fail to satisfy PI. However, using the Check_{ PI }procedure to collapse problematic edges suffices to ensure that the backbone tree satisfies the property again. Collapsing branches with Check_{ PI }is done after each insertion stage and not after every insertion, contrarily to Check_{ PC }. The reason is that some edges of the backbone tree can fail to satisfy PI only temporarily and satisfy it again after the insertion of other taxa. On the contrary, if the backbone contradicts any source tree, it will keep contradicting it, no matter which taxon we insert afterward; it is thus preferable to detect this immediately to avoid problems that may arise while inserting remaining taxa. The improvement of PhySIC_IST on PhySIC shown in figure 4 is a consequence of three fundamental differences between PhySIC and PhySIC_IST. First, the new version operates successive insertions of taxa on a backbone and is not based on a revised version of the Build algorithm [36]; ergo, PhySIC_IST can frequently find relations between taxa that PhySIC cannot detect, being stopped in this analysis by a connected component of the Aho graph. In addition, the two methods do not have the same optimization criterion: indeed, PhySIC aims at finding the supertree satisfying PI and PC that proposes a resolution for as many triplets as possible, while PhySIC_IST looks for the supertree satisfying PC and PI that maximizes the value of CIC. Last, PhySIC_IST can propose nonplenary supertrees, i.e it will not insert the taxa that would decrease the CIC of the supertree, while PhySIC necessarily proposes a supertree that contains all taxa present in a least one source tree.
The STC preprocess
where n = i + max(t). This value is compared to the quantile corresponding to the threshold τ given by the user, i.e. x_{0} : Prob{x <x_{0}} = (1  τ). If χ^{2} > x_{0}, the STC preprocess rejects the H_{0} and inserts the triplet associated to i in ($\mathcal{W}(\mathcal{T})$), i.e. the set of dropped triplets. Note that the two tests performed on each nonnull coordinate are not independent. The user may use the threshold more as a setting parameter rather than interpret it as the probability that the STC drops a triplet that underlies a real anomaly. After that, the STC preprocess modifies the source trees applying PhySIC_IST to each T_{ j }∈ $\mathcal{T}$, with $\mathcal{R}$ = $\mathcal{R}$(T_{ j }) and $\mathcal{R}$ = ($\mathcal{W}(\mathcal{T})$). In this way, we force the source trees not to contain the dropped triplets. Essentially, each modified tree may contain either new multifurcations, or lack some of its former taxa (if the phylogenetic position of these taxa changes extremely within the forest). Then PhySIC_IST is applied to the modified source trees. If the user does not agree with the source tree modifications, he can change t and restart the STC procedure or choose to skip it.
Abbreviations
 PC:

Property of nonContradiction
 PI:

Property of Induction
 PhySIC :

PHYlogenetic Signal with Induction and nonContradiction
 PhySIC_IST :

PHYlogenetic Signal with Induction and nonContradiction Inserting a Subset of Taxa
 MRP:

Matrix Representation with Parsimony
 SMAST:

Maximum Agreement SuperTree
 SMCT:

Maximum Compatible SuperTree
Declarations
Acknowledgements
We would like to thank Gilles Caraux for helpful comments and discussions on the statistical aspects of the STC preprocess and Alexis Criscuolo for invaluable advices in setting the simulation protocol. We thank the anonymous referees for helpful suggestions. This work has been supported by the Conseil Scientifique of the University Montpellier 2, and by the Research Networks Program in BIOINFORMATICS of the High Council for Scientific and Technological Cooperation between France and Israel. This publication is the contribution no. 2008064 of the Institut des Sciences de l'Évolution de Montpellier (UMR 5554CNRS).
Authors’ Affiliations
References
 BinindaEdmonds ORP, Stamatakis A: Taxon sampling versus computational complexity and their impact on obtaining the Tree of Life. In Reconstructing the Tree of Life: Taxonomy and Systematics of Species Rich Taxa. Volume 72. Edited by: Hodkinson T, Parnell J. New York: Systematics Association Special Series, CRC Press; 2006:77–95.View ArticleGoogle Scholar
 BinindaEdmonds ORP: Supertree construction in the genomic age. Methods Enzymol 2005, 395: 745–57.View ArticleGoogle Scholar
 Baum BR, Ragan MA: The MRP method. In Phylogenetic supertrees: combining information to reveal the Tree of Life Edited by: BinindaEmonds O, Kluwer. 2004, 17–34.View ArticleGoogle Scholar
 Semple C, Steel M: A supertree method for rooted trees. Discrete Appl Math 2000, 105: 147–158.View ArticleGoogle Scholar
 Page RDM: Modified mincut supertrees. In Proceedings of the 2nd International Workshop on Algorithms in Bioinformatics (WABI'02) Edited by: Guigó R, Gusfield D. 2002, 537–552.View ArticleGoogle Scholar
 Snir S, Rao S: Using max cut to enhance rooted trees consistency. IEEE/ACM Trans Comput Biol Bioinformatics 2006, 3(4):323–333.View ArticleGoogle Scholar
 Chen D, Eulenstein O, FernándezBaca D, Gordon Burleigh JG: Improved Heuristics for MinimumFlip Supertree Construction. Evolutionary Bioinformatics 2006, 2: 401–4103.Google Scholar
 Golobo3 PA, Pol D: Semistrict supertrees. Cladistics 2002, 18(5):514–525.View ArticleGoogle Scholar
 Goloboff PA: Minorityrule supertrees? MRP, Compatibility, and MinFlip may display the least frequent groups. Cladistics 2005, 21: 282–294.View ArticleGoogle Scholar
 Cotton JA, Slater CSC, Wilkinson M: Discriminating Supported and Unsupported Relationships in Supertrees using Triplets. Syst Biol 2006, 55(2):345–350.View ArticlePubMedGoogle Scholar
 Berry V, Nicolas F: Maximum Agreement and Compatible Supertrees. In Proceedings of CPM, of LNCS Edited by: Sahinalp SC, Muthukrishnan S, Dogrusoz U. 2004, 3109: 205–219.Google Scholar
 Berry V, Nicolas F: Maximum Agreement and Compatible Supertrees. JDA 2007, 5(3):564–591.Google Scholar
 Ranwez V, Berry V, Criscuolo A, Fabre PH, Guillemot S, Scornavacca C, Douzery EPJ: PhySIC: a Veto Supertree Method with Desirable Properties. Syst Biol 2007, 56(5):798–817.View ArticlePubMedGoogle Scholar
 Gordon AG: Consensus supertrees: the synthesis of rooted trees containing overlapping sets of labelled leaves. J Classif 1986, 3: 335–348.View ArticleGoogle Scholar
 Huson DH, Nettles SM, Warnow TJ: Diskcovering, a fastconverging method for phylogenetic tree reconstruction. J Comput Biol 1999, 6(3–4):369–386.View ArticlePubMedGoogle Scholar
 Thorley J, Wilkinson M, Charleston M: The information content of consensus trees. In Advances in Data Science and Classification. Studies in Classification, Data Analysis, and Knowledge Organization Edited by: Rizzi A, Vichi M, Bock HH. 1998, 91–98.Google Scholar
 Dutheil J, Gaillard S, Bazin E, Glemin S, Ranwez V, Galtier N, Belkhir K: Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics 2006, 7: 188.PubMed CentralView ArticlePubMedGoogle Scholar
 Criscuolo A, Berry V, Douzery EJP, Gascuel O: SDM: a Fast Distancebased Approach for (Super)Tree Building in Phylogenomics. Syst Biol 2006, 55(5):740–755.View ArticlePubMedGoogle Scholar
 Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 1980, 16(2):111–20.View ArticlePubMedGoogle Scholar
 Eulenstein O, Chen D, Burleigh JG, FernándezBaca D, Sanderson MJ: Performance of flip supertree construction with a heuristic algorithm. Syst Biol 2004, 53: 299–308.View ArticlePubMedGoogle Scholar
 Guindon S, Gascuel O: A simple, fast and accurate method to estimate large phylogenies by maximumlikelihood. Syst Biol 2003, 52(5):696–704.View ArticlePubMedGoogle Scholar
 Ranwez V, Delsuc F, Ranwez S, Belkhir K, Tilak M, Douzery EPJ: OrthoMaM: A database of orthologous genomic markers for placental mammal phylogenetics. BMC Evol Biol 2007, 7: 241+. [http://kimura.univmontp2.fr/orthomam/html/]PubMed CentralView ArticlePubMedGoogle Scholar
 Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). In Version 4. Sinauer Associates, Sunderland, Massachusetts; 2003.Google Scholar
 Hallstrom BM, Kullberg M, Nilsson MA, Janke A: Phylogenomic Data Analyses Provide Evidence that Xenarthra and Afrotheria Are Sister Groups. Mol Biol Evol 2007, 24(9):2059–2068.View ArticlePubMedGoogle Scholar
 Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W: Using genomic data to unravel the root of the placental mammal phylogeny. Genome Res 2007, 17(4):413–421.PubMed CentralView ArticlePubMedGoogle Scholar
 Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, Gascuel O, Grossman LI, Romero R, Goodman M: Genomics, biogeography, and the diversification of placental mammals. Proc Nat Acad Sci 2007, 104(36):14395–14400.PubMed CentralView ArticlePubMedGoogle Scholar
 Lartillot N, Philippe H: Improvement of molecular phylogenetic inference and the phylogeny of Bilateria. Philos Trans R Soc Lond B Biol Sci 2008. [0962–8436 (Print) Journal article]Google Scholar
 Jobb G, von Haeseler A, Strimmer K: TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol 2004, 4: 18. [1471–2148 (Electronic) Journal Article Research Support, NonU.S. Gov't]PubMed CentralView ArticlePubMedGoogle Scholar
 Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, Sorensen MV, Haddock SHD, SchmidtRhaesa A, Okusu A, Kristensen RM, Wheeler WC, Martindale MQ, Giribet G: Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 2008, 452(7188):745U5.View ArticlePubMedGoogle Scholar
 Semple C, Steel MA: Phylogenetics, of Oxford Lecture Series in Mathematics and its Applications. Volume 24. Oxford University Press; 2003.Google Scholar
 Guillemot S, Berry V: Finding a largest subset of rooted triples identifying a tree is an NPhard task. Tech rep, LIRMM, Univ Montpellier; 2007.Google Scholar
 Wu BY: Constructing the maximum consensus tree from rooted triples. Journal of Combinatorial Optimization 2004, 29: 29–39.View ArticleGoogle Scholar
 Jansson J: On the complexity of inferring rooted evolutionary trees. Proceedings of GRACO 2001 Electron Notes in Disc Math 2001, 7: 50–53.Google Scholar
 Bryant D: Building Trees, Hunting for Trees, and Comparing Trees: theory and method in phylogenetic analysis. PhD thesis. University of Canterbury; 1997.Google Scholar
 Huelsenbeck J, Bollback J, Levine A: Inferring the root of a phylogenetic tree. Syst Biol 2002, 51: 32–43.View ArticlePubMedGoogle Scholar
 Aho AV, Sagiv Y, Szymanski TG, Ullman JD: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comp 1981, 10(3):405–421.View ArticleGoogle Scholar
 Fienberg SE: The analysis of crossclassified categorical data. Cambridge, MA: MIT Press; 1977.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments
View archived comments (1)