Skip to main content


Selecting informative subsets of sparse supermatrices increases the chance to find correct trees



Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30%.


With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered.


Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.


In most phylogenomic studies supermatrices of concatenated presumably orthologous genes are used for tree inference [1-18]. Due to the failure of consistently identifying orthologous genes among taxa [2] and/or due to general sparse sequence data availability these supermatrices frequently display a low data coverage, down to 8% [2]. Simulation studies showed that in these instances chances of recovering a correct and robust tree can drastically decrease [1, 19]. Additionally, Wiens [20, 21], Philippe [22], Sanderson [1, 19, 23], Driskell [2], Hartmann [24] and colleagues showed that low gene data coverage of single taxa can already impede the success of tree reconstructions. In contrast, other simulation studies demonstrated that single taxa with low gene data coverage can help breaking up long branches and thus improve tree reconstructions [20, 21, 25-28]. These mentioned analyses of empirical and simulated data demonstrate that confounding effects of low gene data coverage on tree inference can hardly be generalized [1, 3, 11, 29-36].

Despite these unresolved issues many investigators select sets of taxa with high gene data coverage assuming that the high gene data coverage will improve the robustness of tree inferences [3, 4, 9, 11, 16, 17]. However, these threshold criteria are arbitrary and do not take into account potential phylogenetic signal of the data. Those approaches might not lead to the desired increase of tree robustness. For example, tree robustness will not increase, if high gene data coverage is achieved by selecting highly conservative orthologous genes with low phylogenetic signal. Alternatively, a robust tree might result if taxa with low gene data coverage but highly informative genes have been selected, Driskell et al. [2] e.g. report an example of plausible tree reconstructions based on a supermatrix with a gene data coverage of just 8-16%. Both cases illustrate that gene data coverage and phylogenetic resolution are not necessarily correlated. Consequently, the practice of selecting data based solely on data coverage is potentially problematic. Therefore, we have developed an approach which focuses on the analyses of selected optimal data subsets (SOS) which have high data coverage and phylogenetic signal. Crucial for this approach is the assessment of potential signal of genes and the development of a heuristics to select such an SOS.

Different quartet mapping approaches have been used to assess potential signal within genes [37, 38]. Among these, geometry mapping is demonstrably the most conservative estimator [37] and the application to genes of supermatrices is straightforward. Consequently, we have chosen the geometry mapping approach [37-40] to assess potential signal of genes in the development of our heuristics.

In order to select an optimal set of taxa and genes, Sanderson and colleagues [23] suggested selecting sets of full data coverage (maximal bicliques [41, 42]). However, the identification of the maximal (maximum) biclique is a NP-complete problem [42, 43] and, thus, there is no guarantee to find the maximal (maximum) biclique. Additionally, Sanderson et al. [23] found that selections of maximal bicliques resulted in very small subsets of size < 15 taxa and < 10 genes. Sanderson’s approach is, thus, not suitable to reconstruct phylogenetic relationships of many taxa. A possible solution might be the selection of quasi-bicliques [44, 45], which potentially combine a much larger set of taxa and genes accepting a predefined level of missing data. This promising direction however has the drawback that it is not time-efficient.

Alternatively Hartmann et al. [24] and Cheng et al. [46] introduced two approaches directly applicable to sequence data. The first approach of Hartmann et al. [24] is a masking technique (REAP) which masks multiple sequence alignments according to predefined thresholds of gap frequencies of sites. The approach of Cheng et al. [46] is a statistical correction for missing data (SIA). A comparison of these two approaches demonstrated that REAP performed better, a result which is compatible with the results of Sanderson’s biclique approach. However, both, alignment masking (REAP) and the biclique approach optimize data only with respect to data coverage and without considering potential signal among genes.

Here, we introduce a simple hill climbing algorithm to select optimal data subsets (SOS) which are assembled by considering data coverage and potential signal of genes. We start with the assumption that any taxon and gene can potentially contribute to the total signal of the matrix. However, taxa or genes with incomplete data coverage and low signal can potentially also contribute noise or cause biases to the total signal of the supermatrix. Therefore, we successively mask taxa and genes of low signal and/or data coverage generating a submatrix of higher data coverage and signal. With this approach we deliberately discard taxa and genes because of their low data coverage and/or potential low signal. The proposed hill climbing algorithm delivers an optimal solution of this trade-off. Using simulated and empirical data, we compare the performance of the herein proposed approach with an often applied approach of simply selecting data subsets using predefined thresholds of data coverage only.


The approach can be separated into two parts: (1) the determination of information content of genes, taxa and the concatenated supermatrix and (2) the selection of an optimal subset (SOS) of taxa and genes.

Information content of genes, taxa and matrices

Before we define the information content of genes, taxa and matrices used in our approach, we have to introduce the concepts of data coverage representation matrices.

A concatenated supermatrix of N taxa and n gene nucleotide/amino acid sequence alignments can be represented as a matrix B with entries b ij

B: b ij =(10),(taxa:i:1N,genes:j:1n)

with b ij = (1) for a present and b ij = (0) for an absent gene nucleotide/amino acid sequence j for a taxon i. We call this matrix B the data coverage representation matrix.

We define the information content of a gene j, q j , as the relative data coverage of this gene, defined as

q j = i = 1 N b ij N ,taxa:i:1N.

Likewise, the information content of a taxon i, p i is defined as

p i = j = 1 n b ij n ,genes:j:1n.

We define the information content, P, of a matrix B as

P(B)= i = 1 N j = 1 n p i N × n = i = 1 N j = 1 n q j N × n

with 0 ≤ P (B),p i ,q j ≤ 1. To determine the potential signal of genes we use geometry mapping [37] extended to the amino acid level. Nieselt-Struwe et al. [37] showed that for a given quartet of sequences, relative support for each of the three possible topologies s1,s2,s3 can be computed as

s i = δ i /( δ 1 + δ 2 + δ 3 )

with δ i support for tree T i , 0 ≤ s i ≤ 1 and i s i =1. Support values δ i can be computed with any optimality criterion. Relative support values can be interpreted as baricentric coordinates of a bipartite simplex graph S  with vectors s = (s1,s2,s3):

S= i = 1 3 s i e i | s 1 + s 2 + s 3 = 1 , 0 s 1 , s 2 , s 3 1

with e i as unit vectors. Within S, areas T1,T2,T3 at vertices can be defined for resolved quartets, T1,2,T1,3,T2,3 for partly resolved quartets, and T for star-like, unresolved topologies of quartets [37, see Figure 1]. For all possible quartets k j of a gene j, k j = N 4 with N the number of taxa, all vectors s m = (s1,s2,s3),(m : 1 … k) can be calculated, and the frequency of vectors in areas T1,T2, and T3 determine potential signal, t j of a gene j[37].

t j = T 1 + T 2 + T 3 T 1 + T 2 + T 3 + T 1 , 2 + T 1 , 3 + T 2 , 3 + T

We relaxed the definition of signal by calculating the frequency of vectors in areas T1,T2,T3,T1,2,T1,3,T2,3.

t j ̂ = T 1 + T 2 + T 3 + T 1 , 3 + T 2 , 3 + T 1 , 2 T 1 + T 2 + T 3 + T 1 , 2 + T 1 , 3 + T 2 , 3 + T

Our approach will, thus, be a more optimistic estimator of potential signal. Signal t j ̂ will be 0 t j ̂ 1 (examples of simulated data, Figure 1).

Figure 1

Simplex graphs of three different partitions of simulated data. The area of ϕ corresponds to the inner triangle in light blue, the area of ϕ r  to the surrounding region. Attractors of topologies T1,T2,T3 are found in the corner sections of the outer triangle in dark blue.

Geometry mapping is a conservative estimator of t j ̂ , however, within a narrow range of short internal and long terminal branch lengths, geometry mapping opts for the wrong tree, a classical case of long branch attraction [37]. This phenomenon might inflate the estimation of t j ̂ under certain circumstances.

Nieselt-Struwe and colleagues [37] showed that for any alphabet of characters of finite length, e.g. nucleotides or amino acids, an enumeration of character states among four sequences can be used to calculate support for all three possible topologies. They further showed that a weight matrix M, defining dissimilarity measures between characters, can equivalently be used to calculate distances between sequences. Therefore, we used BLOSUM62, the amino acid substitution matrix introduced by Henikoff [47], to calculate distances between sequences in correspondence to equation (8) in Nieselt-Struwe et al. [37].

We use t j ̂ of each gene j to update entries of matrix B. For each gene j, entries of matrix B = (b ij ) are scaled with the corresponding t j ̂ values. We call this matrix a weighted data coverage representation matrix B, in short, a weighted matrix B, in the following:

B : b ij = ( 0 b ij t j ̂ 1 ) , ( taxa : i : 1 N , genes : j : 1 n )

Substituting b ij for b ij results in weighted forms of equations 1 and 2. The information content of a gene j, q j , represents in its weighted form a product of relative data coverage and potential signal of genes.

Selection of an optimal subset (SOS) of taxa and genes

We consider a subset(=submatrix) of taxa and genes optimal, if it has a high information content, P (B) and contains as many taxa and genes as possible. If we discard genes or taxa with low q j or p i respectively, we will increase P of the matrix, but will loose information on the excluded taxa and genes. A simple optimization can be performed, searching for the highest possible P while excluding as few taxa/genes as possible.

First, a data coverage representation matrix B is generated from the concatenated supermatrix of multiple gene nucleotide/amino acid sequences corresponding to equation (1). Secondly, for each gene j, ≤20,000 quartets are randomly drawn without duplication and t j ̂ is calculated. For each gene j, entries of B = (b ij ) are scaled with the corresponding t j ̂ values, generating a weighted matrix B corresponding to equation (6). Thirdly, we use a simple hill climbing procedure to select an optimal subset (SOS) of taxa and genes. Elimination of taxa or genes starts with dropping either a taxon or gene with the lowest information content p i or q j , generating a new matrix B with P (B). In case of ties between q j and p i , genes will be excluded. Since taxa or genes with lowest information content will be dropped, P (B) > P (B) (it is trivial to show that this will always be the case). After each elimination step, information content of taxa (p i ) and genes (q j ) are recalculated. Every gene represented by less than 4 taxa is automatically dropped from the matrix. Gene overlap between taxa is monitored to a minimum of three taxa and two genes. If the matrix B does not fulfill this criterion, the next best B in terms of P is selected.

Continuous elimination of taxa or genes with low p i or q j will generate a ‘trivial’ SOS containing few taxa and one gene. Therefore, we define an optimality function f (P)

f(P)=1-|(λ- P α × ( 1 - P ) )|ifP<1

with α as a scaling factor (default set to α = 3) and λ as the size ratio between reduced B and original matrix B

λ= N B × n B N B × n B .

During the process of elimination of taxa and/or genes, P will continually increase, and λ will continually decrease. f(P) will reach a maximum of 1. With a scaling factor α = 2, the maximum will be at the intersection of P and λ, with α = 3 it will be reached later, favoring an SOS with a higher P (Figures 2 and 3). If f (P) = 1 the process of elimination stops.

Figure 2

Optimality function and its effects.

Figure 3

Influence of different α values on the identification of reduction optima in simulated data matrices. The dark bold descending line corresponds to λ, size reduction; the grey bold ascending line to the increase in total informativeness.

The outlined procedure is a simple hill climbing heuristics without guarantee of finding a globally optimal solution due to the interaction of p i and q j . The approach can be applied either to B or B. It should be pointed out that removal of taxa will have an influence on the calculation of t j ̂ which is not recalculated during the process of matrix reduction. This simplification greatly speeds up the heuristics. An iterative recalculation of t j ̂ can potentially improve the selection of an informative dataset and will be further studied.

Calculation time for this heuristics grows with the number of taxa (N) and genes (n). Therefore, it is time efficient, O (N + n)2. The algorithm reduces matrices in a deterministic way which makes matrix reduction reproducible. However, different equally optimal solutions will not be found under identical parameter settings.

By varying the scaling parameter α, however, an SOS of high P (α ≥ 3), versus an SOS of more taxa and genes with lower P (α ≤ 3) can be found.

Simulated data

Our simulations were not set up with the intention of fully exploring the performance of matrix reductions depending on super matrix characteristics, but were set up in order to illustrate the potential of the method in four different cases, resembling observed situations of empirical data.

Simulated data with random distribution of missing data

For two different sets of genes, differing in relative evolutionary rates among genes (Figure 4), we simulated 100 (50 taxa × 50 genes) supermatrices each, composed of genes with 400 amino acids (aa), concatenated for each taxon to 20,000 aa length using Seq-Gen [48] and the BLOSUM62 matrix. For these simulations, we used a topology derived from empirical data with realistic distribution of branch lengths (Figure 5A). Evolutionary rates of genes varied from 0.001 to 15.00 relative rate differences, to mimic different signal strength (Figures 4 and 6). Within each gene, site rates were homogeneous. In order to generate supermartices with missing data, we removed amino acid sequences of taxa using a Binomial distribution with a probability of retaining data entries for each taxon and gene of 0.7 (average data coverage of 0.29, Table 1). This set up generated supermatrices with randomly distributed missing data, closely resembling the observed data coverage of published concatenated supermatrices of Dunn and colleagues [4].

Figure 4

Histograms of heterogeneity of signal among genes of simulated data. On the left, set 1, the histogram of simulated data shows relative low heterogeneity of signal among genes, on the right, set 2, the histogram shows relative high heterogeneity of signal among genes, with a higher percentage of genes of low potential information content.

Figure 5

Topologies with branch lengths used for data simulation. Different branch lengths between tree A and B are labeled in light grey in tree B.

Figure 6

Examples of presence/absence B and edge-weighted Bdata used in simulations. Matrices of data availability in the upper panels, transformed edge-weighted matrices in the lower panels. All matrices are sorted. (A,B) 20% data availability, relative (A) low and (B) high heterogeneity of potential signal, missing data Gaussian distributed. (C,D) 10% data availability, relative (C) low and (D) high heterogeneity of potential signal, missing data following a power-law non-random distribution.

Table 1 Summary of simulation results

Simulated data with power-law and non-random distribution of missing data

For two different sets of genes, differing in relative evolutionary rates among genes (Figure 4), we further simulated 100 (50 taxa × 50 genes) supermatrices each, composed of genes with 400 aa, concatenated for each taxon to 20,000 aa length. We used again the topology derived from empirical data with realistic distribution of branch lengths (Figure 5B). We changed seven branch lengths to introduce potential long branch attraction (Figure 5B). In order to generate supermartices with missing data, we followed a proposal of Li and colleagues [49]. These authors showed that the distribution of missing data in many empirical supermatrices is best described by applying a power law function of the probability of having data. Following their observation, we assigned to each taxon and gene a probability of having data randomly drawn from f (x) = (1/10 x-1/2) - 0.1, for x randomly selected with equal probability from, 0 ≤ x ≤ . Additionally, we constrained data assignment to having at least one gene for each taxon. Following this approach, we concatenated supermatrices with a distribution of missing data approximately similar to observed empirically supermatrices (Misof, unpubl.) (average data coverage 0.13, Table 1). Finally, we raised the probability of data coverage for four predefined taxa, mimicking the often seen high coverage of a few taxa for which genomes are available.

Selecting subsets from simulated data and tree reconstructions

Selecting subsets with the hill climbing algorithm

SOS’s were selected using the mare software (mare: matrix reduction) which implements the herein described novel approach. For each supermatrix, trees were reconstructed 1) using the original supermatrix (data coverage 0.3), 2) an SOS of B and 3) an SOS of B. Trees were reconstructed with RAxML 7.0.0 [50, 51]. The BLOSUM62 amino acid substitution matrix with Γ distributed among site rate heterogeneity was used to account for different substitution rates among genes.

To compare reconstructed trees with the correct trees used in data simulations, we used standardized quartet distances between shared taxa [24, 52-55]. QDistances (d QD ) were standardized in relation to all quartets of shared taxa. We recorded d QD ’s of trees inferred from the unreduced matrix and of the two SOS’s derived from B and B.

Selecting subsets with predefined thresholds of data coverage

From supermatrices with power-law and non-random distribution of missing data we selected subsets in two different ways: (1) we selected all genes with data coverage above or equal to 0.4 and (2) we selected all taxa with data coverage above or equal to 0.04 and all genes with data coverage above or equal to 0.4 (adapted to the new number of taxa). We recorded d QD ’s of trees inferred from unreduced matrices and from subsets.

Selecting subsets from empirical data and tree reconstructions

We studied the performance of using the hill climbing algorithm with matrices B and B using the published empirical metazoan data set of Driskell et al. [2] comprising 1,131 putative orthologous genes for 70 taxa (Metazoa, Fungi + outgroup). Additionally, we selected data subsets of the Driskell supermatrix applying predefined thresholds of gene - and taxa coverage (Table 2). All ML analyses using RAxML v7.2.6 or 7.2.8 were executed with rapid bootstrapping (PROTCAT) and best tree search (PROTGAMMA) in one step (-f a, 500 or 1,000 BS replicates) and the empirical substitution matrix WAG [56]. A posteriori bootstop tests were performed to test for a sufficient number of bootstrap replicates [57]. All analyses were conducted using RAxML HYBRID and PTHREADS versions on HPC Linux clusters, 8 nodes with 8 or 12 cores each, at the Regionales Rechenzentrum Köln (RRZK) using Cologne High Efficient Operating Platform for Science (CHEOPS). Further, we compared the effects of data reduction on tree robustness with the resolution score as introduced by Holland and colleagues [58]. This resolution score, RS, calculated as the sum of bootstrap support values ≥50 divided by the number of taxa N - 3, represents a measure of average bootstrap support and, thus, robustness of trees.

Table 2 Comparison of matrix reductions with empirical data using mare and simple predefined thresholds


Performance with simulated data

Tree reconstructions based on unreduced supermatrices with a Gaussian distribution of missing data did not yield correct trees except for one case in set 1 (columns (org) for set1 and set2, Gaussian distribution of missing data in Figure 7A,B, Table 1). The variability of d QD values was low (columns (org) for set1 and set2, Gaussian distribution of missing data in Figure 7A, Table 1). Tree reconstructions based on all SOSs (unweighted and weighted reductions of set1 and set2) of these supermatrices performed much better (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A,B, Table 1). Compared with trees derived from unreduced supermatrices, SOSs supported more often correct trees, but had a higher frequencies of wrong quartets (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A,B, Table 1). However, there was no clear difference of mean d QD values between trees based on SOSs derived from B (uw) or B (w) (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A, Table 1). Trees based on SOSs of B (w) had a much lower amplitude of d QD values (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A, Table 1). SOSs derived from B contained on average more taxa (Table 1).

Figure 7

Summary of results of tree reconstructions of the four different set ups. In this figure, columns (org) refer to tree reconstructions based on unreduced matrices, columns (w) refer to tree reconstructions based on reduced weighted matrices B, columns (uw) refer to tree reconstructions based on reduced unweighted matrices B, columns (ca, cb) refer to tree reconstructions based on the application of thresholds to genes only and columns (cc, cd) refer to tree reconstructions based on the application of thresholds to genes and taxa concerning data coverage. Set1 refers to datasets with relatively low among gene rate heterogeneity and set 2 refers to datasets with relatively high among gene rate heterogeneity. Columns labeled (gaussian) illustrate results derived from datasets with gaussian distribution of missing data, columns labeled (power-law) illustrate results derived from datasets with power-law and non-random distribution of missing data and columns labeled (thresholds) illustrate results derived from datasets reduced by applied predefined thresholds of data coverage. In (A) the distribution of d QD values for each simulation and set up is presented as box plots with median (black bar), quartiles (box), whiskers (doted range) and outliers (circles). The smaller d QD , the more similar is the reconstructed tree to the tree used in data simulations. In (B) columns represent the frequency of correct tree reconstruction for each set up.

Tree reconstructions based on the unreduced matrix with power-law non-random distribution of missing data did not recover correct trees for set 1 and set 2. In both cases variability of d QD values was high (columns (w), (uw) for set1 and set2, power-law non-random distribution of missing data in Figure 7A,B, Table 1). Tree reconstructions based on all SOSs (unweighted and weighted reductions of set 1 and set2) clearly outperformed reconstructions based on the unreduced matrices (columns (org), (w), (uw) for set1 and set2, power-law non-random distribution of missing data in Figure 7A,B, Table 1). The absolute number of correct trees was again higher for all SOSs (unweighted and weighted reductions of set 1 and set2) compared with the number of correct trees inferred from the unreduced matrices. In cases of low relative rate differences among genes, set 1, SOSs derived from B (uw) performed worse compared to SOSs derived from B (w), in cases of high relative rate differences among genes, set 2, the opposite was observed (columns (org), (w), (uw) for set1 and set2, power-law non-random distribution of missing data in Figure 7B, Table 1).

Data subsets derived from matrices with power-law non-random distribution of missing data using predefined thresholds of gene coverage supported trees with lower mean d QD values (columns (ca), (cb) in Figure 7A) in comparison with mean d QD values of trees inferred from SOSs selected with our approach (column (w), (uw) for set 1 and set 2 of the power-law data in Figure 7A, Table 1). The mean d QD values were higher and the amplitude of d QD was large (columns (ca), (cb) in Figure 7A). Data subsets from matrices with power-law non-random distribution of missing data using combined thresholds of data coverage for genes and taxa did support trees with mean d QD values (columns (cc), (cd) in Figure 7A) comparable with mean d QD values of trees inferred from SOSs of set 1 and set 2 selected with our approach (column (w), (uw) for set 1 and set 2 of the power-law data in Figure 7A, Table 1). The amplitude of d QD values however was large (columns (cc), (cd) in Figure 7A). Applying only thresholds for gene data coverage yielded a lower absolute number of correct trees (columns (ca), (cb) in Figure 7B) compared with our approach, but the absolute number of correct trees was comparable or even higher if combined thresholds of taxa and genes were used (columns (cc), (cd) in Figure 7, Table 1).

In summary, reduction of supermatrices often increased the chance to find a correct tree, but not consistently. SOSs derived from B did not always support correct trees more often compared with SOSs derived from B, but had a much smaller amplitude of d Q values. Data subsets derived from predefined thresholds supported fewer correct trees if only applied to genes but supported comparable numbers of correct trees if used with combined thresholds of data coverage for taxa and genes.

Performance with empirical data

We applied our approach to the published metazoan data set of Driskell et al. [2] comprising 1,131 genes for 70 taxa (Metazoa, Fungi + outgroup). The data coverage was low (0.0836), the matrix information content was low (P = 0.0657). Most genes are represented only by few taxa (e.g. Homo sapiens, Mus musculus, Rattus norvegicus, Bos taurus, Sus scofra). We excluded six taxa of which the complete genome was available from the original matrix showing the highest coverage (Homo sapiens, Mus musculus, Rattus norwegicus, Sus scofra, Bos taurus, and Gallus gallus) and selected an SOS from these data. With this procedure we removed the most extreme heterogeneity of data coverage among taxa prior to the selection of an SOS.

Selecting an SOS resulted in a data subset of 48 taxa and 45 genes with a data coverage of 0.316 and P = 0.223. Thus, a SOS was found with a 10.24% loss of taxa and a 9.08-fold increase in data coverage and a 16.043-fold gain in P. However, all outgroup taxa including slime molds, fungi and nematodes had been excluded. We compared tree reconstructions based on 1) the original unreduced supermatrix with 64 taxa (1000 bs replicates, 469,480 aa) and 2) the SOS of 48 taxa and 45 genes (1,000 bs replicates, 11,198 aa). An a posteriori bootstop test (default MR-based bootstopping criterion, WRF average of 100 random splits) revealed that 1,000 BS were by far sufficient for both analyzed data sets.

Tree reconstructions with the 64-taxa set resulted in trees with polyphyletic Tetrapoda, Actinopterygii, monophyletic Marsupialia + Monotrema, and largely unresolved basal splits within Theria (Figure 8A).

Figure 8

Driskell trees. Tree reconstructions based on the (A) 64-taxa set and on (B) the reduced data. Nodes with less than 50% are collapsed. Black dots: 100% BS support.

The tree based on the SOS was more congruent to general taxonomic views. The topology showed moderately supported monophyletic Tetrapoda, and resolution within Ungulates and Carnivora (Figure 8). However, for example Actinopterygii remained paraphyletic and relationships of Marsupialia and Monotrema were not resolved. The resolution score RS increased from 82.148% (unreduced supermatrix including 64 taxa and 1,1131 genes) to 87.38% (SOS). We also compared reductions of the original Driskell supermatrix using different parameter settings in our approach and simple thresholds of data masking (Table 2). Applying predefined thresholds of gene and taxa coverage never resulted in matrices with comparable resolution scores and comparable number of taxa. Our approach outperformed the application of simple thresholds.


We show that supermatrices of simulated amino acid sequence data with low data coverage and relative rate differences among genes can support biased tree inference or low robustness of trees. It can be suspected that these effects will even be stronger for empirical data. These conclusions corroborate results of Hartmann [24], in many aspects Philippe [22] and Wiens and colleagues [28]. Effective techniques to reduce these potential biases in tree inference are therefore clearly needed.

Masking supermatrices and deleting rogue taxa after tree reconstructions could be suitable measures as has been applied by Dunn and colleagues [4]. In their analysis these authors selected taxa and genes according to predefined cutoff values of data coverage. The application of cutoff values considers only the extent of missing data which might favor the selection of the most conserved genes readily identified among all taxa in the data. Additionally, Dunn et al. [4] deleted rogue taxa after tree reconstruction based on an idea introduced by Thorley and colleagues [59, 60]. The major drawback of their approach is that robustly misplaced taxa will not be identified. In this respect, a formal approach to masking of supermatrices as proposed here could be an alternative worth to consider.

We propose to select a subset of taxa and genes with a maximal information content. In doing so, it is necessary to first assess potential signal of genes, for which we use extended geometry mapping (eGM) [37-40]. We opted for geometry mapping, because it tends to be more conservative in discriminating between resolved and star-like trees in contrast to likelihood mapping [61]. Additionally, eGM is easily applied to nucleotide and amino acid sequence data without the need of tree reconstructions. It is, thus, a technically convenient but, admittedly, coarse way of estimating potential signal.

Secondly, it is necessary to select optimal subsets of supermatrices based on the information content of taxa and genes. The information content of taxa and genes is calculated as the ratio of potential signal and data coverage. By introducing this optimality criterion we can select taxa and genes which contribute most signal in tree reconstructions. We select a data subset in a stepwise function penalizing size reduction of the supermatrix and favoring higher matrix information content, monitoring but ignoring optimization of connectivity in the matrix. Our approach is time efficient but will not be effective in discovering a globally optimal subset in terms of taxa/gene overlap (‘connectivity’) and information content. This is in contrast to the approach of Yan [44] in which the quasi-biclique with the highest level of connectivity (‘largest grove’) is searched for.

Improved heuristics considering information content and connectivity in our approach are certainly conceivable. However, the distribution of missing data following a power-law distribution in empirical data suggests that simple hill climbing procedures will be effective in identifying a good (optimal) subset of taxa and genes in terms of matrix information content. The flexibility of our approach offers even the chance to use different parameter settings of the optimality function to identify alternative SOSs.

We observed high amplitudes of d QD values of trees based on SOSs in our simulations. These amplitudes were even higher in SOS’s based on simple data coverage representations. We interpret this occasional high error rate as a possible phenomenon of insufficient taxon sampling in SOSs which might pronounce long branch attraction (LBA), or, alternatively, that connectivity in SOSs was not sufficient to potentially support just one tree [62]. This interpretation highlights a problem of all methods of data reduction. Every reduction process, at least partially, counteracts efforts to reduce biases in tree reconstructions due to insufficient taxon or gene sampling. The analyses of Wiens and colleagues [20, 21, 28] showed that LBA effects can disappear, if data exhibiting LBA are recoded as missing. This implies that an identification of LBA taxa before concatenation and reduction of data would be important. However, we do not have a grip yet on a reliable identification of biases in tree reconstructions which could guide a preselection of taxa. An immediate, however unsatisfying, solution is probably the reconstruction of trees with and without suspect taxa.

Our simulations showed that in the presence of heterogeneous signal among genes the new heuristics increased the chance of finding a correct tree. It is, thus, an alternative to the computationally much more demanding quasi-biclique approach [44, 45]. SOSs derived from B or B matrices did not differ extensively in their success rate of correct tree reconstructions with simulated data, with small advantages for the B in cases of power-law non-random distribution of missing data. However, the analyses of the empirical data imply that tree reconstructions based on SOSs derived from B will result in improved tree robustness.


Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage. The approach prresented here is will be of general inportance in phylogenomic studies based on large concatenated superalignments with incomplete data coverage. It clearly offers an alternative to threshold based data selection.


  1. 1.

    Sanderson MJ, Driskell AC: The challenge of constructing large phylogenetic trees. Trends Plant Sci. 2003, 8: 374-379. 10.1016/S1360-1385(03)00165-1.

  2. 2.

    Driskell AC, Ané C, Burleigh JG, McMahon MM, O’Meara BC, Sanderson MJ: Prospects for building the tree of life from large sequence databases. Science. 2004, 306: 1172-1174. 10.1126/science.1102036.

  3. 3.

    Philippe H, Delsuc F, Brinkmann H, Lartillot N: Phylogenomics. Annu Rev Ecol Evol Syst. 2005, 36: 541-562. 10.1146/annurev.ecolsys.35.112202.130205.

  4. 4.

    Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, Sorensen MV, Haddock SHD, Schmidt-Rhaesa A, Okusu A, Kristensen RM, Wheeler WC, Martindale MQ, Giribet G: Broad phylogenomic sampling improves resolution of the animal tree of life. Nature. 2008, 452: 745-749. 10.1038/nature06614.

  5. 5.

    Bourlat SJ, Nielsen C, Economou AD, Telford MJ: Testing the new animal phylogeny: a phylum level molecular analysis of the animal kingdom. Mol Phylogenet Evol. 2008, 49: 23-31. 10.1016/j.ympev.2008.07.008.

  6. 6.

    de Queiroz A, Gatesy J: The supermatrix approach to systematics. Trends Ecol Evol (Amst). 2006, 22: 34-41.

  7. 7.

    Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005, 6: 361-375.

  8. 8.

    Galtier N, Daubin V: Dealing with incongruence in phylogenomic analyses. Philos Trans R Soc Lond, B, Biol Sci. 2008, 363: 4023-4029. 10.1098/rstb.2008.0144. [],

  9. 9.

    Hausdorf B, Helmkampf M, Meyer A, Witek A, Herlyn H, Bruchhaus I, Hankeln T, Struck TH, Lieb B: Spiralian phylogenomics supports the resurrection of Bryozoa comprising Ectoprocta and Entoprocta. Mol Biol Evol. 2007, 24: 2723-2729. 10.1093/molbev/msm214.

  10. 10.

    Murphy WJ, Pevzner PA, O’Brien SJ: Mammalian phylogenomics comes of age. Trends Genet. 2004, 20: 631-639. 10.1016/j.tig.2004.09.005.

  11. 11.

    Philippe H, Derelle R, Lopez P, Pick K, Borchiellini C, Boury-Esnault N, Vacelet J, Renard E, Houliston E, Quéinnec E, Da Silva C, Wincker P, Le Guyader H, Leys S, Jackson DJ, Schreiber F, Erpenbeck D, Morgenstern B, Wörheide G, Manuël M: Phylogenomics revives traditional views on deep animal relationships. Curr Biol. 2009, 19: 706-712. 10.1016/j.cub.2009.02.052.

  12. 12.

    Regier JC, Shultz JW, Ganley ARD, Hussey A, Shi D, Ball B, Zwick A, Stajich JE, Cummings MP, Martin JW, Cunningham CW: Resolving arthropod phylogeny: exploring phylogenetic signal within 41 kb of protein-coding nuclear gene sequence. Syst Biol. 2008, 57: 920-938. 10.1080/10635150802570791.

  13. 13.

    Shedlock AM, Botka CW, Zhao S, Shetty J, Zhang T, Liu JS, Deschavanne PJ, Edwards SV: Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome. Proc Natl Acad Sci USA. 2007, 104: 2767-2772. 10.1073/pnas.0606204104.

  14. 14.

    Smith SA, Beaulieu JM, Donoghue MJ: Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evol Biol. 2009, 9: 37-10.1186/1471-2148-9-37.

  15. 15.

    Roeding F, Hagner-Holler S, Ruhberg H, Ebersberger I, von Haeseler A, Kube M, Reinhardt R, Burmester T: EST sequencing of Onychophora and phylogenomic analysis of Metazoa. Mol Phylogenet Evol. 2007, 45: 942-951. 10.1016/j.ympev.2007.09.002.

  16. 16.

    Simon S, Strauss S, von Haeseler A, Hadrys H: A phylogenomic approach to resolve the basal pterygote divergence. Mol Biol Evol. 2009, 12: 2719-2730.

  17. 17.

    Struck T, Paul C, Hill N, et al: Phylogenomic analyses unravel annelid evolution. Nature. 2011, 471: 452-456. 10.1038/471452a.

  18. 18.

    Kocot K, Cannon J, Todt C, et al: Phylogenomics reveals deep molluscan relationships. Nature. 2011, 477: 452-456. 10.1038/nature10382.

  19. 19.

    Sanderson MJ: Construction and annotation of large phylogenetic trees. Aust Syst Bot. 2007, 20: 287-301. 10.1071/SB07006.

  20. 20.

    Wiens JJ: Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol. 2003, 52: 528-538. 10.1080/10635150390218330.

  21. 21.

    Wiens JJ: Missing data and the design of phylogenetic analyses. J Biomed Inform. 2006, 39: 34-42. 10.1016/j.jbi.2005.04.001.

  22. 22.

    Philippe H, Snell EA, Bapteste E, Lopez P, Holland PWH, Casane D: Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol. 2004, 21: 1740-1752. 10.1093/molbev/msh182.

  23. 23.

    Sanderson MJ, Driskell AC, Ree RH, Eulenstein O, Langley S: Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Mol Biol Evol. 2003, 20: 1036-1042. 10.1093/molbev/msg115.

  24. 24.

    Hartmann S, Vision TJ: Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?. BMC Evol Biol. 2008, 8: 95-10.1186/1471-2148-8-95.

  25. 25.

    Poe S: Sensitivity of phylogeny estimation to taxonomic sampling. Syst Biol. 1998, 47: 18-31. 10.1080/106351598261003.

  26. 26.

    Kearny M, Clark JM: Problems due to missing data in phylogenetic analyses including fossils: a critical review. J Vertebr Paleontology. 2003, 23: 263-274. 10.1671/0272-4634(2003)023[0263:PDTMDI]2.0.CO;2.

  27. 27.

    Wiens JJ: Can incomplete taxa rescue phylogenetic analyses from long-branch attraction?. Syst Biol. 2005, 54: 731-742. 10.1080/10635150500234583.

  28. 28.

    Wiens JJ, Moen DS: Missing data and the accuracy of Bayesian phylogenetics. J Syst Evol. 2008, 46: 307-314.

  29. 29.

    Phillips MJ, Delsuc F, Penny D: Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004, 21: 1455-1458. 10.1093/molbev/msh137.

  30. 30.

    Jeffroy O, Brinkmann H, Delsuc F, Philippe H: Phylogenomics: the beginning of incongruence?. Trends Genet. 2006, 22: 225-231. 10.1016/j.tig.2006.02.003.

  31. 31.

    Rodríguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H: Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol. 2007, 56: 389-399. 10.1080/10635150701397643.

  32. 32.

    Ho SYW, Jermiin LS: Tracing the decay of the historical signal in biological sequence data. Syst Biol. 2004, 53: 623-637. 10.1080/10635150490503035.

  33. 33.

    Inagaki Y, Nakajima Y, Sato M, Sakaguchi M, Hashimoto T: Gene sampling can bias multi-gene phylogenetic inferences: the relationship between red algae and green plants as a case study. Mol Biol Evol. 2009, 26: 1171-1178. 10.1093/molbev/msp036.

  34. 34.

    Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD: The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst Biol. 2004, 53: 638-643. 10.1080/10635150490468648.

  35. 35.

    Rosenberg MS, Kumar S: Taxon sampling, bioinformatics, and phylogenomics. Syst Biol. 2003, 52: 119-124. 10.1080/10635150390132894.

  36. 36.

    Leigh JW, Susko E, Baumgartner M, Roger AJ: Testing congruence in phylogenomic analysis. Syst Biol. 2008, 57: 104-115. 10.1080/10635150801910436.

  37. 37.

    Nieselt-Struwe K, von Haeseler A: Quartet-Mapping, a generalization of the likelihood-mapping procedure. Mol Biol Evol. 2001, 18: 1204-1219. 10.1093/oxfordjournals.molbev.a003907.

  38. 38.

    Grünewald S, Forslund K, Dress A, Moulton V: QNet: An agglomerative method for the construction of phylogenetic networks from weighted quartets. Mol Biol Evol. 2007, 24: 532-538.

  39. 39.

    Eigen M, Winkler-Oswatitsch R, Dress A: Statistical geometry in sequence space: a method of quantitative comparative sequence analysis. Proc Natl Acad Sci USA. 1988, 85: 5913-5917. 10.1073/pnas.85.16.5913.

  40. 40.

    Nieselt-Struwe K: Graphs in sequence spaces: a review of statistical geometry. Biophys Chem. 1997, 66: 111-131. 10.1016/S0301-4622(97)00064-1.

  41. 41.

    Alexe G, Alexe S, Crama Y, Foldes S, Hammer PL, Simeone B: Consensus algorithms for the generation of all maximal bicliques. DIMACS Technical Reports 2002-52, Rutgers University, Piscataway, NJ, USA 2002. [],

  42. 42.

    Dias VM, de Figueiredob CM, Szwarcfiter JL: On the generation of bicliques of a graph. Discrete Appl Math. 2007, 155: 1826-1832. 10.1016/j.dam.2007.03.017.

  43. 43.

    Dawande M, Keskinocak P, Swaminathan J, Tayur S: On bipartite and multipartite clique problems. J Algorithms. 2001, 41: 388-403. 10.1006/jagm.2001.1199.

  44. 44.

    Yan C, Burleigh JG, Eulenstein O: Identifying optimal incomplete phylogenetic data sets from sequence databases. Mol Phylogenet Evol. 2005, 30 (3): 528-535.

  45. 45.

    Li J, Sim K, Liu G, Wong L: Maximal quasi-bicliques with balanced noise tolerance: concepts and co-clustering applications. Proceedings of the SIAM International Conference on Data Mining SDM 2008, April 24-26, 2008. 2008, Atlanta, Georgia, USA: SIAM,

  46. 46.

    Cheng F, Hartmann S, Gupta M, Ibrahim JG, Vision TJ: A hierarchical model for incomplete alignments in phylogenetic inference. Bioinformatics. 2009, 25: 592-598. 10.1093/bioinformatics/btp015. [],

  47. 47.

    Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919. 10.1073/pnas.89.22.10915.

  48. 48.

    Rambaut A, Grassly NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13: 235-238.

  49. 49.

    Li W, Liu Y: Modeling species-genes data for efficient phylogenetic inference. Proceedings LSS Computational Systems Bioinformatics Conference, August, 2007., Volume 6. 2007, LSS - Life Sciences Society, 429-440. [],

  50. 50.

    Stamatakis A: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006. 2006, Rhodes Island, Greece: IEEE,

  51. 51.

    Ott M, Zola J, Aluru S, Stamatakis A: Large-scale Maximum Likelihood-based phylogenetic analysis on the IBM BlueGene/L. Proceedings of ACM/IEEE Supercomputing conference 2007. 2007, New York, Reno, Nevada: ACM,

  52. 52.

    Mailund T, Pedersen CNS: QDist-quartet distance between evolutionary trees. Bioinformatics. 2004, 20: 1636-1637. 10.1093/bioinformatics/bth097.

  53. 53.

    Christiansenm C, Mailund T, Pedersen CNS, Randers M: Algorithms for computing the quartet distance between trees of arbitrary degree. Edited by: Casadio R, Myers G. 2005, Springer, 77-88.

  54. 54.

    Christiansen C, Mailund T, Pedersen CNS, Randers M, Stissing MS: Fast calculation of the quartet distance between trees of arbitrary degrees. Algorithms Mol Biol. 2006, 1: 16-10.1186/1748-7188-1-16.

  55. 55.

    Stissing M, Mailund T, Pedersen CN, Brodal GS, Fagerberg R: Computing the all-pairs quartet distance on a set of evolutionary trees. J Bioinform Comput Biol. 2008, 6: 37-50. 10.1142/S0219720008003266.

  56. 56.

    Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001, 18: 691-699. 10.1093/oxfordjournals.molbev.a003851.

  57. 57.

    Pattengale N, Alipour M, Bininda-Emonds O, Moret B, Gottlieb E, Stamatakis A: How many bootstrap replicates are necessary?. J Comput Biol. 2010, 17: 337-354. 10.1089/cmb.2009.0179.

  58. 58.

    Holland B, Clarke A, Meudt H: Optimizing Automated AFLP Scoring Parameters to Improve Phylogenetic Resolution. Syst Biol. 2008, 57: 347-366. 10.1080/10635150802044037.

  59. 59.

    Thorley JL, Wilkinson M: Testing the phylogenetic stability of early tetrapods. J Theor Biol. 1999, 200 (3): 343-344. 10.1006/jtbi.1999.0999.

  60. 60.

    Thorley JL, Page RDM: RadCon: phylogenetic tree comparison and consensus. Bioinformatics. 2000, 16: 486-487. 10.1093/bioinformatics/16.5.486.

  61. 61.

    Strimmer K, von Haeseler A: Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc Natl Acad Sci USA. 1997, 94: 6815-6819. 10.1073/pnas.94.13.6815. [],

  62. 62.

    Steel M, Sanderson MJ: Characterizing phylogenetically decisive taxon coverage. Applied Mathematics Letters. 2009,

Download references


We acknowledge the important input from all lab members of the zmb, in particular members of the bioinformatics group, including Harald Letsch, Christoph Mayer, Roman Stocsits and Wolfgang Wägele. We thank also John G. Burleigh and Mike Sanderson for kindly providing the metazoan data set of Driskell et al. 2004. The manuscript profited from many constructive comments in particular from comments of anonymous reviewers. B.Mi. and K.Me. were supported by the DFG grant MI 649/6-3, B.M.v.R. was supported by grant WA530/34, and P.K. was supported by grant WA530/33. This is a publication of the Molecular Biology Unit (zmb) of the ZFMK, Bonn.

We provide a software package to perform the proposed matrix reduction. mare is an open source software, a C++ executable is available from

Author information

Correspondence to Bernhard Misof.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

B.Mi., B.Me. conceived the study, designed the setup and performed all analyses. B.Mi. wrote the paper with comments and revisions from K.Me., K.Mi., P.K., B.v.R. and B.Me. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Misof, B., Meyer, B., von Reumont, B.M. et al. Selecting informative subsets of sparse supermatrices increases the chance to find correct trees. BMC Bioinformatics 14, 348 (2013).

Download citation


  • Data Coverage
  • Correct Tree
  • Hill Climbing
  • Data Subset
  • Tree Reconstruction