In most phylogenomic studies supermatrices of concatenated presumably orthologous genes are used for tree inference [1–18]. Due to the failure of consistently identifying orthologous genes among taxa  and/or due to general sparse sequence data availability these supermatrices frequently display a low data coverage, down to 8% . Simulation studies showed that in these instances chances of recovering a correct and robust tree can drastically decrease [1, 19]. Additionally, Wiens [20, 21], Philippe , Sanderson [1, 19, 23], Driskell , Hartmann  and colleagues showed that low gene data coverage of single taxa can already impede the success of tree reconstructions. In contrast, other simulation studies demonstrated that single taxa with low gene data coverage can help breaking up long branches and thus improve tree reconstructions [20, 21, 25–28]. These mentioned analyses of empirical and simulated data demonstrate that confounding effects of low gene data coverage on tree inference can hardly be generalized [1, 3, 11, 29–36].
Despite these unresolved issues many investigators select sets of taxa with high gene data coverage assuming that the high gene data coverage will improve the robustness of tree inferences [3, 4, 9, 11, 16, 17]. However, these threshold criteria are arbitrary and do not take into account potential phylogenetic signal of the data. Those approaches might not lead to the desired increase of tree robustness. For example, tree robustness will not increase, if high gene data coverage is achieved by selecting highly conservative orthologous genes with low phylogenetic signal. Alternatively, a robust tree might result if taxa with low gene data coverage but highly informative genes have been selected, Driskell et al.  e.g. report an example of plausible tree reconstructions based on a supermatrix with a gene data coverage of just 8–16%. Both cases illustrate that gene data coverage and phylogenetic resolution are not necessarily correlated. Consequently, the practice of selecting data based solely on data coverage is potentially problematic. Therefore, we have developed an approach which focuses on the analyses of selected optimal data subsets (SOS) which have high data coverage and phylogenetic signal. Crucial for this approach is the assessment of potential signal of genes and the development of a heuristics to select such an SOS.
Different quartet mapping approaches have been used to assess potential signal within genes [37, 38]. Among these, geometry mapping is demonstrably the most conservative estimator  and the application to genes of supermatrices is straightforward. Consequently, we have chosen the geometry mapping approach [37–40] to assess potential signal of genes in the development of our heuristics.
In order to select an optimal set of taxa and genes, Sanderson and colleagues  suggested selecting sets of full data coverage (maximal bicliques [41, 42]). However, the identification of the maximal (maximum) biclique is a NP-complete problem [42, 43] and, thus, there is no guarantee to find the maximal (maximum) biclique. Additionally, Sanderson et al.  found that selections of maximal bicliques resulted in very small subsets of size < 15 taxa and < 10 genes. Sanderson’s approach is, thus, not suitable to reconstruct phylogenetic relationships of many taxa. A possible solution might be the selection of quasi-bicliques [44, 45], which potentially combine a much larger set of taxa and genes accepting a predefined level of missing data. This promising direction however has the drawback that it is not time-efficient.
Alternatively Hartmann et al.  and Cheng et al.  introduced two approaches directly applicable to sequence data. The first approach of Hartmann et al.  is a masking technique (REAP) which masks multiple sequence alignments according to predefined thresholds of gap frequencies of sites. The approach of Cheng et al.  is a statistical correction for missing data (SIA). A comparison of these two approaches demonstrated that REAP performed better, a result which is compatible with the results of Sanderson’s biclique approach. However, both, alignment masking (REAP) and the biclique approach optimize data only with respect to data coverage and without considering potential signal among genes.
Here, we introduce a simple hill climbing algorithm to select optimal data subsets (SOS) which are assembled by considering data coverage and potential signal of genes. We start with the assumption that any taxon and gene can potentially contribute to the total signal of the matrix. However, taxa or genes with incomplete data coverage and low signal can potentially also contribute noise or cause biases to the total signal of the supermatrix. Therefore, we successively mask taxa and genes of low signal and/or data coverage generating a submatrix of higher data coverage and signal. With this approach we deliberately discard taxa and genes because of their low data coverage and/or potential low signal. The proposed hill climbing algorithm delivers an optimal solution of this trade-off. Using simulated and empirical data, we compare the performance of the herein proposed approach with an often applied approach of simply selecting data subsets using predefined thresholds of data coverage only.