Rebooting the human mitochondrial phylogeny: an automated and scalable methodology with expert knowledge

Blanco, Roberto; Mayordomo, Elvira; Montoya, Julio; Ruiz-Pesini, Eduardo

doi:10.1186/1471-2105-12-174

Research article
Open access
Published: 19 May 2011

Rebooting the human mitochondrial phylogeny: an automated and scalable methodology with expert knowledge

Roberto Blanco^1,2,
Elvira Mayordomo^1,2,
Julio Montoya^3,4 &
…
Eduardo Ruiz-Pesini^3,4,5

BMC Bioinformatics volume 12, Article number: 174 (2011) Cite this article

4351 Accesses
12 Citations
6 Altmetric
Metrics details

Abstract

Background

Mitochondrial DNA is an ideal source of information to conduct evolutionary and phylogenetic studies due to its extraordinary properties and abundance. Many insights can be gained from these, including but not limited to screening genetic variation to identify potentially deleterious mutations. However, such advances require efficient solutions to very difficult computational problems, a need that is hampered by the very plenty of data that confers strength to the analysis.

Results

We develop a systematic, automated methodology to overcome these difficulties, building from readily available, public sequence databases to high-quality alignments and phylogenetic trees. Within each stage in an autonomous workflow, outputs are carefully evaluated and outlier detection rules defined to integrate expert knowledge and automated curation, hence avoiding the manual bottleneck found in past approaches to the problem. Using these techniques, we have performed exhaustive updates to the human mitochondrial phylogeny, illustrating the power and computational scalability of our approach, and we have conducted some initial analyses on the resulting phylogenies.

Conclusions

The problem at hand demands careful definition of inputs and adequate algorithmic treatment for its solutions to be realistic and useful. It is possible to define formal rules to address the former requirement by refining inputs directly and through their combination as outputs, and the latter are also of help to ascertain the performance of chosen algorithms. Rules can exploit known or inferred properties of datasets to simplify inputs through partitioning, therefore cutting computational costs and affording work on rapidly growing, otherwise intractable datasets. Although expert guidance may be necessary to assist the learning process, low-risk results can be fully automated and have proved themselves convenient and valuable.

Background

Mitochondria are remarkable among eukaryotic organelles for possessing their own genome, which is inherited independently from the nucleus. Even though mitochondrial DNA (mtDNA) constitutes a very small fraction of the whole genome in higher organisms, the few genes it encodes are essential to cellular metabolism, and therefore main actors in the development of genetic diseases. Their functional roles grant them conservation and homogeneity; in fact, the analysis of their genetic variation is one of the primary sources of information for the inference of evolutionary relations in phylogenetics research.

In addition, mitochondrial DNA is an excellent intraspecies phylogenetic marker due to its strictly uniparental inheritance in most species and elevated mutation rate, and the structuring of its variability in humans indeed sheds light on many important questions [1]. To name a few, mtDNA-based phylogenetic studies have provided strong support for the African origin of the human species [2]; contributed to define the role of selection in human mitochondrial evolution [3]; and associated distinct mitochondrial genetic backgrounds to particular disease phenotypes [4], among others.

More specifically, this study relates to and shares some of the goals of the Human Genome Diversity Project (HGDP) [5], a worldwide survey of the genetic wealth of the human species undertaken shortly after the launch of the Human Genome Project. The HGDP ultimately aims to gain an understanding of genetic diversity patterns, their generative processes and evolutionary history. This, in turn, is expected to generate an immense amount of valuable biomedical information. Phylogenetics is naturally suited to the organization and analysis of genetic and variational data to elucidate the course of evolution that culminates in the observed diversity of the human species, as follows from the aims of the HGDP.

In this light, we set out to explore the systematic, escalated reconstruction of human mitochondrial phylogenies, gaining insight from present accomplishments and improving on the available methods and data to understand the evolution of mtDNA.

Related efforts

The reconstruction of comprehensive phylogenies based on mitochondrial DNA is not a novel undertaking: the defining features of the molecule indeed make it an ideal target for such evolutionary studies. The number of available sequences has grown exponentially since the advent of frequent sequencing and submission of human mtDNA genomes to GenBank a decade ago. In particular, the number of published human mtDNA sequences has doubled over the last three years from approximately 4000 to more than 8000. Though certainly beneficial, the very fast pace of progress can undermine efforts to keep explanatory phylogenies up to date in detail. This steady growth represents an outstanding opportunity for comprehensive work on huge datasets and an excellent benchmark for all sorts of algorithms and techniques.

In this section we evaluate the merits and shortcomings of current alternatives and derive some convenient properties that will be incorporated into our proposal.

MITOMAP

MITOMAP [6] hosts what can be considered to this day the most comprehensive fine-grain human mitochondrial phylogeny available for general purpose. It was built from approximately 1000 mtDNA sequences and subsequently updated and maintained manually by Dr Wallace's group until the start of the ZARAMIT project [7] in 2007. The in-depth update process that it required is reportedly unfeasible as it cannot keep up with the growing number of published sequences; thus, it has not been updated since shortly before reaching the 3000-sequence milestone. Moreover, manual augmentation invalidates the formal principles of most construction methods: a mathematical optimality criterion and statistical support measures that lend credibility to a given tree.

Strictly speaking, the MITOMAP phylogeny is not a tree due to the introduction of reticulation events in those places where a simple, unequivocal path to each leaf cannot be determined (so some nodes may have two or more postulated parents, in contradiction with the mitochondrial mode of inheritance). Moreover, input sequences can be found associated to internal nodes instead of confined to leaf nodes, thus blurring the distinction between known hereditary relations and inferred ancestral sequences, commonly assumed to span all internal nodes. These irregularities encumber work with the tree and its evaluation, which is further encumbered by the lack of a machine-readable version of the phylogeny.

Lastly, the tree is enriched with abundant annotations on provenance, mutations and pointers to related literature. These important additions are largely dependent on supervised selection of features; consequently, actions should be taken to ensure coherence, avoid redundancy and decouple these steps from explicit tree upgrade operations.

PhyloTree.org

PhyloTree [8] is another recently started project that publishes periodically updated trees. Its focus is on haplogroup-like classification and as such offers very clean and simple results. It is a very useful resource, though several factors make it unsuitable for our purposes. First and foremost, it is a "tree skeleton" rather than a complete phylogeny and as such it does not offer a complete mutational landscape as a basis for in-depth genetic study. Secondly, like MITOMAP, it is, to a lesser extent, annotation-oriented, relying on curated bibliography listings, thus introducing a supervision factor that may prevent thoroughness to a degree. And thirdly, the details of the construction of the tree are unknown, as are the criteria for definition of new haplogroups and suppression of common polymorphisms. Nonetheless, as a repository of expert knowledge it can serve as an excellent guide for automated construction, as will be noted in the discussion of partitioning techniques for phylogenetic analysis below.

Other trees

Some recent studies such as [9] make use of special-purpose phylogenies which, though limited in their scope and extent, provide insight into desirable properties of the resulting trees, as well as common operations on them. These works could greatly benefit from the availability of high-quality, general-purpose phylogenies and in turn provide feedback to these. Lacking those, studies have to be conducted from scratch, most likely not aiming for automation, scalability or continuity.

Overall, there exist burdensome trends towards manual supervision or annotation, as well as a serious lack of information regarding construction methods and statistical support measures that could both be used to evaluate and improve existing trees. It is also apparent that expert knowledge plays an important role, which should be formalized and automated. Here we propose a methodology to automatically and efficiently build periodically updated phylogenetic trees where human supervision is kept to a minimum and does not disturb the reconstruction process itself.

Results and discussion

Sequences

Human mitochondrial DNA sequences are the raw materials of our study, processed all the way from public databases to annotated phylogenies. We consider both flexible and strict databases. The former encompass all available full sequences as defined by a suitable query, while the latter satisfy some additional quality constraints, the most significant being the suppression of sequences whose non-coding control region is unavailable.

The quality of strict sequence databases is especially relevant due to the conceptually simpler treatment that clean, uniform data allow. The effects of outliers and other anomalies should be mostly local, but incomplete sequences (in the form of gaps in the alignment) can severely limit the effective range of useful positions and decrease the resolution of the dataset. Although a loose database query can ideally avoid false negatives and ensure completeness, more extensive cleaning is needed to correct for the potentially larger number of false positives.

Sequence length tests

Almost all of the complete human mtDNA sequences have lengths in the 16550:16600 base pair range. No sequences can be found beyond the 16600 bp threshold. Sequences devoid of control region are clearly identified by the vacuum in the 15600:16300 range, supposing these are the only coherent "almost complete" sequences we allow (see Sequence composition tests below).

Outliers deserve special attention and, though automatically locatable, may require manual inspection prior to their inclusion in a dataset. All sequences in the 16500:16550 range belong to cancer tissue samples from [10] except for a healthy sequence with a 50-bp deletion [11]. The 16400:16500 range illustrates a rare 154-bp non-deleterious deletion [12] and thus is a legitimate sequence. Finally, two Indian sequences are found in the 16300:16400 range that appear to lack the first ~250 positions, though there is no mention to this irregularity in the original study [13] and the rest of its published sequences seem normal. For the sake of completeness and evaluation, none of these have been excluded from the study.

Sequence composition tests

Length criteria and formal sequence properties combined can yield a good approximation of the desired clean dataset, but data quality and query (or database) limitations must be taken into account to detect further false positives and disruptive data, including incomplete sequences that pass the simple length tests we have just described. Gaps neither have meaning nor are expected in raw sequences, though there is at least one case where the correction of a sequencing error (in the original human mtDNA reference sequence, predecessor to the current rCRS) has inserted an artificial N pseudodeletion to preserve the canonical numbering of positions established by its first, though faulty, incarnation. There is no simple way to tell apart these violations to the IUPAC nomenclature, which should be avoided using an additional "false gap" symbol, if possible.

Uncertainty in sequences (reflected in ambiguous or unknown positions) is undesirable because it blurs results and complicates their interpretation. These problems are widespread: from a strict database of 7395 full sequences, ambiguities exist in 24.0% of these, at least at the base pair level. We may accommodate an acceptable level of uncertainty by defining a maximum allowed number (or fraction) of ambiguous positions per sequence, hence adjusting the trade-off. We observe that a static threshold of 1 covers 95% of all sequences, whereas a threshold of 5 covers 99%. Sequences that exceed these boundaries significantly are clear candidates for inspection. Figure 1b shows database covering as a function of the ambiguity threshold. Flexible databases include a special case of partial sequences whose unknown control region can be considered a form of ambiguity, in the guise of long gaps at both ends of the alignment with no biological meaning. Its impact will be discussed in following sections.

Sequence identity

In large collections of densely sampled, closely related genomes, which are furthermore highly conserved and comparatively short in our case, not every sequence can be assumed to be unique as might be in other circumstances. Equality between two sequences does not provide any additional information to the single-sequence case, and this fact may be used to reduce effective datasets to some extent.

Within strict databases, 10.1% of all sequences may be thus compressed due to equality with others, which may act as their representatives until the final results are produced. Flexible databases include 944 additional sequences, 10.0% of which can be unified.

Alignments

Construction of datasets for input to tree construction algorithms proceeds iteratively applying the procedures detailed in this section. Potential problems will be described together with the methods designed to solve them, as well as their effects on both inputs and outputs.

Each sequence is split prior to alignment according to the structural boundaries of coding and non-coding regions defined by the rCRS record on GenBank, resulting in 50 subalignments of lengths between 1-1812 bp (as defined by the reference) computed separately: 37 genes, 11 non-coding gaps in the coding region, and the D-loop, split in two by the numbering scheme. Their (overlapped) concatenation results in a full, 16832-position alignment where increases in sequence length are chiefly due to a handful of lineage-specific indels. Each partial alignment can be computed on a standard workstation in up to 6.5 hours for the biggest dataset and considerably less for medium-size inputs. Times increase somewhat for flexible databases, though total alignment costs remain comparable. Through this technique, otherwise problematic computations of large alignments become affordable.

Although pairwise alignments with the reference are generally required in order to robustly split sequences into their structural units (due to insufficient database annotation), these only need to be computed once and stored for future use, taking approximately 7 seconds each.

Simple results

Feeding an alignment algorithm with the results of a standard database query is a very straightforward procedure, but the results of this practice are rife with errors. Gene partitioning makes execution times manageable and to an extent structures the solution, but does not solve defects that are inherent to the sequences themselves.

The length of a direct alignment of the results of the MITOMAP query without any further preprocessing goes up to approximately 18000 characters. This is due to the fact that a few (< 0.1%) complete sequences from [14, 15] suffer from a bad definition of their starting position according to the reference: roughly speaking, the circular mtDNA chain has been cut at the wrong point for numbering purposes and canonical position 1 is found somewhere in the middle of the resulting string. These displacements produce an erroneous numbering in the affected sequences and, if left untreated, can severely damage alignment quality by creating very long gaps on both ends of the alignment, disrupting numbering and degrading otherwise usable characters in all other sequences. This, in turn, will have a negative effect on any phylogenies including such sequences.

Distance matrices and curation

Computation of pairwise distance matrices is not only a necessary step for some of the most popular tree building methods, but also a means of detecting uncommon sequences to check for false positives. Parsimony edit distances are a straightforward means of building a histogram and looking for outliers, though more sophisticated models could be used as well. We have found that all strict outliers within the accepted length ranges are actually incomplete sequences presenting long series of unknown positions, marked as N, belonging to just a few different studies [10, 16–20]. Once these are treated, structural anomalies require more refined sequence checks.

Once highly anomalous sequences have been removed, we are left with 7390 full sequences (6644 unique sequences), whose statistics are summarized in Figure 2. The distribution of Homo sapiens intraspecies distances, restricted for correctness to unambiguous sequences, follows a bimodal distribution with a main peak at 45 differing positions and a secondary peak at 100. Distances range between 0 and 130, with a combined average of 46.36 differences (σ = 16.26).

Predictably, the seven available Homo neanderthalensis sequences [21, 22] and three Homo sp. altai sequences [23, 24] are clearly separated from Homo sapiens sequences in terms of pairwise base differences. Generally speaking, whenever we consider clusters of statistically differentiated sequences (e.g., distinct species), these should be studied separately to avoid external sources of noise, determining relevant properties and outliers within each group; doing otherwise results in a mix of underlying distributions that sensibly complicates the problems of inference and detection. Sequences lacking their control regions could be studied jointly with full sequences if parsimony scores treat indels as single events; otherwise all sequences should be considered without their control region for comparability.

There is clearly some overlap with the simpler tests that have been described in preceding sections. The final battery of tests should be applied in order of ascending power (and complexity) to maximize efficiency. It is possible to compute edit distances from unaligned sequences, but a complete alignment is desirable for conducting more thorough examinations based on homology levels and conservation criteria, among others.

Phylogenies

Trees produced from curated alignments, shown in Figure 3, exhibit generally good properties. If we take MITOMAP's simplified mtDNA lineages [25] as the basis for a haplogroup classification and plot these groups on the resulting phylogenies, we observe that these qualitative properties and the relations between them are appropriately respected.

What local inaccuracies occur concern the exact situation of small haplogroups (i.e., those with a small number of specimens) or scattering of portions thereof in the tree, due to low relative weight and other phenomena. Although visually puzzling, it is correct to find some important parent haplogroups embedding their child haplogroups (and consequently "broken" into several parts) instead of indivisibly grouped together. This is reasonable because a child haplogroup is simply a convenient designation for a subtree within its parent; different subtrees need not be evolutionary siblings, though a detailed classification can certainly be beneficial. For purposes of visualization, emphasis should be placed on clearly marking parent clades as such.

If one is willing to sacrifice a small fraction of autonomy in the tree stage to guide the reconstruction process, it is possible to decree that a certain hierarchy of haplogroups (or clades, generally speaking) be imposed on any acceptable solution. In this case, the desired haplogroup hierarchy needs to be provided together with decision rules to classify individual sequences as pertaining to one particular group. A haplogroup subtree can be constructed for each group and later grafted into their combined supertree according to the postulated hierarchy, as we have described in [26]. As an added benefit, this strategy offers a great improvement in performance, and so it becomes feasible to employ sophisticated and comparatively costly algorithms.

The effect of bootstrap sampling on the trees is a trade-off between the inferred robustness of the reconstruction and the amount of potential blurring it may cause. Support tends to be very high near the leaves and decreases as we move higher up the tree. Polytomies result in those regions where a clearly dominant relation cannot be identified. This is particularly visible in the N haplogroup catch-all branch, which is the main point of dispersion in bootstrapped consensus trees. However, and although this represents a departure from the binary tree model, this compaction is coherent with established multifurcations at the haplogroup level.

Simple results

Flexible phylogenies feature very long gaps associated with missing information, even if curated databases are used. While these lacks are not enough to disrupt classification on a broad scale, they can entail a significant loss of fine grain resolution. Non-canonical start/end points have a similar effect on the alignment, inserting rows of gaps at both ends, as noted above.

All in all, artificial gaps degrade the effective performance of most tree reconstruction methods. The inclusion of these positions in the final labeled phylogenies is troublesome as these defects will be passed down to a large number of ancestral sequences, which will be plagued by many false indel events, completely unrelated to evolution. Therefore, it is advisable to either distinguish legitimate gaps from unsequenced regions, or else discard the latter altogether.

Quantitative analysis

The standard consensus (multifurcating) tree derived from the strict database of previous sections requires 67900 point mutations as defined by Fitch's parsimony algorithm (which, treating each character independently, does not unify indels). 31.21% of these involve at least one ambiguous position. 69.88% of all mutations affect highly conserved positions (α > 0.95); this is expected due to the very high conservation of the alignment (μ = 99.80%, σ = 1.60%; see Figure 4): only 0.57% of all aligned characters fall below the 95% bound. Back-mutations occur in 24.91% of all point mutational events.

Table 1 summarizes the relations between the different types of basic events. Two interesting remarks arise from the data: first, ambiguous mutations are one of the main sources of disruption of conserved positions and never involve back-mutations (so their effects appear as local and close to the leaves); and second, many back-mutations are related to indels, suggesting a suboptimal treatment of these mutations. In fact, alignment columns with any number of gaps are usually ignored altogether due to the inability of typical substitution models to account for anything but simple substitution events.

Table 1 Point mutation statistics for the reference strict phylogeny

Full size table

Changes are not uniformly distributed across clades. There is an average of 5 mutations per branch (μ = 4.68, σ = 34.66), though 20.1% of all branches are, in fact, empty. There may be up to 1574 events in a branch, but these are isolated cases that will be considered in the next section.

If all ambiguities were suppressed, a minimum of 7416 point mutations would be needed to build a perfect phylogeny [27] if at all possible, where for each character, every occurring symbol would have a single generation point (except the one found in the root of the tree). Therefore, the additional excess events indicate mutations that occur several times in the tree. As a matter of fact, some mutations are exceedingly common, arising up to 1256 times in the reference tree. However, all mutations with more than 100 generation points are either indels or transitions or involve ambiguous characters, and take place within the D-loop, except for positions 709 (transition), an insertion after 3105 (ambiguity) and 3107 (ambiguities and indels). The same trends are observed for mutations with 50-100 generation points, extended to a few other positions in a non-coding gap (8272-4 and 8281-3) and in coding regions: 3010, the insertion after 3105, 3106-7, 5460, 11914, 13708 and 15924.

Predictably, as we approach the common case of single point generation, more interesting mutations can be observed. Overall, there is a negative correlation between number of generation points and frequency, as seen in Figure 5b. The distribution is essentially the same whether or not we consider mutation variants unified by alignment position and/or include ambiguous mutations. We find 8328 fully defined mutations folded into 6328 positions, and 16991 total mutation types affecting 11520 positions, pointing to a significant decrease in (full) conservation levels exclusively due to sequence ambiguity.

Trees and dataset quality

A labeled phylogeny has the advantage of allowing sophisticated hypothesis testing. An obvious application of the inferred history of the tree is detection of clearly discordant sequences from which we may suspect an intrusion or defect of some sort.

In particular, an abnormally high number of branch mutations points to exceptionally divergent sequences, which should as such be inspected. Within our reference strict database and phylogeny, any node with more than 50 point mutation events is found to be a leaf associated with one of the studies referenced in Distance matrices and curation[10, 16–20], except for the deletions reported in [11, 12] and internal nodes joining the short sequences in [13] and the three following clades: H. sp. altai and H. sapiens-H. neanderthalensis, and H. sapiens and H. neanderthalensis each with the common ancestor of both. Below 50 events per branch we find legitimate changes, so more refined tests should be used to discern exceptional situations. Figure 5a shows both trends, which clearly follow different patterns.

Thus, the outliers that we found in sequences and alignments so far are reflected in the resulting tree if left untreated, and can be detected using tree-specific methods, though these require a higher overall cost due to their reliance on an existing phylogeny. Their impact within the tree appears as completely local and does not affect other regions of the tree; therefore, inclusion should be balanced against the marginally higher information these data may offer.

The suppression of affected terminal branches has, however, a significant effect on tree statistics; in particular, the number of point mutation events drops by 27.92% and most ambiguous mutations disappear. Thus, the tree approaches full definition, down to 2.91 mutations per character (from 4.03 in the initial tree), or 6.62 mutations per leaf (from 9.19). Note the significant decrease from random pairwise distances, indicating high compatibility.

Lack of information, as present in flexible phylogenies, has roughly the same effect as ambiguity, due to its equivalent representation within sequence alignments.