We have automatically collected and sanitized all publicly available human mtDNA sequences, classifying them according to their completeness into flexible and strict databases. The former include all reasonably coherent sequences, a relatively heterogeneous set due to a historical pool of genomes whose control region is unavailable. The latter are restricted to structurally comparable, full sequences. Some preliminary tests (sequence length, composition, equality) on single sequences have further allowed us to cluster potentially related groups and isolate unusual data for inspection.
Many of the potential problems that need to be addressed arise from data ambiguity. Incomplete information is the most serious drawback, be it in the form of incomplete sequences or, especially, completely unknown positions which should be left for detection to subsequent steps. These imperfections blur to an extent the clean results that high-quality datasets should offer, as well as the simplicity of the methods upon which they rely.
In the future, the importance of correct representation of individual sequences and their underlying semantics should be stressed: the adoption of a formal ontology to describe sequences and their features would be of great aid for data classification and manipulation; it would also be of help to design simpler, more accurate queries. We will study integration and coherence between multiple primary data sources, as well, and application of sequence identity criteria to this end (we have addressed the latter in connection with parsimony models in ). Of special concern is the treatment of ambiguous characters according to their significance (be it missing information, artificial gaps or, most remarkably, heteroplasmy) and the adequate machine representation of sequences.
From partially curated sequence datasets, we have built high-quality alignments efficiently using structural subproblem decomposition techniques. We have subsequently used the results to study the relations between individual sequences and detect compositional anomalies by means of distance measures. The set of subproblems we have presented allows semantically sound, fast divisions which result in biologically meaningful subalignments. Whereas this basic partition suffices presently, we ought to consider the relative scaling of computing power and dataset growth (and associated processing costs). Should it become necessary to achieve further reductions in the overall cost of the alignment, unambiguous, conserved regions could be used to perform intragene splitting. The number of sequences per individual alignment could also be reduced by classifying and clustering related sequence groups.
Simple edit distances have been used to perform basic data classification related to automated curation processes. The intraspecies and interspecies group distributions reported in [21, 23] have been confirmed and refined with extensive Homo sapiens mitochondrial datasets. These results encourage us to research improved preprocessing and clustering measures; distances can be computed using special-purpose, possibly exact, pairwise alignment algorithms such as Needleman-Wunsch and Smith-Waterman [29, 30]. Legitimate yet incomplete sequences (i.e., those found in flexible sets and absent from strict sets) may be processed separately to guarantee homogeneity, or jointly with homologous regions of complete sequences as well, depending on distance models. Likewise, the effects of ambiguity in sequence alignment should be investigated more thoroughly.
Although we have removed especially disruptive data from our input sets, some conditions --displaced position numbering in particular-- may be corrected automatically. This, however, requires that local databases store corrected sequences, overriding any bad copies found in public databases until an update is made, at which point a renewed quality check could be made. Such procedures preclude treatment of mtDNA sequences as circular in favor of simpler, conventional methods. Moreover, we intend to exploit the structure and conservation of the human mtDNA molecule to achieve further improvements in computational costs and alignment quality.
We have demonstrated the applicability of our approach reconstructing updated, current and complete human mitochondrial phylogenies integrating the control region in the analysis; and carried out some preliminary analyses on them, using the trees to test several quality assessment criteria as well. The main improvements over previous phylogenies are: the use of a well-founded, systematic methodology, which spans all stages of the reconstruction; the exposition of said methodology; the study of its scalability and repeatability over time and growing datasets; and the customizability of the procedures according to the requirements of both inputs and outputs.
Efficiency has been achieved by means of a combination of biologically sound problem partitioning and effective parallelization of compatible subproblems through distributed systems. Thus, algorithm complexity is offset and problem complexity turned into computational advantage: periodic reconstruction becomes feasible, as does accommodation of dataset growth. To this end, both fundamental problem dimensions (number of sequences and sequence length) can be attacked through known or inferred properties.
As a result we produce automated (save for inclusion of dubious data), high-quality trees which, coupled with an appropriate computational framework, yield workable representations, which we can annotate, extend and analyze easily, as we have done to produce some of the results presented throughout the paper. Some interesting problems remain to be dealt with in the near future. From the end user standpoint, the ability to define and add attributes to the tree, as well as query and interact with it, is fundamental (we have recently addressed this problem in ). The main shortcomings concern visual interaction with such huge trees, particularly in combination with annotations and intensive exploration of these. On the other hand, most formats lack extension capabilities; we have found phyloXML  to be the only reasonable choice for such complex tasks.
Besides user-defined custom rules, special-purpose attributes and filters could be defined to analyze biological patterns of sequence quality and mark leaves as potential outliers, if not removed in previous sanitizing steps; likewise, such procedures could be applied iteratively to refine the original datasets. Another obvious improvement is the elaboration (and automation) of an adequate descriptive formalism for mutations: for instance, merging indels; detecting, if applicable, the gene where the mutation takes place, whether it is synonymous or else what change it effects on the amino acid sequence.
Yet another interesting aspect concerns qualitative evaluation and comparison between different alignments and trees. This comprises everything from model selection  and sensitivity analysis to posterior tree scoring and topological distances. A related prospect deserving of further attention is the addition of general constraints to reflect known biological properties, which may further simplify certain tasks and favor decomposition as usual, possibly including past results as guidelines. The conservation of such properties in the outputs can be used as a qualitative measure of correctness, as well.
In addition to phylogeny-supported curation, it is possible to conceive procedures for tree-driven data correction, determining the simplest ways to integrate discordant data in a way that is consistent with the scoring model. These ideas can be of use to resolve ambiguity and elegantly integrate lacking regions in flexible databases without greatly affecting tree scores, as is usually the case when unknown information is treated as absence of biological features.
Tree optimality and robustness are among the most difficult qualities to evaluate. Statistical methods provide an approximation to these problems, subject to a certain evolutionary model, at the cost of greatly increased computational loads. In addition, more general phylogenetic networks could be used to mark ambiguous hotspots while retaining the information of the main tree. Likewise, polytomous trees are not strictly undesirable, since consistently unresolved binary nodes may point to relevant evolutionary properties, as has been noted before.
To summarize, our intent is to keep improving tree reconstruction from both computational and biological standpoints, as much as to add and extract useful information from the results. We believe formalization of knowledge and automation are keys to carry out these objectives, supported by expert assistance to the information systems designed to this end. As phylogenies become recipients and organizers of information, interoperability with external systems becomes of the utmost importance.
Both biological and computational goals can be greatly aided by integration and cooperation with existing efforts in the study of human mitochondrial diversity at the sequence level, such as MITOMAP  and HmtDB , and at the tree level, like PhyloTree ; this should be one of the first steps to take. Additionally, improved and specialized algorithms can take advantage of the special structural features of mtDNA and the size and density of growing datasets, both to learn or infer new information from these and to use it to assist in and improve the reconstruction of phylogenies. Finally, reliable information systems must be matured to handle all these tasks and make them easily available to researchers.