FIGENIX: Intelligent automation of genomic annotation: expertise integration in a new software platform
- Philippe Gouret†1Email author,
- Vérane Vitiello1,
- Nathalie Balandraud1,
- André Gilles1,
- Pierre Pontarotti†1 and
- Etienne GJ Danchin†1, 2
© Gouret et al; licensee BioMed Central Ltd. 2005
Received: 21 April 2005
Accepted: 05 August 2005
Published: 05 August 2005
Two of the main objectives of the genomic and post-genomic era are to structurally and functionally annotate genomes which consists of detecting genes' position and structure, and inferring their function (as well as of other features of genomes). Structural and functional annotation both require the complex chaining of numerous different software, algorithms and methods under the supervision of a biologist. The automation of these pipelines is necessary to manage huge amounts of data released by sequencing projects. Several pipelines already automate some of these complex chaining but still necessitate an important contribution of biologists for supervising and controlling the results at various steps.
Here we propose an innovative automated platform, FIGENIX, which includes an expert system capable to substitute to human expertise at several key steps. FIGENIX currently automates complex pipelines of structural and functional annotation under the supervision of the expert system (which allows for example to make key decisions, check intermediate results or refine the dataset). The quality of the results produced by FIGENIX is comparable to those obtained by expert biologists with a drastic gain in terms of time costs and avoidance of errors due to the human manipulation of data.
The core engine and expert system of the FIGENIX platform currently handle complex annotation processes of broad interest for the genomic community. They could be easily adapted to new, or more specialized pipelines, such as for example the annotation of miRNAs, the classification of complex multigenic families, annotation of regulatory elements and other genomic features of interest.
Detecting genes, their organization, structure and function is a major challenge of the genomic and post-genomic era. Two fields of genomic biology are dedicated to this task and are known as structural and functional annotation. Structural annotation refers to the task of detecting genes, their location on a biological sequence, their exon/intron structure and predicting the protein sequences that they encode. Functional annotation aims to predict the biological function of genes and proteins.
Structural annotation methods can be classified into several types:
Ab-initio methods, based on content sensor and detectors to discriminate between coding and non-coding regions, and then decipher a putative gene.
Homology-based methods use evolutionary conservation concepts to deduce gene localization and structure.
Hybrid methods couple these two approaches and usually present the best compromise in terms of sensibility and specificity in gene detection .
Computational methods of functional annotation are mainly divided into two types:
Similarity based approaches intending to infer a function based on the pairwise similarity of a given sequence with a sequence of known function. These approaches have been criticized for their propensity of propagating annotation errors  deducing false homology relationships [3, 4], and thus producing systematic errors .
Phylogenomic inference approaches, based on evolutionary history and relationships between biological sequences. These methods avoid most of the false homology inference problems, and allow distinguishing between orthologous and paralogous genes [4, 6]. Orthologous genes, which are produced by a speciation event, are more likely to share the same function than paralogous genes which originate from duplications . These methods are also able to detect potential functional shifts through the study of genes' evolutionary behavior, . Nevertheless, these methods require a high degree of biological expertise, are time consuming, complex, and are difficult to automate in their whole [4, 8, 9].
Aside from detecting protein coding genes and predicting their function, structural and functional annotation also have other aims such as detecting regulatory elements, repetitive elements, non protein-coding genes (i.e. miRNA), or other important genomic features.
Whatever the objective, structural and functional annotation usually require the complex chaining of various different algorithms, software and methods each with its own particular set of parameters and output format. At key steps of these "pipelines", expert biologists are often required to make important decisions, modify the dataset, compare intermediate results, manually handle and convert several files (and so on...) which is labor intensive and can be error prone. For the treatment of huge amounts of data released by sequencing projects, automation of these pipelines is an absolute necessity. Several attempts have been made in the development of annotation platforms automating some of these pipelines, particularly in the field of structural annotation (for example the Ensembl pipeline , or the Otto system ). With regards to functional annotation, several platforms automate pairwise similarity based approaches [9, 10, 12, 13], and fewer have automated the more complex phylogenomic inference approaches [4, 14]. While these latter platforms allowed both a gain in the time cost and avoid errors due to the manual manipulation of files, they still strongly require intervention of human experts at various steps.
Here we present an automated annotation platform featuring an expert system that substitutes for human expertise at various steps and, thus, allows more complete automation than ever considered. The expert system models the biologists' expertise and is able to compare intermediate results from different methods, to modify the dataset, to evaluate the significance of predictions along with other usually "biologist-made" tasks. The FIGENIX platform currently automates 8 different pipelines of structural and functional annotation. In particular, a structural annotation pipeline, which is a hybrid method coupling ab-initio and homology-based approaches, and a functional annotation pipeline fully automating a complex phylogenomic inference method. The present manuscript will specifically focus on the phylogenomic functional inference pipeline which illustrates how an expert system allows automation of complex chaining usually requiring amounts of non-trivial human intervention.
FIGENIX is an intranet/extranet server system usable through any recent Web browser accepting JAVA 2 Plugin installation. FIGENIX is freely available to academic users through the web interface . Users first have to contact us to request a login and password. The source code is available upon request under the GNU GPL (General Public License).
FIGENIX's technical architecture
PROLOG rules, syntax and semantic, example
% X belongs to a list if X is at the start of the list
Element(X, [X|_]) .
% or X belongs to a list if the list starts with Y different from X but X belongs to the list 's queue
Element(X, [Y|;L]) :- different(X, Y), element(X, L) .
a user of a such program asks it like this:
>element(9, [5, 3, 4, 7] .
answer = fail
>element(3, [5, 3, 4, 7] .
answer = ok
>element(X, [5, 3, 4, 7] .
answer = ok, X = 5 or X = 3 or X = 4 or X = 7
Complex pipeline example: phylogenomic inference
As an example to illustrate the potential of an expert system in automating complex and human intervention-requiring pipelines; we focused on the phylogenomic functional inference pipeline. Phylogenomic functional inference is, as previously introduced, labor intensive, time consuming, requires a high level of expertise and human intervention at various different steps. For these reasons, such functional annotation approaches, while clearly more reliable than similarity based approaches, have been considered as impossible or very difficult to automate without dramatically sacrificing the quality (by substituting general default parameters and decisions to human expertise).
Creation of a dataset of sequences homologous to the sequence of interest.
Multiple alignment of these sequences, with elimination of data producing bias, noise, or distorting the evolutionary signal.
Phylogenetic reconstruction based on the multiple alignment using several different methods.
Inference of Orthologs and paralogs through comparison of gene trees with a reference species tree.
Retrieval of experimentally verified functional data for orthologs and paralogs to the query sequence, on Web databases (Gene Ontology, MGI and NCBI's dbEST).
In each of these five steps, human intervention is required multiple times. For example, at step 1 to choose sequences from a BLAST  output that are more likely to be homologous to the query sequence. At step 2 to eliminate sequences producing biases in the alignment or having a divergent composition, and to mask sites with highly divergent evolution. At step 3 to compare the topologies of trees produced by different methods and check whether they are congruent. At step 4 the biologist compares the topology of the gene tree to the topology of a reference species' tree, then deduces the position of duplication and speciation events, and finally infers orthology and paralogy relationships. Once orthologs to a sequence of interest have been identified, biologists then usually look for known functional data in other species and infer for example a likely biochemical function for the unknown gene (step 5).
A complex pipeline's computational translation
Each graph's node, i.e. each unit takes one or more streams as an input and builds a new stream as an output, which is transferred to the input of one or many related units in the stream orientation.
Unit's jobs can be executed in a parallel mode. A "rendez-vous" type, synchronization, which means that a unit starts its work when the complete set of related input streams are present, is thus possible (see the 3 phylogenetic trees building units on figure 2) but not mandatory (unit's work can be started by the arrival of a unique input stream). This kind of parallelism, with explicit and large granularity, at the unit's level, allows us to benefit from multi-processors hardware architecture, and also, by an appropriate deployment, from distribution on several CPUs.
We name algorithmic or "A-units", units that produce a mathematical computation.
Like other adaptable and flexible pipelines systems, we didn't choose to rewrite new software for each algorithmic step. We preferred the use of the "reference" publicly available software in their command line version (e.g. sequence similarity search is done by the BLASTALL local runtime, downloadable on NCBI web site). Thus the BLAST process is driven by a "A-unit" which wraps its input/output streams.
We used the same approach for all software (gene prediction, domain detection, phylogenetic reconstruction, multiple alignment...). Plugging of existing software without modification in our pipelines, allows us to use the most advanced bioinformatics software research development, with a very easy maintenance. It also allows easy evolution of the platform by integrating new software or replacing the older versions by the most up to date ones. New versions of applications (such as BLAST, or HMMPFAM) are not directly and automatically updated in FIGENIX, they are first tested, validated and if needed adapted (due to possible changes in the input/output formats).
The "tool" units, or "T-units" category contains units like enumerators, data accumulators, multiplexers/demultiplexers, simple filters, data converters and so on (e.g. converting data from GENSCAN output data to GFF format).
"Result" units, or R-units, are in charge of the most important genomic results production. Those results are intended to be the components of a scientific report produced by an annotation task started by the biologist.
Interface with the expert system is made through two types of dedicated units. Their role is to "substitute to" human expertise and "memory". Some of them keep information necessary for later reasoning, they are named expert knowledge units, or EK-units. Others take decisions concerning stream direction inside the graph or produce, on output stream, new data resulting from the analysis of the current situation in the data world of the task. These units are named expert decision units, or ED-units. This part of the analysis is based on empiric rules specified by biologists, rather than on an algorithmic approach.
EK and ED units are thus gateways to the expert system, which purpose is to take decisions, using genomic knowledge and data provided by EK units during pipeline processing. Like a human, this expert system has a "memory" and an "intelligence" (limited to the problems managed by our system) used to "supervise" a pipeline execution.
Pipelines themselves are coded as XML files. We are developing a GUI (graphical user interface) for pipeline editing, dedicated to the biologists' use. Scientists will be able to construct their own data flows, chaining available tool units. A semantic control will prevent invalid buildings. Users can propose a given application not currently available in FIGENIX to be included as a new A-unit. This allows for example to substitute a new more accurate or more adapted application to the application currently used in the available pipeline. Users can also decide to share their custom pipelines with other FIGENIX users.
Expert system usefulness examples
To illustrate the importance of a rules-based system, we selected two key examples in which the expert system substitutes for human expertise to take important decisions, to compare intermediate results, or deduce biological information.
One simple example is from step 3 of the phylogenomic inference approach summarized previously, which consists of reconstructing phylogenetic trees from a multiple alignment, then comparing the topologies of trees produced by different methods and producing a unique consensus tree on which all data are projected. The other more complex example is from the step 2, which consists of producing a reliable multiple alignment with elimination of sequences and masking of positions producing biases in the alignment or improper for phylogenetic reconstruction. This step is crucial in the phylogenomic approach because depending of the quality of the alignment in terms of phylogenetic signal and noise, a reliable phylogeny may not be able to be produced.
Example 1: trees consensus
In FIGENIX's phylogenomic inference approach, three phylogenetic trees are produced, with three different approaches, the Neighbor Joining (NJ) method , the Maximum Parsimony (MP) method , and the "Quartet Puzzling" Maximum Likelihood (ML) method . Usually, at the end of this step, an expert biologist manually examines the topology of each tree, runs different tests to compare trees one to one and finally tries to produce a projection onto a unique consensus topology of all the information from the three trees. This process is necessary to check whether the three reconstruction methods give congruent results or only partially congruent subtrees of the original trees. Depending on these congruence tests, conclusion could be drawn for the whole tree or only for subtrees. It also allows evaluating the reliability of the tree.
Data allowing export system to decide what kind of fusion must be done
Neighbor joining 0.1266
Neighbor joining 0.2170
Maximum Parsimony best
Maximum Parsimony best
Maximum likelihood <0.0001*
Maximum likelihood 0.0010
All possible cases provided by tree topologies comparison tests
Neighbor joining (n)
Maximum Likelihood (l)
Interpretation of phylogenetic trees topologies comparison tests
3 trees fusion on NJ labeled npl_A
3 trees fusion on NJ labeled npl_A
3 trees fusion on NJ labeled npl_K
3 trees fusion on NJ labeled npl_A
«best» tree if the same in the two tests with congruent tree labeled: pl or lp or nl or ln or np or pn
3 trees fusion on NJ labeled npl_T
Example 2: multiple alignment masking for sites not evolving under neutrality
At step 2 of phylogenomic inference approaches, a multiple alignment of putative homologous sequences is produced. Before being sent for phylogenetic reconstruction, multiple alignments need to be corrected for various different biases. Among those corrections, sites having high rates of evolution must be removed from the multiple alignment. Similarly, sites for which the rate of substitution is highly divergent in two or more paralogous groups, underlying a possible "non neutrality", should also be removed. Indeed phylogenetic reconstruction methods are not tolerant to sites highly divergent to neutral evolution and molecular clock. Sites not respecting this rule potentially produce errors in trees' reconstruction; they thus have to be masked.
Biologists use to determine theses groups by just looking at the tree. After doing an in depth analysis of their experiment and reasoning, it seems that the knowledge to be modeled can be summarized in this sentence: "Paralogy groups contained in a phylogenetic tree are the biggest sub-trees containing sequences from different species (sequences groups containing only one species are equivalent to a unique node), but containing no sequence belonging to the species chosen as "out group" parameter by the biologist if any"
This is typically the kind of knowledge that can be modeled in the expert system and that is detailed in Appendix 2.
These two examples, clearly show the interest of this approach for knowledge and reasoning modeling in a very few and easily maintainable concise ruleset. These examples are taken from the phylogenomic inference pipeline which is intentionally over-summarized in this section into 5 main steps (detailed on the supplement). The whole phylogenomic inference pipeline included in FIGENIX contains 50 different steps (figure 2). Each of these steps automates processes usually requiring manual intervention of a biologist, 14 of these steps represented by "expert steps" require expert biologists' knowledge and decisions. This last category of steps accounted to date for the main difficulties in automating pipelines such as the one described here in their whole complexity.
Results and discussion
The 8 pipeline models currently available in FIGENIX
The phylogenomic functional inference pipeline shown in this paper and detailed in the supplement.
Builds a FASTA database, eliminating redundant sequences obtained from two different query databases. For example, mixes protein coming from NR and Ensembl databases, and eliminates doubles.
Composition of the two previous pipelines. This pipeline first builds a temporary protein database (mixing two different databases and eliminating doubles). The phylogenomic inference process is then run using the built database.
Builds a FASTA database, mixing sequences obtained on the one hand from a filtered given database and on the other hand by a database of automatically clustered ESTs. For example, it allows mixing protein coming from NR and translations of EST contigs from NCBI dbEST database.
Composition of TwinESTMatix and ProtPhyloGenix__ pipelines. Phylogenomic inference on FASTA databases built with TwinESTMatix This allows construction of phylogenetic tress mixing proteins and translated EST contigs.
Runs our structural annotation method (mixing ab-initio and homology information) to DNA sequence up to ~50 kb (due to current computational power limitations) to predict genes. For larger DNA sequences, SlidingGenePredix can be used.
Apply the GenePredix pipeline on a sliding window. This allows gene prediction on larger DNA sequences, and bypasses the ~50 kb limitation.
Composition of GenePredix and ProtPhyloGenix pipelines. This model allows automatic structural and functional annotation of DNA sequences. Indeed it produces gene prediction in DNA sequences using GenePredix, and then performs phylogenomic functional inference for each putative gene using ProtPhyloGenix.
Validation and performance of FIGENIX's results
Complete automation of complex pipelines through the use of an expert system, although providing obvious gains in time cost, does not necessarily presume of the quality of the produced results. We addressed this question by evaluating the quality of the results and the performance of FIGENIX's pipelines.
Structural annotation results
Performance of two Ab-initio methods vs. FIGENIX's structural annotation method
False positive (overprediction)
Correct full length protein prediction
The platform chooses itself from a BLASTX output the reference protein sequence to compare to raw DNA sequence for gene prediction.
Extension of BLAST's high scoring pairs (HSPs) to splice donor and acceptor sites, start and stop codon is done under supervision of the expert system, as well as the alignment of predicted proteins with the reference protein.
Phylogenomic inference results
Concerning the phylogenomic inference pipeline, several phylogenies produced by FIGENIX have already been validated in peer reviewed article . The results of these phylogenies turned out to be congruent with previously published phylogenies (e.g. the PSME, TAP and GRP78 families [40, 41]). Additionally, as the pipeline automated and implemented in the platform is based on the methods developed in our lab and published in 2002 ; we compared the phylogenies produced today by FIGENIX's pipeline to the 31 trees published in 2002  and to the 38 in 2003  that were all manually produced in our lab. All the trees produced by FIGENIX led to the same orthologs and paralogs inference than the 69 trees published in 2002 and 2003, with similar confidence (bootstrap) values, and with obviously additional sequences in the phylogenies produced today due to automatically updated databases in FIGENIX. In this case also, phylogenies produced by the platform where congruent (with additional species) with previously published phylogenies (e.g. the RXR, Notch, C3-4-5, PBX, and LMP families ). The quality of the phylogenies produced by FIGENIX's pipeline can thus be compared to the one of phylogenies produced by expert biologists through the manual chaining of algorithmic tools and software. The major difference is that, while it usually takes one to several weeks to manually produce phylogenies of this quality, it takes minutes to few hours with FIGENIX.
We don't show here all the intermediate results produced by the task's execution, nor details on parameters used for each tree building algorithm but FIGENIX users can consult, via the Web interface, all produced genomic results and associated parameters.
To automatically detect from the fusion-tree (figure 5) duplications (D-labeled nodes) and speciation (S-labeled nodes) events, we use the Forester (JAVA library) detection algorithm . To compare our consensus tree with a reference tree, we don't use the tree of life given by the Forester library , but, instead, a minimum species tree dynamically extracted from a local copy of NCBI taxonomy's tree of life for each dataset (other reference trees can be chosen). Once duplications are detected, the platform automatically deduces sequences orthologous to the query sequence (here human Notch1 protein labeled "NOTCH_HSA"). At the end of this step, known and experimentally verified functions for all these sequences are automatically searched as shown in functional report on Figure 6.
The execution of the whole pipeline (run on the NR database) takes 25 minutes on the platform (running on a DELL POWEREDGE 1600SC dual-processor Xeon 2.4 Ghz with 1 GB Ram) The quality of the results can be compared to the one published in 2002 by Abi-Rached et al. on the notch family  that took around one week of processing by human expert biologists. The gain in terms of time cost here is evident and is obtained without compromising result quality.
Performance and size limitations of the input sequence both depend on several parameters and on the type of pipeline used. For phylogenetic inference the size of the query protein, the number of homologs, and the number of domains all account in the global performance of the pipeline. Typically FIGENIX can handle phylogenomic inference tasks in less than an hour for protein up to 1000 amino acids and having up to 50 homologs. Concerning structural annotation pipelines, the size of the input sequence as well as the predicted gene density and complexity (in terms of number of exons/introns) all have an impact on the process's performance. To date, we have annotated amphioxus cosmids of sizes around 40 kb with a mean number of 5 predicted genes in less than half an hour per cosmid. We have already tested FIGENIX with several hundred kb long sequences , but not yet with longer genomic portions. The annotation of whole eukaryotic genomes would probably need more computational power. However, the global architecture of the platform has been designed to support multiple CPU and can thus potentially handle annotation of whole genomes with appropriate computational power.
Pairwise-based vs. phylogenomic-based homology prediction methods
Specific differences between FIGENIX's phylogenomic inference pipeline and other software
Homologous sequences search on any NCBI-formatted database including nr, Swissprot and Ensembl.
Homologous sequences search limited to Swissprot and trEMBL.
Homologous sequences search on any NCBI-formatted database including nr.
Choice of the scope of phylomes by the user (root = all phylomes by default)
No choice of the scope of phylomes by the user.
Choice of the scope of phylomes by the user.
Automatic detection of domains on the query sequence.
Manual input of a domain that must be present in pfam and for which pairwise distances must have been precalculated.
Phylogenetic reconstruction at BLAST's high scoring pairs (HSPs) level converted after corrections in multiple sequence alignment (MSA).
Expert system selection of domains and repeats whose evolutionary behaviour are congruent.
Phylogenetic reconstruction on a single domain provided by the user.
No test for domains congruence. Phylogenies constructed on a corrected alignment with a HMM profile.
When no domain is found phylogenetic reconstruction on the "alignable" portion of the query sequence.
No reconstruction possible when no known domain is present on the query sequence.
Phylogenetic reconstruction possible regardless the presence of a known domain on the query sequence.
Elimination of sites not evolving under neutral evolution.
No elimination of sites producing biases in phylogenetic reconstruction.
No elimination of sites producing biases in phylogenetic reconstruction.
Elimination of sequences having a divergent amino acids composition
No elimination of sequences with divergent composition.
No test for sequence composition but selection for sequences producing significant alignments with the query HMM.
Phylogenetic reconstruction with three different methods and projection on a consensus tree.
Phylogenetic reconstruction with one single method (NJ).
Choice of reconstruction method (NJ by default) but only one method at a time and no fusion with multiple methods.
Comparison of the consensus tree with NCBI reference tree of life containing around 200,000 taxa.
Comparison of the NJ tree with a reference tree of life containing around 2,500 taxa.
Comparison of the one-method tree with NCBI reference tree of life containing around 200,000 taxa.
Automatic detection of speciation and duplications, of orthologs and paralogs.
Automatic detection of speciation and duplications, of orthologs and paralogs.
Functionality not available. Possibility to scan a database of trees for a given topology.
Automatic extraction of experimentally verified functional information for all detected orthologs and paralogs.
Functionality not available
Functionality not available
While comparison between pairwise-based and phylogenomic-based approaches to detect homology relationship can appear biased, it illustrates what kind of information is missed by the widely-used pairwise approaches and what kind of systematic errors they are likely to produce and spread on biological databases. Comparison of FIGENIX's pipeline with other automated phylogenomic inference software is discussed in the next section.
In the field of structural and functional annotation, Ensembl  or BioPipe  automated systems propose quite similar frameworks, but independently of implementation's differences that were detailed previously, FIGENIX adds a new concept concretized by expertise units (or E units) which are responsible of crucial points in annotation process automation. They constitute "native" expert module gateways that do not have their counterpart in the Ensembl or BioPipe architectures. Such architectures thus still abundantly require human expertise and cannot fully automate processes such as phylogenomics inference.
Comparison with other software proposing expertise integration
Counter to Ensembl  or BioPipe , the overall approach in FIGENIX can somewhat be compared to MAGPIE system [48, 49] which also includes a kind of expert system. However, FIGENIX automated pipelines are data flow circulating, in a specific order, through computation tools. The expert system acts punctually to take decisions, extract or correct data. In contrast, in the MAGPIE system, computations are done independently on asynchronously incoming data and a PROLOG daemon produces logical deductions, verifying them on the "from data" computed results.
Other major differences in the concept and architecture of these two systems can be listed. For example, while MAGPIE was designed for local installation on a biologist's workstation, FIGENIX was designed as a server made accessible through the internet without the need of installing any additional software than a JAVA 2 browser plugin.
Differences which are not at the architecture or conceptual level reside in the type of biological applications which have been integrated in these two different systems. While MAGPIE automates processes mainly dedicated to structural annotation, FIGENIX additionally integrates Phylogenomic inference pipelines.
Comparison with other automated phylogenomic inference software
Phylogenomic inference is, as stated in Background, a labor-intensive, complex and highly human-dependant process. These are the main reasons why other processes of functional and homology inference which are less complex and more straightforward (ie pairwise-based), have been considered for automation. But, as seen in the previous section, these automated processes ignore some of the functional information that could be deciphered through phylogenomic inference.
Comparison of homology inference between FIGENIX's pipeline and Homologene
Paralogy relationship missed
Co-orthology relationship missed
Orthologs not detected in Taxa
Different orthology assignment
Notch2, Notch3, and Notch4 are not detected as paralogs of Notch1.
Notch2, Notch3, and Notch4 are not detected as co-orthologous to Drosophila N.
Amphibian Ray-finned fish Cephalochordata Arachnida
3 different C.elegans genes are detected for Hs Notch1, Notch2, and Notch3, suggesting that duplications giving rise to this family took place before the divergence between protostomes and deuterostomes, and that Notch2, and Notch3 were lost in Drosophila.
Calmegin and Calreticulin are not detected as paralogs of Calnexin.
Calmegin is not detected as a Human co-ortholog to Drosophila CG9906 gene.
Amphibian Ray-finned fish
Calmegin is detected to be orthologous to another Drosophila gene than CG9906, suggesting Calmegin and Calnexin already existed as two duplicates before the divergence between protostomes and deuterostomes and Calmegin was secondary lost in C. elegans
ENPEP, LNPEP, ERAP, LRAP, and ANPEP are not detected as paralogous to TRHDE.
Each human gene of this family has been assigned a distinct ortholog in protostomes (e.g. Drosophila) suggesting this multigenic family emerged before the separation of Protostomes and Deuterostomes.
PSMB8 is not detected as paralogous to PSMB5
PSMB8 is not detected as co-orthologous to the same Drosophila gene than PSMB5.
Ray-finned fish Avian Cephalochordata. Amphibian
PSMB5 and PSMB8 are each assigned a distinct Drosophila ortholog suggesting they already existed as two copies in the last common ancestor of human and Drosophila.
PSMB10 is not detected as paralogous to PSMB7.
PSMB10 is not detected as co-orthologous to the same Drosophila gene than PSMB7.
Ray-finned fish Avian Cephalochordata. Amphibian
PSMB7 and PSMB10 are each assigned a distinct Drosophila ortholog suggesting they already existed as two copies in the last common ancestor of human and Drosophila.
Cathepsins L, M, P, R
Human Cathepsin R
Cathepsins L, M and P are not detected as paralogous to Cathepsin R.
Amphibian Avian Ray-finned fish
Each cathepsin gene is assigned a distinct drosophila ortholog suggesting the cathepsin family emerged before the separation between human and Drosophila.
None (not a multigenic family)
None (not a multigenic family)
Fungi Bovine Schistosoma Avian
None (not a multigenic family)
Amphibian Aplysia Lepidopteran Avian Schistosoma
TAP1, TAP2, ABCB9, MDR1
TAP2, ABCB9, and MDR1 are not detected as paralogous to TAP1.
Drosophila Avian Amphibian Ray-finned fish
TAP1, and TAP2 are each assigned a distinct C.elegans ortholog and none in Drosophila, suggesting there was already two copies of these genes in the last common ancestor of these two species, and that the two copies were secondary lost in the Drosophila lineage.
PSME1, PSME2, PSME3
PSME2, and PSME3 are not detected as paralogous to PSME1.
Protostomes Ray-finned fish
NLN is not detected as paralogous to THOP1
NLN is not detected as co-orthologous to the same N.crassa gene than THOP1.
FIGENIX's phylogenomic inference pipeline also has specific differences with each of the two methods (Table 8). None of the compared methods already available propose functionalities such as for example the fusion of trees constructed by different methods, tests on domains and repeats congruence and their evolutionary behavior.
Reliable automation is an absolute necessity for structural and functional annotation of huge amounts of genomic data coming from increasingly prolific sequencing projects. Many automated pipelines or genomic annotation platforms already exist as an answer to various different biological questions. However, to the best of our knowledge, no publicly available pipeline or platform yet includes an expert system (with "artificial intelligence") allowing such complete automation or automation of more complex process as FIGENIX does. The FIGENIX platform has today the capacity of detecting protein coding genes in raw nucleic sequences, of inferring their putative function through phylogenomic inference, of clustering ESTs and integrating them in phylogenomic analysis as well as gathering associated expression data. Several other complex pipelines whose automation was impossible so far because of the absolute requirement of human intervention at several steps can now be considered through FIGENIX.
Availability and requirements
Project name: FIGENIX
Project home page: http://www.up.univ-mrs.fr/evol/figenix/
Operating system(s): Platform independent (accessible through a web browser)
Other requirements: JAVA 1.4.2 JRE plugin for web browsers.
License: free for academic users (contact us to request login and password), source code is available upon request under the GNU General Public License.
Any restrictions to use by non-academics: collaboration contract needed
Appendix 1 – PROLOG code for the fusion of trees built with 3 different methods
We modeled biologists' interpretation in a very natural way in PROLOG by these rules:
fusion(npl_A) :- full_congruence(templeton, _), full_congruence (kishino-hasegawa, _).
fusion(npl_A) :- full_congruence (templeton, _), partial_congruence(kishino-hasegawa, _).
fusion(npl_A) :- full_congruence (kishino-hasegawa, _), partial_congruence(templeton, _).
fusion(npl_T) :- full_congruence (templeton, _), no_congruence(kishino-hasegawa, _).
fusion(npl_K) :- full_congruence (kishino-hasegawa, _), no_congruence(templeton, _).
fusion(no_fusion) :- partial_congruence(kishino-hasegawa, _), no_congruence(templeton, _).
fusion(no_fusion) :- no_congruence(kishino-hasegawa, _), partial_congruence(templeton, _).
fusion(no_fusion) :- no_congruence(kishino-hasegawa, _), no_congruence(templeton, _).
fusion(Label) :- partial_congruence(kishino-hasegawa, Label), partial_congruence(templeton, Label).
Val1 < 0.05,
Val2 >= 0.05,
concat_labels (Best, Label2, Label).
These rules can be easily maintained. For example, we can decide to do the fusion on the "best" tree and not always on NJ tree like we do today by default in the 5 first cases. Rules will so look like this:
full_congruence (kishino-hasegawa, Best),
Information brought by EK unit during the pipeline execution take a form like this:
topology(NameOfTest, Best, [Label1, Val1], [Label2, Val2]).
(e.g.: topology(templeton, n, [p, 0.15], [l, 0.01]). that means that for Templeton test, the tree with the best topology is the one built with Neighbor Joining, that tree built with Maximum Parsimony is congruent with a 0.15 rate and that the one built with Maximum Likelihood is congruent with a 0.01 rate.)
Here are the rules for congruence tests:
% congruence is full when comparing rates are higher or equal to the chosen threshold
full_congruence(Test, Best) :-
topology(Test, Best, [_, Val1], [_, Val2]),
Val1 >= 0.05,
Val2 >= 0.05.
% we have no congruence when comparing rates are lower than the chosen threshold
no_congruence(Test, Best) :-
topology(Test, Best, [_, Val1], [_, Val2]),
Val1 < 0.05,
Val2 < 0.05.
% congruence is partial when one of comparing rates is lower than the chosen threshold
% the label associated to the fusion type is just the concatenation of label for "best" (see before) tree and for its congruent tree
partial_congruence(Test, Label) :-
topology(Test, Best, [Label1, Val1], [Label2, Val2]),
Val1 >= 0.05,
Val2 < 0.05,
concat_labels(Best, Label1, Label).
partial_congruence(Test, Label) :-
topology(Test, Best, [Label1, Val1], [Label2, Val2]),
Appendix 2 -Commented prolog code for paralogy groups' detection
Each node of domain's phylogenetic tree, given to the "expert system" by an EK- unit, can have many children but for implementation reasons, we code it as a binary tree. Each node is a term like this:
node(TheSpecies, LeftChild, RightChild)
In the annotated tree, each node knows how many sequences it contains and has the full list of the different species it includes:
node(NumberOfSequences, AllSpecies, LeftChild, RightChild)
The main PROLOG rule for groups' detection is:
% detecting paralogy groups in a phylogenetic tree implies annotating tree nodes with species information then searching biggest groups with different species
paralogy_groups(PhylogeneticTree, ParalogyGroups) :-
(Rules with the same signature express a "logical OR" between them)
% a leaf node which species is different as the one chosen as out group can belong to a paralogy group
% (*) ! character in a PROLOG rule means that if the first rule is successful, PROLOG engine doesn't try other rules with same signature
subtree_species(node(Species, no, no), noeud(1, [Species], no, no)) :-
Species ≠ OutgroupSpecies, !.
% a leaf node which species is the same as the one chosen as out group can't belong to a paralogy group
subtree_species(node(_, no, no), node(1, no, no, no)) :- !.
% annotate a node which has only one child is equivalent to annotate this child
% (we have pseudo nodes to force binary structure)
subtree_species(node(_, Child, no), AnnotatedNode) :- subtree_species(Child, AnnotatedNode), !.
% annotate a sub-tree with two children is equivalent to annotate the children and to compile found species
subtree_species(node(Species, LeftChild, RightChild), node(N, SpeciesList, Left, Right)) :-
Left = node(NL, SpeciesListL, _, _),
Right = node(NR, SpeciesListR, _, _),
compile_annotations(NL, SpeciesListL, NR, SpeciesListR, N, SpeciesList)
% two sub-trees with the same unique species merge in a leaf of this species
compile_annotations(_, [Species], _, [Species], 1, [Species]).
% if one of the two sub-trees is invalidated for merging, the compilation is a tree invalidated for merging
% however we compute the total number of sequences in the sub-tree
compile_annotations(NL, no, NR, _, N, no)) :- is(N, NL + NR).
compile_annotations(NL, _, NR, no, N, no)) :- is(N, NL + NR).
% if no species is common between the two sub-trees, we can merge all species
compile_annotations(NL, SpeciesListL, NR, SpeciesListR, N, SpeciesList)) :-
is(N, NL + NR).
intersection(SpeciesListL, SpeciesListR, CommonSpecies),
CommonSpecies = ,
concat(SpeciesListL, SpeciesListR, SpeciesList).
% search biggest paralogy groups
biggest_groups(node(N, no, Child1, Child2), Groups) :-
concat(Groups1, Groups2, Groups), !.
% accept group if more than 4 different species
biggest_groups(Group, [Group]) :-
Group = node(N, TaxeIds, _, _),
N >= 4, !.
% reject subtree as a group
Thanks to all students and scientists who worked with us in the laboratory this last three years and specially: Sandrine Jacob for the Web interface. Laurent Abi-Rached for technical support and help in the development of the phylogenomic annotation pipeline. Olivier Richard and Mathieu Blanc for the automatic functional information retrieval. Antoine Schellenberger for results display tools. Alexandre Vienne, Jeffrey Rasmussen and Céline Brochier for discussions and critical review of the manuscript.
- Mathe C, Sagot MF, Schiex T, Rouze P: Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002, 30: 4103–4117. 10.1093/nar/gkf543PubMed CentralView ArticlePubMed
- Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA: Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 2002, 18: 1641–1649. 10.1093/bioinformatics/18.12.1641View ArticlePubMed
- Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol 2001, 52: 540–542.View ArticlePubMed
- Sjolander K: Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 2004, 20: 170–179. 10.1093/bioinformatics/bth021View ArticlePubMed
- Bork P, Koonin EV: Predicting functions from protein sequences--where are the bottlenecks? Nat Genet 1998, 18: 313–318. 10.1038/ng0498-313View ArticlePubMed
- Searls DB: Pharmacophylogenomics: genes, evolution and drug targets. Nat Rev Drug Discov 2003, 2: 613–623. 10.1038/nrd1152View ArticlePubMed
- Eisen JA, Fraser CM: Phylogenomics: intersection of evolution and genomics. Science 2003, 300: 1706–1707. 10.1126/science.1086292View ArticlePubMed
- Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 2004, 5: R7. 10.1186/gb-2004-5-2-r7PubMed CentralView ArticlePubMed
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001, 314: 1041–1052. 10.1006/jmbi.2000.5197View ArticlePubMed
- Ensembl Genome Browser[http://www.ensembl.org/]
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X: The sequence of the human genome. Science 2001, 291: 1304–1351. 10.1126/science.1058040View ArticlePubMed
- Potter SC, Clarke L, Curwen V, Keenan S, Mongin E, Searle SM, Stabenau A, Storey R, Clamp M: The Ensembl analysis pipeline. Genome Res 2004, 14: 934–941. 10.1101/gr.1859804PubMed CentralView ArticlePubMed
- Frickey T, Lupas AN: PhyloGenie: automated phylome generation and analysis. Nucleic Acids Res 2004, 32: 5231–5238. 10.1093/nar/gkh867PubMed CentralView ArticlePubMed
- FIGENIX's URL[http://www.up.univ-mrs.fr/evol/figenix/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268: 78–94. 10.1006/jmbi.1997.0951View ArticlePubMed
- Krogh A: Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 1997, 5: 179–186.PubMed
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMed
- Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). 4th edition. Edited by: Associates S. Sunderland, Massachussetts, Sinauer Associates; 2003.
- Felsenstein J: PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.
- Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 2002, 18: 502–504. 10.1093/bioinformatics/18.3.502View ArticlePubMed
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMed
- Java Technology[http://java.sun.com/]
- NCBI Home Page[http://www.ncbi.nlm.nih.gov/]
- Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2000, 28: 263–266. 10.1093/nar/28.1.263PubMed CentralView ArticlePubMed
- Constantine A, Plotnikov: The implementation of ISO Prolog standard as Java library.[https://sourceforge.net/projects/gnuprologjava/]
- Abi-Rached L, Gilles A, Shiina T, Pontarotti P, Inoko H: Evidence of en bloc duplication in vertebrate genomes. Nat Genet 2002, 31: 100–105. 10.1038/ng855View ArticlePubMed
- Vienne A, Rasmussen J, Abi-Rached L, Pontarotti P, Gilles A: Systematic phylogenomic evidence of en bloc duplication of the ancestral 8p11.21–8p21.3-like region. Mol Biol Evol 2003, 20: 1290–1298. 10.1093/molbev/msg127View ArticlePubMed
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.PubMed
- Fitch WM: Toward defining the course of evolution: Minimum change for a specific tree topology. Systematic Zoology 1971, 20: 406–416.View Article
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981, 17: 368–376. 10.1007/BF01734359View ArticlePubMed
- Kishino H, Hasegawa M: Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 1989, 29: 170–179.View ArticlePubMed
- Gu X: Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol 1999, 16: 1664–1674.View ArticlePubMed
- Vienne A, Shiina T, Abi-Rached L, Danchin E, Vitiello V, Cartault F, Inoko H, Pontarotti P: Evolution of the proto-MHC ancestral region: more evidence for the plesiomorphic organisation of human chromosome 9q34 region. Immunogenetics 2003, 55: 429–436. 10.1007/s00251-003-0601-xView ArticlePubMed
- Danchin EGJ, Pontarotti P: Towards the reconstruction of the bilaterian ancestral pre-MHC region. Trends in Genetics 2004, 20: 587–591. 10.1016/j.tig.2004.09.009View ArticlePubMed
- Gelfand MS, Mironov AA, Pevzner PA: Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A 1996, 93: 9061–9066. 10.1073/pnas.93.17.9061PubMed CentralView ArticlePubMed
- Danchin E, Vitiello V, Vienne A, Richard O, Gouret P, McDermott MF, Pontarotti P: The Major Histocompatibility Complex Origin. Immunol Rev 2004, 198: 216–232. 10.1111/j.0105-2896.2004.00132.xView ArticlePubMed
- Kim DH, Lee SM, Hong BY, Kim YT, Choi TJ: Cloning and sequence analysis of cDNA for the proteasome activator PA28-beta subunit of flounder (Paralichthys olivaceus). Mol Immunol 2003, 40: 611–616. 10.1016/j.molimm.2003.08.005View ArticlePubMed
- Hughes AL: Phylogenetic tests of the hypothesis of block duplication of homologous genes on human chromosomes 6, 9, and 1. Mol Biol Evol 1998, 15: 854–870.View ArticlePubMed
- Zmasek CM, Eddy SR: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 2001, 17: 821–828. 10.1093/bioinformatics/17.9.821View ArticlePubMed
- Jordan IK, Wolf YI, Koonin EV: Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evol Biol 2004, 4: 22. 10.1186/1471-2148-4-22PubMed CentralView ArticlePubMed
- Danchin EGJ: Reconstruction of ancestral genomic regions by comparative analysis of evolutionary conserved syntenies. Towards reconstructing the genome of the ancestor of all Bilaterian species (Urbilateria). In Bioinformatics, Structural biochemistry, Genomics. Marseilles, Aix-Marseille II; 2004.
- Danchin EG, Pontarotti P: Statistical evidence for a more than 800-million-year-old evolutionarily conserved genomic region in our genome. J Mol Evol 2004, 59: 587–597. 10.1007/s00239-004-2648-1View ArticlePubMed
- Prince VE, Pickett FB: Splitting pairs: the diverging fates of duplicated genes. Nat Rev Genet 2002, 3: 827–837. 10.1038/nrg928View ArticlePubMed
- biopipe.org -- Main Page[http://biopipe.org/]
- Gaasterland T, Sensen CW: MAGPIE: automated genome interpretation. Trends Genet 1996, 12: 76–78. 10.1016/0168-9525(96)81406-5View ArticlePubMed
- Gaasterland T, Sensen CW: Fully automated genome analysis that reflects user needs and preferences. A detailed introduction to the MAGPIE system architecture. Biochimie 1996, 78: 302–310. 10.1016/0300-9084(96)84761-4View ArticlePubMed
- Zmasek CM, Eddy SR: RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 2002, 3: 14. 10.1186/1471-2105-3-14PubMed CentralView ArticlePubMed
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMed
- Blake JA, Eppig JT, Richardson JE, Davisson MT: The Mouse Genome Database (MGD): a community resource. Status and enhancements. The Mouse Genome Informatics Group. Nucleic Acids Res 1998, 26: 130–137. 10.1093/nar/26.1.130PubMed CentralView ArticlePubMed
- Rogic S, Mackworth AK, Ouellette FB: Evaluation of gene-finding programs on mammalian sequences. Genome Res 2001, 11: 817–832. 10.1101/gr.147901PubMed CentralView ArticlePubMed
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–370. 10.1093/nar/gkg095PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.