Proteinortho: Detection of (Co-)orthologs in large-scale analysis
© Lechner et al; licensee BioMed Central Ltd. 2011
Received: 13 December 2010
Accepted: 28 April 2011
Published: 28 April 2011
Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases.
The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes.
Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.
Genome annotation largely depends on the determination of sequence intervals that are homologous, and if possible, orthologous to sequences of known identity and function in related genomes. Orthologous genes (orthologs) are derived from a common ancestor by a speciation event . Orthologs are of particular interest because they can be expected to have maintained at least part of their (ancestral) biological function. For protein-coding genes, several well-known databases, including InParanoid, OrthoMCL-DB, COG-database, Homogene, eggNOG, OMA Browser and Ensembl Compara compile such information. Their content is restricted to data previously published in comprehensive databases of protein sequences such as UniProt. Updates with additional proteomic data thus are published relatively infrequently. Modern high-throughput technologies, however, produce huge amounts of protein data and even larger amounts of transcript data that are computationally translated to putative polypeptide sequences. Oftentimes, therefore, it would be desirable to generate the orthology relation for a particular dataset, so that the availability of orthology data does not limit the set of species or genes that can be included.
The computation of genome-wide orthology data, however, is a challenging and time consuming task with the currently available tools. In many cases, orthologs cannot be identified unambiguously by means of sequence comparison. The main difficulty arises from the presence of paralogs (homologous genes within the same genome) which can make it very difficult to recognize the correct ortholog among the other homologs. Gene duplications following the speciation, furthermore, create two or more genes in one lineage that are, collectively, orthologous to one or more genes in another lineage. Such genes are known as co-orthologs .
The most widely used approach to identify (putative) orthologs between two species is the reciprocal best alignment heuristic[11–15]. This approach was more recently extended e.g. in OrthoMCL and MultiParanoid to detect (co-)orthologs within multiple species. All these tools, however, are limited to relatively small sets of species. In practise, analyzing the complete proteomes of more than about 50 prokaryote species goes beyond the capabilities of standard hardware and requires access to supercomputer resources. This limitation arises from both, technical issues such as insufficient parallelization and the algorithmic design that requires all reasonable alignments for each input protein to be held in the memory for efficient access in the clustering stage. Proteinortho is specifically designed to deal with hundreds of species together containing millions of proteins. It achieves this performance both by optimizing the implementation and by modifying the reciprocal best alignment method in a way that allows alignment processing on the fly.
Results and Discussion
As for other approaches to large-scale orthology detection, the starting point is a complete collection of pairwise comparisons, typically performed using blast. For simplicity of presentation, we assume that the individual sequences that are compared represent proteins, although algorithms and pipelines are applicable also to other sequence data such as non-coding RNA genes or conserved DNA regions. Typically, the results of pairwise comparisons are ranked by similarity, for instance based on blast statistics, evolutionary distances, or genome rearrangement analysis [7, 16, 18]. High-ranking alignments across multiple species then have to be combined in order to determine orthologous groups. However, these groups usually do not readily provide detailed insights since they can contain large numbers of related genes for each species. Hence, meaningful units have to be identified. For this purpose, a variety of clustering algorithms has been applied to determine Clusters of Orthologous Groups (COGs). The MCL-algorithm for instance uses a stochastic flow simulation to determine meaningful COGs [16, 19, 20]. In addition, MultiParanoid explicitly searches and tags in-paralogs, i.e., recent paralogs that represent species-specific gene expansions. This strategy requires to directly compare proteins within each species. Alternatively, data were curated by manual postprocessing .
We will argue here that orthology determination can be understood as the problem of finding nearly disjoint maximal nearly-complete multipartite subgraphs in an edge-weighted directed graph whose vertices are the proteins in the input set, and whose edges connect certain pairs of similar proteins of different species. The edge weights ωx→yencode the similarity of x and y. In our implementation, the bit score of the blast alignment (x → y) will serve as edge weight. An E-value cut-off is used beyond which blast alignments are not included into .
The symmetric subgraph of , containing only reciprocal best alignments, can be regarded as an undirected graph ϒ RBAH . By construction, any two vertices are connected by edges in ϒ RBAH if and only if they are orthologs. A set of orthologs therefore corresponds to a complete multipartite subgraph of in which every species is represented at most once. Furthermore, we note that these subgraphs are disjoint, i.e., ortholog sets correspond to the connected components of .
This has two advantages: it retains edges to likely co-orthologs while at the same time reducing the number of edges that are inserted in .
The symmetric part ϒ* of now retains more edges than ϒ RBAH . In particular, it includes all the edges connecting similar co-orthologs. On the other hand, the threshold at fairly high bit scores disconnects at least most of the more distant homologs. Sets of (co-)orthologs thus appear in ϒ* as nearly complete multipartite subgraphs. Typically they will contain more than one node from the same species, among them, in particular, all in-paralogs. Although this approach strongly reduces the problem with spurious edges, ϒ* may also contain additional edges connecting two or more sets of (co-)orthologs.
The problem of finding maximal complete multipartite subgraphs of a graph is NP complete . Furthermore, we have seen above that ϒ* may lack a few edges which should connect orthologs ("false negatives"), while at the same time there are also some additional "false positive" edges. In a two-species comparison there is no information that could compensate missing and spurious edges, while in the multi-species case, the graph ϒ* is in a sense "self-correcting" since we can formulate orthology detection as an optimization problem. More precisely, we search for a decomposition of ϒ* into a disjoint collection of complete multipartite subgraphs so that the number of edge insertions and deletions is minimized.
Since no efficient approaches to this combinatorial optimization problem seem to be known, it appears fruitful to resort to a heuristic approach that employs a somewhat different point of view: nearly complete multipartite subgraphs are very dense subgraphs, which in our case either form connected components on their own, or which are connected to other dense clusters by a few additional edges. The problem thus is to determine for each connected component ϒ* of whether it is sufficiently densely connected, and if not, to partition it into its densely connected components by removing the spurious edges connecting them. Here, we approach this issue by means of spectral partitioning , see Additional File 1 for a detailed description.
We remark, finally, that one could efficiently add the explicit determination of in-paralogs after ϒ* has been constructed, although currently this is not implemented in Proteinortho. Following e.g. InParanoid, sets of in-paralogs are subsets of proteins from the same species within the same connected component ϒ* of that are more similar to each other than to any protein in another species. It is sufficient, thus, to determine alignment scores for pairs of nodes from the same species within connected components of ϒ*. As in MultiParanoid, in-paralogs could be collapsed to a single node.
Most importantly, the algorithm outlined in the previous section avoids the memory bottleneck that limits previous approaches. Suppose our input set comprises N species with, on average, m genes. The size of the input is thus n = N × m proteins. Instead of storing all n × n pairwise blast scores, Proteinortho processes the comparisons between any two species A and B immediately: first the blast alignments are filtered by two additional criteria: (1) The alignment must exhibit a minimum level of sequence identity. (2) The alignment must cover at least a minimum fraction of the query protein. This second rule ensures that fusion genes such as rice OsUK/UPRT1  are eventually assigned as homologs of the dominating part of the protein. Then equ.(2) is evaluated for all x in A, so that Proteinortho directly constructs the sparse graph , while does not need to be stored at all. Proteinortho therefore uses chained arrays, requiring only n × k entries, where k is the average number of nearly optimal blast alignments per gene, and k = a × N, where a is the average number of (co-)orthologs of a gene in a single species. The value of a is independent of the size of the dataset. Empirically, we found a ≤ 1 in all datasets investigated so far. Thus Proteinortho saves a factor n2/N2ma = m/a ≥ m of memory. Note that prokaryotes have m ≈ 103 ... 104 proteins.
First we reduce the problem by determining the connected components ϒ* of since these can be treated separately. We use the well-known breath-first search approach  to this end. In order to check whether a connected component Ξ is sufficiently dense to represent a single set of co-orthologs we compute its normalized algebraic connectivity . Here n is the number of vertices of Ξ and αalpha2 is 2nd-smallest eigenvalue of the graph Laplacian L = D - A of Ξ . Here A is the adjacency matrix of Ξ and D is the diagonal matrix of the vertex degrees. The eigenvalue αalpha2 can be computed iteratively, see Additional File 1. Values of indicate dense clusters that most likely correspond to coherent sets of (co-)orthologs. Small values , on the other hand indicate that Ξ has a low connectivity and either consists of two or more dense components or it has (nearly) tree-like protrusions. Very large components can arise when genes duplicate frequently and diverge quickly according to the duplication-degeneration-complementation (DDC) model .
The "Fiedler vector" x2, i.e., the eigenvector of L to eigenvalue α2 can be used to find a partition of Ξ into two connected components, one consisting of the vertices for which x2 has positive entries and one for which x2 has negative entries . This decomposition is iterated until Ξ is partitioned into components with algebraic connectivity above a certain threshold value and tree-like pieces, which most likely correspond to false-positive edges of ϒ*. In order to speed up the computation, trees are therefore removed from the component Ξ before the algebraic connectivity and the Fiedler vector is computed. This is achieved by iteratively removing a vertex of degree 1 and its adjacent edge. This step is not performed if Proteinortho is used to compare only two species.
We remark that the memory and CPU consumption for the clustering step of OrthoMCL can be drastically reduced by using a novel algorithm , reaching a performance that is theoretically comparable to spectral partitioning as used by Proteinortho (see Additional File 1). Both require only the storage of edge or adjacency lists. The current implementation of spectral partitioning could be further optimized e.g. by employing the Lanczos algorithm  for computing the eigenvalues. Spectral partitioning on average scales as O(n2k). This leads to an expected runtime of O(N3) for Proteinortho which is comparable to the O(N3 log N) complexity bound achieved for COG clustering in .
Evaluation of Proteinortho
We compared Proteinortho with the COG-database  and OrthoMCL. The latter is the main competitor in terms of speed and memory. The COG-database provides a manually curated dataset that can be regarded as more reliable than fully automated approaches. For benchmark analysis, a set of 16 randomly chosen bacteria from three different classes (six Gram-positive bacilli, six gamma- and four alpha-proteobacteria) are used. The input set comprises 53, 623 protein sequences.
In order to demonstrate that Proteinortho is suitable for large-scale analysis we asked Which proteins can be found in all bacterial species? Proteins that are conserved domain-wide are likely to be useful for the construction of a phylogeny of eubacteria as an alternative to the prevalent usage of 16S rRNA sequences . They can also serve as protein-based markers for identifying novel bacterial species as members of an established phylogenetic group. In addition, they can give insight into basic protein equipment of bacterial life. Hence, we applied Proteinortho to the set of all eubacterial proteomes available at NCBI at the beginning of 2009 (Additional File 3).
The input dataset comprises 2, 155, 620 proteins annotated in 717 bacterial genomes. The Proteinortho run took less than two weeks using 50 processor cores (Intel Xeon at 2.00-2.33 GHz) distributed over multiple PCs. Only 2 GB memory were required. OrthoMCL could not be employed for this task on the hardware available in our lab. Extrapolating from the benchmarks in Figure 3, we estimate that hundreds of gigabytes of memory and years of runtime would have been required.
Proteinortho identified 152 proteins as core of the bacterial protein complement, occurring in at least 90% of all 717 free-living and endosymbiotic bacteria. Of these, 32 are ribosomal subunits. The 30 apparently most indispensable proteins, occurring in 99% of all bacteria, are:
Elongation factor Tu (often co-orthologous to elongation factor 1-alpha)
Elongation factor G
Translation initiation factor IF-2
RNA polymerase subunits β and β'
ATP-dependent metalloprotease FtsH
F0F1 ATP synthase subunits α and β
ribosomal protein of the 30S rRNA subunit
ribosomal proteins of the 50S rRNA subunit
Nevertheless, about one third of the 30 most conserved proteins could not be recovered in the genomes of the two species with the smallest proteomes in our dataset: Candidatus Carsonella ruddii PV and Candidatus Sulcia muelleri GWSS. Both are endosymbionts that are considered as organelle-like [31, 32]. Numerous genes that are otherwise considered to be essential for life have been reported as missing in both species. A more detailed and larger list of domain-wide common proteins can be downloaded at http://bioinf.pharmazie.uni-marburg.de/supplements/proteinortho/.
Proteinortho implements a blast-based approach to determine sets of (co-)orthologous proteins or nucleic acid sequences that generalizes the reciprocal best alignment heuristic. The software is optimized for large datasets, and in particular provides a drastic reduction of the memory requirements compared to earlier tools. It can therefore be run on off-the-shelf PC hardware for large datasets. Our implementation scales very well with the number of available processor cores. The blast searches can be trivially parallelized and distributed easily to multiple PCs without the need for a cluster management system, while deployment to existing cluster infrastructure is also supported.
Proteinortho views orthology detection as a variant of graph clustering since co-orthologous sets correspond to maximal complete multipartite subgraphs, which at the same time are well separated from each other. Due to the unavoidable noise in the real data, however, co-orthologous sets appear as dense subgraphs without clearly recognizable low-weight cuts. This property is measured quite well by the algebraic connectivity. At the same time, low-weight cuts between dense regions are identified very well by the corresponding Fiedler vector. We therefore employ spectral partitioning instead of a direct graph clustering approach. The quality of the co-orthologous sets proposed by Proteinortho is comparable to the performance of OrthoMCL.
Both time and memory requirements are significantly reduced compared to earlier approaches, enabling applications that were infeasible before. For instance, we applied Proteinortho to the complete set of 2.1 million proteins from the 717 bacterial genomes available at NCBI at the beginning of 2009. We found 30 proteins that are present in more than 99% of the investigated sequences.
All analysis with Proteinortho and OrthoMCL were applied using default values unless described otherwise. These are E - value < 10-10, 25% percent identity, Markov Inflation Index of 1.5 for OrthoMCL and E-value < 10-10, 25% percent identity, adaptive best alignments similarity of f = 0.95, algebraic connectivity > 0.1 for Proteinortho. OrthoMCL version 1.4 was downloaded from http://OrthoMCL.org/common/downloads/.
Speed and memory benchmark were performed multiple times using the proteome of Escherichia coli K12 substr. MG1655 data from the NCBI. The protein ids were renamed systematically to prevent duplicated ids for benchmarking purposes which cannot be handled by Proteinortho. A script continuously observed the memory consumption and reported the maximum peak for each run, Figure 3.
For the domain-wide commons we applied Proteinortho with default values of the parameters. Bacterial proteomes and genomes were downloaded from NCBI ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ in March 2009. A detailed list can be found in Additional File 3. In order to recover missing annotation, we selected all orthologous groups covering at least 75% of all species that are good candidates of domain-wide commons. The unique set of sequences of each orthologous group was blasted against all genomes that lack an annotated ortholog using tblastn (E-value < 10-20). The sequence of the best alignment was then added to the orthologous group.
For evaluation we used the proteome data from the COG-database ftp://ftp.ncbi.nih.gov/pub/COG/COG/ downloaded in November 2009. We have chosen Bacillus halodurans, Bacillus subtilis, Lactococcus lactis, Listeria innocua, Streptococcus pneumoniae TIGR4, Streptococcus pyogenes M1 GAS from the Gram-positive bacilli class, Buchnera sp. APS, Escherichia coli K12, Pasteurella multocida, Salmonella typhimurium LT2, Vibrio cholerae, Yersinia pestis from the gamma proteobacteria class and Brucella melitensis, Caulobacter vibrioides, Mesorhizobium loti, Rickettsia prowazekii from the alpha proteobacteria class. Both, Proteinortho and OrthoMCL were applied to this set. All groups with proteins covering at least 6 species were compared to the COG-database, illustrated in Figure 4 and Figure 5.
The source code of Proteinortho can be obtained under the GPLv2 (or later) from http://www.bioinf.uni-leipzig.de/Software/proteinortho/
This work was supported in part by DFG STA-850/6, STA-850/2, EU-project "Quantomics" and VW-project "Complex networks" and SPP 1258: "Sensory and regulatory RNAs in Prokaryotes".
- Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19(2):99–113. 10.2307/2412448View ArticlePubMedGoogle Scholar
- Berglund AC, Sjölund E, Ostlund G, Sonnhammer EL: InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Res 2008, (36 Database):D263–266.
- Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS: OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res 2006, (34 Database):D363–368.
- Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28: 33–36. 10.1093/nar/28.1.33PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2008, (36 Database):D13–21.
- Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P: eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res 2008, (36 Database):D250–254.
- Schneider A, Dessimoz C, Gonnet GH: OMA Browser-exploring orthologous relations across 352 complete genomes. Bioinformatics 2007, 23(16):2180–2182. 10.1093/bioinformatics/btm295View ArticlePubMedGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Ouverdin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A, Birney E: Ensembl 2007. Nucleic Acids Res 2007, (35 Database):D610–617.Google Scholar
- UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 2010, (38 Database):D142–8.
- Koonin EV: Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 2005, 39: 309–38. 10.1146/annurev.genet.39.073003.114725View ArticlePubMedGoogle Scholar
- Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y: Predicting function: from genes to genomes and back. J Mol Biol 1998, 283(4):707–725. 10.1006/jmbi.1998.2144View ArticlePubMedGoogle Scholar
- Rivera MC, Jain R, Moore JE, Lake JA: Genomic evidence for two functionally distinct gene classes. Proc Natl Acad Sci USA 1998, 95(11):6239–6244. 10.1073/pnas.95.11.6239PubMed CentralView ArticlePubMedGoogle Scholar
- Hirsh AE, Fraser HB: Protein dispensability and rate of evolution. Nature 2001, 411(6841):1046–1049. 10.1038/35082561View ArticlePubMedGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001, 314(5):1041–1052. 10.1006/jmbi.2000.5197View ArticlePubMedGoogle Scholar
- Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 2002, 12(6):962–968.PubMed CentralView ArticlePubMedGoogle Scholar
- Li L, Stoeckert CJ Jr, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13(9):2178–2189. 10.1101/gr.1224503PubMed CentralView ArticlePubMedGoogle Scholar
- Alexeyenko A, Tamas I, Liu G, Sonnhammer EL: Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 2006, 22(14):e9–15. 10.1093/bioinformatics/btl213View ArticlePubMedGoogle Scholar
- Fu Z, Chen X, Vacic V, Nan P, Zhong Y, Jiang T: MSOAR: a high-throughput ortholog assignment system based on genome rearrangement. J Comput Biol 2007, 14(9):1160–1175. 10.1089/cmb.2007.0048View ArticlePubMedGoogle Scholar
- van Dongen SM: Graph Clustering by Flow Simulation. PhD thesis. University of Utrecht, The Netherlands; 2000.Google Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30(7):1575–1584. 10.1093/nar/30.7.1575PubMed CentralView ArticlePubMedGoogle Scholar
- Cornaz D: A linear programming formulation for the maximum complete multipartite subgraph problem. Math Program, Ser B 2006, 105: 329–344. 10.1007/s10107-005-0656-6View ArticleGoogle Scholar
- Guattery S, Miller GL: On the performance of spectral graph partitioning methods. In Proceedings of the sixth annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1995:233–242.Google Scholar
- Sikdar M, Kim JS: Expression of a natural fusion gene for uracil phosphoribosyltransferase and uridine kinase from rice shows growth retardation by 5-fluorouridine or 5-fluorouracil in Escherichia coli. African Journal of Biotechnology 2010, 9(9):1295–1303.Google Scholar
- Hopcroft J, Tarjan R: Efficient algorithms for graph manipulation. Commun ACM 1973, 16: 372–378. 10.1145/362248.362272View ArticleGoogle Scholar
- Fiedler M: Algebraic Connectivity of Graphs. Czechoslovak Math J 1973, 23: 298–305.Google Scholar
- Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics 1999, 151(4):1531–1545.PubMed CentralPubMedGoogle Scholar
- Fiedler M: A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czechoslovak Math J 1975, 25: 619–633.View ArticleGoogle Scholar
- Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A: A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 2010, 26: 1481–1487. 10.1093/bioinformatics/btq229PubMed CentralView ArticlePubMedGoogle Scholar
- Lanczos C: An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators. J Res Natl Bureau Standards 1950, 45: 255–281.View ArticleGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res 2007, 35(21):7188–7196. 10.1093/nar/gkm864PubMed CentralView ArticlePubMedGoogle Scholar
- Nakabachi A, Yamashita A, Toh H, Ishikawa H, Dunbar HE, Moran NA, Hattori M: The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 2006, 314(5797):267. 10.1126/science.1134196View ArticlePubMedGoogle Scholar
- McCutcheon JP, Moran NA: Parallel genomic evolution and metabolic interdependence in an ancient symbiosis. Proc Natl Acad Sci USA 2007, 104(49):19392–19397. 10.1073/pnas.0708855104PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.