RAPSearch: a fast protein similarity search tool for short reads
© Ye et al; licensee BioMed Central Ltd. 2011
Received: 27 July 2010
Accepted: 15 May 2011
Published: 15 May 2011
Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.
We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.
RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.
Similarity search is one of the very first computational analyses in the annotation of a genomic/metagenomic dataset. Among many computational tools for this task, BLAST is most commonly used [1, 2], owning to its two main advantages: the statistical model for measuring the significance of local sequence similarities [3, 4] and its speed. BLAST pioneered the "seed-extension" approach (seed is a match of short subsequences between two sequences, from which the full alignment between the two proteins may be derived by extension), which runs much faster than the rigorous Smith-Waterman local alignment algorithm  but achieves approximately the optimal alignment in practice.
The rapid advance of genome sequencing, in particular the NGS techniques that enabled the generation of DNA sequences at several magnitudes higher throughput than the conventional DNA sequencers, has posed new challenges for sequence comparison. For instance, it will take a 1000-CPU computer cluster approximately a month to search a set of DNA sequences of 20G bases total (which can be acquired by one run of a single Illumina/Solexa sequencer) against a protein database of current size (e.g. NCBI NR protein dataset with ~4G amino acids) by using BLAST search. For clarity, we do not specify BLAST programs in this paper--if the input sequences are DNA sequences, BLASTX will be used, which translates DNA sequences in all 6 frames. Note that the current metagenomic approach, i.e. the direct DNA sequencing of environmental species, represents an urgent need for this kind of database searching . In metagenomic studies, BLAST results are used not only for the identification of protein/gene families (thus the functionality of a microbial community) [7, 8], but also for the taxonomic analysis of the microbial community (as in MEGAN  and Phymm ). Without further improvement, database searching will become a bottleneck for the downstream computational analysis of high-throughput sequencing data.
The bioinformatics community was aware of the speed limitation of BLAST tools in the application for the whole genome alignment. Consequently, many ultra-fast tools were developed to replace BLASTN and megaBLAST in aligning long genomic sequences, e.g. Mummer , PatternHunter [12, 13], BLAT  and BLASTZ , which typically use sophisticated algorithms to select seeds that can be chained to form long alignments. Most of these novel algorithms, however, are not directly applicable to protein sequence database searching owning to a few distinctions between protein and genomic sequences. Protein sequences (typically of hundreds to thousands residues) are much shorter than genome sequences (typically of millions of bases), while the alphabet of protein sequences (i.e., 20 amino acids) is much larger than that of DNA sequences (i.e., 4 bases). Furthermore, some amino acids (e.g., ILE and LEU) are chemically similar, and often can be replaced by one another without changing the global structure (i.e. the fold) or the function of a protein. Hence, the alignment of homologous genomic sequences usually contain relatively long and exact matched seeds (e.g., the maximal exact matches, MEMs , or maximum unique matches, MUMs , or gapped seeds ), whereas the alignment of homologous protein sequences often contain only short exact matches. As a result, in order to achieve high sensitivity, protein database searching tools must retain many false seeds in the fast seed detection step for the seed extension, which is time-consuming.
We proposed a new kind of seeds for protein sequence comparison based on a reduced (compressed) amino acid alphabet (wherein similar amino acids are clustered), and developed a new protein database search tool RAPSearch, based on the representation of protein sequences in the reduced amino acid alphabet. Previous studies have shown that the homologous proteins with low sequence identities may still share significant common sequence patterns, for example, the sequence profiles (derived from a multiple alignment) or strings of reduced alphabets  (derived from individual sequences). RAPSearch follows the seed-extension approach--it first attempts to identify maximal exact matches (MEMs) between the reduced alphabet sequence of a query protein and the reduced alphabet sequence of all proteins in the database using suffix array (see Methods), and then uses the same heuristic extension algorithm as used in BLAST  to extend and evaluate each of these seeds.
We compared the performance of RAPSearch with BLAST, BLAT and HMMER3. BLAT achieves 50 times faster, as compared to BLAST, for protein alignments at sensitivity settings typically used when comparing vertebrate sequences . BLAT was designed to achieve fast similarity search for closely related sequences, and therefore it may not work for detecting remote protein homologs (as shown in our comparison; see Results section). HMMER was developed to achieve accurate detection of remote homologs by using profile Hidden Markov Models (HMM) of protein families as compared to the other database search tools. The new HMMER3 achieves similar speed as BLAST by implementing a new probabilistic model of local sequence alignment and a new heuristic acceleration . Our comparison demonstrates that RAPSearch achieves much faster similarity search than BLAST (and so HMMER3), and its speed is slightly lower than BLAT but achieves much higher sensitivity as compared to BLAT.
RAPSearch adopts the seed-extension approach of BLAST, which identifies the seeds, the maximal exact matches (MEMs) between the reduced alphabet sequence of a query protein and the reduced alphabet sequence of all proteins in the database, followed by extending and evaluating each of these seeds. RAPSearch employs a linear time algorithm to retrieve the MEMs, which first builds a suffix array and a corresponding longest common prefix (LCP) array to index all proteins in the database , and then traverses the suffix array based on each query protein. All identified MEMs are subject to a heuristic extension algorithm including an ungapped extension and then gapped extension, similar to BLAST.
Protein sequence seeds using a reduced amino acid alphabet
The first reduced amino acid alphabet was introduced by Dill in the hydrophobic-polar (HP) model for the study of the folding of globular proteins . Since then, there are more than 50 reduced alphabets of different size that have been proposed for various purposes . A recent study even demonstrated that reduced alphabet is more sensitive and selective in identifying remote homologous proteins . These observations suggested that homologous proteins exhibit a higher sequence identity on the reduced alphabet than that on the 20-aa alphabet, indicating that it is possible to design efficient and sensitive seeds based on a reduced amino acid alphabet.
To select an appropriate reduced amino acid alphabet for RAPSearch, we carried out the following experiments. Using the BaliBase database http://www-bio3d-igbmc.u-strasbg.fr/balibase/, we collected 10,000 pairs of distant homologous proteins that share ~20-40% sequence identify, and 10,000 pairs of proteins from different families (which serve as non-homologous proteins). For each alphabet and each length, we computed the coverage and efficiency of the corresponding seeds: the coverage is defined as the fraction of homolog proteins containing at least one seed match, and the efficiency is defined as the log ratio of the numbers of homologous and non-homologous proteins containing at least one seed match.
Seed identification by using suffix array
An essential procedure in RAPSearch (and other seed-extension tools like BLAST, which uses hash table instead of suffix array) is how to choose appropriate seeds for extension (which is more time consuming than finding the seed itself). A commonly used strategy is to define a cutoff for the minimum seed size--a small cutoff may result in a huge amount of seeds to be extended (thus slow down the similarity search), whereas a large cutoff may miss some seeds that otherwise may lead to significant alignments. BLAST uses minimum size of 3 (residues for proteins) but also requires that there are two seeds in the same diagonal that span no more than a certain length. RAPSearch will extend all single seeds that have been identified by looking up in the suffix array of target protein sequences. As RAPSearch uses reduced alphabets to present proteins, RAPSearch can use longer seed cutoff, thus achieving faster similarity search.
Minimal seed selection algorithm
RAPSearch generates seeds of a minimal length of 6-9 amino acids, with longer seeds for frequent words and shorter seeds for rare words. RAPSearch decides the minimum length of seeds starting at a particular query position based on the frequency of the 6-mers starting at that position. Once the minimum length of the seeds is selected for a particular position, all the seeds of at least the required length will be retrieved by looking up in the pre-computed suffix array of protein similarity search database. The minimum seed length selection algorithm is shown as follows (for a given position i in a query sequence).
Seed length selection algorithm (position i):
minseed ← 6
addlen ← 0
hexmerF ← the frequency of the 6-mer starting at position i
medianF ← the median of the frequencies of all 6-mers in the protein database
expectF ← hexmerF
if (expectF > medianF):
# aaComp(k) is the frequency of the corresponding amino acid at position k
while expectF * aaComp(i + addlen + 1) < medianF:
addlen ← addlen + 1
expectF ← expectF * aaComp(i + addlen + 1)
minseed ← minseed + addlen
Seeds with mismatches
We further consider seeds with mismatches (these mismatches that can not be handled by reduced alphabets). Long seeds (at least 10 aa) that allow at most one mismatch either at position 3, 4, 5, or 6 as in the following patterns, OOOXOOOOOO, OOOOOOXOOO, OOOOXOOOOO and OOOOOXOOOO (where X indicates a mismatch, and Os indicate exact matches). We replace the residue at each of the positions allowing mismatches (marked with X) by one of the reduced amino acids in the reduced alphabet in turn to search for exact matches, achieving identification of seeds with mismatches using suffix array.
Ungapped and gapped alignment
We implemented ungapped and gapped extension procedures following the same approach used in BLAST .
Statistical significance evaluation
We used the statistical evaluation method from BLAST, and used the same model and parameters (for BLOSUM62 substitution matrix) as BLAST to evaluate the significance of the resulting local alignment.
Protein similarity search databases and other datasets
We tested RAPSearch on several public metagenomic datasets with various read lengths [7, 24]. The nucleotide sequences were downloaded from the NCBI short read archive, and the MG-RAST server http://metagenomics.nmpdr.org/. The protein similarity search databases we used include a 98% non-redundant dataset (prepared by using CD-HIT ) of protein sequences from prokaryotic genomes, plasmid and viral genomes, collected in the IMG 3.0 http://img.jgi.doe.gov, eggNOG database (of sequences that have COG annotations) (downloaded from http://eggnog.embl.de/), and NCBI non-redundant (nr) database (downloaded from NCBI ftp site). The complete genomes (Escherichia coli K12 substr MG1655, NC_000913; Salmonella typhi, NC_003198; and Desulfococcus oleovorans Hxd3, NC_009943) and their gene annotations we used for the simulation study were downloaded from the NCBI ftp site.
Other computational tools
RAPSearch was compared to BLAST, BLAT and HMMER. The BLAT source codes were downloaded from http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc34.zip. The default filtering option in BLAST automatically masks low complexity regions of amino acids by using the SEG approach  prior to similarity search. The SEG masking is also implemented in RAPSearch. For comparison purpose, SEG was also applied to the six frame translations of the short reads for (protein) BLAT similarity search.
Selection of reduced amino acid alphabet
A list of reduced amino acid alphabets
Size of the alphabet
Amino acid groups
P G E K R Q D S N T H C I V W Y F A L M
[AEHKQR] [FILMVWY] [CST] [DN] [GP]
[EKQR] [IV] [LY] F [AM] W [HT] C [DNS] [GP]
G [ADEKNQRST] [CFHILMVWY] P
G D N [AEFIKLMQRVW] Y H C T S P
[LIVFMY] W C [DNTSKEQRAGP] H
[YFLIVM] C W [DNTSQKERAG] H P
[LVIMC] [ASGTP] [FYW] [EDNQ] [KRH]
A [KR] [EDNQ] C G H [ILVM] [FYW] P [ST]
[PG] [EKRQ] [DSNTHC] [IVWYF] [ALM]
P G [EKRQ] [DSN] T [HC] [IV] [WYF] A [LM]
Selection of seed lengths
We also observed that, given a protein sequence database, words of the same length (e.g., hexamers) are of very different frequencies in the database because of the composition bias (some amino acids are more frequent than others), with a small number of extremely frequent words. For example, in a protein database that has 1,231,398,422 amino acids (collected from IMG 3 database; see Methods), the most frequent hexamer has 1,351,564 copies, whereas the median frequency of all hexamers is a mere 176 (see Supplementary Figure 1 in Additional File 1 for the frequency distribution for the hexmers). If a fixed minimal seed length (e.g., 6) was used, the matches between a few extremely frequent words would dominate the seed pool, the majority of which will not lead to meaningful, long alignments. Instead, we used the minimal seed selection algorithm to automatically determine the minimum length of the seeds starting at a particular position of a query sequence based on the frequency of its first 6 residues--which can be retrieved from a pre-computed lookup table quickly--and the amino acid composition for the positions after. Using this strategy, seed matches of length 6-9 will be detected by RAPSearch among query sequences and the protein sequences in the database. This length range 6-9 works best in practice, and Figure 2B also shows this is where the seeds (of murphy.10 reduced alphabet) achieve highest efficiency.
Comparison of RAPSearch with BLAST and BLAT
The performance of RAPSearch as compared to BLAST.
Number of reads
Read length (nt)
Reads with homologs (by BLAST)
Running time (CPU hours)
Reads with homologs found in the IMG protein databasea
SRR020796 (2%) b
Performance comparison of similarity search tools on the same query dataset (4440037) against different protein similarity search databases.
Running time (CPU hours)
Reads with homologs found in the protein database (E-value cutoff = 1e-3)
Extended COG a
Comparison of RAPSearch with HMMER3
We were only able to compare RAPSearch with HMMER3 on one similarity search database, the extended COG database (eggNOG database) , for which we built HMMs of the COG families using Muscle , a multiple alignment program, and the HMM builder from the HMMER3 package. (A direct comparison of RAPSearch with HMMER3 is difficult for other similarity search databases, as HMMER is based on profile HMMs of protein families whereas RAPSearch searches against a protein sequence dataset.) The comparison shows that HMMER3 does not necessarily achieve higher sensitivity as compared to RAPSearch (and BLAST) for the short reads at the same similarity significance level. For example, for the query dataset 4440037, in total 5,975 reads have annotations at an E-value cutoff of 1e-3 based on the HMMER3 search results, whereas in total 6,230 reads have annotations based on the RAPSearch search results. In addition, HMMER3 may be too slow (although it is as fast as BLAST) for some applications, such as the similarity search of large datasets of short reads.
Evolutionary distance matters
Comparison of the factions of reads with homologs at different evolutionary distances that are detected by different similarity search tools.
Apply E-value cutoff?c
E-value = 0.1
E-value = 0.001
E-value = 0.1
E-value = 0.001
The comparison of RAPSearch with other tools demonstrates that RAPSearch can be used to achieve a fast protein similarity search with minimal loss of sensitivity. This improvement relieves the computational need for database search for short reads that are derived from NGS techniques. Instead of running BLAST searches on a computer farm with many CPUs, a RAPSearch search against a large protein database (such as the IMG database) for a dataset from a single run of 454 sequencer or Illumina sequencer can be achieved on a single PC with multiple cores, or a small computer cluster.
The tests of RAPSearch on detecting similarities at different evolutionary distances (results summarized in Table 4) showed that RAPSearch and BLAT missed more similarity hits between distantly related proteins. The same trend was observed in the tests on real metagenomic datasets (Figure 3), in which RAPSearch missed proportionally more hits at more stringent E-value cutoffs (the sensitivity loss is even worse for BLAT). This problem may be alleviated when more reference genomes become available and more proteins are added to the similarity search database.
A simple calculation shows that RAPSearch can potentially achieve >120 (106/203) speedup as compared to BLAST: a seed (of a minimal size of 3) can be found in 1 out of 203 possible matches in BLAST, and a seed can be found in 1 out of 106 matches in RAPSearch (which uses seeds of 6 amino acids or longer in the murphy.10 alphabet of size 10). The actual speedup varies, depending on the read length and the nature of the microbial community from which a metagenomic dataset is derived. Generally, RAPSearch achieved more significant speedup on shorter reads, and on the datasets derived from communities with better-characterized microbial organisms (such as the human-associated microbial communities, as in the TS50 and TS28 datasets). RAPSearch achieved ~20-90 times speedup as compared to BLAST on the metagenomic datasets that we tested; this represents the speedup that RAPSearch can achieve on a typical metagenomic dataset in practice.
We will work on several improvements of RAPSearch, aiming to further accelerate the similarity search. The first improvement is to implement a new version of RAPSearch that supports multiple threads, best utilizing the multiple cores that a modern computer typically has. The second strategy we will try is to pre-process the queries to eliminate redundant similarity searches.
Availability and Requirements
RAPSearch (which stands for R educed A lphabet based P rotein similarity Search) was implemented in C++, and has been tested extensively in linux platforms. The inputs of RAPSearch can be either amino acid sequences or nucleotide sequences (which will be translated in 6 frames). RAPSearch produces result files that are similar to BLAST outputs. RAPSearch source codes are available as Supplementary Software (Additional File 2). The details of the comparison between the BLAST and RAPSearch search results, and the source codes of RAPSearch can be found at the RAPSearch website, http://omics.informatics.indiana.edu/mg/RAPSearch.
This research was supported by the NIH grant 1R01HG004908 (Development of New Tools for Computational Analysis of Human Microbiome Project Data) and the NSF CAREER award DBI-0845685. The authors thank the University Information Technology Services team in Indiana University for their help with high-performance computing (for BLAST search). The authors would like to thank the anonymous reviewers for their helpful suggestions.
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America 1990, 87(6):2264–2268. 10.1073/pnas.87.6.2264PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Altschul SF: Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences of the United States of America 1993, 90(12):5873–5877. 10.1073/pnas.90.12.5873PubMed CentralView ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Wooley JC, Ye Y: Metagenomics: Facts and Artifacts, and Computational Challenges. Journal of Computer Science and Technology 2010, 25(1):71–81. 10.1007/s11390-010-9306-4View ArticleGoogle Scholar
- Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F: Functional metagenomic profiling of nine biomes. Nature 2008, 452(7187):629–632. 10.1038/nature06810View ArticlePubMedGoogle Scholar
- Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 2008, 9: 386. 10.1186/1471-2105-9-386PubMed CentralView ArticlePubMedGoogle Scholar
- Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res 2007, 17(3):377–386. 10.1101/gr.5969107PubMed CentralView ArticlePubMedGoogle Scholar
- Brady A, Salzberg SL: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 2009, 6(9):673–676. 10.1038/nmeth.1358PubMed CentralView ArticlePubMedGoogle Scholar
- Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes. Nucl Acids Res 1999, 27(11):2369–2376. 10.1093/nar/27.11.2369PubMed CentralView ArticlePubMedGoogle Scholar
- Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics 2002, 18(3):440–445. 10.1093/bioinformatics/18.3.440View ArticlePubMedGoogle Scholar
- Li M, Ma B, Kisman D, Tromp J: Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2004, 2(3):417–439. 10.1142/S0219720004000661View ArticlePubMedGoogle Scholar
- Kent WJ: BLAT--The BLAST-Like Alignment Tool. Genome research 2002, 12: 656–664.PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-Mouse Alignments with BLASTZ. Genome Research 2003, 13: 103–107. 10.1101/gr.809403PubMed CentralView ArticlePubMedGoogle Scholar
- Bray N, Pachter L: MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome research 2004, 14: 693–699. 10.1101/gr.1960404PubMed CentralView ArticlePubMedGoogle Scholar
- Bork P, Gibson TJ: Applying motif and profile searches. Methods Enzymol 1996, 266: 162–184.View ArticlePubMedGoogle Scholar
- Bork P, Sander C, Valencia A: An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(16):7290–7294. 10.1073/pnas.89.16.7290PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: A new generation of homology search tools based on probabilistic inference. Genome Inform 2009, 23(1):205–211.PubMedGoogle Scholar
- Manber U, Myers G: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 1991, 22(5):935–948.View ArticleGoogle Scholar
- Dill KA: Theory for the folding and stability of globular proteins. Biochemistry 1985, 24(6):1501–1509. 10.1021/bi00327a032View ArticlePubMedGoogle Scholar
- Peterson EL, Kondev J, Theriot JA, Phillips R: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 2009, 25(11):1356–1362. 10.1093/bioinformatics/btp164PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15(1):87–88. 10.1093/bioinformatics/15.1.87View ArticlePubMedGoogle Scholar
- Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI: A core gut microbiome in obese and lean twins. Nature 2009, 457(7228):480–484. 10.1038/nature07540PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Jaroszewski L, Godzik A: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 2001, 17(3):282–283. 10.1093/bioinformatics/17.3.282View ArticlePubMedGoogle Scholar
- Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Methods Enzymol 1996, 266: 554–571.View ArticlePubMedGoogle Scholar
- Schafmeister CE, LaPorte SL, Miercke LJW, Stroud RM: A designed four helix bundle protein with native-like structure. Nat Struct Mol Biol 1997, 4(12):1039–1046. 10.1038/nsb1297-1039View ArticleGoogle Scholar
- Riddle DS, Santiago JV, Bray-Hall ST, Doshi N, Grantcharova VP, Yi Q, Baker D: Functional rapidly folding proteins from simplified amino acid sequences. Nat Struct Mol Biol 1997, 4(10):805–809. 10.1038/nsb1097-805View ArticleGoogle Scholar
- Murphy LR, Wallqvist A, Levy RM: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng 2000, 13(3):149–152. 10.1093/protein/13.3.149View ArticlePubMedGoogle Scholar
- Wommack KE, Bhavsar J, Ravel J: Metagenomics: read length matters. Appl Environ Microbiol 2008, 74(5):1453–1463. 10.1128/AEM.02181-07PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P: eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res 2008, (36 Database):D250–254.Google Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Nelson KE, Weinstock GM, Highlander SK, Worley KC, Creasy HH, Wortman JR, Rusch DB, Mitreva M, Sodergren E, Chinwalla AT, Feldgarden M, Gevers D, Haas BJ, Madupu R, Ward DV, Birren BW, Gibbs RA, Methe B, Petrosino JF, Strausberg RL, Sutton GG, White OR, Wilson RK, Durkin S, Giglio MG, Gujja S, Howarth C, Kodira CD, Kyrpides N, Mehta T, et al.: A catalog of reference genomes from the human microbiome. Science 2010, 328(5981):994–999.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.