- Open Access
SPINGO: a rapid species-classifier for microbial amplicon sequences
© Allard et al. 2015
- Received: 19 March 2015
- Accepted: 17 September 2015
- Published: 8 October 2015
Taxonomic classification is a corner stone for the characterisation and comparison of microbial communities. Currently, most existing methods are either slow, restricted to specific communities, highly sensitive to taxonomic inconsistencies, or limited to genus level classification. As crucial microbiota information is hinging on high-level resolution it is imperative to increase taxonomic resolution to species level wherever possible.
In response to this need we developed SPINGO, a flexible and stand-alone software dedicated to high-resolution assignment of sequences to species level using partial 16S rRNA gene sequences from any environment. SPINGO compares favourably to other methods in terms of classification accuracy, and is as fast or faster than those that have higher error rates. As a demonstration of its flexibility for other types of target genes we successfully applied SPINGO also on cpn60 amplicon sequences.
SPINGO is an accurate, flexible and fast method for low-level taxonomic assignment. This combination is becoming increasingly important for rapid and accurate processing of amplicon data generated by newer next generation sequencing technologies.
- Microbiota composition
- 16S rRNA gene amplicons
Analysis of microbial communities (microbiota) sampled directly from their natural environment, without clonal culturing, is a rapidly evolving field with wide-ranging applications within ecology, agriculture and medicine. A relatively straight-forward approach for characterizing and comparing microbiota is to sequence variable regions of the ubiquitous 16S rRNA gene following amplification using universal primer pairs. The resulting sequence reads can either be analysed as groups of similar sequences (operational taxonomic units: OTUs), or as raw reads. In either case, taxonomic classification of the resulting sequence reads is a crucial component for characterising microbiota composition. The most common tool for this is the RDP-Classifier which generally assigns partial 16S rRNA gene sequences down to genus level . There is, however, a need among investigators to increase the taxonomic resolution to include species assignments wherever possible. For example, a genus like Streptococcus has species that are either considered beneficial (S. thermophilus) or pathogenic (S. pneumoniae), thus it is crucial to be able to identify species with good accuracy whenever sequence specificity allows it. In addition, a subset of the Gram-positive and endospore-forming bacterial species have traditionally been structured into Clostridium clusters, primarily based on 16S rRNA gene similarities . Many of the species in these clusters often belong to genera other than Clostridium, often due to discrepancies between their traditionally characterised phenotypes and molecular phylogeny. As there are established primer combinations for many of these clusters, which are frequently used by microbiologists to elucidate microbiota community structure, there is a need to link high-throughput data derived from culture-independent methods to these more targeted and traditional methods.
So far, the few existing methods that can be used for species classification “out-of-the-box” are rather limited and not designed for such purposes: they are either applied on a very restricted set of species [3, 4], or are only suitable on reads from soon-obsolete technologies like the 454 Pyrosequencing due to the low computational classification speed . Even though broad taxonomic assignment of representative OTU sequences is the main objective for UCLUST as implemented by the assign_taxonomy.py script within the QIIME software suite , it does have the capacity for species-level assignment when Greengenes is used as a reference database . However, this is just for a minor subset of OTUs as Greengenes only have 627 unique species (version 13.5) compared to 12,394 species in the RDP database (version 11.2) compliant with the NCBI Taxonomy. While both databases have uneven representation of taxa, this is more prominent for Greengenes where the most abundant species is Faecalibacterium prausnitzii (15 % of sequences with species classification) compare to the RDP database where the most abundant species is Bacillus subtilis (2 %). Both the Java and mothur implementations of the RDP-classifier can also be used for species classifications, however these methods were designed for broader taxonomic classification  and require re-training using non-default databases. A versatile species-classifier should be able to classify sequenced from very diverse environments, and also be capable of efficiently processing millions of amplicon sequences generated by more contemporary and low-cost high-throughput technologies, e.g. Illumina MiSeq, within a reasonable time-frame. This sequencing technology now routinely generates 300 bp long paired-end reads, thereby facilitating coverage of several adjacent variable regions of the 16S rRNA gene when overlapping paired-end reads are merged.
Here, we present SPINGO (Species-level IdentificatioN of metaGenOmic amplicons), a stand-alone software application capable of classifying assignable species sampled from any environment. Its flexible design, accuracy and speed allows for frequent taxonomy updates facilitating even more precise high-resolution classifications without becoming a computational bottleneck for downstream analysis.
Construction of a species reference database
Full-length (≥1200 bp) bacterial and archaeal 16S rRNA gene sequences were obtained from the Ribosomal Database Project version 11.2 (http://rdp.cme.msu.edu/). All sequences were labelled to species names according to the NCBI (http://www.ncbi.nlm.nih.gov/guide/taxonomy/), which is readily available and distributes the original nomenclature as deposited with the submitted sequence (http://www.ncbi.nlm.nih.gov/WebSub/html/requirements.html). Only sequences with complete binomial (genus + species) names were retained, and identical sequences from the same species were removed in order to reduce the training dataset. Sequences that were identical, but associated to multiple species were on the other hand retained, as such sequences represent species that are not assignable using our algorithm outlined below. Thus, the resulting SPINGO reference database only contained full-length, species-specific 16S rRNA gene sequences, which were non-redundant for each species. For example, if Species A has sequences ACG/ACC/ACC/CCC before this operation, it will afterwards only have sequences ACG/ACC/CCC. From this SPINGO database of 95,210 sequences and 12,394 unique species, a taxonomy mapping file was created linking the original sequence identifiers with a two-level hierarchy comprising both genus and species names, as well as Clostridium clusters where applicable. For the latter, a lookup table linking species names with these clusters had previously been compiled [2, 8]. Albeit not the main aim of SPINGO, genus-level classification is also enabled by default to broaden its application for high taxonomic resolution. The taxonomy mapping file can be re-used by the make_database.py script to facilitate future updates or reconstruction based on other types of taxonomic hierarchies.
For each query sequence, the database is searched using both the forward and reverse complement of the query and a list of the reference sequence(s) giving the highest score is retrieved. For each of the taxonomic levels in the two-level hierarchy, genus and species level, as well as clostridium cluster, an assignment is made at that level if the annotations of the reference sequences are in agreement, otherwise the assignment is considered to be ambiguous. If an assignment is made at any taxonomic level, a bootstrapping process, similar to that of the RDP-classifier , is performed to provide a confidence estimate of the taxonomic assignment. Briefly, for each bootstrap trial at a given k-mer size ksize, a subset qk of QK is sampled at random, where |qk| = |Qk|/ksize. The taxonomic annotations at each level for the reference sequences giving the highest Sq,R are obtained. The confidence estimate is then calculated as the proportion of retrieved sequences with a taxonomic assignment matching that of Qk at the same level. A low confidence estimate indicates that many reference sequences have a similar (but not identical) set of k-mers (low distinctiveness), while a high confidence estimate indicates that there are few reference sequences with a similar composition (high distinctiveness).
Creation of validation datasets
To evaluate SPINGO and demonstrate its utility for species classification we used two different approaches. First, we used a 10-fold cross validation  with the SPINGO database on four different methods for species classification: SPINGO, the mothur-implementation of the RDP-classifier (v1.34.1), UCLUST (v1.2.22q; default method in QIIME’s assign_taxonomy.py) and BLASTn (v.2.2.28), while keeping database, k-mer size (8-mer) and number of bootstrap runs (100) constant across compared methods. All these methods use enumeration of k-mers at an early stage, but differ significantly in how these counts are processed in the downstream analysis. A key difference between SPINGO and the other algorithms is that SPINGO identifies sequences for indistinguishable species and discards them as ambiguous candidates, whereas the other methods will always classify the query sequence even if there are multiple conflicting hits. Even so, by specifying a non-default option it is also possible to list all ambiguous species hits. SPINGO is thus designed to classify relatively short sequences where the percentage deviation from a reference sequence is relatively small. One can view k-mer counting as a proxy for standard pairwise sequence alignment based on sequence similarity, but as there still are some important differences it can be useful to briefly outline situations where false positive and negatives will occur. For example, if a sequence is made of two regions A = ATATTAAATT and B = GCCGGGCGGC the k-mers would be ATAT TATT ATTA TTAA TAAA AAAT AATT ATTG TTGC TGCC GCCG CCGG CGGG GGGC GGCG GCGG CGGC, while if A and B where switched the k-mers would be GCCG CCGG CGGG GGGC GGCG GCGG CGGC GGCA GCAT CATA ATAT TATT ATTA TTAA TAAA AAAT AATT, with the k-mers unique to either A or B underscored. Thus, the k-mer similarity score would be high (14/17), but an alignment score would be low resulting in a false positive. A similar situation could occur at the start or end of a sequence: For example, if there is a substitution at the start of sequence 5′-ATTTGCG, which has k-mers ATTT TTTG TTGC TGCG, to 5′-GTTTGCG the new k-mers are GTTT TTTG TTGC TGCG, resulting in a k-mer similarity score of 3/4 against the original sequence. However, if there instead is a substitution in the middle to 5′-ATTCGCG the new k-mers are ATTC TTCG TCGC CGCG, resulting in k-mer similarity score of 0, much lower than an alignment score (false negative). False negatives will also occur if a query sequence contains a large number of errors equally spread along the sequence, as the k-mer score will be lower than what a global alignment score would be. Nevertheless, a sequence that is not classified due to a large number of mutations or sequencing errors should not be classified, even if there is a high global similarity. This makes sense in a situation where different species may differ in only a small number of bases. So while these issues are worth considering, our empirical data shows that they do not adversely affect the classifier performance. False positive rate will be more greatly affected by mislabelled sequences in the database. As for false-negatives, SPINGO does not try to predict which species are not in a sample - absence of evidence is not evidence of absence - so that discussion is purely academic.
For each 10th of the SPINGO database, 12 different variable 16S rRNA gene regions were extracted using the V-ripper script (Additional file 1 and GitHub distribution) and classified. Second, we obtained three different datasets, based on a simulation, a mock community and a real-life environmental sample. For the simulation, we created a dataset of 10,067 full-length 16S rRNA gene sequences, each representing one type strain, from the SILVA Living Tree Project version 11.5  using the NCBI Taxonomy nomenclature,. This facilitated a like-for-like comparison with the SPINGO database which contains sequences from the RDP database, but with species names labelled according to the NCBI Taxonomy. A hold-out evaluation database was created by removing 9,607 sequences from the SPINGO database that were present in the SILVA database. Variable regions V1-V3 (6,046 sequences), V3-V5 (5,860) and V6-V9 (5,241) were extracted from the SILVA database using previously described primers  with the V-ripper script and subsequently classified using the evaluation database not containing the 9,607 test sequences. In addition, we classified sequences derived from a mock community of 21 known bacterial species in even composition . The 454 Pyrosequencing reads covering the hyper-variable regions V1-V3, V3-V5 and V6-V9 were chimera filtered using UCHIME  with the “Gold” database (http://microbiomeutil.sourceforge.net) as reference to remove chimeric sequences. Sequences were considered to be correctly classified if the unambiguously assigned species was a known component of the mock community. To also explore a real biological environment we analyzed amplicon sequences based on the three primer combinations referred to above for a stool sample originating from a healthy male subject (sample SRS019089 from the Human Microbiome Project http://hmpdacc.org/HM16STR/healthy).
SPINGO’s accuracy and target versatility was finally demonstrated and evaluated on amplicon sequences derived from the universal house-keeping gene cpn60. Here, a 10-fold cross validation was performed on 6,690 amplicon sequences of the cpn60 Universal Target region (~500 bp) for which there was a full species name, which were downloaded from cpnDB  on March 4th 2015 (http://www.cpndb.ca/search.php). The scripts and syntax used for evaluation are available in the Additional file 1.
Species classification of non-16S rRNA gene sequences
Here we present and demonstrate the utility and performance of SPINGO, a rapid, accurate and flexible classifier that improves the taxonomic resolution of 16S rRNA gene amplicons down to species level. While its primary target is species from any type of environmental sample, it can also be adapted to arbitrary classification hierarchies, like Clostridium clusters which are commonly used for characterising mammalian gut microbiota. SPINGO was consistently the most accurate species-classifier when compared to the other methods. To end with, the efficient algorithm provides a significant speed-up compared to existing classifiers which, when combined with its high accuracy, makes SPINGO a particularly valuable tool as amplicons more now than ever are sequenced in the hundreds of millions.
The source code, executables and documentation are available at https://github.com/GuyAllard/SPINGO.
Project name: SPINGO
Operating system(s): Linux
Programming language: C++ / Python
Other requirements: To compile from source the following development libraries are required - Boost.program_options, Boost.serialization and Boost.thread
License: GNU GPL version 3
Restrictions for use by non-academics: None
This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant Number 11/SIRG/B2162 and SFI/12/RC/2273. Ian B Jeffery is funded under Grant Number 13/SIRG/2128. We thank Dr Todd DeSantis for valuable discussions on taxonomic hierarchies.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73(16):5261–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Collins MD, Lawson PA, Willems A, Cordoba JJ, Fernandez-Garayzabal J, Garcia P, et al. The phylogeny of the genus Clostridium: proposal of five new genera and eleven new species combinations. Int J Syst Bacteriol. 1994;44(4):812–26.View ArticlePubMedGoogle Scholar
- Conlan S. Species-level analysis of DNA sequence data from the NIH Human Microbiome Project. PLoS ONE. 2012;7(10):e47075.View ArticlePubMedPubMed CentralGoogle Scholar
- Fettweis JM, Serrano MG, Sheth NU, Mayer CM, Glascock AL, Brooks JP, et al. Species-level classification of the vaginal microbiome. BMC Genomics. 2012;13 Suppl 8:S17.PubMedPubMed CentralGoogle Scholar
- Nakayama J, Jiang J, Watanabe K, Chen K, Ninxin H, Matsuda K, et al. Up to species-level community analysis of human Gut microbiota by 16S rRNA amplicon pyrosequencing. Bioscience of Microbiota, Food and Health. 2013;32(2):69–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7(5):335–6.View ArticlePubMedPubMed CentralGoogle Scholar
- DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006;72(7):5069–72.View ArticlePubMedPubMed CentralGoogle Scholar
- Claesson MJ, Cusack S, O'Sullivan O, Greene-Diniz R, de Weerd H, Flannery E et al. Microbes and health sackler colloquium: composition, variability, and temporal stability of the intestinal microbiota of the elderly. Proc Natl Acad Sci U S A. 2010;108:4586-591. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3063589/pdf/pnas.201000097.pdf.
- Refaeilzadeh P, Tang L, Liu H. Cross-Validation. In: Encyclopedia of Database Systems. Springer USA; 2009: 532–538.Google Scholar
- Munoz R, Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, et al. Release LTPs104 of the All-species living tree. Syst Appl Microbiol. 2011;34(3):169–70.View ArticlePubMedGoogle Scholar
- Ward DV. Evaluation of 16S rDNA-based community profiling for human microbiome research. PLoS ONE. 2012;7(6), e39315.View ArticleGoogle Scholar
- Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 2011;21(3):494–504.View ArticlePubMedPubMed CentralGoogle Scholar
- Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics. 2011;27(16):2194–200.View ArticlePubMedPubMed CentralGoogle Scholar
- Links MG, Chaban B, Hemmingsen SM, Muirhead K, Hill JE. mPUMA: a computational approach to microbiota analysis by de novo assembly of operational taxonomic units based on protein-coding barcode sequences. Microbiome. 2013;1(1):23.View ArticlePubMedPubMed CentralGoogle Scholar