Construction of a species reference database
Full-length (≥1200 bp) bacterial and archaeal 16S rRNA gene sequences were obtained from the Ribosomal Database Project version 11.2 (http://rdp.cme.msu.edu/). All sequences were labelled to species names according to the NCBI (http://www.ncbi.nlm.nih.gov/guide/taxonomy/), which is readily available and distributes the original nomenclature as deposited with the submitted sequence (http://www.ncbi.nlm.nih.gov/WebSub/html/requirements.html). Only sequences with complete binomial (genus + species) names were retained, and identical sequences from the same species were removed in order to reduce the training dataset. Sequences that were identical, but associated to multiple species were on the other hand retained, as such sequences represent species that are not assignable using our algorithm outlined below. Thus, the resulting SPINGO reference database only contained full-length, species-specific 16S rRNA gene sequences, which were non-redundant for each species. For example, if Species A has sequences ACG/ACC/ACC/CCC before this operation, it will afterwards only have sequences ACG/ACC/CCC. From this SPINGO database of 95,210 sequences and 12,394 unique species, a taxonomy mapping file was created linking the original sequence identifiers with a two-level hierarchy comprising both genus and species names, as well as Clostridium clusters where applicable. For the latter, a lookup table linking species names with these clusters had previously been compiled [2, 8]. Albeit not the main aim of SPINGO, genus-level classification is also enabled by default to broaden its application for high taxonomic resolution. The taxonomy mapping file can be re-used by the make_database.py script to facilitate future updates or reconstruction based on other types of taxonomic hierarchies.
Algorithm
Assignment of amplicons to the closest known species is based on the reference database described above. This database is loaded into memory and indexed by k-mers using an inverted index structure, a
$$ {S}_{Q,R}=\frac{\left|{Q}_K\cap {R}_K\right|}{\left|{Q}_K\right|} $$
commonly used index data structure for storing words (k-mers), which allows for rapid retrieval of all sequences which contain a given word (k-mer). Query sequences are then compared to the reference database using the following similarity score: given a query sequence Q and a reference sequence R, QK is the set of overlapping and fixed length k-mers present in Q, and RK is the set of overlapping k-mers in R. A similarity score SQ,R is calculated as the number of k-mers shared between the reference and query sequence, normalized by the number of unique k-mers in the query sequence thus giving a number in the range [0, 1].
For each query sequence, the database is searched using both the forward and reverse complement of the query and a list of the reference sequence(s) giving the highest score is retrieved. For each of the taxonomic levels in the two-level hierarchy, genus and species level, as well as clostridium cluster, an assignment is made at that level if the annotations of the reference sequences are in agreement, otherwise the assignment is considered to be ambiguous. If an assignment is made at any taxonomic level, a bootstrapping process, similar to that of the RDP-classifier [1], is performed to provide a confidence estimate of the taxonomic assignment. Briefly, for each bootstrap trial at a given k-mer size ksize, a subset qk of QK is sampled at random, where |qk| = |Qk|/ksize. The taxonomic annotations at each level for the reference sequences giving the highest Sq,R are obtained. The confidence estimate is then calculated as the proportion of retrieved sequences with a taxonomic assignment matching that of Qk at the same level. A low confidence estimate indicates that many reference sequences have a similar (but not identical) set of k-mers (low distinctiveness), while a high confidence estimate indicates that there are few reference sequences with a similar composition (high distinctiveness).
Creation of validation datasets
To evaluate SPINGO and demonstrate its utility for species classification we used two different approaches. First, we used a 10-fold cross validation [9] with the SPINGO database on four different methods for species classification: SPINGO, the mothur-implementation of the RDP-classifier (v1.34.1), UCLUST (v1.2.22q; default method in QIIME’s assign_taxonomy.py) and BLASTn (v.2.2.28), while keeping database, k-mer size (8-mer) and number of bootstrap runs (100) constant across compared methods. All these methods use enumeration of k-mers at an early stage, but differ significantly in how these counts are processed in the downstream analysis. A key difference between SPINGO and the other algorithms is that SPINGO identifies sequences for indistinguishable species and discards them as ambiguous candidates, whereas the other methods will always classify the query sequence even if there are multiple conflicting hits. Even so, by specifying a non-default option it is also possible to list all ambiguous species hits. SPINGO is thus designed to classify relatively short sequences where the percentage deviation from a reference sequence is relatively small. One can view k-mer counting as a proxy for standard pairwise sequence alignment based on sequence similarity, but as there still are some important differences it can be useful to briefly outline situations where false positive and negatives will occur. For example, if a sequence is made of two regions A = ATATTAAATT and B = GCCGGGCGGC the k-mers would be ATAT TATT ATTA TTAA TAAA AAAT AATT ATTG
TTGC
TGCC GCCG CCGG CGGG GGGC GGCG GCGG CGGC, while if A and B where switched the k-mers would be GCCG CCGG CGGG GGGC GGCG GCGG CGGC GGCA
GCAT
CATA ATAT TATT ATTA TTAA TAAA AAAT AATT, with the k-mers unique to either A or B underscored. Thus, the k-mer similarity score would be high (14/17), but an alignment score would be low resulting in a false positive. A similar situation could occur at the start or end of a sequence: For example, if there is a substitution at the start of sequence 5′-ATTTGCG, which has k-mers ATTT TTTG TTGC TGCG, to 5′-GTTTGCG the new k-mers are GTTT TTTG TTGC TGCG, resulting in a k-mer similarity score of 3/4 against the original sequence. However, if there instead is a substitution in the middle to 5′-ATTCGCG the new k-mers are ATTC TTCG TCGC CGCG, resulting in k-mer similarity score of 0, much lower than an alignment score (false negative). False negatives will also occur if a query sequence contains a large number of errors equally spread along the sequence, as the k-mer score will be lower than what a global alignment score would be. Nevertheless, a sequence that is not classified due to a large number of mutations or sequencing errors should not be classified, even if there is a high global similarity. This makes sense in a situation where different species may differ in only a small number of bases. So while these issues are worth considering, our empirical data shows that they do not adversely affect the classifier performance. False positive rate will be more greatly affected by mislabelled sequences in the database. As for false-negatives, SPINGO does not try to predict which species are not in a sample - absence of evidence is not evidence of absence - so that discussion is purely academic.
For each 10th of the SPINGO database, 12 different variable 16S rRNA gene regions were extracted using the V-ripper script (Additional file 1 and GitHub distribution) and classified. Second, we obtained three different datasets, based on a simulation, a mock community and a real-life environmental sample. For the simulation, we created a dataset of 10,067 full-length 16S rRNA gene sequences, each representing one type strain, from the SILVA Living Tree Project version 11.5 [10] using the NCBI Taxonomy nomenclature,. This facilitated a like-for-like comparison with the SPINGO database which contains sequences from the RDP database, but with species names labelled according to the NCBI Taxonomy. A hold-out evaluation database was created by removing 9,607 sequences from the SPINGO database that were present in the SILVA database. Variable regions V1-V3 (6,046 sequences), V3-V5 (5,860) and V6-V9 (5,241) were extracted from the SILVA database using previously described primers [11] with the V-ripper script and subsequently classified using the evaluation database not containing the 9,607 test sequences. In addition, we classified sequences derived from a mock community of 21 known bacterial species in even composition [12]. The 454 Pyrosequencing reads covering the hyper-variable regions V1-V3, V3-V5 and V6-V9 were chimera filtered using UCHIME [13] with the “Gold” database (http://microbiomeutil.sourceforge.net) as reference to remove chimeric sequences. Sequences were considered to be correctly classified if the unambiguously assigned species was a known component of the mock community. To also explore a real biological environment we analyzed amplicon sequences based on the three primer combinations referred to above for a stool sample originating from a healthy male subject (sample SRS019089 from the Human Microbiome Project http://hmpdacc.org/HM16STR/healthy).
SPINGO’s accuracy and target versatility was finally demonstrated and evaluated on amplicon sequences derived from the universal house-keeping gene cpn60. Here, a 10-fold cross validation was performed on 6,690 amplicon sequences of the cpn60 Universal Target region (~500 bp) for which there was a full species name, which were downloaded from cpnDB [14] on March 4th 2015 (http://www.cpndb.ca/search.php). The scripts and syntax used for evaluation are available in the Additional file 1.