AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization
© Langenkämper et al.; licensee BioMed Central. 2014
Received: 17 July 2014
Accepted: 12 November 2014
Published: 13 December 2014
With the advent of low cost, fast sequencing technologies metagenomic analyses are made possible. The large data volumes gathered by these techniques and the unpredictable diversity captured in them are still, however, a challenge for computational biology.
In this paper we address the problem of rapid taxonomic assignment with small and adaptive data models (< 5 MB) and present the accelerated k-mer explorer (AKE). Acceleration in AKE’s taxonomic assignments is achieved by a special machine learning architecture, which is well suited to model data collections that are intrinsically hierarchical. We report classification accuracy reasonably well for ranks down to order, observed on a study on real world data (Acid Mine Drainage, Cow Rumen).
We show that the execution time of this approach is orders of magnitude shorter than competitive approaches and that accuracy is comparable. The tool is presented to the public as a web application (url: https://ani.cebitec.uni-bielefeld.de/ake/, username: bmc, password: bmcbioinfo).
Metagenomics is the direct sequencing and analysis of environmental samples. Metagenomic studies are used in a variety of fields including, e.g. bio-medical studies  and ecological diversity studies . As a first step after sequencing taxonomic composition is estimated and taxonomic categories are assigned to the data. This is a challenging problem due to sequence length and complexity of the data captured . For analysis of the taxonomic composition the analysis of 16S rRNA sequences is a prominent step, see ,. This imposed some limitations, e.g. the copy number can vary by an order of magnitude  and therefore we will focus on whole metagenome analysis. Multiple tools exist that are able to predict the class of a genomic sample sequence, most of them using alignments (e.g. Megan4 , (Web)Carma3 ,, MG-Rast ). As these can be very time consuming, alternative approaches based on profile features have been proposed (Phylopythia(S) ,, NBC , TAC-ELM , TAXSOM , PhymmBL , Kraken , taxy ). Sequence data are transformed to profile features, i.e. feature vectors that consist of various measurements describing the nucleic composition of the sequence. Frequently employed characteristics are G/C content  and k-mer occurrence ,. The speedup of these techniques is traded in for a loss of accuracy, compared to the alignment-based methods. Nevertheless, it has been shown that k-mer profiles are distinctive enough for binning in metagenomic studies and for classification up to certain levels in the tree of life . For benchmarking we compared AKE with Phylopythia(S) and NBC. Phylopythia classifies profile features with a SVM-based classifier architecture. The web-based version is called PhylopythiaS. It uses two different models, either a generic model for classification or a sample specific one, which can be generated by the user prior to classification. We benchmarked against the parameterless generic model. NBC implements the naive Bayes classifier for taxonomic assignment as a web application. The k-mer length as well as the genomes to match against can be chosen. We chose the Bacteria/Archea genomes to match against and a k-mer length of 6 for benchmarking.
This paper presents AKE (Accelerated k-mer Exploration web-tool) a computational approach to rapid taxonomic assignment for an immediate response to new data. A rapid taxonomic assignment can be of interest, when data sets from lots of samples are to be analyzed immediately or new data sets are generated rapidly by filtering and fusion. A result of AKE is a rapid taxonomic assignment presented as a web-based, interactive and dynamic visualization. AKEs computational speed is achieved by (1) using refined k-mer profile features , (2) a data-driven, i.e. learned, hierarchical and descriptive model, which provides the basis for classification and visualization, and (3) parallel computing. This work is based on a previous paper by Martin et al.  sharing the features and binning method, namely the H 2SOM. However, the classifier architecture is different and Martin et al. do not provide a web interface for visualization of results. Furthermore, the execution speed is increased by using parallelization and a faster implementation. To boost classification accuracy a rejection class is introduced to the model containing non-specific profiles. This results in a web accessible system for low performance computers that features an immediate first visual inspection of new data, i.e. some data might be rejected if it is unspecific. The accuracy is comparable to similar approaches but with a faster execution time. The tool is publicly available as a web application (https://ani.cebitec.uni-bielefeld.de/ake/, username: bmc, pw: bmcbioinfo), which facilitates the ease of use. This releases the users of the burden of resourceful operations on their own systems, e.g. analyses on small-scale computers in laboratories are made possible. Furthermore, no software packages have to be downloaded and installed. The only requirement is an up-to-date web-browser (≈ not older than 2 years).
Recent reports of IMG4 [, progress report (http://img.jgi.doe.gov/w/doc/releaseNotes.pdf)] show a rapidly growing amount of available metagenomes. Likewise, the PubMed hits for the term “metagenomics” grew massively in the latter years, showing the importance of the field.
The following Methods section describes the features, methods and data used in this study and how these are used to build a classification system for metagenomic data. In the Results and discussion section we present the performance on two real world data sets, compared to similar approaches. Furthermore, the differences in runtime are reported. The Conclusion sums up the results of this study.
For using the sequences S (ζ)ζ=0,…,n with a mathematical model like the H 2SOM, features x (ζ)ζ=0,…,n have to be computed for the sequence reads. For this purpose k-mer profiles with three different normalizations are used and referred to as . They are listed here with basic explanations, further information can be found in .
To reduce a bias towards frequent k-mers the vectors are normalized to unit length.
The H2SOM classifier
For creating a descriptive model of the k-mers a Hyperbolic Self Organizing Map is used. The Self Organizing Map is a neural network proposed by Teuvo Kohonen . Many variants have been proposed since, but all share the basic setup that consists of a set of neurons (u i ,z i )i=1…I that are arranged in a grid with z i being the grid coordinate and u i being the attached neural unit also called the prototype. The architecture of the grid differs by the type applied.
In the Hyperbolic SOM (HSOM)  the algorithm is defined in non-euclidean space. The Hyperbolic Hierarchical SOM (H 2SOM)  as used in this paper introduces a hierarchical grid structure to the hyperbolic version.
In metagenomics, the H 2SOM has been applied already for visual exploration and binning . In  it was shown, that clustering genome data with a HSOM correlates more to the tree-of-life structure than the standard SOM clustering.
The number of neural units in the grid of a H 2SOM grows exponentially with the number of rings r. This leads to a more trustworthy mapping but dramatically increases the time required for the search for the best matching unit (BMU) during training. Leveraging the hierarchical structure of the grid a beam search is applied to approximate the global BMU in each training step. The search starts with the central neural unit as the initial BMU. For a beam width of w=1 it continues by recursively choosing the BMU among the children of the last winning neural unit until it reaches the current periphery of the grid. The BMU determined for the last ring is an approximation of the global BMU. For values of w>1 searching is done equivalently, but the children of w different neural units are searched for the BMU. It has been shown in  that this strategy accelerates the training significantly while staying close to or even surpassing the performance using global search.
The H 2SOM depends on parameters that need to be optimized. These are the number of rings r, the spread factor s, the neighborhood adaption modifier n and the learning rate ε. The algorithm is very robust against changes in ε and n but the parameters determining size and architecture (r,s) are important. By employing cross validation the parameters r=5 and s=8 were determined to create a good descriptive model (see Additional file 2).
Taxonomic labeling of unsupervised neural networks
with being a special label namely the rejection class and α a threshold value.
A H 2SOM labeled in one of the above ways can be used for classification. To assign a sample ξ (a sequence), the profile feature vector x (ξ) is computed, employing the same k-mer normalization strategy as used for labeling the model. For assignment a particular function is chosen from , defined in the following.
For training sequences exceeding 4 kb from the NCBI full genome database (bacteria/virus, 2014/04/13) were used. A list of GI numbers (http://www.ncbi.nlm.nih.gov/Class/FieldGuide/glossary.html\#GI) is provided (see Additional file 3). Out of these sequence data four different data sets were generated for model building. Therefore, the sequences S (ζ) were cut at different length .
Results and discussion
For parameter optimization and model evaluation a cross validation study (see Additional file 2) was done. The most promising k-mer length was determined to be k=4. For larger values of k the classification accuracy increased partly, however a decrease in speed can be observed. The two labeling strategies (Eqs. 7, 8), for building the taxonomic model, combined with the three different classification algorithms (Eqs. 9, 10, 11) were applied. A trade-off between correctness of assignment and number of rejections was observed for all six variants. A good balance between assignment correctness, number of rejections and execution speed was determined using purity voting (Eq. 8) for model construction and nearest neighbor selection (Eq. 9) for taxonomic assignment.
Thus, for the following real world data set examples, purity voting with a threshold of α=0.8 for labeling (Eq. 8) and the nearest neighbor strategy (Eq. 9) for assignment were the most promising settings compared to the other variants. For the H 2SOM algorithm an architecture with r=5 rings and s=8 neighbors was chosen.
Acid mine drainage
The Acid Mine Drainage data set  was taken at Iron Mountain in California. The community is comprised of five high abundant species namely Ferroplasma Types I and II, a Thermoplasmatales species, all of phylum Euryarchaeota, and Leptospirillum sp. Group I and II of phylum Nitrospirae. The data has been received from DOE Joint Genome Institute (http://img.jgi.doe.gov (taxon 2001200000)) along with its taxonomic affiliation and is build of 1183 scaffolds of approximately 10 Mb of sequence information.
We compared AKE with some similar approaches including NBC  and PhyloPythiaS  with generic and sample specific model. All results were obtained using a model derived from the 15 Kb data set of NCBI genomes mentioned above. We did not explore the possibility to generate a sample specific model as described in , but expect it to have a similar positive influence as in the cited study. When using the web service the parameters given above are applied.
Please note that further classification results are provided online within AKE. These include the results of the AMD and cow rumen data sets with classification down to order as well as a reference composition for these data sets visualized with AKE. Furthermore, the analysis of simulated data sets  is provided.
The application is written in Python using a C extension for fast computation. The authors implemented the k-mer counting as well as the H 2SOM. The execution times are measured using Python’s time() function. All experiments were repeated ten times and the mean value of this is stated below. The machine used, is the same web server that serves the results for the web interface. It is a virtual machine running two Intel Xeon E5450 CPUs at 3 GHz with 32 GB main memory operated by Sun Solaris 10. The application is multi-threaded using 4 threads.
Execution times of AKE
( k -mers)
Cow rumen (bins)
Cow rumen (scaffolds)
Performance comparison of PhylopythiaS, NBC, WebCarma and AKE for AMD data set on phylum level
The evaluation of different web-based taxonomic classifiers shows that the runtime differs dramatically from a second (AKE), to minutes (PhylopythiaS), to an hour (NBC), to almost a week (WebCarma) due to algorithmic features and implementation details. AKE is faster compared to the other applications because it only needs to compute the euclidean distance between the descriptive model and the data that should be classified, whereas the others need to compute alignments (WebCarma) or apply decision functions (Phylopythia, NBC). Furthermore, optimized C code and multi-threading accelerates the application. The neural network used is especially suited to generate a hierarchical, compact, descriptive model, which allows fast queries using a beam search to limit the number of euclidean distance searches. Although there might be methods reported to be equally fast and more accurate, to the authors knowledge there exists no web-based solution which performs equally well, in terms of execution time and accuracy for generic metagenome data. Since accuracy drops down significantly for ranks lower than order we do not report these here, since our focus in development lay on acceleration and a dynamic web-based visualization system.
AKE is a fast taxonomic assignment tool for first visual inspection of whole metagenome data sets. Its web-based dynamic visualization allows fast analyses even on low performance computers without installation of software. Furthermore, the web-based approach enables a cooperative analysis of data with colleagues.
Data for AMD comparison study except AKE results kindly provided by Kaustubh Patil and Alice McHardy. This work was supported by the German Federal Ministry of Education and Research [grant 01 |H11004 “ENHANCE”] to Daniel Langenkämper. We acknowledge support of the publication fee by Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University.
- Nakao R, Abe T, Nijhof AM, Yamamoto S, Jongejan F, Ikemura T, Sugimoto C: A novel approach, based on BLSOMs (batch learning self-organizing maps), to the microbiome analysis of ticks . ISME J. 2013, 7 (5): 1003-1015. 10.1038/ismej.2012.171. doi:10.1038/ismej.2012.171,View ArticlePubMed CentralPubMedGoogle Scholar
- Teeling H, Gloeckner FO: Current opportunities and challenges in microbial metagenome analysis-a bioinformatic perspective . Brief Bioinform. 2012, 13 (6): 728-742. 10.1093/bib/bbs039. doi:10.1093/bib/bbs039,View ArticlePubMed CentralPubMedGoogle Scholar
- Liu Z, DeSantis TZ, Andersen GL, Knight R: Accurate taxonomy assignments from 16s rrna sequences produced by highly parallel pyrosequencers . Nucleic Acids Res. 2008, 36 (18): 120-120. 10.1093/nar/gkn491.View ArticleGoogle Scholar
- Koslicki D, Foucart S, Rosen G: Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing . Bioinformatics. 2013, 29 (17): 2096-2102. 10.1093/bioinformatics/btt336.View ArticlePubMedGoogle Scholar
- Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P: A bioinformatician’s guide to metagenomics . Microbiol Mol Biol Rev. 2008, 72 (4): 557-578. 10.1128/MMBR.00009-08.View ArticlePubMed CentralPubMedGoogle Scholar
- Huson DHD, Mitra SS, Ruscheweyh H-JH, Weber NN, Schuster SCS: Integrative analysis of environmental sequences using MEGAN4 . Genome Res. 2011, 21 (9): 1552-1560. 10.1101/gr.120618.111. doi:10.1101/gr.120618.111,View ArticlePubMed CentralPubMedGoogle Scholar
- Gerlach W, Jünemann S, Tille F, Goesmann A, Stoye J: WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads . BMC Bioinformatics. 2009, 10 (1): 430-10.1186/1471-2105-10-430. doi:10.1186/1471-2105-10-430,View ArticlePubMed CentralPubMedGoogle Scholar
- Gerlach W, Stoye J: Taxonomic classification of metagenomic shotgun sequences with CARMA3 . Nucleic Acids Res. 2011, 39 (14): e91-10.1093/nar/gkr225. doi:10.1093/nar/gkr225,View ArticlePubMed CentralPubMedGoogle Scholar
- Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes . BMC Bioinformatics. 2008, 9 (1): 386-10.1186/1471-2105-9-386. doi:10.1186/1471-2105-9-386,View ArticlePubMed CentralPubMedGoogle Scholar
- McHardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accurate phylogenetic classification of variable-length DNA fragments . Nat Methods. 2006, 4 (1): 63-72. 10.1038/nmeth976. doi:10.1038/nmeth976,View ArticlePubMedGoogle Scholar
- Patil KR, Roune L, McHardy AC: The PhyloPythiaS web server for taxonomic assignment of metagenome sequences . Plos One. 2011, 7 (6): 38581-38581. 10.1371/journal.pone.0038581. doi:10.1371/journal.pone.0038581,View ArticleGoogle Scholar
- Rosen GLG, Reichenberger ERE, Rosenfeld AMA: NBC: the Naive Bayes classification tool webserver for taxonomic classification of metagenomic reads . Trans IRE Professional Group Audio. 2010, 27 (1): 127-129. doi:10.1093/bioinformatics/btq619,Google Scholar
- Rasheed Z, Rangwala H: Metagenomic taxonomic classification using extreme learning machines . J Bioinform Comput Biol. 2012, 10 (5): 1250015-10.1142/S0219720012500151. doi:10.1142/S0219720012500151,View ArticlePubMedGoogle Scholar
- Weber M, Teeling H, Huang S, Waldmann J, Kassabgy M, Fuchs BM, Klindworth A, Klockow C, Wichels A, Gerdts G, Amann R, Glöckner FO: Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics . ISME J. 2010, 5 (5): 918-928. 10.1038/ismej.2010.180. doi:10.1038/ismej.2010.180,View ArticlePubMed CentralPubMedGoogle Scholar
- Brady A, Salzberg SL: Phymm and phymmbl: metagenomic phylogenetic classification with interpolated markov models . Nat Methods. 2009, 6 (9): 673-676. 10.1038/nmeth.1358.View ArticlePubMed CentralPubMedGoogle Scholar
- Wood D, Salzberg S: Kraken: ultrafast metagenomic sequence classification using exact alignments . Genome Biol. 2014, 15 (3): 46-10.1186/gb-2014-15-3-r46.View ArticleGoogle Scholar
- Meinicke P, Aßhauer KP, Lingner T: Mixture models for analysis of the taxonomic composition of metagenomes . Bioinformatics. 2011, 27 (12): 1618-1624. 10.1093/bioinformatics/btr266.View ArticlePubMed CentralPubMedGoogle Scholar
- Foerstner KUK, von Mering CC, Hooper SDS, Bork PP: Environments shape the nucleotide composition of genomes . EMBO Rep. 2005, 6 (12): 1208-1213. 10.1038/sj.embor.7400538. doi:10.1038/sj.embor.7400538,View ArticlePubMed CentralPubMedGoogle Scholar
- Karlin S, Mrazek J: Compositional differences within and between eukaryotic genomes . Proc Natl Acad Sci U S A. 1997, 94 (19): 10227-10232. 10.1073/pnas.94.19.10227.View ArticlePubMed CentralPubMedGoogle Scholar
- Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences . Mol Biol Evol. 1999, 16 (10): 1391-1399. 10.1093/oxfordjournals.molbev.a026048.View ArticlePubMedGoogle Scholar
- Martin C, Diaz NN, Ontrup J, Nattkemper TW: Hyperbolic SOM-based clustering of DNA fragment features for taxonomic visualization and classification . Bioinformatics. 2008, 24 (14): 1568-1574. 10.1093/bioinformatics/btn257. doi:10.1093/bioinformatics/btn257,View ArticlePubMedGoogle Scholar
- Markowitz VM, Chen I-MA, Palaniappan K, Chu K, Szeto E, Grechkin Y, Ratner A, Jacob B, Huang J, Williams P, Huntemann M, Anderson I, Mavromatis K, Ivanova NN, Kyrpides NC: IMG: the Integrated Microbial Genomes database and comparative analysis system . Nucleic Acids Res. 2011, 40 (Database issue): 115-122. doi:10.1093/nar/gkr1044,Google Scholar
- Kohonen T: Self-organized formation of topologically correct feature maps . Biol Cybern. 1982, 43 (1): 59-69. 10.1007/BF00337288. doi:10.1007/BF00337288,View ArticleGoogle Scholar
- Ritter H: Self-organizing maps on non-euclidean spaces . Kohonen Maps. 1999, 73: 97-110. 10.1016/B978-044450270-4/50007-3.View ArticleGoogle Scholar
- Ontrup J, Ritter H: A hierarchically growing hyperbolic self-organizing map for rapid structuring of large data sets. In Proceedings of the 5th Workshop on Self-Organizing Maps, Marie Cottrell (Paris 1 Panthéon-Sorbonne University). Paris (France); 2005.Google Scholar
- Martin C, Diaz NN, Ontrup J: Genome feature exploration using hyperbolic self-organising maps. In 6th international workshop on self-organizing maps WSOM; 2007.Google Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment . Nature. 2004, 428 (6978): 37-43. 10.1038/nature02340. doi:10.1038/nature02340,View ArticlePubMedGoogle Scholar
- Hess M, Sczyrba A, Egan R, Kim T-W, Chokhawala H, Schroth G, Luo S, Clark DS, Chen F, Zhang T, Mackie RI, Pennacchio LA, Tringe SG, Visel A, Woyke T, Wang Z, Rubin EM: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen . Science. 2011, 331 (6016): 463-467. 10.1126/science.1200387. doi:10.1126/science.1200387,View ArticlePubMedGoogle Scholar
- Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods . Nat Med. 2007, 4 (6): 495-500.Google Scholar
- Ondov BDB, Bergman NHN, Phillippy AMA: Interactive metagenomic visualization in a Web browser . BMC Bioinformatics. 2010, 12: 385-385. 10.1186/1471-2105-12-385. doi:10.1186/1471-2105-12-385,View ArticleGoogle Scholar
- Bostock M, Ogievetsky V, Heer J: D 3data-driven documents . IEEE Trans Vis Comput Graph. 2011, 17 (12): 2301-2309. 10.1109/TVCG.2011.185.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.