Volume 13 Supplement 2
Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
© Hughes et al.; licensee BioMed Central Ltd. 2012
Published: 13 March 2012
Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.
Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters.
This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS.
Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.
The continued advancement of pyrosequencing techniques has made it possible for scientists to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for involved and time-consuming laboratory purification . As a result, there has been a rapid accumulation of raw sequence reads awaiting analysis in recent years, placing an extreme burden on existing software systems. Alignment of sequences across these large data sets (100,000+ sequences) is of particular interest for the purposes of sequence classification and identification of potential gene clusters and families, but such analysis cannot be completed manually and represents a daunting computational task. The aim of this work is the development of an efficient and effective pipeline for clustering large quantities of raw biosequence reads.
One technique often used in sequence clustering is multiple sequence alignment (MSA), which employs heuristic methods in an attempt to determine optimal alignments across an entire sample. However, global pairwise sequence alignment algorithms have previously been reported to better identify microbial richness in genomes with hypervariable regions, like 16S rRNA, than do MSA techniques, while also offering superior computational scaling . For these studies, genetic distances produced by the Needleman-Wunsch pairwise aligner algorithm  were converted to Cartesian coordinates through Multidimensional Scaling (MDS) for the purpose of clustering and visualization .
Results and discussion
Full calculation on entire data set
Interpolation: 50000 in-sample sequences, 50000 out-of-sample sequences
Interpolation: 10000 in-sample sequences, 90000 out-of-sample sequences
This study demonstrates the effectiveness of combining the Needleman-Wunsch genetic distance algorithm with Multidimensional Scaling (MDS) to enable visual identification of sequence clusters in a large sample of raw reads from the 16S rRNA genome. In addition, the use of interpolative MDS and the Twister Iterative MapReduce runtime provides significant improvement in overall computational throughput while maintaining the basic structure of the resultant sequence space. Further investigation is needed to determine the optimal ratio of in-sample to out-of-sample data set sizes in order to strike the proper balance between performance and intra-cluster detail. Future plans include the study of other genomes and scaling up these studies to cluster millions of sequence reads in the span of a single pipeline run.
List of abbreviations used
ribosomal ribonucleic acid.
This study was funded by the NIH and the American Recovery & Reinvestment Act, grant number 5 RC2 HG 005806-02.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 2, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S2
- Sun Y, Cai Y, Liu L, Farrell ML, McKendree W, Farmerie W: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucl Acids Res. 2009, 37 (10): e76-10.1093/nar/gkp285.PubMed CentralView ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016/0022-2836(70)90057-4.View ArticlePubMedGoogle Scholar
- Qiu X, Fox GC, Yuan H, Bae S, Chrysanthakopoulos G, Nielsen HF: Parallel clustering and dimensional scaling on multicore systems. Invited talk at the 2008 High Performance Computing & Simulation Conference (HPCS 2008) In Conjunction With The 22nd European Conference on Modelling and Simulation (ECMS 2008): 3-6 June 2008; Nicosia, Cyprus. 2008, High Performance Computing & Simulation Conference (HPCS 2008) In Conjunction With The 22nd European Conference on Modelling and Simulation (ECMS 2008): 3-6 June 2008; Nicosia, CyprusGoogle Scholar
- Bae S, Choi J, Qiu J, Fox G: Dimension reduction and visualization of large high-dimensional data via interpolation. Proceedings of ACM HPDC. 2010, conference: 20-25 June 2010; ChicagoGoogle Scholar
- Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S, Qiu J, Fox G: Twister: a runtime for iterative mapReduce. Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC. 2010, conference: 20-25 June 2010; ChicagoGoogle Scholar
- Ekanayake J: Architecture and performance of runtime environments for data intensive scalable computing. PhD thesis. 2010, Indiana University, School of Informatics and Computer ScienceGoogle Scholar
- Neethu S, Tang H, Doak T, Ye Y: Comparing bacterial communities inferred from 16S rRNA gene sequencing and shotgun metagenomics. Pac Symp Biocomput. 2011: 165-76.
- Ye Y: Identification and quantification of abundant species from pyrosequences of 16S rRNA by consensus alignment. Proceedings of BIBM. 2010, 2010: 153-157. : 18-21 December 2010; Hong KongGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.