Constructing position-sensitive frequency vectors
In this section, we present the foundation for the analysis and clustering framework. The basic unit of analysis for our case is a word inside a segment of a sequence. In the case of global frequency analysis used by alignment-free methods [9], a word of length p is referred to as an oligomer or p-mer. In case of DNA sequences, the words are formed from the set of alphabet {A, C, T, G}. In our analysis, we are interested in the distribution of words within specific regions (segments) of a sequence.
Definition 1. Given a DNA sequence × of length L, a segment S
i,l
is defined as a subsequence starting at position i and having length l, where l < L − (i − 1).
Next, we define the distribution of words within a segment.
Definition 2. Given a segment S
i,l
, we define NSV
i,l
= 〈f1, f2, . . . , 〉 as a vector of length 4p where each element f
i
, i = {1, 2, . . . , 4p}, represents the count of a possible p-mer in the segment. This vector represents the word frequency distribution in S
i,l
and is referred to the segment's Numerical Summarization Vectors (NSV).
The set of NSVs for an entire sequence can be created by partitioning it into equal sized segments that may or may not overlap. For example, a sequence of length 1500 base pairs (bp) can be divided into 15 segments each of length 100 without any overlap between them. The word frequency distribution will later be used to find similar segments. Thus, the word size parameter p controls how well the similarity between segments is approximated with larger values leading to better approximation while smaller values lead to faster computation. We find that p = 3, i.e. we count the occurrence of tri-mers within a segment, produces good results while creating NSVs of length 43 or 64. Cutting a sequence into segments and then creating NSVs is illustrated in Figure 1. We currently do not take into account any unknown characters, such as "N" that may be present due to sequencing errors or ambiguous sequencing.
Data stream clustering
After constructing NSVs for the entire set of sequences to be examined, we use high-throughput data stream techniques [17] to cluster similar segments. We consider each NSV as a data point in a stream of consecutive NSVs. The clustering algorithm then adds one NSV after the other to the cluster model by adding it to an existing cluster if it is within a user defined clustering threshold from its center or otherwise creating a new cluster with the NSV as its first member. This idea is illustrated in Figure 1. In addition to the clusters we also retain order information in the form of a directed graph (shown in Figure 1). The exact clustering procedure is discussed in [18].
Definition 3. A GenModel M is defined as a directed graph where the vertices are the set of clusters of NSVs and the edges E represent the ordering of the NSVs in the sequences. Each cluster contains metadata such as location in the original sequence and the sequence IDs.
We can now reframe the problem of alignment as the problem of finding similar subsequences (segments) via clustering.
Definition 4. Each cluster C in a GenModel M represents a set of similar segments and is referred to as a quasi-alignment and the segments are said to be quasi-aligned.
By using clustering, we avoid the expensive alignment process and yet obtain information about local similarity between multiple sequences. The data stream clustering algorithm makes just one single pass through the segments and thus it has a linear time complexity in terms of the length and number of sequences. Also, new sequences can be added very efficiently to an existing model.
Similarity measures for clustering
Clustering algorithms use similarity measures for comparing individual data points. In our case, the sequence data is converted into fixed dimension NSVs representing the word frequency distribution within segments. NSVs can be compared using standard measures for vectors such as Manhattan distance, Euclidean distance, squared Euclidean distance, Kullback-Leibler discrepancy and Mahalanobis distance [9]. They can also be compared using other measures such as the number of shared words between two segments. For example, Simrank [19] compares the number of matching p-mers (typically with p = 7) for fast sequence search.
The distance between NSVs can be related back to the difference between the original sequence strings also. The difference between two sequences is measured in terms of edit distance [20], which is the minimum number of point mutations required to change one sequence into another. A point mutation can be an insertion, deletion, or substitution. Ukkonen [21] has proposed that the edit distance between two strings can be approximated by the Manhattan distance between their q-gram profiles (which in our case will be the p-mer profile). The Manhattan distance between two frequency vectors x and y obtained from the segments s
x
and s
y
is defined as:
(1)
The Manhattan distance simply computes the number of p-mers that are different between the two sequences. It can be shown that the Manhattan distance gives a lower bound for the edit distance between the two segments.
(2)
The reasoning behind the bound above is that any insertion, deletion, or substitution in the segment will create at most p new p-mers and destroy p existing p-mers. Note that in theory it is possible to create two completely different sequences with the same q-gram profile (see [21]), however, this is very unlikely if we deal with biological sequences which are expected to have a certain degree of similarity (e.g., caused by conserved regions or homology).
The relationship in equation 2 can be used to determine a reasonable clustering threshold for the data stream clustering algorithm in [18] for a given word size p. For example, we often use a segment size of 100 bases with 3-mers and a Manhattan clustering threshold of 30. Equation 2 shows that this threshold means that the edit distance between two segments needs to be at least 30/6 = 5 to put them into two separate clusters. Note also that position specific p-mer frequency clustering is not restricted to using Manhattan distance, it can be used with any proximity measure defined on the frequency counts in NSVs.
Discovery of conserved regions from GenModels
GenModels provide vital information about the similarity between segments in the form of quasi-alignments. This allows us to identify regions that are highly similar across multiple sequences. For sequences related by evolution, such as those from the same taxonomic unit, these segments are known as conserved regions.They are likely to be responsible for a particular function or provide a needed structural characteristic.
As an illustration, Figure 2 shows a GenModel created from the 16S rRNA sequences belonging to the phylum GN06. We used the unaligned version of the sequences from the Greengenes database [22]. The available 13 sequences range in length between 1374 and 1525 bases. For building the model, these sequences were broken down into non-overlapping segments of size l = 100 bases each, which were then aggregated with 3-mers (p = 3) and clustered using Manhattan distance and a clustering threshold of 30. The resulting GenModel contains 54 clusters or quasi-alignments. The plot shows each of the quasi-alignment as circles uniquely identified by an id. The circle size is proportional to the size of the clusters (i.e., number of segments participating in the quasi-alignment) and arrows represent the direction of the transitions between them. A stronger arrow indicates that the transition occurs with a higher probability. For example, Figure 2 shows that one of the common transition paths is the quasi-alignment sequence 26 → 27 → 28 → 4 → . . . → 12 → 13 → 14 indicating that a large fraction of the sequences share some common sequence segments. In addition the plot shows that almost all sequences go through a few quasi-alignments (e.g., 4, 6, 10 and 14) which represent candidates for regions that may be highly conserved in the set of analyzed sequences. The sequences share a common higher level taxonomy (phylum) and thus we expect relatively stronger quasi-alignments as compared to random sequences. Interesting in Figure 2 is the almost completely separate path of smaller clusters starting with quasi-alignment number 16. This indicates that a single or a few sequences are significantly different from the majority of the sequences in the set. This might be due to several reasons, such as mis-classification of the sequence, or a sequencing error whereby some of the initial bases may have been removed.
In Figure 3, we visualize the largest quasi-alignments found in the GenModel along the approximately 1500 bases (x-axis). The top part of Figure 3 shows the segments grouped into the 5 most popular quasi-alignments as red horizontal lines. In this model, all red horizontal lines are exactly 100 bases long because a segment length of 100 was used. The segments that are part of the same quasi-alignment are joined by vertical dotted lines and the cluster id from Figure 2 is shown on top. We see that the well preserved segments are found in quasi-alignment 4, 6, 31, 10 and 14 which correspond to the largest clusters in the model where almost all sequences converge in Figure 2. The lower part of Figure 3 shows a measure of consensus for each segment i.e. proportion of sequences clustered into the most popular quasi-alignment. For example, it shows that all of the sequences in the nucleotide region 900-1000 converge in quasi-alignment 10. Similarly, all except one sequence converge in quasi-alignment 4, 6 and 14 for the nucleotide regions 300-400, 500-600 and 1300-1400, respectively. This is an indication that these segments are highly similar and could be conserved regions of the sequences.
To validate our claim that the segments that are clustered together into a quasi-alignment are indeed highly similar, we performed traditional Multiple Sequence Alignment (MSA) on the segments that are part of quasi-alignment 10. We used Clustal [8] available through the software JalView [23, 24] to perform MSA and visualize the alignment in Figure 4. The results show that the segments have an average pairwise alignment score of 0.94 (out of a maximum possible of 1.00) with a large majority of segments being almost identical and having 100% pairwise alignment. Figure 4 and the MSA results indicate that the segments in quasi-alignment 10 are indeed very similar. We have performed a similar analysis on the other quasi-alignment shown in Figure 3 and verified that the segments have a high degree of base-wise identity.
It is interesting to note that in Figure 4 some sequences have bases that are "shifted" by a certain amount. For example, the first sequence shown with id 159470, has its bases shifted to the right by 29 bases. This can be the result of insertions/deletions (indels) in the sequences due to evolutionary processes. If this offset becomes too large, then it can interfere with clustering segments. This problem can be removed by using overlapping segments, i.e., considering many or all possible offsets. This increases the time complexity in the worst case by a fixed factor of l (segment length). It also makes makes visualizing quasi-alignments more complicated and therefore we will restrict the discussion in this paper to non-overlapping segment.
Further validation of the location of the found conserved regions can be obtained by looking at biological evidence available in the literature. Studies have reported that 16S rRNA contains regions that are highly conserved within each species, but variable between species. These regions are known as hypervariable regions [25, 26], which are characteristic for each species, and have applications such as PCR amplification using universal primers [25]. It has also been reported that hypervariable regions are flanked on both sides by regions that are highly conserved across multiple species [27, 28]. These flanking regions are conserved even for sequences exhibiting wide genomic diversity such as environmental or biological samples.
Nine identified hypervariable regions in 16S rRNA consist of nucleotides number 69-99, 137-242, 433-497, 576-682, 822-879, 986-1043, 1117-1173, 1243-1294 and 1435-1465, and are denoted by V1 through V9, respectively [25]. The sequence data of the phylum GN06 contains 13 sequences potentially from multiple species, which in the data have not been identified and hence are coded as unknown in the Greengenes database. Since the data contains several species, we would expect greater variations in the hypervariable regions than in the flanking, preserved regions. The top part of Figure 3 shows the 5 largest quasi-alignments (clusters) found for the GenModel for the sequences from GN06. The positions of the hypervariable regions are shown as blue lines labeled V1 through V9 at the top of the plot. It is very clear that our algorithm identifies regions that flank the hypervariable regions. For example, quasi-alignment 10 covers the nucleotide bases between the hypervariable regions V5 and V6. Similarly, quasi-alignment 4, 6, 31, and 14 cover the bases between hypervariable regions V2, V3, V4, V5, V6, V8 and V9. The plot in the lower part of Figure 3 also confirms this finding. The peaks of the plots indicate those segments that have a high consensus i.e. a strong quasi-alignment. The region between bases 900 and 1000 share a perfect consensus, i.e., all the segments belong to the same quasi-alignment. As discussed earlier, this area lies between hypervariable regions and is thus is expected to be more conserved for a sample containing multiple species.
Implementation details
We have implemented an open source software package using the R framework called QuasiAlign which can be downloaded from http://r-forge.r-project.org/projects/mmsa/. This package has methods to quickly load a large set of sequence files, that can be in FASTA format with Greengenes [22] annotations, into a relational database and can be used to easily filter sequences belonging to any taxonomic rank. This package is built on top of a number of other packages including Biostrings [29] for handling sequences, and the data stream clustering package rEMM [30, 31]. It provides a complete interface for managing sequences, creating word frequencies distributions (NSVs) and creating and analyzing GenModels. Several other useful functions, such as those for metagenomic classification, are also available. More details can be obtained from the package documentation [32].
For the analysis in this paper, we have used the default parameters for creating NSVs and GenModels. The parameters for NSVs are a segment length of l = 100 bases with no overlap between and segments and a word size of p = 3. For creating GenModels, the default is Manhattan distance with a clustering threshold distance of 30 which requires a minimum edit distance of 5 between two segments to place the into separate clusters (see equation 2 above).
Large scale experiments
Position sensitive p-mer clustering can work with a set of DNA/RNA sequences or even fragments of sequences from any source. This is a valuable asset for metagenomic analysis and also fits in nicely with the requirements of Next Generation Sequencing methods. All experiments in this paper can be reproduced using the QuasiAlign [32] package.
Dataset used
The method presented here for discovering conserved regions is general enough to be applied to any set of sequences. For this analysis, we used the more than 400,000 16S rRNA sequences obtained from the Greengenes project [22]. The 16S gene is widely used for phylogenetic analysis as it is highly conserved for different species of bacteria and archaea. The sequences of this gene have remained more or less constant over time and evolutionary cycles. Further, it contains several distinct regions, known as hypervariable regions, that are very specific and unique for each individual species [25] and are widespread used for sequence identification and classification. The package QuasiAlign allows us to directly import FASTA sequences with Greengenes annotations into a relational database for further analysis.