- Methodology article
- Open Access

# A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

- David J Russell
^{1}Email author, - Samuel F Way
^{1}, - Andrew K Benson
^{2, 3}and - Khalid Sayood
^{1}

**11**:601

https://doi.org/10.1186/1471-2105-11-601

© Russell et al; licensee BioMed Central Ltd. 2010

**Received:**27 May 2010**Accepted:**17 December 2010**Published:**17 December 2010

## Abstract

### Background

We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.

### Results

The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.

### Conclusions

We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

## Keywords

- Basis Sequence
- Representative Sequence
- Suffix Tree
- FASTA File
- Jaccard Coefficient

## Background

The amount of biological information being gathered is growing faster than the rate at which it can be analyzed. Data clustering, which compresses the problem space by reducing redundancy, is one viable tool for managing the explosive growth of data. In general, clustering algorithms are designed to operate on a large set of related values, eventually generating a smaller set of elements that represent groups of similar data points. A central data element may then be used as the sole representative of a group.

Significant clustering work relating to bioinformatics may be traced to the late 1990 s when methods for quick generation of nonredundant (NR) protein databases were developed. These combined identical or nearly identical protein sequences into single entries [1–3]. The primary benefits of these methods include faster searches of the NR protein databases and reduced statistical bias in the query results [1]. Similarly, computer programs such as those in ICAtools [4] were developed for compressing DNA databases by removing redundant sequences found via clustering resulting in faster database queries. Note that the use of the term "clustering" in these applications differs from another use often found in the literature where clustering refers to generating a phylogenetic distance matrix, such as in [5]. The operation of clustering used in this work identifies groups of sequences related by phylogeny; and it additionally applies to redundancy removal by identifying a sequence that suitably represents similar sequences.

Recently, DNA/RNA clustering has attracted attention for a variety of reasons. The drive to lower the expense of genome sequencing has led to the development of high-throughput sequencing technologies capable of generating millions of sequence fragments simultaneously. A clustering preprocessing step can be used to remove a great amount of fragment redundancy which, in turn, allows for quicker fragment reassembly.

One of the more popular DNA/RNA clustering algorithms is CD-HIT-EST [6] which was based on the protein clustering methods of [2, 3] and was developed for clustering DNA/RNA database data such as non-intron-containing expressed sequence tags (ESTs).

A major application of CD-HIT has been for clustering large data sets from microbiota analysis (e.g. [7]), often as a preprocessing step to create sets of highly related sequences representing operational taxonomic units (OTUs). These OTUs are subsequently used as a basis for estimating species diversity between treatment groups or quantitative relationships of taxa between treatment groups. Alternatively, representative sequences from the OTUs are used for phylogeny-based analyses.

A recent effort in [8] to develop software tools which reduce the time required by BLAST [9] to search large biological databases has resulted in a set of programs, including UBLAST and USEARCH, that reduce the search time by orders of magnitude. As part of the work, an additional clustering program called UCLUST was created which utilizes the heuristic algorithm provided by USEARCH. UCLUST generates results that dramatically improve upon the time required by CD-HIT.

This work presents GramCluster, a fast and accurate algorithm for clustering large data sets of 16S rDNA sequences based on the inherent grammar of DNA and RNA sequences. Lempel-Ziv parsing [10] is used to estimate the grammar of each sequence to provide a distance metric among sequences. The implementation of this algorithm allows for fast and accurate clustering of biological information. The following sections describe the algorithm and present results, including comparisons with the CD-HIT-EST algorithm and the recently developed UCLUST algorithm.

## Results and Discussion

### Grammar

Necessary concepts for understanding how a grammar model is specified are briefly reviewed in this section. An *alphabet*, Σ, is a finite, nonempty set of symbols from which finite-length sequences, or *strings*, are formed. Strings are constructed via the binary operation of *concatenation* which begins with a copy of the left string and appends a copy of the right string. A *language*, *L*, is then defined as a subset of strings selected from the set of all strings over an alphabet, and a *problem* is defined as the question of deciding whether a given string is a member of some particular language. That is, given a string, *w* ∈ Σ*, and *L*, a language over Σ, decide if *w* ∈ *L*.

As *L* may be infinite, it is useful to have a compact description of the strings in *L*. Such an abstract model is called a *grammar*, *G*. Typically, a grammar is specified by the 4-tuple *G* = (*V*, *T*, *P*, *S*), where *V* is the set of *variables* and *T* is the set of *terminals* which are symbols that form the strings of *L*. *P* is the set of *productions*, each of which represent the recursive definition of *L*; and *S* ∈ *V* is the *start symbol*, which is the variable that defines *L*. Each production consists of a *head* variable followed by the production operator → and a *body* string of zero or more terminals and variables. Each production represents one way to form strings in *L* from the head variable.

*G*= (

*V*,

*T*,

*P*,

*S*), the language,

*L*, is defined by

That is, *L*(*G*) is the set of all strings derived from *S*.

It was observed in [11], that a grammar, *G*, used to model a string can be converted to an LZ77 representation in a simple way. The term LZ77 refers to Lempel-Ziv dictionary-based lossless compression detailed in [10] and [12]. Subsequently, an algorithm was presented in [13] to use an inverted process to map an LZ77-compressed sequence into a grammar. While the inverted process is more involved, it demonstrates the fact that Lempel-Ziv compression can be thought of as inferring a grammar from the sequence it compresses. The original concept behind abstract grammars is that a grammar, *G*, is meant to completely describe the underlying structure of a corpus of sequences. Because most naturally occurring sequences contain repetition and redundancy, grammars are often able to describe sequences efficiently.

### Algorithm

*S*, is regarded as input to the algorithm with

*S*= {

*s*

_{1},...,

*s*

_{ N }}, where

*s*

_{ i }is the

*i*th sequence and

*i*∈ {1,...,

*N*}. The goal of the algorithm is to partition

*S*where each sequence is grouped with similar sequences from

*S*such that all sequences within each resulting cluster are more similar to each other than sequences from other clusters. The final partition is represented by the set of clusters,

*C*= {

*c*

_{1},...,

*c*

_{ M }}, where

*c*

_{ j }is the

*j*th cluster and

*j*∈ {1,...,

*M*}. The algorithm initially generates a suffix tree,

*t*

_{ i }, and grammar dictionary,

*d*

_{ i }, associated with each sequence,

*s*

_{ i }. For each sequence,

*s*

_{ i }, these data structures are used to determine if an existing cluster contains sufficiently similar sequences to

*s*

_{ i }or if a new cluster needs to be created. If a cluster,

*c*

_{ j }∈

*C*, already exists with similar sequences, the sequence

*s*

_{ i }is added to

*c*

_{ j }. However, if no cluster contains similar sequences, a new cluster containing only

*s*

_{ i }is added to

*C*. This clustering continues for all sequences in

*S*. The algorithm is described in more detail below with reference to the various blocks in Figure 1.

#### Dictionary Creation

One of the core processes of the clustering algorithm is the formation of a distance estimate between an unprocessed sequence, *s*_{
i
}, and each cluster, *c*_{
j
}, already in the partition, *C*. To this end, one sequence, called the representative sequence, is used to represent all other sequences within each cluster. The distance between *s*_{
i
} and ${s}_{{r}_{j}}\in {c}_{j}$, where ${s}_{{r}_{j}}$ represents *c*_{
j
}, is used to determine if *s*_{
i
} should be added to *c*_{
j
}.

Each sequence, *s*_{
i
}, is compared with, at most, the set of representative sequences, $\{{s}_{{r}_{j}}|{s}_{{r}_{j}}\text{represents}{c}_{j}\in C\}$, to discover the correct cluster for *s*_{
i
}.

*f*

^{1}=

*s*

_{ i }(1), is set to the first residue of the corresponding sequence, and only the first element,

*s*

_{ i }(1), is visible to the algorithm. At the

*k*th iteration of the procedure, the

*k*th residue is appended to the fragment resulting from the (

*k -*1)th step; and the visible sequence is checked. If

*f*

^{ k }∉

*s*

_{ i }(1,...,

*k -*1), then

*f*

^{ k }is considered a new rule and so added to the dictionary, ${d}_{i}^{k}={d}_{i}^{k-1}\int \left\{{f}^{k}\right\}$; and the fragment is reset, ${f}^{k}=\varnothing $. However, if

*f*

^{ k }∈

*s*

_{ i }(1,...,

*k -*1), then the current dictionary contains enough rules to reproduce the current fragment, i.e., ${d}_{i}^{k}={d}_{i}^{k-1}$. In either case, the iteration completes by appending the

*k*th residue to the visible sequence.

This procedure continues until the visible sequence is equal to the entire sequence, at which time the size of the dictionary, |*d*_{
i
}*|* is determined for use in the metric calculation. The correlation of the LZ-based distance with phylogenetic distance was exploited in [16] to obtain phylogenies for a set of mammalian species using complete mitochondrial DNA and for the superfamily *Cavioidea* using exon#10 of the growth hormone receptor (GHR) gene, the transthyretin (TTH) gene, and the 12 S rRNA gene. In [19], the same distance metric was used to obtain phylogenies for fungal species using the cytochrome b gene and internal transcribed spacer regions of the rDNA gene complex.

#### Suffix Tree Construction

*L*suffix substrings of a length-

*L*sequence [20–22]. For example, a suffix tree for the sequence "gagacat" is schematically shown in Figure 3. All seven suffixes {gagacat, agacat, gacat, acat, cat, at, t} are found by tracing a unique path from the root node to one of the seven leaf nodes along solid lines. One valuable use of suffix trees is searching for substrings which can be thought of as the preffix of a suffix. By using a suffix tree, a length-

*L*sequence can be completely scanned for a length-

*F*fragment in $\mathcal{O}$(

*F*) time as opposed to $\mathcal{O}$ (

*L*) for a brute force search. Also depicted in Figure 3 are the dashed-line suffix links which are a fundamental feature for linear-time construction of the suffix tree [22].

A sequence, *s*_{
i
} , can be converted into a suffix tree, *t*_{
i
} , in linear time and then searched for substrings in linear time based on the fragment length. As will be shown, suffix tree sequence representation is important for reducing the time required for GramCluster to complete all necessary grammar-based comparisons.

#### Clustering

*c*

_{ j }∈

*C*, until a cluster is found where the distance between the representative sequence, ${s}_{{r}_{j}}$ and

*s*

_{ i },${D}_{j}=\text{dist}({s}_{i},{s}_{r})$ is less than a user-defined threshold,

*T*. Once this condition is met, the cluster is updated,

*c*

_{ j }=

*c*

_{ j }∪ {

*s*

_{ i }}; and processing in this block terminates. If no clusters meet the condition of

*D < T*, a new cluster is created with

*s*

_{ i }as its first member.

The following sections describe the cluster data structure, the representative sequence selection method, and the grammar-based distance calculation.

#### Cluster Data Structure

In order to follow the cluster classification process, it is helpful to understand the data structure used to represent each cluster. In particular, every cluster uses a list of suffix trees, *t*_{
i
} , and dictionary sizes, |*d*_{
i
} |, to identify its set of sequences. The remaining components contained in the data structure are used to determine and specify the representative sequence, ${s}_{{r}_{j}}$, of the cluster, *c*_{
j
} . A good selection for ${s}_{{r}_{j}}$ is a sequence that appears grammatically similar to all other sequences within the cluster. This implies the need to estimate the grammar-based distance between all sequences of the cluster, a computationally expensive task. To avoid this cost, GramCluster selects only a few specific sequences in the cluster, that we will call "basis sequences," to which all others are compared. The representative sequence, ${s}_{{r}_{j}}$, can be determined by considering the sets of relative distances between all sequences and each basis sequence. The centroid of the cluster is then defined as the vector containing the mean values of each set of relative distances. The sequence with relative distances nearest to the centroid is selected as ${s}_{{r}_{j}}$.

*Acetobacter*,

*Achromobacter*,

*Borrelia*,

*Flavobacterium*). Of the two initially selected basis sequences, one came from

*Acetobacter*and the second from

*Flavobacterium*. Then, the pair of distances between each sequence and the basis sequences was computed and plotted. As can be seen from the plot, the sequences group into clusters which correspond to their genus. Note that the basis sequences are not orthogonal; however, use is made of the fact that the grammar-based distances tend to obey the transitive property such that if

*D*

_{ b }is close to

*D*

_{ c }, then

*s*

_{ b }and

*s*

_{ c }tend to be grammatically similar to each other. The example in Figure 5 demonstrates this by the use of basis sequences from

*Acetobacter*(genus one) and

*Flavobacterium*(genus four). One would expect that comparing all sequences to one sequence would provide separation between the sequences from the same genus as the basis sequence and the rest. However, sequences from the other genera also form into clusters as a result of sequences being compared to a single basis sequence. In our example, all forty sequences are compared to just two sequences; and four clear clusters appear. The method presented here for building vectors of distances relative to basis sequences is similar to the concept of embedding presented in [5]. The work of [5] details an algorithm called mBed that operates on a set of sequences to generate a distance matrix representing a phylogenetic guide tree, a process that is closely related to the data clustering problem presented here. The mBed algorithm selects a subset of

*t*seed reference sequences that are not close together relative to a distance metric. Then each sequence has a

*t*-dimensional vector associated with it where each coordinate value is the distance between the sequence and the respective reference sequence. The distance used in [5] was selected to be the

*k*-tuple distance measure of [23] and implemented in ClustalW [24]. The basis sequence concept used in this work is similar, with the grammar-based distance metric replacing the

*k*-tuple distance measure being the primary difference. Additionally, a single reference subset is used in [5] to build all vectors. The algorithm presented here creates vectors for each sequence contained in a cluster relative to basis sequences also sampled from the same cluster.

#### Representative Sequence Selection

*s*

_{ i }to the representative sequence of cluster

*c*

_{ j }∈

*C*. For clusters containing many sequences, a representative sequence is determined using the basis sequence method described above. In this case, only the representative sequence, ${s}_{{r}_{j}}$, is compared to

*s*

_{ i }

*s*

_{ i }

The minimum distance, $\underset{k}{\mathrm{min}}\left\{{D}_{k}\right\}$, is used as the classification metric.

#### Grammar-Based Distance Calculation

The distance metric used in GramCluster is a modified form of the grammar-based distance metric introduced in [16, 18] and used in [17].

The original distance metric is computed by concatenating the two sequences being compared into a single sequence and then performing the operations detailed in Figure 2. Formally, consider the process of comparing sequences *s*_{
m
} and *s*_{
n
} . Initially, the dictionary, ${d}_{m,n}^{1}={d}_{m}$, is set to that of sequence *s*_{
m
} , a fragment, *f*^{1} = *s*_{
n
} (1), is set to the first residue of the *n* th sequence, and the visible sequence is all of *s*_{
m
} .

*d*

_{m,n}|. When complete, more grammatically similar sequences will have a new dictionary size with fewer entries as compared to sequences that are less grammatically similar. Therefore, the size of the new dictionary, |

*d*

_{m,n}|, will be close to the size of the original dictionary, |

*d*

_{ m }

*|*. The distance between the sequences is estimated using the dictionary sizes, in particular

This particular metric accounts for differences in sequence lengths and normalizes accordingly. Smaller values of *D* indicate a stronger similarity. Intuitively, sequences with a similar grammar should be clustered with each other.

While this grammar-based distance metric works well, it requires that the extended sequence be rescanned for every residue in the second sequence. This means that *s*_{
m
} will be rescanned completely for every character in *s*_{
n
} . This process is repeated as many times as the number of sequences compared to *s*_{
m
} . As a result, approximately 75% of the computation is devoted to string searching and concatenation. To improve the execution time, we introduce two significant modifications described below.

#### Fragment Markers

*k*th character in the second sequence, the first sequence is completely scanned along with the initial

*k*- 1 portion of the second sequence. However, this is quite unnecessary since many fragments formed from the second sequence were already found in the second sequence during the initial scan. Formally, consider sequences

*s*

_{ m }and

*s*

_{ n }which have already had their own dictionaries created in a previous step. Now suppose the concatenated sequence

*s*

_{m·n}is being processed for the

*k*th character in

*s*

_{ n }, at which point there is a nonempty fragment,

*f*

^{ k }. The process begins with the fragment completely composed of consecutive letters from

*s*

_{ n }, which means that this fragment has already been created once before when

*s*

_{ n }was processed by itself. As long as

*f*

^{ k }was previously found within

*s*

_{ n }(1,...,

*k -*1), there will be no new information gained by scanning

*s*

_{ m·n }(1,...,

*|s*

_{ m }

*|*+

*k -*1), because it is certain to be there since

*s*

_{ n }(1,...,

*k -*1) ⊂

*s*

_{ m·n }(1,...,

*|s*

_{ m }

*|*+

*k -*1). So, there is no need to scan for fragments that have been previously found during any distance calculation. The inverse statement is also true: fragments not previously found do need to be scanned for during a distance calculation. This is implemented as shown in Figure 6 in which fragment

*f*

^{ k }∉

*= s*

_{ i }(1,...,

*k -*1), so

*k*is added to a list of marked fragment indices.

The same distance metric given by (1) is used, but there is no longer a need to perform string concatenation; and only the first string is scanned for the marked fragments from the second string. Formally, consider the process of comparing sequences *s*_{
m
} and *s*_{
n
} . Initially, the dictionary, ${d}_{m,n}^{1}={d}_{m}$, is set to that of sequence *s*_{
m
} , a fragment, *f*^{marked(1)}, is set to the first marked substring of the *n* th sequence, and the visible sequence is always just *s*_{
m
} . The algorithm simply scans *s*_{
m
} for an occurrence of the fragment and adds one to the dictionary if the fragment is not found. Either way, the fragment is updated to the next marked substring of *s*_{
n
} ; and *s*_{
m
} is scanned again. This continues for all marked fragments from *s*_{
n
} resulting in a new dictionary size, | *d*_{
m,n
} |. This fragment marking process significantly reduces the total number of substring searches performed, as well as the character concatenations that would be otherwise required.

The second optimization involves a time-efficient method of searching a string for a substring of characters, a very relevant problem for suffix trees.

#### Suffix Tree Searches

As stated previously, a length-*L* sequence stored in a suffix tree data structure can be completely scanned for a length-*F* fragment in $\mathcal{O}$(*F*) time. To see why this is true, consider the simple example depicted in Figure 3. Every suffix is represented in the data structure as a unique path beginning at the root node and traversing along a solid line to a leaf node. Any substring occurring in this string has to be the start of a suffix, so searching for a substring amounts to finding a suffix that begins with the substring. Consider searching "gagacat" for the substring fragment "gac" which is present in the string. The first step is to find a branch beginning with "g" leaving the root, which is found as the third entry in the data structure.

Following the branch to the internal node indicates that all suffixes in this tree that begin with "g" are always followed by an "a," which is also true of the fragment. At the internal node, the next step is to search for any branch that begins with "c," which is found as the second entry in the data structure, concluding the search. Next, consider searching for the substring fragment "gact," which follows the previous search to the internal node and includes identifying the branch beginning with "c." The final step is looking at the subsequent character along the branch, which is "a," and does not match. This search finishes having determined that "gact" is not a substring of "gagacat." The use of the suffix tree in this context means that the time necessary for identifying whether previously marked fragments from sequence *s*_{
n
} are present in sequence *s*_{
m
} is $\mathcal{O}$(*F* ).

#### Algorithm Complexity

The algorithm complexity of GramCluster may be broken into three pieces, beginning with the generation of each sequence grammar dictionary, *d*_{
i
} for *i* ∈ {1, *..., N*}, where *N* is the number of sequences. Suppose the average sequence length is *L*, then each *d*_{
i
} results in complexity $\mathcal{O}$ (*L*), so all dictionaries are generated with complexity $\mathcal{O}$ (*LN*). Next, each suffix tree, *t*_{
i
} , has a complexity $\mathcal{O}$ (*L*^{2}), so all sequences are converted into trees with complexity $\mathcal{O}$(*L*^{2}*N*). Finally, suppose the average number of clusters is *M*. As an upper bound, all clusters are scanned until each sequence is classified and each scanning process has complexity $\mathcal{O}$(*L*). The result is a total scanning complexity of $\mathcal{O}$(*LMN*). Thus, the entire time complexity for GramCluster is $\mathcal{O}$(*LN* + *L*^{2}*N* + *LMN*), which simplifies to $\mathcal{O}$(*L*^{2}*N* + *LMN*).

Regarding the memory complexity of GramCluster and continuing with *N* as the number of sequences, suppose the average sequence header length in the FASTA file is *H*. Because every header line is stored for subsequent file output, this memory complexity is $\mathcal{O}$(*HN*). As before, if the average sequence length is *L*, then each sequence is stored in $\mathcal{O}$(*L*). The worst-case memory usage for the clusters themselves occurs if every cluster created has an incomplete set of basis sequences. In this case, each cluster has a memory complexity of $\mathcal{O}$(*C* + *B* + *BC* + *LC*) where *C* is the number of sequences held within the cluster and *B* is the number of basis sequences per cluster. Because there are *N* sequences stored in memory during this worst-case scenario, a final upper bound on the memory complexity is $\mathcal{O}$((*H* + *B* + *L*)*N*) in which the most significant component has a memory complexity of $\mathcal{O}$(*LN*).

### Testing

We performed several clustering experiments to validate the proposed algorithm, GramCluster version 1.3 (see Additional File 1). The training procedures for obtaining the default parameters are described in the Methods section. In particular, we used GramCluster to cluster sets of 16S rDNA sequences. As detailed in the Methods section, the resulting clusters were analyzed for correctness whereby the genus of each sequence was compared to that of all other sequences in the data set. Correct classification is considered when sequences belonging to the same genus fall into the same cluster. Likewise, incorrect classification occurs when sequences belonging to different genera are placed into the same cluster. Each output set was analyzed using several statistical quality metrics described in the Methods section. For comparison, CD-HIT-EST (no version given, archive created on 4/27/2009) [6] and UCLUST version 3.0.617 [8] were also used to cluster the same 16S rDNA sequences and analyzed using the same quality metrics.

#### Experiments with Moderate-Sized Data Set

Results indicate that CD-HIT-EST achieved 17.5% in-cluster classification and 99.7% sequence differentiation out of the 2,050 total clusters determined. That is, for sequences that were supposed to be in the same cluster, CD-HIT-EST placed them together 17.5% of the time; and for sequences that were not supposed to be in the same cluster, it correctly kept them in different clusters 99.7% of the time. Improved results for UCLUST show 30.4% and 99.8% in-cluster classification and sequence differentiation out of the 1,680 total clusters determined. By comparison, GramCluster achieved 84.5% in-cluster classification and 99.0% sequence differentiation out of the 2,447 total clusters identified. Clearly, GramCluster provides a significant improvement in clustering sequences correctly. This improvement can be further observed using common statistical measures for evaluating the performance of clustering algorithms [25] described in the Methods section. These measures are shown for GramCluster, CD-HIT-EST, and UCLUST operating on a set of 74,709 16S rDNA genes obtained from 2,255 different genera. The Jaccard Coefficient and Folkes and Mallows Index exceed those of CD-HIT-EST four-fold and over two-fold, respectively. The CPU execution time of GramCluster (1342 seconds) is on the same order as that of CD-HIT-EST (8277 seconds), which is considered ultra-fast [26]. The UCLUST CPU execution time (89 seconds) is much faster than GramCluster, however its quality metrics fall significantly short of those provided by GramCluster.

#### Experiments with Large Data Set

#### Varying Command Line Options

Next, we consider the effect of varying the command line options primarily responsible for affecting the resulting data set partition. We ran two additional clustering experiments on the original set of sequences with GramCluster and UCLUST. The GramCluster experiments had both grammar-based distance thresholds altered from the default setting of 0.13 to 0.15 and 0.11. Similarly, the UCLUST experiments had the identity threshold altered from the default setting of 90% to 85% and 95%.

#### Experiments Clustering on Species

The final experiment operated on the original set of sequences, but the partitioning was based on the sequence species instead of their genus.

## Conclusions

The primary goal of this work was to introduce a computationally efficient clustering algorithm which can be used for clustering large datasets with high accuracy. The algorithm introduced was validated against a specific class of datasets containing 16S rDNA sequences but was designed to cluster any set of RNA, DNA, or protein sequences. The grammar-based distance work introduced in [16, 18] and previously used in [17] was modified to generate an estimation of the proper classification in which sequences are to be grouped. Results from clusters generated were presented in an attempt to study the overall quality of the resultant classifications as well as the computation time necessary to achieve the outputs. Accurate clustering of large numbers of biological sequences in an efficient amount of time is an important and challenging problem with a wide spectrum of applications. In this work, we adapted existing ideas in a novel way and introduced significant improvements. The proposed algorithm achieved higher-quality clusters compared to existing methods while operating at similar, high-speed execution times.

## Methods

### Experiments

All results presented in the Testing section were generated by compiling and executing the respective clustering programs on the same computer, specifically an Apple MacBook Pro with an Intel Core 2 Duo operating at 2.53 GHz with 4 Gb of system memory and a 3 Mb L2 cache. In the case of UCLUST, the binary was downloaded from the author's website. The experiments were conducted using various versions of FASTA files containing 74,709 16S rDNA sequences from 7,043 different species of 2,255 genera obtained from the Ribosomal Database Project http://rdp.cme.msu.edu. For example, the second set of experiments involved a processed version of the FASTA file to simulate the application of clustering a large set of unknown fragments that typically result from high-throughput sequencing technologies, such as 454 pyrosequencing. In particular, every sequence was reduced to only the first 200 bases; and then the entire file was repeated 14 times for a total of 1,045,926 sequences from 2,255 genera.

*SS*), 2) the pair should be in different clusters but they are in the same cluster (

*DS*), 3) the pair should be in the same cluster but they are in different clusters (

*SD*), and 4) the pair should be in different clusters and they are in different clusters (

*DD*). The goal of a clustering algorithm is to obtain maximal values for

*SS*and

*DD*and minimal values for

*DS*and

*SD*. The three metrics all operate on combinations of these counts in order to provide an indication as to the quality of actual clustering versus ideal clustering, as follows:

are provided. Given all sequence pair comparisons, the total number that implies a pair of sequences belong to the same genus is (*SS* + *SD*). Of that total, only *SS* pairs were actually classified into the same cluster. Thus, the in-cluster classification is the percent of sequence-to-sequence pairs that have correctly clustered sequences together out of all that should be clustered together. Similarly, the total number of sequence pair comparisons that imply two sequences do not belong to the same genus is (*DS* + *DD*). Out of the total, only *DD* pairs were correctly separated into different clusters. The sequence differentiation used here was the percent of sequence pair comparisons that have correctly classified sequences apart out of the total that should not be clustered together.

We repeated the first two experiments in the Testing section using two different random permutations of the FASTA file (results not shown). All programs produced very similar results, thereby demonstrating that the order in which sequences are input to the algorithms does not affect the resulting clusters. In order to identify the best set of default parameters for the GramCluster implementation, we used two different training methods. In the first method, we randomly selected 10% of the sequences for training while the remaining 90% were used for testing. In the second method, we randomly divided the genera into two sets, one containing about 10% of the sequences and the other containing 90% of the sequences. The smaller set was again used as a training set to obtain the parameters for the algorithm. The default parameters ended up being the same as those found in the first training experiment. In particular, a grammar-based threshold of 0.13 was found to produce the best overall clustering metrics based on genera.

We applied the same training methods to identify the best thresholds for GramCluster when clustering based on species. In this case, the best overall clustering metrics based on species occurred when the grammar-based threshold of 0.03 was applied.

### Command Line Options

- 1.
-B <value> Specify the full basis amount. The value specified in this option represents the number of nonidentical sequences added to a cluster before a centroid sequence is determined. If this option is not specified, the default value is 4 sequences.

- 2.
-b <value> Specify the grammar distance identical threshold. The value specified in this option represents the grammar-based distance threshold for two sequences to be consider grammatically identical. When a new sequence is added to a cluster, it has a distance less than one of the thresholds (specified by -C or -G). In the event that two sequences are very similar (or identical), this threshold prevents the new sequence from becoming a basis sequence. If this option is not specified, the default value is 0.01.

- 3.
-C <value> Specify the grammar distance-to-centroid maximum threshold. The value specified in this option represents the grammar-based distance threshold to the centroid sequence. If a distance calculated between a new sequence and the centroid sequence is less than this value, then the new sequence is added to the cluster. If this option is not specified, the default value is 0.13.

- 4.
-G <value> Specify the grammar distance maximum threshold. The value specified in this option represents the grammar-based distance threshold to all basis sequences for clusters that do not have a centroid already determined. If a distance calculated between a new sequence and any basis sequence is less than this value, then the new sequence is added to the cluster. If this option is not specified, the default value is 0.13.

- 5.
-c Turn on complete cluster searching. This causes the algorithm to scan every cluster for the lowest distance before adding it. The default is greedy cluster searching, which causes sequences to be added to the first cluster presenting a distance lower than the specified thresholds.

- 6.
-R Turn on reverse complement checking. This causes GramCluster to check both the input sequence as well as its reverse complement against each cluster representative. The lowest resulting distance is used for classification.

Note that the -C and -G options specify thresholds that function similar to the identity percentage thresholds used by other clustering programs, such as CD-HIT-EST and UCLUST. However, the thresholds function just the opposite, whereby sequences are only added if their grammar-based distance is calculated as a value below the threshold value. In contrast, the identity percent thresholds of CD-HIT-EST and UCLUST require sequences to have a metric score higher than the threshold before they are added to the respective cluster.

### Availability

The source code for the current version of GramCluster may be downloaded from http://bioinfo.unl.edu/

## Declarations

### Acknowledgements

We would like to thank the National Institutes of Health (NIH) for partial funding of this work and Nina Murray for her careful critique of the paper. We would also like to thank the editor and anonymous referees for their insightful comments. KS thanks NIH for support under grant K25AI068151.

## Authors’ Affiliations

## References

- Holm L, Sander C: Removing Near-Neighbour Redundancy from Large Protein Sequence Collections.
*Bioinformatics*1998, 14(5):423–429. 10.1093/bioinformatics/14.5.423View ArticlePubMedGoogle Scholar - Li W, Jaroszewski L, Godzik A: Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases.
*Bioinformatics*2001, 17(3):282–283. 10.1093/bioinformatics/17.3.282View ArticlePubMedGoogle Scholar - Li W, Jaroszewski L, Godzik A: Tolerating some Redundancy Significantly Speeds up Clustering of Large Protein Databases.
*Bioinformatics*2002, 18: 77–82. 10.1093/bioinformatics/18.1.77View ArticlePubMedGoogle Scholar - Parsons JD: Improved Tools for DNA Comparison and Clustering.
*Computer Applications in the Biosciences*1995, 11(6):603–613.PubMedGoogle Scholar - Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG: Sequence Embedding for Fast Construction of Guide Trees for Multiple Sequence Alignment. Algorithms for Molecular Biology 2010., 5(21):Google Scholar
- Li W, Godzik A: Cd-hit: a Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences.
*Bioinformatics*2006, 22(13):1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar - Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R: Bacterial Community Variation in Human Body Habitats Across Space and Time.
*Science*2009, 326: 1694–1697. 10.1126/science.1177486View ArticlePubMedPubMed CentralGoogle Scholar - Edgar RC: Search and Clustering Orders of Magnitude Faster than BLAST.
*Bioinformatics*2010, 26(19):2460–2461. 10.1093/bioinformatics/btq461View ArticlePubMedGoogle Scholar - Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs.
*Nucleic Acids Research*1997, 25(17):3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed CentralGoogle Scholar - Lempel A, Ziv J: On the Complexity of Finite Sequences.
*IEEE Transactions on Information Theory*1976, 22: 75–81. 10.1109/TIT.1976.1055501View ArticleGoogle Scholar - Nevill-Manning CG, Witten IH: Compression and Explanation using Hierarchical Grammars.
*The Computer Journal*1997, 40(2/3):103–116. 10.1093/comjnl/40.2_and_3.103View ArticleGoogle Scholar - Ziv J, Lempel A: A Universal Algorithm for Sequential Data Compression.
*IEEE Transactions on Information Theory*1977, 23(3):337–343. 10.1109/TIT.1977.1055714View ArticleGoogle Scholar - Charikar M, Lehman E, Liu D, Panigrahy R, Prabhakaran M, Rasala A, Sahai A, Shelat A: Approximating the Smallest Grammar: Kolmogorov Complexity in Natural Models. In
*STOC '02: Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing*. New York, NY, USA: ACM; 2002:792–801.View ArticleGoogle Scholar - Ziv J, Lempel A: Compression of Individual Sequences via Variable-Rate Coding.
*IEEE Transactions on Information Theory*1978, 24(5):530–536. 10.1109/TIT.1978.1055934View ArticleGoogle Scholar - Benedetto D, Caglioti E, Loreto V: Language Trees and Zipping. Physical Review Letters 2002., 88(4): 10.1103/PhysRevLett.88.048702Google Scholar
- Otu HH, Sayood K: A New Sequence Distance Measure for Phylogenetic Tree Construction.
*Bioinformatics*2003, 19(16):2122–2130. 10.1093/bioinformatics/btg295View ArticlePubMedGoogle Scholar - Russell DJ, Otu HH, Sayood K: Grammar-Based Distance in Progressive Multiple Sequence Alignment. BMC Bioinformatics 2008., 9(306):Google Scholar
- Puglisi A, Benedetto D, Caglioti E, Loreto V, Vulpiani A: Data Compression and Learning in Time Sequences Analysis.
*Physica D: Nonlinear Phenomena*2003, 180: 92–107. 10.1016/S0167-2789(03)00047-2View ArticleGoogle Scholar - Bastola DR, Otu HH, Doukas SE, Sayood K, Hinrichs SH, Iwen PC: Utilization of the Relative Complexity Measure to Construct a Phylogenetic Tree for Fungi.
*Mycological Research*2004, 108(2):117–125. 10.1017/S0953756203009079View ArticlePubMedGoogle Scholar - Weiner P: Linear Pattern Matching Algorithms.
*14th Annual Symposium on Switching and Automata Theory*1973, 1–11. full_textView ArticleGoogle Scholar - McCreight EM: A Space-Economical Suffix Tree Construction Algorithm.
*Journal of the ACM*1976, 23(2):262–272. 10.1145/321941.321946View ArticleGoogle Scholar - Ukkonen E: On-Line Construction of Suffix Trees.
*Algorithmica*1995, 14(3):249–260. 10.1007/BF01206331View ArticleGoogle Scholar - Wilbur WJ, Lipman DJ: Rapid Similarity Searches of Nucleic Acid and Protein Data Banks.
*Proceedings of the National Academy of Sciences of the United States of America*1983, 80: 726–730. 10.1073/pnas.80.3.726View ArticlePubMedPubMed CentralGoogle Scholar - Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.
*Nucleic Acids Research*1994, 22(22):4673–4680. 10.1093/nar/22.22.4673View ArticlePubMedPubMed CentralGoogle Scholar - Halkidi M, Batistakis Y, Vazirgiannis M: On Clustering Validation Techniques.
*Journal of Intelligent Information Systems*2001, 17(2–3):107–145. 10.1023/A:1012801612483View ArticleGoogle Scholar - Li W: Analysis and Comparison of Very Large Metagenomes with Fast Clustering and Functional Annotation. BMC Bioinformatics 2009., 10(359):Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.