FSBC: fast string-based clustering for HT-SELEX data

Background The combination of systematic evolution of ligands by exponential enrichment (SELEX) and deep sequencing is termed high-throughput (HT)-SELEX, which enables searching aptamer candidates from a massive amount of oligonucleotide sequences. A clustering method is an important procedure to identify sequence groups including aptamer candidates for evaluation with experimental analysis. In general, aptamer includes a specific target binding region, which is necessary for binding to the target molecules. The length of the target binding region varies depending on the target molecules and/or binding styles. Currently available clustering methods for HT-SELEX only estimate clusters based on the similarity of full-length sequences or limited length of motifs as target binding regions. Hence, a clustering method considering the target binding region with different lengths is required. Moreover, to handle such huge data and to save sequencing cost, a clustering method with fast calculation from a single round of HT-SELEX data, not multiple rounds, is also preferred. Results We developed fast string-based clustering (FSBC) for HT-SELEX data. FSBC was designed to estimate clusters by searching various lengths of over-represented strings as target binding regions. FSBC was also designed for fast calculation with search space reduction from a single round, typically the final round, of HT-SELEX data considering imbalanced nucleobases of the aptamer selection process. The calculation time and clustering accuracy of FSBC were compared with those of four conventional clustering methods, FASTAptamer, AptaCluster, APTANI, and AptaTRACE, using HT-SELEX data (>15 million oligonucleotide sequences). FSBC, AptaCluster, and AptaTRACE could complete the clustering for all sequence data, and FSBC and AptaTRACE performed higher clustering accuracy. FSBC showed the highest clustering accuracy and had the second fastest calculation speed among all methods compared. Conclusion FSBC is applicable to a large HT-SELEX dataset, which can facilitate the accurate identification of groups including aptamer candidates. Availability of data and materials FSBC is available at http://www.aoki.ecei.tohoku.ac.jp/fsbc/.


Background
Systematic evolution of ligands by exponential enrichment (SELEX) is an experimental method for identifying aptamers, which bind to specific target molecules with high affinity and specificity [1,2]. SELEX is an iterative method with multiple rounds for the enrichment of aptamers from the initial oligonucleotide random library. Each round consists of selection with target molecules and amplification with polymerase chain reaction(PCR). Aptamers are RNA or short single-stranded DNA molecules, which fold into a three-dimensional structure and bind different types of target molecules such as proteins [3], small molecules [4], toxins [5], ions [6], and cell surfaces [7]. Owing to the wide variety of possible target molecules, aptamers are commonly used for therapeutics [8], clinical diagnostics [9], the high-throughput multi-protein measurement [10], imaging [11], and biosensors [12].
A next-generation sequencing (NGS), which was originally developed for whole-genome sequencing, is available for analysis of large oligonucleotide pools obtained by SELEX to acquire an enormous sequence dataset for predicting aptamer candidates. This combined use of SELEX and NGS is referred to as high-throughput SELEX (HT-SELEX). It is not reasonable to evaluate the binding affinity with all observed sequences from NGS. In general, dozens of candidate aptamers are selected from the HT-SELEX data for evaluation with experimental analysis considering cost and time-consuming. In other words, the list of dozens of candidate aptamers is required from HT-SELEX data for evaluation with experimental analysis. Clustering for HT-SELEX data is an effective process to identify the sequence groups which are related to aptamer candidates, or noise sequences such as non-specific binding sequences, bead-binders, and PCR biased sequences which are easy to be enriched by PCR. Clustering is also useful to identify different types of aptamers such as different binding epitopes and for understanding the diversity and enrichment of oligonucleotide sequence pools. Figure 1 describes the typical procedure of selecting different types of aptamer candidates from the clustering results for binding verification with experimental analysis.
Several clustering methods have been developed for HT-SELEX data to date, including FASTAptamer [13], AptaCluster [14,15], APTANI [16], and AptaTRACE [17]. FASTAptamer generates clusters based on Levenshteindistance (LD) which represents the full length of sequence similarity with highly ranked sequences. AptaCluster first roughly groups sequences with local sensitive hashing (LSH) and then generates clusters with the short k-mer sequence similarity. APTANI and AptaTRACE identify clusters with short motifs considering the nucleic acid secondary structure. APTANI estimates motifs from a single round of SELEX data whereas AptaTRACE estimates motifs by tracing the changes of frequency between multiple rounds.
It is often observed that the most enriched sequence does not show the binding affinity to the target molecules. These noise sequences are likely to be generated by PCR bias (some oligonucleotide molecules are easy to be enriched by PCR) or non-specific binding of other molecules such as beads with charge effect. Typically, aptamers harbor a specific sequence region, which is necessary for binding to the target molecules, although noise oligonucleotide sequences generally do not include such a target binding region. Hence, determining the sequence clusters with such a target binding region could be an effective approach to choose aptamer candidates. The length of the target binding region varies according to the target molecules, epitopes, and/or binding styles. Thus, estimating target binding regions with different lengths is required. Although AptaTRACE was designed for detecting the candidate motifs as target binding regions, it has a limitation of the length of motifs and requires multiple rounds of SELEX data, which increases the sequencing cost.
To overcome these limitations, we developed the fast string-based clustering (FSBC) method. FSBC estimates clusters considering different lengths of over-represented strings as target binding regions. FSBC was also designed for fast calculation with search space reduction of over-represented strings using only a single round of HT-SELEX data, especially in the final round of SELEX, considering the imbalance of nucleobases of the aptamer selection process. FSBC implemented with R [18] is available at http://www.aoki.ecei.tohoku.ac.jp/fscb/.

Overview of the clustering algorithm
FSBC is composed of two parts: selection of overrepresented strings with different lengths and sequence clustering based on the selected over-represented strings. For over-represented string selection, we propose a new score calculation method that considers the imbalanced ratios of nucleobases due to the selection process of SELEX. Figure 2 shows the outline of the FSBC algorithm.

String score definition
For a set of nucleobases = {A, C, G, T(U)}, which represents adenine, cytosine, guanine, and thymine/uracil, respectively, the probability of each nucleobase is given as p j , (j ∈ ), the string is s, the length of the string is |s|, and the number of nucleobases in the string is n s,j . The probability of an L-mer oligonucleotide including string s, P s,L , Fig. 1 Procedure from obtaining clustering results to experimental analysis. HT-SELEX sequence data are grouped into different clusters according to cluster ranking. Sequences with a high frequency from high-ranked clusters are synthesized and evaluated for binding affinity with experimental analysis is then described as the following recurrent equation: where T is a set of self-overlapping regions of s, and n t,j , (t ∈ T ) is the number of nucleobases of the selfoverlapping regions. For example, if string s is "ATATA", the set of self-overlapping regions T ATATA is {A, ATA}. If L < |s|, then P s,L = 0. The terms P s,L−1 , Q, QP s,L−|s| , and Q t∈T q −1 P s,L−|s|+|t| − P s,L−|s|+|t|−1 represent the probability that a sequence has the string from 1 to L − |s| − 1, a sequence has the string at L − |s|, a sequence has the string both from 1 to L − |s| − 1 and at L − |s|, and The lower panel shows the clustering based on selected strings ranked according to the Z * -score, which is normalized Z-score for strings of different lengths a sequence has the string at the self-overlapping position, respectively. Figure S1 shows a graphical representation of Eq. (1). In stringology, the probability calculation is the same approach with "missing words in random text" [19], and the self-overlapping region is the same meaning with "string overlaps" [20].
The lengths of observed sequences obtained using NGS vary owing to insertions and/or deletions during the SELEX process. Stoltenburg and Strehiltz described that around 78% of sequences had an expected length of random regions, while the other 22% of sequences are different from the original length of random region [21]. Therefore, the probability P s,L was adjusted for different lengths of sequences using the following equation: where N is the number of observed sequences and L i is the length of the i-th sequence. Let the frequency of observed sequences including string s be F s . P s follows a binomial distribution. If N is a large enough number for F s , a random variable representing the difference between F s and P s normalized by the standard deviation of the binomial distribution then shows an approximate normal Gaussian distribution. Hence, the Z-score for string s is derived according to the following equation: ( 3 )

Selection of over-represented strings
Before selection of the over-represented strings, the probability of each nucleobase,p j , is estimated with the following equation: where n j is the number of observed nucleobases. These estimated probabilities are then used for calculation of the Z-scores. Since the ratios of nucleobases in the SELEX pool can change owing to the systematic selection bias of SELEX, the Z-score is calculated based on the balance of nucleobases using Eqs. (1) -(4). Over-represented strings with lengths ranging from l min to l max are selected while reducing the search space from all possible combinations by comparing Z-scores. Selection of over-represented strings is then conducted according to the following process: 1 Enumerate all l min -length strings and calculate their Z-scores. Exclude string whose Z-scores are less than 0. 2 Substitute l ← l min .
3 Enumerate extended strings by adding a nucleobase and calculate their Z-scores. Exclude extended strings whose Z-scores are less than those of the pre-extended strings. 4 If l + 1 > l max , then finish the selection of over-represented strings. 5 Substitute l ← l + 1, and go to 3.
The algorithm for estimating over-represented strings reduces the search space by comparing of Z-scores between the post-extended and pre-extended strings. Thus, the number of selected strings, m, is much smaller than the exhaustive enumeration of all strings: m l max l=l min | | l . This search space minimization provides a huge reduction in the calculation time for an HT-SELEX dataset.

Clustering with selected over-represented strings
While extending the string length, the strings with higher Z-scores are selected for search space reduction. For evaluating the different lengths of strings equally, the normalization of the Z-score was performed. The normalized Z-score for string s, referred to as Z * s , is calculated with the following equation: whereμ |s| andσ |s| are the mean and standard deviation, respectively, of the Z-score of selected strings with length |s|. The strings are then ordered by Z * . Parametersμ |s| and σ |s| are estimated with only selected strings. Therefore, there are no guarantees of Gaussian distribution of Z * . The clustering is then achieved according to the following process: 1 Substitute i ← 1.
2 Extract sequences including the i-th strings from the sequence dataset, where a set of extracted sequences is referred to as the i-th cluster. Remove extracted sequences from the sequence dataset. 3 If there are no sequences remaining, finish the clustering. 4 Substitute i ← i + 1, and go to step 2.

Data
The publicly available whole-cell SELEX dataset of human embryonic stem cells [22] was used for comparing the calculation speed and clustering accuracy. The SELEX was finished at the fifth round and nineteen sequences were evaluated for binding affinity with flow cytometry. According to the binding evaluation, eight of nineteen sequences showed higher fluorescent intensity and those sequences were defined as target-binding sequences.

Calculation time
The sequence data were filtered with different frequency cut-offs (1, 10, and 100) to vary the size of the dataset. The numbers of sequences included with frequency cut-offs of 1, 10, and 100 were 15,327,604 (4,381,160), 8,799,219 (156,587), and 4,947,522 (6,193) with 1, 10, and 100, respectively; the numbers of non-redundant sequences are indicated in parentheses.
The five different algorithms, namely FASTAptamer, AptaCluster, APTANI, AptaTRACE, and FSBC, were compared with respect to calculation time. The fifth round HT-SELEX data, which was the last round of SELEX, were used for FASTAptamer, AptaCluster, APTANI, and FSBC. The fourth and fifth round HT-SELEX data were used for AptaTRACE because Apta-TRACE requires multiple rounds of HT-SELEX data.
FASTAptamer was performed with an edit distance option of 7 (according to the user guide), and the maximum cluster number was set to 100 to reduce the calculation time. AptaCluster was performed with the default options. The options for APTANI were no-filtering of frequency, fixed length for HT-SELEX data, and primer information for estimation of the secondary structure. There are no further options for reducing the calculation time except for frequency filtering; thus, we did not change any options for APTANI. AptaTRACE was performed with the background sequence option as 1,000 because AptaTRACE demonstrated the best accuracy with that parameter. The options of FSBC were l min = 5 and l max = 10.

Clustering accuracy
Filtered data (frequency ≥ 10) of the fifth round, which was the final round of the SELEX, was applied for comparing the accuracy of the clustering methods because FASTAptamer and APTANI did not complete the clustering with the entire sequence dataset. The same parameters indicated in the previous subsection for AptaCluster and APTANI were applied for evaluating the clustering accuracy. The option of the maximum number of clusters for FASTAptamer was not used for the evaluation of clustering accuracy. Changing the parameters of LD and motif length did not improve the accuracy of FASTAptamer and AptaCluster, respectively. For AptaTRACE, the background sequence option was set as 1,000 because Apta-TRACE showed the highest accuracy with that option. The options for FSBC were l min = {3, 4, 5} and l max = 10. FSBC was also applied to the entire sequence dataset and the filtered data (frequency ≥ 100) to evaluate the potential bias of frequency filtering, and missing aptamer sequences due to the sequence frequency filtering.
The sequences with binding/non-binding information were sorted with cluster ranking for each method. For evaluating the cluster ranking and binding sequences, the receiver operating characteristic (ROC) curves were generated according to the order of cluster ranking with the binding information. The area under the curve (AUC) values were calculated based on the area of the ROC curves. FSBC was also applied to all of the sequence data from the third and fourth rounds of SELEX to evaluate the possibility of the detection of aptamers in early rounds.

Comparison with exhaustive enumeration of strings
Due to the search space reduction, there are no guarantees that the top-ranked strings of exhaustive enumeration are included in the selected strings. Hence, we verified whether the top-ranked strings of exhaustive enumeration was included in the selected strings or not. The missing rate of the top ten ranked strings of exhaustive enumeration was also evaluated for each length. Table 1 shows the calculation time for each method and the dataset size. The first column shows the clustering methods, and the second to seventh columns represent Table 1 Clustering calculation time for each method with datasets of different sizes. Sequences (≥ 10) and sequences (≥ 100) represent filtered data with frequency cutoff. DNF indicates did not finish. DNF 1 : FASTAptamer did not complete the calculation for the entire sequence dataset in 7 days. DNF 2 : APTANI showed a calculation error after the prediction of the secondary structure, which took 25 h AptaCluster showed the fastest calculation time for clustering, followed by FSBC. However, FASTAptamer was the slowest of the five methods and did not complete the clustering of the entire dataset in 7 days, even when changing the clustering number option to "-c 100" to reduce the calculation time. APTANI also could not complete the calculation for the entire dataset due to an error after the secondary structure prediction, which required 25 h. AptaTRACE calculated clustering with parallel computing, hence the real-time was much smaller than CPU time.

Accuracy
The clustering result for each algorithm is shown in Table 2. The columns indicate the oligonucleotide sequences excluding both ends of the primers, sequence ID, ranking of frequency, frequency, binding information, and cluster ranking for each method. AptaCluster has a two-ranking system for clustering, including frequency and diversity, corresponding to the frequency of sequences in the cluster and the number of nonredundant sequences in the cluster, respectively. APTANI does not include any functions for ordering clusters; thus, we used frequency and diversity for this purpose as performed by AptaCluster. The binding information was already defined by the verification of experimental analysis using flow cytometry [22]. Sequence IDs, seq1 to seq8 are defined as binding sequences whereas sequence IDs seq9 to seq19 are not the binding sequences. Sequence ID seq8 was not included since it was filtered out based on the frequency cut-off before clustering. The strings selected by FSBC are underlined and in uppercase in the table. The order of sequences in Table 2 is based on the ranking of the frequency on binding/nonbinding sequences. FASTAptamer, AptaCluster (frequency/diversity), APTANI (frequency/diversity), Apta-TRACE and FSBC estimated 2,380, 136,350, 2,348, 13, and 155 clusters, respectively.
Among the five methods, only FSBC and Apta-TRACE provided a top-ranked cluster that included binding sequences. By contrast, the top-ranked clusters obtained with FASTAptamer, AptaCluster (frequency), and APTANI, and the second top-ranked cluster obtained by AptaCluster (diversity) included the topranked sequence of "frequency", which did not show binding ability. Similarly, APTANI (diversity) yielded a topranked cluster including sequence ID seq17, which also did not bind to the target molecules. FASTAptamer, Apta-Cluster (frequency), AptaCluster (diversity), APTANI (frequency), and APTANI (diversity) showed 6, 7, 5, 7, and 290 as the highest ranked clusters including binding sequences, respectively, and these ranks were all lower than those with non-binding sequences. FSBC and Apta-TRACE grouped all binding sequences from sequence ID seq1 to seq7 into two clusters with cluster ranks 1 and 5. However, AptaTRACE missed sequence ID seq6, and sequence ID seq7 was grouped with sequence ID seq17 which did not show the binding affinity. FASTAptamer grouped sequence ID seq2 and seq4 into the same cluster, which was ranked fifteenth. APTANI (diversity) showed the same cluster ranking from sequence ID, seq5 to seq8; however, these ranks were simply tied but the sequences did not group in the same cluster. AptaCluster (frequency/diversity) and APTANI (frequency) did not group any binding sequences into the same cluster. Table  S1 shows the same result of FSBC with all sequences (nofiltering with frequency cutoff ) and filtered data (≥100) under the option l min = 5. Similar to the result in Table 2, all binding sequences were in the higher-ranked clusters rather than in the clusters ranked with non-binding sequences.
FSBC selected a total of 1,003 strings, and the top 24 strings are shown in Table S2. The selected overrepresented strings "ATGGACTTCGG" and "GACTT", ranked 1 and 12, respectively, were included in cluster 1 and 5 in Table 2. The selected string "GACTT" is a part of string "ATGGACTTCGG". The distribution of the Zscores and Z * -scores of the selected strings for each length of string is shown in Figure S2.
The relation between cluster ranking and frequency of oligonucleotide sequences with each method is displayed in Fig. 3, in which the red, blue, and gray dots represent binding, non-binding, and non-evaluated sequences to the target molecules, respectively. The top-ranked clusters obtained by FASTAptamer, AptaCluster (frequency), and APTANI (frequency) included the non-binding sequence of the highest frequency. AptaCluster (diversity) and APTANI (diversity) included the non-binding sequence of the highest frequency in higher ranked cluster than those including binding sequences. By contrast, FSBC and AptaTRACE grouped the binding sequences with lower frequencies in the top-ranked cluster.
The ROC curve and AUC value for each clustering method are displayed in Fig. 4. FSBC with options l min = 4 and l min = 5 clearly distinguished binding from nonbinding sequences, i.e. the AUC value equals to 1. The AUC value was slightly lower (0.96) when the FSBC options l min = 3 were applied, because some non-binding sequences were grouped into the same binding cluster. AptaTRACE also demonstrates a higher AUC value because AptaTRACE detected the target binding regions in the higher-ranked clusters. However, the other clustering methods resulted in lower AUCs because non-binding sequences with high frequency were included in the Table 2 Cluster ranking. AptaCluster (Freq.), AptaCluster (Div.), and APTANI (Freq.), APTANI (Div.) represent the cluster ranking of frequency and diversity (the number of non-redundant sequences) of AptaCluster and APTANI, respectively. Sequences with a frequency of less than 10 were excluded before the clustering analysis. Because FASTAptamer and APTANI did not finish with all sequence data. *: This sequence is filtered as the frequency is less than 10. **: The ranking of clusters is tied; however, the sequences are not grouped in the same cluster. ***: These sequences did not include any motifs estimated by AptaTRACE, thus the sequences are not grouped into any clusters higher-ranked clusters. FSBC with option l min = 5 also showed that the AUC value equals to 1 for all sequence data and filtered data (frequency ≥ 100) in Table S1. The clustering results for third-and fourth-round data are summarized in Table S3 and Table S4, respectively. FSBC could identify aptamer sequences in the third and first clusters from third-and fourth-round data. AUC values for the third and fourth rounds are 0.89 and 1, respectively.

Comparison with exhaustive enumeration of strings
The top-ranked over-represented string of exhaustive enumeration was included in the selected strings for each length. The missing rate of the top 10 ranked strings of exhaustive enumeration for each length is shown in Table  S5. Top 10 ranked 10-mer strings of the exhaustive enumeration include 6 selected strings. Thus, the missing rate of the 10-mer string is 0.4.

Discussion
Our newly developed algorithm for clustering, FSBC, showed the second fastest calculation speed with HT-SELEX data. AptaCluster displayed a remarkably fastest calculation time. FASTAptamer and APTANI could not complete the clustering for all of the sequence data; hence, only FSBC, AptaCluster, and AptaTRACE are available for applications with a real HT-SELEX dataset. FSBC selected a total of 1,003 strings, which was much smaller than all exhaustive enumeration of strings: 10 i=5 4 i = 1, 397, 760. The ratio of the number of selected strings over all combinations is 1, 003/1, 397, 760 = 0.0007175767. Hence, the minimization of the search space was an effective method for finding over-represented strings of longer lengths such as 10-mer. FSBC was designed for handling a single-round sequence data from SELEX. This approach could also be helpful to reduce the sequencing cost and the calculation time compared to other methods such as MPBind [22] and AptaTRACE [17], which require multiple rounds of sequence data from SELEX pools.
Importantly, FSBC and AptaTRACE distinguished binding sequences as high-ranked clusters, whereas the other clustering methods categorized non-binding sequences with high frequency under high-ranked clusters. This demonstrates that FASTAptamer, AptaCluster, and APTANI are more sensitive to the frequency of sequences rather than to enrichment of over-represented strings. Thus, if the SELEX pool contains numerous non-binding sequences due to PCR bias, FASTAptamer, AptaCluster, and APTANI might place these PCR biased sequences in the high-ranked clusters. The sequencing data used for the current study includes enriched strings among the binding sequences, and FSBC and Apta-TRACE could accurately detect these strings as the estimated target binding region. AptaTRACE detected binding sequences with higher-ranked clusters, however, the cluster of rank 5 includes both binding and non-binding sequences. Consequently, FSBC showed a better result for cluster ranking in this study.
This proposed string score calculation method can be extended to combine with other outcomes. In this study, we defined the outcome according to nucleobases: nucleobase = {A, C, G, T(U)}. However, other outcomes can also be defined, such as the oligonucleotide secondary structure: structure = {H, B, S, M, E, I, G}, which represent the structure of the hairpin loop, bulge, stem, multi-loop, external loop, internal loop, and G-quadruplex, respectively. A set of outcomes can be extended as = nucleobase × structure . If the set is extended to include the secondary structure, FSBC is available for searching Fig. 4 Receiver operating characteristic (ROC) curves of different clustering methods. "Freq." and "Div." in the parentheses (after AptaCluster and APTANI) indicate the cluster ranking with frequency and diversity (the number of non-redundant sequences) in the cluster for the respective method. AUC indicates the area under the curve over-represented strings with a specific secondary structure. However, the calculation time will also increase with increasing the number of elements of . Hence, to obtain the fastest calculation with FSBC, = nucleobase is the reasonable outcome. This string scoring method can also be used for other types of sequence analysis such as for amino acid sequences. In other words, if is defined based on amino acids, Eq. (1) can be used for finding over-represented strings among amino acid sequences.
FSBC does not consider insertion/deletions or degenerated nucleobases, because the method was designed to reduce the calculation time to enable estimating longer over-represented strings in a huge dataset. Since the size of clusters is much smaller than the size of the entire sequence dataset, other motif-estimating methods such as MEME [24] can be used for more accurate estimation of candidate motifs.
Due to a lack of publicly accessible HT-SELEX data with binding information, only one HT-SELEX dataset was used. Sequence data could differ depending on the target molecules, SELEX methods, and initial bias of SELEX. Hence, the evaluation with other HT-SELEX data should be performed. After there will be enough dataset of HT-SELEX data publicly available for evaluation, the clustering methods need to be summarized. Moreover, only a single clustering method cannot cover all types of SELEX datasets. Thus, the most suitable clustering approach is to compare and summarize the results of different clustering methods.

Conclusion
We proposed a new and rapid string-based clustering method for HT-SELEX data. Our clustering method could complete the calculation from a huge dataset in a reasonable time, even though the method is designed to estimate longer over-represented strings such as 10-mer. Importantly, our clustering method could identify enriched strings that were included in binding sequences estimated as the target binding region of the aptamer. Overall, FSBC