Skip to main content
Fig. 5 | BMC Bioinformatics

Fig. 5

From: Removing duplicate reads using graphics processing units

Fig. 5

Multi-step clustering. To simplify the graphical representation, we assume that the multi-step clustering is enabled for prefixes longer than 5 nucleotides. In this example 15 reads are clustered analyzing prefixes of 10 nucleotides. a Initially, the prefixes are split into two chunks of 5 nucleotides. In the figure, the nucleotides of the first chunk are represented in blue, and those of the second chunk are represented in red. The clustering consists of three steps. b Reads are clustered by sorting them according to the first chunk of the prefixes (sorting A in the figure). Clustering generates 5 clusters of different size (C A1,C A2,C A3,C A4,C A5). Reads clustered together are represented with the same background color. Subsequently, reads are clustered by sorting them according to the second chunk of the prefixes (sorting B in the figure). This clustering generates 6 clusters unrelated from those of the previous clustering (C B1,C B2,C B3,C B4,C B5,C B6). c A new array is initialized and partitioned according to the size of the clusters of A. The sequences of each cluster in B are copied in the new array in the partition associated to their belonging cluster in A. Each read is copied into the first free position of the partition. The process is represented in the C box. Each row reported therein represents the process of copying the reads of a cluster in C. On the left it is shown where the reads are copied, whereas on the right it is shown how clusters are split after each iteration. Initially, the reads R 13 and R 6 of C B1 are copied in the new array. R 13 belongs to C A5 in A and R 6 belongs to C A2 in A. Being the first reads to be analyzed, they are copied in the first position related to its cluster in A. Cluster C A2 and C A5 are partially filled after this step. This implies that the reads in these clusters are not identical, according to the second chunk of their prefix. In fact, R 15 and R 1 have not been clustered together with R 6 in B. Similarly, R 6 has not been clustered together with R 13 in B. Therefore, the clusters are split (see first row in C box). Cluster C A2 is split into two clusters. A cluster contains R 6 and the other cluster (of size 2) is empty. Similarly, C A5 is split into two clusters of size 1. The process is iterated (as represented in the C box) until all clusters in B have been analyzed. The final sorting generates 11 clusters

Back to article page