An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads

Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies.


Background
Next generation sequencing has been responsible for numerous advances in the biological sciences allowing for the rapid production of sequencing data at rates not previously possible. Next generation sequencing has allowed for much innovative research in fields such as cancer genomics [1][2][3], epigenetics [4,5], and metagenomics [6,7]. These instruments are capable of producing several millions to billions of short reads in a single run. These reads are usually a small fraction of the original genome and do not contain much information individually. The massive amount of data that next generation sequencing technologies have produced has necessitated the development of efficient algorithms for short read analysis. Next generation sequencing technologies generate reads at high levels of genome coverage causing many of the reads to overlap. Specialized software programs called assemblers utilize these overlap relationships to organize, order, and align reads to produce long stretches of continuous sequence called contigs, which can be used for downstream analysis. Graph models for providing structure for the reads and their overlap relationships form the foundation of many of these assembly algorithms [8].
Metagenomics is a field of research that focuses on the sequencing of communities of organisms. This adds an additional layer of complexity to the analysis of short reads produced from metagenomics samples containing multiple sources of genetic information. Often these reads must be clustered or binned into their respective genomes before assembly or analysis of the reads can take place to avoid chimeric assembly results [9]. Multiple clustering and binning algorithms have been developed to address this issue [10][11][12]. While assembly results have been shown to be substantially improved by clustering metagenomics data before sequence assembly [13], overlap relationships retained by the assembly overlap graph are lost, leading to the removal of key global read overlap relationships and read similarities.
To address this issue, we previously introduced a short read analysis algorithm [14] that utilizes an overlap graph to model reads and their overlap relationships. Our algorithm utilizes Heavy Edge Matching (HEM) and graph coarsening methods [15] to efficiently reduce the overlap graph iteratively and to generate clusters of reads. At each graph coarsening iteration the algorithm outputs the reduced graph, producing a series of graphs representing the original read dataset across a spectrum of granularities. In comparison, most previous methods rely on a single graph to represent the read dataset, which may not capture all dataset features. The goal of our research is to create a multilevel approach that will allow for the extraction and analysis of dataset features at different information granularities that can be integrated into the assembly or analysis process. In our previous work, we applied our algorithm to cluster simulated reads representing a metagenomics dataset produced by the 454 technology. We then applied our algorithm to 454 bacterial datasets downloaded from NCBI to test our algorithm's ability to efficiently reduce and store the overlap graph. In this paper, we test the scalability of our algorithm and its ability to accurately cluster simulated Illumina read datasets at different genome coverages. We compare our algorithm's performance when applied to simulated 454 reads versus simulated Illumina reads. We also conduct a study using an Illumina metagenomics dataset downloaded from NCBI to evaluate the scalability of our algorithm. The obtained results show that our algorithm was able to substantially reduce the Illumina metagenomics overlap graph size and is scalable for large datasets. Results also demonstrate that our algorithm is practical for both 454 and Illumina data.

Results and discussion
In this section, we apply our graph coarsening and clustering algorithm to three Illumina metagenomics read datasets simulated at 5x, 15x, and 25x genome coverage. We evaluate our algorithm's graph coarsening and clustering results and compare them to results obtained by clustering a similar 454 metagenomics read dataset. Finally, we apply our algorithm to a large Illumina metagenomics dataset to demonstrate its scalability and ability to reduce large datasets. For all datasets, we report read overlapping and graph coarsening runtimes when ran on single or multiple compute nodes in a high performance computing environment.

Metagenomics clustering of simulated Illumina and 454 reads
For this study we generated Illumina read datasets from the eight reference genomes downloaded from NCBI RefSeq [16] used to generate the 454 metagenomics dataset in [14]. These reference genomes were selected at various levels of homology. Half of the bacterial genomes were chosen from the phylum Firmicutes and the remaining half were chosen from the phylum Actinobacteria. Pairs of reference genomes were chosen from the same order.
The software package ART [17] was used to simulate Illumina reads with a read length of 100 bp from each genome at 5x, 15x, and 25x coverage. The characteristics of the 454 metagenomics dataset and the Illumina datasets can be found in table 1 developed in [14] and table 2, respectively. Eight compute nodes on the commercial strength Firefly cluster located at the Holland Computing Center were used for read overlapping of the simulated Illumina metagenomics dataset [18]. After this was completed, the graph coarsening algorithm was applied to the read overlaps that were output by the overlap algorithm. Graph coarsening was run on a single node on the Firefly computer. The minimum for the ratio of nodes successfully matched to total graph size was .01. The minimum edge density was set to a threshold of 50.
Clusters were generated from the reduced graph at each graph coarsening iteration. We assigned each cluster and its reads a classification at the species level by majority read vote. Our algorithm's cluster classification error rate was defined to be the percentage of misclassified reads. The error rates for the classifications at the species level for various graph coarsening iterations of the simulated Illumina datasets are shown in Figure 1.a. 1.b and 1. c display the node and edge counts for each graph coarsening iteration, respectively. The read overlapping and graph coarsening runtimes can be found in Figure 2.
For the purpose of comparing our algorithm's performance when applied to datasets generated by different sequencing technologies, we reran our algorithm on the 454 metagenomics dataset in [14]. All runtime parameters remained the same for both the simulated Illumina and 454 read datasets. The error rates for read classification at the species level for each graph coarsening iteration of the simulated 454 read dataset can be found in Figure 3.a.
The node and edge counts can be found in Figure 3.b and 3.c, respectively. Node counts are also recorded in table 3. The graph coarsening algorithm was able to reduce the node count of original overlap graph 1.37 fold, 1.79 fold, and 2.02 fold for the 5x, 15x, and 25x read datasets, respectively, indicating that graph coarsening effectiveness may increase with increasing dataset coverage.
The results from the graph coarsening and clustering of the simulated Illumina and 454 read dataset demonstrate that the read error rates for the two sequencing  technologies are comparable and that our algorithm can be successfully applied to both 454 and Illumina reads. Read overlapping and graph coarsening runtimes demonstrate the scalability of the algorithm for datasets of increasing size and genome coverage. For the largest simulated Illumina dataset, read overlapping was completed on eight nodes in less than twelve hours. Graph coarsening was completed on a single node in less than forty-five minutes for the largest simulated Illumina dataset. The read overlapping and graph coarsening runtimes for the remaining datasets can be found in tables 4 and 5, respectively.

Illumina bioreactor metagenomics dataset
The results from the simulated metagenomics study demonstrated the algorithm's ability to reveal incremental levels of information in read datasets and that it can be extended to both 454 and Illumina read datasets. However, for this algorithm to be practical for a wide range of sequencing applications, we must demonstrate the scalability of this algorithm for large read datasets. For this purpose, we applied our algorithm to a large Illumina bioreactor metagenomics dataset and evaluated its runtime and ability to reduce a large overlap graph. We downloaded an Illumina bioreactor metagenomics dataset from the NCBI sequence read archive [19]. Table 2 describes the characteristics of this dataset. Paired-end reads were split for a total of 9,641,139 single reads. Any low quality read ends were trimmed with a minimum quality score of twenty using the FASTQ Quality Trimmer of the FASTX-toolkit [20].
Read overlapping was completed on thirty nodes of the Firefly computing cluster [18]. We then applied the graph coarsening algorithm to the overlap relationships produced by the read overlapper. We applied our graph coarsening algorithm multiple times to the dataset with varying numbers of compute nodes added to the parallel merge sort algorithm. The minimum for the ratio of nodes successfully matched to graph size was .01. The minimum edge density threshold was set to 50. The node counts for each graph coarsening iteration are recorded in table 3. The runtime for the read overlapping of the Illumina bioreactor dataset is shown in table 4. Figure 4 shows the runtime for the graph coarsening algorithm versus the number of compute nodes added to the parallel merge sort algorithm. The results demonstrate that our algorithm is scalable for larger metagenomics datasets. Figure 5 displays the reduction in node and edge counts per graph coarsening iteration. The number of edges was reduced approximately 168 times from the original overlap graph by the twenty-sixth graph coarsening iteration. The number of nodes was reduced approximately twelve times from the original overlap graph by the twenty-sixth graph coarsening iteration. The greatest reduction in edge counts occurred in the first ten graph coarsening iterations and the greatest reduction in node counts occurred in the first fourteen graph coarsening iterations.

Conclusions
In this paper, we introduced a graph coarsening and clustering algorithm that is able to model reads at multiple levels across a spectrum of information granularities. We demonstrated our algorithm's ability to cluster simulated Illumina metagenomics data at different levels of genome coverage. Clustering error rates of the algorithm applied to simulated Illumina metagenomics reads are comparable to error rates for clustering a simulated 454 metagenomics dataset with similar dataset characteristics. This suggests that our algorithm can be successfully applied to both Illumina and 454 read data. Algorithm runtimes were practical for all datasets. The largest simulated Illumina read dataset required less than twelve hours to complete read overlapping on eight compute nodes. Graph coarsening was completed on a single node in less than forty-five minutes. Read overlapping of the simulated 454 read dataset required less than an hour on eight compute nodes. Graph coarsening required less than seven minutes on a single compute node. The graph coarsening algorithm was more effectively able to reduce the higher coverage simulated Illumina datasets than the lower coverage Illumina datasets.
Finally, we applied our algorithm to a large Illumina metagenomics dataset to demonstrate its scalability. By utilizing parallel computing, our read overlapping algorithm was able to complete less than eight hours. The graph coarsening algorithm completed within approximately seven to twelve hours depending on the number of nodes added to the parallel merge sort portion of the graph coarsening algorithm. The algorithm was able to reduce the number of edges in the overlap graph nearly 168 fold while recording each graph level on disk, allowing a researcher to access the overlap graph at various levels of complexity. We plan to expand our graph coarsening algorithm to a full sequence assembler which will be contrasted to currently available assembly methods. We also plan to conduct further in-depth studies on the impact of input parameters on the graph coarsening process. Most current approaches rely on one overlap graph to capture a single snapshot of the reads and their overlap relationships. In contrast, our proposed assembly algorithm will rely on a series of coarsened graphs to capture both local and global dataset features.
The goal of our research is to develop an analysis method that will allow us to extract features of the read dataset at multiple information granularities to incorporate into the read analysis and assembly process. In the future, we plan to configure the algorithm such that clusters or nodes can be selected at different levels of information granularity. For example, if a node in a reduced graph is over-collapsed, we can select its child nodes from the previous graph iteration instead. We can continue with this zooming-in and zooming-out process, selecting child nodes from previous graph iterations until the desired node criteria is achieved or optimized. 5x Illumina n/a n/a n/a n/a n/a n/a n/a 15x Illumina n/a n/a n/a n/a n/a n/a n/a 25x Illumina 3342376 3330806 n/a n/a n/a n/a n/a 5x Illumina n/a n/a n/a n/a n/a n/a n/a 15x Illumina n/a n/a n/a n/a n/a n/a n/a 25x Illumina n/a n/a n/a n/a n/a n/a n/a 454 Reads n/a n/a n/a n/a n/a n/a n/a 5x Illumina n/a n/a n/a n/a n/a n/a n/a 15x Illumina n/a n/a n/a n/a n/a n/a n/a 25x Illumina n/a n/a n/a n/a n/a n/a n/a 454 Reads n/a n/a n/a n/a n/a n/a n/a   This will facilitate a customizable, intelligent approach to the read analysis and assembly problem.

Read overlapping
We have developed an exact match read overlapper for discovering read overlap relationships. All reads are concatenated into a single string and indexed by a suffix array data structure [21]. In succession, each read is split into its l-k-1 component k-mer seeds, where l is the length of the read. The read overlapper searches the suffix array for exact matches to the k-mer seeds. If an exact match is found, the reference read that contains the hit is selected for comparison to the query read. The reference and query reads are subject to a q-gram filter [22]. If the reads pass the filter, then the hit is extended into a full alignment using a banded Needle-Wunsch algorithm [23]. The alignment length and identity score are used to evaluate the quality of the produced overlap relationship. If the overlap does not meet minimum user-thresholds, then it is excluded from the final read overlap set. The ids of the query and reference reads are recorded along with the alignment length and identity for each overlap that meets minimum threshold requirements. After the completion of the read overlapping process, the overlap set is ordered by parallel merge sort into an edge list. The query and reference read ids become, respectively, the labels of the source and destination nodes of the edges in the overlap graph. Edges that have the same source node are grouped into edge sets within the edge list. Each edge set is sorted by overlap length, from longest overlap length to smallest. The edge list is then indexed by a graph data structure.
To address computational requirements for large datasets, we split the read dataset into subsets which are indexed by suffix arrays and sent to compute nodes in pairs to be ran in parallel in high performance computing environments. Figure 6 gives a flow diagram for the read overlapping process.

Graph theoretic model
Graph theory has become an important tool in many areas of bioinformatics research. The overlap graph forms the structural foundation of our read analysis and clustering algorithm. For this graph theoretic model, there is a one-to-one correspondence between the reads in the read dataset and nodes in the overlap graph. Edges connecting nodes in the overlap graph represent overlap relationships between reads. Each edge stores its corresponding overlap's alignment length and identity score. The overlap graph shares many similarities with the interval graph. The interval graph is one of the perfect graphs in graph theory and has many defined properties, making it a robust model for many applications [24]. An example of the overlap graph and interval graph developed in [14] can be found in Figure 7. We use succinct dictionary data structures [25] to index the nodes and edges of the overlap graph in a highly efficient manner allowing us to store our graph in O(n) + O(m) + 64m bits, where n and m are the total nodes and edges in the graph, respectively. More details on the structure and efficiency of the graph data structures can be found in [14].

Graph coarsening
We apply Heavy Edge Matching (HEM) to produce a series of coarsened graphs [15]. Graph coarsening provides levels of information about the structure of a graph over a spectrum of graph granularities. A very simple overview of global graph features exists at the coarsest graph levels, while more complex graph details are retained in earlier coarsening iterations and the original overlap graph.
Here we provide a description of our HEM algorithm for graph coarsening. Our graph coarsening algorithm attempts to find a maximal matching, a maximal set of edges without common endpoints, over a graph while maximizing for edge weight. The endpoints of these edges are then merged to form supernodes in a new coarsened graph. The steps of our algorithm are as follows. Given an overlap graph G n = (V n , E n ), the graph coarsening algorithm visits its nodes in succession over Warnke and Ali BMC Bioinformatics 2013, 14(Suppl 11):S7 http://www.biomedcentral.com/1471-2105/14/S11/S7 a user-specified number of passes, with the nodes with the largest edge weights visited randomly first followed by nodes with smaller edge weights. In the initial overlap graph, the edge weight is set to the alignment overlap length. We desire to visit nodes with larger overlap lengths first because the larger the overlap shared between two reads, the greater the likelihood that the reads are adjacent to one another in the original target sequence. The edges of the currently selected node u are visited from largest edge weight to smallest. The selected node u is matched to its first available neighbor v and the search through its edges is terminated. If no such neighbor exists or if the edge overlap length falls below a user-provided minimum during the edge search, then u remains unmatched. After the matching process is complete, the endpoints of each edge (u, v) in the matching are merged into a single supernode z in a new coarsened graph, G n+1 . Each unmatched node is mapped to a new node in the coarsened graph. Each edge from the original graph G n is mapped to a new edge in the coarsened graph G n+1 . The endpoints of an edge in G n+1 are the supernodes that contain the endpoints of its corresponding edge in G n as component nodes. If two edges in the coarsened graph share the same supernode endpoints, then the edges are merged into a single edge and their weights are summed to form a new edge weight. An example of graph coarsening via HEM can be found in Figure 8 developed in [14].
This graph coarsening process can be applied to the newly coarsened graph G n+1 to produce an even coarser graph G n+2 . This iterative process of node matching and merging produces a series of coarsened graphs G 0 , G 1 , G 2 ... G n , where N(|G 0 |) ≥ N(|G 1 |) ≥ N(|G 2 |) ... ≥ N(|G n |) representing the dataset across a spectrum of information levels. Graph coarsening is terminated when the ratio of the number of nodes successfully matched to graph size falls below a user minimum.
Four array data structures are used to hold critical information describing the graph coarsening process. For each graph G 0 , G 1 , G 2 ... G n in the series of coarsened graphs, there are two arrays, node_weights and edge_weights. For each supernode z in a graph G n , node_weights n records the number of child nodes descended from z in G 0 and edge_weights n records the total weight of the edges induced by the child nodes. Let z be a supernode in a graph G n+1 and u and v be its child nodes in G n . We use these arrays to calculate node density with the following equation.
edge_density(u,v) = edge_density(z) =  During the matching process, a node u will be matched to its neighbor v only if edge_density(u, v) is greater than a user-provided minimum.
Each graph is also assigned two additional arrays, node_map and node_map_inverse. For each node u in a graph G n , node_map n , records the label of the supernode that u is mapped to in G n+1 . For each supernode z in a graph G n+1 , node_map_inverse (n+1) records the labels of its child nodes in G n . Let u and v be nodes in G n that are mapped to supernode z in G n+1 , then node_map n [u] = node_map n [v] = z. If z is a supernode in G n+1 , then node_map_inverse (n+1) [2*z] = u and  node_map_inverse (n+1) [2*z+1] = v, where (u, v) is an edge in a matching M applied to G n . If z only has one child node u, then node_map_inverse (n+1) [2*z] = u and node_map_inverse (n+1 )[2*z+1] = -1.
After the completion of matching, the edges of G n+1 are generated from the edges of G n . The algorithm labels each edge e = (u, v) in G n to form a new edge e new = (node_map n [u], node_map n [v]) in G n+1 . If node_map n [u] = node_map n [v], then e new is not included in the edge set. The edges are ordered into an edge list by parallel merge sort such that edges with the same source node are grouped into edge sets within the edge list. The edge sets are ordered by descending edge weight. Any edges with the same endpoints in G n+1 are merged and their weights are summed. The new edge set is indexed by graph data structures to form G n+1 . Pseudocode for the graph coarsening process is shown in Figure 9.

Read cluster generation
Our algorithm uses the node_map_inverse arrays to recover reads clusters from a given coarsened graph G n . Recall that if z is a supernode in G n , then node_map_ inverse n [2*z] = u and node_map_inverse n [2*z+1] = v, where u and v are child nodes of z in G n-1 . The labels of the child nodes of u in G n-2 would therefore be given by node_map_inverse (n-1) [2*node_map_inverse n [2*z] ] and node_map_inverse (n-1) [2*node_map_inverse n [2*z] +1]. The child nodes of v would be given by node_map_inverse (n-1) [2*node_map_inverse n [2*z+1] ] and node_map_inverse [2*node_map_inverse n [2*z+1] +1]. This nested iteration through the node_map_inverse arrays continues until the all of the labels of the child nodes of z in G 0 are recovered. Since the label of each node in G 0 is the id of the read to which it corresponds to in the read dataset, we can use the child node labels to generate the read cluster belonging to the supernode z.

Edge relabeling
Graph traversal of the reduced overlap graph is used to determine an ordering of the nodes in the original, full overlap graph. The end points of each edge in the original overlap graph are relabeled according to the node ordering recovered from the reduced overlap graph. The goal of the edge relabeling process is to organize the edges within the original overlap graph such that many of the edges with common endpoints are close to one Figure 9 Pseudocode. Pseudocode is shown for the graph coarsening procedure. another in the graph data structure, facilitating more efficient access to overlap graph information. More details on the reduced graph traversal and edge relabeling process and experiments examining its effectiveness can be found in [14].