Read overlapping
We have developed an exact match read overlapper for discovering read overlap relationships. All reads are concatenated into a single string and indexed by a suffix array data structure [21]. In succession, each read is split into its l-k-1 component k-mer seeds, where l is the length of the read. The read overlapper searches the suffix array for exact matches to the k-mer seeds. If an exact match is found, the reference read that contains the hit is selected for comparison to the query read. The reference and query reads are subject to a q-gram filter [22]. If the reads pass the filter, then the hit is extended into a full alignment using a banded Needle-Wunsch algorithm [23]. The alignment length and identity score are used to evaluate the quality of the produced overlap relationship. If the overlap does not meet minimum user-thresholds, then it is excluded from the final read overlap set. The ids of the query and reference reads are recorded along with the alignment length and identity for each overlap that meets minimum threshold requirements. After the completion of the read overlapping process, the overlap set is ordered by parallel merge sort into an edge list. The query and reference read ids become, respectively, the labels of the source and destination nodes of the edges in the overlap graph. Edges that have the same source node are grouped into edge sets within the edge list. Each edge set is sorted by overlap length, from longest overlap length to smallest. The edge list is then indexed by a graph data structure. To address computational requirements for large datasets, we split the read dataset into subsets which are indexed by suffix arrays and sent to compute nodes in pairs to be ran in parallel in high performance computing environments. Figure 6 gives a flow diagram for the read overlapping process.
Graph theoretic model
Graph theory has become an important tool in many areas of bioinformatics research. The overlap graph forms the structural foundation of our read analysis and clustering algorithm. For this graph theoretic model, there is a one-to-one correspondence between the reads in the read dataset and nodes in the overlap graph. Edges connecting nodes in the overlap graph represent overlap relationships between reads. Each edge stores its corresponding overlap's alignment length and identity score. The overlap graph shares many similarities with the interval graph. The interval graph is one of the perfect graphs in graph theory and has many defined properties, making it a robust model for many applications [24].
An example of the overlap graph and interval graph developed in [14] can be found in Figure 7. We use succinct dictionary data structures [25] to index the nodes and edges of the overlap graph in a highly efficient manner allowing us to store our graph in O(n) + O(m) + 64m bits, where n and m are the total nodes and edges in the graph, respectively. More details on the structure and efficiency of the graph data structures can be found in [14].
Graph coarsening
We apply Heavy Edge Matching (HEM) to produce a series of coarsened graphs [15]. Graph coarsening provides levels of information about the structure of a graph over a spectrum of graph granularities. A very simple overview of global graph features exists at the coarsest graph levels, while more complex graph details are retained in earlier coarsening iterations and the original overlap graph.
Here we provide a description of our HEM algorithm for graph coarsening. Our graph coarsening algorithm attempts to find a maximal matching, a maximal set of edges without common endpoints, over a graph while maximizing for edge weight. The endpoints of these edges are then merged to form supernodes in a new coarsened graph. The steps of our algorithm are as follows. Given an overlap graph Gn = (Vn, En), the graph coarsening algorithm visits its nodes in succession over a user-specified number of passes, with the nodes with the largest edge weights visited randomly first followed by nodes with smaller edge weights. In the initial overlap graph, the edge weight is set to the alignment overlap length. We desire to visit nodes with larger overlap lengths first because the larger the overlap shared between two reads, the greater the likelihood that the reads are adjacent to one another in the original target sequence. The edges of the currently selected node u are visited from largest edge weight to smallest. The selected node u is matched to its first available neighbor v and the search through its edges is terminated. If no such neighbor exists or if the edge overlap length falls below a user-provided minimum during the edge search, then u remains unmatched. After the matching process is complete, the endpoints of each edge (u, v) in the matching are merged into a single supernode z in a new coarsened graph, Gn+1. Each unmatched node is mapped to a new node in the coarsened graph. Each edge from the original graph Gn is mapped to a new edge in the coarsened graph Gn+1. The endpoints of an edge in Gn+1 are the supernodes that contain the endpoints of its corresponding edge in Gn as component nodes. If two edges in the coarsened graph share the same supernode endpoints, then the edges are merged into a single edge and their weights are summed to form a new edge weight. An example of graph coarsening via HEM can be found in Figure 8 developed in [14].
This graph coarsening process can be applied to the newly coarsened graph Gn+1 to produce an even coarser graph Gn+2. This iterative process of node matching and merging produces a series of coarsened graphs G0, G1, G2 ... Gn, where N(|G0|) ≥ N(|G1|) ≥ N(|G2|) ... ≥ N(|Gn|) representing the dataset across a spectrum of information levels. Graph coarsening is terminated when the ratio of the number of nodes successfully matched to graph size falls below a user minimum.
Four array data structures are used to hold critical information describing the graph coarsening process. For each graph G0, G1, G2 ... Gn in the series of coarsened graphs, there are two arrays, node_weights and edge_weights. For each supernode z in a graph Gn, node_weights
n
records the number of child nodes descended from z in G0 and edge_weights
n
records the total weight of the edges induced by the child nodes. Let z be a supernode in a graph Gn+1 and u and v be its child nodes in Gn. We use these arrays to calculate node density with the following equation.
edge_density(u,v) = edge_density(z) =
where ew[u] = edge_weights
n
[u] and nw[u] = node_weights
n
[u].
During the matching process, a node u will be matched to its neighbor v only if edge_density(u, v) is greater than a user-provided minimum.
Each graph is also assigned two additional arrays, node_map and node_map_inverse. For each node u in a graph Gn, node_map
n
, records the label of the supernode that u is mapped to in Gn+1. For each supernode z in a graph Gn+1, node_map_inverse
(n+1)
records the labels of its child nodes in Gn. Let u and v be nodes in Gn that are mapped to supernode z in Gn+1, then node_map
n
[u] = node_map
n
[v] = z. If z is a supernode in Gn+1, then node_map_inverse(n+1)[2*z] = u and node_map_inverse(n+1)[2*z+1] = v, where (u, v) is an edge in a matching M applied to Gn. If z only has one child node u, then node_map_inverse(n+1)[2*z] = u and node_map_inverse(n+1)[2*z+1] = -1.
After the completion of matching, the edges of Gn+1 are generated from the edges of Gn. The algorithm labels each edge e = (u, v) in Gn to form a new edge enew = (node_map
n
[u], node_map
n
[v]) in Gn+1. If node_map
n
[u] = node_map
n
[v], then enew is not included in the edge set. The edges are ordered into an edge list by parallel merge sort such that edges with the same source node are grouped into edge sets within the edge list. The edge sets are ordered by descending edge weight. Any edges with the same endpoints in Gn+1 are merged and their weights are summed. The new edge set is indexed by graph data structures to form Gn+1. Pseudocode for the graph coarsening process is shown in Figure 9.
Read cluster generation
Our algorithm uses the node_map_inverse arrays to recover reads clusters from a given coarsened graph Gn. Recall that if z is a supernode in Gn, then node_map_inverse
n
[2*z] = u and node_map_inverse
n
[2*z+1] = v, where u and v are child nodes of z in Gn-1. The labels of the child nodes of u in Gn-2 would therefore be given by node_map_inverse
(n-1)
[2*node_map_inverse
n
[2*z] ] and node_map_inverse
(n-1)
[2*node_map_inverse
n
[2*z] +1]. The child nodes of v would be given by node_map_inverse
(n-1)
[2*node_map_inverse
n
[2*z+ 1] ] and node_map_inverse[2*node_map_inverse
n
[2*z+ 1] +1]. This nested iteration through the node_map_inverse arrays continues until the all of the labels of the child nodes of z in G0 are recovered. Since the label of each node in G0 is the id of the read to which it corresponds to in the read dataset, we can use the child node labels to generate the read cluster belonging to the supernode z.
Edge relabeling
Graph traversal of the reduced overlap graph is used to determine an ordering of the nodes in the original, full overlap graph. The end points of each edge in the original overlap graph are relabeled according to the node ordering recovered from the reduced overlap graph. The goal of the edge relabeling process is to organize the edges within the original overlap graph such that many of the edges with common endpoints are close to one another in the graph data structure, facilitating more efficient access to overlap graph information. More details on the reduced graph traversal and edge relabeling process and experiments examining its effectiveness can be found in [14].