### Read overlapping

We have developed an exact match read overlapper for discovering read overlap relationships. All reads are concatenated into a single string and indexed by a suffix array data structure [21]. In succession, each read is split into its *l*-k-1 component k-mer seeds, where *l* is the length of the read. The read overlapper searches the suffix array for exact matches to the k-mer seeds. If an exact match is found, the reference read that contains the hit is selected for comparison to the query read. The reference and query reads are subject to a q-gram filter [22]. If the reads pass the filter, then the hit is extended into a full alignment using a banded Needle-Wunsch algorithm [23]. The alignment length and identity score are used to evaluate the quality of the produced overlap relationship. If the overlap does not meet minimum user-thresholds, then it is excluded from the final read overlap set. The ids of the query and reference reads are recorded along with the alignment length and identity for each overlap that meets minimum threshold requirements. After the completion of the read overlapping process, the overlap set is ordered by parallel merge sort into an edge list. The query and reference read ids become, respectively, the labels of the source and destination nodes of the edges in the overlap graph. Edges that have the same source node are grouped into edge sets within the edge list. Each edge set is sorted by overlap length, from longest overlap length to smallest. The edge list is then indexed by a graph data structure. To address computational requirements for large datasets, we split the read dataset into subsets which are indexed by suffix arrays and sent to compute nodes in pairs to be ran in parallel in high performance computing environments. Figure 6 gives a flow diagram for the read overlapping process.

### Graph theoretic model

Graph theory has become an important tool in many areas of bioinformatics research. The overlap graph forms the structural foundation of our read analysis and clustering algorithm. For this graph theoretic model, there is a one-to-one correspondence between the reads in the read dataset and nodes in the overlap graph. Edges connecting nodes in the overlap graph represent overlap relationships between reads. Each edge stores its corresponding overlap's alignment length and identity score. The overlap graph shares many similarities with the interval graph. The interval graph is one of the perfect graphs in graph theory and has many defined properties, making it a robust model for many applications [24].

An example of the overlap graph and interval graph developed in [14] can be found in Figure 7. We use succinct dictionary data structures [25] to index the nodes and edges of the overlap graph in a highly efficient manner allowing us to store our graph in O(*n*) + O(*m*) + 64*m* bits, where *n* and *m* are the total nodes and edges in the graph, respectively. More details on the structure and efficiency of the graph data structures can be found in [14].

### Graph coarsening

We apply Heavy Edge Matching (HEM) to produce a series of coarsened graphs [15]. Graph coarsening provides levels of information about the structure of a graph over a spectrum of graph granularities. A very simple overview of global graph features exists at the coarsest graph levels, while more complex graph details are retained in earlier coarsening iterations and the original overlap graph.

Here we provide a description of our HEM algorithm for graph coarsening. Our graph coarsening algorithm attempts to find a maximal matching, a maximal set of edges without common endpoints, over a graph while maximizing for edge weight. The endpoints of these edges are then merged to form supernodes in a new coarsened graph. The steps of our algorithm are as follows. Given an overlap graph G_{n} = (V_{n}, E_{n}), the graph coarsening algorithm visits its nodes in succession over a user-specified number of passes, with the nodes with the largest edge weights visited randomly first followed by nodes with smaller edge weights. In the initial overlap graph, the edge weight is set to the alignment overlap length. We desire to visit nodes with larger overlap lengths first because the larger the overlap shared between two reads, the greater the likelihood that the reads are adjacent to one another in the original target sequence. The edges of the currently selected node *u* are visited from largest edge weight to smallest. The selected node *u* is matched to its first available neighbor *v* and the search through its edges is terminated. If no such neighbor exists or if the edge overlap length falls below a user-provided minimum during the edge search, then *u* remains unmatched. After the matching process is complete, the endpoints of each edge (*u, v*) in the matching are merged into a single supernode *z* in a new coarsened graph, G_{n+1}. Each unmatched node is mapped to a new node in the coarsened graph. Each edge from the original graph G_{n} is mapped to a new edge in the coarsened graph G_{n+1}. The endpoints of an edge in G_{n+1} are the supernodes that contain the endpoints of its corresponding edge in G_{n} as component nodes. If two edges in the coarsened graph share the same supernode endpoints, then the edges are merged into a single edge and their weights are summed to form a new edge weight. An example of graph coarsening via HEM can be found in Figure 8 developed in [14].

This graph coarsening process can be applied to the newly coarsened graph G_{n+1} to produce an even coarser graph G_{n+2}. This iterative process of node matching and merging produces a series of coarsened graphs G_{0}, G_{1}, G_{2} ... G_{n}, where N(|G_{0}|) ≥ N(|G_{1}|) ≥ N(|G_{2}|) ... ≥ N(|G_{n}|) representing the dataset across a spectrum of information levels. Graph coarsening is terminated when the ratio of the number of nodes successfully matched to graph size falls below a user minimum.

Four array data structures are used to hold critical information describing the graph coarsening process. For each graph G_{0}, G_{1}, G_{2} ... G_{n} in the series of coarsened graphs, there are two arrays, *node_weights* and *edge_weights*. For each supernode *z* in a graph G_{n}, *node_weights*_{
n
} records the number of child nodes descended from *z* in G_{0} and *edge_weights*_{
n
} records the total weight of the edges induced by the child nodes. Let *z* be a supernode in a graph G_{n+1} and *u* and *v* be its child nodes in G_{n}. We use these arrays to calculate node density with the following equation.

*edge_density(u,v) = edge_density(z) =*

\frac{2*\left(ew\left[u\right]+ew\left[v\right]+w\left(u,v\right)\right)}{\left(nw\left[u\right]+nw\left[v\right]\right)*\left(\left(nw\left[u\right]+nw\left[v\right]\right)-1\right)}

where *ew*[u] = *edge_weights*_{
n
}[*u*] and *nw*[u] = *node_weights*_{
n
}[u].

During the matching process, a node *u* will be matched to its neighbor *v* only if *edge_density(u, v*) is greater than a user-provided minimum.

Each graph is also assigned two additional arrays, *node_map* and *node_map_inverse*. For each node *u* in a graph G_{n}, *node_map*_{
n
}, records the label of the supernode that *u* is mapped to in G_{n+1}. For each supernode *z* in a graph G_{n+1}, *node_map_inverse*_{
(n+1)
} records the labels of its child nodes in G_{n}. Let *u* and *v* be nodes in G_{n} that are mapped to supernode *z* in G_{n+1}, then *node_map*_{
n
}[*u*] = *node_map*_{
n
}[*v*] = *z*. If *z* is a supernode in G_{n+1}, then *node_map_inverse*_{(n+1)}[*2***z*] *= u* and *node_map_inverse*_{(n+1)}[*2*z+1*] *= v*, where (*u, v*) is an edge in a matching M applied to G_{n}. If *z* only has one child node *u*, then *node_map_inverse*_{(n+1)}[*2*z*] *= u* and *node_map_inverse*_{(n+1})[*2*z+1*] *= -1*.

After the completion of matching, the edges of G_{n+1} are generated from the edges of G_{n}. The algorithm labels each edge *e* = (*u, v*) in G_{n} to form a new edge *e*_{new} = (*node_map*_{
n
}[*u*], node*_map*_{
n
}[*v*]) in G_{n+1}. If *node_map*_{
n
}[*u*] = *node_map*_{
n
}[*v*], then *e*_{new} is not included in the edge set. The edges are ordered into an edge list by parallel merge sort such that edges with the same source node are grouped into edge sets within the edge list. The edge sets are ordered by descending edge weight. Any edges with the same endpoints in G_{n+1} are merged and their weights are summed. The new edge set is indexed by graph data structures to form G_{n+1}. Pseudocode for the graph coarsening process is shown in Figure 9.

### Read cluster generation

Our algorithm uses the *node_map_inverse* arrays to recover reads clusters from a given coarsened graph G_{n}. Recall that if *z* is a supernode in G_{n}, then *node_map_inverse*_{
n
}[*2*z*] *= u* and *node_map_inverse*_{
n
}[*2*z+1*] *= v*, where *u* and *v* are child nodes of *z* in G_{n-1}. The labels of the child nodes of *u* in G_{n-2} would therefore be given by *node_map_inverse*_{
(n-1)
}[*2*node_map_inverse*_{
n
}[*2*z*] ] and *node_map_inverse*_{
(n-1)
}[*2*node_map_inverse*_{
n
}[*2*z*] +1]. The child nodes of *v* would be given by *node_map_inverse*_{
(n-1)
}[*2*node_map_inverse*_{
n
}[*2*z+* 1] ] and *node_map_inverse*[*2*node_map_inverse*_{
n
}[*2*z+* 1] +1]. This nested iteration through the *node_map_inverse* arrays continues until the all of the labels of the child nodes of *z* in G_{0} are recovered. Since the label of each node in G_{0} is the id of the read to which it corresponds to in the read dataset, we can use the child node labels to generate the read cluster belonging to the supernode *z*.

### Edge relabeling

Graph traversal of the reduced overlap graph is used to determine an ordering of the nodes in the original, full overlap graph. The end points of each edge in the original overlap graph are relabeled according to the node ordering recovered from the reduced overlap graph. The goal of the edge relabeling process is to organize the edges within the original overlap graph such that many of the edges with common endpoints are close to one another in the graph data structure, facilitating more efficient access to overlap graph information. More details on the reduced graph traversal and edge relabeling process and experiments examining its effectiveness can be found in [14].