Inferring viral quasispecies spectra from 454 pyrosequencing reads

Background RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results In this paper, we introduce a new Viral Spectrum Assembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html. Conclusions ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.


Methods
Cost Formula Derivation. First, let us consider a simplified model where all quasispecies are uniformly distributed and reads beginning positions follow uniform distribution as well. Let b(u) and e(u) be the beginning and the end positions of the read u, respectively. And let A be an event that two reads u, v from the same quasispecies Q are connected with an edge (u, v) in the transitively reduced graph, and j differences among u and v occur in the overlap due to sequencing errors. In the other words, the event A is a product of the following independent events: (1) read u exists, (2) read v exists, starting at position b(v) = b(u) + ∆ (overhang ∆ is some shift) (3) no read w from the same quasispecies Q satisfies b(u) < b(w) < b(v), and (4) there are j sequencing errors in overlap o = e(u) − b(v) of reads u and v.
Given N reads, originated from q quasispecies of length L, the probability that read u starts at a position b(u) is N Lq . The probability of the event ∆ > k is the probability that there is no read from quasispecies Q starting at any position in {b(u) + 1, b(u) + 2, . . . , b(u) + k}, that is Then P r(∆ = k) = N Lq 2 p k−1 .
Next we calculate the probability of having j genotyping errors among overlapped positions as follows: where ε is a genotyping error rate. Finally, the probability of event A is: is a shift (overhang) between starting positions of reads u and v. If ∆ Lq/N then v is more likely to be a read from another quasispecies Q and differences from quasispecies Q and Q in interval between b(u) and b(v) are the cause of large ∆. Therefore, measures our uncertainty that (u, v) is a true edge. In practice, quasispecies frequencies are under some unknown distribution F . Since it is impossible to approximate F , we assume that two identical copies of the same quasispecies genome correspond to two different entities, which follow uniform distribution.
In experimental results, reads starting positions tend to follow uniform distribution with small amount of outliers. If it is not a case, the cost formula can be adjusted by taking into account number of reads starting in interval between b(u) and b(v). However, when a candidate path is constructed, at each step, algorithm chooses between edges spanning almost the same positions, thus, non-adjusting cost formula to the coverage does not influence the construction of a candidate path.

Frequency Estimation via EM-algorithm.
Once a set of candidate sequences is obtained, their maximum-likelihood frequencies are calculated by the EM based algorithm. Iteratively, we estimate missing probability p q,r that read r comes from a candidate sequence q with j sequencing errors and maximize likelihood of an approximated model.
First, we create a bipartite graph G = {Q R, E} such that each candidate sequence is represented as a vertex q ∈ Q, and each read is represented as a vertex r ∈ R. With each vertex q ∈ Q, we associate unknown frequency f q of the candidate sequence. And with each vertex r ∈ R, we associate read observed frequency o r . Then for each pair q, r, we add an edge (q, r) weighted by probability of the read r being produced by the candidate sequence q with j genotyping errors: where l is length of read sequence, and ε is the genotyping error rate.
After initializing frequencies f q q∈Q at random, the algorithm repeatedly performs the next two steps until convergence: E-step: For each pair q, r, compute the expected value p q,r that read r comes from candidate sequence q under the assumption that frequencies f q q∈Q are correct by the following formula: M-step: For each q ∈ Q, update value of f q to the portion of reads being originated by the candidate sequence q among all observed reads in the sample, i.e.: Currently, convergence of EM algorithm is determined at the tolerance level 0.005.
Rationale for Max-Bandwidth Path. Previously, we show that cost of an edge should be correlated to its overhang (shift) ∆. If we view the costs on edges as edge lengths, then the most probable paths for quasispecies are the shortest paths in the graph. So we can choose shortest path for each vertex to build the set of candidate paths.
In experiments on error-free reads, we consider the family of edge cost functions cost k (u, v) = e ∆ (u,v) k . Figure 1 shows the number of the shortest paths in candidate set as a function of k. Smaller values of k yield fewer paths, and surprisingly, no correct candidate quasispecies are lost with decreasing of k. In the limiting case k = 0, the resulted paths are maximum bandwidth paths. Example of Read Graph Construction.   Vertices circled by bold lines correspond to superreads. Vertices circled by dashed lines correspond to subreads. Vertices that correspond to superread and its subreads are circled by the same color. Colors are as follows: red is for "G", dark blue is for "A", green is for "C", blue is for "T", pink is for "N" and purple is for "D". For example, R1 and R4 have 2 differences in the overlap, R1 and R6 as well as R1 and R6 have 1 difference. Fig. 6. Example of read graph (n = 2, m = 2) if only superreads are represented by vertices in the read graph. Read graph is shown before transitive reduction (at the left) and after transitive reduction was applied (at the right).

Data Sets
Reads generated by FlowSim from known HCV quasispecies. Additional Read Statistics. 99.96% of aligned reads has at least one indel with respect to the reference: 99.97% of deletions and 99.6% of insertions are 1bp long. Only 1.1% of aligned reads has unknown value(s).
454 Pyrosequencing Reads from HCV Samples. Additional Read Statistics. 72% of aligned reads has at least one deletion with respect to the reference: 98% of deletions are 1bp long, 1.5% has length 2, and the rest 0.5% has length 3. 77% of aligned reads has at least one insertion: 86% of insertions have length equaled to 1, and 9.8% have length equaled to 3. Only 7% of aligned reads has at least one unknown value. Assuming that only once encountered insertions caused by typing errors, we found that the insertion error rate is at least 0.025%. 454 Pyrosequencing Reads from HIV Samples. Additional Read Statistics. 87% of aligned reads has at least one deletion with respect to the reference: 99.97% of deletions are 1bp long. 99% of aligned reads has at least one insertion: 85% of insertions have length equaled to 1, 10% have length equaled to 2, and 3.5% have length equaled to 3. 11.6% of aligned reads has at least one unknown value.