Jumping profile Hidden Markov Model
A jumping profile Hidden Markov Model (jpHMM) as introduced in this section combines profile HMMs with the jumping alignment (JALI) approach introduced by Spang et al. [9]. The data that a jpHMM models are a sequence family = {S1,...,S
n
} together with a multiple sequence alignment A of . In addition, we assume that we have a partition of into k subtypes 1,...,
k
such that each sequence S
i
belongs to exactly one of the subtypes. Our jpHMM approach can be seen as a generalization of the 'jumping alignment' algorithm. In JALI, each position of a database sequence is aligned to one reference sequence S
i
∈ . By contrast, jpHMM aligns parts of the input sequence to entire subtypes of the input multiple alignment. Thus, JALI corresponds to the special case in our approach where each subtype
i
consists of exactly one sequence. As with standard profile HMMs, each match state in our model is derived from one column of the input alignment A. However, in our model we define match states specific for the subtypes. Thus, a column in the query alignment A may correspond to up to k match states, and a match state is specified by two indices, the corresponding column of the multiple alignment A and the subtype it belongs to.
For a given subtype
i
, a column is modeled as a match state only if it is a consensus column for that subtype. Consequently, a column i in the alignment A may be modeled as a match state for some of the subtypes but not for other subtypes. In addition to the match states, we have distinct insert and delete states for each subtype just as in standard profile HMMs. In our notation, match state M
i,j
is the j-th match state within the i-th subtype, and I
i,j
and D
i,j
are the insert and delete states corresponding to M
i,j
. There is a single Begin state and a single End state, respectively, for the entire model. Further, there are general, not subtype-specific insert and delete states just after the Begin state and just before the End state. From the delete state immediately after the Begin state, each match state from each subtype can be reached. Similarly, from each match state, the delete state directly before the End state can be reached. These states have been introduced to deal with the fact that the sequences are often incompletely sequenced and are missing the initial or terminal part.
Note that the sub-model associated with a subtype
i
in our jpHMM corresponds to a standard profile HMM for
i
. Thus, our model can be seen as the union of k standard profile HMMs with additional transitions between these standard HMMs. The underlying multiple alignment A induces a quasi partial order relation on the set of all states of our jpHMM. We say that a match or delete state T is (strictly) to the left of a match or delete state R if the alignment column associated to T is (strictly) to the left of the column associated with R. This ordering is related to the quasi partial order relation ∘A defined on the set of all sites of a multiple alignment introduced in [28] in the context of consistency of alignments. As for standard HMMs, the states of a jpHMM are connected by transitions to which transition probabilities are assigned. Transitions are possible within one subtype
i
as in standard profile HMMs, e.g. from one match state a transition is possible to the next match state or to the corresponding insert and delete states of
i
. In addition, our model allows transitions between different subtypes as shown in Figures 3 and 4. Transitions between subtypes are called jumps.
Transitions between subtypes are more complicated than within subtypes since not every alignment column is represented in every subtype as a match state. Thus, it is not obvious from which state in one subtype we can jump to which state in another subtype, so we need to specify which jumps between subtypes are allowed. Generally, there are two reasons to limit the number of possible jumps between states. (a) To reduce the computer resources required by our algorithm, we need to limit the number of possible transitions between states. (b) More importantly, we need to make sure that a path through our model cannot jump to the left or too far to the right. A jump to the left would have the biological meaning of a tandem repeat of a certain part of the sequence, which we do not allow. A jump to the right that overjumps consensus columns in one of the two subtypes involved in the jump means that some part of one of those subtypes is deleted with respect to the query sequence. This is possible but should be punished as in a standard profile HMM by using the alternative path, a chain of delete states. This exclusion of forward jumps similarly reduces the number of transitions as done in [29]. In our approach, we imposed the following rules:
r1 For two subtypes
i
and
j
, the algorithm can jump from a match state of
i
only to a match state or a delete state of
j
, and from an insert state or delete state of
i
a jump is possible only to a match state of
j
.
r2 A jump from a state T in
i
is possible only to the left-most state in
j
that is strictly to the right of T.
r3 A jump from a state M
i,k
, D
i,k
or I
i,k
to a state in
j
that is to the right of Mi,k+1is not possible.
Rule r1 reduces the number of possible transitions in our model. Rules r2 and r3 ensure that there are no insertions or deletions introduced during a jump without using insert or delete states.
Parameter estimation
A jpHMM has a large number of parameters that need to be specified, namely the emission probabilities of match and insert states, the transition probabilities within subtypes and the probability of the jumps between different subtypes. With the exception of the probabilities of jumps, which is discussed below, the above probabilities can be estimated based on observed frequencies. Given the topology of the jpHMM, each of the sequences in the given multiple sequence alignment defines a unique path through the states, and gives rise to observed emissions and transitions. For example, a particular residue that is aligned in a consensus column is emitted from the corresponding match state of the subtype the respective sequence belongs to. To give another example, an insert region of length l gives rise to one transition from the preceding match state to the corresponding insert state, l - 1 transitions from that insert state to itself and one transition from the insert state to the next match state.
The generalized problem for estimating each emission distribution and each distribution of possible transitions out of a state is the following. We are given a count vector = (n1,...,n
s
), where s is the number of emissions (or transitions, respectively) out of this state. For example, we have s = 4 in case of nucleotide emissions. n
i
is the number of times the i th emission (or transition) is observed. These observed frequencies are distributed according to a multinomial distribution with parameters = (p1,...,p
s
), where p
i
is the probability of observing option i. For this problem of estimating given , we chose a Bayesian approach as in [30]. This means we assume a prior distribution on the set of all possible , and then estimate by the following conditional expectation
E(|).
We model this prior knowledge using a Dirichlet distribution [30] which has parameters = (α1,...,α
s
). These parameters can be interpreted as pseudocounts that are added to the observed counts. For the emission probabilities we estimated the parameters of the prior distribution with a Maximum Likelihood approach [30] based on the input multiple alignment. For the transition probabilities we used the parameters of the prior distribution taken from [31]. Those were shown to perform better than the parameters derived by Maximum Likelihood.
In contrast to the transitions within a subtype of the jpHMM, jumps between subtypes cannot be observed in the input alignment data. Since we cannot estimate the corresponding jump probabilities from observed frequencies, we use a fixed, empirically derived value P
j
for the probability of observing a jump. If in a given state of the jpHMM, there are several possibilities for a jump, this probability is evenly distributed between the possible jumps. In other words, if we have K options to jump into another subtype, the probability of each of these jumps is given by P
j
/K. In the application of our program to the identification of HIV and HCV recombinants we use a jump probability of P
j
= 10-9 which we derived by optimizing the results by comparing them to published HIV intersubtype recombination breakpoints. Taking into account that the transition and jump probabilities out of each state must sum up to 1, we scale the non-jump transition probabilities, i.e. the probabilities for transitions within the same subtype by multiplying them by (1 - P
j
), if jumps are possible out of this state.
Alignment algorithm and efficiency
The jumping alignment of a query sequence S = s1, s2,...,s
n
and a given multiple alignment is determined by searching the most probable path Q* through the jpHMM that emits S as described above. This is done with a dynamic programming algorithm, the Viterbi algorithm. For each position i = 1,..., n of the query sequence S and for each state q of the jpHMM we calculate the probability δ
i
(q) of the prefix s1,...,s
i
of the query sequence and the most probable path through the jpHMM ending in state q and emitting s1,...,s
i
. These probabilities are called Viterbi values and the following recursion holds.
Here, q' ranges over all states of the model, t
q' q
is the probability of the transition from state q' to state q and is the probability of emitting nucleotide si+1out of state q. The Viterbi values can be computed by increasing i with the states sorted from left to right. By backtracking we can construct the most probable path Q* (see Figure 4) and thus the jumping alignment. This algorithm has a complexity of O(nℓk) in time and O(nℓ) in space where n is the length of the query sequence, k the number of subtypes in the alignment and ℓ the number of states in the jpHMM.
In the case of very long alignments this may require too much time and memory for current computer hardware. For example, genomes of the HIV-1 group M have a length of roughly 10, 000 nucleotides. Thus, given a multiple alignment of 14 (sub-)subtypes of such sequences and a query sequence of length ~10, 000 in a straightforward implementation we would need to calculate and store roughly 10, 000 · 10, 000 · 14 · 3 = 4 · 2 · 109 floating point numbers.
To accelerate the computation and to save memory we bound the number of considered states in each step i by using the beam-search algorithm [32, 33]. The idea behind this algorithm is to exclude possible irrelevant paths in each step and to restrict the search space to 'promising' paths. If an alignment of an initial part is much worse than another alignment of that part then we do not try to extend that low quality alignment. This is achieved by computing and storing in each step i a modified Viterbi value ≤ δ
i
(q) only for a subset i of the states, namely those states q whose modified Viterbi value is not much lower than the optimal local solution = max
q
These states are called 'active' states and the set i of active states of step i is determined by
The modified Viterbi value of the inactive states is set to 0, and does not need to be stored. In the next step i + 1 of the recursion the modified Viterbi value needs only be computed for states, which can be reached from a state in the subset of active states i through a path with one emission. This speeds up the computation of the recursion.
In the tradeoff between memory efficiency and speed (large ) against accuracy (small )) we chose a beam in the order that allows maximal accuracy within the limits of currenct PC hardware: = 10-20. In Figure 5 and 6 we sketch the set of columns of activated states in the multiple alignment for an example with HIV sequences. Using this beam search heuristic very rarely affects the output of the computation but the time and memory savings are immense. The average number of active states per input sequence position in this example is 1,690 which compares to roughly 10,000 · 14 · 3 = 4 · 2 · 105 states if the beam search heuristic was not used. For the HIV-1 sequences that we tested, the average number of active states was between 1,620 (CRF 12, length = 8,760 nt) and 2,862 (CRF 11, length = 9,768 nt). The CPU time for those sequences using the beam search heuristics is 7.2 min (CRF 12) and 13.6 min (CRF 11) on a Linux PC with 3 GB RAM and 3.2 GHz. This includes model building as well as a search of one sequence against the model, but most of the CPU time was consumed by the second step.
Test results
Results on HIV genomic sequences
To evaluate the accuracy of our jpHMM approach on HIV-1 sequences, we used two different types of test data including simulated as well as real-world sequence data.
First, we wanted to know to what extend our method is able to recognize subtypes in artificial sequences produced by the underlying probabilistic model itself. This test can be considered as a minimal check of our model. We sampled 800 random sequences according to the transition and emission probabilities of the HMM built for HIV-1 subtyping. Since the "jump probability" in our model is rather small, each of our 800 artificial sequences consisted of one subtype only without any recombinations. For each of these sequences our method correctly predicted the underlying single subtype, and no jumps between different subtypes were predicted. Moreover, for 752 of our 800 sequences, 100 % of the individual sequence positions were assigned to the correct subtype. In the remaining 48 sequences, the only differences between the sampled and predicted paths were in the lengths of short unclassified regions at the ends of the sequence. All in all, 99.99 % of the sequence positions in our test sequences were assigned to the correct subtype.
Further, we used simulated inter-subtype recombinant sequences with known breakpoints. Artificial recombinant sequences were created in the following way: for each simulated sequence, two real-world 'parent' sequences were taken from two different clades of HIV-1. For a fixed value X, these sequences were split at every X-th nucleotide, and a simulated recombinant sequence was composed of alternating segments of length X from these two parent sequences. Thus, for two parent sequences P 1 and P 2 and, for example, X = 1000, the first 1000 nucleotides of the artificial recombinant are from P 1, nucleotides 1001 to 2000 are from P 2, residues 2001 to 3000 are from P 1 etc. In the present study, we used values of X = 500, 1000, 1500. This way, known breakpoints were introduced into both relatively conserved regions, and highly variable regions, and the performance of the jpHMM method could be assessed in both contexts. Here, we distinguish between recombinant sequences with parents from different subtypes which we call inter-subtype recombinants and recombinants with parents from the same subtype but different sub-subtypes, which we refer to as inter sub-subtype recombinants. Sub-subtypes are clearly distinguishable, established lineages that occur within the subtypes, but do not have the minimal genetic distances required to be considered an independent subtype. For historical reasons the B and D clades are called subtypes, but in fact the distances between these two clades correspond to a sub-subtype distance [15].
The sequences used in the inter-subtype recombinants have been created using parent sequences from the following subtypes (GenBank accession numbers of the parent sequences are in parentheses): A1 and C (A1: AF193275, C:AY463217); A1 and D (A1: AF193275, D:AF133821); A1 and G (A1: AF193275, G: AF450098); B and C (B: AF042101, C: AY463217); B and F1 (B: AF042101, F1:AY173958); B and 01 (B: AF042101, 01:AB032741). The sequences used in the inter sub-subtype artificial recombinants have been created from the following sub-subtypes and parents, respectively: A1 and A2 (A1: AF413987, A2:AF286240); A2 and A1 (A2:AF286241, A1:AF539405); B and D (B: AF538302, D:AJ320484); D and B (D: AJ488926, B:AY352275); F1 and F2 (F1:AY173957, F2:AF377956); F2 and F1 (F2: AY371158, F1:AY173958). We selected the above combinations of subtypes for the parent sequences, because they correspond to known real-world recombinants. From each subtype, we selected parent sequences for which breakpoints are assumed to be reliably annotated. Figure 7A illustrates the creation of these artificial recombinants, Figure 7B shows the evolutionary relations of the subtypes and inter sub-subtypes used for our simulated recombinants.
Based on these artificial recombinants, we evaluated the performance of our jpHMM tool and compared it with Simplot [12], a widely used HIV subtyping tool. We measured the distances between the predicted and the real breakpoints, and assessed the differences in prediction accuracy using non-parametric statistics, namely calculating the (a) the median value of the distances and (b) the interquartile range, and comparing distributions using the Wilcoxon signed-rank test implemented with R http://www.r-project.org. As shown in Figure 8, our method consistently showed much better predictions of the artificial recombinant breakpoints than Simplot. In the inter-subtype sequences, jpHMM's median value for the distances between the predicted positions and the real breakpoints is 10, with an interquartile range from 4 to 15. By contrast, Simplot's median is 54, while its interquartile range is from 19 to 72. The difference between the predictions of jpHMM and Simplot is significant with p < 10-9 in the Wilcoxon signed-rank test.
The inter sub-subtype simulated sequences are more similar to each other, and so it becomes a more difficult problem to distinguish breakpoints. Here, jpHMM's median value for the distances between the predicted and the actual breakpoints is 9, the interquartile range is from 3.5 to 19, while Simplot's median value is 84 and its interquartile range is 19.5 to 122. The accuracy difference between jpHMM and Simplot's predictions are significant with p < 10-7 in the Wilcoxon signed-rank test. Finally, Figure 8 also shows there were no particular breakpoints that were consistently hard to define, rather whether or not a particular breakpoint (for example, position 1000) was accurately resolved depended on the particular combination of sequences; the artificial breakpoints we introduced were embedded in both conserved and variable regions of HIV. Introducing breakpoints at intervals of 500 and 1500 gave comparable results (data not shown) to the 1000 base intervals included in Figures 7 and 8. Finally, the breakpoint definition methods in Simplot uses a chi squared statistic for resolution [34].
We have tried to compare jphmm and Simplot chi square results to the suite of programs available in the RDP package [35], including RDP, GENECONV, MaxChi, Chimaera, Siscan. While Simplot and jphmm readily recognized the artificially generated breakpoints in our recombinants, shown in Figure 7, and could distinguish the parental subtypes, the other methods missed many of the breakpoints and often assigned incorrect subtype designations. The LARD [36] program appears not to be designed for recognition of multiple breakpoints. Bootscanning works well for correct identification of the subtypes of parental fragments in a recombinant genome, whether using the Simplot or RDP implementation, but does not attempt to optimally resolve breakpoints. Another algorithm to detect recombination events has been described in [37, 38]. However, this method is limited to detect chimeric sequences that are recombinations of only two sequences with only one breakpoint. While the jumping alignment program JALI [8, 9] can be adapted to run on DNA sequences, its computer memory requirements are far too high for applying it to the test data in our study. Thus we used the chi squared method [11, 34] implemented in Simplot [12] for Figure 8, as it gave the best results among existing methods.
In addition to simulated recombinants, we used real-world circulating recombinant forms (CRFs) for which the recombination breakpoints have been very carefully defined, and published in the literature. Here, we compared only our jpHMM predictions to the published data but not predictions by Simplot, since the published breakpoints already mainly rely on predictions by Simplot or similar methods. Thus, it would be redundant to compare CRFs to our own revised Simplot predictions. We tested reference sequences from 12 different CRFs, namely CRF02 to CRF08 and CRF10 to CRF14. These recombinants are known to be well annotated. Figure 9 shows the published genome map of CRF02 together with the subtypes as predicted by our jpHMM software. For the 12 CRFs that we analyzed, subtypes predicted by jpHMM roughly correspond to the previously published subtypes: for 70% of the CRFs with breakpoints, the breakpoints predicted by jpHMM are located in a distance of < 150nt from the corresponding published breakpoints (with an average distance of 27nt). Discrepancies between the published and jpHMM predicted recombinant sequences were found in the H, J and K-containing CRFs.
Results for the hepatitis C virus
We analyzed the two recombinant strains of HCV that were available until mid-2005, the 1b/2k St Petersburg recombinants (AY587845, [22]) and an artificial 1a/2a recombinant (AF177037, [39]). In both cases, the jpHMM method accurately reconstructed the recombinant. jpHMM located the breakpoint for the 1b/2k St Petersburg recombinant between nucleotide 3186 and 3187 (in HCV-H77 numbering). The original authors manually pinpointed this breakpoint to the exact same nucleotide, based on a Simplot graphic and a sequence alignment.
The location of the breakpoint of the artificial 1a/2a recombinant was estimated to be at position 2759/2760 (HCV-H77 numbering), while the cross-over point in the actual artificial recombinant was reported to be between the fourth and fifth nucleotide of a mutated restriction site Nde 1 located at position 2761–66 of their reference sequence (the boundary of p7 and NS2). However, in the same reference sequence (AF177037) the only site (CATATG) was located at positions 2773–2778, which corresponds to 2762–2767 of HCV-H77. This would place the breakpoint at position 2765/2766, so in this case the prediction is off by 6 nucleotides.
In both recombinants, the genotype of the 5'UTR part of the sequence was misidentified, as 1a for the 1a/2a recombinant and as 2a for the 1b/2k recombinant. Thus, a spurious breakpoint was postulated for both sequences, at position 238/239 for the 1b/2k recombinant and at position 349/350 for the 1a/2a recombinant. This region of the HCV genome, around 350 nucleotides long, is known to be too highly conserved to contain a good phylogenetic signal, and often cannot even be used to phylogenetically distinguish different genotypes, let alone subtypes (CK, unpublished results), so it is not surprising that the jpHMM method is unable to make an accurate determination. As a consequence, for any automatic recombination detection method to work accurately, we expect that this region will have to be excluded a priori; and conversely, because this region does contain little phylogenetic information, detection of recombination will be almost impossible.