Deconvoluting the diversity of within-host pathogen strains in a multi-locus sequence typing framework

Background Bacterial pathogens exhibit an impressive amount of genomic diversity. This diversity can be informative of evolutionary adaptations, host-pathogen interactions, and disease transmission patterns. However, capturing this diversity directly from biological samples is challenging. Results We introduce a framework for understanding the within-host diversity of a pathogen using multi-locus sequence types (MLST) from whole-genome sequencing (WGS) data. Our approach consists of two stages. First we process each sample individually by assigning it, for each locus in the MLST scheme, a set of alleles and a proportion for each allele. Next, we associate to each sample a set of strain types using the alleles and the strain proportions obtained in the first step. We achieve this by using the smallest possible number of previously unobserved strains across all samples, while using those unobserved strains which are as close to the observed ones as possible, at the same time respecting the allele proportions as closely as possible. We solve both problems using mixed integer linear programming (MILP). Our method performs accurately on simulated data and generates results on a real data set of Borrelia burgdorferi genomes suggesting a high level of diversity for this pathogen. Conclusions Our approach can apply to any bacterial pathogen with an MLST scheme, even though we developed it with Borrelia burgdorferi, the etiological agent of Lyme disease, in mind. Our work paves the way for robust strain typing in the presence of within-host heterogeneity, overcoming an essential challenge currently not addressed by any existing methodology for pathogen genomics.


Complexity of the Allele Diversity Problem Proof of NP-completeness of the Allele Diversity Problem
We transform the Allele Diversity Problem into a decision problem as follows. Input: • a set of reads R = {r 1 , r 2 , . . .}; • a set of n alleles A = {a 1 , . . . , a n } for the chosen locus; • a target cost C.
Question: Is there a read cover whose cost is at most C? It is clear that this problem is in NP, since given a particular subset B of alleles, it is easy to check, in polynomial time, that they form a read cover and that the cost of this read cover is at most C.
To show its NP-completeness we use a reduction from the 3-dimensional matching problem, formulated as follows: Input: • three disjoint sets, X, Y and Z, each of size n; • a collection of m ≥ n triples {(x i , y i , z i )} 1≤i≤m , with x i ∈ X, y i ∈ Y, z i ∈ Z ∀ i.
Question: Is there a three-dimensional matching, i.e. a subset M ⊆ {1, 2, . . . , m} with |M | = n such that ∪ i∈M {x i , y i , z i } = X ∪ Y ∪ Z (i.e. the corresponding triples contain each element exactly once)? From a given input to the 3-dimensional matching problem we construct an instance of the allele diversity problem (with reads over a binary alphabet) as follows: Select 3n binary strings, each of length L − 2, such that the Hamming distance between any pair of them is at least 1 (for instance by using a Hamming code), pad them with a single 0 at the beginning and at the end to a total length L, and associate exactly one of them to each of the elements of X ∪ Y ∪ Z. For an element e denote by s(e) the associated binary string.
Let the set of reads be R := {s(e)} e∈X∪Y ∪Z . There are exactly 3n reads. We assume that the base quality scores at all the positions of all the reads equal the maximnum possible base quality score, so that after scaling, each mismatch contributes exactly 1 to the objective function.
For each triplet (x i , y i , z i ) we construct an allele of length 5L, s(x i )1 L s(y i )1 L s(z i ), where 1 L is the string of L 1's and the product is via concatenation. In other words, A := {s(x i )1 L s(y i )1 L s(z i )} 1≤i≤m . Lastly, we set C := n. Since L can be chosen to be linear in n, and the codewords can be constructed in linear time with respect to n as well, this reduction can be performed in polynomial time.
We now prove the following Claim: There is a read cover with cost at most n if and only if there is a three-dimensional matching.
Proof : By construction, s(e) maps to any allele based on a triplet containing e with no errors, and to any allele based on a triplet not containing e with at least 1 error (since the encodings of the elements differ in at least one position, and a mapping overlapping with the "sentinel" string 1 L also incurs at least one mismatch due to the presence of a 0 at either end). It follows that we can get a score of at most n if we pick the n alleles corresponding to a three-dimensional matching, since there are no errors on any of the reads and n alleles are used in this case.
On the other hand, suppose that we do get a score of at most n. Since each allele can explain at most 3 reads with no errors, the case of no errors requires at least n alleles for the 3n reads, and the only way that this can be achieved with exactly n alleles is if each allele explains three different reads with no errors, meaning that the triples corresponding to the n chosen alleles form a three-dimensional matching. If there are k > 0 errors, then at most n − k alleles can be used. It follows that at most 3(n − k) of the 3n reads incur no errors, and the remaining 3k reads must each incur at least one error, so the number of errors incurred must be at least 3k, contradicting the hypothesis that there were only k errors. This proves the claim.

The ADP and the Uncapacitated Facility Location Problem
Our formulation of the ADP is very similar to that of the Uncapacitated Facility Location Problem (UFLP), in which a set of customers needs to be served by facilities placed in a subset of a finite set of possible locations, such that the total cost of opening the facilities plus the distances between each customer and the closest facility is minimized. Indeed, if we consider reads to be customers, alleles to be facilities, and distances to be the sum of normalized base quality scores of the corresponding mappings, the ADP becomes a special case of the UFLP. However, our problem is somewhat simpler than the general UFLP because the set of possible distances is limited. We exploit this fact in our ILP formulation which, unlike the formulation for the UFLP, has fewer variables and constraints.
Related to the Uncapacitated Facility Location Problem, already known to be NP-hard, our NP-hardness proof is of independent interest because it actually produces a set of reads and a set of alleles that have the desired distances, whereas not every instance of the UFLP may be obtainable from actual reads and alleles.

Alternative Formulation of the Allele Diversity Problem
The seemingly intuitive way to formulate the ADP is to introduce a decision variable b ij representing the assignment of read r i to allele a j . Here we present the alternative formulation of ADP based on this idea as an integer linear program. Given there are n alleles and m reads, we denote the set of reads R = {r 1 , ..., r m } and the set of alleles A = {a 1 , ..., a n }. Also, m ij is the score for mapping read r i to allele a j , the entries of the matrix input for ADP as described previously. Next, the decision variables and objective function are as follows: • z j = 1 if allele a j is chosen, and 0 otherwise. • b ij = 1 if the mapping of read r i to allele a j is chosen and 0 otherwise.
The constraints are: • All reads are covered: • If read r i is assigned to allele a j , then allele a j must be chosen:

Proof of the Equivalence of Both Formulations
Denote our formulation as (1) and the alternative formulation as (2). We prove that both formulations are equivalent by showing that a feasible solution of (1) can be translated to a feasible solution of (2), with both having the same objective value and vice versa. ("⇒") Prove that a feasible solution of (1) can be translated into a feasible solution of (2). Given a feasible solution s 1 = (x; Y) where x = (x j ) and Y = (y ik ). We denote a solution for (2), s 2 = (z; B) where z = (z j ) and B = (b ij ). Also, recall that the matrix input for the ADP is M = (m ij ) m×n .
• For all j, assuming the indices for both refer to the same allele, let z j = x j .
• For all i, due to the 2nd constraint in (1), ∃! k * such that y ik * = 1 and y ik = 0 ∀k = k * . Denote Claim: s 2 is a feasible solution for (2) with same objective value as (1). Proof: • Constraint 1: The summation of the allele variables is the same in both formulations. For the other summation term: Prove that a feasible solution of (2) can be translated to a feasible solution of (1). Given a feasible solution s 2 = (z; B).
• Constraint 2: • Objective value: The summation of the allele variables is the same. For the other summation term:

Proof of NP-completeness of the Strain Diversity Problem
We turn the Strain Diversity Problem into a decision problem by putting the objective function value into the input. This decision problem version is then formulated as follows: . .} of all alleles selected for locus j in sample i, together with the set P ij = {p ij1 , p ij2 , . . .} of proportions of these alleles; • A database Ω of known strain types; • An error bound ∈ [0, 1]; • A target cost C.
Let the set of all possible strain types for sample i, the Cartesian product For simplicity, assuming the weights for each novel strain are set to 1. Question: Is there a set of proportions π ih for each strain type V ih ∈ V i in each sample i, satisfying the validity constraint such that the total cost is at most equal to the target C, i.e.
This problem is clearly in NP because given a set of proportions π ih , it is easy to check, in linear time in the size of the input, that the validity constraints are satisfied, and that the cost is at most C.
To show its NP-completeness we use a reduction from the 3-partition problem, formulated as follows: Input: A multiset of n = 3m positive integers x 1 , x 2 , . . . , x n with sum S = mB for some integer B, and satisfying B 4 < x i < B 2 for each 1 ≤ i ≤ n. Question: Does there exist a partition of this multiset into m disjoint triples T 1 , T 2 , . . . , T m , such that the sum of each triple is exactly B?
From a given input to the 3-partition problem we construct the following instance of SDP. Let us choose = 1 4m . Let us look at a single sample with two loci, 1 and 2, where the first locus has the n = 3m possible alleles g 1 , g 2 , . . . , g n and the second one has the m possible alleles h 1 , h 2 , . . . , h m .
Let the target allele distribution for locus 1 be g 1 with proportion x1 S , g 2 with proportion x2 S , . . . , g n with proportion xn S , and let the target allele distribution for locus 2 be h i with proportion 1 m for each 1 ≤ i ≤ m. Finally, let the library strains be Ω := ∅, and let the target cost be C := n. It is clear that the transformation is polynomial in the input size. We now prove the following Claim: The original problem has a "yes" answer if and only if the transformed version has cost at most C.
Proof : Suppose that there is a partition of the multiset into disjoint triples T 1 , T 2 , . . . , T m . Then we can take the 3 integers, say x j1 , x j2 , x j3 , in a triple T j , and use the 3 new strains g j1 − h j with proportion xj 1 S , g j2 − h j with proportion xj 2 S , and g j3 − h j with proportion xj 3 S , to cover the allele h j completely, with no error (since construction). Thus, we need a total of n = 3m strains to cover all the alleles in both loci with no error, so the cost of our solution is indeed n.
Conversely, suppose that there is a solution with cost at most n. Note first that because = 1 4m = B 4Bm = B 4S < xi S for any 1 ≤ i ≤ n, it follows that no allele in locus 1 can be eliminated from the solution while satisfying the validity constraint. It follows that at least n = 3m strains are needed in any valid solution, so the cost cannot be smaller than n.
Furthermore, in order to get a cost of exactly n, there needs to be no error on the proportions and exactly n = 3m strains, so each allele in locus 1 must be in exactly one strain. Thus, the allele h j for locus 2 must be associated with a subset T j of alleles of locus 1, and the T j must be disjoint and therefore form a partition of the alleles of locus 1. Since B 4 < x i < B 2 for each i, we must have exactly 3 alleles in each subset T j , since adding up 2 of the xi S results in a sum below the target proportion B S = 1 m for h j , and adding up 4 of them results in a sum above it, thus incurring a non-zero error on the proportions.
By considering the sum of the proportions in each triple T j containing three alleles, say g j1 , g j2 , g j3 , we see that we must have As the T j are disjoint, this proves that the answer to the original problem is "yes", completing the proof.

WGS data preprocessing.
Before running the pipeline, we map the reads of each sample to a database of alleles for each gene. We use the memory efficient short read mapper Bowtie (v0.12.7) for read mapping. Next, we use SAMTOOLS (v1.3.1) to process the SAM file and extract information such as reads that are mapped as a pair, the alleles that are involved in the mappings, the number of mismatches, the base qualities, and the position of the mismatches.

Definition of Earth-Mover's Distance
be the set of true strains T i with its respective proportion t i and P = {P j , p j } |P| i=1 be the set of predicted strains P j with predicted proportion p j . Also, denote d ij as the edit distance between T i and P j . EMD measures the minimum amount of work to transform the predicted distribution defined by P into true distribution defined by T , where the amount of work is also weighted by distances d ij . It involves solving a transportation problem which can be modeled as a network flow problem, where the problem is to obtain a flow F = {f ij } which minimizes ij f ij d ij and the EM D = ij fij dij ij fij . It can be thought as a network flow problem in the following sense: we assign fractions of each P i to any T j (flow f ij ), constrained by the maximum amount of fractions that P i and T j can give and take, which is p i and t j (capacities), where the assignments f ij are weighted by d ij .

Results of Kallisto on ADP simulation
As Kallisto introduces large number of alleles with extremely small abundances, we only retained alleles with larger than 1% proportion.

Results of strainEST on full pipeline simulation
Recall that full pipeline simulation involves generating reads after strains are chosen, and our pipeline first solves the ADP problem then the SDP problem, by taking results of ADP as an input. StrainEST infers strains and abundances, and we compute the statistics for ADP based on the strains and abundances inferred.
In this section, we present the results for our algorithm and strainEST on 2 different set of experiments: (1) benchmark simulation: Only existing strains are chosen for simulation, where we consider k = {1, 2} number of strains (2) EvoMods simulation: 4 evolutionary mechanisms. Remark: As strainEST can only take a maximum of 100 genomes, we randomly choose 100 existing strains as the database for simulation.

Results on 4 EvoMods
The results for different evolutionary mechanisms presented for our algorithm are different from those presented in the Results section, due to the database size constraint by strainEST.

Minimum Spanning Tree of Strains Found in 24 Samples
Figure 1: Nodes with prefix 'n' represents novel strains, 'e' represents existing.
We can observe clusters of strains which are very close to each other, for example, the circled tree branch in the upper left, consists nodes n25, n26, e6, e3, n13, n20, n14, n27, n28, n29. These nodes can be grouped as a cluster with strains of edit distance at most 5 bases. It is worth taking note that this distance is taken across 8 genes where each is approximately 580 bases long. Furthermore, mapping back to samples which contain these strains, n13 and n14 are in sample SRR2034335; n25, n26, n27, n28, n29 are in sample SRR2034342. Similar observations were seen on nodes which are in the star structure circled in Figure2 where any 2 nodes in the structure differ by at most 11 bases. These observation indicates that either strains in these samples mutate into extremely similar strains, or it indicates that wrong allele call leads to relatively higher number of novel strains inferred by our method.

Box Plots of Our Method and strainEST on Benchmark Simulation
Recall that the benchmark simulation is one of the simulations used to compare to strainEST.