We consider the scenario in which a set of *N* individuals are to be sequenced for any application such as a disease association study, or fusion-gene detection. The most straightforward approach would be to barcode the individual and sequence them separately. When *N* is large, or when the desired coverage is high, this approach is infeasible due to budget constraints. A few methods have been suggested to tackle this problem using a set of overlapping pools [[9–11]]. These methods are based on the following generic idea. Let the sequences of the individuals be represented by a matrix *G* = {*g*
_{
ij
}} of dimension *N* × *m*, where *m* is the length of the genome. *g*
_{
ij
} ∊ {0,1,2} is the number of occurrences of a genetic variant in position *j* of individual *i -* such variant could be a single nucleotide polymorphism (SNP), copy number variant (CNV), or a gene-fusion, as discussed in the introduction. The pooling based approach considers a {0, 1} matrix *A* of dimension *T* × *N*, representing a set of pools. Each row of *A* corresponds to a DNA pool; individual *j* participates in the *i —th* row if and only if *A*
_{
ij
} = 1. Thus, the matrix *A* provides a compact description of the study design. When the study is performed, under an error-free model, the pooling results are given as *Y* = *AG.* In principle, one can now decode the matrix by finding a solution to the set of equations *AX* = *Y.* In reality though the pooling results are not as accurate, and therefore current methods are using a rounded version of *Y;* for every *i*,*j*, we define *c*
_{
ij
} = 1 if *y*
_{
ij
} > 0 and *c*
_{
ij
} = 0 otherwise. Thus, if we replace the SUM operation by an OR operation then *AG* = *C.* Using this information, Erlich et al. [9],Prabhu et al. [10] and Shental et al. [11] provide a decoding algorithm which finds which individuals have *g*
_{ij} > 0. For the simplicity of the exposition, we will assume from now on that only one variant is considered, and so *Y*, *C*, and *G* are column vectors of length *N.*

### A lower bound on decoding accuracy

Unfortunately, by collapsing the data to a {0,1} matrix resolution is lost, and therefore there is no hope in decoding all genetic variants from the pools if the number of pools is not large. Note that for a given variant, there are *3*
^{
N
} possible genotype vectors. The number of possible column vectors *C*
_{
j
} is 2^{
T
}
*.* Therefore, in order to be able to decode all individual genotypes we need *2*
^{
T
} > 3^{N}, or *T* > *N.* Even without rounding, the number of possible vectors *B*
_{
j
} is at most (*2N*)^{
T
}, and therefore even in the error-free case we need (*2N*)^{
T
} > 3^{N}, or
. In practice, the pooling decoding methods work well when the allele frequency is low, under an error-free model. For a variant of allele frequency α, the number of possible genotype vectors is
, and therefore, we get that in the case where the rounded solutions are provided (the matrix C), we need (2*N*)^{
T
} ≥ 3^{
N
}, or
, or *T* ≥ *—Nαlogα*, and if we are using the full information given by the matrix *B*, we need
. Note that these are lower bounds, and it may theoretically be the case that a larger number of pools is required; however, it is easy to see that if *A* is chosen as a random matrix where each entry is 1 with probability 0.5, then the bounds given here are tight up to a constant factor (the proof is omitted from this version). Moreover, in [[9–11]], it is shown that using the matrix *C* one can decode low allele frequencies
then *T* = *O*(*logN*) suffices, which is consistent with the bounds we provide here. Since a random matrix provides a good decoding scheme in theory, we followed this intuition and generated a matrix *A* so that half of the entries in each row is 1 and the other half is 0. To obtain a better design matrix, we use local search; we repeatedly permute a random row and a random column and check to see if the Hamming distance between the permuted row/column and the other rows increased. If so, we keep the change, otherwise, we revert. After performing 1000 permutations, we result in a matrix whose rows and columns are farther apart, which improves our ability to decode.

### Incorporating imputation into decoding

As described above, from an information theoretical point of view, decoding the genotype vector is only possible when the allele frequency is low and therefore the genotype vector is sparse. For this reason, both Erlich et al. [9] and Shental et al. [11] make the connection between decoding and compressed sensing [19], where the requirement for the decoding success is based on the fact that the desired vector is sparse. We therefore suggest to incorporate imputation results into the decoding scheme; this allows us to overcome the information theoretical bound for the following reason. We can represent the true genotypes *G* as a sum of the (rounded) imputation predictions I, *i*
_{
ij
} ∊ {0, 1, 2}, and a set of imputation residual errors *R*, *r*
_{
ij
} ∊ {-2,1, 0, 1, 2}, where *G* = *I* + *R.* Then, the observed data can be represented as the pools’ results *Y* = *AG*, which is *Y* = *AI* + *AR.* Now, note that *R* is a sparse vector, and *I* is known; therefore, from a theoretical point of view, the above information theoretic lower bound does not hold on our case and there may be an algorithm that is able to decode the genotypes based on the sequencing and the imputation. In principle, we can solve for the imputation residual errors by solving the set of equations for *AX* = *Y - AI.* Once we obtain the residual vector, we can obtain the actual genotypes. In practice, as described below, we use the imputation dosage so our algorithm theoretically searches over the entire space, and not only over sparse vectors, but the search is pruned for vectors that are dense based on a likelihood model. As we show in the results section, this yields an improved imputation accuracy for high allele frequency SNPs.

### Pooling using read counts

Our approach differs from previous approaches in that we are considering the matrix *Y* and not *C.* As discussed above, this should allow us a gain of approximately log *N* factor in the number of pools needed, at least for higher allele frequencies. However, in order to do so, we need to explicitly model the sequencing errors. The error model may be different, depending on the application at hand. We will describe here the model we use for the detection of mutations (SNP calling). There are three main sources of noise that we include in the model:

1. There are slight differences in the concentration of each individual’s DNA in each pool. This

*pooling noise* is modeled as a normally distributed noise added to each non-zero element of A with mean 0 and variance

*σ*
_{
p
}
*.* Thus, we set

2. There is a variance in the coverage of any specific region in the genome. We denote by *L* the length of the sequenced genomic region; if the total number of bases sequenced is *λLN*, then we expect that each base will be covered by λ reads on average. λ is often termed the expected *coverage.* We will denote the number of reads covering individual *i* by *r*
_{
i
}
_{1}, *r*
_{
i
}
_{2} (corresponding to the two chromosomal copies). Then, *r*
_{
ij
} is Poisson distributed, with mean m_{i}. Prabhu et al. [10] showed that the m_{i} are approximately drawn from a Gamma distribution with *α* = 6.3 and *β* = λ/*α* for Illumina Solexa sequencers. We note that for a given variant it is easy to infer the value of *m*
_{
i
}, since it is shared across all individuals in all pools. Thus, we have

*m*_{i} ~ Γ(*α*, *β*), *r*_{ij} ~ *Poisson*(*m*_{
i
})

3. The third source of error is sequencing error. The sequencing error rate depends on the location of the base in the read, but since the location of the base is uniformly distributed, we simply model the error rate by a constant probability *ε* for a substitution (1% is an acceptable estimate).

The above procedure produces a matrix *Â* of noisy pools and a pair of vectors
of noisy sequence reads; the number of sequence reads
is generated by a Poisson distribution with an expectation that depends on the genotype *g*
_{
i
}, and the coverage *m*
_{
i
}, followed by a Binomial distribution to model the errors as explained above. *R*
^{0} corresponds to the reads with the major allele, while *R*
^{1} corresponds to the reads with the minor alleles. Note that even if *g*
_{
i
} = 0, if *ε* > 0, then expected number of reads with the minor allele will be greater than 0 because of errors. The pooling results are given by (*Y*
^{0}, *Y*
^{1}), where *Y*
^{
k
} = *ÂR*
^{
k
}
*.*

### A likelihood model

Given the pooling results *Y*, we need to find a decoding algorithm that estimates *G* from *Y.* To do so, we define a likelihood model which can evaluate each putative solution. Our likelihood model takes into account both the error model, as well as population genetics data and external information when available. We decompose the likelihood *L*(*G; Pools*) into several functions, and take their product as a composite likelihood.

#### Hardy-Weinberg Equilibrium

We first note that the overall allele frequency

of the SNP can be estimated as the average across all pools. We can now compute the Hardy-Weinberg (HW) probability of the observed genotypes

, where

*n*
_{0},

*n*
_{1},

*n*
_{2} are the genotype counts in

*G.* Using Bayes law we have

Assuming no prior information, we observe that maximizing *Pr*
^{
HW
}(*Pools* | *G*) is equivalent to maximizing *Pr*
^{
HW
}(*G* | *Pools*)*.* We denote
.

#### Likelihood of the observed reads

We compute the probability of the observed reads in the pools given G based on the noise model. Note that the only unknown in the noise model is the concentrations of the individuals in the different pools. This is true since the coverage in any given region can be easily estimated. Assume that λ is the coverage. Then, the number of reads with the minor allele (or major) contributed by individual *j* in pool *i* are Poisson distributed with *Â*
_{
ij
}
*λG*
_{
j
} (or *Â*
_{
ij
}
*λ*(2 *- G*
_{
j
}))*.* Since the sum of Poisson distributions is Poisson distributed, we have that
is Poisson distributed with a known expectation and thus we can write its likelihood. We denote this function by *f*
^{
n
}
^{oi}
^{
se
}(*G*)*.* In order to find the concentration values *Â*
_{i}
_{
j
} we need to use external information. One such possibility could be to genotype small set of SNPs across the population and use those as the ground truth in order to tune the values of *Â*
_{
ij
}
*.* These SNPs provide a set of linear equations for the values *Â;* for each pool we have one equation per SNP, and the number of variables is *N.* Therefore, genotyping as many as *O*(*N*) SNPs and using a least squares approach guarantees an accurate estimate of *Â*
_{
ij
}
*.*

#### Likelihood of imputed data

Due to the bounds given on the possibility for detection, it is clear that without external information we will not be able to do much better than detecting rare SNPs. One natural choice for external data could be the genotypes of the individuals using microarrays. Today’s genotyping technology is extremely cheap compared to sequencing, and the genotyping of thousands of individuals is feasible within a given study. The genotype information, however, provides the information about less than a million SNPs and another million CNVs across the genomes, while many other genetic variants are left unmeasured. To cope with this, imputation methods have been developed, in which nearby SNPs are used to impute unmeasured variants using the linkage disequilibrium structure of the genome [16, 20]. However, this process is inevitably noisy, especially when imputing SNPs of low allele frequencies or SNPs in regions of low linkage disequilibrium. Together with the pooling information we are able to provide a much more accurate calling of the imputed SNPs in all ranges of allele frequencies and linkage disequilibrium patterns. The output of the imputation method typically provides a distribution of the possible genotypes. For each individual *j*, we can assume that there is a given probability *h*
_{i}(0), *h*
_{
i
}(1), *h*
_{i}(2), where *h*
_{
i
}(*j*) is the probability that individual *i* has *G*
_{
i
} = *j.* We can now use the imputation results for our likelihood model, by writing
.

### A decoding algorithm using linear programming

We use a linear program to bound the possible errors of each of the pools. If the coverage for the SNP is λ, we have that the pools should roughly satisfy λ

*ÂG* =

*Y.* We can therefore solve the following linear program:

The linear program provides a lower bound on the best possible

*l*
_{1} distance between λ

*ÂG* and

*Y* as well as returning a solution

*G* which is close to the imputation prediction

*I. β* is a parameter that trades off the relative importance of being close to the imputation vector compared to being consistent with the pools. Note that if

*Y*
_{
j
} is distributed as Poisson with expectation

*µ*
_{
j
} for which

*Y*
_{
j
}
*- µ*
_{
j
} =

*x*
_{
j
}, then

Therefore, we get that
.

### Application to gene fusion detection

In order to detect gene fusions, we make several changes and extensions to the model presented above. The major additional complication in detection of fusion genes is that each sample may have a different expression level for a particular fusion gene. Even if we include the same amount of RNA from each tumor into each pool, the relative concentration of each gene will differ in each sample. However, this concentration is approximately constant across pools. Let *e*
_{
ij
} be the normalized expression level of a particular variant *j* (in this case a fusion gene). Whether or not an individual *i* has the variant *j* is encoded as *G* = {*g*
_{
ij
}}, *g*
_{
ij
} ∊ {0, 1}. We define the matrix *H* = {*e*
_{
ij
}
*g*
_{
ij
}} as the concentrations of the samples and the results of the pools (assuming no noise) will then be *Y* = *AH* instead of *Y* = *AG* as in the genotyping application. In this application, we can also assume that the matrix *G* is sparse, but in order to perform the decoding, we must also estimate *e*
_{
ij
} for the non-zero values of *g*
_{
ij
}.

It is possible to estimate *e*
_{
ij
} because they are constant across pools, however this introduces additional complexities in the optimization. We take advantage that fusion genes are very rare and most fusion genes are not shared across tumors. We constrain our optimization to allow for a maximum of *k* tumors to contain a given sample. We note that we only need to estimate the values of *e*
_{
ij
} corresponding to non-zero elements of *g*
_{
ij
}. To perform the optimization we enumerate over all
possible genotype vectors and for each vector we estimate the corresponding *e*
_{
ij
} values.

Since optimizing the likelihood function for each possible genotype vector is computationally impractical, we solve a linear program as a method to quickly eliminate poor solutions. Let *A** be a matrix consisting of the only the columns of *A* corresponding to the non-zero entries of the genotype vector. If *x* is a vector which has a length the same as the number of non-zero elements in the genotype vector, the solution to *A* * *x* = *Y* will be an approximate estimate of the values for *e*
_{
ij
}. We can incorporate errors by adding a vector of all 1s to *A** and appending a term to x corresponding to the amount of errors expected in each pool. For the top 100 estimates obtained by using the pseudo-inverse, we then perform a grid search over the values of *e*
_{
ij
} using the likelihood function described above.