- Research article
- Open Access

# Quantitative group testing-based overlapping pool sequencing to identify rare variant carriers

- Chang-Chang Cao
^{1}, - Cheng Li
^{1}and - Xiao Sun
^{1}Email author

**15**:195

https://doi.org/10.1186/1471-2105-15-195

© Cao et al.; licensee BioMed Central Ltd. 2014

**Received:**6 November 2013**Accepted:**10 June 2014**Published:**17 June 2014

## Abstract

### Background

Genome-wide association studies have revealed that rare variants are responsible for a large portion of the heritability of some complex human diseases. This highlights the increasing importance of detecting and screening for rare variants. Although the massively parallel sequencing technologies have greatly reduced the cost of DNA sequencing, the identification of rare variant carriers by large-scale re-sequencing remains prohibitively expensive because of the huge challenge of constructing libraries for thousands of samples. Recently, several studies have reported that techniques from group testing theory and compressed sensing could help identify rare variant carriers in large-scale samples with few pooled sequencing experiments and a dramatically reduced cost.

### Results

Based on quantitative group testing, we propose an efficient overlapping pool sequencing strategy that allows the efficient recovery of variant carriers in numerous individuals with much lower costs than conventional methods. We used random *k*-set pool designs to mix samples, and optimized the design parameters according to an indicative probability. Based on a mathematical model of sequencing depth distribution, an optimal threshold was selected to declare a pool positive or negative. Then, using the quantitative information contained in the sequencing results, we designed a heuristic Bayesian probability decoding algorithm to identify variant carriers. Finally, we conducted *in silico* experiments to find variant carriers among 200 simulated *Escherichia coli* strains. With the simulated pools and publicly available Illumina sequencing data, our method correctly identified the variant carriers for 91.5–97.9% variants with the variant frequency ranging from 0.5 to 1.5%.

### Conclusions

Using the number of reads, variant carriers could be identified precisely even though samples were randomly selected and pooled. Our method performed better than the published DNA Sudoku design and compressed sequencing, especially in reducing the required data throughput and cost.

## Keywords

- Quantitative group testing
- Random
*k*-set pool design - Overlapping pool sequencing
- Rare variants

## Background

Rare variants are responsible for a large portion of the heritability of some common complex human diseases [1, 2]. Genome-wide association studies have focused on the contribution of variants of low minor allele frequency (MAF 0.5–5%), or of rare variants (MAF < 0.5%) [2]. The functional and evolutionary impacts of rare variants have been reported; therefore, large-scale screening for disease-associated rare variants is becoming increasingly important [3, 4]. One major application of rare variants genotyping is in screens for rare genetic disorders such as Tay–Sachs disease and thalassemia [5].

Because of the extremely low frequency of rare variants, sample sizes must be large enough to guarantee efficient observations. The cost of DNA sequencing has dropped dramatically with the introduction of the massively parallel sequencing technologies. However, identifying rare variant carriers by sequencing individual samples separately remains prohibitively expensive because of the huge challenge of constructing sequencing libraries for thousands of samples [6, 7]. Barcoding has been developed as a powerful approach to cost-effectively determine the genotype of each individual [7]. To further reduce the cost of large-scale screens for rare variant carriers, several techniques based on the group testing theory [8] and compressed sensing [9, 10] to construct overlapping pool sequencing strategies have been used. These strategies have helped decrease the sequencing times for rare variant carrier identification and further lower the cost [11–14].

Because a large number of samples are pooled together and then sequenced, overlapping pool sequencing uses fewer pools to identify rare variant carriers among large numbers of individuals. Thus, overlapping pool sequencing can vastly reduce the time and cost for DNA library preparation. Some overlapping pool sequencing programs return true/false values after testing each pool; this scheme was adopted by Erlich et al. [11], Prabhu and Pe’Er [12], and Cao et al. [14], who used the number of normal and variant reads in each pool to determine whether a pool contained carriers. However, the quantitative information in the sequencing data, which can be used to estimate the percentage of variant chromosomes in a pool, is missed in these methods. Quantitative group testing is an alternative scheme that takes the quantitative information into account, thus rare variant carriers can be identified efficiently [13].

We propose an efficient random overlapping pool sequencing strategy with quantitative group testing for the identification of rare variant carriers using massively parallel sequencing data. Because of the excellent performance of random designs in classic group testing [15, 16], we employed a random *k*-set pool design [17] to mix samples. The parameters of the random *k*-set pool design can be selected appropriately according to an indicative probability value. Based on a depth model for pooled sequencing, we calculated the optimal cut-off of the number of reads containing variants to distinguish pools containing variant carriers from those that do not. Using the quantitative information contained in the sequencing data, we designed a heuristic Bayesian decoding algorithm to identify variant carriers accurately. Compared with the DNA Sudoku algorithm [11] and compressed sequencing [13], our method required less data throughput. Finally, we applied our method to identify variant carriers among 200 simulated *Escherichia coli* strains using simulated pools and publicly available Illumina sequencing data. The results showed that our method could successfully identify carriers for 91.5–97.9% of the variants with frequencies ranging from 0.5 to 1.5%.

## Methods

### Random *k*-set pool design

Random *k*-set pool designs were first proposed by Bruno et al. [15] for efficient DNA clone library screening. In such a design, each clone is pooled in exact *k* pools that are chosen with equal chance. Random *k*-set pool designs are easy to specify for any number of pools and are known to be efficient in classic group testing, but they have not been used in real situations, partly because of the presence of different sample sets with identical test sets that are indistinguishable when the test results are qualitative [16]. However, this problem can be overcome by quantitative tests such as sequencing experiments.

*n*samples containing

*d*positive samples, the basic objective of a random

*k*-set pool design is to identify all the positive samples with a small number of pooled tests. In such a design, each sample occurs in exact

*k*pools, and a pool is defined as positive only when it contains at least one positive sample; otherwise, it is defined as negative. For a random

*k*-set pool design with

*t*pools, Hwang [17] calculated the probability that a given set of

*i*pools is a negative one (Eq. (1)) and the expected number of negative pools (Eq. (2).

where *n_min* and *n_max* are the minimum and maximum number of negative pools, respectively, and *h* is a temporary variable. *n_max* = *t - k*, and *n_min* = 0 (if *dk* ≥ *t*) or *t - dk* (if *dk* < *t*).

*PI*; Eq. (5)) to evaluate the performance of random

*k*-set designs in quantitative group testing.

*PI*is the probability that positive pools are more than the sum of unresolved negative samples and real positive samples; therefore, designs with high

*PI*s will perform better than designs with low

*PI*s.

where *p_min* and *p_max* are the minimum and maximum number of positive pools, respectively, *p_min* = *t* - *n_max*, and *p_max* = *t* - *n_min*. The derivation of Eq. (5) is given in Appendix 1.

### Optimal cut-off value for pooled sequencing

For overlapping pool sequencing, the DNA samples are mixed and then sequenced. Samples from variant carriers are treated as positive and samples from normal individuals are treated as negative. To distinguish positive pools containing variant carriers from negative pools consisting of normal individuals, the cut-off for the number of reads containing variants must be selected carefully to declare whether a pool contains carriers or not. Ideally, the cut-off value must guarantee that the minimum error rates are obtained, including false-positive and false-negative rates.

*D*, the depth that represents the number of times a nucleotide is read follows a negative binomial distribution $\mathit{NB}\left(\frac{\mathit{D}}{\mathit{r}-1},\frac{1}{\mathit{r}}\right)$ where

*r*is the variance/mean ratio;

*r*is related to sequencing platforms and genomes and can be estimated from sequencing results (Eq. (6)).

*N*samples, given sequencing depth

*D*and average sequencing error rate

*p*

_{ error }, the probabilities

*P*

_{ nv }(

*N*

_{ v }) and

*P*

_{ pv }(

*N*

_{ v }) that

*N*

_{ v }reads containing variants are observed in negative pools and positive pools, respectively, are given by Eqs. (7) and (8). For a locus sequenced

*i*times, the number of sequencing errors follows a binomial distribution

*Bin*(

*i*,

*p*

_{ error }).

*P*

_{ nn }(

*N*

_{ n }) and

*P*

_{ pn }(

*N*

_{ n }) that

*N*

_{ n }reads without variants are observed in negative pools and positive pools, respectively, are given by Eqs. (9) and (10).

where *p* is the percentage of variant chromosomes in the pool. In a positive pool with *N* diploid DNA samples and *c* heterozygous variant carriers, ignoring the bias in mixing samples, the percentage of the variant chromosomes is *p* = *c*/(2 *N*), while for haploid samples *p* = *c*/*N*. The derivations of Eqs. (7)–(10) are given in Appendix 1.

*P*

_{ nv }(

*N*

_{ v }) and

*P*

_{ pv }(

*N*

_{ v }), the formula for the false-positive rate

*F*

_{ p }and false-negative rate

*F*

_{ n }in classifying pools with a cut-off value

*T*can be constructed (Eqs. (11) and (12)).

*T*can be defined as the value that minimizes the maximum values of

*F*

_{ n }and

*F*

_{ p }.

### Decoding algorithm

*A*is the set of variant carriers, the probability that the sequencing result

*O*is observed is given by Eq. (13).

where *O*_{
iv
} and *O*_{
in
} are the number of reads with and without variants in the *i*^{th} pool. Given that *A* is the set of variant carriers, if the *i*^{th} pool is negative, then *P*(*O*_{
iv,
}*O*_{
in
}*|A*) = *P*_{
nv
}(*O*_{
iv
})*P*_{
nn
}(*O*_{
in
}); otherwise, *P*(*O*_{
iv,
}*O*_{
in
}*|A*) = *P*_{
pv
}(*O*_{
iv
})*P*_{
pn
}(*O*_{
in
}).

For the second step, we designed a Bayesian decoding algorithm based on Eq. (13). To identify variant carriers among haploid samples, after excluding resolved negative samples, the rest of the samples form a set *A*_{
0
} = {*S*_{
1
},…,*S*_{
c
}}. First, assuming that all the samples in *A*_{
0
} are variant carriers, we calculate the probability of observing the sequencing results and denote it as *Pmax_0*. Next, replace one positive sample in set *A*_{
0
} with a negative sample in turn and repeat the probability calculation. Denote the derived set that results in the maximum probability as *A*_{
1
} and the corresponding probability as *Pmax_1*. Replace *A*_{
0
} with *A*_{
1
} and repeat this step until *A*_{
c
} and *Pmax_c* are obtained. Finally, the set *A*_{
i
} that results in the maximum corresponding probability *Pmax_i* is defined as the set of variant carriers. These steps are written as Algorithm 1.

Two kinds of positive samples need to be considered while identifying variant carriers among diploid samples: heterozygous carriers and homozygous carriers. First, suppose that there are only heterozygous variant carriers; this is analogous to finding variant carriers among haploid samples. Then we present Algorithm 2 which is very similar to Algorithm 1 to recognize heterozygous and homozygous variant carriers.

Our decoding procedure to identify variant carriers among diploid samples is summarized in Algorithm 3. First, we suppose that only heterozygous variant carriers exist and run Algorithm 1 to find the set of variant carriers *C*_{
0
}. Then, run Algorithm 2 to recognize heterozygous and homozygous variant carriers..

## Results and discussion

### Optimal cut-off value

*r*of 3 performed much better than the Poisson distribution. Therefore,

*r*was set to 3 in our simulation unless otherwise stated.For a pool consisting of 10 diploid samples, we calculated the false-positive and false-negative probabilities with different cut-off values when only one heterozygous variant carrier was allowed in the positive pool. The average sequencing error rate and whole depth were set to 0.01 and 600×, respectively (Figure 1a). The results verified the importance of selecting an appropriate cut-off value. With smaller cut-off values, the probability of misclassifying negative pools as positive is high. With higher cut-off values, some positive pools will be misclassified. Both will lower the speed and accuracy of decoding. From the results we can infer that the optimal cut-off value is 14; here both the false-negative and false-positive probabilities are very low (Figure 1a).In most studies, because of the rarity of positive samples, the number of positive and negative pools is unequal. Therefore, selecting a cut-off value based on the expected number of errors in classifying pools is a more practical approach. For instance, when there are 30 negative and 10 positive pools mentioned above, the optimal cut-off value is 16 (Figure 1b). In the following simulation experiments, we adopted this kind of scheme unless otherwise stated.

As mentioned, the variance/mean ratio *r* is related to the sequencing platforms and genomes. Because the observed variation is significantly greater than the mean of the sequencing depth, *r* is larger than 1. Different values for *r* yield distinct distributions. The pooling design mentioned above consisting of 30 negative and 10 positive pools was used to estimate the effect of *r* on our methods. We calculated the least depth required for each pool to make the expected number of errors in classifying pools smaller than 1 by increasing the depth gradually (see Additional file 1: Figure S1). From the results, we can see that our method performed even better for smaller *r*, which required less data throughput.

### Performance of the pipeline

To evaluate the performance of our method, we conducted substantial simulations to identify four heterozygous variant carriers among 100 haploid samples. 1000 replicates were done for each pair of design parameters: pool number *t* and column weight *k*. The pooling matrix was designed by collecting random binary vectors with length *t* and weight *k*, meaning that each sample was mixed in *k* of *t* pools. Identical vectors were deleted and the steps were repeated until 100 vectors were obtained to form the matrix.

We used the random function in Perl to simulate the number of reads with and without variants in pooled sequencing. Sequencing error and mixing bias were added to the simulation procedure to bring it closer to a real situation. Sequencing error follows a binomial distribution in sequencing results, and in the simulation the average sequencing error rate was set as 0.01. Mixing bias is caused by the practical difficulty of mixing exactly equal amounts of DNA samples. Based on the study conducted by Shental et al. [13], a random variable following the Gaussian distribution was added to each non-zero element of the pooling matrix to simulate mixing bias. The standard deviation of the Gaussian distribution was 0.05, reflecting up to 5% average noise in the mixed quantities of each sample.

*k*negatively influenced the correct decoding rates. Moreover, a large

*k*meant more pooling procedures and increased experimental costs. Therefore, a proper column weight

*k*is critical for conducting experiments successfully and efficiently.

### Cost-effective overlapping pool sequencing

The column weight *k* denotes the mixing times for each sample in a random *k*-set pool design. For a given number of pools, a *k* that is too large or too small will lower the decoding accuracy. We designed an indicative probability *PI*, which reflects the performance of random *k*-set designs that could be used to choose the optimal column weight *k*.

*k*under the condition that 30 pools were allowed to identify four heterozygous variant carriers among 100 diploid samples by conducting 1000 replicates for each

*k*. Next,

*PI*was computed based on Eq. (5) and the results are shown in Figure 3. A strong correlation was observed between

*PI*and the correct decoding rate (Pearson correlation coefficient = 0.92,

*p*-value = 9.8e-06), especially before the correct decoding rate reached the saturation point (

*k*= 6, Pearson correlation coefficient = 0.98,

*p*-value = 3.4e-3). The

*PI*values and correct decoding rates for identifying variant carriers were also obtained under different scenarios (see Additional file 1: Figure S2). All the scenarios showed strong correlations between

*PI*and the correct decoding rate before the correct decoding rate reached the saturation point.

For a given pool number *t*, we defined the optimal *k* as the minimum that obtains the maximum *PI* value, which could maximize the correct decoding rate. Designs with optimal *k* require fewer pools or lower sequencing depth. In practice, the optimal *k* is selected by calculating the *PI* value without the need to conduct simulations, thereby greatly reducing the computational time required.

**Five candidate designs to identify one heterozygous variant carrier among 100 individuals**

ID of candidate design | # of pools | Data throughput (Gb) | Cost |
---|---|---|---|

I | 10 | 567.0 | $35,051.0 |

II | 20 | 292.8 | $25,518.4 |

III | 30 | 268.8 | $29,246.4 |

IV | 40 | 272.4 | $34,437.2 |

V | 50 | 265.2 | $39,055.6 |

Sequencing separately | 100 | 72.6 | $53,847.8 |

**Most cost-effective designs for different scenarios**

Sequencing region (Mb) | Sample size | Frequency of variant | # of pools | Data throughput (Gb) | Cost saving | |
---|---|---|---|---|---|---|

Haploid sample | 5 | 200 | 0.5% | 20 | 83.4 | 85.7% |

5 | 200 | 1% | 30 | 124.8 | 78.6% | |

5 | 200 | 1.5% | 40 | 128.8 | 73.4% | |

Diploid sample | 30 | 200 | 0.25% | 30 | 669.6 | 53.4% |

30 | 100 | 0.5% | 20 | 292.8 | 52.6% | |

30 | 100 | 1% | 30 | 534.6 | 20.3% |

### Comparisons with current methods

In 2009, benefiting from the Chinese remainder theorem, Erlich et al. [11] put forward the DNA Sudoku design for overlapping pool sequencing. A pattern consistency decoding algorithm was also developed by Erlich et al. [11] to identify variant carriers with the DNA Sudoku design. In 2010, Shental et al. [13] developed a method called compressed sequencing to identify rare variants and their carriers by borrowing techniques from compressed sensing. Two designs were proposed in compressed sequencing: one used pools with a random half of the samples and the other used pools with sizes equal to the square root of the number of samples. We compared the performance of our method in identifying rare variant carriers with the performances of these two methods.

To identify variant carriers in 100 diploid samples, the DNA Sudoku design with parameter *d*_{
0
} = 2 was employed that required 36 pools. To maintain consistency, only 36 pools were allowed for the random *k*-set pool design and compressed sequencing. Since the expected number of positive and negative pools was not clear for the DNA Sudoku design, the cut-off value for the number of reads containing variants to declare a pool to be positive was set based on the false-negative and false-positive rates, and not on the expected number of errors in the classifying pools.

### Simulation experiment

We applied our method to identify variant carriers among 200 simulated *E. coli* strains. Illumina sequencing reads of two *E. coli* strains were downloaded from GenBank’s Short Read Archive (O157:H7 strain [SRA: ERR018562]) and BGI’s FTP site (O104:H4 strain, ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482). We treated the O157:H7 strain as the variant carrier and the O104:H4 strain as the normal sample. Bowtie0.12.9 [23] was used to map the O157:H7 reads to the O104:H4 genome, and SAMtools 0.1.19 [24] was used to call single base mutations. Because the mean depth was 134× for O157:H7, mutations with depths lower than 130 or higher than 140 were removed to control the quality; the remaining 1271 mutations were used in the analysis.

*in silico*. Considering up to 5% average noise in the DNA quantities of each sample in the pooling procedure, the number of reads for each sample was revised with a random coefficient following a Gaussian distribution to simulate reality. Bowtie was used to map pooled reads, and Perl scripts were used to count the reads with and without variants that were mapped at the loci of variants. After the decoding procedure, variant carriers could be identified correctly for 91.5–97.9% variants. This result was consistent with the design capability (Table 3).

**Correct decoding rate of our method in the identification of variant carriers**

Experiment | Frequency of variant | Variant carriers | Correct decoding rate |
---|---|---|---|

1 | 0.5% | 4 | 97.9% |

2 | 1% | 164 | 93.5% |

3 | 1.5% | 31 | 91.5% |

## Conclusions

Here, an efficient method that harnesses random *k*-set pool designs and massively parallel sequencing technologies to identify rare variant carriers is presented. The parameters of the random *k*-set pool design can be selected appropriately depending on an indicative probability. According to the depth model for pooled sequencing, the optimal cut-off value to separate negative pools from positive pools was designed. Taking advantage of the quantitative information in the sequencing results, a heuristic Bayesian decoding algorithm to identify the variant carriers was developed. Compared with the DNA Sudoku design and compressed sequencing, our method showed potential advantages, especially in decreasing the required data throughput. Finally, we applied our method to identify variant carriers among 200 simulated *E. coli* strains using simulated pools and Illumina sequencing data. Our method successfully identified variant carriers at reduced experimental costs.

For the accurate identification of variant carriers, the sequencing depth and pool number must be adequate to overcome sequencing errors and mixing bias. Considering the trade-off between the pool number and data throughput, substantial simulations need to be performed to verify whether a design is capable of identifying all the variant carriers correctly. Because the overall cost of overlapping pool sequencing stems from the pooling procedure, library construction, and data production, the optimal design depends on the whole cost.

Our decoding algorithm identifies the variant carriers by maximizing the posterior probability, and does not depend too much on the rarity of variants. Therefore, our approach can succeed even for low frequency variants. Furthermore, the sequencing qualities that indicate the sequencing error probabilities could be integrated into the calculation of the posterior probability in the decoding procedure to improve the accuracy. Compared with compressed sequencing, our decoding procedure was very time-consuming because of the substantial calculation of the posterior probability. This will be improved in future work.

Further improvement could be made with a reasonable depth model. Although in many studies negative binomial distribution rather than Poisson distribution has been used to fit the sequencing depth, numerous different models exist. We could not determine which model fit the depth distribution best because, in previous studies, these models have not been compared. Additionally, different sequencing procedures and platforms, such as exome sequencing and whole genome sequencing, produce distinct depth distributions. We aim to employ a better depth model to improve the performance of our method.

Our method has the advantage over compressed sequencing because required data throughput is reduced. However, because each sample is sequenced multiple times, the required data throughput is still substantial. Third-generation sequencing technologies [25, 26], which significantly reduce the cost for data production, may help to overcome this drawback. We expect that our method could be applied not only in sequencing experiments but also in other fields as long as the pooled experimental results contain quantitative information about the number of positive samples.

## Appendix 1: Derivations of Eq. (5) and Eqs. (7)–(10)

*PI*is the probability that positive pools are more than the sum of unresolved negative samples and real positive samples. If,

*N*

_{ p }is the number of positive pools, $\overline{\mathit{N}}$ is the number of unresolved negative samples, and

*d*is the number of positive samples, then

*PI*can be written as A(1).

where *p_min* and *p_max* are the minimum and maximum number of positive pools, respectively.

*N*

_{ p }=

*i*indicates that there are

*t*-

*i*negative pools,

*P*(

*N*

_{ p }=

*i*) can be formulated as A(2). Because $\mathit{P}\left(\overline{\mathit{N}}+\mathit{d}\le \mathit{i}\right)$ = $\mathit{P}\left(\overline{\mathit{N}}\le \mathit{i}-\mathit{d}\right)$, $\mathit{P}\left(\overline{\mathit{N}}+\mathit{d}\le \mathit{i}\right)$ can be formulated as A(3). After integrating A(1)–A(3),

*PI*can be formulated as A(4).

*N*

_{ v }reads containing variants are observed in a negative pool (

*P*

_{ nv }(

*N*

_{ v })), and

*N*

_{ n }reads without variants are observed in a negative pool (

*P*

_{ nn }(

*N*

_{ n })), respectively. Briefly,

*P*

_{ nv }(

*N*

_{ v }) can be written as A(5).

*P*(

*i*) is the probability that

*i*reads are obtained, and

*P*

_{ e }(

*Nv*|

*i*) is the probability that

*N*

_{ v }errors occur among these

*i*reads. Because the depth follows a negative binomial distribution and sequencing errors follow a binomial distribution, these two probabilities can be formulated as A(6) and A(7). In A(6),

*D*and

*r*are the mean depth of coverage for pooled sequencing and the variance/mean ratio, respectively. In A(7),

*p*

_{ error }is the mean sequencing error rate.

*P*

_{ nv }(

*N*

_{ v }) can be formulated as A(8).

*P*

_{ nn }(

*N*

_{ n }) (A(9)) is similar to the derivation for

*P*

_{ nv }(

*N*

_{ v }).

*N*

_{ v }reads containing variants are observed in a positive pool (

*P*

_{ pv }(

*N*

_{ v })) and

*N*

_{ n }reads without variants are observed in a positive pool (

*P*

_{ pn }(

*N*

_{ n })), respectively. The observations of a variant in a positive pool consist of two parts: real variants from variant chromosomes, and false variants resulting from sequencing errors. Briefly,

*P*

_{ pv }(

*N*

_{ v }) can be written as A(10) where

*P*

_{ N }(

*x*) stands for the probability that

*x*reads containing variants stemming from the sequencing results of normal chromosomes, and

*P*

_{ P }(

*O*-

*x*) denotes the probability that

*O*-

*x*reads contain variants from variant chromosomes.

*P*

_{ N }(

*x*) and

*P*

_{ P }(

*N*

_{ v }-

*x*) can be formulated as A(11) and A(12). The only difference is the mean sequencing depth of coverage. Because the percentages of variant chromosomes and normal chromosomes are

*p*and 1 -

*p*, respectively, the mean depths of coverage for sequencing variant chromosomes and normal chromosomes are

*pD*and (1 -

*p*)

*D*, respectively.

*P*

_{ pv }(

*N*

_{ v }) can be obtained by integrating A(10)–A(12), which is shown as A(13).

*P*

_{ pn }(

*N*

_{ n }) can be obtained as shown in A(14).

## Declarations

### Acknowledgements

This work was supported by the National Basic Research Program of China (No. 2012CB316501) and the National Natural Science Foundation of China (61073141).

## Authors’ Affiliations

## References

- Bodmer W, Bonilla C: Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008, 40 (6): 695-701. 10.1038/ng.f.136.View ArticlePubMed CentralPubMedGoogle Scholar
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747-753. 10.1038/nature08494.View ArticlePubMed CentralPubMedGoogle Scholar
- Nelson MR, Wegmann D, Ehm MG, Kessner D, Jean PS, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D: An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012, 337 (6090): 100-104. 10.1126/science.1217876.View ArticlePubMed CentralPubMedGoogle Scholar
- Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G: Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012, 337 (6090): 64-69. 10.1126/science.1219240.View ArticlePubMed CentralPubMedGoogle Scholar
- Golan D, Erlich Y, Rosset S: Weighted pooling—practical and cost-effective techniques for pooled high-throughput sequencing. Bioinformatics. 2012, 28 (12): i197-i206. 10.1093/bioinformatics/bts208.View ArticlePubMed CentralPubMedGoogle Scholar
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.View ArticlePubMedGoogle Scholar
- Patterson N, Gabriel S: Combinatorics and next-generation sequencing. Nat Biotechnol. 2009, 27 (9): 826-827. 10.1038/nbt0909-826.View ArticlePubMedGoogle Scholar
- Ding-Zhu D, Hwang FK: Combinatorial group testing and its applications. 2000, APPLIED MATHEMATICS: SERIES ON, 12-Google Scholar
- Candes EJ, Romberg JK, Tao T: Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math. 2006, 59 (8): 1207-1223. 10.1002/cpa.20124.View ArticleGoogle Scholar
- Donoho DL: Compressed sensing. IEEE Trans Inf Theory. 2006, 52 (4): 1289-1306.View ArticleGoogle Scholar
- Erlich Y, Chang K, Gordon A, Ronen R, Navon O, Rooks M, Hannon GJ: DNA Sudoku—harnessing high-throughput sequencing for multiplexed specimen analysis. Genome Res. 2009, 19 (7): 1243-1253. 10.1101/gr.092957.109.View ArticlePubMed CentralPubMedGoogle Scholar
- Prabhu S, Pe’Er I: Overlapping pools for high-throughput targeted resequencing. Genome Res. 2009, 19 (7): 1254-1261. 10.1101/gr.088559.108.View ArticlePubMed CentralPubMedGoogle Scholar
- Shental N, Amir A, Zuk O: Identification of rare alleles and their carriers using compressed se (que) nsing. Nucleic Acids Res. 2010, 38 (19): e179-e179. 10.1093/nar/gkq675.View ArticlePubMed CentralPubMedGoogle Scholar
- Cao C-C, Li C, Huang Z, Ma X, Sun X: Identifying rare variants with optimal depth of coverage and cost-effective overlapping pool sequencing. Genet Epidemiol. 2013, 37: 820-830. 10.1002/gepi.21769.View ArticlePubMedGoogle Scholar
- Bruno WJ, Knill E, Balding DJ, Bruce D, Doggett N, Sawhill W, Stallings R, Whittaker CC, Torney DC: Efficient pooling designs for library screening. Genomics. 1995, 26 (1): 21-30. 10.1016/0888-7543(95)80078-Z.View ArticlePubMedGoogle Scholar
- Ngo HQ, Du DZ: A survey on combinatorial group testing algorithms with applications to DNA library screening. Discrete mathematical problems with medical applications. 2000, 55: 171-182.Google Scholar
- Hwang F: Random k-set pool designs with distinct columns. Probability in the Engineering and Informational Sciences. 2000, 14 (1): 49-56.View ArticleGoogle Scholar
- Barillot E, Lacroix B, Cohen D: Theoretical analysis of library screening using a N-dimensional pooling strategy. Nucleic Acids Res. 1991, 19 (22): 6241-6247. 10.1093/nar/19.22.6241.View ArticlePubMed CentralPubMedGoogle Scholar
- Sarin S, Prabhu S, O’Meara MM, Pe’er I, Hobert O: Caenorhabditis elegans mutant allele identification by whole-genome sequencing. Nat Methods. 2008, 5 (10): 865-10.1038/nmeth.1249.View ArticlePubMed CentralPubMedGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-10.1186/gb-2010-11-10-r106.View ArticlePubMed CentralPubMedGoogle Scholar
- Miller CA, Hampton O, Coarfa C, Milosavljevic A: ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS One. 2011, 6 (1): e16327-10.1371/journal.pone.0016327.View ArticlePubMed CentralPubMedGoogle Scholar
- Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE: Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009, 461 (7261): 272-276. 10.1038/nature08250.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.View ArticlePubMed CentralPubMedGoogle Scholar
- Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley H: Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol. 2009, 4 (4): 265-270. 10.1038/nnano.2009.12.View ArticlePubMedGoogle Scholar
- Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B: Real-time DNA sequencing from single polymerase molecules. Science. 2009, 323 (5910): 133-138. 10.1126/science.1162986.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.