 Software
 Open Access
 Published:
PDA: Pooled DNA analyzer
BMC Bioinformatics volume 7, Article number: 233 (2006)
Abstract
Background
Association mapping using abundant single nucleotide polymorphisms is a powerful tool for identifying disease susceptibility genes for complex traits and exploring possible genetic diversity. Genotyping large numbers of SNPs individually is performed routinely but is cost prohibitive for largescale genetic studies. DNA pooling is a reliable and costsaving alternative genotyping method. However, no software has been developed for complete pooledDNA analyses, including data standardization, allele frequency estimation, and single/multipoint DNA pooling association tests. This motivated the development of the software, 'PDA' (Pooled DNA Analyzer), to analyze pooled DNA data.
Results
We develop the software, PDA, for the analysis of pooledDNA data. PDA is originally implemented with the MATLAB^{®} language, but it can also be executed on a Windows system without installing the MATLAB^{®}. PDA provides estimates of the coefficient of preferential amplification and allele frequency. PDA considers an extended singlepoint association test, which can compare allele frequencies between two DNA pools constructed under different experimental conditions. Moreover, PDA also provides novel chromosomewide multipoint association tests based on pvalue combinations and a slidingwindow concept. This new multipoint testing procedure overcomes a computational bottleneck of conventional haplotypeoriented multipoint methods in DNA pooling analyses and can handle data sets having a large pool size and/or large numbers of polymorphic markers. All of the PDA functions are illustrated in the four bona fide examples.
Conclusion
PDA is simple to operate and does not require that users have a strong statistical background. The software is available at http://www.ibms.sinica.edu.tw/%7Ecsjfann/first%20flow/pda.htm.
Background
The millions of single nucleotide polymorphisms (SNPs) now available are ideal for association analyses that identify important genetic variants in populations as well as genes predisposed to diseases involving complex traits [1, 2]. Although the cost of individual genotyping has been reduced drastically over the years, the use of DNA pooling has reduced the cost even further, especially for largescale studies. The first DNA pooling study was performed to identify the association between HLA class II loci and disease genes predisposing type 1 diabetes [3]. DNA pooling was later used to estimate the allele frequency of short tandem repeats and SNPs, map disease susceptibility genes [4, 5], and identify polymorphisms [6–8]. A comprehensive review of the history of DNA pooling, the methods and algorithms involved, and the application thereof can refer to [9] and [10].
DNA pooling is highly efficient. Many researchers have investigated the performance of DNA pools while estimating allele frequency and have measured the impact of pooling on association test results. The results show that allele frequencies can be estimated accurately and precisely using DNA pools after considering coefficient of preferential amplification (CPA) [11, 12]; moreover, the test power is high and the falsepositive rate is well controlled [11, 13]. These promising results suggest that DNA pooling studies is reliable and costsaving relative to individual genotyping studies. This motivated the development of the software, Pooled DNA Analyzer (PDA), to analyze pooled DNA data.
Although many singlepoint pooled DNA association tests have been developed, multipoint analysis still presents a challenge due to the large numbers of genotypic combinations in DNA pools. The difficulty increases substantially with the pool size and/or the number of SNPs involved. Several of the recently proposed advanced multipoint estimations and tests have been haplotype oriented [14–17]; nevertheless, all such methods require a small pool size and a small number of SNPs to reduce both the computational complexity and running time. To address the current computational challenges of analyzing DNA pools, PDA provides the slidingwindow empirical pvalue test (SWEPT), which has advantages with respect to statistical computation, data implementation and practical application. The SWEPT method is particularly applicable when the analysis involves a large amount of data, which overcomes the computational bottleneck of conventional haplotypeoriented multipoint methods in DNA pooling analyses.
Implementation
PDA was developed on the MATLAB^{®} software platform that is adapted to the Windows systems (MS Windows^{®} 98/ME and MS Windows^{®} NT/2000/XP/2003). For MATLAB^{®} users, PDA can be run with a graphical userfriendly interface where users merely click the checkboxes to carry out data analysis. The PDA user interface is shown in Figure 1. For those who have no access to or little knowledge of the MATLAB^{®} system, we used the MATLAB^{®} compiler to generate standalone executables of PDA, which can be deployed on machines without installing the MATLAB^{®}. The guide to the installation and initialization of PDA on Windows is illustrated in Appendix A (See Additional File 1). Description of working directories for PDA is shown in Appendix B (See Additional File 2). The PDA's input and output data formats are explained in Appendices C and D (See Additional files 3 and 4), respectively. Finally, the compiled version of PDA is demonstrated in Appendix E (See Additional File 5).
Interface of PDA, item functions and operation procedures
There are seven main items in the PDA menu, i.e., input/output directory, number of groups studied, data type for CPA estimation, bootstrapped standard error (s.e.) of CPA estimates, allele frequency estimates, singlepoint pooled DNA association test and multipoint pooled DNA association test.
Item 1. Input/Output directory: The directories of input and output files must be specified. PDA will read data from the assigned input directory and automatically save outputs in the output directory. The format of input and output is illustrated in Appendices C and D (See Additional files 3 and 4).
Item 2. Number of groups studied: PDA can analyze onegroup or twogroup DNA pooling data. For onegroup studies, users can estimate CPA and calculate adjusted allele frequency by checking the box 'One group'. For twogroup studies (e.g., case control studies), users check the box 'Two groups' and determine whether to carry out association tests after calculating estimates for CPA and allele frequency. PDA provides the flexibility of equal or unequal CPA statistical inference that the user may choose as needed. Check 'Yes' for equal CPA inference or 'No' for unequal CPA inference.
Item 3. Data type for CPA estimation: Two types of data are acceptable. The first type is peak intensity data from genotyping experiments. The second type is raw CPA/heterozygote ratio from empirical studies or databases. If peak intensity data are inputted, then users should provide the number of pairs of peak intensities for each locus.
Item 4. Calculation of the bootstrapped s.e. of the CPA estimate: Bootstrapping is a resampling technique used to estimate the s.e. of CPA. Users can determine whether s.e. is to be calculated. If users want to calculate the bootstrapped s.e. then they should check 'Yes' and assign a number of bootstrap replications between 10 and 1000. A larger number of bootstrap replications will take longer to calculate but yields a more reliable estimate.
Item 5. Estimation of adjusted allele frequency: Users can check 'Yes' to calculate the adjusted allele frequencies or 'No' to omit the calculation.
Item 6. Singlepoint pooled DNA association test: Users can carry out association tests only for the analysis of a twogroup study. Because the test statistic of association tests depends on experimental error, users must assign a proper value for the experimental standard error, σ_{E}, if an association test is conducted.
Item 7. Multipoint pooled DNA association test: Users can carry out association tests only for the analysis of a twogroup study. If they check 'Yes', they must answer seven options to conduct this test. The seven options are as follows. (1) Data type for the association test. Two types of data are acceptable: peak intensity data or raw pvalues from previous singlepoint association tests. (2) Map information. Users can check 'Yes' to provide information on marker positions for the latter graph demonstration of multipoint pvalues or check 'No' to ignore the intermarker distances. (3) Weight function. Users can choose to assign equal weights to all marker loci by checking 'Equal weight' or provide a set of weights by checking 'Userspecified weight'. (4) Threshold value of truncation. PDA provides a function to truncate insignificant pvalues in the analysis. The value is between 0 and 1, and pvalues greater than the threshold will be excluded from the analysis. (5) Number of Monte Carlo simulations. Users must provide a suitable number of simulations between 500 and 10000. A large number of simulations increase the accuracy of the empirical pvalue estimation, but a longer computational time may be required. (6) Window size, defined as the number of markers in a window prior to pvalue truncation. Users should specify a suitable number of markers in a window according to the attributes of their data. Window size must be = 2, with the upper limit being the total number of SNPs in the study. (7) SWEPT statistics. PDA provides three statistics for multipoint association tests; i.e., multiplicative, additive and minimum pvalue statistics.
The statistical theory is introduced in the next section.
Results
Methodology
We developed PDA based on a fourstage procedure, which combines the concept of a threestage DNA pooling experiment [11] with the procedure of a novel multipoint association test, SWEPT. The functions make PDA useful for a complete analysis of pooled DNA data.
Firstly, PDA provides estimates for the CPA, which affects allele frequency estimation and association testing in a pooled DNA study. For a diallelic SNP with alleles A and a, CPA represents the relative magnitude of the averaged amplified intensities of the different alleles and is defined mathematically as κ = μ_{ A }/μ_{ a }, where μ_{ A }and μ_{ a }are the average peak intensities of alleles A and a. The parameters can be estimated from heterozygous individuals who provide a standard for a 50:50 ratio for a pair of peak intensities of two heterozygous alleles. When κ = 1, there is no preferential amplification; when κ > 1, the first allele is more likely to be amplified than the second; when κ < 1, the second allele is more likely to be amplified than the first. PDA provides three discrete estimates for the CPA: arithmetic mean adjustment ${\widehat{\kappa}}_{\text{H}}$, unbiased adjustment ${\widehat{\kappa}}_{\text{U}}$ and geometric mean adjustment ${\widehat{\kappa}}_{\text{G}}$ along with the corresponding bootstrap standard errors [11]. Let n_{heter} denote the number of heterozygous individuals and {${h}_{A}^{I}$(j), ${h}_{a}^{I}$(j), j = 1,...,n_{heter}} is the pair of peak intensities of heterozygous individuals derived from individual genotypings. The mathematical formulas of the three CPA estimators are presented as follows:
where ${\overline{h}}_{A}^{\text{I}}={n}_{\text{heter}}^{1}\cdot {\displaystyle {\sum}_{j=1}^{{n}_{\text{heter}}}{h}_{A}^{\text{I}}\left(j\right)}$ and ${\overline{h}}_{a}^{\text{I}}={n}_{\text{heter}}^{1}\cdot {\displaystyle {\sum}_{j=1}^{{n}_{\text{heter}}}{h}_{a}^{\text{I}}\left(j\right)}$. For each SNP, the estimated CPA will inform users of the magnitude of the difference in amplification between two alleles.
Secondly, PDA provides adjusted estimates for allele frequencies and the standard errors corresponding to the three different CPAs. Let $\widehat{\kappa}$ be the estimated CPA. The adjusted allele frequency of allele A is estimated by ${\widehat{p}}_{A}={h}_{A}/\left({h}_{A}+\widehat{\kappa}\times {h}_{a}\right)$, where h_{ A }and h_{ a }denote the peak intensity of alleles A and a in a DNA pool [12]. These analyses can be applied to studies of a single group or two groups, and the information will help users understand the genetic distribution of their groups.
Thirdly, PDA provides a singlepoint association mapping of two groups (e.g., case control studies or comparative studies of two groups). Let n_{G1} and n_{G2} be the numbers of individuals in groups G1 and G2; ${\widehat{\kappa}}_{\text{G}1}$ and ${\widehat{\kappa}}_{\text{G}2}$ are the estimated CPAs in groups G1 and G2; $D={\widehat{p}}_{A}^{\text{G}1}{\widehat{p}}_{A}^{\text{G}2}$ denotes the difference of the estimated allele frequencies of allele A between two groups. The test statistic of singlepoint association mapping with adjustment for preferential amplification is $X={D}^{2}/\widehat{V}\left(D\right)$, where the estimated variance is
where $\widehat{V}\left({\widehat{\kappa}}_{\text{G1}}\right)$ and $\widehat{V}\left({\widehat{\kappa}}_{\text{G2}}\right)$ are the bootstrapped variances of the estimated CPAs in groups G1 and G2, and ${\widehat{\sigma}}_{\text{E}}$ is the experimental standard error which can be estimated by calculating the root mean square error based on a hierarchical experimental design [18] or calculating the square root of variance components relied on the restricted maximum likelihood method [19]. The asymptotic distribution of test statistic X is a chi square distribution with one degree of freedom. This test reduces to the singlepoint association test proposed in [11] if the equality of CPAs in two groups is held. The test statistic and pvalue are calculated and used to identify important SNPs. Association studies that compare more than two groups can be further analyzed by combining pairwise analyses with multiple testing correction.
Fourthly, PDA provides a multipoint association test. A slidingwindow empirical pvalue method is introduced into pooled DNA analysis. Define {v_{1},...,v_{ N }} to be a pvalue vector of N SNPs from singlepoint association tests, and the locations of SNPs follow the order of genetic or physical mappings. Let k denote the size of a sliding window. The SWEPT statistics, based on multiplicative and additive models in the i th window with window size k, are represented as follows: for i = 1,...,N + 1  k,
and ${Z}_{A}\left(i,k\right)={\displaystyle {\sum}_{j=i}^{i+k1}{w}_{ij}\times {v}_{j}\times I[{v}_{j}<\mu ]},$
where μ is the threshold of the pvalue truncation and I[A] is the usual indicator that takes the value of 1 if event A is true; otherwise, it takes the value of 0. The nonnegative w_{ ij }is a standardized weight of the pvalue, v_{ j }, in the i th window (i.e. the weight satisfies the requirement that the weights in the window sum to one). The standardized weight is calculated by dividing the original weight by the sum of all original weights in the window under the given original weights. The multiplicative SWEPT statistic is a slidingwindow extension of the truncated product method [20], and the additive SWEPT statistic is an extension of the test statistic [21]. The third statistic is the minimum pvalue in the window as follows:
Z_{ Min }(i,k) = min_{j = i,...,i+k1}{v_{ j }}, i = 1,...,N + 1  k.
The minimum SWEPT statistic extended the technique of taking the minimum score, which has good performances in test power and type 1 error and has been used broadly in genetic studies [22, 23].
There are other efficient pvalue combinations, such as the rank truncated product method [24], which may be considered in PDA in the future. Extension of these methods using sliding windows will help screen important genetic markers in largescale chromosomewide pooled DNA association studies. By default, PDA performs multipoint analysis by using pvalue data obtained from the proposed singlepoint association; however, PDA also provides options for the use of pvalue data yielded from other singlepoint methods.
To assess the statistical significance of the SWEPT in each window, PDA applied a MonteCarlo procedure recommended in [20] to calculate an empirical pvalue. The procedure generates the correlated pvalue vector V with a correlation matrix ∑ from an independent pvalue vector V_{0}, based on the following correlationinvariant transformation
V = 1  Φ(C^{1}Φ^{1}(1  V_{0})),
where Φ(.) is the cumulative distribution of a standard normal random variable and C is a lower triangular matrix satisfying the Cholesky decomposition, ∑ = CC^{T}. We estimated the correlation matrix ∑ using an autocorrelation function of pvalues. We recalculated the SWEPT statistics based on the generated pvalue vector, V. The previous procedure was repeated B times to yield {Z^{(b)}(i, k), b = 1,...,B}. Hence, the empirical pvalue of the i th window with window size k can be calculated as the following:
where Z*(i,k) is the corresponding SWEPT value based on real data. The SWEPT offers several advantages over conventional DNA pooling analyses. (1) SWEPT can work well even in cases where only pvalue data are available; hence, it can analyze data from different study designs and is applicable to metaanalysis. Because SWEPT allows a pvalue truncation, it also handles data containing unpublished insignificant pvalues. (2) The SWEPT statistics make adjustments for preferential amplification, a critical aspect that has never been considered before in pooled DNA multipoint analyses. (3) The simplicity of the SWEPT statistics lowers processing time and significantly reduces the computational complexity. (4) The SNPs involved in multipoint analyses can be determined conveniently once the window size has been determined, thereby avoiding the common perplexity of selecting SNPs in haplotypeoriented or other multipoint analyses. (5) SWEPT is comprehensive in that it covers conventional singlepoint test statistics and can be applied to the analysis of individual genotyping data, although this aspect is not the primary concern of PDA.
Real data analysis
We give four examples to illustrate functions of PDA: (1) Onegroup allele frequency estimation. (2) Twogroup singlepoint DNA pooling studies. (3) Twogroup multipoint association test based on peak intensity data. (4) Twogroup multipoint analysis based on pvalue using PDA. Throughout this paper, we set the host name of working directory to be 'C:\Program Files\MATLAB71\PDA'. All input data files for these four examples are available with software PDA and saved in the example directory, 'C:\Program Files\MATLAB71\PDA\Example'.
Example 1: onegroup singlepoint analysis
We used the six SNP data published in our previous paper [11] to illustrate the onegroup analysis, the purpose being to estimate allele frequency. The operation procedures are illustrated in Appendix F (See Additional File 6).
Table 1 and Table 2 present the results from PDA for the six SNPs. Table 1 shows the estimated results for CPA. The 1^{st} column shows the SNP number. The 2^{nd} column shows the SNP name. The 3^{rd} column shows the number of heterozygous individuals. Three discrete adjustments (${\widehat{\kappa}}_{\text{H}}$, ${\widehat{\kappa}}_{\text{U}}$) are shown along with the corresponding s.e. For example, for the 6^{th} SNP with SNP name 639, there are 36 heterozygous individuals used to calculate the CPA adjustment. The arithmetic mean adjustment is 2.288, with s.e. 0.038; the unbiased adjustment is 2.320, with s.e. 0.006; the geometric adjustment is 2.265, with s.e. 0.009.
In Table 2, PDA provides the allele frequency estimates. The 1^{st} column shows the SNP number. The 2^{nd} column shows the SNP name. The 3^{rd} panel shows the unadjusted allele frequencies and the corresponding s.e. The 4^{th}, 5^{th} and 6^{th} panels show the allele frequency estimates based on the three adjustments (${\widehat{\kappa}}_{\text{H}}$, ${\widehat{\kappa}}_{\text{U}}$, ${\widehat{\kappa}}_{\text{G}}$) along with their respective s.e. values. For example, the unadjusted allele frequency of the 1^{st} allele of SNP 639 is 0.806 (the allele frequency of the 2^{nd} allele is 0.194), and the s.e. is 0.051. After applying CPA adjustments, the accurate allele frequency estimate is about 0.64 and s.e. is 0.06. Three different adjustments yield similar results. In this example, there is a serious overestimation of allele frequency if the CPA adjustment is ignored.
Example 2: twogroup singlepoint analysis
In this example, we analyze the data set from our previous project that compared the allele distributions of three main Taiwan subgroups in the human major histocompatibility complex (MHC) region. We selected two subgroups (Hakka and Han groups) and 4 SNPs for the illustrations. The operation procedures are illustrated in Appendix F (See Additional File 6).
The results are shown in Tables 3, 4, 5. Table 3 shows the CPA estimates along with the s.e. values for these four SNPs. The unbiased CPA estimates are 1.68, 1.60, 1.39 and 1.77, and the corresponding s.e. values are 0.013, 0.010, 0.009 and 0.007.
Table 4 shows the allele frequency estimates along with s.e. Based on the unbiased adjustment of CPA, the allele frequency estimates (s.e. values) of SNPs 6260, 6267, 6272 and 6415 in the Hakka group are 0.93 (0.013), 0.94 (0.012), 0.60 (0.024) and 0.13 (0.016), respectively. The allele frequency estimates (s.e. values) of SNPs in the Han group are 0.84 (0.018), 0.82 (0.019), 0.62 (0.024) and 0.19 (0.019), respectively.
In Table 5, PDA conducted association tests using the four SNPs to compare the allele distributions between Hakka and Han groups. Firstly, the association test without applying CPA adjustment was conducted. The chi square statistic and the corresponding pvalue were calculated for each SNP. Secondly, modified association statistics X based on the three different CPA adjustments were conducted. The s.e. of experimental error was set to be 0.02 according to our previous study [8]. For example, the association test based on the unbiased adjustment yields chi square statistics 5.54, 11.51, 0.23 and 2.95 and pvalues 0.019, 0.001, 0.634 and 0.086 respectively. The conclusions from the unadjusted association test and adjusted association test are quite different.
In our previous project, these four SNPs were also genotyped individually and the allelebased association test based on individual genotyping data yielded the exact pvalues for these four SNPs are 0.00795, 0.00006, 0.52346 and 0.23972 respectively. The conclusions are consistent with the results from the adjusted association tests and demonstrate the importance of CPA adjustment.
Example 3: twogroup multipoint analysis based on peak intensity data
In this example, we illustrate a multipoint analysis, an important utility of PDA. We analyzed 10 SNPs from our MHC study to screen for potential candidate regions that could distinguish Hakka and Han groups. The operation procedures are illustrated in Appendix F (See Additional File 6).
The results are shown in Tables 6, 7, 8, 9. Table 6 shows the CPA estimates along with s.e. values for the ten SNPs. Table 7 shows the allele frequency estimates along with s.e. values. Table 8 shows the singlepoint pooled DNA association tests comparing the allele distributions between Hakka and Han groups. The results show that only SNP 6419 is significant; the pvalue is 0.019 for the statistic without adjusting CPA, whereas it is 0.025 after adjusting CPA.
Table 9 shows the multipoint pooled DNA association tests. The results firstly describe the input information of the analysis. In this example, peak intensity data, map information and equal weight were considered in the analysis, and the pvalue was not truncated. We carried out 10000 Monte Carlo simulations to calculate the empirical pvalue. The size of each window was 5, and a multiplicative pvalue statistic was used. Using these settings, multipoint tests based on different CPAs were conducted. The results also are presented in Figure 2, where pvalues were transformed by taking the minus log 10. For example, based on the unbiased adjustment of CPA, the pvalues for the six sliding windows (with window size 5) are 0.047, 0.84, 0.229, 0.718, 0.629 and 0.874.
In our previous project, these ten SNPs were also genotyped individually, and the allelebased association test based on individual genotyping data yielded exact pvalues for these ten SNPs: 0.0216, 0.0052, 0.0115, 0.6859, 0.0232, 0.9440, 0.1628, 0.4468, 0.4082 and 0.9443. However, the previous singlepoint pooled DNA test only identified SNP 6419. In this case, the important SNPs, 6421 and 6422, were not identified by the singlepoint association tests; however, the two SNPs are included in the region from SNPs 6421 to 6424, which was identified by a multipoint analysis based on a sliding window with size 5.
Example 4: twogroup multipoint analysis based on pvalue data
In this example, we illustrate the implementation of pvalue analysis using PDA. To conduct multipoint association tests, we used the same 10 SNPs as in Example 3, based on the pvalue derived from a singlepoint pooled DNA association test with unbiased adjustment of CPA. The operation procedures are illustrated in Appendix F (See Additional File 6).
Because we only implemented the pvalue of each SNP, the procedures for the CPA estimate, allele frequency estimate and singlepoint association test cannot be considered in the analysis. Only multipoint association tests can be conducted.
First, PDA shows the input information for the analysis in this example, as follows: pvalue data were used; no map information was provided; userspecified weights were used; the threshold value of truncation was 1; the number of Monte Carlo simulations was 10000; the size of each window was 5; the SWEPT statistic was calculated using the additive model. The results are summarized in Table 10 and are presented in Figure 3. Table 10 shows the SWEPT statistics and pvalues for the six regions, each of which contains five SNPs. Because the same SNP data were used in Examples 3 and 4, it is not surprising that the results are similar to those in Example 3.
Discussion
CPA estimation is based on peak intensity data of heterozygous individuals. Data of heterozygous individuals in a pilot study may not be available occasionally. Public accessible CPA databases for SNPs provide important information [25, 26]. PDA allows for allele frequency estimation and association testing by directly inputting CPA values of SNPs of interest. This function enhances PDA to analyze large numbers of SNPs on the public databases in pooled DNA analysis.
PDA provides an extended singlepoint association test allowing for different CPAs between two comparative groups. This test reduces to the conventional test in [11] if the equal CPA between two groups is assumed. If typing of case and control DNA pools is performed at the same time under the same experimental conditions, then the reduced test should be applied. However, if the DNA pools of case and control groups are typed under different time/environments, e.g., a meta analysis and a sequential analysis, then the extended test should be performed.
Haplotypescoring [27] and locusscoring approaches [28] are the two main categories of association tests for disease gene mapping; however, it is currently unclear as to which method is superior while analysing individual genotyping data. We first introduce locusscoring approach to analyze pooled DNA data. The SWEPT method considered in PDA is a locusscoring approach, which does not require an inference to phaseunknown haplotypes; hence the locusscoring approach has several advantages, among which is the reduction of computational burden. Until a breakthrough in economic efficiencies of haplotyping, locusscoring approach is preferred than haplotypescoring approach while performing pooled DNA analyses.
Weights for different SNPs in each window may affect the significance of a multipoint association test. If there is no prior knowledge in this regard, then equal weights can be employed. The other strategy is to consider weights according to genetic/physical or linkage disequilibrium maps of SNPs [29]. Using information of haplotype maps to improve the estimation of allele frequency difference at each single locus for association mapping has been considered in [30]. In our method, a SNP should be assigned a higher weight if the SNP marker is closer to the anchor in the center of a window. Anchors scan over the chromosome region of interest simultaneously when sliding windows move from the start to the end of all SNPs.
The sliding window procedure emphasizes a local effect, which assumes the neighboring SNPs provide sufficient information for the window of interest and that other SNPs outside the window do not impact the inference of the window once SNPs within the window have been considered. A small proportion of SNPs is considered each time, making the slidingwindow approach a convenient and practical procedure for chromosomewide studies once the window size is determined. A slidingwindow size of 5 for the selection of genetic markers for association tests with individual genotyping data was suggested in [31], but they warned that this value might not be suitable in certain situations. We suggest that genetic background of studied region should be considered and several window sizes about the size of 5 should be analyzed to yield reliable results.
Conclusion
PDA provides simultaneous analyses of the CPA adjustment, adjusted allele frequency estimate and single/multipoint DNA pooling association tests that are usually essential for complete DNA pooling studies. All of the PDA functions are illustrated in the four bona fide examples contained in the program. PDA is simple to operate and does not require that users have a strong statistical background.
Availability and requirements
PDA software can be downloaded from the web site: http://www.ibms.sinica.edu.tw/%7Ecsjfann/first%20flow/pda.htm.
Project name: DNA pooling project
Project home page: http://www.ibms.sinica.edu.tw/%7Ecsjfann/first%20flow/pda.htm
Operating system: MS Windows^{®}
Programming language: MATLAB^{®}
Other requirements: No
License: PDA license
Any restrictions to use by nonacademics: On request and citation
Abbreviations
 PDA:

Pooled DNA analyzer
 CPA:

Coefficient of preferential amplification
 SWEPT:

Slidingwindow empirical pvalue test
References
 1.
Hirschhorn JN, Daly MJ: Genomewide association studies for common diseases and complex traits. Nat Rev Genet 2005, 6: 95–108. 10.1038/nrg1521
 2.
Wang WYS, Barratt BJ, Clayton DG, Todd JA: Genomewide association studies: The theoretical and practical concerns. Nat Rev 2005, 6: 109–118. 10.1038/nrg1522
 3.
Arnheim N, Strange C, Erlich H: Use of pooled DNA samples to detect linkage disequilibrium of polymorphic restriction fragments and human disease: studies of the HLA class II loci. Proc Natl Acad Sci USA 1985, 82: 6970–6974. 10.1073/pnas.82.20.6970
 4.
Mohlke KL, Erdos MR, Scott LJ, Fingerlin TE, Jackson AU, Silander K, Hollstein P, Boehnke M, Collins FS: Highthroughput screening for evidence of association by using mass spectrometry genotyping on DNA pools. Proc Natl Acad Sci USA 2002, 99: 16928–16933. 10.1073/pnas.262661399
 5.
Herbon N, Werner M, Braig C, Gohlke H, Dütsch G, Illig T, Altmüller J, Hampe J, Lantermann A, Schreiber S, Bonifacio E, Ziegler A, Schwab S, Wildenauer D, van den Boom D, Braun A, Knapp M, Reitmeir P, Wjst M: Highresolution SNP scan of chromosome 6p21 in pooled samples from patients with complex diseases. Genomics 2003, 81: 510–518. 10.1016/S08887543(02)000356
 6.
Buetow KH, Edmonson M, MacDonald R, Clifford R, Yip P, Kelley J, Little DP, Strausberg R, Koester H, Cantor CR, Braun A: Highthroughput development and characterization of a genomewide collection of genebased single nucleotide polymorphism markers by chipbased matrixassisted laser desorption/ionization timeofflight mass spectrometry. Proc Natl Acad Sci USA 2001, 98: 581–584. 10.1073/pnas.021506298
 7.
Nelson MR, Marnellos G, Kammerer S, Hoyal CR, Shi MM, Cantor CR, Braun A: Largescale validation of single nucleotide polymorphisms in gene regions. Genome Res 2004, 14: 1664–1668. 10.1101/gr.2421604
 8.
Yang HC, Lin CH, Hung SI, Fann CSJ: Polymorphism validation using DNA pools prior to conducting largescale genetic studies. Ann Hum Genet, in press.
 9.
Sham P, Bader JS, Craig I, O'Donovan M, Owen M: DNA pooling: A tool for largescale association studies. Nat Rev Genet 2002, 3: 862–871. 10.1038/nrg930
 10.
Yang HC, Fann CSJ: Association mapping using pooled DNA. In Linkage Disequilibrium and Association Mapping. Edited by: Collins A. New Jersey: The Humana Press Inc; 2006.
 11.
Yang HC, Pan CC, Lu RCY, Fann CSJ: New adjustment factors and sample size calculation in a DNApooling experiment with preferential amplification. Genetics 2005, 169: 399–410. 10.1534/genetics.104.032052
 12.
Hoogendoorn B, Norton N, Kirov G, Williams N, Hamshere ML, Spurlock G, Austin J, Stephens MK, Buckland PR, Owen MJ, O'Donovan MC: Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum Genet 2000, 107: 488–493. 10.1007/s004390000397
 13.
Visscher PM, Le Hellard S: Simple method to analyze SNPbased association studies using DNA pools. Genet Epidemiol 2003, 24: 291–296. 10.1002/gepi.10240
 14.
Ito T, Chiku S, Inoue E, Tomita M, Morisaki T, Morisaki H, Kamatani N: Estimation of haplotype frequencies, linkagedisequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am J Hum Genet 2003, 72: 384–398. 10.1086/346116
 15.
Wang S, Kidd KK, Zhao H: On the use of DNA pooling to estimate haplotype frequencies. Genet Epidemiol 2003, 24: 74–82. 10.1002/gepi.10195
 16.
Yang Y, Zhang J, Hoh J, Matsuda F, Xu P, Lathrop M, Ott J: Efficiency of singlenucleotide polymorphism haplotype estimation from pooled DNA. Proc Natl Acad Sci USA 2003, 100: 7225–7230. 10.1073/pnas.1237858100
 17.
Zeng D, Lin DY: Estimating haplotypedisease associations with pooled genotype data. Genet Epidemiol 2005, 28: 70–82. 10.1002/gepi.20040
 18.
Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG: Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann Hum Genet 2002, 66: 393–405. 10.1046/j.14691809.2002.00125.x
 19.
Downes K, Barratt BJ, Akan P, Bumpstead SJ, Taylor SD, Clayton DG, Deloukas P: SNP allele frequency estimation in DNA pools and variance components analysis. Biotechniques 2004, 36: 840–845.
 20.
Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS: Truncated product method for combing pvalues. Genet Epidemiol 2002, 22: 170–185. 10.1002/gepi.0042
 21.
Edgington ES: An additive model for combining probability values from independent experiments. J Psychol 1972, 80: 351–363.
 22.
Zheng G: Use of max and min scores for trend tests for association when the genetic model is unknown. Stat Med 2003, 22: 2657–2666. 10.1002/sim.1474
 23.
Yu K, Gu CC, Province M, Xiong CJ, Rao DC: Genetic association mapping under founder heterogeneity via weighted haplotype similarity analysis in candidate genes. Genet Epidemiol 2004, 27: 182–191. 10.1002/gepi.20022
 24.
Dudbridge F, Koeleman BPC: Rank truncated product of pvalues, with application to genomewide association scans. Genet Epidemiol 2003, 25: 360–366. 10.1002/gepi.10264
 25.
Simpson CL, Knight J, Butcher LM, Hansen VK, Meaburn E, Schalkwyk LC, Craig IW, Powell JF, Sham PC, ALChalabi A: A central resource for accurate allele frequency estimation from pooled DNA genotyped on DNA microarrays. Nucleic Acids Res 2005, 33: e25. 10.1093/nar/gni028
 26.
The Database of Coefficient of Preferential Amplification/Hybridization[http://www.ibms.sinica.edu.tw/%7Ecsjfann/first%20flow/database.htm]
 27.
Morris RW, Kaplan NL: On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genet Epidemiol 2002, 23: 221–233. 10.1002/gepi.10200
 28.
Seaman SR, MüllerMyhsok B: Rapid simulation of p values for product methods and multipletesting adjustment in association studies. Am J Hum Genet 2005, 76: 399–408. 10.1086/428140
 29.
Yang HC, Lin CY, Fann CSJ: A unified multilocus association test [abstract]. Am J Hum Genet 2005, 77: s2393.
 30.
Hinds DA, Seymour AB, Durham LK, Banerjee P, Ballinger DG, Milos PM, Cox DR, Thompson JF, Frazer KA: Application of pooled genotyping to scan candidate regions for association with HDL cholesterol levels. Human Genomics 2004, 1: 421–434.
 31.
Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG: Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet 2003, 73: 115–130. 10.1086/376561
Acknowledgements
We appropriate MeiChu Huang and YuJen Liang for testing the prototype of software PDA. We thank the three anonymous reviewers for their insightful comments, which have improved the presentation of our manuscript. This research was supported in part by grants NSC 932320B0010.26 and Academia Sinica 91IBMS2PPC of Taiwan.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
HCY conceived the statistical methods and experimental designs and prepared the manuscript. CCP programmed the software. CYL and CSJF contributed to the discussion and preparation of the manuscript. All authors have approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Yang, HC., Pan, CC., Lin, CY. et al. PDA: Pooled DNA analyzer. BMC Bioinformatics 7, 233 (2006). https://doi.org/10.1186/147121057233
Received:
Accepted:
Published:
Keywords
 Association Test
 Heterozygous Individual
 Allele Frequency Estimate
 Multipoint Analysis
 Preferential Amplification