PDA: Pooled DNA analyzer

Background Association mapping using abundant single nucleotide polymorphisms is a powerful tool for identifying disease susceptibility genes for complex traits and exploring possible genetic diversity. Genotyping large numbers of SNPs individually is performed routinely but is cost prohibitive for large-scale genetic studies. DNA pooling is a reliable and cost-saving alternative genotyping method. However, no software has been developed for complete pooled-DNA analyses, including data standardization, allele frequency estimation, and single/multipoint DNA pooling association tests. This motivated the development of the software, 'PDA' (Pooled DNA Analyzer), to analyze pooled DNA data. Results We develop the software, PDA, for the analysis of pooled-DNA data. PDA is originally implemented with the MATLAB® language, but it can also be executed on a Windows system without installing the MATLAB®. PDA provides estimates of the coefficient of preferential amplification and allele frequency. PDA considers an extended single-point association test, which can compare allele frequencies between two DNA pools constructed under different experimental conditions. Moreover, PDA also provides novel chromosome-wide multipoint association tests based on p-value combinations and a sliding-window concept. This new multipoint testing procedure overcomes a computational bottleneck of conventional haplotype-oriented multipoint methods in DNA pooling analyses and can handle data sets having a large pool size and/or large numbers of polymorphic markers. All of the PDA functions are illustrated in the four bona fide examples. Conclusion PDA is simple to operate and does not require that users have a strong statistical background. The software is available at .


Background
The millions of single nucleotide polymorphisms (SNPs) now available are ideal for association analyses that identify important genetic variants in populations as well as genes predisposed to diseases involving complex traits [1,2]. Although the cost of individual genotyping has been reduced drastically over the years, the use of DNA pooling has reduced the cost even further, especially for large-scale studies. The first DNA pooling study was performed to identify the association between HLA class II loci and disease genes predisposing type 1 diabetes [3]. DNA pooling was later used to estimate the allele fre-quency of short tandem repeats and SNPs, map disease susceptibility genes [4,5], and identify polymorphisms [6][7][8]. A comprehensive review of the history of DNA pooling, the methods and algorithms involved, and the application thereof can refer to [9] and [10]. DNA pooling is highly efficient. Many researchers have investigated the performance of DNA pools while estimating allele frequency and have measured the impact of pooling on association test results. The results show that allele frequencies can be estimated accurately and pre-Interface of PDA Figure 1 Interface of PDA.
cisely using DNA pools after considering coefficient of preferential amplification (CPA) [11,12]; moreover, the test power is high and the false-positive rate is well controlled [11,13]. These promising results suggest that DNA pooling studies is reliable and cost-saving relative to individual genotyping studies. This motivated the development of the software, Pooled DNA Analyzer (PDA), to analyze pooled DNA data.
Although many single-point pooled DNA association tests have been developed, multipoint analysis still presents a challenge due to the large numbers of genotypic combinations in DNA pools. The difficulty increases substantially with the pool size and/or the number of SNPs involved. Several of the recently proposed advanced multipoint estimations and tests have been haplotype oriented [14][15][16][17]; nevertheless, all such methods require a small pool size and a small number of SNPs to reduce both the computational complexity and running time. To address the current computational challenges of analyzing DNA pools, PDA provides the sliding-window empirical p-value test (SWEPT), which has advantages with respect to statistical computation, data implementation and practical application. The SWEPT method is particularly applicable when the analysis involves a large amount of data, which overcomes the computational bottleneck of conventional haplotype-oriented multipoint methods in DNA pooling analyses.

Implementation
PDA was developed on the MATLAB ® software platform that is adapted to the Windows systems (MS Windows ® 98/ME and MS Windows ® NT/2000/XP/2003). For MAT-LAB ® users, PDA can be run with a graphical user-friendly interface where users merely click the checkboxes to carry out data analysis. The PDA user interface is shown in Figure 1. For those who have no access to or little knowledge of the MATLAB ® system, we used the MATLAB ® compiler to generate standalone executables of PDA, which can be deployed on machines without installing the MATLAB ® . The guide to the installation and initialization of PDA on Windows is illustrated in Appendix A (See Additional File 1). Description of working directories for PDA is shown in Appendix B (See Additional File 2). The PDA's input and output data formats are explained in Appendices C and D (See Additional files 3 and 4), respectively. Finally, the compiled version of PDA is demonstrated in Appendix E (See Additional File 5).

Interface of PDA, item functions and operation procedures
There are seven main items in the PDA menu, i.e., input/ output directory, number of groups studied, data type for CPA estimation, bootstrapped standard error (s.e.) of CPA estimates, allele frequency estimates, single-point pooled DNA association test and multipoint pooled DNA association test. Item 1. Input/Output directory: The directories of input and output files must be specified. PDA will read data from the assigned input directory and automatically save outputs in the output directory. The format of input and output is illustrated in Appendices C and D (See Additional files 3 and 4). Item 2. Number of groups studied: PDA can analyze onegroup or two-group DNA pooling data. For one-group studies, users can estimate CPA and calculate adjusted allele frequency by checking the box 'One group'. For two- The transformed p-values of multiplicative SWEPT statistic using p-value data in Example 4

Figure 3
The transformed p-values of multiplicative SWEPT statistic using p-value data in Example 4.
The transformed p-values of multiplicative SWEPT statistic based on different CPAs by using peak intensity data in Exam-ple 3 Figure 2 The transformed p-values of multiplicative SWEPT statistic based on different CPAs by using peak intensity data in Example 3.
group studies (e.g., case control studies), users check the box 'Two groups' and determine whether to carry out association tests after calculating estimates for CPA and allele frequency. PDA provides the flexibility of equal or unequal CPA statistical inference that the user may choose as needed. Check 'Yes' for equal CPA inference or 'No' for unequal CPA inference. Item 3. Data type for CPA estimation: Two types of data are acceptable. The first type is peak intensity data from genotyping experiments. The second type is raw CPA/heterozygote ratio from empirical studies or databases. If peak intensity data are inputted, then users should provide the number of pairs of peak intensities for each locus.  PDA provides a function to truncate insignificant p-values in the analysis. The value is between 0 and 1, and p-values greater than the threshold will be excluded from the analysis. (5) Number of Monte Carlo simulations. Users must provide a suitable number of simulations between 500 and 10000. A large number of simulations increase the accuracy of the empirical p-value estimation, but a longer computational time may be required. (6) Window size, defined as the number of markers in a window prior to pvalue truncation. Users should specify a suitable number of markers in a window according to the attributes of their data. Window size must be = 2, with the upper limit being the total number of SNPs in the study. (7) SWEPT statistics. PDA provides three statistics for multipoint association tests; i.e., multiplicative, additive and minimum pvalue statistics.
The statistical theory is introduced in the next section.

Methodology
We developed PDA based on a four-stage procedure, which combines the concept of a three-stage DNA pooling experiment [11] with the procedure of a novel multipoint association test, SWEPT. The functions make PDA useful for a complete analysis of pooled DNA data.
Firstly, PDA provides estimates for the CPA, which affects allele frequency estimation and association testing in a pooled DNA study. For a diallelic SNP with alleles A and a, CPA represents the relative magnitude of the averaged amplified intensities of the different alleles and is defined mathematically as κ = μ A /μ a , where μ A and μ a are the average peak intensities of alleles A and a. The parameters can be estimated from heterozygous individuals who provide Interface of the execution of PDA on machines without MATLAB ® installed Figure 4 Interface of the execution of PDA on machines without MATLAB ® installed.
a standard for a 50:50 ratio for a pair of peak intensities of two heterozygous alleles. When κ = 1, there is no preferential amplification; when κ > 1, the first allele is more likely to be amplified than the second; when κ < 1, the second allele is more likely to be amplified than the first. PDA provides three discrete estimates for the CPA: arithmetic mean adjustment , unbiased adjustment and geometric mean adjustment along with the corresponding bootstrap standard errors [11]. Let n heter denote the number of heterozygous individuals and { (j), (j), j = 1,...,n heter } is the pair of peak intensities of heterozygous individuals derived from individual genotypings. The mathematical formulas of the three CPA estimators are presented as follows: where and . For each SNP, the estimated CPA will inform users of the magnitude of the difference in amplification between two alleles.
Secondly, PDA provides adjusted estimates for allele frequencies and the standard errors corresponding to the three different CPAs. Let be the estimated CPA. The adjusted allele frequency of allele A is estimated by , where h A and h a denote the peak intensity of alleles A and a in a DNA pool [12]. These analyses can be applied to studies of a single group or two groups, and the information will help users understand the genetic distribution of their groups.
Thirdly, PDA provides a single-point association mapping of two groups (e.g., case control studies or comparative studies of two groups). Let n G1 and n G2 be the numbers of individuals in groups G1 and G2; and are the estimated CPAs in groups G1 and G2; denotes the difference of the estimated allele frequencies of allele A between two groups. The test statistic of singlepoint association mapping with adjustment for preferential amplification is , where the estimated variance is where and are the bootstrapped variances of the estimated CPAs in groups G1 and G2, and is the experimental standard error which can be estimated by calculating the root mean square error based on a hierarchical experimental design [18] or calculating the square root of variance components relied on the restricted maximum likelihood method [19]. The asymptotic distribution of test statistic X is a chi square distribution with one degree of freedom. This test reduces to the single-point association test proposed in [11] if the equality of CPAs in two groups is held. The test statistic and pvalue are calculated and used to identify important SNPs. Association studies that compare more than two groups can be further analyzed by combining pair-wise analyses with multiple testing correction.
Fourthly, PDA provides a multipoint association test. is the usual indicator that takes the value of 1 if event A is true; otherwise, it takes the value of 0. The non-negative w ij is a standardized weight of the p-value, v j , in the ith window (i.e. the weight satisfies the requirement that the weights in the window sum to one). The standardized weight is calculated by dividing the original weight by the sum of all original weights in the window under the given original weights. The multiplicative SWEPT statistic is a sliding-window extension of the truncated product  [20], and the additive SWEPT statistic is an extension of the test statistic [21]. The third statistic is the minimum p-value in the window as follows: The minimum SWEPT statistic extended the technique of taking the minimum score, which has good performances in test power and type 1 error and has been used broadly in genetic studies [22,23].
There are other efficient p-value combinations, such as the rank truncated product method [24], which may be considered in PDA in the future. Extension of these methods using sliding windows will help screen important genetic markers in large-scale chromosome-wide pooled DNA association studies. By default, PDA performs multipoint analysis by using p-value data obtained from the proposed single-point association; however, PDA also provides options for the use of p-value data yielded from other single-point methods.
To assess the statistical significance of the SWEPT in each window, PDA applied a Monte-Carlo procedure recommended in [20] to calculate an empirical p-value. The procedure generates the correlated p-value vector V with a correlation matrix ∑ from an independent p-value vector V 0 , based on the following correlation-invariant transformation where Φ(.) is the cumulative distribution of a standard normal random variable and C is a lower triangular matrix satisfying the Cholesky decomposition, ∑ = CC T . We estimated the correlation matrix ∑ using an autocorrelation function of p-values. We recalculated the SWEPT statistics based on the generated p-value vector, V. The previous procedure was repeated B times to yield {Z (b) (i, k), b = 1,...,B}. Hence, the empirical p-value of the ith window with window size k can be calculated as the following: where Z*(i,k) is the corresponding SWEPT value based on real data. The SWEPT offers several advantages over conventional DNA pooling analyses. (1) SWEPT can work well even in cases where only p-value data are available; hence, it can analyze data from different study designs and is applicable to meta-analysis. Because SWEPT allows a pvalue truncation, it also handles data containing unpublished insignificant p-values. (2) The SWEPT statistics make adjustments for preferential amplification, a critical aspect that has never been considered before in pooled DNA multipoint analyses. (3) The simplicity of the SWEPT statistics lowers processing time and significantly reduces the computational complexity. (4) The SNPs involved in multipoint analyses can be determined con-  veniently once the window size has been determined, thereby avoiding the common perplexity of selecting SNPs in haplotype-oriented or other multipoint analyses.
(5) SWEPT is comprehensive in that it covers conventional single-point test statistics and can be applied to the analysis of individual genotyping data, although this aspect is not the primary concern of PDA.

Real data analysis
We give four examples to illustrate functions of PDA: (1) One-group allele frequency estimation. (2) Two-group single-point DNA pooling studies. (3) Two-group multipoint association test based on peak intensity data.
(4) Two-group multipoint analysis based on p-value using PDA. Throughout this paper, we set the host name of working directory to be 'C:\Program Files\MATLAB71\PDA'. All input data files for these four examples are available with software PDA and saved in the example directory, 'C:\Program Files\MATLAB71\PDA\Example'.

Example 1: one-group single-point analysis
We used the six SNP data published in our previous paper [11] to illustrate the one-group analysis, the purpose being to estimate allele frequency. The operation procedures are illustrated in Appendix F (See Additional File 6).

Example 2: two-group single-point analysis
In this example, we analyze the data set from our previous project that compared the allele distributions of three main Taiwan subgroups in the human major histocompatibility complex (MHC) region. We selected two subgroups (Hakka and Han groups) and 4 SNPs for the illustrations. The operation procedures are illustrated in Appendix F (See Additional File 6).
The results are shown in Tables 3, 4, 5. In Table 5, PDA conducted association tests using the four SNPs to compare the allele distributions between Hakka and Han groups. Firstly, the association test without applying CPA adjustment was conducted. The chi square statistic and the corresponding p-value were calculated for each SNP. Secondly, modified association statistics X based on the three different CPA adjustments were conducted. The s.e. of experimental error was set to be 0.02 according to our previous study [8]. For example, the association test based on the unbiased adjustment yields chi square statistics 5.54, 11.51, 0.23 and 2.95 and p-values 0.019, 0.001, 0.634 and 0.086 respectively. The conclusions from the unadjusted association test and adjusted association test are quite different.
In our previous project, these four SNPs were also genotyped individually and the allele-based association test κ Hκ UκG

Example 3: two-group multipoint analysis based on peak intensity data
In this example, we illustrate a multipoint analysis, an important utility of PDA. We analyzed 10 SNPs from our MHC study to screen for potential candidate regions that  The results are shown in Tables 6,7,8,9. Table 6 shows the CPA estimates along with s.e. values for the ten SNPs. Table 7 shows the allele frequency estimates along with s.e. values. Table 8 shows the single-point pooled DNA association tests comparing the allele distributions between Hakka and Han groups. The results show that only SNP 6419 is significant; the p-value is 0.019 for the statistic without adjusting CPA, whereas it is 0.025 after adjusting CPA. Table 9 shows the multipoint pooled DNA association tests. The results firstly describe the input information of the analysis. In this example, peak intensity data, map information and equal weight were considered in the analysis, and the p-value was not truncated. We carried out 10000 Monte Carlo simulations to calculate the empirical p-value. The size of each window was 5, and a multiplicative p-value statistic was used. Using these settings, multipoint tests based on different CPAs were conducted. The results also are presented in Figure 2, where pvalues were transformed by taking the minus log 10. For example, based on the unbiased adjustment of CPA, the p-values for the six sliding windows (with window size 5) are 0.047, 0.84, 0.229, 0.718, 0.629 and 0.874.
In our previous project, these ten SNPs were also genotyped individually, and the allele-based association test based on individual genotyping data yielded exact p-values for these Because we only implemented the p-value of each SNP, the procedures for the CPA estimate, allele frequency estimate and single-point association test cannot be considered in the analysis. Only multipoint association tests can be conducted.  First, PDA shows the input information for the analysis in this example, as follows: p-value data were used; no map information was provided; user-specified weights were used; the threshold value of truncation was 1; the number of Monte Carlo simulations was 10000; the size of each window was 5; the SWEPT statistic was calculated using the additive model. The results are summarized in Table  10 and are presented in Figure 3. Table 10 shows the SWEPT statistics and p-values for the six regions, each of which contains five SNPs. Because the same SNP data were used in Examples 3 and 4, it is not surprising that the results are similar to those in Example 3.

Discussion
CPA estimation is based on peak intensity data of heterozygous individuals. Data of heterozygous individuals in a pilot study may not be available occasionally. Public accessible CPA databases for SNPs provide important information [25,26]. PDA allows for allele frequency estimation and association testing by directly inputting CPA values of SNPs of interest. This function enhances PDA to analyze large numbers of SNPs on the public databases in pooled DNA analysis.
PDA provides an extended single-point association test allowing for different CPAs between two comparative groups. This test reduces to the conventional test in [11] if the equal CPA between two groups is assumed. If typing of case and control DNA pools is performed at the same time under the same experimental conditions, then the reduced test should be applied. However, if the DNA pools of case and control groups are typed under different time/environments, e.g., a meta analysis and a sequential analysis, then the extended test should be performed.
Haplotype-scoring [27] and locus-scoring approaches [28] are the two main categories of association tests for disease gene mapping; however, it is currently unclear as to which method is superior while analysing individual genotyping data. We first introduce locus-scoring approach to analyze pooled DNA data. The SWEPT method considered in PDA is a locus-scoring approach, which does not require an inference to phase-unknown haplotypes; hence the locus-scoring approach has several advantages, among which is the reduction of computational burden. Until a breakthrough in economic efficiencies of haplotyping, locus-scoring approach is preferred than haplotype-scoring approach while performing pooled DNA analyses.
Weights for different SNPs in each window may affect the significance of a multipoint association test. If there is no prior knowledge in this regard, then equal weights can be employed. The other strategy is to consider weights according to genetic/physical or linkage disequilibrium maps of SNPs [29]. Using information of haplotype maps to improve the estimation of allele frequency difference at each single locus for association mapping has been considered in [30]. In our method, a SNP should be assigned a higher weight if the SNP marker is closer to the anchor in the center of a window. Anchors scan over the chromosome region of interest simultaneously when sliding windows move from the start to the end of all SNPs.
The sliding window procedure emphasizes a local effect, which assumes the neighboring SNPs provide sufficient information for the window of interest and that other SNPs outside the window do not impact the inference of the window once SNPs within the window have been considered. A small proportion of SNPs is considered each time, making the sliding-window approach a convenient and practical procedure for chromosome-wide studies once the window size is determined. A sliding-window size of 5 for the selection of genetic markers for association tests with individual genotyping data was suggested in [31], but they warned that this value might not be suitable in certain situations. We suggest that genetic background of studied region should be considered and several window sizes about the size of 5 should be analyzed to yield reliable results.

Conclusion
PDA provides simultaneous analyses of the CPA adjustment, adjusted allele frequency estimate and single/ multipoint DNA pooling association tests that are usually essential for complete DNA pooling studies. All of the PDA functions are illustrated in the four bona fide examples contained in the program. PDA is simple to operate and does not require that users have a strong statistical background.