Family-based and case-control association designs have been used in many genome-wide association studies (GWAS). For GWAS where ~1 million markers are tested, the major challenge is sorting out true positives from the many false positives. Many GWAS datasets have been deposited into public databases such as the database of Genotypes and Phenotypes (dbGaP). Also the Welcome Trust Case Control Consortium (WTCCC) provides a large number of case-control samples for public analysis . These resources provide the crucial opportunity to increase power by combining datasets. However, this requires flexible analytic methods that can accommodate diverse study designs (e.g., family and case-control).
Current available software for combining case-control and family data all have restrictions. Most of them such as SCOUT , CHRR  and UNPHASED  require sampling a homogeneous population, which may not be a reasonable assumption for data from a large consortium. FamCC  can account for population stratification and uses nuclear families with arbitrary number of siblings but requires parental genotype data, which are often unavailable for late-onset diseases. To overcome these restrictions, we have developed the Combined APL test (CAPL) , which is a novel and powerful statistical test that can accommodate family and case-control datasets and can account for population stratification using a clustering algorithm.
CAPL is an extension of the family-based Association in the Presence of Linkage (APL) test , which compares the difference between the observed number of alleles in affected siblings and its expected value, conditional on parental genotypes, under the null hypothesis of no linkage or no association. CAPL can use nuclear families with one or more affected sibs and can infer missing parental genotypes properly in the presence of linkage by accounting for the identity-by-descent (IBD) parameters. Unrelated cases and controls in CAPL are treated as families with one sibling and two missing parents so that they can be integrated into the family-based framework. Ward's clustering algorithm is used in CAPL to identify subpopulations and parental mating-type probabilities are calculated conditional on the subpopulation information. The EM algorithm is used to estimate the allele frequencies, IBD parameters and probabilities of origin in the presence of population substructure. A bootstrap approach is used in CAPL to estimate the variance for the CAPL statistic . For each bootstrap replicate, samples are resampled with replacement and the EM algorithm is performed. The clustering algorithm is also included in the bootstrap procedure to account for the variation from clustering. CAPL has been shown to have correct type I error rates and has more power than other association tests that combine case-control and family data such as UNPHASED, SCOUT, CHRR and FAMCC under various simulation scenarios .
Generally 20-40 EM iterations are required for the parameter estimates to converge, and 200-1000 bootstrap replicates are performed in CAPL for the variance estimate. For each bootstrap replicate, the EM algorithm is performed. Therefore, the CAPL algorithm is very computationally intensive and can be inefficient for analyzing GWAS data. The same is also true for other association methods that infer missing parental mating types based on sample allele frequencies such as UNPHASED, which relies on the quasi-Newton algorithm for maximum likelihood estimates . However, because each marker in CAPL is analyzed independently, analysis of each marker can potentially be parallelized to reduce the run time.
We implemented CAPL using the POSIX threads (pthreads) and open message passing interface (open MPI) libraries that can be executed in a computer cluster environment. We used computer simulations to demonstrate that CAPL can analyze GWAS datasets within a reasonable amount of time. The CAPL software package will be a useful tool to combine existing family and case-control GWAS datasets in the presence of population stratification.