An integrative method for scoring candidate genes from association studies: application to warfarin dosing

Background A key challenge in pharmacogenomics is the identification of genes whose variants contribute to drug response phenotypes, which can include severe adverse effects. Pharmacogenomics GWAS attempt to elucidate genotypes predictive of drug response. However, the size of these studies has severely limited their power and potential application. We propose a novel knowledge integration and SNP aggregation approach for identifying genes impacting drug response. Our SNP aggregation method characterizes the degree to which uncommon alleles of a gene are associated with drug response. We first use pre-existing knowledge sources to rank pharmacogenes by their likelihood to affect drug response. We then define a summary score for each gene based on allele frequencies and train linear and logistic regression classifiers to predict drug response phenotypes. Results We applied our method to a published warfarin GWAS data set comprising 181 individuals. We find that our method can increase the power of the GWAS to identify both VKORC1 and CYP2C9 as warfarin pharmacogenes, where the original analysis had only identified VKORC1. Additionally, we find that our method can be used to discriminate between low-dose (AUROC=0.886) and high-dose (AUROC=0.764) responders. Conclusions Our method offers a new route for candidate pharmacogene discovery from pharmacogenomics GWAS, and serves as a foundation for future work in methods for predictive pharmacogenomics.


Background
Genome-wide association studies (GWAS), which generally assay common variation and rely on relatively large effects, are now being widely applied to perform genome-wide scans for variants associated with pharmacological phenotypes [1,2]. In the case of complex diseases, GWAS have yielded associated variants with modest overall effects. However, in the case of pharmacological phenotypes, initial results from pharmacogenomic GWAS appear to indicate a greater ability to discover variants with substantial effect size [1,3]. Nevertheless, pharmacogenomic GWAS suffer many of the limitations of disease GWAS in that follow up studies are often required to elucidate the causative genes and variants latent in the GWAS results [4][5][6]. Additionally, pharmacogenomic GWAS are also limited in power by small cohort sizes [7]. Amongst the drugs whose pharmacological variance has been evaluated using the GWAS approach, warfarin (Coumadin) has emerged as a prominent pharmacogenomics case study with great translational potential.
Given the its broad use, narrow therapeutic range, and severity of side effects, a comprehensive pharmacogenomic characterization of warfarin dose-response offers the potential for substantial clinical impact [8,9]. Retrospective studies revealed the role of VKORC1, CYP2C9, and CYP4F2 in mediating abnormal variations in warfarin dose response [10][11][12], explaining approximately 30%, 10% and 5% of the variance in drug response respectively [13]. The relationship between variants in these genes and atypical warfarin dose response has been subsequently confirmed by GWAS analysis [14,15]. Recently, a method to estimate stable warfarin dose was developed by integrating information on patient VKORC1 and CYP2C9 genotypes with clinical factors [16]. Although this method incorporated genotype information for only two warfarin pharmacogenes, the genotype-based method was able to explaiñ 49% of the variance in stable dose, substantially outperforming the pure-clinical and fixed-dose approaches. The elucidation of additional large-effect pharmacogenes for warfarin and other genomic features might serve to dramatically increase the ability to predict warfarin dose response from genotype. Therefore, in this study, we propose a novel method for increasing the power of pharmacogenomic GWAS and detecting pharmacogenes predictive of drug response. Our method characterizes the degree to which uncommon alleles of a gene correlate with drug response. If uncommon alleles are associated with atypical drug response phenotypes, then the gene is considered a predictive pharmacogene and a putative marker for drug response.

Results
Using knowledge to limit the hypothesis testing space We identified 228 genes which were likely to be pharmacogenes for warfarin and which also contained SNPs measured in the Cooper warfarin response GWAS data set (See Methods). There were 3,856 SNPs contained within these 228 genes. We tested each of the 3,856 SNPs with a univariate linear regression model for its ability to predict warfarin dosage, the exact analysis performed by Cooper, et al., however, with the advantage of testing fewer hypotheses. As expected, the results of this analysis closely resemble the results of the Cooper analysis except that a much less strict significance threshold is necessary to correct for multiple hypothesis testing (1.3e-5 as opposed to 1.0e-7). However, even with this lower threshold only one SNP, rs10871454 (VKORC1), was significant after correction (Table 1).
Derivation and evaluation of a candidate gene-score based on allele frequencies W assigned each SNP to a gene if the SNP was within 5 kbp of the boundary of the gene. Some SNPs mapped to more than one gene. We then aggregated the SNPs into genes using this mapping and calculate each gene's pHap gene-scores (See Methods). Each gene-score was then tested with a univariate linear regression model for its ability to predict warfarin dosage, again, the same analysis performed by Cooper, except that we now are testing only 228 hypotheses and each gene-score is a summary of a set of SNPs frequencies. In this analysis both VKORC1 and CYP2C9 are significant for predicting stable warfarin dosage ( Table 2). VKORC1 and CYP2C9 pass the corrected significance threshold of 1e-3 with p-values of 9.1e-7 and 9.6e-5 respectively.

Extreme dose response warfarin response models
The gene-score was calculated for each of the 228 genes in the WSP for each patient in two classes: low dose patients and the compliment. The distributions of genescores of the two classes were tested for the null hypothesis, namely that the means of the distributions were equal. Two genes significantly distinguish the two classes after corrected for multiple hypothesis testing, VKORC1 and UBE3A with p-values of 1.2e-4 and 5.4e-5 respectively. 18 other genes had p-values that were less  than 0.05, but not significant after multiple hypothesis testing (Table 3).
A logistic regression classification model was trained on VKORC1 and UBE3A gene-scores and evaluated with 10-fold cross validation ( Figure 1). The AUROC was 0.721 (significance p-value < 0.01). A second logistic regression classification model was trained on the 20 genes that had a p-value ≤ 0.05 ( Figure 1). The AUROC of this classifier was 0.886 (significance p-value < 0.01).
A logistic regression classification model was trained on VKORC1, CYP2C9, and QTRTD1 and evaluated with 10-fold cross validation ( Figure 2). The AUROC was 0.693 (significance p-value < 0.01). A second logistic regression classification model was trained on the 15 genes that had a p-value ≤0.05 ( Figure 2). The AUROC of this classifier was 0.764 (significance p-value < 0.01).

Discussion
In this study we present a novel method for integrating pre-existing pharmacological knowledge and scoring candidate genes from association studies. We have demonstrated that this filtering and scoring method is capable of identifying candidate genes explaining a large proportion of warfarin dose response phenotypes.
Our pHap value is a simple measure of how "extreme" an individual's variants are in a particular gene: if all the SNPs show minor alleles, then the individual has a very high pHap value for that gene. If all the SNPs show major alleles, then the individual has a low value for that gene. We have shown that this aggregate measure of genotype has the advantage of aggregating genetic variation in order to reduce the number of hypotheses tested. This measure cannot only increase the power to identify candidate genes that explain dosage variation but also can identify candidate genes for extreme phenotypes.
We validate our methods on the warfarin GWAS study by Cooper et al, which attempted to identify candidate genes by examining a SNPs ability to predict dosage in a univariate linear regression model. The power of the Cooper analysis, however, was limited by the low number of patients in the cohort, a common problem in drug GWAS [7]. When our methods were applied to the same data set we were able to identify the two best characterized genes responsible for variation in warfarin dose response, VKORC1 and CYP2C9. It is interesting to note that the knowledge filtration method alone will not identify both of these two genes ( Table  1). The pHap, which summarizes the aggregate contribution of a set of SNPs, is also needed in order to identify CYP2C9 significantly (Table 2).
We also show that our method can significantly identify features for a machine learning classification algorithm. We achieve high performance when classifying between low/not low-dose patients and between high/ not-high dose patients (AUROCs of 0.886 and 0.764 respectively, Figures 1 and 2). Using an empirical bootstrapping approach we demonstrated that the genes our method identifies are significant. In both cases the AUROC of the model was the highest observed, corresponding to a p-value < 0.01. We acknowledge that the described method is dependant being able to identify potentially important pharmacogenes. It is important to note that the algorithm we employed to rank genes for their potential to be pharmacogenomic does not require the drug to be previously known and will predict pharmacogenes for novel chemical structures.
The results of this study offer support for the applicability of allele-based pharmacogenetic models for the prediction of drug response phenotype. In addition, it opens up new avenues for candidate pharmacogene discovery. In future work we plan to improve our method through the investigation of more sophisticated classification methods and identification of additional genetic and genomic features predictive of drug response. We will also seek to expand the application of our method to additional drugs, with the overarching aim of developing a method that is predictive and robust across a broad range of drugs. We also plan to investigate the warfarin-associated pharmacogenes identified by our approach and work to biologically validate their putative role in warfarin response.

Conclusions
We have developed a novel approach that incorporates pharmacogenomic knowledge integration and a gene scoring system based on SNP aggregation to enhance pharmacogene discovery from pharmacogenomics GWAS. We applied this approach to a published warfarin GWAS data set comprising 181 individuals and found that our method can increase the power of the GWAS to identify established warfarin pharmacogenes VKORC1 and CYP2C9, and implicate several novel warfarin pharmacogenes. Additionally, we find that our method can be  Table 3), and the second, dashed line, on only those genes that were significant after multiple hypothesis testing correction (AUROC=0.721, p≤0.001, Table 3). Both classifiers have empirical p-value significance of less than 0.01 when tested using bootstrapping. used to discriminate between low-dose (AUROC=0.886) and high-dose (AUROC=0.764) responders, establishing a basis of direct clinical utility for the approach. Based on the performance observed with the warfarin GWAS, we recommend future work to extend the approach and apply it to additional pharmacogenomic GWAS as well as GWAS characterising other traits of clinical interest.

Warfarin GWAS data
For validation, a dataset of 181 patient genotypes (175 Caucasian, 6 Hispanic) and stable warfarin dosages was obtained from a warfarin GWAS study described by Cooper et al. [15]. The list of SNPs measured by the Cooper et al. study was filtered so that the set of SNPs had a maximum pairwise r-squared linkage disequilibrium score of 0.2. The resulting set was then queried against the SCAN SNP and CNV Annotation Database to determine whether the SNPs were either contained within a given gene or within 5kbp upstream or downstream of that gene. One exception to the SCAN SNP mapping was made so that rs10871454 mapped to VKORC1. This SNP is in perfect linkage disequilibrium with rs9923231, a VKORC1 SNP [15].
Using knowledge to limit the hypothesis testing space GWAS studies are hindered by multiple hypothesis testing corrections that significantly limit the power of analysis on smaller patient data sets. Pharmacogenomic knowledge bases integrate large amounts of data from the literature and biological experiments. Recently, an algorithm, called the PGxPipeline, was published that integrates pharmacogenomics and drug-binding databases to rank genes for their likelihood to be pharmacogenomic for a given drug [17]. The PGxPipeline was used to rank genes for their likelihood to be involved in the pharmacokinetics or pharmacodynamics of warfarin metabolism and action. This ranking of genes was then repeated for each of 486 other drugs for which pharmacogenetic interactions are known. These other drug gene rankings were then used to apply a significance score to the warfarin gene rankings. Only the most significant (p-value ≤ 0.05) warfarin genes were used the following analysis. 786 such genes were identified to have significant likelihood scores to be potential warfarin pharmacogenes. Of those, 228 contained SNPs that were measured by the Cooper GWAS study. We define this genes set to be the Warfarin-Specific Pharmacogenome (WSP). We limited our analysis to just those 3,856  Table 3), and the second, dashed line, on only those genes that were significant after multiple hypothesis testing correction (AUROC=0.693, p≤0.001, Table 3). Both classifiers have empirical p-value significance of less than 0.01 when tested using bootstrapping.
SNPs that were in these 228 genes. The univariate linear regression analysis used in the Cooper study was repeated for these 3,856 SNPs (Table 1).
Derivation and evaluation of a candidate gene-score based on allele frequencies We aggregated he 3,856 SNPs by gene and a computed the gene-score for each gene based on the SNP allele frequencies. These features are based on the straightforward assumption that genes with a preponderance of low-frequency alleles in individuals with extreme drug response phenotypes are more likely to be modulating that response [18]. Variants of pharmacogenes are important in determining drug response. As a proxy to the potential of variants in a pharmacogene to effect drug response we define the pHap score. To calculate the pHap score for a given patient and a given gene we first compute the negative log of each genotype frequency for each SNP in the given gene. We then take the sum of these values and call this the pHap score. The pHap score for patient, i, and gene, j, is defined as where N j is the number of SNPs in gene j and f i,j,k is the frequency of the allele of patient i for SNP k of gene j. Therefore, there is a pHap score for each gene in the WSP for each patient. In total we generated 228 pHap scores for each patient (for the 228 included genes). We then fit each gene-score in a univariate linear regression model to the dosage data and tested for the gene-score's ability to predict the dosage data (Table 2).

Redefinition to a classification problem
To test the method's ability to identify genomic features that are predictive of phenotypes the patients were divided into classes based on their stable dose. The first two classes consisted of patients who required a low stable dose of warfarin (≤3mg/day) and the compliment set of patients. The latter two classes consisted of patients who required a high stable dose of warfarin (≥7mg/day) and the compliment set of patients. Dividing the patients into these two sets of classes (low/not low and high/not high) redefines the task as a classification problem. This allows us to train machine-learning algorithms on the features identified by our method and, if the features are discriminatory, predict the drug response classification of the patient.

Identification of features to distinguish extreme dose patients
Patients were divided into two classes, those that required a low stable dose of warfarin and those who did not. For each patient we calculated the gene-score of each gene and tested for the two classes of scores for the null hypothesis using a Students t test. Since 228 null hypotheses were being tested the p-values were corrected using the conservative Bonferroni multiple hypothesis correction method. The new threshold for significance was set at 0.001 (Table 3).
We trained two logistic regression classifiers. The first classifier we trained was on only the genes that were significant after correcting for multiple hypothesis testing and the second was trained on all genes that had a p-value ≤ 0.05. The classifiers were evaluated using 10fold cross validation. In 10-fold cross validation the classifier is trained on 9/10th of the data and the remaining 1/10th of the data is reserved to evaluate the performance of the classifier. This process is repeated 10 times and the average performance is plotted in a receiver operating characteristic (ROC) curve. The area under the ROC curve (AUROC) is a summary statistic for overall the performance of the classifier (Figure 1).
In order to evaluate the significance of the classification models an empirical p-value was derived. To derive an empirical p-value, we trained the classifiers on randomly chosen genes from the set of 228 in the WSP. The number of genes chosen at random corresponded to the number of genes used in the two classifiers. We repeated this process 100 times for each classifier.
The analogous analysis was repeated for the highdose/not-high-dose patient classifications (Table 4, Figure 2).