Drug perturbation gene set enrichment analysis (dpGSEA): a new transcriptomic drug screening approach

Background In this study, we demonstrate that our modified Gene Set Enrichment Analysis (GSEA) method, drug perturbation GSEA (dpGSEA), can detect phenotypically relevant drug targets through a unique transcriptomic enrichment that emphasizes biological directionality of drug-derived gene sets. Results We detail our dpGSEA method and show its effectiveness in detecting specific perturbation of drugs in independent public datasets by confirming fluvastatin, paclitaxel, and rosiglitazone perturbation in gastroenteropancreatic neuroendocrine tumor cells. In drug discovery experiments, we found that dpGSEA was able to detect phenotypically relevant drug targets in previously published differentially expressed genes of CD4+T regulatory cells from immune responders and non-responders to antiviral therapy in HIV-infected individuals, such as those involved with virion replication, cell cycle dysfunction, and mitochondrial dysfunction. dpGSEA is publicly available at https://github.com/sxf296/drug_targeting. Conclusions dpGSEA is an approach that uniquely enriches on drug-defined gene sets while considering directionality of gene modulation. We recommend dpGSEA as an exploratory tool to screen for possible drug targeting molecules.

of available knowledge, we propose a transcriptomically-driven drug screening approach that utilizes enrichment methods to determine candidate therapeutics.
Discovery-based enrichment methods can be used for finding matching transcriptomic signatures of drug-disease comparisons [8]. One approach, referred to as the signature reversion principle, has been successful in diverse therapeutic settings [9][10][11]. It assumes that a drug induced gene expression signature will be correlated with change of the transcriptome in disease to a healthy or healthier state [2]. Our premise is that a negatively correlated gene profile of a drug-perturbated transcriptome can be exploited in in silico drug screening methodologies.
Gene set enrichment techniques are well established in providing biological context in -omics studies, particularly in transcriptomic studies where summarizing the overall biology of a particular contrast or linear model by pathways enhances interpretability. We have used gene set enrichment techniques in a variety of transcriptional studies that compare or contrast the human host response to infectious or chronic illness [12][13][14]. Of the various enrichment approaches [15], Gene Set Enrichment Analysis (GSEA), Database for Annotation, Visualization and Integrated Discovery (DAVID) and Gene Ontology (GO) are gold standards in pathway and gene set enrichment for transcriptomic analyses [16][17][18][19], but unfortunately, their direct application in drug screening may not be ideal due to the lack of incorporation of drug-gene modulatory information. While other popular approaches such as gene2drug, DSEA, sscMAP, L1000cds, and CMAP-native methods may include such information, they lack the statistical rigor of GSEA [20][21][22][23][24]: none perform error rate analysis, calculate score normalizations, provide enrichment driver genes, or are tailored for transcriptomic analyses.
By performing enrichment on disease-associated gene signatures while using drug perturbation defined gene sets, entire transcriptomes can be probed for potential drugs or therapeutics. We propose a modified version of GSEA, namely drug perturbation GSEA (dpGSEA), to perform a unique drug-defined gene set enrichment analysis for screening therapeutics downstream of transcriptomic or proteomic studies. We describe dpGSEA as an analysis tool that emphasizes enrichment of counteracting gene expression between drug-gene and disease-gene profiles and provides an easily interpretable set of statistics to determine effectiveness of screened drugs. By using proto-matrices to capture a-priori drug perturbated gene signatures rather than gene sets, we believe our approach is well suited for transcriptomic-based therapeutic screening and enrichment.

Methods
We provide a comparison between dpGSEA and related approaches in Fig. 1 and detailed definition, notation, framework, statistic, normalization and error rate notes in the Additional file 1: Methods.

dpGSEA gene set priors
An overview of the dpGSEA processing, including the proto-matrix is shown in Fig. 2. dpGSEA utilizes transcriptomic signatures of drug perturbated cell lines from Broad Institute's connectivity map project (CMAP) and the library of integrated networkbased cellular signatures (LINCS) projects to produce annotated gene sets rather than curated lists, like those from MSigDB [16,24,25]. These gene sets are organized into proto-matrices as defined by gene signature cutoffs of ranked top fold change or statistical significance. The proto-matrix itself contains information including genes acted on by a specific drug and the directionality in which it is influenced, that is, whether the drug induces up or down regulation of the gene.
To generate the proto-matrices, differential expression (DE) analysis using default LIMMA-voom parameters in Bioconductor was performed on the CMAP and LINCS data [26]. For each drug, a DE experiment was conducted using the correspondingly batched DMSO sample as controls, while remaining effects were linearly corrected. The resulting genes were ranked by fold change and statistical significance to generate a specific signature, that is, the top 10, 20 or 50 genes acted upon by a specific drug with the cell-line information retained (these are labeled as "Sig Rank 10" or "FC Rank 20", etc. with the first label denoting fold change or significance and the last label denoting the number of top ranked genes).

dpGSEA scoring statistics
Similar to the approach of GSEA (see Additional file 1: Methods), we consider a list L of annotated genes rank-ordered by increasing T (j) , for j ∈ {1, . . . , p} . Our method detects an enrichment of high values of T (j) in the positive tail of gene set S k . This Fig. 1 Comparisons between dpGSEA and similar approaches. a dpGSEA's primary differences compared to GSEA include usage of a-priori gene set information derived from the Broad Institute's connectivity map project (CMAP) and the library of integrated network-based cellular signatures (LINCS) projects organized as proto-matrices, an absolute statistical significance ranked approach rather than a fold change ranked approach, and a novel statistic to evaluate the drug target. Both approaches utilize a random walk running sum statistic to calculate enrichment scores. dpGSEA requires two inputs from the user to run. b dpGSEA is listed with comparable techniques that utilize GSEA-like approaches. Our approach uses the significance of a gene as well as directionality along with the generation of a novel statistic, the target compatibility score. We also show the driver genes for each drug translates into finding evidence of a leading-edge subset in gene set S k , in which the values of T (j) are maximal: 1. The traditional Enrichment Score, denoted U k = U (S k ) = U T 1 , . . . , T p |S k = ES k which is calculated for each gene set S k , as the maximum deviation from 0 of a weighted running sum, for j ∈ {1, . . . , p} , in the gene set S k , relative to its complement S k . Formally, our first gene-specific Global Test Statistic can be written as: Where | | denotes the absolute value, max(·) denotes the maximum function with respect to gene index l ∈ {1, . . . , p} , ω is a parameter describing the weight of the tail in the random walk (see remarks below), and I[σ (j) ∈ S k ] is the indicator function on whether the jth rank-ordered gene, belongs to gene set S k and is the inverse sign referring to the counter directionality for disease-gene and drug-gene, for which is calculated for each gene set S k , for k ∈ {1, . . . , K } , as the absolute distance between the point of maximum enrichment score and the point where the rankordered T (j) is minimal in absolute value, typically a zero fold-change or zero correlation gene index. This involves the computation of two gene indices: (1) the gene rank maximizer of the ES k statistic (leading edge upper bound), denoted l max k , and (2) the gene rank minimizer of the rank-ordered T (j) , denoted lmin : where l max k = arg max l∈{1,...,p} v k (l) and lmin = arg min j∈{1,...,p} T (j) where arg max(·) and arg min(·) denote the maximizer and minimizer functions with respect to gene index l ∈ {1, . . . , p} and j ∈ {1, . . . , p} , respectively.

Normalization, significance, and error rate
Normalization places ES and TCS scores on respective comparable scales. A null distribution is created by gene label permutation of list L while retaining original gene label rank-ordering; this is performed for 1000 permutations. The normalization factor is the change of scale obtained by the mean of the scores generated by the permuted distributions, and the normalized score is then obtained by simply dividing the true score by this normalization factor. The significance of the true score is determined by the proportion of permuted scores that are greater than the true score, and our null hypothesis states that the true score is no different from those generated by random gene label permutation.
The multiple testing problem is addressed by our procedure carried out to control the False Discovery Rate (FDR). After a full experimental run of dpGSEA, the FDR is calculated by comparison of the proportion of all permuted null normalized scores for every drug screened greater than the specific score of a drug in question. This is performed for each ES and TCS respectively and is the approach utilized by GSEA.

Testing dpGSEA
We approached testing dpGSEA in a two-fold manner. (1) We determined if dpGSEA was able to positively identify a perturbated drug from an external DE experiment through positively correlated gene modulation, as opposed to the signature reversion principle. (2) We used dpGSEA as intended, an exploratory tool for drug screening, to determine if the therapeutics detected have biological or phenotypic relevance to a disease in question.
For the first case, we tested third party gene signatures, not those from CMAP or LINCS, derived from gastroenteropancreatic neuroendocrine tumor cells (GEPNTs) perturbated by fluvastatin, parbendazole (against drug-defined gene sets present and generated from CMAP), paclitaxel, rosiglitazone (against drug-defined gene sets present and generated from LINCS), and doxorubicin (against drug-defined gene sets present and generated from both CMAP and LINCS) (Gene Expression Omnibus (GEO) #GSE98894) [27]. Drug perturbation DE for GEPNTs was performed using LIMMAvoom and matching signatures were detected using dpGSEA.
For the second case, drug screening, we applied dpGSEA to our recent study of differential gene expression in CD4+T regulatory cells (Tregs) from immune responders (IR) and nonresponders (INR) to antiviral therapy in HIV-infected individuals (GEO #GSE106792) [28]. This study assessed HIV-infected individuals for their ability to reconstitute the CD4+T cell pool in response to antiretroviral treatment and what candidate mechanisms were behind poor clinical outcomes and greater risk for morbidity and mortality with respect to INR status. Mitochondrial Treg mechanisms were implicated to be the cause of the cell cycle halting [28]. We analyzed this dataset with dpGSEA to determine whether we could identify drugs that may take advantage of differentially expressed genes (DEGs) involved in mitochondrial dysfunction or immune function as a whole in INRs.

Results
Our case study results for detection of GEPNTs drug perturbations by dpGSEA that pass the FDR α = 0.05 threshold are shown in Table 1A and B for both ES and TCS, respectively, along with the specific proto-matrix used. It is worth mentioning that not every GEPNTs drug perturbation was positively identified by every proto-matrix by ES and TCS FDR thresholds, but we were able to positively identify all perturbations in most case with the exception of rosiglitazone by ES FDR and fluvastatin by TCS FDR. Paclitaxel perturbations were most frequently positively identified by both scores and primarily by significance-based LINCS proto-matrices, while other drugs varied in their positive findings with respect to the proto-matrix utilized. Table 2 shows the most statistically significant ES drug discoveries for the INR versus IR case study where mitochondrial and immunological associated drugs were found. Oseltamivir-carboxylate, the active metabolite of Tamiflu, an antiviral, prevents the release of progeny influenza virions while simultaneously modulating human sialidases which have been found to be localized in the mitochondria and involved in the regulation of cell apoptosis [29,30]. Ibutilide, an antiarrhythmic, has been shown to inhibit endoplasmic reticulum and mitochondrial stress mechanisms [31]. These findings are consistent with the INR mitochondrial dysfunction while showing targetable transcription that may increase antiviral activity and/or prevent cell cycle disruption of Treg function. Notably, other drugs within statistical significance of 0.05, such as fibronil and telmesteine (p = 0.015, p = 0.017 respectively) have targets (CPT1A, and IDH2 respectively) suggested to be representative of fatty acid oxidation and energy production of mitochondrial dysfunction congruent with previous research [28].
When comparing dpGSEA with traditional GSEA, we find that the ranked ordering of paclitaxel perturbated cell lines are significantly different suggesting a substantial difference compared to our approach (Fig. 3). Wilcoxon signed rank tests are not significant (p > 0.90) when comparing dpGSEA with GSEA for both ES significance ranking and TCS significance ranking (Fig. 3e). Comparisons between ES and TCS rankings within dpGSEA and GSEA for comparable proto-matrices showed a maximum positional shift of 3 for perturbated cell line ranking (Top 50 Rank GSEA: MCF7 from 2nd to 5th) and most ranking shifts between ES and TCS within 2 spots (0 spot shift: 11; 1 shift: 9; 2shift: 5; 3shift: 1). Figure 4 compares dpGSEA score and significance trends against those of the CMAP native and the gene2drug approaches. Two approaches were not included in our evaluation: the sscMAP approach is no longer available and the L1000cds approach does not provide a significance estimate. Here, we use the CMAP top 20 ranked significantly proto

Table 1 A rank-ordered list by (A) ES and (B) TCS p value of positively correlated validation tests using an external RNA-seq dataset
Each row represents a positively identified drug (rosiglitazone, fluvastatin, parbendazole, paclitaxel, or doxorubicin) by dpGSEA in GEPNTs perturbation versus GEPNTs DMSO control DE of our first test case. All findings shown pass an FDR threshold of α = 0.05 for ES in A and TCS in B. The leading-edge driver genes are listed in the "Genes" column and the specific proto-matrix the positive results were detected in are listed in the "Proto-Matrix" column. Positively identified paclitaxel perturbations were most frequent while other drugs were found in only some of the proto-matrices utilized matrix as our a-priori signature for dpGSEA and correspondingly equivalent inputs (top 20 genes by significance) for other approaches. It should be noted that directionality is integrated into the dpGSEA enrichment approach producing only positive scores as reflected by the one-sided distribution of Fig. 4b. The ranked drugs that pass nominal GSEA-defined FDR, and Benjamini-Hochberg (BH)-defined FDR adjusted thresholds are also shown. Using the GSEA-FDR threshold [16], we find many drug screens that pass FDR = 0.05. This is in stark contrast to those approaches without inherent error rate analysis (Fig. 4c, d) that have few or no screened drugs that pass the BH-FDR threshold at the same level. Therefore, the dpGSEA screened drug results provide a richer and more reliable ranking of drugs for the clinician. In addition, it should be noted that this result is achieved by dpGSEA despite the fact that the GSEA-defined FDR procedure is inherently more conservative (less inductive of downward bias) than the BH-defined FDR procedures [32], especially in cases of lower α values (Additional file 2: Figure S1). Last but not the least, note the statistically significant findings of screened drugs (highlighted in green in Fig. 4) that pass a designated FDR significance threshold (0.05) unique to dpGSEA's novel TCS statistic. The distributions of scores and significance can be found in Additional file 3: Figure  S2 and Additional file 4: Figure S3 which show trends between both normalized scores and their respective transformed p values along with leading edge gene set sizes. Each plot shows one completed run of dpGSEA with each drug and their respective scores

Table 2 A rank-ordered list of the most statistically significant ES for drugs found when performing an enrichment for INR vs. IR Treg cells using top 50 p value rank protomatrix (derived from LINCs)
Both normalized ES and normalized TCS along with their respective statistical significance are shown. The leading-edge genes are also displayed, indicating genes driving the enrichment scores. Note that the highest ranked result, oseltamivircarboxylate, suppresses influenza virions production modulates human sialidases, known for mitochondrial involvement and regulation of cell apoptosis and significance shown. We can see that, as expected, scores trend positively with significance, and that TCS significance tends to favor smaller sets of driver genes (R = 0.72) while ES does not (R = 0.03) in Additional file 4: Figures S3C and S3D. Furthermore, Additional file 4: Figure S3 shows a comparison between the positively identified fluvastatin perturbation for various proto-matrices. As a-priori signature sizes increase, we see fluvastatin migrate from lack of statistical significance to close and beyond TCS significance at p < 0.05 and ultimately to ES significance at p < 0.05. This may suggest that TCS is more capable of detecting enrichment for smaller gene set sizes.

Discussion
The accurate portrayal of disease-gene and drug-gene complementary expression was the impetus for the development of dpGSEA. There are two features of dpGSEA that underscore its novelty in comparison to GSEA and other approaches, namely our indicator function denoting complementary disease-gene and drug-gene expression and the utilization of drug-derived gene set priors that include drug-gene modulation information. GSEA, in its current state, is not capable of producing results that can be interpreted with directionality of modulation by gene set priors for enrichment. Indeed, MSigDB gene sets, for example, only contain gene membership information. In cases where enrichment does take modulation of expression within a gene set into account, such as those defined in the C6 and C7 collections, the representation of a single biologically-defined gene set is dichotomized into up and down regulated groups [16]. This is less than ideal as interpretation of enrichment must be contextualized with two scores and two significance levels, making biological interpretations difficult in cases where two sets of estimates may not be congruent. Further, our results differ from those generated by traditional GSEA as shown by notable changes in rank of paclitaxel perturbated cell lines in Fig. 3, suggesting that the signature reversion principle plays a role when it comes to directionally influenced enrichment. In addition, with respect to GSEA but unlike other methods, we report FDR results in dpGSEA analyses that reflect a combination of both sensitivity and specificity metrics. However, in order to generate more specific accuracy metrics results like specificity and sensitivity, a simulation study of joint true drug perturbation and true DE (i.e. where the truth would be known for both) would be required to allow us to compare our candidate drug end results. Fig. 4 Comparisons between the trends of scores and significance for dpGSEA and the CMAP native and gene2drug approach. Each point within each plot represents a screened drug's significance and score within an equivalent run for dpGSEA (plots a, b), CMAP native (plot c), and gene2drug (plot d). Screened drugs that pass a designated FDR significance threshold (0.05) are shown in red, and screened drugs highlighted in green show statistically significant findings unique to dpGSEA's novel TCS. Total number of screened drugs within specific significance thresholds are also shown, and it should be noted that the number passing FDR α = 0.05 using the GSEA-defined FDR threshold (plot a, b) is 121 while those that pass the BH defined threshold are 3 and 0 for CMAP and gene2drug, respectively (plot c, d) When compared to other drug screening approaches shown in Fig. 1b, we uniquely use both degree of modulation, as represented by DE significance, and directionality for enrichment. Methods such as gene2drug and DSEA require less conventional inputs, which will allow for application beyond transcriptomics but requires users to query with a single gene, a set of pathways, or a set of drugs without considering directionality of modulation [20,21]. Although these approaches are versatile, dpGSEA takes advantage of the statistics generated in a DE experiment, making it uniquely postured for tackling transcriptomic drug screening. CMAP-native, L1000cds, and sscMAP approaches consider directionality, but not DE significance, and instead use ordered lists or sets [22][23][24]. Furthermore, we retain important aspects pivotal in GSEA's success, such as score normalization and true FDR analysis in our approach [22,24]. When comparing results between dpGSEA, CMAP native, and gene2drug, we see our approach provides for a greater number of drug screens that pass error correction. Our intrinsically less conservative GSEA-defined measurement of error is more appropriate for drug screening compared to the BH procedure. In our and other test cases the BH procedure shows strong bias towards exclusion of possible positive drug screens as shown in Fig. 4c, d, especially in cases of screens with high statistical significance, where the bias is most substantial (Additional file 1: Figure S1). The BH procedure, and others like it, is insufficient in understanding false discovery in these drug screening approaches which calls for an inherent method such as the one we have applied. Furthermore, for exploratory screenings, a less conservative error analysis that maintains strict statistical rigor is ideal. As a result, dpGSEA is fundamentally different from the aforementioned approaches, and we believe it can be an effective tool for drug screening for transcriptomic DE experiments. Furthermore, our novel statistic, TCS, serves as an alternative to the traditional ES by emphasizing gene rank with a DE experiment rather than statistical significance. It provides for another valid means of screening and, as shown in Fig. 4, elucidates a substantial number of otherwise ignored, but possibly important and effective, drug screens. This allows future studies another avenue for justification of exploration for a specific drug or gene target of interest if ES significance is not met. When testing dpGSEA we were able to positively identify drug perturbations of paclitaxel, parbendazole, doxorubicin, rosiglitazone, and fluvastatin in GEPNTs, but we want to emphasize that dpGSEA's primary purpose is discovery screening rather than identification. Our identification testing is a proof-of-concept for how our approach, in theory, can effectively apply the signature reversion principle in enrichment and detect drug perturbation signals for an external data set. We believe our true use-case test of dpG-SEA on INR versus IR DE where mitochondrial and immunological associated drugs were found, is more revealing of dpGSEA approach's capabilities.
Our scores, analogous to traditional GSEA scores, are rigorously generated while adjusted for false discovery to ensure the best possible accuracy. With respect to analytical studies based on DE analysis, i.e. all transcriptomic enrichment approaches, inferences made by dpGSEA will rely upon the validity of the prior DE results generated for the first stage of the dpGSEA framework. In line with this point, a recent study supports the importance of ranking statistics in GSEA. As the authors state, "An important parameter, which could affect the final result, is the choice of a metric for the ranking of genes. Applying a default ranking metric may lead to poor results." [33] Hence, the