An overview of the framework
The pipeline is divided into two color-coded sections as shown in Fig. 1. The first section (colored red) contains steps for construction of PD modules, and the second section (colored green) contains steps to perform modular drug repositioning. The construction of PD modules was carried out in 6 steps: 1) We filtered genes with significant expression changes between the case vs. control samples (with the False Discovery Rate set to 0.05), using a Bayesian inference technique available in the limma package in R [14]. 2). We performed WGCNA, which yields clusters (modules) of highly correlated genes having significant changes across three tissues. 3) We compiled PD-specific GWAS candidate genes and performed one layer extension to generate a gene regulatory network by retrieving the gene-gene regulatory relationship from the PAGER database [15]. 4) We performed enrichment analysis by finding overlapping genes shared between co-expression clusters and GWAS candidate genes, extracting these enriched clusters as PD-specific modules. 5) We constructed PD-specific network module by retrieving the gene-gene interactions for the genes in PD-specific modules from the HAPPI-2 database [16]. 6) Finally, we annotated PD-specific modules with functional groups using ClueGO [17].
The Drug repositioning section (green) was comprised of four steps. First, we calculated a P-score, which is an intuitive pharmacology score that combines the probability for each interaction and the weight of the drug-target interaction using data from the DMAP database (see details in Methods). Second, we calculated the RP-score, which is a measure of Relevant Protein importance in the PD modules network (see details in Methods). Third, we calculated the Drug Effect Score (DES) of each module. Finally, the DESS was calculated across all modules. Using these steps, we obtained a ranked “modular drug list” consisting of candidate treatments based on PD-specific modules.
Preparation of PD-specific omics, gene-gene interaction, and drug-protein regulation data
Datasets from whole genome expression transcriptional profiling (on the GSE8397-GPL-96 array) were retrieved from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8397). In the gene expression profile, 47 samples from PD patients and controls were used in three brain regions: the Frontal Gyrus (FG: 8 tissue samples), Lateral Substantia (LS: 16 tissue samples) and Medial Substantia (MS: 23 tissue samples) [12]. SNP data was obtained from a PD paper [11], in which a GWAS was carried out. We mapped probe IDs to gene symbols using the NCBI microarray toolkit and assigned gene expression scores by the averaging probe expression values after adjustment and trimming of background noises by using the standard deviation of the mean values from all samples. Since the standard deviation of the mean values were small enough (0.02 in this study), no samples had been trimmed. After performing probe transformation and synonymous gene merging on data from the Affymetrix Human Genome U133A Array [HG-U133A] and Affymetrix Human Genome U133B Array [HG-U133B], 12,995 genes were mapped by 22,283 probes in the merged matrix from the two arrays. In the prior study, 54 genes were reported as having had significant enrichment [11] in GWAS. The PAGER database [15] was used to obtain gene-gene regulatory relationships (22,127 pairs curated from 645,385 in total). The HAPPI-2 database [16] was used to obtain protein-protein interaction (PPI) data. This integrated protein interaction database comprehensively integrates weighted human protein-protein interaction data from a wide variety of protein-protein database sources. After mapping the proteins to genes using UniProt IDs, we obtained 2,658,799 gene-gene interactions. The drug-target regulatory relationships data was from the DMAP database [18], which consisted of curated 438,004 drug-protein regulatory relationships.
PD-specific network module identifications
Whole-genome expression data on 12,995 genes was filtered down to 2895 candidate genes, based on a multi-group empirical Bayesian (eBayes) moderated t-test with p-value ≤ 0.05. Next, we performed WGCNA to cluster these genes based on their co-expression. To do this, we first performed our pipeline steps to identify excessive missing values and outlier microarray samples. The detection of the outlier was performed by trimming the hierarchy tree of average Euclidean distance method using cutoff tree height of 100. Second, we chose an exponent for soft thresholding based on analysis of network topology, to further reduce noise and amplify stronger connections in the scale-free topological model. Third, we performed one-step network construction and module detection using hierarchy tree of unsigned TOM-based dissimilarity distance. Fourth, we visualized the genes in modules in a hierarchy tree based on average linkage clustering [13]. Fifth, we analyzed the cluster (Principal Components) and sample (expression data) correlation using Pearson correlation and asymptotic p-value.
An initial regulatory relational network was seeded using the 54 candidate genes identified by Moran et al. and expanded using the gene-gene regulatory relationship data. The resulting expanded regulatory relational network consists of 288 genes and 1983 gene-gene regulatory relationships. Subsequently, we performed the enrichment testing of the genes in the expanded regulatory relational network to measure enrichment in the co-expression clusters using the hypergeometric test and assigned the f(pts) score using the formula:
$$ \mathrm{f}\left(\mathrm{pts}\right)=\mathsf{\operatorname{sign}}\left(\frac{\mathrm{K}}{\mathrm{N}}-\frac{\mathrm{k}}{\mathrm{n}}\right)\times \boldsymbol{\mathsf{\log}}\left(\frac{\left(\genfrac{}{}{0pt}{}{\mathrm{K}}{\mathrm{k}}\right)\left(\genfrac{}{}{0pt}{}{\mathrm{N}-\mathrm{K}}{\mathrm{n}-\mathrm{k}}\right)}{\left(\genfrac{}{}{0pt}{}{\mathrm{N}}{\mathrm{n}}\right)}\right) $$
(1)
where N is the total number of the genes in co-expression clusters, K is the number of overlapping genes between co-expression clusters and genetic candidate genes, n is the number of the genes in the co-expression cluster selected, and k is the overlap genes between selected co-expression cluster and the genetic candidate genes. A positive value for f(pts) indicates the over-representation in the expanded regulatory relational network. PD-specific modules were defined as the over-represented co-expression clusters. We then generated the network of PD-specific modules by applying high-confidence gene-gene interactions (as indicated by 3-star or above in the HAPPI-2 database).
In the final step, we performed ClueGO analysis to elucidate mechanisms involved in the PD-specific modules. We applied Bonferroni correction and selected those with post-correction p-value ≤ 0.05 and Kappa score ≥ 0.5 (moderate network strength or stronger) [17].
Modular drug repositioning
DESS was calculated using the P-score from the DMAP database, the RP-score from the PD modules, and the module enrichment score f(pts) from the PD modules. We calculated a P-score, an intuitive pharmacology score in the DMAP database, via a probability-weighted summary of all the evidence mined from literature or other drug target databases to determine an overall mechanism of “edge action” for each specific chemical-protein interaction using conf(d,p):
$$ \mathsf{conf}\left(d,p\right)={\sum}_{i=1}^N\left({prob}_i\left(d,p\right)\times {\operatorname{sign}}_i\right) $$
(2)
where d and p are specific drugs and proteins, respectively. N is the number of types of evidence for the interaction between d and p. prob
i
(d, p) is confidence in each type of evidence i with a value within the range of [0,1]. sign
i
has a value of 1 if the evidence i suggests activation and a value of −1 if the evidence i suggests inhibition. Afterwards, to rank each interaction, we used the algorithm in HAPPI [19] by assigning a weight(p) for the proteins interacting with each drug using the following formula adapted from [20].
$$ weight(p)=k\times \mathit{\ln}\left({\sum}_{p,q\in NET} conf\left(p,q\right)\right)-\mathit{\ln}\left({\sum}_{p,q\in NET}N\left(p,q\right)\right) $$
(3)
Here, p and q are proteins in the protein interaction network, k is an empirical constant (k = 2 in this study), conf(p,q) is the confidence score of interaction between protein p and q assigned by HAPPI-2, and N(p, q) holds the value of 1 if protein p interacts with q or the value of 0 if protein p does not interact with q. Thus, the foregoing probabilities and weights for each interaction were combined into P-score(d,p), which includes both information on each drug’s effects on interacting proteins and the importance of the protein in the protein-protein interaction network:
$$ P- score\left(d,p\right)= conf\left(p,q\right)\times weight(p) $$
(4)
We applied each gene’s RP-score calculation in a manner similar to formula (3) in PD-specific modules using the formula:
$$ RP- score={e}^{k\times \ln \left({\sum}_{p,q\in ModuleNET} conf\left(p,q\right)\right)-\ln \left({\sum}_{p,q\in ModuleNET}N\left(p,q\right)\right)} $$
(5)
where p and q are the indexes of proteins from the selected module, k is a constant (k = 2 in this study). The term conf(p, q) is the interaction confidence score assigned by HAPPI-2, where conf(p, q)
∈ [0,1].
Further, we calculated a DES(d,m) by using the drug weight score and the module gene RP-score according to the formula:
$$ DES\left(d,m\right)={\sum}_{i\in ModT\arg et}^n\left[{\operatorname{sign}}_d\times {\log}_2\left(p-\mathrm{score}\left(d,i\right)\right)\times {\log}_2\left( RP-\mathrm{score}\left(d,i\right)\right)\times {p}_i\right] $$
(6)
where m is the module, i is the index of the proteins in the PD-specific module, sign
d
is the direction of the effect drug d on protein expression, and P-score(d, i) is the pharmacology score of the drug d to target i. p
i
is the priority score which indicates the source of the candidate. We assigned a value of p
i=1/20 when the candidates were from GWAS, p
i=1/21 when the candidates were from regulatory one-layer extension of GWAS, and p
i=1/22 when the other candidates were from the same module.
The DESS(d,M) was calculated by integrating all PD-specific modules according to the expression:
$$ DESS\left(d,M\right)={\sum}_{m\epsilon Module}^M DES\left(d,m\right)\times {f}_m(pts) $$
(7)
where M is the module set of module m, f
m
(pts) is probability mass function (pmf) transform score of the PD-specific module m. An example of how the DESS score is calculated for a drug is shown in Fig. 2. Based on the total DESS, modular drugs (drugs selected based on their predicted effect at a module level) and their targets in the modules were collected. We pulled out modular drugs or drugs selected based on their predicted effect at a module level alongside their associated targets. Finally, we applied a single regulatory layer expansion and retrieved drug-target regulatory relationships (DMAP database) and protein-protein interactions (HAPPI-2 database) to generate the “extended modular drug-target network”.