DualGCN: a dual graph convolutional network model to predict cancer drug response

Ma, Tianxing; Liu, Qiao; Li, Haochen; Zhou, Mu; Jiang, Rui; Zhang, Xuegong

doi:10.1186/s12859-022-04664-4

Volume 23 Supplement 4

The 20th International Conference on Bioinformatics (InCoB 2021)

Research
Open access
Published: 15 April 2022

DualGCN: a dual graph convolutional network model to predict cancer drug response

Tianxing Ma¹,
Qiao Liu²,
Haochen Li³,
Mu Zhou⁴,
Rui Jiang¹ &
…
Xuegong Zhang ORCID: orcid.org/0000-0002-9684-5643^1,3

BMC Bioinformatics volume 23, Article number: 129 (2022) Cite this article

3653 Accesses
12 Citations
2 Altmetric
Metrics details

Abstract

Background

Drug resistance is a critical obstacle in cancer therapy. Discovering cancer drug response is important to improve anti-cancer drug treatment and guide anti-cancer drug design. Abundant genomic and drug response resources of cancer cell lines provide unprecedented opportunities for such study. However, cancer cell lines cannot fully reflect heterogeneous tumor microenvironments. Transferring knowledge studied from in vitro cell lines to single-cell and clinical data will be a promising direction to better understand drug resistance. Most current studies include single nucleotide variants (SNV) as features and focus on improving predictive ability of cancer drug response on cell lines. However, obtaining accurate SNVs from clinical tumor samples and single-cell data is not reliable. This makes it difficult to generalize such SNV-based models to clinical tumor data or single-cell level studies in the future.

Results

We present a new method, DualGCN, a unified Dual Graph Convolutional Network model to predict cancer drug response. DualGCN encodes both chemical structures of drugs and omics data of biological samples using graph convolutional networks. Then the two embeddings are fed into a multilayer perceptron to predict drug response. DualGCN incorporates prior knowledge on cancer-related genes and protein–protein interactions, and outperforms most state-of-the-art methods while avoiding using large-scale SNV data.

Conclusions

The proposed method outperforms most state-of-the-art methods in predicting cancer drug response without the use of large-scale SNV data. These favorable results indicate its potential to be extended to clinical and single-cell tumor samples and advancements in precision medicine.

Background

Anti-cancer drugs have played important roles in cancer therapy in recent years. However, the occurrence of drug resistance limits the effectiveness of anti-cancer drugs [1]. It is essential to fully explore the cancer drug response (CDR) underlying comprehensive biological systems.

Cancer drug response can be studied with cancer cell line models. Drug response on these models is quantitatively described by the half-maximal inhibitory concentration (IC50). The IC50 depicts the amount of drug needed to inhibit cancer cell growth by half. A smaller IC50 indicates that the drug is relatively more powerful. Comprehensive genetic and pharmacologic characterizations of cancer cell line models are collected by projects such as Cancer Cell Line Encyclopedia (CCLE) [2], Catalogue of Somatic Mutations in Cancer (COSMIC) [3], and Genomics of Drug Sensitivity in Cancer (GDSC) [4]. Such data enable researchers to develop predictive machine learning models of anti-cancer drug sensitivity [5,6,7,8,9]. These models consist of two parts that are responsible for encoding drugs and cell lines separately. Drugs are represented through one-hot encoding using simplified molecular-input line-entry system (SMILES) data [7, 8]. Genomic mutations have been reported to have significantly different patterns across cell lines [4]. They are widely used as features of cancer cell lines, and are encoded by models such as multilayer perceptrons (MLP) [7] and convolutional neural networks (CNN) [8, 9]. However, drug resistance could not be fully discovered using these in vitro cancer cell lines. It has been revealed that tumors are highly heterogeneous [10], and tumor microenvironments have essential influences on tumor progression [11,12,13]. Such heterogeneity and interaction could not be reflected with in vitro cancer cell lines only. Emerging single-cell data and clinical data show the potential to decipher complex tumor microenvironments and to unlock drug response [14,15,16]. Transferring knowledge studied from in vitro cancer cell lines to single-cell and clinical data is a promising avenue [14].

There are some limitations in current methods to be generalized to single-cell and clinical data. First, most existing methods include SNVs as features to improve the predictive ability on cancer cell lines. However, it has been revealed that calling SNVs reliably from cancer samples cannot always be reached. High-frequency genomic aberrations and aneuploidy are common in cancers, and these variations reduce SNV detection efficiency [17]. Similarly, detecting reliable SNVs covering all hotspots simultaneously from single-cell data is unattainable. Both sequencing coverage and sequencing depth in single-cell data are too low to detect SNVs completely from the data [18, 19]. Second, current methods encode gene features as separate units. However, recent evidence from single-cell studies shows that the tumor microenvironment is a complex system [11]. Tumor cells interact with surrounding cells. Such interactions form a biological network, and the whole ecosystems contribute to drug response simultaneously [20,21,22]. These inspired us to develop new methods without using SNVs as features and considering cancer samples as systems with interactions between proteins.

In this paper, we propose a novel deep learning model called DualGCN. It consists of dual graph convolutional networks (GCN) [23] and takes drug structures and omics data as input to predict cancer drug response. One GCN module learns intrinsic chemical features of drugs. Nodes in this module represent atoms of drugs, and edges indicate connections between the atoms. Meanwhile, the other GCN module incorporates protein–protein interactions (PPI) and extracts the underlying biological features of cancer samples. Nodes in this module represent proteins, and edges indicate protein–protein interactions. In this study, we used gene expression and copy number variation as gene features. These features have been demonstrated to be vital to depict cancer cell types in recent single-cell studies [24,25,26,27,28]. We conducted extensive experiments and demonstrated that our method outperforms most state-of-the-art methods while avoiding the use of SNVs. In addition, we conducted a case study on clinical cancer patients with DualGCN and showed its potential to be extended to clinical and single-cell cancer samples.

Results and discussion

Overview of DualGCN

DualGCN takes chemical structure data of drugs and gene features of cancer samples as inputs and outputs drug response (IC50). The concept of DualGCN is shown in Fig. 1. The top panel of Fig. 1 is a GCN module (named drug-GCN below) used to encode the drug chemical structure. Nodes in this module represent atoms of drugs. Edges between nodes indicate connections between the atoms of drugs. Features of atoms are learned from the previous algorithm [29]. The bottom panel of Fig. 1 is another GCN module (named bio-GCN below) used to encode biological features of cancer samples. It is built on PPI networks and takes features of cancer-related genes as inputs. We used gene expression (Expr.) and copy number variation (CNV) as gene features in this study. These gene features were demonstrated to have important roles in decoding cancer cell types from recent studies [26,27,28]. Both GCN modules use ReLU as activation functions and adopt batch normalization [30] and dropout [31] strategies to improve model robustness. Two embeddings from the drug-GCN module and the bio-GCN module are then concatenated together to be fed into a multilayer perceptron to study the response of the given drug on the given cancer sample. Detailed settings of the model can be found in Additional file 1: Table S1.

Assessment of methods

We evaluated the performance of DualGCN as well as baselines including support vector machine (SVM), random forest, Lasso regression, ridge regression, CDRscan [7], and DeepCDR [8]. The evaluation was conducted on 86,530 drug-cell line pairs. These data included 208 drugs and 525 cell lines covering 27 kinds of cancers. Data preparation and configurations of baselines are described in the “Methods” section. The evaluation was conducted with five-fold cross-validation (CV). We used evaluation metrics including Pearson’s correlation coefficient, Spearman’s correlation coefficient, and root mean square error (RMSE).

DualGCN achieves strong predictive performance without the use of SNVs. It gained Pearson’s correlation = 0.925, Spearman’s correlation = 0.907, and RMSE = 1.079. It significantly outperformed traditional methods, including SVM, random forest, Lasso regression, and ridge regression (Table 1). Detailed configurations and results of these methods can be found in Additional file 1: Table S5, Additional file 1: Table S6, and Additional file 1: Table S7. In addition, we also compared DualGCN with deep learning models. DualGCN had consistent improvements over CDRscan among all evaluation metrics. Improvements in Pearson’s correlation, Spearman’s correlation, and RMSE were 0.014, 0.013, and 0.094, respectively. DeepCDR gained higher predictive performance than DualGCN. The differences in Pearson’s correlation, Spearman’s correlation, and RMSE were 0.003, 0.003, and 0.013, respectively. Such differences needed huge SNV information. DeepCDR contains several sub-networks encoding multi-omics data. We evaluated its performance without SNV by removing the corresponding sub-network and denoted it by DeepCDR (-). Pearson’s correlation, Spearman’s correlation, and RMSE of DeepCDR (-) dropped to 0.900, 0.877, and 1.265, respectively. DualGCN gained a large margin over it without tens of thousands of SNVs. Improvements in Pearson’s correlation, Spearman’s correlation, and RMSE are 0.025, 0.030, and 0.186, respectively. There are two major reasons SNV data should be treated with caution. First, different projects collected SNVs in different patterns and used different references (human reference genome or normal tissues) in SNV calling algorithms. Thus, SNVs might not be aligned across data from different sources. Second, studying drug responses on in vitro cancer cell lines only cannot fully reveal the mechanisms of drug resistance. Transferring knowledge studied from in vitro cancer cell lines to single-cell and clinical data tends to be an important direction [14]. However, it is unreliable to call SNVs from clinical and single-cell tumor data covering all candidate loci [17,18,19]. In addition, recent evidence shows that whole tumors collectively act on drugs [12]. These studies gradually accumulate protein–protein interactions influencing cancer progression and drug response [13]. DeepCDR encodes different features of the same unit (gene) separately. It is difficult for such encoding systems to further include constantly discovered and important interacting protein pairs. DualGCN encodes genes as basic units. It achieves strong predictive performance without SNV data. Such advances indicate its potential to absorb new biological knowledge and to be generalized to studies on clinical data and at single-cell resolution.

Table 1 Performance comparison

Full size table

DualGCN achieves high performance across different types of cancers consistently. Pearson’s correlation coefficients on different cancers ranged from 0.942 to 0.893 (Fig. 2a). The highest and the lowest coefficients were obtained on lung squamous cell carcinoma (LUSC) and neuroblastoma (NB), respectively. Scatterplots of these two cases are shown in Fig. 2b and Fig. 2c. We also evaluated the performance across drugs. Pearson’s correlation coefficients for different drugs varied in a wide range from 0.861 to 0.132 (Fig. 2d). The highest and the lowest coefficients were obtained on CAY10603 and cetuximab, respectively. Scatterplots of these two cases are shown in Fig. 2e, f. We performed principal component analysis (PCA) on SMILEs of drugs. We observed that latent representations of CAY10603 and cetuximab were close in low-dimensional space. This result indicates that the structures of these two drugs are similar, though the prediction performances on these two drugs were significantly different (Additional file 1: Figure S2). In addition, we found that the IC50 of cetuximab was much higher than that of other drugs. These findings indicate that drugs with low prediction performances may be affected by their isolation of IC50 from the overall distribution.

Ablation analysis

We conducted ablation studies to evaluate the effects of different gene features on DualGCN. We studied such effects by taking only one kind of features as the input. The results are shown in Table 2. CNV data contributed more than gene expression data to our model. In addition, simultaneously taking gene expression and CNV data gained higher predictive performance than single kind of features.

Table 2 Ablation study on gene features

Full size table

A case study on clinical cancer patients

We conducted a case study on clinical BRCA patients using the trained DualGCN model. Gene features and drug response annotations of patients were obtained from The Cancer Genome Atlas Program (TCGA) [32]. There is a noticeable difference in analyzing drug response from in vitro cancer cell lines and clinical cancer data. Drug response annotations of clinical cancer data are qualitatively described as grades. In contrast, responses on cancer cell lines are quantitatively depicted by the IC50. We first binarized the clinical drug response annotations of patients into “sensitive” and “resistant”. Such binary labels were considered as ground truth. Then, we predicted the drug responses of patients and calculated the corresponding drug sensitivity score (DSS). A high DSS indicates sensitivity, and a low DSS indicates resistance. Detailed descriptions of annotation transformation and definitions of the DSS are given in the “Methods” section. We set DSS on cancer samples as discrimination thresholds of the receiver operating characteristic (ROC) curve. We observed a modest consistency between the predicted drug responses and clinical annotations. The area of the curve (AUC) of the ROC curve was 0.661 (95% confidence interval: 0.558 to 0.765, shown in Additional file 1: Figure S3. Future studies may need to combine single-cell cancer data and cellular interactions to further decode cell-type composition and cancer drug resistance mechanisms.

Conclusions

Anti-cancer drugs have played important roles in cancer treatments. However, resistance to anti-cancer drugs continues to be a serious challenge. Studying drug response on tumors is essential to improve the treatment of cancers and guide anti-cancer drug design. Cancer cell line models have been widely used for such research. However, tumors are heterogeneous and consist of different cell types and complex interactions. Studying in vitro cancer cell lines only cannot fully decode the mechanisms of drug resistance. Emerging single-cell technologies are powerful toolkits to explore cell-type composition and cellular interactions in tumors. Transferring drug response knowledge obtained from cell line models to clinical and single-cell data is an important direction. Single nucleotide variants are widely used as features of cancer cell lines in current cancer drug response studies. However, detecting SNVs covering all candidate genomic loci from clinical tumor data is not always reliable, let alone from single-cell data. Such SNV-based models are hard to extend to studies on clinical data and at single-cell resolution.

In this study, we developed a unified dual graph convolutional network model, DualGCN, to predict cancer drug response. DualGCN encodes both drugs and cancer samples using graph convolutional networks with protein–protein interactions embedded. We demonstrated that DualGCN gained high predictive abilities without the use of SNV data. Such advances indicate its potential to be further extended to clinical and single-cell data. Meanwhile, recent single-cell tumor studies have constantly discovered important interactions in tumors. DualGCN sets genes as units of the encoding system with links across them. Such structures make it easy to absorb newly discovered protein interactions essential to tumor progression and drug resistance. We organized a case study on analyzing clinical cancer samples using knowledge learned from cell line models, and observed a modest consistency between the predicted drug responses and clinical annotations.

In addition, we notice limitations of the proposed method. Units of the module encoding cancer samples are genes. Thus, input features are at the gene level. Such structures provide a convenient interface to combine interacting protein pairs constantly discovered from cancer research. However, other non-gene level signals, such as histone modifications, are hard to encode into the module directly.

In summary, we introduce a method, DualGCN, that achieves high predictive abilities on cancer drug response without using SNV data. The method could be extended to clinical and single-cell data and has the potential to promote the development of precision medicine.

Methods

Drug and cell line data preparation

Drug data were downloaded from the GDSC (version: GDSC1) [4]. We only kept drugs that were recorded in PubChem [33]. In addition, drugs sharing the same PubChem identifiers but owning different GDSC identifiers were also filtered out. Finally, we collected 208 drugs. Detailed descriptions of these drugs can be found in Additional file 1: Table S2. We then transformed drug chemical structure data to obtain feature vectors of atoms of drugs using the previous algorithm [29]. Dimension of these feature vectors was $l_{d} = 75$. It has been proved that these feature vectors reflect the intrinsic properties of drugs, such as atom type, atom connectivity, and degrees of freedom.

Gene features of cancer cell lines were downloaded from CCLE (version: 19Q2) [2]. We filtered out cell lines if (1) either gene expression or CNV data were unavailable, or (2) cancer type annotations were missed, or (3) the sample size of the corresponding cancer type was less than 10. Finally, we collected 525 cell lines covering 27 kinds of cancers. Detailed descriptions of these cell lines can be found in Additional file 1: Table S3. Gene expression data were represented as $log_{2} \left( {TPM + 1} \right)$. CNV data were represented as $log_{2} \left( {CN + 1} \right)$, where $CN$ represents the relative copy number. We then used z-score normalization on these gene features.

Cancer drug response data (IC50) were downloaded from GDSC (version: GDSC1) [4]. The IC50 describes the amount of drug needed to inhibit cancer cell growth by half. In GDSC, the IC50 is recorded in the scale of µM and is transformed with natural logarithm. Finally, we collected 86,530 drug-cell line pairs.

Construction of drug-GCN module

Drug-GCN module takes feature and adjacency matrix of drugs as inputs. It considers each drug as a graph where nodes represent atoms of the drug and edges indicate connections between atoms. This module extracts intrinsic chemical attributes using the graph convolutional network algorithm [23]. Different drugs have different number of atoms (from 5 to 96 in this study), so the scales of these raw drug graphs $G_{{d \text{-}raw}}$ vary. We first built a fixed-scale graph $G_{d}$, and then embedded the raw drug graph $G_{{d \text{-}raw}}$ into it. Such operations ensure that the drug-GCN module is unified to all drugs. The number of nodes $N_{d}$ of graph $G_{d}$ is 100.

Mathematically, raw drug graph $G_{{d \text{-}raw\left( i \right)}} = \left( {X_{{d \text{-}raw\left( i \right)}} , A_{{d \text{-}raw\left( i \right)}} } \right)$ is a sub-graph of the fixed-scale graph $G_{d\left( i \right)} = \left( {X_{d\left( i \right)} , A_{d\left( i \right)} } \right)$. Additional nodes in $G_{d\left( i \right)}$ are filled with zeros,

$$X_{d\left( i \right)} = \left( {\begin{array}{*{20}l} {X_{{d \text{-}raw\left( i \right)}} } \hfill \\ {0_{c1\left( i \right)} } \hfill \\ \end{array} } \right)\quad A_{d\left( i \right)} = \left( {\begin{array}{*{20}l} {A_{{d \text{-}raw\left( i \right)}} } \hfill & {0_{c2\left( i \right)} } \hfill \\ {0_{c3\left( i \right)} } \hfill & {0_{c4\left( i \right)} } \hfill \\ \end{array} } \right)$$

where $X_{d\left( i \right)} \in {\mathbb{R}}^{{N_{d} \times l_{d} }}$ denotes the feature matrix of the fixed-scale graph $G_{d\left( i \right)}$. $A_{d\left( i \right)} \in {\mathbb{R}}^{{N_{d} \times N_{d} }}$ denotes binary adjacency matrix of $G_{d\left( i \right)}$. Similarly, $X_{{d \text{-}raw\left( i \right)}} \in {\mathbb{R}}^{{N_{i} \times l_{d} }}$ and $A_{{d \text{-}raw\left( i \right)}} \in {\mathbb{R}}^{{N_{i} \times N_{i} }}$ denote the feature matrix and adjacency matrix of $G_{{d \text{-}raw\left( i \right)}}$, respectively. $N_{i}$ denotes the number of atoms of drug $i$. $0_{c1\left( i \right)}$, $0_{c2\left( i \right)}$, $0_{c3\left( i \right)}$, and $0_{c4\left( i \right)}$ are zero matrices.

According to the GCN algorithm [23], we have,

$$H_{d}^{{ \left( {l + 1} \right)}} = {\text{ReLU}}\left( {\tilde{D}_{d}^{{ - \frac{1}{2}}} \tilde{A}_{d} \tilde{D}_{d}^{{ - \frac{1}{2}}} H_{d}^{\left( l \right)} W_{d}^{\left( l \right)} } \right)$$

(1)

where $H_{d}^{\left( l \right)}$ is the output of layer $l$, and $H_{d}^{\left( 0 \right)}$ is the initial feature matrix $X_{d}$. $\tilde{A}_{d} = A_{d} + I_{d}$ is a modified adjacency matrix with self-connections. $I_{d}$ is an identity matrix. Diagonal matrix $\tilde{D}_{d}$ is a degree matrix of $\tilde{A}_{d}$ with $\tilde{D}_{d} \left[ {k,k} \right] = \mathop \sum \limits_{m} \tilde{A}_{d} \left[ {k,m} \right]$. $W_{d}^{\left( l \right)}$ represents weights of the layer $l$.

Detailed configurations of the drug-GCN module can be found in Additional file 1: Table S1.

Construction of bio-GCN module

Bio-GCN module takes the gene features of cancer samples as inputs. Gene expression and CNV data were used in this study. These gene features were first fed into a two-layer MLP and the latent features were considered as features of genes. The module considers each cancer sample as a graph where nodes are proteins (genes) and edges indicate interactions between proteins. Such protein–protein interaction information was obtained from the STRING database (version 11.0, Taxonomy ID: 9606) [34]. Meanwhile, we only kept proteins that are known to be related to cancers. Such cancer-related proteins (genes) were collected from COSMIC [3] and TCGA [32]. We finally obtained 697 cancer-related genes (Table S4 in Additional file 1) and 55,140 protein–protein interaction pairs among them.

Mathematically, the biological graph of cancer sample $j$ is denoted by $G_{b\left( j \right)} = \left( {X_{b\left( j \right)} , A_{b\left( j \right)} } \right)$. $X_{b\left( j \right)} \in {\mathbb{R}}^{{N_{b} \times l_{b} }}$ and $A_{b\left( j \right)} \in {\mathbb{R}}^{{N_{b} \times N_{b} }}$ denote the feature matrix and adjacency matrix, respectively. $N_{b}$ denotes the number of nodes. $l_{b}$ denotes dimension of features of genes. $A_{b\left( j \right)}$ is a symmetric binary matrix. $A_{b\left( j \right)} \left[ {k,m} \right] = A_{b\left( j \right)} \left[ {m,k} \right] = 1$ if gene $k$ and gene $m$ have interactions in the PPI network. Otherwise, $A_{b\left( j \right)} \left[ {k,m} \right] = A_{b\left( j \right)} \left[ {m,k} \right] = 0$.

Then, the bio-GCN module uses graph convolutional network algorithms to extract intrinsic biological features of the cancer sample. The formula is as same as Eq. (1). Detailed configurations of the bio-GCN module can be found in Additional file 1: Table S1.

Configurations of baselines

We compared DualGCN with six baselines, including DeepCDR [8], CDRscan [7], SVM, random forest, Lasso regression, and ridge regression. We additionally collected SNV data from the CCLE because they were necessary when using some of the baselines. We finally collected 27,180 SNVs within the cancer-related genes. We encoded the SNV features as binary vectors with one denoting the occurrence of a mutation.

DeepCDR [8] encodes multi-omics data using CNN separately. Genomic features including SNVs, gene expression, and copy number variation were used. Besides, it encodes drug data using graph convolutional networks. Meanwhile, we also tested the performance of DeepCDR without using SNV data by removing the corresponding CNN module. This modified version is denoted by DeepCDR (-). CDRscan [7] encodes SNVs using CNN. Besides, drugs are represented through one-hot encoding on SMILES data. SMILES is a string where characters represent atoms and connectivity relationships. We obtained SMILES (isomeric type) of drugs through parsing related XML files from PubChem. In addition, we also tested SVM, random forest, Lasso regression, and ridge regression using SNVs as features of cell lines, and drugs were represented through one-hot encoding of SMILES. We applied kernels including radial basis function (RBF) kernel, polynomial kernel, and sigmoid kernel for SVM. We applied multiple number of trees (n = 50, 100, 200) for random forest. We set coefficient alpha = 0.01, 0.1, 0.5 for Lasso regression. We set coefficient alpha = 0.1, 0.5, 1.0, 2.0 for ridge regression.

Clinical cancer data preparation

We conducted a case study on clinical cancer patients using DualGCN. First, we curated data of patients whose drug response information was available in TCGA. Patients with breast invasive carcinoma (BRCA) owned the largest scale (195 records) and were included in this case study. Then, we downloaded the gene features of these cancer patients through Firehose Broad GDAC (http://gdac.broadinstitute.org/). Gene expression data of patients were transformed as $log_{2} \left( {TPM + 1} \right)$. CNV data were at segment-level originally. We further transformed these segment-level CNV data into gene-level. There are $K$ segments overlapping some gene, and the length of each overlapped region is denoted by $l_{s} { }\left( {s = 1, 2, \ldots ,K} \right)$. Length of the gene is denoted by $L$. The relative copy number ratio of each segment is denoted by $c_{s} \left( {s = 1, 2, \ldots ,K} \right)$. We extracted the locations of genes from Ensembl (GRCh37) [35]. We transformed segment-level CNV data into gene-level and adopted logarithmic transformation using the following formula,

$$log_{2} \left( {\mathop \sum \limits_{{s = \left\{ {1, 2, \ldots ,K} \right\}}} c_{s} \frac{{l_{s} }}{L} + \left( {1 - \mathop \sum \limits_{{s = \left\{ {1, 2, \ldots ,K} \right\}}} \frac{{l_{s} }}{L}} \right) + 1} \right)$$

There is a noticeable difference in analyzing drug response from in vitro cancer cell lines and clinical cancer data. In clinical cancer data, drug response annotations are qualitative rather than quantitative. Drug responses are labeled as four types in TCGA: (1) complete response, (2) partial response, (3) clinical progressive disease, and (4) stable disease. We binarized such labels into “sensitive” and “resistant”. We considered drugs to be sensitive if annotations in TCGA were (1) complete response or (2) partial response. We considered drugs to be resistant if annotations were (3) clinical progressive disease or (4) stable disease. On the other hand, drug responses on cell lines are quantified by IC50. However, the range of IC50 of each drug is different (Figure S1 in Additional file 1). We thereby introduced a metric, drug sensitivity score (DSS), to transform drug responses into the same scale and to make responses comparable across drugs,

$$DSS = \left( { - 1} \right)^{I(IC50 > MSC)} ln\left( {\frac{{\left| {IC50 - MSC} \right|}}{MSC} + 1} \right)$$

where MSC denotes max screening concentration of the drug. We collected MSC from the GDSC. $I\left( \cdot \right)$ is indicator function. If $IC50 > MSC$, $I(IC50 > MSC) = 1$. This indicates that the given drug is not sufficient to kill the cancer cells, and the DSS is smaller than 0. If $IC50 < MSC$, $I(IC50 > MSC) = 0$. This indicates that the given drug has the potential to kill the cancer cells, and the DSS is larger than 0. The larger the DSS is, the more sensitive the drug is. Gene features and drug response annotations of clinical samples are given in Additional file 2: Table S8.

We predicted the IC50 of drugs on clinical cancer patients and calculated the DSS. We then adopted the ROC curve to analyze the consistency between our predictions and the binary clinical annotations obtained from the TCGA.

Availability of data and materials

The source code is available at https://github.com/horsedayday/DualGCN.

Abbreviations

SNV:: Single nucleotide variant
CDR:: Cancer drug response
IC50:: Half-maximal inhibitory concentration
CCLE:: Cancer Cell Line Encyclopedia
COSMIC:: Catalogue of Somatic Mutations in Cancer
GDSC:: Genomics of Drug Sensitivity in Cancer
SMILES:: Simplified molecular-input line-entry system
MLP:: Multilayer perceptron
CNN:: Convolutional neural network
GCN:: Graph convolutional network
PPI:: Protein–protein interaction
Expr.:: Gene expression
CNV:: Copy number variation
SVM:: Support vector machine
CV:: Cross-validation
RMSE:: Root mean square error
TCGA:: The Cancer Genome Atlas Program
DSS:: Drug sensitivity score
ROC:: Receiver operating characteristic
AUC:: Area under the curve
RBF:: Radial basis function

References

Vasan N, Baselga J, Hyman DM. A view on drug resistance in cancer. Nature. 2019;575:299–309. https://doi.org/10.1038/s41586-019-1730-1.
Article CAS PubMed PubMed Central Google Scholar
Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483:603–7. https://doi.org/10.1038/nature11003.
Article CAS PubMed PubMed Central Google Scholar
Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017;45:D777–83. https://doi.org/10.1093/nar/gkw1121.
Article CAS PubMed Google Scholar
Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 2013;41:D955–61. https://doi.org/10.1093/NAR/GKS1111.
Article CAS PubMed Google Scholar
Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol. 2014;15:1–12. https://doi.org/10.1186/gb-2014-15-3-r47.
Article CAS Google Scholar
Daemen A, Griffith OL, Heiser LM, Wang NJ, Enache OM, Sanborn Z, et al. Modeling precision treatment of breast cancer. Genome Biol. 2013;14:1–14. https://doi.org/10.1186/gb-2013-14-10-r110.
Article Google Scholar
Chang Y, Park H, Yang HJ, Lee S, Lee KY, Kim TS, et al. Cancer Drug Response Profile scan (CDRscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Sci Rep. 2018;8:1–11. https://doi.org/10.1038/s41598-018-27214-6.
Article CAS Google Scholar
Liu P, Li H, Li S, Leung KS. Improving prediction of phenotypic drug response on cancer cell lines using deep convolutional network. BMC Bioinform. 2019;20:1–14. https://doi.org/10.1186/s12859-019-2910-6.
Article CAS Google Scholar
Liu Q, Hu Z, Jiang R, Zhou M. DeepCDR: a hybrid graph convolutional network for predicting cancer drug response. Bioinformatics. 2020;36(Supplement_2):I911–8. https://doi.org/10.1093/bioinformatics/btaa822.
Article CAS PubMed Google Scholar
Dagogo-Jack I, Shaw AT. Tumour heterogeneity and resistance to cancer therapies. Nat Rev Clin Oncol. 2018;15:81–94. https://doi.org/10.1038/nrclinonc.2017.166.
Article CAS PubMed Google Scholar
Hinshaw DC, Shevde LA. The tumor microenvironment innately modulates cancer progression. Cancer Res. 2019;79:4557–67. https://doi.org/10.1158/0008-5472.CAN-18-3962.
Article CAS PubMed PubMed Central Google Scholar
Tang T, Huang X, Zhang G, Hong Z, Bai X, Liang T. Advantages of targeting the tumor immune microenvironment over blocking immune checkpoint in cancer immunotherapy. Signal Transduct Target Ther. 2021;6:1–13. https://doi.org/10.1038/s41392-020-00449-4.
Article Google Scholar
Ni Y, Zhou X, Yang J, Shi H, Li H, Zhao X, et al. The role of tumor-stroma interactions in drug resistance within tumor microenvironment. Front Cell Dev Biol. 2021;9:1206.
Article Google Scholar
Wu Z, Lawrence PJ, Ma A, Zhu J, Xu D, Ma Q. Single-cell techniques and deep learning in predicting drug response. Trends Pharmacol Sci. 2020;41:1050–65. https://doi.org/10.1016/j.tips.2020.10.004.
Article CAS PubMed PubMed Central Google Scholar
Prieto-Vila M, Usuba W, Takahashi RU, Shimomura I, Sasaki H, Ochiya T, et al. Single-cell analysis reveals a preexisting drug-resistant subpopulation in the luminal breast cancer subtype. Cancer Res. 2019;79:4412–25. https://doi.org/10.1158/0008-5472.CAN-19-0122.
Article CAS PubMed Google Scholar
Ho YJ, Anaparthy N, Molik D, Mathew G, Aicher T, Patel A, et al. Single-cell RNA-seq analysis identifies markers of resistance to targeted BRAF inhibitors in melanoma cell populations. Genome Res. 2018;28:1353–63. https://doi.org/10.1101/gr.234062.117.
Article CAS PubMed PubMed Central Google Scholar
Adey A, Burton JN, Kitzman JO, Hiatt JB, Lewis AP, Martin BK, et al. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature. 2013;500:207–11. https://doi.org/10.1038/nature12064.
Article CAS PubMed PubMed Central Google Scholar
Gawad C, Koh W, Quake SR. Single-cell genome sequencing: current state of the science. Nat Rev Genet. 2016;17:175–88. https://doi.org/10.1038/nrg.2015.16.
Article CAS PubMed Google Scholar
Ma T, Li H, Zhang X. Discovering single-cell eQTLs from scRNA-seq data only. bioRxiv. 2021. https://doi.org/10.1101/2021.06.10.447906.
Article PubMed PubMed Central Google Scholar
Armingol E, Officer A, Harismendy O, Lewis NE. Deciphering cell–cell interactions and communication from gene expression. Nat Rev Genet. 2021;22:71–88. https://doi.org/10.1038/s41576-020-00292-x.
Article CAS PubMed Google Scholar
Kumar MP, Du J, Lagoudas G, Jiao Y, Sawyer A, Drummond DC, et al. Analysis of single-cell RNA-Seq identifies cell–cell communication associated with tumor characteristics. Cell Rep. 2018;25:1458-1468.e4. https://doi.org/10.1016/j.celrep.2018.10.047.
Article CAS PubMed PubMed Central Google Scholar
Wu F, Fan J, He Y, Xiong A, Yu J, Li Y, et al. Single-cell profiling of tumor heterogeneity and the microenvironment in advanced non-small cell lung cancer. Nat Commun. 2021;12:1–11. https://doi.org/10.1038/s41467-021-22801-0.
Article CAS Google Scholar
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv. 2017. https://arxiv.org/abs/1609.02907v4.
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344:1396–401.
Article CAS PubMed PubMed Central Google Scholar
Tirosh I, Izar B, Prakadan SM, Wadsworth MH, Treacy D, Trombetta JJ, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189–96. https://doi.org/10.1126/science.aad0501.
Article CAS PubMed PubMed Central Google Scholar
Chen YP, Yin JH, Li WF, Li HJ, Chen DP, Zhang CJ, et al. Single-cell transcriptomics reveals regulators underlying immune cell diversity and immune subtypes associated with prognosis in nasopharyngeal carcinoma. Cell Res. 2020;30:1024–42. https://doi.org/10.1038/s41422-020-0374-x.
Article CAS PubMed PubMed Central Google Scholar
Kim N, Kim HK, Lee K, Hong Y, Cho JH, Choi JW, et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat Commun. 2020;11:1–15. https://doi.org/10.1038/s41467-020-16164-1.
Article CAS Google Scholar
Lee HW, Chung W, Lee HO, Jeong DE, Jo A, Lim JE, et al. Single-cell RNA sequencing reveals the tumor microenvironment and facilitates strategic choices to circumvent treatment failure in a chemorefractory bladder cancer patient. Genome Med. 2020;12:1–21. https://doi.org/10.1186/s13073-020-00741-6.
Article Google Scholar
Ramsudar B, Eastman P, Walters P, Pande V. Deep learning for life sciences. 2019.
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR; 2015. p. 448–56.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929–58.
Google Scholar
Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45:1113–20. https://doi.org/10.1038/ng.2764.
Article CAS PubMed PubMed Central Google Scholar
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47:D1102–9. https://doi.org/10.1093/NAR/GKY1033.
Article PubMed Google Scholar
Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47:D607–13. https://doi.org/10.1093/NAR/GKY1131.
Article CAS PubMed Google Scholar
Yates AD, Achuthan P, Akanni W, Allen J, Allen J, Alvarez-Jarreta J, et al. Ensembl 2020. Nucleic Acids Res. 2020;48:D682–8. https://doi.org/10.1093/NAR/GKZ966.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 23 Supplement 4, 2022: The 20th International Conference on Bioinformatics (InCoB 2021). The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-23-supplement-4.

Funding

This work is supported in part by the National Natural Science Foundation of China (NSFC 61721003 and 62050178) and Tsinghua-Fuzhou Institute for Data Technology Grant TFIDT2021005. The publication costs are funded by NSFC 61721003. The funding bodies are not involved in the design of the study and collection, analysis, and interpretation of data or in writing the manuscript.

Author information

Authors and Affiliations

MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST and Department of Automation, Tsinghua University, Beijing, 100084, China
Tianxing Ma, Rui Jiang & Xuegong Zhang
Department of Statistics, Stanford University, Stanford, CA, 94305, USA
Qiao Liu
School of Medicine, Center for Synthetic and Systems Biology, Tsinghua University, Beijing, 100084, China
Haochen Li & Xuegong Zhang
SenseBrain Research, San Jose, CA, 95131, USA
Mu Zhou

Authors

Tianxing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Qiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haochen Li
View author publications
You can also search for this author in PubMed Google Scholar
Mu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Rui Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xuegong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.M., Q.L., and M.Z. conceived and designed the study. T.M., Q.L. performed experiments. T.M., Q.L., and H.L. performed analysis and wrote the manuscript. X.Z. and R.J. supervised the study. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xuegong Zhang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Supplementary figures and Supplementary tables S1–S7 for additional results. Figure S1. IC50 and MSC of drugs. Figure S2. PCA of structures of drugs. Figure S3. ROC curve on clinical cancer patients. Table S1. Parameter settings of DualGCN. Table S2. Descriptions of drugs. Table S3. Descriptions of cell lines. Table S4. List of cancer-related genes. Table S5. Results of SVM regression with various kernels. Table S6. Results of random forest with various number of trees. Table S7. Results of Lasso regression with various alpha.

Additional file 2: Supplementary Table S8

for gene features and clinical annotations of the TCGA data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Ma, T., Liu, Q., Li, H. et al. DualGCN: a dual graph convolutional network model to predict cancer drug response. BMC Bioinformatics 23 (Suppl 4), 129 (2022). https://doi.org/10.1186/s12859-022-04664-4

Download citation

Received: 24 March 2022
Accepted: 04 April 2022
Published: 15 April 2022
DOI: https://doi.org/10.1186/s12859-022-04664-4

The 20th International Conference on Bioinformatics (InCoB 2021)

DualGCN: a dual graph convolutional network model to predict cancer drug response

Abstract

Background

Results

Conclusions

Background

Results and discussion

Overview of DualGCN

Assessment of methods

Ablation analysis

A case study on clinical cancer patients

Conclusions

Methods

Drug and cell line data preparation

Construction of drug-GCN module

Construction of bio-GCN module

Configurations of baselines

Clinical cancer data preparation

Availability of data and materials

Abbreviations

References

Acknowledgements

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1:

Additional file 2: Supplementary Table S8

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us