Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP

Li, Rongyuan; Wu, Jingli; Li, Gaoshi; Liu, Jiafei; Xuan, Junbo; Zhu, Qi

doi:10.1186/s12859-023-05558-9

Research
Open access
Published: 13 November 2023

Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP

Rongyuan Li¹,
Jingli Wu²,
Gaoshi Li³,
Jiafei Liu²,
Junbo Xuan² &
…
Qi Zhu¹

BMC Bioinformatics volume 24, Article number: 427 (2023) Cite this article

1420 Accesses
1 Altmetric
Metrics details

Abstract

Background

Although gene expression data play significant roles in biological and medical studies, their applications are hampered due to the difficulty and high expenses of gathering them through biological experiments. It is an urgent problem to generate high quality gene expression data with computational methods. WGAN-GP, a generative adversarial network-based method, has been successfully applied in augmenting gene expression data. However, mode collapse or over-fitting may take place for small training samples due to just one discriminator is adopted in the method.

Results

In this study, an improved data augmentation approach MDWGAN-GP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network. Extensive experiments were implemented on real biological data.

Conclusions

The experimental results have demonstrated that compared with other state-of-the-art methods, the MDWGAN-GP method can produce higher quality generated gene expression data in most cases.

Peer Review reports

Introduction

Over the last two to three decades, the rapid development of the genome sequencing technology has made it into reality to measure the expression level of thousands of genes from a biological sample simultaneously. Since gene expression data is extracted by various gene profiling technologies, direct reflecting the physiological state and disease of the human body [1], many computational technologies such as regression, classification and clustering can be applied on it to uncover disease mechanisms, propose novel drug targets, provide a basis for comparative genomics, and address a wide range of fundamental biological problems [2].

Nevertheless, the gene expression profile data are fundamentally limited in sample size, diversity, and the speed at which they can be gathered [3], due to the ethical challenge [4] and high expenses of money for gathering gene expression data through biological experiments. For example, the per person costs were US$604-1932 for exome sequencing, and US$2006-3347 for whole genome sequencing in 2018 [5]. In addition, much bias or noise, which results from the errors in the splicing process of short reads [6] and various batch effects [7], makes it a great challenge to take advantage of the gene expression data effectively. Therefore, it is desired to generate biologically plausible synthetic gene expression data, which can be applied in such downstream tasks as marker gene detection, cell type clustering, gene association identification, cancer stages prediction, and so on [3]. In recent years, data augmentation (DA) methods, being capable of enriching data sets, mitigating data imbalance and data noise issues, have been extensively studied in the area of generating synthetic gene expression data.

To the best of our knowledge, there are generally three categories of data augmentation methods for generating gene expression data, such as sample-based, simulator-based, and generative adversarial network-based. The sample-based methods include random sampling [8], mean sampling [9], resampling [10], and oversampling [11, 12], which are prone to the problem of overfitting [13] or distribution marginalization [14]. The simulator-based methods [15, 16] generate synthetic transcriptomics datasets based on known regulatory networks. Since they perform similarly to the random simulators [2], the key features of gene expression data can not be simulated [17]. With the rapid development of deep learning technology, the Generative Adversarial Network (GAN)-based method, being able to produce more diverse and higher quality samples than the former two methods, has received major attention [1, 2]. It is also studied in this paper.

In 2020, Chaudhari et al. [18] firstly proposed modified generator GAN (MG-GAN), which is fed with original data along with minimalistic multivariate noise to generate data with Gaussian distribution. In 2021, Kwon et al. [19] indicated that GANs are not effective with whole genes, and expanded RNA expression data for selected significant genes using GANs. Both of the two methods adopt the original unconditioned generative model, which has no control on modes of the data being generated [20]. In 2022, Ahmed et al. [21] developed method omicsGAN to integrate two omics data and their interaction network into a Wasserstein Generative Adversarial Network (WGAN) [22]. Nevertheless, gradient explosion is common when training WGAN. In 2020, Marouf et al. [23] adopted conditional single-cell generative adversarial neural networks (cscGAN) to produce single-cell RNA-seq data. It learns non-linear gene-gene dependencies from complex, multiple cell type samples and uses this information to generate realistic cells of defined types. In 2022, Han et al. [1] put forward the method Gene-CWGAN, which stabilizes the distribution of generated samples with a dataset partition method, and adopts constraint penalty term to improve the diversity of generated samples. In the same year, Viñas et al. [2] proposed a new simulator (it is called as S-WGAN-GP in this paper) based on WGAN-GP (Wasserstein Generative Adversarial Network with Gradient Penalty) [24]. S-WGAN-GP concatenates the sample covariates with the input features and samples the class labels from the real distribution. The S-WGAN-GP simulator can be used at a higher scale to produce tissue- and organ-specific transcriptomics data.

In the process of training generative adversarial networks, mode collapse is a serious issue to be concerned about. It may be an effective channel to alleviate the problem to improve the diversity of training samples as well as feedback signals. Among the above mentioned approaches, the diversity of feedback signals may be constrained for just one discriminator being adopted in the GANs. Therefore, in this paper, the collaboration of multiple discriminators is explored. The main contributions of this paper are summarized as follows:

1.
The multiple discriminator WGAN-GP (MDWGAN-GP) model is proposed. It can ensure the high quality of the generated gene expression data. Multiple discriminators are adopted prevent mode collapse via providing more feedback signals to the generator.
2.
A novel approach based on linear graph convolutional network (GCN) is put forward to enrich training samples, avoiding over-fitting or mode collapse caused by small sample size in high dimensional data.
3.
The pan-cancer gene expression datasets were produced to demonstrate the effectiveness of the MDWGAN-GP approach. A data preprocess method is conducted to select the genes with high confidence or top ranking from protein-protein interaction networks, so as to relieve the curse of dimensionality encountered in the training. Extensive experiments were implemented to compare the quality of generated gene expression data between the MDWGAN-GP method and other state-of-the-art ones.

Preliminaries

Conditional generative adversarial network

The conditional generative adversarial network (CGAN) [20] attempts to generate samples of specified labels through input labels and noise. As the normal generative adversarial network (GAN) [25], a CGAN model consists of a generation network G and a discrimination network D. Given some noise z and conditional information y (e.g. category labels, data with different modalities), the generator G learns to produce synthetic samples similar to the real distribution. The discriminator D needs to distinguish whether the input sample is from authentic sample p(x) or from sample p(z) produced by the generator G. The loss function of CGAN can be formulated as:

$$\begin{aligned} \mathop {\min }\limits _G \mathop {\max }\limits _D V(D,G) = {E_{x\sim {p{(x)}}}}[\log D(x|y)] + {E_{z\sim {p{(z)}}}}[\log (1 - D(G(z|y)|y))] \end{aligned}$$

(1)

Conditional Wasserstein generative adversarial network with gradient penalty

Different from CGAN, the Wasserstein generative adversarial network (WGAN) [22] tries to generate samples with just input noise. It applies the Wasserstein distance instead of the Jensen-Shannon (JS) divergence to evaluate the distribution distance between the real samples and the generated ones, making the training process more stable and faster than the normal generative adversarial network. The Wasserstein generative adversarial network with gradient penalty (WGAN-GP) [24] is an modified model based on WGAN, penalizing the norm of gradient of the discriminator with respect to its input. In 2020, Zheng et al. [26] further improved the WGAN-GP model from the addition of conditional information and proposed the CWGAN-GP model, whose loss function can be formulated as:

$$\begin{aligned} \begin{aligned} \mathop {\min }\limits _G \mathop {\max }\limits _D V(D,G) = {E_{x\sim {p{(x)}}}}[D(x|y)] - {E_{z\sim {p{(z)}}}}[D(G(z|y)|y)]+\\ \lambda {E_{{\hat{x}}\sim {p{({{\hat{x}}})}}}}[{(||{\nabla _{{\hat{x}}}}D({{\hat{x}}}|y)|{|_2} - 1)^2}], \end{aligned} \end{aligned}$$

(2)

where ${E_{{\hat{x}}\sim {p{({{\hat{x}}})}}}}[{(||{\nabla _{{\hat{x}}}}D({{\hat{x}}}|y)|{|_2} - 1)^2}]$ is the gradient penalty term.

Graph convolutional network

The emerging graph convolutional networks (GCNs) [27,28,29] are able to extract well spatial correlation in non-Euclidean structures and maintain shift-invariance. Let G=(V, E) be an undirected graph, where V and E represent the set of nodes $v_{i}$ $\in$ $V$ (i=1,2,...,n) and edges ($v_{i}$,$v_{j}$)$\in$ $E$, respectively. $A$ $\in$ $R^{n\times n}$ is the adjacent matrix of G, where $A_{ij}$ indicates whether there is an edge between $v_{i}$ and $v_{j}$, or the similarity between them basing on a similarity measure. Let $H^{(l)}$ represent the graph node representations at the l-th ($l$ $\in$ $N$) layer, the propagation rule for calculating the graph node representations at the $(l+1)$-th layer is formulated as:

$$\begin{aligned} H^{(l+1)}=f\left( {\widetilde{D}}^{-\frac{1}{2}} \widetilde{A}{\widetilde{D}}^{-\frac{1}{2}}H^{(l)}W^{(l)}\right) , \end{aligned}$$

(3)

where f($\cdot$) is a no-linear activation function, $\widetilde{A}$=A+I, and $W^{(l)}$ is the weight matrix of the l-th layer. ${\widetilde{D}}^{-\frac{1}{2}}\widetilde{A}{\widetilde{D}}^{-\frac{1}{2}}$ is a symmetric normalized Laplacian matrix, where ${\widetilde{D}}_{ii}$=$\sum _{j=1}^{n}{\widetilde{A}}_{ij}$.

Proposed method

Recently, Viñas et al. [2] proposed a WGAN-GP based simulator S-WGAN-GP to generate specific tumour gene expression data. Though conditional restrictions are added, model collapse or over-fitting may not be exempted for small training samples due to just one discriminator is adopted. In addition, some inherent defects are also harboured in WGAN-GP, such as training unstable and failing to generate diverse samples [1, 30]. Therefore, in this section, an improved data augmentation approach, the multiple discriminator WGAN-GP (MDWGAN-GP) model, is proposed. We begin with enriching the training samples with linear graph convolution [31, 32], then a generative adversarial network with multiple discriminators is devised based on WGAN-GP. The concrete descriptions are as follows. The source code of method MDWGAN-GP can be downloaded from https://github.com/lryup/MDWGAN-GP.

Enriching training samples

It is generally regarded that enriched training samples contribute to GAN capturing the original distribution [33]. Inspired by methods exerted on image data to enrich training samples, i.e., rotation, flipping, and cropping, a novel approach suitable for gene expression data is proposed. Given a raw gene expression matrix $X_1$ with n rows (samples) and m columns (genes), where each entry represents the expression level of a given gene in a particular sample. A pair of K-Nearest Neighbors (KNN) graphs [34, 35] $G_E$ and $G_C$ are built from matrix $X_1$ based on Euclidean distance and Cosine distance, respectively. Each vertex of them denotes a sample, and the edge demonstrates that there is a strong relationship between the connected two samples. Linear graph convolution is performed to update the vertices (samples), i.e., aggregating the information of their neighbor ones. The updated gene expression matrices $X_2$ and $X_3$ are depicted as follows:

$$\begin{aligned} X_2= & {} f\left( {\widetilde{D}_E}^{-\frac{1}{2}}\widetilde{A}_ E{\widetilde{D}_E}^{-\frac{1}{2}}X_1\right) , \end{aligned}$$

(4)

$$\begin{aligned} X_3= & {} f\left( {\widetilde{D}_C}^{-\frac{1}{2}}\widetilde{A}_ C{\widetilde{D}_C}^{-\frac{1}{2}}X_1\right) , \end{aligned}$$

(5)

where f($\cdot$) is a linear activation function. $\widetilde{A}_E$=$A_E$+I (resp. $\widetilde{A}_C$=$A_C$+I), where $A_E$ and $A_C$ are the adjacency matrices of graphs $G_E$ and $G_C$, respectively. ${\widetilde{D}}_E{_{ii}}$=$\sum _{j=1}^{n}{\widetilde{A}}_E{_{ij}}$, ${\widetilde{D}}_C{_{ii}}$=$\sum _{j=1}^{n}{\widetilde{A}}_C{_{ij}}$.

Adversarial simulator for augmenting gene expression data

It has been regarded that the adoption of multi discriminators can improve the stability of optimization process [33]. In this subsection, an adversarial simulator MDWGAN-GP with three discriminators is devised, as shown in Fig. 1.

Figure 1a shows the S-WGAN-GP model, and Fig. 1b illustrates the structure of MDWGAN-GP proposed in this paper. In the MDWGAN-GP model, the distribution of the original data are expected to be learned from two updated gene expression matrices $X_2$ and $X_3$ besides raw gene expression matrix $X_1$. Hence two more discriminators $D_2$ as well as $D_3$ are added and fed with $X_2$ and $X_3$, respectively. Nevertheless, it is worth noticed that the generator is still anticipated to learn from the raw samples $X_1$ principally rather than the updated ones, which play auxiliary roles in the process of training.

The objective function

In a generative adversarial network, the generator tries to produce samples that look real enough to trick the discriminator, while the discriminator attempts to distinguish the generated samples from the real ones. Here the objective functions are designed for one generator and three discriminators in MDWGAN-GP, as illustrated in Equation (6):

$$\begin{aligned} \begin{aligned} V(D_i,G) = {E_{X_i\sim {p{(X_i)}}}}[D_i(X_i|Y)]-{E_{Z\sim {p{(Z)}}}}[D_i(G(Z|Y)|Y)]+\\ \lambda {E_{{\hat{X}}_i\sim {p{({{\hat{X}}_i})}}}}[{(||{\nabla _{{\hat{X}}_i}}D_i({{\hat{X}}_i |Y})|{|_2} - 1)^2}], i=1,2,3, \end{aligned} \end{aligned}$$

(6)

where Y indicates the conditional labels. $\lambda$ is a hyperparameter determining strength of gradient penalty ${E_{{\hat{X}}_i\sim {p{({{\hat{X}}_i} )}}}}[{(||{\nabla _{{\hat{X}}_i}}D_i({{\hat{X}}_i|Y})|{|_2} - 1)^2}]$. $X_i$ is the real samples, Z denotes the noise samples, $\hat{X_i}$ represents the samples randomly chosen from the real ones or the generated ones. The whole optimization objective functions of generator and discriminator are formulated as Equation (7) and Equation (8):

$$\begin{aligned}{} & {} {\mathop {\mathrm{{min}}}\limits _G V( {D_1,D_2,D_3,G}) = V({D_1,G}) + \frac{{{\lambda _g}}}{{2}}[V(D_2,G)+V(D_3,G)]}, \end{aligned}$$

(7)

$$\begin{aligned}{} & {} \mathop {\mathrm{{max}}}\limits _{D_1,D_2,D_3} V({D_1,D_2,D_3,G})=V({D_1,G})+\frac{{{\lambda _d}}}{{2}}[V(D_2,G)+V(D_3,G)], \end{aligned}$$

(8)

where ${\lambda _g}$ and ${\lambda _d}$ denote two small adjustable parameters assisting model learning. All discriminators are trained through weight sharing to improve model performance [33].

Architecture

Figure 2 shows the architecture of the proposed simulator MDWGAN-GP. The generator G receives noise vector Z and conditional label Y as input and produces vector $X'$ of synthetic expression values. The discriminator $D_i$ (i=1,2,3) takes either a real gene expression sample $X_i$ or a synthetic sample $X'$, in addition to a conditional label Y, and tries to distinguish whether the input sample is real or fake. Matrices $X_2$ and $X_3$ are respectively produced with a linear graph convolution of sample graphes $G_E$ and $G_C$, which are respectively constructed from matrix $X_1$ based on Euclidean distance and Cosine distance.

Experimental details

The effectiveness of MDWGAN-GP is verified through extensive experiments. We began with comparing the model performances of CGAN [20], CWGAN [36], CWGAN-GP [26], Gene-CWGAN [1], S-WGAN-GP [2], and MDWGAN-GP with the similarity dist($\cdot$, $\cdot$) on fifteen datasets, and the diversity of samples generated by these models through sample dimension visualization. Then we compared the model performances with the classification ability of generated samples. Next, we compared the performances among these models in terms of the correlations among key genes. Finally, we compared the differentially expressed genes, identified using the generated datasets, with those identified using the real ones.

Data preparation and parameter settings

In the experiments, real biological datasets are acquired from four databases:

(1) The Cancer Genome Atlas (TCGA). It is a public biospecimen repository which aims to augment the understanding of the molecular mechanisms of cancers. The database contains high-throughput genomic data from over 20,000 primary cancer and matched healthy samples spanning 33 cancer-types.

(2) The Genotype-Tissue Expression (GTEx). It is also a public resource built to study tissue-specific gene expression and regulation. It contains samples collected from 54 non-diseased tissue sites across nearly 1000 individuals [37].

(3) The String dataset. String is a database which records known and predicted protein-protein interactions, including physical as well as functional connections. The latest Human Protein Interaction Network version 11.5 was adopted in the experiments.

(4) The HumanNet dataset. HumanNet [38] is a database that covers 99.8% of human protein-coding genes. The latest functional gene network (HumanNet-FN) version 3 [39] was adopted in the experiments.

The data preparation was conducted was follows. Firstly, the raw RNA-seq sample datasets of TCGA and GTEx were acquired from Wang et al. [40]. Fifteen common tissues between TCGA and GTEx datasets were selected to construct the GT dataset, which consisted of 9,147 samples and 18,154 genes. Secondly, the String PPI network were consisted of 11,938,499 edges and 19,385 proteins, and 360,783 edges as well as 14,220 proteins were retained through filtering out the edges with a score less than 800. The transfer from protein ID to gene ID, then to gene name was conducted with the Genome Reference Consortium Human Build 38 Organism (GRCH38) database, and R packages AnnotationDbi and org.Hs.eg.db. Then 13,035 genes were remained by dropping duplicate ones, for some proteins correspond to multiple genes. Thirdly, among the 977,495 edges and 18,458 genes of HumanNet, 15,443 genes and 97,749 edges were left by choosing the top 10% more reliable edges. Finally, the genes that were not belong to the String or the HumanNet PPI networks were dropped from the GT dataset, and 9147 samples and 10612 genes were remained. Both logarithmic transformation and z-score were adopted to normalize the gene expression values. The number of samples of the fifteen common tissues were illustrated in Table 1.

Table 1 The number of samples of the fifteen common tissues

Full size table

In the experiments, 10% of the samples in all datasets were randomly selected as the training set, while the 90% rest ones were as the test set. Both the generator and the discriminator models included two layers of fully connected hidden layers, each of which had 256 nerves. The hidden layer adopted the ReLU activation function, and the output layer did not use any. The RMSProp optimizer was executed with a learning rate of 0.0005 [41]. Some hyperparameters were set as follows: $\lambda$=10 [24], $\lambda _g$=0.2, and $\lambda _d$=0.02 [33]. The training process was terminated when the validation score dist($D^X$, $D^Z$) was not improved for 20 consecutive times, or it reached the maximum iterations of 500.

Evaluation index

In this section, evaluation indexes for estimating the performance of generative model are described. Assume that $X_{m_1\times n}$ and $Z_{m_2\times n}$ are a pair of matrices recording real and synthetic gene expression observations, respectively. The rows of them respectively denote a set of $m_1$ real cancer samples and $m_2$ synthetic ones, the columns of them denote a set of n genes, and the entries of them are real numbers, i.e., $x_{ij}$, $z_{ij}$ $\in$ $R$. Let $D^X$ and $D^Z$ be a pair of $n$ $\times$ $n$ symmetric matrices corresponding to X and Z. In matrix $D^X$ (resp. $D^Z$), each entry $d_{jk}^{X}$ (resp. $d_{jk}^{Z}$) records the pairwise distance between the j-th and the k-th genes, i.e., the pearson correlation coefficient between columns $x_{-j}$ (resp. $z_{-j}$) and $x_{-k}$ (resp. $z_{-k}$), as defined in Equation (9) (resp. Equation (10)):

$$\begin{aligned} \begin{array}{*{20}{c}} {d_{jk}^{X}=\frac{\mathop \sum \limits _{i=1}^{m_1}(x_{ij} -{\bar{x}}_{-j})\mathop \sum \limits _{i=1}^{m_1} (x_{ik}-{\bar{x}}_{-k})}{ \sqrt{\mathop \sum \limits _{i=1}^{m_1}(x_{ij} -{\bar{x}}_{-j})^2}\sqrt{\mathop \sum \limits _{i=1}^{m_1} (x_{ik}-{\bar{x}}_{-k})^2}} } \end{array} \end{aligned}$$

(9)

$$\begin{aligned} \begin{array}{*{20}{c}} {d_{jk}^{Z}=\frac{\mathop \sum \limits _{i=1}^{m_2}(z_{ij} -{\bar{z}}_{-j})\mathop \sum \limits _{i=1}^{m_2} (z_{ik}-{\bar{z}}_{-k})}{ \sqrt{\mathop \sum \limits _{i=1}^{m_2}(z_{ij} -{\bar{z}}_{-j})^2}\sqrt{\mathop \sum \limits _{i=1}^{m_2} (z_{ik}-{\bar{z}}_{-k})^2}} } \end{array} \end{aligned}$$

(10)

where ${\bar{x}}_{-j}$=$\frac{\sum \limits _{i=1}^{m_1}x_{ij}}{m_1}$, ${\bar{x}}_{-k}$=$\frac{\sum \limits _{i=1}^{m_1}x_{ik}}{m_1}$, ${\bar{z}}_{-j}$=$\frac{\sum \limits _{i=1}^{m_2}z_{ij}}{m_2}$, ${\bar{z}}_{-k}$=$\frac{\sum \limits _{i=1}^{m_2}z_{ik}}{m_2}$.

Let dist($D^X$, $D^Z$) represent the similarity between matrices $D^X$ and $D^Z$, measuring whether the pairwise correlation between genes from the real data are correlated with those from the synthetic data, as defined in Equation (11) [2]:

$$\begin{aligned} \begin{array}{*{20}{c}} dist(D^X, D^Z)=\sum \limits _{i=1}^{n-1}\sum \limits _{j=i+1}^n(\frac{d^X_{ij} -\mu (D^X)}{\sigma (D^X)})(\frac{d^Z_{ij}-\mu (D^Z)}{\sigma (D^Z)}), \end{array} \end{aligned}$$

(11)

where $\mu (D^X)$ and $\sigma (D^X)$ are defined as Equation (12) and Equation (13), and $\mu (D^Z)$ and $\sigma (D^Z)$ are defined accordingly.

$$\begin{aligned} \begin{array}{*{20}{c}} {\mu (D^X)=\frac{2}{n(n-1)}\sum \limits _{i=1}^{n-1}\sum \limits _{j=i+1}^n d^X_{ij}} \end{array} \end{aligned}$$

(12)

$$\begin{aligned} \begin{array}{*{20}{c}} {\sigma (D^X)=\sqrt{\frac{2}{n(n-1)}\sum \limits _{i=1}^{n-1}\sum \limits _{j=i+1}^n {(d^X_{ij}-\mu (D^X))}^2}} \end{array} \end{aligned}$$

(13)

In addition, the classification performance obtained by taking advantage of the synthetic gene expression data is also adopted to measure the performance of generative model, as depicted from Equation (14) to Equation (18):

$$\begin{aligned} Accuracy= & {} \frac{TP+TN}{TP+FP+FN+TN} \end{aligned}$$

(14)

$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP} \end{aligned}$$

(15)

$$\begin{aligned} Recall= & {} \frac{TP}{TP+FN} \end{aligned}$$

(16)

$$\begin{aligned} F1-score= & {} \frac{2\times Precision\times Recall}{Precision+Recall} \end{aligned}$$

(17)

$$\begin{aligned} Mcc= & {} \frac{TP\times TN-FP\times FN}{\sqrt{(TP+FN)\times (TP+FP)\times (TN+FN) \times (TN+FP)}} \end{aligned}$$

(18)

where TP (resp. TN) denotes the number of positive (resp. negative) samples correctly labeled by the classifier. FP (resp. FN) represents the number of negative (resp. positive) samples incorrectly labeled as positive (resp. negative) ones. Mcc denotes Matthews correlation coefficient.

Comparison of similarity dist($\cdot$, $\cdot$) of different models

In Table 2, the performance of similarity dist($\cdot$, $\cdot$) is compared among different models. For each dataset, the generated sample set has the same size as the corresponding test set. From this table we can see that the presented model MDWGAN-GP outperforms other models in 11 of the 15 datasets. Its average dist($\cdot$, $\cdot$) among all of the datasets is 0.704, which is apparently higher than those of other five models.

Table 2 Comparisons of similarity between the real and generated samples

Full size table

In addition, as shown in Figs. 3 , 4 and 5, the comparisons of distributions are demonstrated between the generated samples and the real samples for the first 11 genes, reflecting intuitively the diversity of generated samples. In all figures, the horizontal coordinates indicate the number of genes, and the vertical ones denote the gene expression values. The red line represents the real samples, and the blue one represents the generated samples.

From Figs. 3 , 4 and 5 we can see that compared with the samples generated by the other five models, those generated by model MDWGAN-GP generally have distributions more similar to the real samples. The samples generated by model CGAN concentrate in a very narrow range, indicating that original data distribution and the generated data distribution hold a negligible overlapping area, for JS divergence adopted by model CGAN may lead to gradient disappearance and mode collapse [42]. CWGAN adopts Wasserstein distance to solve the problem of mode collapse. However, it generates samples deviating from the original values due to gradient explosion resulting from the absence of gradient penalty [24]. CWGAN-GP avoids gradient explosion effectively with the addition of gradient punishment. Nevertheless, because the true value range of each feature is unknown and the output layer activation function of CWGAN-GP forcibly limits the generation space [1], the diversity of its samples remains poor at the distribution margins. Gene-CWGAN expands the generation space of the generation model by removing the tanh activation function of the CWGAN-GP generation model, and avoids the expansion of learning fluctuation with a constraint penalty term [1]. Nevertheless, the generated samples may deviate from the original ones. As shown in Fig. 4d, the maximum original values of the 0-th and the 3-th genes are respectively 7 and 8, while the maximum generated values of them are respectively close to 9 and 10. Similar to Gene-CWGAN, S-WGAN-GP also expands the generation space by removing the tanh activation function, and it can generate sample data with specified conditions. In order to further improve model stability and the diversity of generated samples, enriched training samples are produced with the aid of multiple discriminators in the MDWGAN-GP method. As shown in Fig. 3 , 4 and 5, the samples generated by MDWGAN-GP have more satisfying diversity at the distribution margins.

Comparison of classification ability of samples generated based on different models

As illustrated in Tables 3, 4 and 5, the classification ability of generated samples is evaluated in terms of classifying the normal and the cancer samples. In the experiments, three kinds of classical classification methods, i.e., random forests (RF) [43], K-nearest neighbors (KNN) [44], and multi layered perceptron (MLP) [45], were adopted. The number of trees $n_{estimators}$ was 200 for RF, the number of neighbours K was 5 for KNN, and two hidden layers with 128 units and the ReLU activation function were adopted for MLP. For each method, the average results of ten runs are calculated and presented. It can be seen from the three tables that among the three methods the samples generated with the MDWGAN-GP model perform the best classification ability in the vast majority of cases. Furthermore, basing on the classification methods RF and KNN, the samples generated with the MDWGAN-GP model even present superior classification performance than the real samples (denoted as “Real” in the three tables).

Table 3 Comparisons of classifying the normal and the cancer samples (Accuracy%)

Full size table

Table 4 Comparisons of classifying the normal and the cancer samples (F1-score%)

Full size table

Table 5 Comparisons of classifying the normal and the cancer samples (Mcc%)

Full size table

Furthermore, in order to intuitively reflect the clustering ability of the samples generated by model MDWGAN-GP, we compare the cluster results on the real samples with those on the generated ones. As shown in Figs. 6 and 7, there datasets such as Colon, Thyroid and Lung were adopted. The dimensionality of each sample was reduced to two with t-SNE [46]. From Fig. 6 we can discover that the generated samples almost overlap with the real ones. Moreover, the clustering results in Fig. 7 demonstrate that the generated samples present better linear separability than the real ones, indicating that it might be better to perform differential analysis between normal and cancer tissues using the generated datasets.

Ablation experiments

As mentioned before, the training samples were enriched with linear graph convolution in method MDWGAN-GP. Here a series of ablation experiments were conducted on the GT dataset. The training set was constructed by randomly selecting 10% of the samples from each tissue, and the remaining 90% of the samples were chosen as the test set. Figure 8 compares the similarity between the real data and the generated one in terms of dist($\cdot$, $\cdot$). In this figure, MDWGAN-GP-C (resp. MDWGAN-GP-E) represents the model adopting only Cosine distance (resp. Euclidean distance). As can be seen from the figure, the MDWGAN-GP model has the highest dist($\cdot$, $\cdot$) among the four models. In the subsequent two subsections, experiments were conducted to further test the usability of samples generated with method MDWGAN-GP.

Comparison of the correlations among key genes

Ten most frequently mutated genes in human cancers [47] were chosen as key genes. The correlations among them are calculated and presented based on the generated and the real expression data, respectively. As can be seen in Fig. 9, a pair of $10 \times 10$ symmetric matrices record the distance $d_{jk}$ (j,k=1,2,...,10) among the ten key genes. Figure 9a represents the correlations based on the real samples, while Fig. 9b represents those based on the generated samples of model MDWGAN-GP. It can be seen that the distances among genes calculated basing the two different kinds of samples are close, indicating the correlations among genes in the generated data well approximate to those in the real data.

Comparison of differentially expressed genes (DEGs)

As analyzed above, compared with using the real datasets, it might be better to conduct differential analysis between normal and cancer tissues using the generated ones. In this section, comparisons were further performed between the differentially expressed genes identified based on the generated datasets and those identified based on the real ones. Eighty percent of all pan-cancer samples were randomly selected as the training set, and the same number of samples were generated with model MDWGAN-GP. DESeq2 package of R was called to calculate the difference fold and p-value for each gene by using the denormalized generated expression data, and the genes with $|log2(fold \ change)|$ greater than 3 and p-values less than 0.05 were selected as differentially expressed genes. For the convenience of description, we use “real-DEGs” and “fake-DEGs” to denote the DEGs ascertained based on the real and the generated datasets, respectively.

As shown in Table 6, for most cancer types, the number of fake-DEGs approximates to that of real-DEGs. Additionally, breast cancer was taken as an example to analyze the association between DEGs and cancers. Firstly, among the top 286 real-DEGs (resp. fake-DEGs), 165 (resp. 177) breast cancer related genes were ascertained basing on the DisGeNET database (v7.0) [48]. It is obvious that the number of breast cancer related DEGs obtained from the generated data are greater than that obtained from the real one.

Table 6 Comparisons of the number of differentially expressed genes

Full size table

Secondly, package clusterProfiler of R [49] was called to conduct enrichment analysis for the DEGs based on the KEGG database [50]. As displayed in Fig. 10, both real-DEGs and fake-DEGs are enriched in nine biological pathways. The color of bars indicates the degree of significance, and the length of them counts the number of DEGs enriched. Among the two groups of enriched biological pathways, seven breast cancer related pathways are enriched by both real-DEGs and fake-DEGs. The PPAR signaling pathway has been reported as a potential biomarker for the diagnosis of breast cancer [51,52,53]. Cytokine-cytokine receptor interaction plays an important role in the metastasis of breast cancer and its development [54]. Aberrant AMPK signaling pathways may play a role in the regulation of growth, survival and the development of drug resistance in triple-negative breast cancer [55]. IL-17 signaling pathway has been demonstrated to promote the proliferation, invasion and metastasis of breast cells, and is significantly associated with the poor prognosis of breast patients [56]. Regulation of lipolysis in adipocytes pathway promotes the proliferation and migration of breast cancer cell [57]. Tyrosine metabolism pathway regulates the development of breast cancer [58]. Proximal tubule bicarbonate reclamation pathway indirectly regulates the proliferation of breast cancer cell through TASK-2 [59]. In addition, a pair of breast cancer related biological pathways, i.e., Viral protein interaction with cytokine and cytokine receptor and Adipocytokine signaling pathway, are also enriched by the fake-DEGs. Viral protein interaction with cytokine and cytokine receptor has been reported to be significant for breast cancer [60]. Adipocytokine signaling pathway can mediate the survival, growth, invasion, and metastasis of breast cancer cells through different cellular and molecular mechanisms, thus reducing survival time and contributing to malignancy [61]. Figure 11 (resp. Figure 12) further illustrates the five top pathways enriched by real-DEGs (resp. fake-DEGs) in term of adjusted p-values. The steelblue nodes represent the pathways, and the size of which indicates the number of DEGs enriched. Other colored small nodes represent the DEGs, and the color of which indicates its value of $log2(fold \ change)$.

Conclusions and future directions

Since it is both difficult and expensive for gathering gene expression data with biological experiments, generating them through computational approaches has aroused great attentions. In this study, a generative adversarial network model MDWGAN-GP, having multiple discriminators, is put forward. A novel method is designed for enriching training samples based on linear graph convolutional network. Compared with other state-of-the-art methods, the MDWGAN-GP method can produce higher quality generated gene expression data in most cases. In addition, some critical biomarkers, enriching in some significant biological pathways, are identified based on the generated data. All of these have been verified through extensive experiments performed on real biological data.

However, during the process of experiments, we found that GAN and its improved versions have the inherent defect of being difficult to train. It has been reported that the diffusion model can ensure sample diversity by means of adding and removing noise step by step [62]. It is anticipated to do well in generating high quality and diverse gene expression data, which will be studied in the future.

Availibility of data and materials

The datasets used in this paper and the source code of MDWGAN-GP are available at https://github.com/lryup/MDWGAN-GP.

References

Han F, Zhu S, Ling Q, Han H, Li H, Guo X, Cao J. Gene-cwgan: a data enhancement method for gene expression profile based on improved cwgan-gp. Neural Computing Appl. 2022;1–15:16325–39.
Article Google Scholar
Viñas R, Andrés-Terré H, Liò P, Bryson K. Adversarial generation of gene expression data. Bioinformatics. 2022;38(3):730–7.
Article Google Scholar
Lee M. Recent advances in generative adversarial networks for gene expression data: a comprehensive review. Mathematics. 2023;11(14):3055.
Article Google Scholar
Buccitelli C, Selbach M. mrnas, proteins and the emerging principles of gene expression control. Nat Rev Genet. 2020;21(10):630–44.
Article CAS PubMed Google Scholar
Gordon LG, White NM, Elliott TM, Nones K, Beckhouse AG, Rodriguez-Acevedo AJ, Webb PM, Lee XJ, Graves N, Schofield DJ. Estimating the costs of genomic sequencing in cancer control. BMC Health Serv Res. 2020;20(1):1–11.
Article Google Scholar
Harris RS, Cechova M, Makova KD. Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data. Bioinformatics. 2019;35(22):4809–11.
Article CAS PubMed PubMed Central Google Scholar
Zang C, Wang T, Deng K, Li B, Hu S, Qin Q, Xiao T, Zhang S, Meyer CA, He HH. High-dimensional genomic data bias correction and data integration using mancie. Nat Commun. 2016;7(1):1–8.
Article CAS Google Scholar
Kuhn K, Baker SC, Chudin E, Lieu M-H, Oeser S, Bennett H, Rigault P, Barker D, McDaniel TK, Chee MS. A novel, high-performance random array platform for quantitative gene expression profiling. Genome Res. 2004;14(11):2347–56.
Article CAS PubMed PubMed Central Google Scholar
Eldar YC. Mean-squared error sampling and reconstruction in the presence of noise. IEEE Trans Signal Process. 2006;54(12):4619–33.
Article Google Scholar
Park S-W, Hao W-D, Leung CS. Reconstruction of uniformly sampled sequence from nonuniformly sampled transient sequence using symmetric extension. IEEE Trans Signal Process. 2011;60(3):1498–501.
Article Google Scholar
Blagus R, Lusa L. Smote for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14(1):1–16.
Google Scholar
Gu Q, Wang X-M, Wu Z, Ning B, Xin C-S. An improved smote algorithm based on genetic algorithm for imbalanced data classification. J Digital Infor Manag. 2016;14(2):92–103.
Google Scholar
Li X, Zhang L. Unbalanced data processing using deep sparse learning technique. Futur Gener Comput Syst. 2021;125:480–4.
Article Google Scholar
Huang, D.H., Liu, D., Wen, M., Dong, X.L., Wen, M., Zhao, X.H.: A clustering method of gas load based on fcm-smote. In: E3S Web of Conferences, vol. 257, p. 01032 (2021). EDP Sciences
Van den Bulcke T, Van Leemput K, Naudts B, van Remortel P, Ma H, Verschoren A, De Moor B, Marchal K. Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics. 2006;7(1):1–12.
Google Scholar
Schaffter T, Marbach D, Floreano D. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–70.
Article CAS PubMed Google Scholar
Maier R, Zimmer R, Küffner R. A turing test for artificial expression data. Bioinformatics. 2013;29(20):2603–9.
Article CAS PubMed Google Scholar
Chaudhari P, Agrawal H, Kotecha K. Data augmentation using mg-gan for improved cancer classification on gene expression data. Soft Comput. 2020;24(15):11381–91.
Article Google Scholar
Kwon C, Park S, Ko S, Ahn J. Increasing prediction accuracy of pathogenic staging by sample augmentation with a gan. PLoS ONE. 2021;16(4):0250458.
Article Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Ahmed KT, Sun J, Cheng S, Yong J, Zhang W. Multi-omics data integration by generative adversarial network. Bioinformatics. 2022;38(1):179–86.
Article CAS Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223 (2017). PMLR
Marouf M, Machart P, Bansal V, Kilian C, Magruder DS, Krebs CF, Bonn S. Realistic in silico generation and augmentation of single-cell rna-seq data using generative adversarial networks. Nat Commun. 2020;11(1):1–12.
Article Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z. Conditional wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci. 2020;512:1009–23.
Article Google Scholar
Kipf TN, Welling M: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Wu F, Souza A., Zhang T, Fifty C, Yu T, Weinberger K: Simplifying graph convolutional networks. In: International Conference on Machine Learning, pp. 6861–6871 (2019). PMLR
Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: a comprehensive review. Comput Social Netw. 2019;6(1):1–23.
Article Google Scholar
Petzka H, Fischer A., Lukovnicov D: On the regularization of wasserstein gans. arXiv preprint arXiv:1709.08894 (2017)
Tian X, Ding CH, Chen S, Luo B, Wang X. Regularization graph convolutional networks with data augmentation. Neurocomputing. 2021;436:92–102.
Article Google Scholar
Wang Y, Wang Y, Yang J, Lin Z. Dissecting the diffusion process in linear graph convolutional networks. Adv Neural Inf Process Syst. 2021;34:5758–69.
Google Scholar
Tran N-T, Tran V-H, Nguyen N-B, Nguyen T-K, Cheung N-M. On data augmentation for gan training. IEEE Trans Image Process. 2021;30:1882–97.
Article PubMed Google Scholar
Grün D. Revealing dynamics of gene expression variability in cell state space. Nat Methods. 2020;17(1):45–9.
Article PubMed Google Scholar
Wang J, Ma A, Chang Y, Gong J, Jiang Y, Qi R, Wang C, Fu H, Ma Q, Xu D. scgnn is a novel graph neural network framework for single-cell rna-seq analyses. Nat Commun. 2021;12(1):1–11.
Google Scholar
Jin Q, Luo X, Shi Y, Kita K: Image generation method based on improved condition gan. In: 2019 6th international conference on systems and informatics (ICSAI), pp. 1290–1294 (2019). IEEE
G Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30.
Article Google Scholar
Hwang S, Kim CY, Yang S, Kim E, Hart T, Marcotte EM, Lee I. Humannet v2: human gene networks for disease research. Nucleic Acids Res. 2019;47(D1):573–80.
Article Google Scholar
Kim CY, Baek S, Cha J, Yang S, Kim E, Marcotte EM, Hart T, Lee I. Humannet v3: an improved database of human gene networks for disease research. Nucleic Acids Res. 2022;50(D1):632–9.
Article Google Scholar
Wang Q, Armenia J, Zhang C, Penson AV, Reznik E, Zhang L, Minet T, Ochoa A, Gross BE, Iacobuzio-Donahue CA. Unifying cancer and normal rna sequencing data from different sources. Scientific data. 2018;5(1):1–8.
Article Google Scholar
Tijmen T, Hinton G: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2), 26–31 (2012)
Li W, Xu L, Liang Z, Wang S, Cao J, Ma C, Cui X. Sketch-then-edit generative adversarial network. Knowl-Based Syst. 2020;203: 106102.
Article Google Scholar
Rigatti SJ. Random forest. J Insur Med. 2017;47(1):31–9.
Article PubMed Google Scholar
Peterson LE. K-nearest neighbor. Scholarpedia. 2009;4(2):1883.
Article Google Scholar
Karlik B, Olgac AV. Performance analysis of various activation functions in generalized mlp architectures of neural networks. Int J Artif Intell Expert Syst. 2011;1(4):111–22.
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(11) (2008)
Mendiratta G, Ke E, Aziz M, Liarakos D, Tong M, Stites EC. Cancer gene mutation frequencies for the us population. Nat Commun. 2021;12(1):5961.
Article CAS PubMed PubMed Central Google Scholar
Piñero J, Saüch J, Sanz F, Furlong LI. The disgenet cytoscape app: exploring and visualizing disease genomics data. Comput Struct Biotechnol J. 2021;19:2960–7.
Article PubMed Central Google Scholar
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L. clusterprofiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021;2(3): 100141.
Article CAS PubMed PubMed Central Google Scholar
Kanehisa M, Goto S. Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.
Article CAS PubMed PubMed Central Google Scholar
Baranova A.: Ppar ligands as potential modifiers of breast carcinoma outcomes. PPAR research 2008 (2008)
Xu Y, Shu D, Shen M, Wu Q, Peng Y, Liu L, Tang Z, Gao S, Wang Y, Liu S: Development and validation of a novel ppar signaling pathway-related predictive model to predict prognosis in breast cancer. Journal of Immunology Research 2022 (2022)
Sultan G, Zubair S, Tayubi IA, Dahms H-U, Madar IH. Towards the early detection of ductal carcinoma (a common type of breast cancer) using biomarkers linked to the ppar ($\gamma$) signaling pathway. Bioinformation. 2019;15(11):799.
Article PubMed PubMed Central Google Scholar
Méndez-García LA, Nava-Castro KE, Ochoa-Mercado T, Palacios-Arreola MI, Ruiz-Manzano RA, Segovia-Mendoza M, Solleiro-Villavicencio H, Cázarez-Martínez C, Morales-Montor J. Breast cancer metastasis: are cytokines important players during its development and progression? J Interferon & Cytokine Res. 2019;39(1):39–55.
Article Google Scholar
Cao W, Li J, Hao Q, Vadgama JV, Wu Y. Amp-activated protein kinase: a potential therapeutic target for triple-negative breast cancer. Breast Cancer Res. 2019;21(1):1–10.
Article CAS Google Scholar
Song X, Wei C, Li X. The potential role and status of il-17 family cytokines in breast cancer. Int Immunopharmacol. 2021;95: 107544.
Article CAS PubMed Google Scholar
Balaban S, Shearer RF, Lee LS, van Geldermalsen M, Schreuder M, Shtein HC, Cairns R, Thomas KC, Fazakerley DJ, Grewal T. Adipocyte lipolysis links obesity to breast cancer growth: adipocyte-derived fatty acids drive breast cancer cell proliferation and migration. Cancer & metabolism. 2017;5(1):1–14.
Article Google Scholar
Acevedo DS, Fang WB, Rao V, Penmetcha V, Leyva H, Acosta G, Cote P, Brodine R, Swerdlow R, Tan L. Regulation of growth, invasion and metabolism of breast ductal carcinoma through ccl2/ccr2 signaling interactions with met receptor tyrosine kinases. Neoplasia. 2022;28: 100791.
Article CAS PubMed PubMed Central Google Scholar
Cid LP, Roa-Rojas HA, Niemeyer MI, González W, Araki M, Araki K, Sepúlveda FV. Task-2: a k2p k+ channel with complex regulation and diverse physiological functions. Front Physiol. 2013;4:198.
Article CAS PubMed PubMed Central Google Scholar
Ye Q, Han X, Wu Z. Bioinformatics analysis to screen key prognostic genes in the breast cancer tumor microenvironment. Bioengineered. 2020;11(1):1280–300.
Article CAS PubMed PubMed Central Google Scholar
Li J, Han X. Adipocytokines and breast cancer. Curr Probl Cancer. 2018;42(2):208–14.
Article PubMed Google Scholar
Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst. 2021;34:8780–94.
Google Scholar

Download references

Acknowledgements

The authors are grateful to anonymous referees for their helpful comments.

Funding

This research is supported by the National Natural Science Foundation of China under Grant No. 62366007, Guangxi Natural Science Foundation under Grant No. 2022GXNSFAA035625, the National Natural Science Foundation of China under Grant No. 62302107, “Bagui Scholar” Project Special Funds, Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.

Author information

Authors and Affiliations

College of Computer Science and Engineering, Guangxi Normal University, Guilin, China
Rongyuan Li & Qi Zhu
Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, China
Jingli Wu, Jiafei Liu & Junbo Xuan
Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, China
Gaoshi Li

Authors

Rongyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingli Wu
View author publications
You can also search for this author in PubMed Google Scholar
Gaoshi Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiafei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Junbo Xuan
View author publications
You can also search for this author in PubMed Google Scholar
Qi Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RL participated in the data collection, data preprocessing, model design, and draft writing. JW participated in the concept, design and critical revision on the manuscript. GL and JL participated in the syntax modification of this paper. JX and QZ analyzed the experiments. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jingli Wu.

Ethics declarations

Ehics approval and Consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Li, R., Wu, J., Li, G. et al. Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP. BMC Bioinformatics 24, 427 (2023). https://doi.org/10.1186/s12859-023-05558-9

Download citation

Received: 04 August 2023
Accepted: 06 November 2023
Published: 13 November 2023
DOI: https://doi.org/10.1186/s12859-023-05558-9

Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP

Abstract

Background

Results

Conclusions

Introduction

Preliminaries

Conditional generative adversarial network

Conditional Wasserstein generative adversarial network with gradient penalty

Graph convolutional network

Proposed method

Enriching training samples

Adversarial simulator for augmenting gene expression data

The objective function

Architecture

Experimental details

Data preparation and parameter settings

Evaluation index

Comparison of similarity dist(\(\cdot\), \(\cdot\)) of different models

Comparison of classification ability of samples generated based on different models

Ablation experiments

Comparison of the correlations among key genes

Comparison of differentially expressed genes (DEGs)

Conclusions and future directions

Availibility of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ehics approval and Consent to participate

Consent for publication

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us