 Research
 Open access
 Published:
Mdwgangp: data augmentation for gene expression data based on multiple discriminator WGANGP
BMC Bioinformatics volume 24, Article number: 427 (2023)
Abstract
Background
Although gene expression data play significant roles in biological and medical studies, their applications are hampered due to the difficulty and high expenses of gathering them through biological experiments. It is an urgent problem to generate high quality gene expression data with computational methods. WGANGP, a generative adversarial networkbased method, has been successfully applied in augmenting gene expression data. However, mode collapse or overfitting may take place for small training samples due to just one discriminator is adopted in the method.
Results
In this study, an improved data augmentation approach MDWGANGP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network. Extensive experiments were implemented on real biological data.
Conclusions
The experimental results have demonstrated that compared with other stateoftheart methods, the MDWGANGP method can produce higher quality generated gene expression data in most cases.
Introduction
Over the last two to three decades, the rapid development of the genome sequencing technology has made it into reality to measure the expression level of thousands of genes from a biological sample simultaneously. Since gene expression data is extracted by various gene profiling technologies, direct reflecting the physiological state and disease of the human body [1], many computational technologies such as regression, classification and clustering can be applied on it to uncover disease mechanisms, propose novel drug targets, provide a basis for comparative genomics, and address a wide range of fundamental biological problems [2].
Nevertheless, the gene expression profile data are fundamentally limited in sample size, diversity, and the speed at which they can be gathered [3], due to the ethical challenge [4] and high expenses of money for gathering gene expression data through biological experiments. For example, the per person costs were US$6041932 for exome sequencing, and US$20063347 for whole genome sequencing in 2018 [5]. In addition, much bias or noise, which results from the errors in the splicing process of short reads [6] and various batch effects [7], makes it a great challenge to take advantage of the gene expression data effectively. Therefore, it is desired to generate biologically plausible synthetic gene expression data, which can be applied in such downstream tasks as marker gene detection, cell type clustering, gene association identification, cancer stages prediction, and so on [3]. In recent years, data augmentation (DA) methods, being capable of enriching data sets, mitigating data imbalance and data noise issues, have been extensively studied in the area of generating synthetic gene expression data.
To the best of our knowledge, there are generally three categories of data augmentation methods for generating gene expression data, such as samplebased, simulatorbased, and generative adversarial networkbased. The samplebased methods include random sampling [8], mean sampling [9], resampling [10], and oversampling [11, 12], which are prone to the problem of overfitting [13] or distribution marginalization [14]. The simulatorbased methods [15, 16] generate synthetic transcriptomics datasets based on known regulatory networks. Since they perform similarly to the random simulators [2], the key features of gene expression data can not be simulated [17]. With the rapid development of deep learning technology, the Generative Adversarial Network (GAN)based method, being able to produce more diverse and higher quality samples than the former two methods, has received major attention [1, 2]. It is also studied in this paper.
In 2020, Chaudhari et al. [18] firstly proposed modified generator GAN (MGGAN), which is fed with original data along with minimalistic multivariate noise to generate data with Gaussian distribution. In 2021, Kwon et al. [19] indicated that GANs are not effective with whole genes, and expanded RNA expression data for selected significant genes using GANs. Both of the two methods adopt the original unconditioned generative model, which has no control on modes of the data being generated [20]. In 2022, Ahmed et al. [21] developed method omicsGAN to integrate two omics data and their interaction network into a Wasserstein Generative Adversarial Network (WGAN) [22]. Nevertheless, gradient explosion is common when training WGAN. In 2020, Marouf et al. [23] adopted conditional singlecell generative adversarial neural networks (cscGAN) to produce singlecell RNAseq data. It learns nonlinear genegene dependencies from complex, multiple cell type samples and uses this information to generate realistic cells of defined types. In 2022, Han et al. [1] put forward the method GeneCWGAN, which stabilizes the distribution of generated samples with a dataset partition method, and adopts constraint penalty term to improve the diversity of generated samples. In the same year, Viñas et al. [2] proposed a new simulator (it is called as SWGANGP in this paper) based on WGANGP (Wasserstein Generative Adversarial Network with Gradient Penalty) [24]. SWGANGP concatenates the sample covariates with the input features and samples the class labels from the real distribution. The SWGANGP simulator can be used at a higher scale to produce tissue and organspecific transcriptomics data.
In the process of training generative adversarial networks, mode collapse is a serious issue to be concerned about. It may be an effective channel to alleviate the problem to improve the diversity of training samples as well as feedback signals. Among the above mentioned approaches, the diversity of feedback signals may be constrained for just one discriminator being adopted in the GANs. Therefore, in this paper, the collaboration of multiple discriminators is explored. The main contributions of this paper are summarized as follows:

1.
The multiple discriminator WGANGP (MDWGANGP) model is proposed. It can ensure the high quality of the generated gene expression data. Multiple discriminators are adopted prevent mode collapse via providing more feedback signals to the generator.

2.
A novel approach based on linear graph convolutional network (GCN) is put forward to enrich training samples, avoiding overfitting or mode collapse caused by small sample size in high dimensional data.

3.
The pancancer gene expression datasets were produced to demonstrate the effectiveness of the MDWGANGP approach. A data preprocess method is conducted to select the genes with high confidence or top ranking from proteinprotein interaction networks, so as to relieve the curse of dimensionality encountered in the training. Extensive experiments were implemented to compare the quality of generated gene expression data between the MDWGANGP method and other stateoftheart ones.
Preliminaries
Conditional generative adversarial network
The conditional generative adversarial network (CGAN) [20] attempts to generate samples of specified labels through input labels and noise. As the normal generative adversarial network (GAN) [25], a CGAN model consists of a generation network G and a discrimination network D. Given some noise z and conditional information y (e.g. category labels, data with different modalities), the generator G learns to produce synthetic samples similar to the real distribution. The discriminator D needs to distinguish whether the input sample is from authentic sample p(x) or from sample p(z) produced by the generator G. The loss function of CGAN can be formulated as:
Conditional Wasserstein generative adversarial network with gradient penalty
Different from CGAN, the Wasserstein generative adversarial network (WGAN) [22] tries to generate samples with just input noise. It applies the Wasserstein distance instead of the JensenShannon (JS) divergence to evaluate the distribution distance between the real samples and the generated ones, making the training process more stable and faster than the normal generative adversarial network. The Wasserstein generative adversarial network with gradient penalty (WGANGP) [24] is an modified model based on WGAN, penalizing the norm of gradient of the discriminator with respect to its input. In 2020, Zheng et al. [26] further improved the WGANGP model from the addition of conditional information and proposed the CWGANGP model, whose loss function can be formulated as:
where \({E_{{\hat{x}}\sim {p{({{\hat{x}}})}}}}[{({\nabla _{{\hat{x}}}}D({{\hat{x}}}y){_2}  1)^2}]\) is the gradient penalty term.
Graph convolutional network
The emerging graph convolutional networks (GCNs) [27,28,29] are able to extract well spatial correlation in nonEuclidean structures and maintain shiftinvariance. Let G=(V, E) be an undirected graph, where V and E represent the set of nodes \(v_{i}\) \(\in\) \(V\) (i=1,2,...,n) and edges (\(v_{i}\),\(v_{j}\))\(\in\) \(E\), respectively. \(A\) \(\in\) \(R^{n\times n}\) is the adjacent matrix of G, where \(A_{ij}\) indicates whether there is an edge between \(v_{i}\) and \(v_{j}\), or the similarity between them basing on a similarity measure. Let \(H^{(l)}\) represent the graph node representations at the lth (\(l\) \(\in\) \(N\)) layer, the propagation rule for calculating the graph node representations at the \((l+1)\)th layer is formulated as:
where f(\(\cdot\)) is a nolinear activation function, \(\widetilde{A}\)=A+I, and \(W^{(l)}\) is the weight matrix of the lth layer. \({\widetilde{D}}^{\frac{1}{2}}\widetilde{A}{\widetilde{D}}^{\frac{1}{2}}\) is a symmetric normalized Laplacian matrix, where \({\widetilde{D}}_{ii}\)=\(\sum _{j=1}^{n}{\widetilde{A}}_{ij}\).
Proposed method
Recently, Viñas et al. [2] proposed a WGANGP based simulator SWGANGP to generate specific tumour gene expression data. Though conditional restrictions are added, model collapse or overfitting may not be exempted for small training samples due to just one discriminator is adopted. In addition, some inherent defects are also harboured in WGANGP, such as training unstable and failing to generate diverse samples [1, 30]. Therefore, in this section, an improved data augmentation approach, the multiple discriminator WGANGP (MDWGANGP) model, is proposed. We begin with enriching the training samples with linear graph convolution [31, 32], then a generative adversarial network with multiple discriminators is devised based on WGANGP. The concrete descriptions are as follows. The source code of method MDWGANGP can be downloaded from https://github.com/lryup/MDWGANGP.
Enriching training samples
It is generally regarded that enriched training samples contribute to GAN capturing the original distribution [33]. Inspired by methods exerted on image data to enrich training samples, i.e., rotation, flipping, and cropping, a novel approach suitable for gene expression data is proposed. Given a raw gene expression matrix \(X_1\) with n rows (samples) and m columns (genes), where each entry represents the expression level of a given gene in a particular sample. A pair of KNearest Neighbors (KNN) graphs [34, 35] \(G_E\) and \(G_C\) are built from matrix \(X_1\) based on Euclidean distance and Cosine distance, respectively. Each vertex of them denotes a sample, and the edge demonstrates that there is a strong relationship between the connected two samples. Linear graph convolution is performed to update the vertices (samples), i.e., aggregating the information of their neighbor ones. The updated gene expression matrices \(X_2\) and \(X_3\) are depicted as follows:
where f(\(\cdot\)) is a linear activation function. \(\widetilde{A}_E\)=\(A_E\)+I (resp. \(\widetilde{A}_C\)=\(A_C\)+I), where \(A_E\) and \(A_C\) are the adjacency matrices of graphs \(G_E\) and \(G_C\), respectively. \({\widetilde{D}}_E{_{ii}}\)=\(\sum _{j=1}^{n}{\widetilde{A}}_E{_{ij}}\), \({\widetilde{D}}_C{_{ii}}\)=\(\sum _{j=1}^{n}{\widetilde{A}}_C{_{ij}}\).
Adversarial simulator for augmenting gene expression data
It has been regarded that the adoption of multi discriminators can improve the stability of optimization process [33]. In this subsection, an adversarial simulator MDWGANGP with three discriminators is devised, as shown in Fig. 1.
Figure 1a shows the SWGANGP model, and Fig. 1b illustrates the structure of MDWGANGP proposed in this paper. In the MDWGANGP model, the distribution of the original data are expected to be learned from two updated gene expression matrices \(X_2\) and \(X_3\) besides raw gene expression matrix \(X_1\). Hence two more discriminators \(D_2\) as well as \(D_3\) are added and fed with \(X_2\) and \(X_3\), respectively. Nevertheless, it is worth noticed that the generator is still anticipated to learn from the raw samples \(X_1\) principally rather than the updated ones, which play auxiliary roles in the process of training.
The objective function
In a generative adversarial network, the generator tries to produce samples that look real enough to trick the discriminator, while the discriminator attempts to distinguish the generated samples from the real ones. Here the objective functions are designed for one generator and three discriminators in MDWGANGP, as illustrated in Equation (6):
where Y indicates the conditional labels. \(\lambda\) is a hyperparameter determining strength of gradient penalty \({E_{{\hat{X}}_i\sim {p{({{\hat{X}}_i} )}}}}[{({\nabla _{{\hat{X}}_i}}D_i({{\hat{X}}_iY}){_2}  1)^2}]\). \(X_i\) is the real samples, Z denotes the noise samples, \(\hat{X_i}\) represents the samples randomly chosen from the real ones or the generated ones. The whole optimization objective functions of generator and discriminator are formulated as Equation (7) and Equation (8):
where \({\lambda _g}\) and \({\lambda _d}\) denote two small adjustable parameters assisting model learning. All discriminators are trained through weight sharing to improve model performance [33].
Architecture
Figure 2 shows the architecture of the proposed simulator MDWGANGP. The generator G receives noise vector Z and conditional label Y as input and produces vector \(X'\) of synthetic expression values. The discriminator \(D_i\) (i=1,2,3) takes either a real gene expression sample \(X_i\) or a synthetic sample \(X'\), in addition to a conditional label Y, and tries to distinguish whether the input sample is real or fake. Matrices \(X_2\) and \(X_3\) are respectively produced with a linear graph convolution of sample graphes \(G_E\) and \(G_C\), which are respectively constructed from matrix \(X_1\) based on Euclidean distance and Cosine distance.
Experimental details
The effectiveness of MDWGANGP is verified through extensive experiments. We began with comparing the model performances of CGAN [20], CWGAN [36], CWGANGP [26], GeneCWGAN [1], SWGANGP [2], and MDWGANGP with the similarity dist(\(\cdot\), \(\cdot\)) on fifteen datasets, and the diversity of samples generated by these models through sample dimension visualization. Then we compared the model performances with the classification ability of generated samples. Next, we compared the performances among these models in terms of the correlations among key genes. Finally, we compared the differentially expressed genes, identified using the generated datasets, with those identified using the real ones.
Data preparation and parameter settings
In the experiments, real biological datasets are acquired from four databases:
(1) The Cancer Genome Atlas (TCGA). It is a public biospecimen repository which aims to augment the understanding of the molecular mechanisms of cancers. The database contains highthroughput genomic data from over 20,000 primary cancer and matched healthy samples spanning 33 cancertypes.
(2) The GenotypeTissue Expression (GTEx). It is also a public resource built to study tissuespecific gene expression and regulation. It contains samples collected from 54 nondiseased tissue sites across nearly 1000 individuals [37].
(3) The String dataset. String is a database which records known and predicted proteinprotein interactions, including physical as well as functional connections. The latest Human Protein Interaction Network version 11.5 was adopted in the experiments.
(4) The HumanNet dataset. HumanNet [38] is a database that covers 99.8% of human proteincoding genes. The latest functional gene network (HumanNetFN) version 3 [39] was adopted in the experiments.
The data preparation was conducted was follows. Firstly, the raw RNAseq sample datasets of TCGA and GTEx were acquired from Wang et al. [40]. Fifteen common tissues between TCGA and GTEx datasets were selected to construct the GT dataset, which consisted of 9,147 samples and 18,154 genes. Secondly, the String PPI network were consisted of 11,938,499 edges and 19,385 proteins, and 360,783 edges as well as 14,220 proteins were retained through filtering out the edges with a score less than 800. The transfer from protein ID to gene ID, then to gene name was conducted with the Genome Reference Consortium Human Build 38 Organism (GRCH38) database, and R packages AnnotationDbi and org.Hs.eg.db. Then 13,035 genes were remained by dropping duplicate ones, for some proteins correspond to multiple genes. Thirdly, among the 977,495 edges and 18,458 genes of HumanNet, 15,443 genes and 97,749 edges were left by choosing the top 10% more reliable edges. Finally, the genes that were not belong to the String or the HumanNet PPI networks were dropped from the GT dataset, and 9147 samples and 10612 genes were remained. Both logarithmic transformation and zscore were adopted to normalize the gene expression values. The number of samples of the fifteen common tissues were illustrated in Table 1.
In the experiments, 10% of the samples in all datasets were randomly selected as the training set, while the 90% rest ones were as the test set. Both the generator and the discriminator models included two layers of fully connected hidden layers, each of which had 256 nerves. The hidden layer adopted the ReLU activation function, and the output layer did not use any. The RMSProp optimizer was executed with a learning rate of 0.0005 [41]. Some hyperparameters were set as follows: \(\lambda\)=10 [24], \(\lambda _g\)=0.2, and \(\lambda _d\)=0.02 [33]. The training process was terminated when the validation score dist(\(D^X\), \(D^Z\)) was not improved for 20 consecutive times, or it reached the maximum iterations of 500.
Evaluation index
In this section, evaluation indexes for estimating the performance of generative model are described. Assume that \(X_{m_1\times n}\) and \(Z_{m_2\times n}\) are a pair of matrices recording real and synthetic gene expression observations, respectively. The rows of them respectively denote a set of \(m_1\) real cancer samples and \(m_2\) synthetic ones, the columns of them denote a set of n genes, and the entries of them are real numbers, i.e., \(x_{ij}\), \(z_{ij}\) \(\in\) \(R\). Let \(D^X\) and \(D^Z\) be a pair of \(n\) \(\times\) \(n\) symmetric matrices corresponding to X and Z. In matrix \(D^X\) (resp. \(D^Z\)), each entry \(d_{jk}^{X}\) (resp. \(d_{jk}^{Z}\)) records the pairwise distance between the jth and the kth genes, i.e., the pearson correlation coefficient between columns \(x_{j}\) (resp. \(z_{j}\)) and \(x_{k}\) (resp. \(z_{k}\)), as defined in Equation (9) (resp. Equation (10)):
where \({\bar{x}}_{j}\)=\(\frac{\sum \limits _{i=1}^{m_1}x_{ij}}{m_1}\), \({\bar{x}}_{k}\)=\(\frac{\sum \limits _{i=1}^{m_1}x_{ik}}{m_1}\), \({\bar{z}}_{j}\)=\(\frac{\sum \limits _{i=1}^{m_2}z_{ij}}{m_2}\), \({\bar{z}}_{k}\)=\(\frac{\sum \limits _{i=1}^{m_2}z_{ik}}{m_2}\).
Let dist(\(D^X\), \(D^Z\)) represent the similarity between matrices \(D^X\) and \(D^Z\), measuring whether the pairwise correlation between genes from the real data are correlated with those from the synthetic data, as defined in Equation (11) [2]:
where \(\mu (D^X)\) and \(\sigma (D^X)\) are defined as Equation (12) and Equation (13), and \(\mu (D^Z)\) and \(\sigma (D^Z)\) are defined accordingly.
In addition, the classification performance obtained by taking advantage of the synthetic gene expression data is also adopted to measure the performance of generative model, as depicted from Equation (14) to Equation (18):
where TP (resp. TN) denotes the number of positive (resp. negative) samples correctly labeled by the classifier. FP (resp. FN) represents the number of negative (resp. positive) samples incorrectly labeled as positive (resp. negative) ones. Mcc denotes Matthews correlation coefficient.
Comparison of similarity dist(\(\cdot\), \(\cdot\)) of different models
In Table 2, the performance of similarity dist(\(\cdot\), \(\cdot\)) is compared among different models. For each dataset, the generated sample set has the same size as the corresponding test set. From this table we can see that the presented model MDWGANGP outperforms other models in 11 of the 15 datasets. Its average dist(\(\cdot\), \(\cdot\)) among all of the datasets is 0.704, which is apparently higher than those of other five models.
In addition, as shown in Figs. 3 , 4 and 5, the comparisons of distributions are demonstrated between the generated samples and the real samples for the first 11 genes, reflecting intuitively the diversity of generated samples. In all figures, the horizontal coordinates indicate the number of genes, and the vertical ones denote the gene expression values. The red line represents the real samples, and the blue one represents the generated samples.
From Figs. 3 , 4 and 5 we can see that compared with the samples generated by the other five models, those generated by model MDWGANGP generally have distributions more similar to the real samples. The samples generated by model CGAN concentrate in a very narrow range, indicating that original data distribution and the generated data distribution hold a negligible overlapping area, for JS divergence adopted by model CGAN may lead to gradient disappearance and mode collapse [42]. CWGAN adopts Wasserstein distance to solve the problem of mode collapse. However, it generates samples deviating from the original values due to gradient explosion resulting from the absence of gradient penalty [24]. CWGANGP avoids gradient explosion effectively with the addition of gradient punishment. Nevertheless, because the true value range of each feature is unknown and the output layer activation function of CWGANGP forcibly limits the generation space [1], the diversity of its samples remains poor at the distribution margins. GeneCWGAN expands the generation space of the generation model by removing the tanh activation function of the CWGANGP generation model, and avoids the expansion of learning fluctuation with a constraint penalty term [1]. Nevertheless, the generated samples may deviate from the original ones. As shown in Fig. 4d, the maximum original values of the 0th and the 3th genes are respectively 7 and 8, while the maximum generated values of them are respectively close to 9 and 10. Similar to GeneCWGAN, SWGANGP also expands the generation space by removing the tanh activation function, and it can generate sample data with specified conditions. In order to further improve model stability and the diversity of generated samples, enriched training samples are produced with the aid of multiple discriminators in the MDWGANGP method. As shown in Fig. 3 , 4 and 5, the samples generated by MDWGANGP have more satisfying diversity at the distribution margins.
Comparison of classification ability of samples generated based on different models
As illustrated in Tables 3, 4 and 5, the classification ability of generated samples is evaluated in terms of classifying the normal and the cancer samples. In the experiments, three kinds of classical classification methods, i.e., random forests (RF) [43], Knearest neighbors (KNN) [44], and multi layered perceptron (MLP) [45], were adopted. The number of trees \(n_{estimators}\) was 200 for RF, the number of neighbours K was 5 for KNN, and two hidden layers with 128 units and the ReLU activation function were adopted for MLP. For each method, the average results of ten runs are calculated and presented. It can be seen from the three tables that among the three methods the samples generated with the MDWGANGP model perform the best classification ability in the vast majority of cases. Furthermore, basing on the classification methods RF and KNN, the samples generated with the MDWGANGP model even present superior classification performance than the real samples (denoted as “Real” in the three tables).
Furthermore, in order to intuitively reflect the clustering ability of the samples generated by model MDWGANGP, we compare the cluster results on the real samples with those on the generated ones. As shown in Figs. 6 and 7, there datasets such as Colon, Thyroid and Lung were adopted. The dimensionality of each sample was reduced to two with tSNE [46]. From Fig. 6 we can discover that the generated samples almost overlap with the real ones. Moreover, the clustering results in Fig. 7 demonstrate that the generated samples present better linear separability than the real ones, indicating that it might be better to perform differential analysis between normal and cancer tissues using the generated datasets.
Ablation experiments
As mentioned before, the training samples were enriched with linear graph convolution in method MDWGANGP. Here a series of ablation experiments were conducted on the GT dataset. The training set was constructed by randomly selecting 10% of the samples from each tissue, and the remaining 90% of the samples were chosen as the test set. Figure 8 compares the similarity between the real data and the generated one in terms of dist(\(\cdot\), \(\cdot\)). In this figure, MDWGANGPC (resp. MDWGANGPE) represents the model adopting only Cosine distance (resp. Euclidean distance). As can be seen from the figure, the MDWGANGP model has the highest dist(\(\cdot\), \(\cdot\)) among the four models. In the subsequent two subsections, experiments were conducted to further test the usability of samples generated with method MDWGANGP.
Comparison of the correlations among key genes
Ten most frequently mutated genes in human cancers [47] were chosen as key genes. The correlations among them are calculated and presented based on the generated and the real expression data, respectively. As can be seen in Fig. 9, a pair of \(10 \times 10\) symmetric matrices record the distance \(d_{jk}\) (j,k=1,2,...,10) among the ten key genes. Figure 9a represents the correlations based on the real samples, while Fig. 9b represents those based on the generated samples of model MDWGANGP. It can be seen that the distances among genes calculated basing the two different kinds of samples are close, indicating the correlations among genes in the generated data well approximate to those in the real data.
Comparison of differentially expressed genes (DEGs)
As analyzed above, compared with using the real datasets, it might be better to conduct differential analysis between normal and cancer tissues using the generated ones. In this section, comparisons were further performed between the differentially expressed genes identified based on the generated datasets and those identified based on the real ones. Eighty percent of all pancancer samples were randomly selected as the training set, and the same number of samples were generated with model MDWGANGP. DESeq2 package of R was called to calculate the difference fold and pvalue for each gene by using the denormalized generated expression data, and the genes with \(log2(fold \ change)\) greater than 3 and pvalues less than 0.05 were selected as differentially expressed genes. For the convenience of description, we use “realDEGs” and “fakeDEGs” to denote the DEGs ascertained based on the real and the generated datasets, respectively.
As shown in Table 6, for most cancer types, the number of fakeDEGs approximates to that of realDEGs. Additionally, breast cancer was taken as an example to analyze the association between DEGs and cancers. Firstly, among the top 286 realDEGs (resp. fakeDEGs), 165 (resp. 177) breast cancer related genes were ascertained basing on the DisGeNET database (v7.0) [48]. It is obvious that the number of breast cancer related DEGs obtained from the generated data are greater than that obtained from the real one.
Secondly, package clusterProfiler of R [49] was called to conduct enrichment analysis for the DEGs based on the KEGG database [50]. As displayed in Fig. 10, both realDEGs and fakeDEGs are enriched in nine biological pathways. The color of bars indicates the degree of significance, and the length of them counts the number of DEGs enriched. Among the two groups of enriched biological pathways, seven breast cancer related pathways are enriched by both realDEGs and fakeDEGs. The PPAR signaling pathway has been reported as a potential biomarker for the diagnosis of breast cancer [51,52,53]. Cytokinecytokine receptor interaction plays an important role in the metastasis of breast cancer and its development [54]. Aberrant AMPK signaling pathways may play a role in the regulation of growth, survival and the development of drug resistance in triplenegative breast cancer [55]. IL17 signaling pathway has been demonstrated to promote the proliferation, invasion and metastasis of breast cells, and is significantly associated with the poor prognosis of breast patients [56]. Regulation of lipolysis in adipocytes pathway promotes the proliferation and migration of breast cancer cell [57]. Tyrosine metabolism pathway regulates the development of breast cancer [58]. Proximal tubule bicarbonate reclamation pathway indirectly regulates the proliferation of breast cancer cell through TASK2 [59]. In addition, a pair of breast cancer related biological pathways, i.e., Viral protein interaction with cytokine and cytokine receptor and Adipocytokine signaling pathway, are also enriched by the fakeDEGs. Viral protein interaction with cytokine and cytokine receptor has been reported to be significant for breast cancer [60]. Adipocytokine signaling pathway can mediate the survival, growth, invasion, and metastasis of breast cancer cells through different cellular and molecular mechanisms, thus reducing survival time and contributing to malignancy [61]. Figure 11 (resp. Figure 12) further illustrates the five top pathways enriched by realDEGs (resp. fakeDEGs) in term of adjusted pvalues. The steelblue nodes represent the pathways, and the size of which indicates the number of DEGs enriched. Other colored small nodes represent the DEGs, and the color of which indicates its value of \(log2(fold \ change)\).
Conclusions and future directions
Since it is both difficult and expensive for gathering gene expression data with biological experiments, generating them through computational approaches has aroused great attentions. In this study, a generative adversarial network model MDWGANGP, having multiple discriminators, is put forward. A novel method is designed for enriching training samples based on linear graph convolutional network. Compared with other stateoftheart methods, the MDWGANGP method can produce higher quality generated gene expression data in most cases. In addition, some critical biomarkers, enriching in some significant biological pathways, are identified based on the generated data. All of these have been verified through extensive experiments performed on real biological data.
However, during the process of experiments, we found that GAN and its improved versions have the inherent defect of being difficult to train. It has been reported that the diffusion model can ensure sample diversity by means of adding and removing noise step by step [62]. It is anticipated to do well in generating high quality and diverse gene expression data, which will be studied in the future.
Availibility of data and materials
The datasets used in this paper and the source code of MDWGANGP are available at https://github.com/lryup/MDWGANGP.
References
Han F, Zhu S, Ling Q, Han H, Li H, Guo X, Cao J. Genecwgan: a data enhancement method for gene expression profile based on improved cwgangp. Neural Computing Appl. 2022;1–15:16325–39.
Viñas R, AndrésTerré H, Liò P, Bryson K. Adversarial generation of gene expression data. Bioinformatics. 2022;38(3):730–7.
Lee M. Recent advances in generative adversarial networks for gene expression data: a comprehensive review. Mathematics. 2023;11(14):3055.
Buccitelli C, Selbach M. mrnas, proteins and the emerging principles of gene expression control. Nat Rev Genet. 2020;21(10):630–44.
Gordon LG, White NM, Elliott TM, Nones K, Beckhouse AG, RodriguezAcevedo AJ, Webb PM, Lee XJ, Graves N, Schofield DJ. Estimating the costs of genomic sequencing in cancer control. BMC Health Serv Res. 2020;20(1):1–11.
Harris RS, Cechova M, Makova KD. Noisecancelling repeat finder: uncovering tandem repeats in errorprone longread sequencing data. Bioinformatics. 2019;35(22):4809–11.
Zang C, Wang T, Deng K, Li B, Hu S, Qin Q, Xiao T, Zhang S, Meyer CA, He HH. Highdimensional genomic data bias correction and data integration using mancie. Nat Commun. 2016;7(1):1–8.
Kuhn K, Baker SC, Chudin E, Lieu MH, Oeser S, Bennett H, Rigault P, Barker D, McDaniel TK, Chee MS. A novel, highperformance random array platform for quantitative gene expression profiling. Genome Res. 2004;14(11):2347–56.
Eldar YC. Meansquared error sampling and reconstruction in the presence of noise. IEEE Trans Signal Process. 2006;54(12):4619–33.
Park SW, Hao WD, Leung CS. Reconstruction of uniformly sampled sequence from nonuniformly sampled transient sequence using symmetric extension. IEEE Trans Signal Process. 2011;60(3):1498–501.
Blagus R, Lusa L. Smote for highdimensional classimbalanced data. BMC Bioinformatics. 2013;14(1):1–16.
Gu Q, Wang XM, Wu Z, Ning B, Xin CS. An improved smote algorithm based on genetic algorithm for imbalanced data classification. J Digital Infor Manag. 2016;14(2):92–103.
Li X, Zhang L. Unbalanced data processing using deep sparse learning technique. Futur Gener Comput Syst. 2021;125:480–4.
Huang, D.H., Liu, D., Wen, M., Dong, X.L., Wen, M., Zhao, X.H.: A clustering method of gas load based on fcmsmote. In: E3S Web of Conferences, vol. 257, p. 01032 (2021). EDP Sciences
Van den Bulcke T, Van Leemput K, Naudts B, van Remortel P, Ma H, Verschoren A, De Moor B, Marchal K. Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics. 2006;7(1):1–12.
Schaffter T, Marbach D, Floreano D. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–70.
Maier R, Zimmer R, Küffner R. A turing test for artificial expression data. Bioinformatics. 2013;29(20):2603–9.
Chaudhari P, Agrawal H, Kotecha K. Data augmentation using mggan for improved cancer classification on gene expression data. Soft Comput. 2020;24(15):11381–91.
Kwon C, Park S, Ko S, Ahn J. Increasing prediction accuracy of pathogenic staging by sample augmentation with a gan. PLoS ONE. 2021;16(4):0250458.
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Ahmed KT, Sun J, Cheng S, Yong J, Zhang W. Multiomics data integration by generative adversarial network. Bioinformatics. 2022;38(1):179–86.
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223 (2017). PMLR
Marouf M, Machart P, Bansal V, Kilian C, Magruder DS, Krebs CF, Bonn S. Realistic in silico generation and augmentation of singlecell rnaseq data using generative adversarial networks. Nat Commun. 2020;11(1):1–12.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017)
Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z. Conditional wasserstein generative adversarial networkgradient penaltybased approach to alleviating imbalanced data classification. Inf Sci. 2020;512:1009–23.
Kipf TN, Welling M: Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Wu F, Souza A., Zhang T, Fifty C, Yu T, Weinberger K: Simplifying graph convolutional networks. In: International Conference on Machine Learning, pp. 6861–6871 (2019). PMLR
Zhang S, Tong H, Xu J, Maciejewski R. Graph convolutional networks: a comprehensive review. Comput Social Netw. 2019;6(1):1–23.
Petzka H, Fischer A., Lukovnicov D: On the regularization of wasserstein gans. arXiv preprint arXiv:1709.08894 (2017)
Tian X, Ding CH, Chen S, Luo B, Wang X. Regularization graph convolutional networks with data augmentation. Neurocomputing. 2021;436:92–102.
Wang Y, Wang Y, Yang J, Lin Z. Dissecting the diffusion process in linear graph convolutional networks. Adv Neural Inf Process Syst. 2021;34:5758–69.
Tran NT, Tran VH, Nguyen NB, Nguyen TK, Cheung NM. On data augmentation for gan training. IEEE Trans Image Process. 2021;30:1882–97.
Grün D. Revealing dynamics of gene expression variability in cell state space. Nat Methods. 2020;17(1):45–9.
Wang J, Ma A, Chang Y, Gong J, Jiang Y, Qi R, Wang C, Fu H, Ma Q, Xu D. scgnn is a novel graph neural network framework for singlecell rnaseq analyses. Nat Commun. 2021;12(1):1–11.
Jin Q, Luo X, Shi Y, Kita K: Image generation method based on improved condition gan. In: 2019 6th international conference on systems and informatics (ICSAI), pp. 1290–1294 (2019). IEEE
G Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30.
Hwang S, Kim CY, Yang S, Kim E, Hart T, Marcotte EM, Lee I. Humannet v2: human gene networks for disease research. Nucleic Acids Res. 2019;47(D1):573–80.
Kim CY, Baek S, Cha J, Yang S, Kim E, Marcotte EM, Hart T, Lee I. Humannet v3: an improved database of human gene networks for disease research. Nucleic Acids Res. 2022;50(D1):632–9.
Wang Q, Armenia J, Zhang C, Penson AV, Reznik E, Zhang L, Minet T, Ochoa A, Gross BE, IacobuzioDonahue CA. Unifying cancer and normal rna sequencing data from different sources. Scientific data. 2018;5(1):1–8.
Tijmen T, Hinton G: Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2), 26–31 (2012)
Li W, Xu L, Liang Z, Wang S, Cao J, Ma C, Cui X. Sketchthenedit generative adversarial network. KnowlBased Syst. 2020;203: 106102.
Rigatti SJ. Random forest. J Insur Med. 2017;47(1):31–9.
Peterson LE. Knearest neighbor. Scholarpedia. 2009;4(2):1883.
Karlik B, Olgac AV. Performance analysis of various activation functions in generalized mlp architectures of neural networks. Int J Artif Intell Expert Syst. 2011;1(4):111–22.
Van der Maaten, L., Hinton, G.: Visualizing data using tsne. Journal of machine learning research 9(11) (2008)
Mendiratta G, Ke E, Aziz M, Liarakos D, Tong M, Stites EC. Cancer gene mutation frequencies for the us population. Nat Commun. 2021;12(1):5961.
Piñero J, Saüch J, Sanz F, Furlong LI. The disgenet cytoscape app: exploring and visualizing disease genomics data. Comput Struct Biotechnol J. 2021;19:2960–7.
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L. clusterprofiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021;2(3): 100141.
Kanehisa M, Goto S. Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.
Baranova A.: Ppar ligands as potential modifiers of breast carcinoma outcomes. PPAR research 2008 (2008)
Xu Y, Shu D, Shen M, Wu Q, Peng Y, Liu L, Tang Z, Gao S, Wang Y, Liu S: Development and validation of a novel ppar signaling pathwayrelated predictive model to predict prognosis in breast cancer. Journal of Immunology Research 2022 (2022)
Sultan G, Zubair S, Tayubi IA, Dahms HU, Madar IH. Towards the early detection of ductal carcinoma (a common type of breast cancer) using biomarkers linked to the ppar (\(\gamma\)) signaling pathway. Bioinformation. 2019;15(11):799.
MéndezGarcía LA, NavaCastro KE, OchoaMercado T, PalaciosArreola MI, RuizManzano RA, SegoviaMendoza M, SolleiroVillavicencio H, CázarezMartínez C, MoralesMontor J. Breast cancer metastasis: are cytokines important players during its development and progression? J Interferon & Cytokine Res. 2019;39(1):39–55.
Cao W, Li J, Hao Q, Vadgama JV, Wu Y. Ampactivated protein kinase: a potential therapeutic target for triplenegative breast cancer. Breast Cancer Res. 2019;21(1):1–10.
Song X, Wei C, Li X. The potential role and status of il17 family cytokines in breast cancer. Int Immunopharmacol. 2021;95: 107544.
Balaban S, Shearer RF, Lee LS, van Geldermalsen M, Schreuder M, Shtein HC, Cairns R, Thomas KC, Fazakerley DJ, Grewal T. Adipocyte lipolysis links obesity to breast cancer growth: adipocytederived fatty acids drive breast cancer cell proliferation and migration. Cancer & metabolism. 2017;5(1):1–14.
Acevedo DS, Fang WB, Rao V, Penmetcha V, Leyva H, Acosta G, Cote P, Brodine R, Swerdlow R, Tan L. Regulation of growth, invasion and metabolism of breast ductal carcinoma through ccl2/ccr2 signaling interactions with met receptor tyrosine kinases. Neoplasia. 2022;28: 100791.
Cid LP, RoaRojas HA, Niemeyer MI, González W, Araki M, Araki K, Sepúlveda FV. Task2: a k2p k+ channel with complex regulation and diverse physiological functions. Front Physiol. 2013;4:198.
Ye Q, Han X, Wu Z. Bioinformatics analysis to screen key prognostic genes in the breast cancer tumor microenvironment. Bioengineered. 2020;11(1):1280–300.
Li J, Han X. Adipocytokines and breast cancer. Curr Probl Cancer. 2018;42(2):208–14.
Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst. 2021;34:8780–94.
Acknowledgements
The authors are grateful to anonymous referees for their helpful comments.
Funding
This research is supported by the National Natural Science Foundation of China under Grant No. 62366007, Guangxi Natural Science Foundation under Grant No. 2022GXNSFAA035625, the National Natural Science Foundation of China under Grant No. 62302107, “Bagui Scholar” Project Special Funds, Guangxi Collaborative Innovation Center of Multisource Information Integration and Intelligent Processing.
Author information
Authors and Affiliations
Contributions
RL participated in the data collection, data preprocessing, model design, and draft writing. JW participated in the concept, design and critical revision on the manuscript. GL and JL participated in the syntax modification of this paper. JX and QZ analyzed the experiments. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ehics approval and Consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, R., Wu, J., Li, G. et al. Mdwgangp: data augmentation for gene expression data based on multiple discriminator WGANGP. BMC Bioinformatics 24, 427 (2023). https://doi.org/10.1186/s12859023055589
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023055589