 Research
 Open access
 Published:
Classifying breast cancer subtypes on multiomics data via sparse canonical correlation analysis and deep learning
BMC Bioinformatics volumeÂ 25, ArticleÂ number:Â 132 (2024)
Abstract
Background
Classifying breast cancer subtypes is crucial for clinical diagnosis and treatment. However, the early symptoms of breast cancer may not be apparent. Rapid advances in highthroughput sequencing technology have led to generating large number of multiomics biological data. Leveraging and integrating the available multiomics data can effectively enhance the accuracy of identifying breast cancer subtypes. However, few efforts focus on identifying the associations of different omics data to predict the breast cancer subtypes.
Results
In this paper, we propose a differential sparse canonical correlation analysis network (DSCCN) for classifying the breast cancer subtypes. DSCCN performs differential analysis on multiomics expression data to identify differentially expressed (DE) genes and adopts sparse canonical correlation analysis (SCCA) to mine highly correlated features between multiomics DEgenes. Meanwhile, DSCCN uses multitask deep learning neural network separately to train the correlated DEgenes to predict breast cancer subtypes, which spontaneously tackle the data heterogeneity problem in integrating multiomics data.
Conclusions
The experimental results show that by mining the associations among multiomics data, DSCCN is more capable of accurately classifying breast cancer subtypes than the existing methods.
Introduction
Breast cancer is the second leading cause of cancer death in women after Lung cancer [1]. It is a highly heterogeneous disease, consisting of different biological subtypes. Each breast cancer subtype has different clinical, pathological and molecular features, and has different prognostic and therapeutic implications [2, 3]. Therefore, the study of breast cancer subtypes is of great significance for precision medicine and prognosis prediction of breast cancer [4, 5]. To profile heterogeneous genotype data related to breast cancer, highthroughput technologies could be exploited [6,7,8].
Driven by the new highthroughput sequencing technologies, biological data in a variety of different formats, sizes and structures are growing at an unprecedented rate [9,10,11]. Based on these omics data, there have been many studies on the classification of breast cancer subtypes, which can be divided into two categories. The first category is based on single omics data. For example, Lehmann et al. [12] used gene expression data for clustering analysis to identify subtypes of triplenegative breast cancer. Rhee et al. [13] proposed a hybrid approach to integrate graph convolutional networks and relational networks to predict breast cancer subtypes using gene expression profiles. Yu et al. [14] performed differential expression analysis on biologically important genes in the gene regulatory networks and constructed a machine learningbased binary classification model for each breast cancer subtype using the differential expression genes. Each type of omics data exhibits specific disease associations [15, 16]. However, the analysis of single omics data do not capture the interrelationships between molecules at different levels, which may fail to provide a comprehensive understanding of the biological processes of breast cancer [17].
To address these limitations, the second category utilizes multiomics data to perform breast cancer classification. Various studies have shown that combining multiple omics datasets yields better accurate prediction to clinical outcomes, thereby verifying the importance of integrating multiomics data over singleomics data [17,18,19,20,21]. According to the way of data integration, the multiomics data integration methods for predicting breast cancer subtypes can be classified as concatenationbased, ensemblebased and knowledgedriven methods[22].
The concatenationbased methods combine all omics data into a single dataset before training [15, 23]. For example, Tao et al. [24] presented a SVM model with multiple kernel to classify breast cancer subtypes using multiomics data. List et al. [25] constructed random forest model to classify breast cancer subtypes using both gene expression and DNA methylation data. Concatenationbased methods are convenient for integrating multiomics data into single dataset before training, but they suffer from the increasing dimensionality of multiomics data and the data heterogeneity issue in integrating multiomics data [26]. The ensemblebased methods separately train a model on each omics dataset and combine the prediction results based on the average or majority voting scheme [27]. For example, Lin et al. [28] proposed a deep neural network model DeepMo based on multiomics data for the breast cancer subtypes classification. DeepMo applies fullyconnected layers to each omics and concatenates these fullyconnected layers for final subtypes prediction. Joung et al. [29] presented an interpretable deep learningbased framework moBRCAnet for classifying breast cancer subtypes. moBRCAnet utilizes selfattention module to each omics to mine the important features of multiomics data and integrates the mined features into deep neural network to identify breast cancer subtypes. The ensemblebased methods retain unique data distribution so that the omics data from different sources can be fully trained. However, the ensemblebased methods do not consider the biological interaction between multiomics data, which may lose complementary information in multiomics data [30]. Knowledgedriven approaches considers the relationships between different omics data based on prior knowledge. For example, Singh et al. proposed DIABLO to seek common information across different modality data by selecting a subset of features and discriminating multiple subtypes simultaneously. SMSPL [31] is a robust multimodal approach for classifying breast cancer subtypes by analyzing integrative multiomics data. However, it should be noted that the prior knowledge sometimes may not be suitable for some biological research fields [31].
Although the abovementioned methods have achieved great success in predicting breast cancer subtypes, some challenges still remain when integrating multiomics data: (1) Biological data usually contain a large number of features p and small size of samples n, which is called the large p and small n problem [32]. From a biological perspective, only a small fraction of features is highly correlated with the target disease, while most features are irrelevant. From a machine learning perspective, many irrelevant features may be prone to overfitting problems and negatively influence the performance of the classifier. (2) Data heterogeneity problem. Different types of biological data produced by different omics platforms contain heterogeneous information, which could result in different kinds and levels of uncertainty and imprecision [33]. (3) The complementary information presented in multiomics data is not fully utilized. In the classification of breast cancer subtypes, people mainly focused on employing the associations between disease and single omics data rather than the associations among different types of omics data.
Motivated by these limitations, we propose a novel framework called DSCCN for classifying breast cancer subtypes by mining the associations among multiomics data. To solve the large p and small n problem in the integration of multiomics data, DSCCN first performs differential analysis on the multiomics expression data of breast cancer patients to identify differentially expressed genes. This step, specifically designed for breast cancer, has effectively reduced the number of features while ensuring that the selected features are statistically significant, which are potentially related to the occurrence of breast cancer. To mine the associations among multiomics data, a SCCA mode [34] is exploited to detect linear structural interaction information of the multiomics expression data to uncover correlated multiomics features of the identified DEgenes. To the best of our knowledge, this is the first time of using SCCA model to identify associations in multiomics data for classifying breast cancer subtypes. Finally, DSCCN adopts an endtoend multitask deep learning neural network model DNN with attention mechanism to train the correlated multiomics features of DEgenes to classify the breast cancer subtypes. Unlike traditional neural networks, which are usually trained only for a single specific task, our multitask network utilizes a shared representation to perform multiple tasks simultaneously. Two independent tasks are separately performed to train our DNN model on two omics dataset, and the attention mechanism is utilized to mine the impotrant multiomics genes of high similarity within both tasks to produce classification probabilities for each task. This effectively solves the problem of data heterogeneity and captures the information presented in multiomics.
We demonstrate the capability of DSCCN by comparing it with the stateoftheart methods. In the comparative experiments, we evaluate the performance of all competitive methods in the binary/multiclass classification of breast cancer subtypes. The results demonstrate that DSCCN shows competitive performance with the existing methods in classifying breast cancer subtypes. Our proposed DSCCN thus could be a promising method for the classification of breast cancer subtypes. The source code is available at https://github.com/hyr0771/DSCCN.
Materials and methods
In this section, we introduce our method DSCCN for classifying breast cancer subtypes. The overview of DSCCN is summarized in Fig.Â 1. As shown in Fig.Â 1, DSCCN mainly includes three steps:

Step 1: Performing differential analysis on the multiomics data (mRNA, DNA methylation) of breast cancer patients to detect DEmRNAs and DEDNAms.

Step 2: Utilizing Sparse Canonical Correlation Analysis to identify highly correlated mRNAs and DNAms of patients based on the detected DEmRNAs and DEDNAms in step 1. We call these correlated mRNAs and DNAms as CorrmRNAs and CorrDNAms, respectively.

Step 3: Using the deep neural network model to classify the breast cancer subtypes based on the CorrmRNAs and CorrDNAms of patients.
Differential analysis of multiomics data
The breast cancer multiomics (mRNA, DNAm) data of patients are obtained from The Cancer Genome Atlas(TCGA) [35]. The multiomics data contains four subtypes of breast cancer: Basallike (Basal), Her2enriched (Her2), Luminal A (LumA), Luminal B (LumB), which are publicly reported as the most replicated subtypes of human breast cancer [2]. The primary characteristics of the breast cancer subtypes are based on the expression levels of estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2) and proliferation indicator Ki67 [2, 36, 37]. The sample numbers of the breast cancer subtypes are given in TableÂ 1.
Note that integrating omics data faces the challenge of the large p and small n problem. Appropriate dimensionality reduction is necessary for identifying relevant multiomics features of samples. We thus first carry out dimensionality reduction process on the mRNA and DNAm datasets. Specifically, we divide the samples into two groups. For the mRNA dataset, the health group and the disease group with breast cancer contain 194 and 986 samples respectively. For the DNAm dataset, the health group and the disease group with breast cancer contain 97 and 785 samples respectively.
Then we perform differential analysis on two sets of omics separately, utilizing Ttest and Fold Change methods to identifying differentially expressed genes. Specifically, the genes with a pvalue (Ttest) less than 0.01 and a Fold Change less than 0.5 are defined as lowly expressed genes. Similarly, those with a pvalue (Ttest) less than 0.01 and a Fold Change greater than 1 are considered highly expressed genes. Finally, we totally obtain 3692 DEmRNA genes, with 3440 highly expressed genes and 252 lowly expressed genes; 4679 DEDNAm genes, with 3740 highly expressed genes and 939 lowly expressed genes. The results of differential analysis of the mRNA and DNAm data are shown in TableÂ 1 and Fig.Â 2.
Identifying correlated genes with SCCA model
A comprehensive analysis of mRNA and DNA methylation omics data can offer a encompassing overview of gene regulation, aiding in the comprehension of the molecular mechanisms for gene expression regulation. Detecting complex bimultivariate associations between the mRNA and DNAm of patients is a critical task in identifying cancer subtypes. Recently, Sparse Canonical Correlation Analysis has received great attention in bimultivariate association identification and feature selection [34]. Usually, there exists a chain association across mRNA and DNAm [38, 39]. Specifically, the effect of DNA methylation on mRNA is mainly manifested in its ability to regulate gene expression changes in DNA methylation levels can affect the binding of transcription factors to DNA, leading to activation or silencing of genes, which in turn affects the production of mRNA. Inspired by this, we adopt a SCCA model called FGLSCCA [34] with the fused pairwise group lasso (FGL) penalty and the graph guided pairwise group lasso (GGL) penalty to mine the bimultivariate associations of mRNA and DNAm to classify breast cancer subtypes.
The matrix \(\mathbf{X}\in {\mathcal{R}}^{s\times m}\) represents the DEmRNA data of patients where s is the number of samples and m is the feature number of DEmRNA. The matrix \(\mathbf{Y}\in {\mathcal{R}}^{s\times n}\) represents the DEDNAm data of patients where n is the feature number of DNAm. Let \(\mathbf{X}\) and \(\mathbf{Y}\) be normalized and centered, the optimization problem can be defined as the following FGLSCCA model [34]:
where the vectors u and v are the canonical weights for the mRNA features and DNAm features respectively, \({\varphi }_{FGL}(\mathbf{u})\) and \({\varphi }_{GGL}(\mathbf{v})\) are the penalties to fit the adjacent smoothness and graphical smoothness, respectively. The FGL penalty \({\varphi }_{FGL}(\mathbf{u})\) is defined as \({\gamma }_{1}\sum_{k=1}^{m1} {\omega }_{k,k+1}\sqrt{{u}_{k}^{2}+{u}_{k+1}^{2}}\) where \({\omega }_{k,k+1}\) is the weight of two adjacent features and \({\gamma }_{1}\) is positive tuning hyperparameter. By mapping the feature space of \(\mathbf{v}\) into a undirected graph G, the GGL penalty \({\varphi }_{GGL}(\mathbf{v})\) is defined as \({\gamma }_{2}\sum_{\left(p,q\right)\in E} {\omega }_{p,q}\sqrt{{v}_{p}^{2}+{v}_{q}^{2}}\) where p and q are the DNAm feature nodes of G, E is the edge set guided by the graph G, and \({\omega }_{p,q}\) is the weight of the edge, and \({\gamma }_{2}\) is a hyperparameter to control the amount of regularization. Both FGL and GGL penalty can be used in the datadriven model in the case of no prior knowledge is given[34], while FGL assumes that the mRNA data is sequential. Meanwhile GGL is usaully adopted to bridge the gap between graph guided penalties and group lasso. DNA methylation have different roles in cell types or tissues and the graphical relationship of different roles for DNA methylation could be better captured by the graph guided penalty GGL. We thus impose the FGL penalty on mRNA data and GGL penalty on DNA methylation data, respectively.
The FGL penalty encourages \({u}_{k}\) and \({u}_{k+1}\) in the vector \(\mathbf{u}\) to have similar values. During each iteration of solving Eq.Â (1), the FGL penalty sets \({\omega }_{k, k+1}\) to the value of \({u}_{k}^{2}\) in the previous iteration. This forms a smooth sequence of weights among adjacent elements of \(\mathbf{u}\), which is beneficial for handling data with an ordered structure.
For the GGL penalty imposing on \(\mathbf{Y}\), the undirected graph G represents the pattern of connections between the DNAm features, guiding the construction of the edge set E. Specifically, a matrix of n \(\times\)(n1) rows and n columns is constructed for E where each row represents a connection between two different nodes in G. For the connected DNAm feature nodes p and q in G and their canonical weights \({v}_{p}\) and \({v}_{q}\), the GGL penalty encourages \({v}_{p}\) and \({v}_{q}\) in \(\mathbf{v}\) to have similar values. Similar to the determination of \({\omega }_{k, k+1}\), the GGL penalty sets \({\omega }_{p,q}\) to the value of \({v}_{p}^{2}\) in the previous iteration of solving Eq.Â (1).
Based on the DEmRNA features and DEDNAm features derived from the differential analysis in the first step, we adopt standard quadratic programming [34, 40] to solve (1), and the solutions \(\mathbf{u}\) and \(\mathbf{v}\) are the canonical weights for the DEmRNA features and DEDNAm features respectively. Then we can compute the correlation coefficient \({\text{corr}}(\mathbf{X*u},\mathbf{Y*v})\) to measure the relevance of the DEmRNA and DEDNAm features based on Pearson correlation coefficient. The larger the absolute value of the correlation coefficient, the stronger the correlation between the DEmRNA features and DEDNAm features. We can choose suitable values of \({\gamma }_{1}\) and \({\gamma }_{2}\) based on the correlation coefficient.
Finally, we calculate the absolute values of \(\mathbf{u}\) and \(\mathbf{v}\) and sort the DEmRNA features and DEDNAm features based on the values of \(\mathbf{u}\) and \(\mathbf{v}\) in descending order. Then we select the top \({m}_{1}\) DEmRNA and \({n}_{1}\) DEDNAm features to construct the correlation matrices \({\mathbf{X}}_{corr}\in {\mathcal{R}}^{s\times {m}_{1}}\) and \({\mathbf{Y}}_{corr}{\in \mathcal{R}}^{s\times {n}_{1}}\).
Predicting breast cancer subtypes using DNN model
The FGLSCCA model is capable of extracting linear structured feature information from the mRNA and DNAm data. However, the nonlinear associations in the omics data are critical for cancer subtype classification as well. In order to mine the nonlinear associations in the mRNA and DNAm data, we utilize a multitask deep learning neural network model DNN [41] to identify the nonlinear associations among the mRNA and DNAm data to predict breast cancer subtypes. We use \({\mathbf{X}}_{corr}\) and \({\mathbf{Y}}_{corr}\) as the input of the DNN model. As can be seen in Fig.Â 1, the DNN model consists of three main stages: (i) constructing modules for each dataset using module encoder. (ii) Identifying important modules across different omics data with a module attention mechanism. (iii) Implementing multitask learning on a fully connected layer to comprehensively process each omics dataset.
Module encoder
The module encoder consists of a fully connected layer, which links the features of the omics data to each module. Let \({\mathcal{W}}_{\text{module }}^{j}\) denote the weights of the fully connected layer, which represents the association between modules and features of the jth omics data. For a training sample (x ^{j}, y), x ^{j} denotes the sample from the jth omics data and y is the classification label of x ^{j}. Let \({\mathcal{F}}_{module}^{j}\) represent the module encoder for the jth omics data. The module vectors \({M}^{j}\) for the jth omics data can be defined as follows:
where \({\mathcal{W}}_{\text{module}}\) represents the weights of \({\mathcal{F}}_{\text{module}}\), \({N}^{j}\) indicates the number of modules of jth omics data, and \(D\) represents the dimension of the module vector.
Attention mechanism
DNN devises a module attention mechanism that specifically focus on modules with high similarity between each omics data module. Cosine similarity is used to assess the degree of correlation among these modules. Let \(Att\) denote the module attention matrix between the module vectors of two omics datasets. \({Att}_{lk}\) represents the element in row l and column k of \(Att\). \({Att}_{lk}\) contains the information on the potential dependencies between the lth module of one omics dataset and the kth module from another omics dataset. The definition of each element within the attention matrix is as follows [41]:
where \({M}^{j} ={M}^{j}\left({x}^{j}\right)\) as an abbreviation, \({M}_{l}^{i}\) and \({M}_{k}^{j}\) respectively represent the lth module vector of ith omics data and the kth module vector of jth omics data. To emphasize important modules, the module vectors are multiplied by the attention matrices and then concatenated with the other omics data. The updated module vector is defined as follows:
Training
The fully connected layers are then applied. In the model, loss \(\mathcal{L}\) is set to the crossentropy error between the true label and predict outputs and it is defined as follows:
where \(J\) denotes the number of omics datasets, \(C\) represents the total number of the breast cancer subtypes, \(y_{i} { }\left( {\hat{y}_{i} } \right)\) denotes the true (predict) probability for each breast cancer subtype. Each layer takes the previous layer as input and multiplies it with the trained weight matrix to obtain the input of the next layer. At last, the classification layer flattens the multidimensional vectors and generates the final classification probabilities for each breast cancer subtype.
Results
Evaluation metrics
In this section, we will introduce the metrics for evaluating the performance of classifying breast cancer subtypes. The number of correctly predicted positive samples is denoted as TP (True Positive) and the number of negative samples that are identified as positive samples is denoted as FP (False Positive). Similarly, the number of correctly predicted negative samples is denoted as TN (True Negative), and the number of the positive samples that are identified as negative samples is denoted as FN (False Negative). Then we can calculate the Accuracy(ACC)â€‰=â€‰(TPâ€‰+â€‰TN)/(TPâ€‰+â€‰TNâ€‰+â€‰FPâ€‰+â€‰FN), Precisionâ€‰=â€‰TP/(TPâ€‰+â€‰FP), Recallâ€‰=â€‰TP/(TPâ€‰+â€‰FN) and F1â€‰=â€‰2â€‰Ã—â€‰Precisionâ€‰Ã—â€‰Recall/(Precisionâ€‰+â€‰Recall). Accuracy (ACC) indicates the prediction accuracy of all samples whereas Precision indicates the ratio of the true positive samples in the predicted positive samples. Recall indicates the probability that the true positive samples are correctly predicted. ROC is the curve that calculates True Positive Rate TPRâ€‰=â€‰TP/(TPâ€‰+â€‰FN) and False Positive Rate FPRâ€‰=â€‰FP/(TNâ€‰+â€‰FP) according to various rank thresholds. AUC is defined as the area under the ROC curve and it is less than 1.
Traditional metrics such as Precision, Recall, and F1 score are originally defined for binary classification problems. In multiclassification problems, we use macroaveraged Precision (Precisionmacro), macroaveraged Recall (Recallmacro), and macroaveraged F1 score (F1macro) to comprehensively evaluate the performance of each method. Specifically, we first independently calculate the Precision, Recall, and F1 score for each class, and then respectively take the arithmetic mean of the Precision, Recall and F1 score across all classes to obtain Precisionmacro, Recallmacro and F1macro.
Comparison with other methods
To evaluate our proposed method DSCCN, we compare its performance with the stateoftheart methods. Specifically, we apply the logistic regression model/multinomial model with Elastic Net (EN) regularization [42], Random Forest (RF) [43] in the concatenation and ensemble frameworks to obtain two concatenationbased methods (Concate EN, Concate RF) and two ensemblebased methods (Ensemble EN, Ensemble RF) for comparison. Besides these four comparative methods, we also compare the performance of DSCCN with other three breast cancer classification methods based on multiomics data. These three multimodal methods include DIABLO [22], SMSPL [31] and DeepMO [28].
Among the comparative methods, DIABLO is dedicated to maximizing the shared or correlated information across multiple omics datasets, reducing the high dimensionality of features. SMSPL addresses the issue of data heterogeneity by interactively recommending highconfidence samples between different modalities and assigns varying weights to training samples through its unique soft weighting mechanism, which significantly mitigates the impact of highdimensional noise on model performance. Meanwhile, DeepMo employs the SelectKBest [44] method from the Python library to select the top K features for training to alleviate the problem of data heterogeneity.
In the experiments, we use FGLSCCA to detect highly correlated genes between DEmRNAs and DEDNAms. We randomly divided 70% of the samples as the training set and treated the remaining samples as the test set in TableÂ 1. By performing grid search on \({\gamma }_{1}\) and\({\gamma }_{2}\), we obtained the optimal correlation coefficient values of 0.969 on the training data and 0.896 on the test data, respectively. For DNN, the optimized parameters are as follows: the â€˜number of modulesâ€™ is selected from {16, 32, 64, 128}, the â€˜learning rateâ€™ is selected from {10^{â€“4}, 10^{â€“5}, 5â€‰Ã—â€‰10^{â€“6}, 10^{â€“6}}, the â€˜weight decayâ€™ is selected from {10^{â€“3}, 10^{â€“4}, 10^{â€“5}} and the â€˜early stopping patienceâ€™ is selected from {50, 100, 200, 300}. To ensure fairness in comparison, for each comparative method, including random forest, we used the default parameter value suggested by their literatures.
In the following section, we first verify the performance of DSCCN on the binary and multiple classification of breast cancer subtypes. Then we conduct ablation studies to learn the effectiveness of each step in DSCCN. Finally, we perform comprehensive analysis on the selected genes to learn the ability of DSCCN in identifying critical features for predicting breast cancer subtypes.
Performance of binary classification
To assess the performance of our method DSCCN in binary classification, we compare its effectiveness in distinguishing any two subtypes of breast cancer, including (1) Basal versus Her2, (2) Basal versus LumA, (3) Basal versus LumB, (4) Her2 versus LumA, (5) Her2 versus LumB, and (6) LumA versus LumB. The sample size of the breast cancer datasets in binary classification can be found in TableÂ 2.We maintain the stability of our results by conducting stratified fivefold crossvalidation on each classification dataset, and repeat the experiments 30 times to report the average measurement. The Accuracy, AUC and F1 score on any two subtypes of breast cancer obtained by different methods are shown in TableÂ 3.
Table 3 presents the performance comparison, demonstrating that DSCCN consistently outperforms other methods in terms of F1 score across all datasets. Notably, except for Her2 vs LumA, DSCCN attains the highest accuracy (ACC) on the remaining five datasets. Moreover, DSCCN attains the highest AUC value in four out of the six datasets. These results indicate that DSCCN is an effective method in performing binary classification for the subtypes of breast cancer.
Performance of multiclassification
In this section, we compare the average performance of DSCCN and other seven methods on the multiclassification of multiple breast cancer subtypes. From TableÂ 4, we can find that DSCCN outperforms other methods across all metrics. Specifically, DSCCN achieves the highest accuracy value of 0.906 and F1marco of 0.922, respectively. Overall, the results in TableÂ 4 demonstrate that DSCCN is an effective method in classifying multiple breast cancer subtypes.
In Fig.Â 3, we plot the normalized confusion matrices to visualize the multiclassification performance of all methods for each breast cancer subtype. FigureÂ 3 shows that DSCCN obtains comparative performance as compared to other methods on the breast cancer datasets. Specifically, for Basal, DSCCN makes accurate classifications (error rateâ€‰=â€‰0). For Her2, which has the smallest sample size, DSCCN shows the strongest classification capability (error rateâ€‰=â€‰25%) compared to other methods. For LumA, which has the largest sample size, DSCCN makes the second best classification on it (error rateâ€‰=â€‰4%). DSCCN makes a slightly weak classification of LumB (error rateâ€‰=â€‰26%). Compared to other methods, DSCCN has overall demonstrated robust performance in the classification of each subtype.
Ablation experiment
In this section, we will evaluate the effectiveness of different parts of DSCCN by conducting ablation study on both binary classification and multiclassification. In DSCCN, two optimization techniques are employed for the classification of breast cancer subtypes, with the utilization of a DNN model as the classifier. Specifically, the first technique is to perform the differential analysis on both omics datasets to reduce data dimensionality. The other optimization technique is to detect the highly correlated genes between mRNA and DNAm using the algorithm FGLSCCA.
As can be seen in TableÂ 5, we construct five models for DSCCN. For DSCCN1, none of the optimization techniques is implemented. For DSCCN2, only the differential analysis is implemented. For DSCCN3, only the FGLSCCA technique is implemented.
To investigate the efficacy of the DNN model in the classification of breast cancer subtypes, we construct two models: DSCCN 4 and DSCCN 5. For DSCCN4, two optimization techniques are employed, and XGBoost [45] is utilized as a classifier to demonstrate the effectiveness of the DNN model. To further understand the role of the attention mechanism within DNN, we construct DSCCN5, which is identical to the DSCCN except for the deactivation of the attention mechanism.We then compare the performance of these five models to explore the effectiveness of each step of DSCCN.
Binary classification
In this section, we discuss the performance of different modes of DSCCN on the binary classification of breast cancer subtypes. Table 6 shows the performance of classifying any two subtypes of breast cancer using different DSCCN modes depicted in Table 6. As depicted in TableÂ 6, the indicators of DSCCN2 are superior to those of DSCCN1 on the most datasets. This implies that the differential analysis effectively filters out irrelevant feature values, resulting in the model exhibiting enhanced classification performance. Moreover, DSCCN3 outperforms DSCCN1 on all datasets in terms of ACC and AUC. This demonstrates the benefit of using FGLSCCA to identify highly correlated features for the binary classification of breast cancer subtypes.
Moreover, TableÂ 6 shows that the performance of DSCCN surpasses that of DSCCN4 and DSCCN5. This result further confirms that the DNN models can achieve superior results in the binary classification of breast cancer subtypes. Additionally, it demonstrates the efficacy of the attention mechanism within DNN models, significantly enhancing its performance. Overall, the ACC, AUC values, and the F1 score of DSCCN are all superior to those of its variant models. This indicates that DSCCN exhibits robust classification performance on the binary classification of breast cancer. Overall, the comparisons of different DSCCN modes demonstrate the effectiveness of combining differential analysis and Sparse Canonical Correlation Analysis to perform binary classification on breast cancer subtypes.
Multiclassification
In this section, we discuss the performance of different modes of DSCCN on the multiclassification of the breast cancer subtypes. Table 7 shows the performance of different DSCCN modes for classifying multiple breast cancer subtypes in TableÂ 1. As shown in TableÂ 7, compared to DSCCN1, the optimized DSCCN2 and DSCCN3 both demonstrate superior performance, which robustly validates the effectiveness of the two optimization techniques used. Furthermore, as shown in TableÂ 7, the performance of DSCCN surpasses that of DSCCN4 and DSCCN5, further confirming the enhanced ability of attention mechanismequipped DNN models in the multiclassification of breast cancer subtypes. These results suggest that a more accurate multiclassification of breast cancer subtypes can be achieved by integrating differential analysis and Sparse Canonical Correlation Analysis.
In Fig.Â 4, we generate normalized confusion matrices to visualize the multiclassification performance of each DSCCN mode on each subtype. As shown in the Fig.Â 4, DSCCN2, DSCCN3, DSCCN5 and DSCCN correctly classify Basal from other three breast cancer subtypes. For Her2, both DSCCN2 and DSCCN obtain the best accuracy of 75%. For LumA, DSCCN achieves the second best accuracy of 96%. For LumB, DSCCN reaches the accuracy of 74%. Overall, DSCCN consistently maintains a high classification accuracy across all subtypes, making its overall performance superior. These results highlight the significant enhancements achieved by incorporating differential analysis and FGLSCCA techniques into our model, ensuring more reliable and precise multiclassifications on breast cancer subtypes.
Analysis of the selected gene of DSCCN
In order to learn the differences in the expression of the selected genes in each subtype, in Fig.Â 5, we draw the heatmaps for the expression of the top 30 selected genes of DSCCN in mRNA and DNAm data in the multiclassification of four breast cancer subtypes. In Fig.Â 5, it can be observed that there exists significant expression difference in the identified genes between the Basal subtype and other subtypes. Furthermore, to investigate whether the genes detected by DSCCN are highly correlated, we selects these top 30 genes with the highest weights from each omics for Pearson correlation analysis. FigureÂ 6 depicts the correlation coefficient matrix between gene pairs of omics, as can be seen, a significant majority of gene pairs demonstrate some correlation. Further statistical analysis reveals that 65.3% (588 of 900) of these gene pairs have pvalues below the critical threshold, suggesting that the correlations observed among them are not due to random chance.
Interestingly, 13 out of the 30 identified mRNAs in DSCCN (RNF145, CDKN2A, PLCG2, SOX10, TNFRSF11A, L3MBTL4, THRA, BBS10, ZFP36L2, SPNS2, RHOU, PER2, ANGPTL4) have recently been found to be associated with breast cancer. For example, The CDKN2A gene was found to be a potential addition to the small list of other genes examined for associations with breast cancer histopathology and/or disease course [46]. SOX10 was recently reported to have high expression in the triple negative breast cancer, which could be helpful for diagnosing the origin of breast cancer [47]. ANGPLT4 has been identified to be associated with the malignant progression and poor prognosis of breast cancer. This implies that ANGPLT4 might serve as a novel therapeutic target for breast cancer [48].
18 out of the 30 identified DNAms in DSCCN (MED27,GNG7, ST6GLNAC4,RP11,DICER1, TCF12, ZNRF3, APOA5, CERS2,TRPM1,TATDN1,LSM2,ECI2,FBXW4,TRERF1,FRY,GPLD1,FLT1) have been confirmed to be associated with breast cancer. For instance, the expression level of MED27 in breast cancer samples is higher than in normal tissues, especially in triplenegative breast cancer.Additionally, as the pathological stage increases, its expression level also rises [49]. The study revealed that, compared to normal breast tissue, GNG7 exhibits lower expression in breast cancer tissue. Silencing GNG7 significantly enhances cell proliferation, inhibits apoptosis, and the exogenous overexpression of GNG7 has a reversing effect on breast cancer cells [50].
Conclusion
In this work, we present a method called DSCCN to classify breast cancer subtypes using multiomics data. To address the challenges of large p small n issue and data heterogeneity problem in multiomics data integration, we first perform differential analysis on the multiomics expression data of patients to identify differentially expressed genes and obtain DEmRNA features and DEDNAm features. Then we carry out Sparse Canonical Correlation Analysis to identify highly correlated DEmRNA and DEDNAm features. Finally, we adopt a neural network with attention mechanism to identify genes with high cosine similarity to classify breast cancer subtypes. Through the use of Sparse Canonical Correlation Analysis and attention mechanism, DSCCN is able to efficiently identify highly correlated genes between mRNA and DNAm data. The experimental results show that our proposed method is superior to the existing methods in the binary classification and multiclassification of breast cancer subtypes. The ablation study shows that each step of DSCCN has a significant contribution to the classification performance. DSCCN thus could be a useful framework for classifying breast cancer subtypes.
Despite the effectiveness of DSCCN in classifying breast cancer subtypes, limitations remain. Biological intuition says that using more omics data could improve the performance of the classification model. It is known that mRNA and DNAm are typical coding genes. In the future, we intend to extend our analysis to noncoding genes, especially the analysis of miRNAs and lncRNAs. This may enable us to improve the classification accuracy and robustness of our model and understand the breast cancer subtypes from a comprehensive perspective of coding and noncoding genes. Moreover, due to data imbalance in breast cancer dataset, our model is difficuilt to thoroughly learn the features of each subtype, which results in a decreased accuracy. Considering that data augmentation techniques have been proven effective in numerous fields, we intend to incorporate these techniques into our future work so as to accurately recognize the characteristics of each subtype.
Availability of data and materials
The source code and data are available at https://github.com/hyr0771/DSCCN.
References
Azamjah N, SoltanZadeh Y, Zayeri F. Global trend of breast cancer mortality rate: a 25year study. Asian Pac J Cancer Prev APJCP. 2019;20(7):2015â€“20.
SÃ¸rlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci. 2003;100(14):8418â€“23.
Huang Y, Wu Z, Lan W, Zhong C. Predicting diseaseassociated N7methylguanosine(m7G) sites via random walk on heterogeneous network. IEEE/ACM Trans Comput Biol Bioinform. 2023;20:3173â€“81.
Waks AG, Winer EP. Breast cancer treatment: a review. JAMA. 2019;321(3):288â€“300.
Yersal O, Barutca S. Biological subtypes of breast cancer: prognostic and therapeutic implications. World J Clin Oncol. 2014;5(3):412â€“24.
Khan D, Shedole S. Leveraging deep learning techniques and integrated omics data for tailored treatment of breast cancer. J Personal Med. 2022;12:674.
Du L, Liu C, Wei R, Chen J. Uncertaintyaware dynamic integration for multiomics classification of tumors. J Cancer Res Clin Oncol. 2023;149(7):3301â€“12.
Zhang C, Li P, Sun D. Liu ZP MOFNet: a deep learning framework of integrating multiomics data for breast cancer diagnosis. In: Advanced intelligent computing technology and applications: 2023. Singapore: Springer; 2023. pp. 727â€“738
Bennett DA, Buchman AS, Boyle PA, Barnes LL, Wilson RS, Schneider JA. Religious orders study and rush memory and aging project. J Alzheimers Dis. 2018;64:S161â€“89.
Chen S, Liu Q, Cui X, Feng Z, Li C, Wang X, Zhang X, Wang Y, Jiang R. OpenAnnotate: a web server to annotate the chromatin accessibility of genomic regions. Nucleic Acids Res. 2021;49(W1):W483â€“90.
Huang Y, Bin Y, Zeng P, Lan W, Zhong C. NetPro: neighborhood interactionbased drug repositioning via label propagation. IEEE/ACM Trans Comput Biol Bioinf. 2023;20(3):2159â€“69.
Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA. Identification of human triplenegative breast cancer subtypes and preclinical models for selection of targeted therapies. J Clin Investig. 2011;121(7):2750â€“67.
Rhee S, Seo S, Kim S. Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. http://arxiv.org/abs/arXiv:1711.05859. (2018)
Yu Z, Wang Z, Yu X, Zhang Z. RNASeqbased breast cancer subtypes classification using machine learning approaches. Comput Intell Neurosci. 2020;2020:4737969.
Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep learningbased multiomics integration robustly predicts survival in liver cancer. Clin Cancer Res. 2018;24(6):1248â€“59.
Huang Y, Chen F, Sun H, Zhong C. Exploring genepatient association to identify personalized cancer driver genes by linear neighborhood propagation. BMC Bioinform. 2024;25(1):34.
Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multiomics data integration methods. Frontiers. 2017;8:268903.
Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, Buettner F, Huber W, Stegle O. Multiomics factor analysisâ€”a framework for unsupervised integration of multiomics data sets. Mol Syst Biol. 2018;14(6):e8124.
Conesa A, Beck S. Making multiomics data accessible to researchers. Sci Data. 2019;6(1):251.
Peng YZ, Lin Y, Huang Y, Li Y, Luo G, Liao J. GEPEpiSeeker: a gene expression programmingbased method for epistatic interaction detection in genomewide association studies. BMC Genomics. 2021;22(1):910.
Huang Y, Zhong C. Detecting listcolored graph motifs in biological networks using branchandbound strategy. Comput Biol Med. 2019;107:1â€“9.
Singh A, Shannon CP, Gautier B, Rohart F, Vacher M, Tebbutt SJ, LÃª Cao KA. DIABLO: an integrative approach for identifying key molecular drivers from multiomics assays. Bioinformatics. 2019;35(17):3055â€“62.
Liu Y, Devescovi V, Chen S, Nardini C. Multilevel omic data integration in cancer cell lines: advanced annotation and emergent properties. BMC Syst Biol. 2013;7(1):14.
Tao M, Song T, Du W, Han S, Zuo C, Li Y, Wang Y, Yang Z. Classifying breast cancer subtypes using multiple kernel learning based on omics data. Genes. 2019;10(3):200.
List M, Hauschild AC, Tan Q, Kruse TA, Baumbach J, Batra R. Classification of breast cancer subtypes by combining gene expression and DNA methylation data. J Integr Bioinform. 2014;11(2):1â€“14.
Rappoport N, Shamir R. Multiomic and multiview clustering algorithms: review and cancer benchmark. Nucl Acids Res. 2018;46(20):10546â€“62.
GÃ¼nther OP, Chen V, Freue GC, Balshaw RF, Tebbutt SJ, Hollander Z, Takhar M, McMaster WR, McManus BM, Keown PA, et al. A computational pipeline for the development of multimarker biosignature panels and ensemble classifiers. BMC Bioinform. 2012;13(1):326.
Lin Y, Zhang W, Cao H, Li G, Du W. Classifying breast cancer subtypes using deep neural networks based on multiomics data. Genes. 2020;11(8):888.
Choi JM, Chae H. moBRCAnet: a breast cancer subtype classification framework based on multiomics attention neural networks. BMC Bioinform. 2023;24(1):169.
SharifiNoghabi H, Zolotareva O, Collins CC, Ester M. MOLI: multiomics late integration with deep neural networks for drug response prediction. Bioinformatics. 2019;35(14):i501â€“9.
Yang Z, Wu N, Liang Y, Zhang H, Ren Y. SMSPL: robust multimodal approach to integrative analysis of multiomics data. IEEE Trans Cybern. 2022;52(4):2082â€“95.
Wang Y, Miller DJ, Clarke R. Approaches to working in highdimensional data spaces: gene expression microarrays. Br J Cancer. 2008;98(6):1023â€“8.
Li Y, Wu FX, Ngom A. A review on machine learning principles for multiview biological data integration. Brief Bioinform. 2016;19(2):325â€“40.
Du L, Liu K, Yao X, Risacher SL, Han J, Saykin AJ, Guo L, Shen L. Detecting genetic associations with brain imaging phenotypes in Alzheimerâ€™s disease via a novel structured SCCA approach. Med Image Anal. 2020;61:101656.
Tomczak K, CzerwiÅ„ska P, Wiznerowicz M. Review the cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol WspÃ³Å‚cz Onkol. 2015;2015:68â€“77.
Perou CM, SÃ¸rlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747â€“52.
SÃ¸rlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci. 2001;98(19):10869â€“74.
Chhabra R miRNA and methylation: a multifaceted liaison. 2015;16(2):195â€“203.
Xuan J, Jing Z, Yuanfang Z, Xiaoju H, Pei L, Guiyin J, Yu Z. Comprehensive analysis of DNA methylation and gene expression of placental tissue in preeclampsia patients. Hypertens Pregnancy. 2016;35(1):129â€“38.
Sequential Quadratic Programming. In: Nocedal J, Wright SJ editors. Numerical optimization. New York, NY: Springer New York; 1999. pp. 526â€“573.
Moon S, Lee H. MOMA: a multitask attention learning algorithm for multiomics data interpretation and classification. Bioinformatics. 2022;38(8):2287â€“96.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301â€“20.
Biau G, Scornet E. A random forest guided tour. TEST. 2016;25(2):197â€“227.
Senan EM, Abunadi I, Jadhav ME, Fati SM. Score and correlation coefficientbased feature selection for predicting heart failure diagnosis by using machine learning algorithms. Comput Math Methods Med. 2021;2021:8500314.
Chen T, Guestrin C XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, California, USA. Association for Computing Machinery; 2016. pp. 785â€“794.
DÄ™bniak T, Cybulski C, GÃ³rski B, Huzarski T, Byrski T, Gronwald J, Jakubowska A, Kowalska E, Oszurek O, Narod SA, et al. CDKN2Apositive breast cancers in young women from Poland. Breast Cancer Res Treat. 2007;103(3):355â€“9.
Aphivatanasiri C, Li J, Chan R, Jamidi SK, Tsang JY, Poon IK, Shao Y, Tong J, To KF, Chan SK, et al. Combined SOX10 GATA3 is most sensitive in detecting primary and metastatic breast cancers: a comparative study of breast markers in multiple tumors. Breast Cancer Res Treat. 2020;184(1):11â€“21.
Zhao J, Liu J, Wu N, Zhang H, Zhang S, Li L, Wang M. ANGPTL4 overexpression is associated with progression and poor prognosis in breast cancer. Oncol Lett. 2020;20(3):2499â€“505.
Wang R, Yu W, Zhu T, Lin F, Hua C, Ru L, Guo P, Wan X, Xue G, Guo Z, et al. MED27 plays a tumorpromoting role in breast cancer progression by targeting KLF4. Cancer Sci. 2023;114(6):2277â€“92.
Mei J, Wang T, Zhao S, Zhang Y. Osthole inhibits breast cancer progression through upregulating tumor suppressor GNG7. J Oncol. 2021;2021:6610511.
Funding
This work is supported by the National Natural Science Foundation of China (No. 62362004), the Natural Science Foundation of Guangxi Province (No.2020GXNSFAA159074) and the National Natural Science Foundation of China (No. 61862006 and No.62261003).
Author information
Authors and Affiliations
Contributions
YH and PZ conceived the presented idea.YH and PZ developed the theory and wrote the software package. CZ verifed the analytical methods and performed the computations. YH and PZ wrote the orignal draft. CZ reviewed the draft. YH and CZ provided the funding. All authors discussed the results and contributed to the fnal manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Huang, Y., Zeng, P. & Zhong, C. Classifying breast cancer subtypes on multiomics data via sparse canonical correlation analysis and deep learning. BMC Bioinformatics 25, 132 (2024). https://doi.org/10.1186/s1285902405749y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902405749y