A multimodal graph neural network framework for cancer molecular subtype classification

Background The recent development of high-throughput sequencing has created a large collection of multi-omics data, which enables researchers to better investigate cancer molecular profiles and cancer taxonomy based on molecular subtypes. Integrating multi-omics data has been proven to be effective for building more precise classification models. Most current multi-omics integrative models use either an early fusion in the form of concatenation or late fusion with a separate feature extractor for each omic, which are mainly based on deep neural networks. Due to the nature of biological systems, graphs are a better structural representation of bio-medical data. Although few graph neural network (GNN) based multi-omics integrative methods have been proposed, they suffer from three common disadvantages. One is most of them use only one type of connection, either inter-omics or intra-omic connection; second, they only consider one kind of GNN layer, either graph convolution network (GCN) or graph attention network (GAT); and third, most of these methods have not been tested on a more complex classification task, such as cancer molecular subtypes. Results In this study, we propose a novel end-to-end multi-omics GNN framework for accurate and robust cancer subtype classification. The proposed model utilizes multi-omics data in the form of heterogeneous multi-layer graphs, which combine both inter-omics and intra-omic connections from established biological knowledge. The proposed model incorporates learned graph features and global genome features for accurate classification. We tested the proposed model on the Cancer Genome Atlas (TCGA) Pan-cancer dataset and TCGA breast invasive carcinoma (BRCA) dataset for molecular subtype and cancer subtype classification, respectively. The proposed model shows superior performance compared to four current state-of-the-art baseline models in terms of accuracy, F1 score, precision, and recall. The comparative analysis of GAT-based models and GCN-based models reveals that GAT-based models are preferred for smaller graphs with less information and GCN-based models are preferred for larger graphs with extra information. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05622-4.


Background
The fast-growing high-throughput sequencing technology has made DNA and RNA sequencing more efficient and accessible, resulting in a large collection of multi-omics data which makes molecular profiling possible.Due to the heterogeneity in cancer and the complexity of the biological processes, employing multi-omics sequencing data are crucial to more accurate cancer classification and tumor profiling.Many researchers have proposed methods that incorporate multi-omics data for either cancer type classification or cell type clustering [1,2,3,4,5,6,7,8,9,10,11].These methods show that utilizing multi-omics data improves performance, and provides a better understanding of the key pathophysiological pathways across different molecular layers [12].A typical multi-omics data generated from DNA and RNA sequencing usually consists of mRNA expression, microRNA (miRNA) expression, copy number variation (CNV), and DNA methylation [13].The difference in data distributions across each omic, and the complex inter-omics and intra-omic connections (certain omic can act as a promotor or suppressor to genes) add more challenges to developing an integrative multi-omics classification method for cancer molecular subtypes.
Recent studies have shown that cancer taxonomy based on molecular subtypes can be crucial for precision oncology [13,14].An accurate cancer molecular subtype classifier is crucial for early-stage diagnosis, prognosis, and drug development.Traditional cancer taxonomy is based on its tissue origin.In 2014, The Cancer Genome Atlas (TCGA) Research Network proposed a new clustering method for cancers based on their integrated molecular subtypes that share mutations, copy-number alterations, pathway commonalities, and micro-environment characteristics instead of their tissue of origin [13].They found 11 subtypes from 12 cancer types.In 2018, they applied the new taxonomy method to 33 cancer types and found 28 molecular subtypes [15].The new cancer taxonomy provides a better insight into the heterogeneous nature of cancer.
With the recent development in deep learning models, data-driven models benefit from the powerful feature extraction capability of deep learning networks in many fields [16,17,18,19,20,21,22].Most multi-omics integrative models employ an early fusion approach that aggregates multi-omics data (mainly by concatenation) and then applies a deep neural network as a feature extractor; or a late fusion approach that first extracts features from each omic by deep neural networks and then aggregates extracted features as inputs to the classification network.For efficient implementation of multi-omics integrative models, convolutional neural networks (CNNs) are widely used [23].
Traditional deep neural networks are based on the assumption that the inner structure of the data is in Euclidean space [24].Because of the complex interactions across many biological processes, such data structure is not a proper representation of bio-medical data, and researchers proposed graph-based data structures to tackle this limitation.In 2016, a graph convolution network (GCN), ChebNet, was proposed [16].It uses the Chebyshev polynomial as the localized learning filter to extract the graph feature representation.In 2017, Petar Velickovic et al. proposed a graph attention network (GAT) that overcomes GCN's disadvantage of dependence on the Laplacian eigenbasis [25].GAT uses masked self-attention layers to enable nodes to attend over their neighborhoods' features [25].With the recent growing interest in the graph neural network, many graph-based classification methods have been proposed in the bio-medical field.
To utilize the power of graph-structured data, Ramirez et al. proposed a GCN method to use intra-omic connections, protein-protein interaction networks, and gene co-expression networks.The model achieves a 94.71% classification accuracy for 33 cancer types and normal tissue on TCGA data [26].To use the intra-omic connection across multiple omics, Wang et al. proposed MOGONET, a late-fusion GCN-based method that integrates multi-omics data for bio-medical data classification.And they achieve 80.61% accuracy on breast cancer subtype classification with BRCA dataset [5].To compensate for the limitation of GCN, that it only extracts local representation on the graph, Li et al. proposed a parallel-structured GCN-based method that utilizes a gene-based prior knowledge graph for cancer molecular subtype classification [1].There are also other ways to structure the graph.Wang et al. proposed a GCN-based method that uses a KNN-generated cell-cell similarity graph for single-cell sequencing data classification [27].
Since the introduction of GAT in 2017, it has gained more and more interest.Shanthamallu et al. proposed a GAT-based method, GrAMME, with two variations that use a supra-graph approach and late-fusion approach to extract features from a multi-layer graph with intra-omic connections only for classification in social science and political science datasets [28].On the other hand, Kaczmarek et al. proposed a multi-omics graph transformer to utilize an inter-omics connection only graph, the miRNA-gene target network, for cancer classification on 12 cancer types from the TCGA data [7].
There are three common disadvantages of these approaches.First, most of them consider only one kind of connections in their model, either inter-omics or intra-omic connections.They do not aim to utilize both inter-omics and intra-omic connections for more effective feature extraction.Second, they only consider one kind of GNN models, either GCN or GAT.We find that GAT and GCN have their strength in different scenarios as shown in our experiments.Different graph layers are preferred for different tasks even with datasets in a similar domain.Third, most of these methods have not been tested on a more complex classification task.They are used for classification based on the cell-of-origin taxonomy such as cancer type classification and have not been applied to a more complex classification task such as cancer molecular subtype classification, which is more useful for diagnosis, prognosis, and treatment.Inspired by our previous work on the cancer molecular subtype classification based solely on intra-omic connections, we aim to develop a multi-omics integrative framework that exploits the powerful data aggregation property of GCN or GAT models (depending on the situation) and utilizes both the intra-omic network and the inter-omics network for more precise classification.
Our goal is to build an accurate, robust, and efficient multi-omics integrative predictive model to classify these cancer molecular subtypes.In this work, we propose a general framework that can be used with any graph neural networks as the feature extractor, incorporate both gene-based and non-gene-based prior biological knowledge (primarily miRNA), and learn a knowledge graph consisting of both intra-omic and inter-omics connections.We apply the proposed model to classify cancer molecular subtypes and breast cancer molecular subtypes.We choose breast cancer as it is one of the most common and lethal cancers with a large number of samples in TCGA.It can be categorized into four major molecular subtypes based on the gene expression of the cancer cells, and breast cancer subtypes have significant impacts on the patient's survival rates [29].Our experimental results show the proposed method outperforms both the graph-based and CNN-based state-of-the-art methods.
Our contributions in this study are i) a novel generalized GNN-based multi-omics integrative framework for cancer molecular subtype classification, ii) a supra-graph approach that can incorporate both intra-omic and inter-omics prior biological knowledge in the form of graphs, iii) a representation of multi-omics data in the form of heterogeneous multi-layer graph, and iv) a comparative analysis of GCN and GAT based models at different combinations of omics and different graph structures.

Method and Materials
The overview of the proposed framework structure is shown in Figure 1.The input data for the proposed framework is shown as a graph structure on the leftmost side.The data consists of three omics, mRNA expression (orange boxes), copy number variation (CNV) (yellow boxes), and miRNA (green boxes).The details of the network structure are discussed in the following Network Section.The proposed framework consists of 4 major modules: Module 1) a linear dimension-increase neural network, Module 2) a graph neural network (GNN), Module 3) a decoder, and Module 4) a shallow parallel network.Any kind of graph neural network can be used in Module 2. In this study, we focus on graph convolutional network (GCN) and graph attention network (GAT), which are two major kinds of GNN.Experiments about the effect of the decoder and the shallow parallel network modules are discussed in our ablation study.

Network
We build a heterogeneous multi-layer graph based on the prior biological knowledge, i.e. gene-gene interaction (GGI) network from BioGrid and miRNA-gene target network from miRDB [30,31].Inspired by the meta-path and supra-graph approach for the multi-layered network models [32,28], we build a supra-graph with miRNA-miRNA meta-paths.A miRNA-miRNA meta-path is defined as if two miRNA nodes are connected to the same gene node from the GGI network and miRNA-gene network.An example of how we construct the supra-graph is shown in Figure 2. Meta-paths are shown as dotted lines in the figure.
The adjacency matrix of the supra-graph is an (N + M ) × (N + M ) matrix, where N is the number of genes and M is the number of miRNA.Every node in the graph is assumed to be self-connected, thus the diagonal elements of the adjacency matrix in the study are 1.The adjacency matrix of the supra-graph is shown in Equation (1). where We also construct four different kinds of graphs other than supra-graph in our ablation study and apply them to five input combinations of omics: mRNA, miRNA, mRNA + miRNA, mRNA+CNV, mRNA + MiRNA + CNV, to test the effect of the different graphs on the model performance.The four different graphs are defined as follows.Only Gene-based Nodes: When the input combination of omics is mRNA or mRNA+mRNA+CNV (M = 0), the graph is built with the GGI network, Only miRNA-based Nodes: When the input combination of omics is miRNA (N = 0), the graph is built with only miRNA meta-path network, Only Intra-class Edges: The graph only contains GGI network and miRNA meta-path network.
Only Inter-class Edges: The graph only contains miRNA-gene target network.
The input graph is denoted as a tuple G = (V, E, X V ), where V is the set of nodes, E is the set of edges, and x V is the node attributes.The prior knowledge is incorporated into the model through the supra-graph defined above.In the supragraph, nodes consist of both gene-based nodes and miRNA-based nodes, and edges are assigned by the adjacency matrix.Each gene-based node has a node attribute of a vector consisting of both gene expression and CNV data, x v∈Vgene ∈ R 2 .Each miRNA-based node has a node attribute as a scalar, x v∈VmiRNA ∈ R. The gene-based nodes and miRNA-based nodes are fed through a linear dimension-increase layer, denoted as Module 1 in Figure 1 to achieve the same node attribute dimension, X ′ V ∈ R (N +M )×F , where F is the increased node attribute dimension.

Graph Neural Network: Convolution-based
As mentioned before, any graph neural network can be used in the GNN module.We use ChebNet [16] to implement the GCN in this study.The supra-graph adjacency matrix introduced in the previous network section is first Laplacian normalized to L as expressed in Equation ( 4).
where I ∈ R (N +M )×(N +M ) is an identity matrix, and the degree matrix D ∈ R (N +M )×(N +M ) is a diagonal matrix.The eigen decomposition form of L can be obtained as where U = (u 1 , u 2 , . . ., u n ) is a matrix of n orthonormal eigenvectors of L, therefore UU T = I.And Λ = diag(λ 1 , λ 2 , . . ., λ n ) is the eigenvalue matrix [16].
After transforming the graph on the Fourier domain, the learning filter can be approximated by a K th -order Chebshev polynomial.The convolution on the graph by such localized learning filter, h(Λ) can be expressed in Equation (6).
where X j ∈ R (N +M )×F is the features of j-th sample, L = 2L/λ max − I, and K is a hyperparameter, where K = 5 in our study.A max-pooling layer with p = 8 is used to reduce the number of nodes and one layer of fully connected network is used to transform the learned local feature representation to a vector of length 64 for each sample, θ 1 ∈ R 64 .
Graph-Neural Network: Attention-based GAT aims to solve the problem of GCN's dependence on Laplacian eigenbasis of the graph adjacency matrix [25].The updated node attributes are first passed through a linear transformation by a learnable weight, denoted as W ∈ R F ′ ×F , where F is the updated node attribute dimension and F ′ is the intended output dimension for this GAT layer.Then, the self-attention coefficients for each node can be calculated as Equation (7).
where e ij represents the importance of node j to node i and x i , x j are the node attributes for node i, j.Such attention score is only calculated for j ∈ N B(i), where N B(i) is all the first-order neighbor nodes around node i.The method normalizes the attention score by a softmax layer of e ij and uses LeakyReLU as the activation function as express in Equation (8).
The output for each node can be expressed as Equation (9).
A multi-head attention mechanism is used to stabilize the attention score.In our study, the number of heads is 8. Similar to the GCN-based GNN module, the output is then passed through a max-pooling layer and a transformation layer to obtain the local graph representation, θ 1 ∈ R 64 .

Decoder & Shallow Parallel Network
As shown in Figure 1, the decoder is a two-layer fully connected network that is used to reconstruct the node attributes on the input graph.To compensate the localization property of either GCN or GAT layer in the GNN module, we use a parallel shallow fully connected network.Since the prior knowledge graphs have many limitation [1], we may neglect some global patterns in the data when extracting features based on the graph structure only.A shallow two-layer fully connected network is able to learn the global features of the data while ignoring the actual inner structure of the data.These two modules help the framework to better extract the overall sample feature representation.The effect of including vs. excluding these two modules is discussed in detail in the Ablation Study Section.The input of the parallel network is the updated node attributes, X ′ V ∈ R (N +M )×F and the output global representation of the sample, θ 1 is in the same dimension as the local feature representation from the GNN module, θ 2 ∈ R 64 .θ 1 and θ 2 are then concatenated and passed through a classification layer for prediction.

Loss Function
In the proposed framework, we define the loss function L as a linear combination of three loss functions in Equation (10).
where λ 1 , λ 2 and λ 3 are linear weights, L ent is the standard cross-entropy loss for the classification results, L recon is the mean squared error for the reconstruction loss when the decoder is included, and L reg is the squared l 2 norm of the model parameters to penalize the number of parameters to avoid overfitting.L recon is defined as where x j is the flattened feature vector of j-th sample and xj is the corresponding reconstructed vector.We denote W all as the vector consists of all parameters in the model and the L reg is defined as

Results and Discussion
We apply the proposed model to two different classification problems.The first is cancer molecular subtype classification on the TCGA Pan-cancer dataset and the second is breast cancer subtype classification on the TCGA breast invasive carcinoma (BRCA) dataset [15,33].

Data and Experiment Settings
The TCGA Pan-cancer RNA-seq data, CNV data, miRNA data, and molecular subtype labels are obtained from the University of California Santa Cruz's Xena website [34].We only keep samples that have all three omics data and molecular subtype labels, and collect 9,027 samples in total.We use 17,946 genes that are common in both the gene expression data and the CNV data, and 743 miRNAs.The total number of molecular subtypes is 27 and there is a clear imbalance among these 27 classes as shown in Figure 3.All samples from class 24 are excluded from the study due to the lack of miRNA data.For BRCA subtype classification, there are 981 samples in total with 4 subtypes as shown in Table 1.For the experiments on both datasets, 80% of the data is used for training, 10% is used for validation, and 10% is used for testing.All classes are present in the test set.All expression values are normalized within their own omics.We select the top 700 genes ranked by gene expression variances across the samples, and the top 100 miRNAs by miRNA expression variance.Results are averaged from five individual trials.The details of the model structure and hyperparameters are disclosed in the appendix.The model is implemented using Pytorch Geometric Library.

Baseline Models
We selected four state-of-the-art models [1,7,26,28] as baseline models to evaluate the performance of the proposed approach.These four baseline models are implemented within the proposed framework in two forms, one is with the original structure, and the other is with some modifications to accommodate the multiomics data.The details of all graph-based baseline implementation configurations are shown in Table 2.We also included a fully-connected neural network (FC-NN) as a Euclidean-based baseline model.Conventional machine learning methods, such as Random Forest and SVM are not included in the scope of this study because they do not scale well to the multi-omics data as mentioned in our previous work [1].

Fully-connected Neural Network (FC-NN)
The FC-NN is one of the widely used deep learning model for data in Euclidean space.The implemented structure is the same as the parallel structure.The input data is passed through a dimension-increase layer and then flattened.The flattened data is passed through three hidden layers and a softmax layer for classification.

GCN Models by Ramirez et. al.
The GCN model on cancer type classification is designed for gene expression data with intra-omic connections only [26].The implementation of the original structure and the modified structure is a GCN model with no regularization modules.

Multi-omics GCN Models by Li et al.
The multi-omics GCN model on cancer molecular subtype classification is designed for gene expression and CNV data with intra-omic connections only [1].The implementation of both structures is a GCN model with a decoder and a parallel structure as shown in Table 2.

GrAMME
Since GrAMME is not designed for cancer type classification [28], we modified the original structure for multi-omics data.GrAMME is designed for a GAT model with intra-omic connections only.The implementation is a GAT model with no regularization modules.
Multi-omics GAT by Kaczmarek et al.

Performance on Classification
For both classification tasks, the results of the proposed model and the baseline models are shown in Table 3.The proposed model with GAT layers outperforms all the baseline models for both tasks in all four metrics and the proposed model with GCN layers achieves third for the pan-cancer classification, and second for the breast cancer subtype classification.For the task of pan-cancer molecular subtype classification, the additional omic data in the modified structure improve the model performance in all three cases of the baseline model with the original structure vs. the baseline model with the modified structure.For the same task, the multi-omics GCN model with the decoder and parallel structure shows superior performance among all the baseline models that utilize GCN layers.And GrAMME, which utilizes intra-omic connections, performs better than GAT models that utilize interomics connections.GrAMME is the best-performing one among the baseline models for the pan-cancer task.Overall, we see the proposed model achieves the best performance for the classification task on the complex pan-cancer molecular subtype classification in all four metrics and we can conclude that more omics improve the performance of models, and the models with more restriction modules or GAT layers have better performance.
For breast cancer subtype classification, the overall trend is slightly different from that in the previous task.In most cases of including more omics, the performance of the models shows little or no improvement.We believe it is due to the nature of breast cancer taxonomy.The subtype is based on the expression level of multiple proteins.Thus, it makes the breast cancer subtype to be more closely related to the gene expression omic than the pan-cancer molecular subtype does.Such characteristic of the breast cancer subtype makes the model only using gene expression data perform very well such as the original GCN model.However, the proposed model still outperforms any baseline models by a large margin in all four metrics.

Ablation Study
We conduct an ablation study to evaluate the effects of different numbers of genes, different training set splits, different combinations of modules within the model, and different combination of omics and graphs on the performance of the proposed model.We trained the proposed model and all baseline models at the 300 and 500 genes for pan-cancer molecular subtype classification and 300, 500, 1000, 2000, and 5000 genes for breast cancer subtype classification.The limitation of the test scope on pan-cancer classification is due to the computation constraints caused by its large number of samples.As shown in Table 4, increasing the number of gene nodes improves the performance of all models.FC-NN model demonstrates great improvement in performance as the number of genes increases.And the proposed model with the GAT layer outperforms the baseline models at both numbers of genes.

Different Numbers of Genes
The accuracy and F1 scores of the proposed model and the baseline models for BRCA subtype classification are shown in Figure 4.The proposed model with GAT performs best when the number of genes is smaller than 1000 and the proposed model with GCN performs best when the number of genes is larger than 1000.The proposed GAT-based model yields the best result with an accuracy of 88.9% and an F1 score of 0.89 when using 700 genes; and the proposed GCN-based model yields the best result with an accuracy of 90.1% and an F1 score of 0.90 when using 5000 genes.The detailed results are shown in the supplementary file.The performance of the proposed model with GAT deteriorates beyond 1,000 genes, but the performance of the proposed model with GCN continues to rise as the number (b) The F1 scores of the proposed model with GAT (blue solid line) or GCN (orange solid line) and baseline models (dashed line) are plotted against different numbers of genes (300, 500, 700, 1000, 2000, and 5000) for BRCA subtype classification.
of genes grows beyond 1,000 genes.All GAT-based baseline models show similar deterioration around 1000 genes.We think the high computation cost of the GATbased model can cause it to perform worse on a large graph than on a small graph.Overall, we can conclude that the proposed model with GCN layers scales better than that with GAT layers at a large number of genes.
In the process of testing the models on a large graph, we also find that a GATbased model is more stable on a smaller learning rate compared to a GCN-based model.We believe it is caused by GAT's high computation costs since a high learning rate may cause the model to be stuck in a local optimum.
Overall, we see the proposed model achieves the best performance and scales well with a larger number of genes.We can also conclude that more genes and more omics mostly improve the performance of models, the models with more modules have better performance, and GAT-based models perform better with smaller graphs while GCN-based models scale better at larger graphs.

Different Training Set Split
To examine the performance of the proposed model on a complex dataset with a smaller training set, we tested the model on the Pan-cancer dataset using three different training set splits.This approach was taken to mimic situations where only a smaller labeled dataset is available in the real world.The training set splits were set at 70%, 60%, and 50%, with corresponding testing set splits of 20%, 30%, and 40%.Throughout these tests, the validation set split was consistently kept at 10%.As shown in Table 5, the proposed model with the GAT layer exhibits a slight performance deterioration at 70% and 60% training set splits.However, it displays a more pronounced decline in classification accuracy at 50%.In contrast, the proposed model with the GCN layer demonstrates consistent and robust performance across all three training-validation-testing splits.However, its classification accuracy is lower than that of the model with the GAT layer at 70% and 60% training set splits.Therefore, we can conclude that the proposed model with the GAT layer achieves superior performance compared to the model with the GCN layer when the training set is relatively small.However, the model with the GCN layer outperforms at a very small training set (50%).Overall, the proposed model with the GCN layer offers more robust classification performance with smaller training sets.

Different Combinations of Modules
To examine the effect of different modules within the proposed model, we test three different variants of the proposed model for the Pan-cancer molecular subtype classification.All variants of the proposed model are trained with all three omics data at 300, 500, and 700 genes.The proposed model without the decoder acts as a parallel structured GNN model, the proposed model without the parallel structure acts as a graph autoencoder model, and the proposed model without both the decoder and the parallel structure acts as a graph-classification GNN model.As shown in Table 6, models without the parallel structure perform poorly compared to those without the decoder at any number of genes in general.It shows that the parallel structure plays an important role in feature extraction, which also demonstrates the benefit of including both local features and global features.When the graph size is small (300 genes), the model without the decoder and the parallel structure performs more poorly compared to those with either component.However, when the graph size is large enough (500 genes and 700 genes), the model without the decoder and the parallel structure performs relatively the same compared to those with either of the component.We believe the extra information in the large graph compensates for the loss in performance caused by the exclusion of either the decoder or the parallel structure.To test the effect of different choices of omics and different graphs, we generate five different combinations of omics.The five combinations of omics are mRNA, miRNA, mRNA + CNV, mRNA + miRNA, and mRNA + CNV + miRNA.For mRNA + miRNA and mRNA + CNV + miRNA, two different variants of graphs are also tested.All models are conducted for Pan-cancer molecular subtype classification, and trained with 500 genes except for only miRNA omic, which contains only 100 miRNA nodes.As shown in Table 7, the best-performing setting is mRNA + CNV + miRNA with intra-omic edges for both GAT-based and GCN-based models.The worst-performing setting is miRNA, which has the smallest graph size and information.Models on mRNA + CNV perform better than those on mRNA + miRNA, but adding miRNA to mRNA + CNV (mRNA + CNV + miRNA setting) still improves the model performance.Models with intra-omic graph performs slightly better than models with inter-omics graph.The performance difference across different settings is the same for both GAT-based and GCN-based models.

Conclusion
In this study, we propose a novel end-to-end multi-omics GNN framework for accurate and robust cancer subtype classification.The proposed model utilizes multiomics data in the form of a heterogeneous multi-layer graph, which is the supragraph built from GGI network, miRNA-gene target network, and miRNA metapath.While GNNs have been previously employed for genomics data analysis, our model's novelty lies in the utilization of a heterogeneous multi-layer multiomics supra-graph.The supra-graph not only incorporates inter-omics and intra-omic connections from established biological knowledge but also integrates genomics, transcriptomics, and epigenomics data into a single graph, providing a novel advancement in cancer subtype classification.The proposed model outperforms all four baseline models for cancer molecular subtype classification.We do a thorough comparative analysis of GAT and GCN-based models at different numbers of gene settings, different combinations of omics, and different graphs.
Comparing the proposed model to the baseline models, it achieves the best performance for cancer molecular subtype classification and BRCA subtype classification.The proposed model with GAT layers performs better than that with GCN layers at smaller-size graphs (smaller than 1,000 genes).However, the performance of the GAT-based model deteriorates as the size of the graph grows beyond a certain threshold.On the other hand, the performance of the GCN-based model continues to improve as the size of the graph grows.Therefore, we can conclude that a GAT-based model is more suitable on a smaller graph, where it has a higher feature extraction ability and its computation cost isn't that high yet.
By studying the effect of different modules within the proposed model and different combinations of omics, we find the addition of a decoder and the parallel structure, and including other omics improves the performance of the proposed model.The benefit of using parallel structure outweighs that of decoder, especially on smaller-size graphs, and the benefit of adding CNV is higher than that of adding miRNA.We also find that using a graph with only intra-omic edges yields a better performance than using a graph with only inter-omics edges, which agrees with the results from the previous study [7].
The proposed model also has some limitations.We investigate only two wellestablished and widely adopted GNN models.New models are emerging with the recent blooming of studies in GNN models.As the size of the graph grows or more omics are added, GAT-based models become more sensitive to parameters and take a much longer time to train.It is our future research direction to overcome such limitations.The proposed model for cancer subtype classification depends on labeled data, which is costly to annotate and difficult to obtain in the real world.Exploring unsupervised learning for cancer subtype detection is also a direction we aim to pursue in our future research.
In summary, incorporating gene-based and non-gene-based omic data in the form of a supra-graph with inter-omics and intra-omic connections improves the cancer subtype classification.The GAT-based model is preferable on smaller graphs than the GCN-based model.GCN-based model is preferable when dealing with large and complex graphs.

Figure 1 :
Figure 1: The overall structure of the proposed model has four major modules shown as dotted grey rectangles.The input graph consists of inter-omics (red edges), intra-omic (blue edges) edges and miRNA-miRNA meta-path (black dashed edges), and three omics data, mRNA (orange boxes), CNV (yellow boxes), and miRNA (green boxes) is shown as the leftmost side.Module 1 consists of two parallel linear dimension-increase layers for gene-based nodes and miRNA-based nodes.The upgraded graph shown in the middle is obtained by feeding the node attributes from the input graph through module 1, where the dark orange boxes are the updated gene-based node attributes and the dark green boxes are the updated miRNA-based node attributes.Module 2 consists of two graph neural network layers, which can be any graph neural networks.The output of module 2 is then fed into a max pooling layer and then a transformation layer to obtain the learned graph representation (blue boxes).Module 3 consists of a decoder to reconstruct the graph representation back to the input graph node attributes.Module 4 consists of a shallow fully connected network that takes the updated node attributes as the input.The output of the parallel network (grey cubes) is then concatenated with the learned graph representation, and passes through a classification layer for the classification task.

Figure 2 :
Figure 2: The overall graph, supra-graph, is constructed from three different omic data on the left-hand side and two prior knowledge graphs on the right-hand side.mRNA (orangetable) and CNV (yellow table) data are considered gene-based, which have the same dimension.miRNA (green table) data has the same number of rows but different feature lengths for each sample.

Figure 3 :
Figure 3: The number of cases in each molecular subtypes is shown.All samples from class 24 are excluded due to lack of miRNA data.

Figure 4 :
Figure 4: Performance of the Proposed Models and Baseline Models with Different Numbers of Genes on BRCA Dataset

Table 1 :
Number of Cases in Each BRCA Subtype

Table 2 :
Configurations of Baseline Models on Omics, Graph Structure, GNN Layers, and Regularizaiton Modules

Table 3 :
Results of the Proposed and Baseline Models with 700 Genes for Molecular Subtype Classification on the TCGA Pan-cancer Dataset And Cancer Subtype Classificaiton on the TCGA BRCA Dataset

Table 4 :
Results of the Proposed Model and Baseline Models with 300 and 500 Genes for Molecular Subtype Classification Using the TCGA Pan-cancer Dataset The bold font indicates the highest values and the values after ± sign are the standard deviations.

Table 5 :
Proposed Model with Different Training-validation-testing Split

Table 6 :
Results of the Variants of the Proposed Model for Molecular Subtype Classification Using the TCGA Pan-cancer Dataset.The bold font indicates the highest values and the values after ± sign are the standard deviations.1 Accu.stands for Accuracy.

Table 7 :
Results of the Proposed Model on Different Combinations of Omics and Networks at 500 Genes Using the TCGA Pan-cancer Dataset.The bold font indicates the highest values and the values after ± sign are the standard deviations.1 Data contains no miRNA-based nodes, so only 500 gene nodes in the graph