Multiview clustering of multi-omics data integration by using a penalty model

Background Methods for the multiview clustering and integration of multi-omics data have been developed recently to solve problems caused by data noise or limited sample size and to integrate multi-omics data with consistent (common) and differential cluster patterns. However, the integration of such data still suffers from limited performance and low accuracy. Results In this study, a computational framework for the multiview clustering method based on the penalty model is presented to overcome the challenges of low accuracy and limited performance in the case of integrating multi-omics data with consistent (common) and differential cluster patterns. The performance of the proposed method was evaluated on synthetic data and four real multi-omics data and then compared with approaches presented in the literature under different scenarios. Result implies that our method exhibits competitive performance compared with recently developed techniques when the underlying clusters are consistent with synthetic data. In the case of the differential clusters, the proposed method also presents an enhanced performance. In addition, with regards to real omics data, the developed method exhibits better performance, demonstrating its ability to provide more detailed information within each data type and working better to integrate multi-omics data with consistent (common) and differential cluster patterns. This study shows that the proposed method offers more significant differences in survival times across all types of cancer. Conclusions A new multiview clustering method is proposed in this study based on synthetic and real data. This method performs better than other techniques previously presented in the literature in terms of integrating multi-omics data with consistent and differential cluster patterns and determining the significance of difference in survival times.

omics will exhibit a causal reductionism resulting in unsatisfactory results [1]. The rapid advancement of high-production technology has produced a massive amount of data generated from patients with different types of cancer, facilitating the collection of different genome-scale datasets to address clinical and scientific challenges. The Cancer Genome Atlas (TCGA) is one of the most prominent projects. It provides a considerable amount of omics data obtained from different platforms (e.g. DNA methylation and gene expression). Therefore, developing methods for integrating different types of omics data and creating a comprehensive picture of a given disease or biological process is necessary to improve disease detection, treatment and prevention [2].
Clustering is a popular technique for exploratory data analysis [3], and it has been used as a fundamental step in the comprehensive analysis of omics data to identify cancer subtypes [4][5][6] and detect correlated gene expression patterns [7]. Traditional clustering methods for omics data analysis involve one data type, such as DNA methylation [8]. However, information from one data type may be inconsistent due to a small number of samples compared with a large number of measurements, scale differences, collection bias, noise in each dataset and the complementary nature of information provided by different types of omics data. Several methods have focused on multi-omics data clustering to integrate information from different types of omics data [9][10][11][12] or used concordant data structure to perform clustering [10,[12][13][14][15][16].
Although multiview clustering methods have significantly improved clustering performance, some methods assume that underlying clusters for different data types are consistent clusters [11,12,16], whilst others assume that such clusters are differential clusters [17][18][19]. In such case, multi-omics data should have simultaneous consistent and differential cluster patterns in accordance with our intuition. A few integrative clustering methods have been proposed recently to integrate multiview data with consistent and differential cluster patterns [20][21][22][23]. The authors of [24] proposed a multiview clustering method to solve the multiview spectral clustering optimisation problem [17]; this method used the linear search technique. However, all previous studies reported low performance in integrating different data types. The optimisation problem exhibits orthogonality constraints. Several methods have been proposed to solve the optimisation problem with orthogonality, such as gradient-based methods [24][25][26], conjugate gradient methods [27,28], projection-based methods [29], a constraint-preserving updating scheme [30,31], a multiplier correction framework [32] and penalty function methods [33].
In the present study, we proposed a new approach for multiview clustering that aimed to integrate different types of data with consistent and differential cluster patterns. The current work developed the computational framework for multiview clustering presented in [23] to improve performance in integrating different types of omics data with consistent and differential cluster patterns. The procedures of the proposed method were based on the penalty model, and the first-order algorithm in [33] was used to solve the multiview spectral clustering optimisation problem with orthogonality constraints [21]. Moreover, we evaluated the performance of the proposed method on synthetic data. On the basis of the obtained high accuracy, we used the method on four different types of real multi-omics data. The experiment results demonstrated that the proposed method outperformed other methods presented in the literature.
The major contributions of this work are as follows. Firstly, a new multiview clustering method is provided for integrating multi-omics data with consistent and differential cluster patterns. Secondly, the proposed method is used to determine the significance of difference in survival times. Finally, comprehensive experiments on several state-ofthe-art multiview clustering methods are performed to validate the proposed method. The remaining sections of this paper are organised as follows. "Related work" section provides a review of related work. "Proposed method" section presents the proposed method. "Results and discussion" section describes the experiments performed on datasets by using state-of-the-art algorithms for comparison. "Conclusion" section concludes the study.

Related work
In this section, we review related work on multiview clustering methods for integrating different types of data.
Currently available approaches can be classified into three types. The first type is based on a concordant data structure when performing clustering [10,[12][13][14][15][16]. iCluster is an integrative clustering method based on a Gaussian latent variable model with lassotype penalty terms to induce sparsity in coefficient matrices for feature selection. This approach's significant computational complexity necessitates gene preselection. Therefore, clustering outcomes are dependent on this step [12,15]. To address the issue of gene preselection, a fused network method is proposed. This method uses sample networks as a foundation for integration and merges similarity networks built for each data type into a single combined similarity network via an iterative approach based on message passing. The final clusters for a fused network are obtained using spectral clustering [13]. A multiple kernel learning method that aims to reduce the dimensions of data from different sources conserves the distance of neighbours for all data types by extending spectral clustering to accept several affinity matrices as input and then fuses matrices by using a linear combination with weight optimisation [12].
The second type is based on spectral clustering. The affinity aggregation spectral clustering method provides a framework for learning the spectral clustering similarity matrix to improve the robustness of spectral clustering by reducing the effect of unreliable and irrelevant features [34]. The authors of [21] and [22]aimed to maximise consistency between clusters from different perspectives by using various cluster consistency measures. The formulation of an optimisation problem in [22] included an alternative computation of the eigenvectors of the regularised Laplacian matrix; thus, the method is less stable and more likely to yield the local optimum.
For the third type, multiview clustering methods have demonstrated remarkably improved clustering performance, although some of these methods assume that the underlying clusters across different data types are consistent clusters. For example, methods based on the application of cancer patients assume that the underlying subtypes are the same across different data types [11,12,16]. The flaw of these methods is that they do not check whether genes, microRNA (miRNA) and other small molecules exhibit the same cluster patterns across different subtypes. Several methods based on the assumption that underlying clusters across various data types are different have also been proposed by comparing clusters identified in different data or by merging information that demonstrates differences in single objects [17][18][19]. Meanwhile, a few methods have focused on the problem of integrating different datasets with consistent and differential cluster patterns. The method in [22] tends to obtain the local optimum, whereas the methods in [21,35] overly relax the original multiview ratio cut model, likely producing the information in each data type. The multiview clustering method based on manifold was proposed to solve the multiview spectral clustering optimisation problem [23] by using a linear search technique. All these methods exhibit lower performance in integrating different data types. Therefore, we proposed a new multiview clustering method to obtain better performance.

Proposed method
In this section, we introduce the penalty model, the proposed method steps, the algorithm, and the example for our method as follows.

Penalty model
By considering the following matrix optimization problem with orthogonality constraints: where I p is any p × p identity matrix, and f : R n×p � → R satisfies the following assumption throughout this section. We start with the merit function that can be written as follows: This merit function was firstly defined in Gao et al. [36]. Equation (4.2) evaluates the function value reduction of the proximal linearised augmented Lagrangian algorithm (PLAM), where ψ : R p×p → R p×p , and ψ(X) = X T +X 2 denotes the linear operator for symmetrisation. The penalty model that minimises h(X) under a compact convex constraint (PenC) is as follows: where M is a compact convex set that contains St n,p .
Xiao et al. [33] demonstrated that the original problem 1 and the penalty model 3 are equivalent. They proposed the first-order method (PenCF) for solving problem 3. In our work, we used PenCF to solve our problem because our objective function was formulated as

Steps of the proposed method
The proposed method has five steps. Firstly, to normalise each data type to have a standard normal distribution [10], we convert each data type into a patient-patient k-nearest neighbour (NN) similarity network on the basis of a spectral clustering method [37,38]. Secondly, we obtain a multiview spectral clustering optimisation problem from the multiview network clustering model to integrate multiple similarity networks [21,35]. Thirdly, we use the penalty model, i.e. the first-order algorithm from [33], to solve the multiview spectral clustering problem with orthogonality constraints. Fourthly, we repeat the processing of the penalty model until convergence is achieved, and we obtain the values of X that are represented by the matrix that contains the label for each patient. Finally, we use k -mean to cluster the matrix X into C 1 ; C 2 ; C 3 ; ...; C K . We present the details of these steps as follows.

Normalization
We start to normalise each feature across all data types to have a standard normal distribution. We use the following form: where ǧ denotes the feature after normalization, g denotes a feature of any data type, E(g) represents the mean of features, and Var(g) represents the variance of features.

Construction of the k-NN similarity network
where ǧ m is the number of features of m. For each data type X m , we construct a patient-to patient similarity network G m to model the local neighbourhood relationships between the samples. Let G m = (V m ; E m ; A m ) denote a patient similarity network for data type m, where V m = {v 1 ; v 2 ; v 3 ...; v N } denotes the set of the nodes, E m denotes the edge set, and A m denotes the adjacency matrix. The nodes represent N patients, and the edges represent the connection between patients.
The adjacency matrix A m of the network G M is is a symmetric matrix whose entry a m (i, j) represents the edge weight if there is an edge between node v i and v j , otherwise, A m (i, j) = 0 . To construct this similarity network, we firstly compute a similarity matrix for measuring pairwise similarity between each sample pair. Here, we use a common similarity measure, called Gaussian similarity, as the similarity metric [37,38]: || is the Euclidean distance between x m (i) and x m (j) patients, and parameter α controls the width of the neighbourhoods. Several options are available for setting parameter α , here we use the most common choice by setting parameter α as the standard deviation of patients ||x m (i) − x m (j)|| . Then, we construct a k-NN network from the similarity matrix S m . We symbolise N i as a set of node v i neighbours with node v i , and the size of N i is equal to k. We then connect v i and v j with an undirected edge with the edge weight as S(i, j) if v i ∈ N i , as shown in Eq. (7).
Thus far, the k-NN similarity network G m for each data type is constructed.

Multiview network clustering model for integrating the multiple similarity networks
To integrate all k-NN similarity networks, we first compute the Laplacian matrix L m for each k-NN similarity network G m for all data types.
where A m is the adjacency matrix for the k-NN similarity network G m ; and D m is the diagonal matrix for the k-NN similarity network G m , whose entries are the degree of all the nodes(patients) in the m-th G m . Then, we present the multiview network clustering model [21,35]. We start by denoting the number of clusters as K across the M networks and S m as the assignment of the N nodes into K clusters for the network G M .
We set We firstly align the clusters obtained for different networks to show consistent and differential cluster patterns across multiple networks. The computation is complex if we directly align the identified cluster patterns in each network. Thus, Zhang et al. proposed identifying cluster patterns for each network and aligning clusters for multiple networks. The cluster patterns in each network are identified using spectral clustering. For cluster alignment in different networks, the similarity between the mth cluster in network G m and the mth cluster in network G h is defined as Zhang et al. [21] and Chen et al. [35] aimed to maximize the similarities of the corresponding clusters in all networks, and the objective function of a multiview network clustering model is where β is the parameter for controlling the contributions from intra-and inter-network connections. The optimization problem is formulated as where S m ,k represents the k-th column of S m . We can cluster the nodes in each network from the first term in the objective function and obtain the alignment of clusters in different networks from the second term. The aligned clusters in all the views are obtained by solving optimization problem (11).
As shown below, the preceding optimisation problem is relaxed to a multiview spectral clustering optimization problem as follows: where To relax both constraints in optimization problem (11), we transform the constraints for each network into a single equation with X m = S m ||S m || 2 . In the first constraint in optimization problem (11), the variables S m are relaxed from binary values to real values. To combine the information of all networks, we relaxed the second constraint to X T X = I K . However, this relaxation will result in the loss of information in the network [23]. To preserve information, we solve the following optimization problem: . Then, the constraint X T m X m = I K is direct relaxation of K k=1 S m ,k = 1 , and it retains more information than in the optimization problem (11). The problem (14) is an optimization problem with orthogonality constraints. The orthogonality constraints can be expressed as a Stiefel manifold St n,p , which X ∈ St n,p = X ∈ R n×p |X T X = I p . Thus, problem (14) can be written as which is an optimization problem defined in manifold.

Solving the optimization problem
In this step, we solve optimization problem (15) by using the penalty model, i.e. the firstorder algorithm from [33]. This algorithm iteratively implements steps.
Step 1 As mentioned earlier, Xiao et al. [33] demonstrated that the original problem (1) and the penalty model 3 are equivalent. We begin by computing the gradient of the augmented Lagrangian [36] to approximate problem (15). The gradient of the augmented Lagrangian function with respect to X is formulated as 0 I n · · · I n I n 0 · · · I n . . . . . . . . . . . . (15) min where D m = ∇ X ϕ(X m , δ) denotes the gradient of the augmented Lagrangian, and ∇f (X m ) represents the gradient of the objective function 4. We computed the gradient of the objective function as and δ(X) is the Lagrangian multiplier. The PLAM method in [36] is shown as the closedform expression of the Lagrangian multipliers δ(X) , which is defined as where the operator ψ is defined from Eq. (2) as follows, From Eqs. (18) and (19), we compute the Lagrangian multiplier to have the following form: We use Eq. (16) to calculate the gradient of the augmented Lagrangian.
Step 2 We perform the following iterations in this step. We compute X m+1 = X m − µ m D m , and then we check if �X m+1 � F is greater than the parameter r. If yes, then If �X m+1 � F is not greater than r, then X m+1 =X m+1 , where µ m is the stepsize, and r is the radius. We explain how we selected the two parameters in the "Results and discussion" section. We repeat the processing of the aforementioned steps until convergence. Finally, we obtain the values of X.

k-means
We begin by setting X as N × M points in R K , and then use k-means clustering to obtain the clusters. The output is the class label of the patients (nodes) in each network. We call our method multiview clustering by using the penalty model (MVCPM).

Algorithm
In this subsection, we present the algorithm framework of the proposed method as shown in Algorithm 1. For convenience, we call it MVCPM.

Results and discussion
We investigate the performance of our method (i.e. MVCPM) in this section. The performance of the proposed method is compared with several recently developed methods by using synthetic and real data. The comparison methods are the affinity aggregation for spectral clustering (AASC) algorithm [34], the multiview clustering based on manifold optimisation (MVCMO) method [23] and the multiview spectral clustering (MVSC) method [21].

Comparison using synthetic data
We begin by simulating network structures because all the methods mentioned earlier have been proposed using network tools. The networks are generated via the stochastic block model [39]. We begin by simulating network structures because all the methods mentioned earlier have been proposed using network tools. The networks are generated via the stochastic block model [39]. Assume that M views are present, and N is the number of nodes in each view. First, the number of N nodes is assigned to K. Then, the connections within and between different clusters are generated using a given probability matrix, in which diagonal entries specify connection probabilities within clusters. The other entries specify connection probabilities between corresponding clusters. To obtain the results when between cluster connections are changed, we use the following four connection probability matrices, with the within cluster connection probability being larger than that between clusters: The entries of main diagonal (l, l) for each probability matrix represent the connection probabilities within cluster l, whilst the entries (l, s); l � = s represent the connection probabilities between the cluster l and cluster s. To evaluate its performance, the proposed method is compared with existing methods in the literature under different scenarios. Firstly, the performance of the numbers of M views and N nodes is analysed when setting the same cluster size. For example, we set the cluster size as To see the performance of all the methods, we repeat the experiments 50 times for each setting. Then, we calculate the identification accuracy of the clusters, wherein identification accuracy is represented by the Rand index, which is defined as (22)       As shown in Tables 1 and 3, when cluster sizes are the same across all M views, all four methods can cluster the nodes with a mean Rand index close to 1 and a lower standard deviation. Thus, all the methods exhibit comparable performance and demonstrate the significance of data integration, with AASC and our method performing the best with a ratio of 0.01.
Tables 2 and 4 present the results of the four methods when cluster sizes are different. Our method outperforms all the other methods. The primary reason is solving optimization problem 15. Most of the other methods formulate the optimization problem under the assumption that cluster sizes are the same across different networks. Then, they focus on finding the common features of the cluster to perform clustering. In AASC, for example, the M vectors that minimize the ratio cut of the weighted combination of all the networks are computed. By clustering these M vectors, the common cluster for all the considered networks can be obtained. In Tables 2 and 4, the mean Rand index of the AASC method is lower than those of the other methods. Consequently, all the methods can detect cluster differences between different views, whereas the AASC method regards them as the same. Thus, MVCPM, MVSC and MVCMO exhibit better performance than the AASC method. The MVCPM method outperforms MVSC and MVCMO, which has the highest mean Rand index. The MVCPM method performs more consistently when networks are fewer, resulting in a considerably lower standard deviation of the Rand index.

Selection of the parameters of the proposed method
In this section, we describe the method for choosing the values of the penalty parameter β , the stepsize µ k , and the radius r.
Penalty parameter β . In Theorem 4.1 [33], β should be sufficiently large to guarantee convergence. Although, we can estimate a suitable β that meets the requirements of previous theorems, such β is too large to be practically helpful. In our method, we set β to be less than t = ||f (X 0 )|| . In the selected penalty model, the compact convex set as the ball B r with radius r > √ p. Stepsize µ k Practically, the upper bound of µ k used in Theorem 4.1 is too restrictive. Therefore, we recommend using the alternating Barzilai-Borwein (ABB) step size [15] and the Barzilai-Borwein (BB) stepsize [40]. Moreover, we compare the penalty model for solving problem (15) with n = 1000 , p = 30 and different pairs of β and r: β = 10 −6 t, 10 −5 t, 10 −4 t, 10 −3 t, 10 −2 t, 10 −1 t, t, 10t and r = √ p, 1.01 m , for even m.
The results are presented in Fig. 2. From Fig. 2, we can conclude that our algorithm is insensitive to the choice of r when r ≤ 2 . Furthermore, a small β can lead to a fast convergence rate. Consequently, we suggest choosing β = 0.01t whilst r = 1.04 √ p. To guarantee fair comparison over the literature, we set parameter β = 1 to determine the performance of our method compared with those of other methods.

Real data experiments
In this section, we use five real different types of cancer data to evaluate the performance of the proposed method and compare it with those of other methods. We select two commonly used measures, namely, silhouette score to measure clustering performance [41] and the P-value in the Cox log-rank test to evaluate the significance of difference between survival times [42].

Dataset
This study uses real multi-omics data generated by TCGA. We use the five data types of cancer processed in [16]. The data types are glioblastoma multiforme (GBM), lung squamous cell carcinoma (LSCC), kidney renal clear cell carcinoma (KRCCC) and colon adenocarcinoma (COAD). Each data type of cancer contains miRNA expression, DNA methylation and gene expression data type.

Comparison using silhouette score
We begin by constructing the k-NN similarity networks for each data type, choosing the five nearest neighbours and setting the number of clusters K as that in [16]. Then, we apply the proposed method to the k-NN similarity networks to obtain the final result. We use the silhouette score to measure the clustering performance of our method (MVCPM) compared with the performance of other methods [41]. For the AASC method, we use the code developed in [34]. For the MVCPM and MVSC methods, we use the code generated in [23]. We compute the silhouette score of the common clusters and the average silhouette score of all data types to obtain the cluster assignment in each dataset and the common clusters across the three datasets. We also calculate the silhouette score of integrative clustering by using k-NN to assign patients who are previously clustered in different clusters into one cluster because cluster assignment in the three data types is not the same. We denote the silhouette score of the common clusters, the average silhouette score and the silhouette score for integrative clustering as (MVCPMcom, MVCMOcom, MVSCcom), (MVCPMavg, MVCMOavg, MVSCavg) and (MVCP-Mint, MVCMOint, MVSCint), respectively. The results are provided in Table 5.
From Table 5, the common clusters identified using our method provide the highest silhouette score. The proposed method can capture common cluster structures efficiently in multiple views. The average silhouette scores are also higher than those using other methods in three of the four datasets. This result shows that MVCPM maintains the clustering results in each dataset. The cluster assignment in the three data types are not the same; thus, we use k-NN to assign patients that are originally clustered into different clusters into one cluster and compute silhouette scores as MVCPMint, MVC-MOint, AASCint and MVSCint. In this case, the common silhouette score and average silhouette scores of all the methods are higher when compared with AASC, implying that all the methods, except for AASC, can identify the common clusters and capture the common cluster structures efficiently across multiview data. Moreover, all the methods have a high silhouette score for integrative clustering, except for AASC, indicating that all the methods can integrate clustering with good performance.
When we compare MVCPM, MVCMO and MVSC, MVCPM has the highest silhouette score for common clusters and the average silhouette score. MVCPM provides more detailed information within each data type, is better for integrating different types of

Comparison by Cox survival P-values
In this subsection, we compare the performance of our proposed method with those of other methods by using the P-value in the Cox log-rank test to determine the significance of the difference in survival times [42]. The lowest P-values are used to determine the number of clusters for each cancer type in our proposed method. To ensure that the proposed method's results are comparable with those of the other methods, we set the number of clusters for each cancer type for the other methods to be the same as the number of clusters for our proposed method. The results are provided in Table 6.
As indicated in Table 6, our method provides more significant differences in survival times across all types of cancer studied compared with the other methods. The P-value produced is comparable with the P-value produced by the other methods. Thus, the survival plots for all methods across COAD, GBM, KRCCC and LSCC cancers are presented in Figs. 3, 4, 5 and 6.

Conclusion
This paper proposed a new method for multiview clustering of This study proposed a new method for the multiview clustering of multi-omics data based on a penalty model. Synthetic data and four different types of real multi-omics data were used to test the performance of the proposed method. We adopted the block model to generate these data in synthetic data and applied the proposed method and other methods. To ensure the performance of the proposed method, we compared it with some recently developed multiview clustering methods by using synthetic data. The results showed that when the underlying clusters were consistent, the proposed method exhibited competitive performance with the recently developed methods. However, the performance of the proposed method was better when the underlying clusters were differential.
For real multi-omics data, we downloaded the gene expression, miRNA expression and DNA methylation datasets for five different types of real multi-omics data. These are GBM, COAD, LSCC and KRCCC from TCGA. To test the clustering performance of all the methods, we used two commonly used measures: silhouette score to measure clustering performance and P-value in Cox log-rank test to evaluate the significance of difference between survival times. For the silhouette score, the proposed method obtained the highest silhouette score and provided more detailed information within each data type. In Cox survival P-values, the proposed method resulted in more significant differences in survival times across all studied data types. Thus, the proposed method can be considered the best approach for integration and clustering. Although our method performed well in synthetic and real omics data analyses, some issues still need to be addressed. For example, we set parameter β = 1 and the number of clusters to be three throughout our study to guarantee fair comparison  Table 6)  Table 6) with conventional methods. However, the number of clusters is not practically optional because it must be selected on the basis of the eigengap. Moreover, no guarantee theoretically exists to determine the optimal number of clusters. Consequently, developing an excellent statistical method for determining the optimal number of clusters in network-based clustering is still a worthwhile research.

MVCPM
Our method AASC Affinity aggregation for spectral clustering algorithm  Table 6)  Table 6)