The overall design of our method is illustrated in Fig. 1.
Datasets and preprocessing
In this paper, we selected four cancer datasets in the TCGA for experiment, including gene expression data, miRNA expression data and DNA methylation data from samples of cancer patients. The cancer datasets include glioblastoma multiforme (GBM) with 215 samples, breast invasive carcinoma (BIC) with 105 samples, Skim Cutaneous Melanoma (SKCM) with 439 samples and Acute Myeloid Leukemia (AML) with 96 samples.
Firstly, if the data of a patient loses more than 20% in any data type, the patient will be deleted. Secondly, if the missing value of a feature in all patients exceeds 20%, it will be filtered out. Thirdly, the K-nearest-neighbor method is adopted to fill in missing data. We need to determine k according to the size of the sample. In our experiment, we set k = 20.
Fourthly, we log transform the data set to make it more stable. Finally, each feature is normalized in the constructed network to make it have a standard normal distribution. We performed the following normalization for each data type:
$$\widehat{f}=\frac{f-E(f)}{\sqrt{Var(f)}} .$$
(1)
where f is the characteristic of sample data,\(\widehat{f}\) is the corresponding characteristic after normalization of f, E (f) and Var (f) represent the sample mean and sample variance respectively.
Construction of the patient-to-patient similarity graph
Denoted \({\{{\mathrm{X}}^{\mathrm{m}}\}}_{\mathrm{m}=1}^{\mathrm{M}}\) as multi-view data from N patient samples, which has m data type in total. Each \({\mathrm{X}}^{\mathrm{m}}\) is a matrix of \({\mathrm{p}}_{\mathrm{m}}\times \mathrm{N}\), then a similar network graph \({\mathrm{G}}^{\mathrm{m}}\) is constructed to reflect the neighborhood relationship between the samples.
In the similar network of the type m,\({\mathrm{G}}^{\mathrm{m}}=({\mathrm{V}}^{\mathrm{m}},{\mathrm{E}}^{\mathrm{m}} ,{\mathrm{W}}^{\mathrm{m}} )\), \({\mathrm{V}}^{\mathrm{m}}\) is vertex set, \({\mathrm{E}}^{\mathrm{m}}\) is edge set, and \({\mathrm{W}}^{\mathrm{m}}\) is adjacency matrix. The adjacency matrix of \({\mathrm{W}}^{\mathrm{m}}\) in graph \({\mathrm{G}}^{\mathrm{m}}\) is a symmetric matrix.
In this paper, “Heat Kernel” is used to measure the similarity between samples [18]. The basic form is a Gaussian function with “t”. It has linear complexity and robustness that is not sensitive to small changes.
$${\mathrm{S}}_{\mathrm{ij}}^{\mathrm{m}}=\mathrm{exp}\left(-\frac{{\Vert {\mathrm{x}}_{\mathrm{i}}^{\mathrm{m}}-{\mathrm{x}}_{\mathrm{j}}^{\mathrm{m}}\Vert }^{2}}{2{\mathrm{t}}^{2}}\right),\mathrm{i}=1\dots ,\mathrm{N},\mathrm{j}=1\dots ,\mathrm{N}.$$
(2)
Next, we construct the K-nearest neighbor graph based on the similarity matrix \({\mathrm{S}}^{\mathrm{m}}\). If the vertex has an edge between \({\mathrm{v}}_{\mathrm{i}}\) and \({\mathrm{v}}_{\mathrm{j}}\), then \({\mathrm{W}}_{\mathrm{ij}}^{\mathrm{m}}\) represents the edge weight, otherwise 0.
$${\text{W}}_{{{\text{ij}}}}^{{\text{m}}} = \left\{ {\begin{array}{*{20}l} {{\text{S}}_{{{\text{ij}}}}^{{\text{m}}} ,} \hfill & {{\text{v}}_{{\text{j}}} \in {\text{N}}_{{\text{i }}} , } \hfill \\ {0,} \hfill & {otherwise.} \hfill \\ \end{array} } \right.$$
(3)
here \({\mathrm{N}}_{\mathrm{i}}\) is the neighborhood of \({\mathrm{v}}_{\mathrm{i}}\) (including \({\mathrm{v}}_{\mathrm{i}}\)), \({\mathrm{N}}_{\mathrm{i}}\) with size k, and the number of k usually depends on the size of the sample. Essentially, we assume that local similarity is more reliable than remote similarity. This is a modest assumption, and it is widely used by other manifold learning algorithms [18].
Construction of objective optimize problem
The objective optimize problem of the spectral clustering method is:
$$\begin{aligned} & \mathop {\min }\limits_{{{\text{U}}_{{\text{m}}} \in {\text{R}}^{{{\text{N}} \times {\text{k}}}} }} \,{\text{trace}}\,\left( {{\text{U}}_{{\text{m}}}^{{\text{T}}} {\text{L}}_{{\text{m}}} {\text{U}}_{{\text{m}}} } \right) \\ & {\text{s}}.{\text{t}}.\quad {\text{U}}_{{\text{m}}}^{{\text{T}}} {\text{U}}_{{\text{m}}} = {\text{I}}_{{\text{K}}} . \\ \end{aligned}$$
(4)
Here, the \({L}_{m}=({D}_{m}-{A}_{m})\). The \({A}_{m}\) is the corresponding adjacency matrix of similar network \({\mathrm{G}}^{\mathrm{m}}\), and \({D}_{m}\) is the diagonal matrix constructed using the degree of all the nodes in the mth network.
Then, used \({U}_{m}\) for K-means and find its minimum k eigenvectors in order to obtain the clustering labels.
Based on the spectral clustering, [15] proposed a multi-view network clustering method. Its objective optimize problem is:
$$\begin{aligned} & \min \mathop \sum \limits_{{{\text{m}} = 1}}^{{\text{M}}} \mathop \sum \limits_{{{\text{k}} = 1}}^{{\text{K}}} \frac{{\left( {{\text{S}}_{{,{\text{k}}}}^{{\text{m}}} } \right)^{{\text{T}}} \left( {{\text{D}}_{{\text{m}}} - {\text{A}}_{{\text{m}}} } \right)\left( {{\text{S}}_{{,{\text{k}}}}^{{\text{m}}} } \right)}}{{\left( {{\text{S}}_{{,{\text{k}}}}^{{\text{m}}} } \right)^{{\text{T}}} \left( {{\text{S}}_{{,{\text{k}}}}^{{\text{m}}} } \right)}} -\upbeta \mathop \sum \limits_{{{\text{l}} \ne {\text{m}}}} \mathop \sum \limits_{{{\text{k}} = 1}}^{{\text{K}}} \frac{{\left( {{\text{S}}_{{,{\text{k}}}}^{{\text{m}}} } \right)^{{\text{T}}} \left( {{\text{S}}_{{,{\text{k}}}}^{{\text{l}}} } \right)}}{{{\text{S}}_{{,{\text{k}}2}}^{{\text{m}}} {\text{S}}_{{,{\text{k}}2}}^{{\text{l}}} }} \\ & {\text{s}}.{\text{t}}.\,\,{\text{S}}_{{,{\text{k}}}}^{{\text{m}}} \in \left\{ {0,1} \right\},{\text{i}} = 1 \ldots ,{\text{N}};{\text{m}} = 1 \ldots ,{\text{M}};{\text{k}} = 1 \ldots ,{\text{k}}; \\ & \mathop \sum \limits_{{{\text{k}} = 1}}^{{\text{K}}} {\text{S}}_{{{\text{i}},{\text{k}}}}^{{\text{m}}} = 1,{ }\quad {\text{for }}\quad {\text{ m}} = 1 \ldots ,{\text{M}}. \\ \end{aligned}$$
(5)
The binary optimization problem cannot be solved in polynomial time. So, the objective function of multi-view spectral clustering can be constructed as follows:
$$\begin{aligned} & \mathop {\min }\limits_{{{\text{U}}_{{\text{m}}} \in {\text{R}}^{{{\text{N}} \times {\text{k}}}} }} \,{\text{trace}}\,\left( {{\text{U}}^{{\text{T}}} {\text{LU}}} \right) \\ & {\text{s}}.{\text{t}}.\quad {\text{U}}^{{\text{T}}} {\text{U}} = {\text{I}}_{{\text{K}}} \\ \end{aligned}$$
(6)
where \(\mathrm{L}=\left(\begin{array}{ccc}{\mathrm{L}}_{1}& \cdots & 0\\ \vdots & \ddots & \vdots \\ 0& \cdots & {\mathrm{L}}_{\mathrm{m}}\end{array}\right)-\upbeta \left(\begin{array}{ccc}0& \cdots & {\mathrm{I}}_{\mathrm{n}}\\ \vdots & \ddots & \vdots \\ {\mathrm{I}}_{\mathrm{n}}& \cdots & 0\end{array}\right),\mathrm{U}=\left(\genfrac{}{}{0pt}{}{\begin{array}{c}{\mathrm{U}}_{1}\\ {\mathrm{U}}_{2}\end{array}}{\begin{array}{c}\vdots \\ {\mathrm{U}}_{3}\end{array}}\right)\).
\(\beta\) is used to balance the weight parameters between the network and within the network. If we have abundant prior knowledge, we can set it according to prior information. Otherwise, when building a network, we can try to establish a connection at the same level (e.g. similar connection densities) and set it directly to 1. In our experiment, we set \(\beta =1\) directly.
However, the optimization problem (6) combines the information of all networks together and will loss the information in each network. The proposed MVSM method still follows the original objective function of multi-view spectral clustering and the construction of Laplace matrix [17].
The objective optimization problems to be solved are as follows:
$$\begin{aligned} & \mathop {\min }\limits_{{{\text{U}}_{{\text{m}}} \in {\text{R}}^{{{\text{N}} \times {\text{k}}}} }} {\text{trace}}\left( {{\text{U}}^{{\text{T}}} {\text{LU}}} \right) \\ & {\text{s}}.{\text{t}}.\quad {\text{U}}_{{\text{m}}}^{{\text{T}}} {\text{U}}_{{\text{m}}} = {\text{I}} \\ \end{aligned}$$
(7)
When we set \({\mathrm{U}}_{\mathrm{m}}=\frac{{\mathrm{S}}^{\mathrm{m}}}{{\Vert {\mathrm{S}}^{\mathrm{m}}\Vert }_{2}}\), the objective function \({\mathrm{U}}_{\mathrm{m}}^{\mathrm{T}}{\mathrm{U}}_{\mathrm{m}}={\mathrm{I}}_{\mathrm{K}}\) substitude as \(\sum {\mathrm{s}}_{\mathrm{i},\mathrm{j}}^{\mathrm{N}}=1\). It transforms the constraints for each network into one equation.
The solution of objective optimize problem
To solve the objective function (7), we project it onto the Stiefel manifold and solve it by backtracking linear search. The process is roughly divided into three steps.
First, we project the target function \(\mathrm{trace}\left({\mathrm{U}}^{\mathrm{T}}\mathrm{LU}\right)\) onto each of Stiefel manifold's tangent vector Spaces.
The tangent vector space of M is.
$${\mathrm{TM}}_{\mathrm{m}}=\left\{{\mathrm{U}}_{\mathrm{m}}\mathrm{B}+\left(\mathrm{I}-{{\mathrm{U}}_{\mathrm{m}}\mathrm{U}}_{\mathrm{m}}^{\mathrm{T}}\right)\mathrm{C}:\mathrm{B}=-{\mathrm{B}}^{\mathrm{T}},\mathrm{C}\in {\mathrm{R}}^{\mathrm{N}\times \mathrm{k}}\right\}.$$
(8)
here each Stiefel manifold \({\mathrm{M}}_{\mathrm{m}}={\mathrm{U}}_{\mathrm{m}}\in {\mathrm{R}}^{\mathrm{N}\times \mathrm{k}}:\mathrm{ s}.\mathrm{t}.{\mathrm{U}}_{\mathrm{m}}^{\mathrm{T}}{\mathrm{U}}_{\mathrm{m}}={\mathrm{I}}_{\mathrm{K}}\).
So, the negative gradient of the target function \(\mathrm{trace}\left({\mathrm{U}}^{\mathrm{T}}\mathrm{LU}\right)\) can be expressed as:
$$\mathrm{Z}=-\nabla \mathrm{ trace}\left({\mathrm{U}}^{\mathrm{T}}\mathrm{LU}\right)=-\mathrm{LU}={\left({\mathrm{Z}}_{1}^{\mathrm{T}},{\mathrm{Z}}_{2}^{\mathrm{T}},\dots ,{\mathrm{Z}}_{\mathrm{m}}^{\mathrm{T}}\right)}^{\mathrm{T}}$$
(9)
where \(Z_{m}\) represents the negative derivative of the objective function on the mth omic.
Then, we search the next point along the direction \({\upeta }_{\mathrm{m}}\) on each tangent vector space of the manifold. Where,
$${\upeta }_{\mathrm{m}}={\mathrm{Z}}_{\mathrm{m}}-\frac{1}{2}{\mathrm{U}}^{\mathrm{m}}\left({\left({\mathrm{U}}^{\mathrm{m}}\right)}^{\mathrm{T}}{\mathrm{U}}^{\mathrm{m}}+{\mathrm{Z}}_{\mathrm{m}}^{\mathrm{T}}{\mathrm{U}}^{\mathrm{m}}\right).$$
(10)
Second, we do Backtracking Line Search on Stiefel manifold for problem (9–10).
The purpose of line search is to find the smallest point of the target function in the search direction. However, it is time-consuming to find the accurate minimum point. The search direction is already approximate, so we just to find the minimum point approximation at a lower cost. Backtracking Line Search (BLS) is such a Line Search algorithm. The idea of the BLS algorithm is to set an initial step size \({\alpha }_{0}\) in the search direction. Then, if the step size is too large, we reduce the step size until it is appropriate.
Backtracking Line Search in the negative gradient direction of the objective function is as follow:
$$\begin{aligned} {\text{f}}\left( {{\text{U}} + \upalpha \upeta } \right) & \le {\text{f}}\left( {\text{U}} \right) + \upalpha {\text{cm}} \\ {\text{m}} & =\upeta ^{T} {\text{Z}} \\ \end{aligned}$$
(11)
where \(\upeta\) is the current search direction, \(\mathrm{\alpha }\) is the step size, and c is the control parameter, which needs to be manually verified according to the situation.
If the current U does not satisfy inequation (11), then a parameter \(\uptau\) is required to adjust the step size:
$$\mathrm{\alpha }=\mathrm{\tau \alpha }$$
(12)
where the parameter \(\uptau\) is controls the reduce search step size.
Third, we retract the found points to the manifold with singular value decomposition.
$$\mathrm{U}=\mathrm{W\Sigma }{\mathrm{U}}^{\mathrm{T}},\mathrm{U}={\mathrm{WU}}^{\mathrm{T}}.$$
(13)
After the manifold optimization process, we get the values of U. The whole process of our proposed method is summarized in Algorithm 1.
Here, we get the solution of the objective function, and then we perform k-means to cluster U and obtain the cluster labels \({\mathrm{C}}_{1},{\mathrm{C}}_{2},\dots ,{\mathrm{C}}_{\mathrm{k}}\). Finally, we integrate the clustering results obtained from three omics by using k-nearest neighbor method.
Remark: we set \(\mathrm{c}=6,\uptau =0.1\) in our experiment. We will set out the reasons in Sect. 3.1.2.