 Research
 Open access
 Published:
A truncated nuclear norm and graphLaplacian regularized lowrank representation method for tumor clustering and gene selection
BMC Bioinformatics volume 22, Article number: 436 (2021)
Abstract
Background
Clustering and feature selection act major roles in many communities. As a matrix factorization, LowRank Representation (LRR) has attracted lots of attentions in clustering and feature selection, but sometimes its performance is frustrated when the data samples are insufficient or contain a lot of noise.
Results
To address this drawback, a novel LRR model named TGLRR is proposed by integrating the truncated nuclear norm with graphLaplacian. Different from the nuclear norm minimizing all singular values, the truncated nuclear norm only minimizes some smallest singular values, which can dispel the harm of shrinkage of the leading singular values. Finally, an efficient algorithm based on Linearized Alternating Direction with Adaptive Penalty is applied to resolving the optimization problem.
Conclusions
The results show that the TGLRR method exceeds the existing stateoftheart methods in aspect of tumor clustering and gene selection on integrated gene expression data.
Background
In most countries, cancer is the first or second cause of death [1]]. Thus, it is a hot topic to prevent and cure cancer effectively in the medical field. Genes can regulate critical movements of organisms, even including the emergence of cancer [2]. As the improvement of gene sequencing technology, plenty of genomic data are available, which is conducive to researching the pathogenesis of cancer [3]. However, the majority of genomic data have the features of high dimension and small sample, which hinders the advances in medicine studies [4, 5]. Evidently, data dimension reduction acts a momentous role in the process of genomic data analysis.
Data dimension reduction aims to get a significant lowdimensional representation of highdimensional data, remove redundant features and prevent overfitting. Therefore, it has achieved great successes in many areas, such as characteristic gene selection [6], image analysis [7], and text documents [8]. Principal Component Analysis (PCA) is one of the most classic linear dimension reduction methods [9]. Based on the high efficiency of PCA, it has been widely used on different kinds of data and developed in many fields [5, 10, 11]. To boost the robustness of PCA, Candès et al. developed a new PCA in [10], called Robust PCA (RPCA), which is exploited for background modeling from video and analyzing face images. Moreover, in [5], RPCA is exploited for discovering differentially expressed genes by Liu et al. Though, the above PCA methods all have obtained excellent results, the performance of these methods is corrupted with the noisy observation data.
In [12], Wright et al. introduced a lowrank matrices recovery approach for removing the noise of data. Then, PCA is applied to the lowrank matrix. The robustness of data processing is enhanced significantly through the approach in [12].
The experimental data X are usually obtained from a union of multiple subspaces \({\text{S}} = \sum {{\text{S}}_{1} ,{\text{S}}_{2} , \ldots ,{\text{S}}_{k} }\) rather than a single space, where \({\text{S}}_{i}\) indicates lowdimensional space hidden in highdimensional space [13,14,15]. Since these methods related to PCA prefer to research the data obtained from a single lowdimensional space, Liu et al. proposed a lowrank representation (LRR) model that can excavate the global distribution between data points to study X [16]. LRR strives to look for the lowest rank matrix representation about original data and has got brilliant results in several applications [16, 17]. However, LRR still exists a few shortages, for instance it cannot reveal the local manifold structures of data obtained from a nonlinear lowdimensional manifold. Joyfully, various manifold learning models have been put forward, such as ISOMAP [18], Laplacian Eigenmap (LE) [19], Locally Linear Embedding (LLE) [20], and graphLaplacian regularization [21].
A graphLaplacian regularized LRR (LLRR) model [14] was developed, which introduces the graph regularization into LRR. In LLRR model, the useful rules hiding among the data points including the global geometric structure and the internal similarity information are all seized. LLRR only exploits one view of data, i.e. data manifold for data analysis. Contrasted with LLRR, Latent LRR (LatLRR) model adds another view, i.e. feature manifold to do image processing [22]. For solving these minimization problems of LRR, LLRR and LatLRR, the common point is to use the nuclear norm to approximate the rank operator. Given a data matrix X, the nuclear norm means that the sum of all singular values belonging to X. Since the nuclear norm minimizes the sum of all singular values for accomplishing the minimization problem, all nonzero singular values have different influences for the rank [23]. Thus, the nuclear norm maybe not the best way to approximate the rank of the matrix. To better approximate the rank and handle the nonconvex optimizing problems, the truncated nuclear norm (TNN) was proposed in [24] and attracted much attention [13, 23, 25, 26]. The TNN that is the sum of few smallest singular values of a matrix can dispel the harm of shrinkage of the leading singular values, so it may be a more robust regularization to get the rank of a matrix than the nuclear norm.
To strengthen the efficiency and robustness of the model, in our paper, a novel LRR method is developed, named Truncated nuclear norm and graphLaplacian regularized LowRank Representation model (TGLRR). In the objective function of TGLRR, the nuclear norm is replaced by the TNN for reaching the robust approximation of rank function, a graphLaplacian regularization is imposed to find the local manifold structure, and the L_{1}norm is used for realizing the sparse constraints of outliers. The main contributions of our paper are showed as follows. Firstly, compared with the popular LRR model regularized by the nuclear norm, our TGLRR method can obtain a better performance by the TNN, and solve the nonconvex and discontinuous issues. Secondly, the TGLRR method can seize the valuable information lying in data manifold and feature manifold simultaneously. Finally, the TGLRR method can capture the internal similarity information and some underlying affinity among data points by incorporating a graph regularization term and utilizing a linear association of some bases to represent each data point.
The remainder of this article is organized as follows. In the Results section, TGLRR is exploited for clustering and feature selection on integrated gene expression data. In Conclusions section, conclusions and the future work are given. In Methods section, our TGLRR method is put forward and the optimization problem is resolved through an efficient framework based on LADMAP [27].
Results
Integrative gene expression datasets
To validate the performance of TGLRR model, six clustering experiments and a feature selection experiment are conducted. The experimental data are integrative cancer gene expression data instead of single cancer data for avoiding sample imbalance problem. Seven different datasets are produced via integrating five different single gene expression data downloaded from The Cancer Genome Atlas (TCGA, https://www.cancer.gov/aboutnci/organization/ccg/research/structuralgenomics/tcga). The pertinent information about the seven integrative datasets is listed in Table 1.
PAAD, ESCA, COAD, CHOL and HNSC are the abbreviations of Pancreatic Ductal Adenocarcinoma, Esophageal Carcinoma, Colorectal Adenocarcinoma, Cholangiocarcinoma, Head and Neck Squamous Cell Carcinoma, respectively. Taking PAADCOAD dataset for example, it is only composed of the tumor samples of PAAD and COAD data, in which PAAD data are made up of 176 tumor samples and 4 normal samples, and COAD data consist of 262 tumor samples and 19 normal samples. HNSCESCA, CHOLHNSCESCA, COADPAADESCA, HNSCPAADCHOLESCA and ESCACOADCHOLPAAD datasets are also made in the production way of PAADCOAD dataset. But PAADESCAHNSC dataset is composed of the whole samples of PAAD, ESCA and HNSC.
To eliminate redundant features and avoid overfitting, the dimension of data matrix X is reduced before clustering and feature selection experiment, which can also greatly abate the computational cost. PCA is chosen for dimension reduction experiments in our paper. In addition, 2000dimensional data X are obtained after dimension reduction.
Parameters selection
There are three important parameters, i.e. regularization terms \(\lambda\), \(\beta\) and r of the TNN in our TGLRR model. The grid search is used to pick up the values of \(\lambda\), \(\beta\) and r. Figure 1 shows that clustering results are varied with the parameters \(\lambda\) and \(\beta\) on three distinct integrative datasets of tumor gene expression.
Xaxis represents the values range of \(\lambda\), Yaxis represents the values range of \(\beta\), and Zaxis represents the clustering accuracy in Fig. 1. It can be distinctly found that the effect of clustering accuracy stems from \(\beta\) that is greater than the effect from \(\lambda\), especially in HNSCPAADCHOLESCA dataset. Finally, it can be found that TGLRR performs well when \(\lambda = 10\) and \(\beta = 10^{4}\) on PAADCOAD dataset, \(\lambda = 10^{  2}\) and \(\beta = 10\) on HNSCESCA dataset, \(\lambda = 10^{  2}\) and \(\beta = 10^{2}\) on CHOLHNSCESCA dataset, \(\lambda = 10^{  1}\) and \(\beta = 10^{4}\) on COADPAADESCA dataset, \(\lambda = 10^{  2}\) and \(\beta = 10^{11}\) on HNSCPAADCHOLESCA dataset, and \(\lambda = 10^{2}\) and \(\beta = 10^{10}\) on ESCACOADCHOLPAAD dataset, respectively.
Different from the method in [25] that tries all the possible values to seek the optimal value of r, the method in [28] is used to choose the optimal value of r. A curve graph showing the singular values needs to be drawn. Figure 2 shows the summary curve graph on six datasets applied in clustering experiments.
Xaxis indicates the number and Yaxis denotes the singular values in Fig. 2. The value of the first inflection point in each curve is chosen as the value of r corresponding to each dataset. The principle of selecting r is that the singular values before inflection point are bigger than the singular values after inflection point. Therefore, the values of r on PAADCOAD, HNSCESCA, CHOLHNSCESCA, COADPAADESCA, HNSCPAADCHOLESCA and ESCACOADCHOLPAAD datasets are set as 2, 3, 3, 2, 2 and 3, respectively.
Convergence analysis
Since Algorithm 1 is a practical application based on LADMAP framework whose convergence has been proved in [27], Algorithm 1 should also be convergent. Many approaches are able to demonstrate the convergence property of algorithms [23, 29]. In our paper, an efficient approach in [29] by means of auxiliary function is exploited to validate the convergence property of TGLRR method. The results are exhibited in Fig. 3.
The abscissa in Fig. 3 indicates the iteration number and the ordinate denotes the loss function value. As shown in Fig. 3, our model is convergent. The TGLRR method begins to converge after 30 iterations on two datasets, such as HNSCESCA and CHOLHNSCESCA. On other four datasets, the TGLRR method converges in 40 iterations. Here, the HNSCESCA dataset may be easily addressed, so our method begins to converge on the two datasets after 30 iterations while it needs 40 iterations on the other four datasets.
Clustering results
In this subsection, the TGLRR is applied for clustering, and compared with Kmeans, LLRR [14], LRR [30], RPCA [5], DGLRR [31], and LatLRR [22].
In respect of the dictionary matrix X, the optimal solution \({\mathbf{Z}}^{*}\) to TGLRR is able to symbolize “the minimum rank representation” of the data matrix X. What’s more, the ith column about Z could be regarded as a “better” reflection of the ith column about X so as to make the subspace structure more easily detectable [31]. Namely, the optimal solution \({\mathbf{Z}}^{*}\) could include almost all the sample information about integrative gene expression data X. Therefore, \({\mathbf{Z}}^{*}\) can be used for clustering experiments by Kmeans.
To measure the performance of our approach, three quantity metrics are adopted in this paper, i.e., accuracy (ACC), normalized mutual information (NMI) and Fmeasure. As a widely used metric in machine learning field, ACC can be defined as follows:
where n indicates the total number of tumor samples in an integrated data, \(\updelta \left( {x,y} \right)\) is a delta function set to 1 only when \(x = y\) and 0 otherwise, \(\hat{L}\left( i \right)\) denotes true class label of the ith sample, and \(l_{i}\) represents the cluster label produced by the algorithms. \({\text{Map}}\left( {l_{i} } \right)\) is a mapping function permuting every \(l_{i}\) to match real sample label.
The second index of NMI is defined by
where T and \(\hat{T}\) denote two different tumor index sets separately. H(T) and \({{\text{H}}}\left( {{\hat{{\text{T}}}}} \right)\) represent the entropy in T and \(\hat{T}\), respectively. And \({{\text{MI}}}\left( {{{\text{T}}},\hat{T}} \right) = \sum\limits_{{t \in {{\text{T}}}}} {\sum\limits_{{\hat{t} \in \hat{T}}} {{{\text{P}}}\left( {t,\hat{t}} \right)} } \log {{{{\text{P}}}\left( {t,\hat{t}} \right)} \mathord{\left/ {\vphantom {{{{\text{P}}}\left( {t,\hat{t}} \right)} {\left( {{{\text{P}}}\left( t \right){{\text{P}}}\left( {\hat{t}} \right)} \right)}}} \right. \kern\nulldelimiterspace} {\left( {{{\text{P}}}\left( t \right){{\text{P}}}\left( {\hat{t}} \right)} \right)}}\) where P(t) is the marginal probability distribution function, namely, the probabilities that a tumor sample arbitrarily chosen from an integrated dataset belongs to cluster T. In addition, \({{\text{MI}}}\left( {{\text{T}},\hat{{\text{T}}}} \right)\) indicates the joint probabilities that a tumor sample belongs to the two clusters T and \({\hat{{\text{T}}}}\) simultaneously.
Fmeasure is the comprehensive evaluation index considering both precision and recall, and written as:
where \({\text{recall = }}{{{{\text{TP}}}} \mathord{\left/ {\vphantom {{{{\text{TP}}}} {\left( {{{\text{TP}}} + {{\text{FN}}}} \right)}}} \right. \kern\nulldelimiterspace} {\left( {{{\text{TP}}} + {{\text{FN}}}} \right)}}\) and \({\text{precision}} ={{{{\text{TP}}}} \mathord{\left/ {\vphantom {{{{\text{TP}}}} {\left( {{{\text{TP}}} + {{\text{FP}}}} \right)}}} \right. \kern\nulldelimiterspace} {\left( {{{\text{TP}}} + {{\text{FP}}}} \right)}}\). TP, FP, TN and FN indicate the truepositive, falsepositive, truenegative and falsenegative, respectively.
To prove the effectiveness of TGLRR, the detailed clustering results of these methods on integrative tumor gene expression data are listed by three tables. In tables, the values about ACC, NMI and Fmeasure are the average of 100 clustering results of each approach, and the values on the right of ± are the variance of 100 results.
Table 2 reports the clustering results on PAADCOAD and HNSCESCA datasets. Obviously, our TGLRR method exceeds other six comparison methods on PAADCOAD dataset. TGLRR is more robust than other six methods on PAADCOAD dataset from the point of the variance values. HNSCESCA data are extraordinary and may be easily addressed, in which the clustering results about all algorithms are good and particularly the evaluation indices of RPCA and TGLRR are 1.
The clustering results on two integrated datasets containing three types of tumors are exhibited in Table 3. It can be seen that the clustering performance of TGLRR model outperforms other models on COADPAADESCA dataset. On CHOLHNSCESCA dataset, TGLRR’s ACC, NMI and Fmeasure values are higher than values obtained via other five models except for LRR. Consequently, it still can be said that the TGLRR method outstrips other methods on CHOLHNSCESCA dataset.
From Table 4, our TGLRR method outmatches other six methods on HNSCPAADCHOLESCA and ESCACOADCHOLPAAD datasets.
Feature selection
Cancers are commonly relevant to gene mutation or abnormal expression of genes. Thus, in this subsection, the TGLRR method is used to identify cofeature genes of PAAD, ESCA and HNSC from PAADESCAHNSC dataset.
From the formula (14), a minimum solution \({\mathbf{G}}^{*}\) can be got from an integrative gene expression data X via TGLRR scheme. \({\mathbf{G}}^{*}\) can obtain the feature manifold structure lying in data. As a result, it can be applied in feature gene extraction. From the view of cancer, its pathogenesis may be related to gene mutation [32]. It is extremely meaningful to find out the feature genes inducing cancers from gene expression data.
Similar to the subsection of Parameters Selection, 10^{−2}, 10^{3} and 4 are assigned to \(\lambda\), \(\beta\) and r.
Table 5 exhibits the top 10 cofeature genes with the mean of highest relevance score distinguished by the TGLRR method from PAADESCAHNSC dataset. The related diseases, related pathways and coded proteins about these genes are gotten from GeneCards (https://www.genecards.org/). These genes are most likely to lead to PAAD, ESCA and HNSC simultaneously.
From Table 5, clearly, CDH1 gene with the highest relevance score can result in a host of cancers, which indicates that CDH1 may be a dangerous cofeature gene. What's more, PAAD, ESCA and HNSC are all correlative with CDH1 and RHOA, which can be affirmed from [33,34,35,36,37,38]. It is a verifiable fact that TGFB1 and RELA all serve as a predictor for PAAD and ESCA via consulting some literatures. Some data show that PTPN11 may induce HNSC and PAAD. From [39, 40], it can be seen that ESCA is relevant to IGF2R and RUNX1. In addition, the related pathways of RUNX1 and EWSR1 include transcriptional misregulation in cancer. So, RUNX1 and EWSR1 may be cocharacteristic genes of PAAD, ESCA and HNSC.
All in all, the TGLRR method is successful in identifying cocharacteristic genes on the integrative gene expression datasets.
Discussions
The TGLRR method is applied to the tumor clustering and gene selection, and superior to the other methods. Based on above results, it can be affirmed that the TNN could capture more valuable information existed in data than the nuclear norm from data. By comparing the results of DGLRR, a conclusion can be drawn that the graph Laplacian regularization imposed on feature manifold may cause adverse effects for clustering on our integrative datasets. The TGLRR method has some limitations. For example, on HNSCPAADCHOLESCA dataset, the variance values of TGLRR is larger than LRR. It may be caused by the integrated datasets and its stability needs to be improved in future.
In a word, these improvements to the prevenient LRR model can help TGLRR catch more useful information concealed in the lowdimensional manifold structure.
Conclusions
The paper proposes a LowRank Representation approach called TGLRR. It can capture the global and local geometric structures in data manifold via using the raw data matrix as the dictionary matrix and introducing the graphLaplacian regularization term. Furthermore, TGLRR can gain a better approximation to the rank operator than the approaches regularized by the nuclear norm. The objective function of TGLRR is perfectly resolved through an iterative algorithm based on LADMAP framework. The efficiency and robustness of our TGLRR method are testified through the encouraging experimental results.
Methods
Related LRR methods
Based on the assumption that the observation data X are sampled from a union of several lowdimensional subspaces \(S = \sum {S_{1} ,S_{2} , \ldots ,S_{k} }\) located in a highdimensional spaces, LRR was raised in [16]. If data are noiseless, the rank minimization problem of LRR is written into
where \({\mathbf{X}} = \left[ {{\mathbf{x}}_{1} ,{\mathbf{x}}_{2} , \cdots ,{\mathbf{x}}_{n} } \right] \in {\mathbf{R}}^{m \times n}\) is the original data matrix and \({\mathbf{Z}} \in {\mathbf{R}}^{n \times n}\) is a lowrank matrix recovered from X via LRR. \({\mathbf{D}} \in {\mathbf{R}}^{m \times n}\) is a basis matrix (or named dictionary matrix), which spans the whole data space linearly. The observation data generally exist more or less noise in real life, so the optimization problem (4) may be impracticable. The LRR model with noise is
where \({\mathbf{P}} \in {\mathbf{R}}^{m \times n}\) is the reconstruction errors matrix (or called noise matrix). \(\lambda\) is a penalty parameter aiming to adjust the sparsity of matrix P and the reconstruction fidelity of data matrix X damaged by errors matrix P. \(\left\ {\mathbf{P}} \right\_{0}\) is the L_{0}norm of matrix P, which indicates the number of nonzero elements in matrix P.
Since the rank function is discrete, the problem (5) may have multiple solutions and the L_{0}minimization is nonconvexity and intractable. Usually, solving the problem (5) is NPhard [41]. To better solve the above rank minimization problem, the nuclear norm is imposed on the lowrank matrix, and the L_{0}norm is replaced with the L_{2,1}norm [17]. The convex optimization problem about LRR model is written as follows:
where \(\left\ {\mathbf{Z}} \right\_{*} = \sum\nolimits_{i}^{{\min \left( {m,n} \right)}} {\sigma_{i} \left( {\mathbf{Z}} \right)}\)(\(\updelta _{i} \left( {\mathbf{Z}} \right)\) is the ith largest singular value of Z) denotes the nuclear norm of matrix Z, and \(\left\ {\mathbf{P}} \right\_{{2,1}} = \sum\nolimits_{i = 1}^{m} {\left( {\sum\nolimits_{j = 1}^{n} {{{\text{m}}}_{ij}^{{2}} } } \right)}^{{{{1} \mathord{\left/ {\vphantom {{1} {2}}} \right. \kern\nulldelimiterspace} {2}}}}\) denotes the L_{2,1}norm of matrix P. To get a selfexpression model, the observation data X are generally installed as the dictionary matrix [13, 14, 22]. The final LRR model becomes
For lowrank matrix \({\mathbf{Z}} = \left[ {{\mathbf{z}}_{1} ,{\mathbf{z}}_{2} , \cdots ,{\mathbf{z}}_{n} } \right]\), its each element \({\mathbf{z}}_{ij}\) can reflect the manifold information, i.e. the similarity between the data point \({\mathbf{x}}_{i}\) and the data point \({\mathbf{x}}_{j}\). Therefore, matrix Z can be seen as an affinity matrix [14]. LRR is devoted to seek the lowest rank representation of the observation data. With the help of an appropriate dictionary matrix, the underlying row space can be recovered via the lowest rank representation such that the true segmentation of data can be correctly revealed. Thus, LRR method can manage the data extracted from a union of multiple subspaces well [17].
Nevertheless, LRR method has to face two issues owing to the raw data X that are used as the basis. First, LRR method requires that the basis contains adequate data samples from the subspaces so as to possess the capacity of representing the underlying subspaces. Second, LRR method demands that noise of data X is little, i.e. only a part of X is corrupted. To remedy these two shortcomings of LRR, Liu et al. proposed the following convex optimization LRR problem [22]:
where \(\left\ {\mathbf{P}} \right\_{1} = \sum\nolimits_{i = 1}^{n} {\sum\nolimits_{j = 1}^{m} {\left {{{\text{p}}}_{ij} } \right} }\) is the L_{1}norm of matrix P and G is the feature matrix separated from the original X. Equation (8) is a stateoftheart LRRbased subspace learning model, named LatLRR. By means of LatLRR model, the observed sampling can be expressed via many unobserved sampling effectively [42]. In practical application, Z and G are applied in cluster analysis and feature selection, respectively.
Truncated nuclear norm (TNN)
The TNN is the summation of a few smaller singular values, i.e. the sum of some largest singular values is subtracted from the nuclear norm [24]. As an approximation of a rank operator, the largest rth singular values could produce minor amount of information, meanwhile, the minimal \(\left( {\min \left( {m,n} \right)  r} \right)\)th singular values act a crucial role [23]. Compared to the nuclear norm, the TNN may be a better approximation to the rank operator. Its mathematical formula is
where \(\updelta _{i} \left( {\mathbf{Z}} \right)\) denotes the ith largest singular value belongs to Z and r is a nonnegative integer and \({\text{r}} \le min\left( {m, \, n} \right)\).
Since the minimization of Eq. (9) is not convex, it cannot be directly resolved through the approaches. For overcoming this issue, Hu et al. come up with a theorem [25]. According to the Theorem, the equivalent transformation of Eq. (9) is achieved.
GraphLaplacian regularization
GraphLaplacian regularization is an outstanding manifold learning method, which can uncover the internal geometrical structures among the data points. As a result, naturally, appears a number of LRR models regularized by graph embedding manifold regularization [13, 43].
Given a knearestneighbor graph G, suppose it has n vertices, and each vertex denotes a data point hidden in an underlying submanifold M [11]. Then, a symmetric weight matrix \({\mathbf{W}} \in {\mathbf{R}}^{n \times n}\) is constructed, where \({\mathbf{w}}_{ij}\) expresses the i weight of the edge linking vertices i and j. The value of every \({\mathbf{w}}_{ij}\) can be calculated via
where \({{\text{N}}}_{k} \left( {{\mathbf{d}}_{j} } \right)\) indicates the knearestneighbors of data point \({\mathbf{d}}_{j}\).
Next, a diagonal matrix O, termed a degree matrix, need to be established. The value of the ith member of O can be calculated by the sum of all the similarities associated with vertex \({\mathbf{d}}_{j}\), i.e. \({\mathbf{o}}_{ii} = \sum\nolimits_{j} {{\mathbf{w}}_{ij} }\). The graphLaplacian matrix L can be obtained by
Finally, the graph embedding regularization term can be formulated by
Truncated nuclear norm and graphLaplacian regularized lowrank representation method
Motivated by strengthening the robustness of LRR, our method (TGLRR) is put forward. Considering that some data may exist nonlinear geometric structure [14] and the disadvantages about the nuclear norm, the TNN and graph embedding manifold learning are introduced into our rank minimization problem to extract more essential information hidden in data. The objective function of TGLRR is formulated as follows:
where \(\beta \ge 0\) and \(\lambda \ge 0\) are the regularization parameters for balancing the contribution of each term.
Essentially, TGLRR can get a more precise approximation to the rank function with the help of the TNN than the nuclear norm. And the underlying lowdimensional structures of data could be captured by the aid of the graphLaplacian regularization and the basis matrix X.
Optimization solution
To correctly solve the optimization problem of (14), an efficient iterative algorithm based on LADMAP framework is designed. The algorithm (Algorithm 1) is implemented via alternating two iterative procedures till Eq. (14) converges to the minimum.
The first step is to determine matrix A and B.
Step 1: Given \({\mathbf{Z}}_{k}\) (k indicates the kth updating), the SVD (Singular Value Decomposition) of \({\mathbf{Z}}_{k}\) need to be conducted. \(\left[ {{\mathbf{U}}_{k} ,{{\varvec{\Sigma}}}_{k} ,{\mathbf{V}}_{k} } \right] = {{\text{SVD}}}\left( {{\mathbf{Z}}_{k} } \right)\), where \({\mathbf{U}}_{k} = \left( {{\mathbf{u}}_{1} ,{\mathbf{u}}_{2} , \ldots ,{\mathbf{u}}_{m} } \right) \in {\mathbf{R}}^{m \times m}\), \({{\varvec{\Sigma}}}_{k} \in {\mathbf{R}}^{m \times n}\) and \({\mathbf{V}}_{k} = \left( {{\mathbf{v}}_{1} ,{\mathbf{v}}_{2} , \ldots ,{\mathbf{v}}_{m} } \right) \in {\mathbf{R}}^{n \times n}\). \({\mathbf{A}}_{k}\) and \({\mathbf{B}}_{k}\) are calculated via \({\mathbf{A}}_{k} = \left( {{\mathbf{u}}_{{1}} ,{\mathbf{u}}_{{2}} , \ldots ,{\mathbf{u}}_{r} } \right)^{T}\) and \({\mathbf{B}}_{k} = \left( {{\mathbf{v}}_{{1}} ,{\mathbf{v}}_{{2}} , \ldots ,{\mathbf{v}}_{r} } \right)^{T}\).
The second step is to resolve the following convex optimization problem:
Step 2: To achieve the separation of objective function (15), an auxiliary variable \({\mathbf{F}}\) is introduced. Equation (15) is rewritten as follows:
Equation (16) can be solved through LADMAP method, which introduces two Lagrangian multipliers Y^{1} and Y^{2}. Thus, the augmented Lagrangian function can be defined as
where \(\upmu\) is the penalty parameter and \(\left\ \cdot \right\_{F}^{2}\) denotes the Frobenius norm of a matrix that is \(\left\ {\mathbf{X}} \right\_{F}^{2} = \sum\nolimits_{i = 1}^{m} {\sum\nolimits_{j = 1}^{n} {\left {{\mathbf{x}}_{ij} } \right^{2} } }\).
Next, the alternating minimization strategy is adopted to compute Z, F, G and P. In the iterative procedure, Z, F, G or P is updated when the other three variables are fixed, respectively.
Updating Z
To get the solution of Z, the below minimization objective w.r.t. Z needs to be solved.
Equation (18) has a closedform solution:
where \(\Theta \left( \cdot \right)\) indicates the Singular Value Thresholding operator (SVT), \(\nabla_{zq} \left( {{\mathbf{Z}}_{k} } \right) = \frac{\beta }{2}\left( {{\mathbf{ZL}}^{T} + {\mathbf{ZL}}} \right) +\upmu _{k} \left( {{\mathbf{Z}}  {\mathbf{F}}_{k} + {\raise0.7ex\hbox{${{\mathbf{Y}}_{k}^{2} }$} \!\mathord{\left/ {\vphantom {{{\mathbf{Y}}_{k}^{2} } {\upmu _{k} }}}\right.\kern\nulldelimiterspace} \!\lower0.7ex\hbox{${\upmu _{k} }$}}} \right)\) + \(\upmu _{k} {\mathbf{X}}^{T} \left( {{\mathbf{XZ}}  {\mathbf{X}} + {\mathbf{G}}_{k} {\mathbf{X}} + {\mathbf{P}}_{k}  {{{\mathbf{Y}}_{k}^{1} } \mathord{\left/ {\vphantom {{{\mathbf{Y}}_{k}^{1} } {\upmu _{k} }}} \right. \kern\nulldelimiterspace} {\upmu _{k} }}} \right)\) and \(\upeta _{1} = \beta \left\ {\mathbf{L}} \right\_{2} +\upmu _{k} \left( {1 + \left\ {\mathbf{X}} \right\_{2}^{2} } \right).\)
Updating G
Similar to the solution of Z, the SVT operator is employed in computing G. The optimal solution \({\mathbf{G}}_{k + 1}^{*}\) is
where \(\nabla_{gq} \left( {{\mathbf{G}}_{k} } \right) =\upmu _{k} \left( {{\mathbf{XZ}}_{k + 1}  {\mathbf{X}} + {\mathbf{G}}_{k} {\mathbf{X}} + {\mathbf{P}}_{k}  {{{\mathbf{Y}}_{k}^{2} } \mathord{\left/ {\vphantom {{{\mathbf{Y}}_{k}^{2} } {\upmu _{k} }}} \right. \kern\nulldelimiterspace} {\upmu _{k} }}} \right){\mathbf{X}}^{T}\) and \(\upeta _{2} =\upmu _{k} \left\ {\mathbf{X}} \right\_{2}^{2}\).
Updating F
The below subproblem w.r.t. F is
Equation (21) is the smooth convex planning problem. Different to the solving rules of Z and G, we can differentiate Eq. (21) and set it to zero to gain the answer of F. Its optimal solution is
Updating P
Calculating P has to optimize the following objective:
The optimal solution to the above subproblem w.r.t. P can be formulated by
\(S_{{{\lambda \mathord{\left/ {\vphantom {\lambda {\upmu _{k} }}} \right. \kern\nulldelimiterspace} {\upmu _{k} }}}} \left( \cdot \right)\) is the shrinkage operator defined as \(S_{{{\lambda \mathord{\left/ {\vphantom {\lambda {\upmu _{k} }}} \right. \kern\nulldelimiterspace} {\upmu _{k} }}}} \left( \cdot \right) = {\mathbf{U\Sigma }}_{{{\lambda \mathord{\left/ {\vphantom {\lambda {\upmu _{k} }}} \right. \kern\nulldelimiterspace} {\upmu _{k} }}}} {\mathbf{V}}^{T}\), \({{\varvec{\Sigma}}}_{{{\lambda \mathord{\left/ {\vphantom {\lambda {\upmu _{k} }}} \right. \kern\nulldelimiterspace} {\upmu _{k} }}}} = {{\text{diag}}}\left( {\max \left\{ {\sigma_{i}  {\lambda \mathord{\left/ {\vphantom {\lambda {\upmu _{k} }}} \right. \kern\nulldelimiterspace} {\upmu _{k} }}} \right\},0} \right)\).
Updating \(\mu_{k}\) , \({\mathbf{Y}}_{{_{k} }}^{1}\) and \({\mathbf{Y}}_{{_{k} }}^{2}\)
After computing the above variables, two Lagrange multipliers \({\mathbf{Y}}_{{_{k} }}^{1}\) and \({\mathbf{Y}}_{{_{k} }}^{2}\) are given by
The iteration rule about \(\upmu _{k}\) is
The detailed algorithm about TGLRR model is showed in Algorithm 1.
Time complexity
In this subsection, the time complexity about TGLRR is discussed. Clearly, the main running time of TGLRR is expended on calculating the matrices Z, F, G and P. For the \(m \times n\) input data matrix X, it has m genes and n samples. The time complexity of SVD method with respect to Z is \({\rm O}\left( {r_{{\mathbf{Z}}} n^{2} } \right)\) (\(r_{{\mathbf{Z}}}\) is the lowest rank of Z decided by algorithm 1). For the same activity, the time complexity of SVD decomposition of G is \({\rm O}\left( {r_{{\mathbf{G}}} m^{2} } \right)\). The optimal solution of F can be obtained in \({\rm O}\left( {rmn} \right)\). In the resolving procedure of P, \({\mathbf{Y}}^{1}\) also needs to be updated. The computational cost of P and \({\mathbf{Y}}^{1}\) needs \({\rm O}\left( {mn^{2} + mr_{{\mathbf{P}}} n} \right)\) and \({\rm O}\left( {nm^{2} } \right)\), respectively. Since, \(m > > n\) in our dataset, the total time cost of algorithm 1 is \({\rm O}\left( {nm^{2} } \right)\).
Availability of data and materials
The TCGA datasets that support the findings of this study are available in https://www.cancer.gov/aboutnci/organization/ccg/research/structuralgenomics/tcga. The results from the free trial version of the GeneCards (https://www.genecards.org/) did only be used for scientific research, but not be used for commercial purposes.
Abbreviations
 LRR:

LowRank Representation
 LADMAP:

Linearized Alternating Direction with Adaptive Penalty
 PCA:

Principal Component Analysis
 RPCA:

Robust PCA
 LE:

Laplacian Eigenmap
 LLE:

Locally Linear Embedding
 LLRR:

GraphLaplacian regularized LRR
 LatLRR:

Latent LRR
 TNN:

Truncated Nuclear Norm
 TGLRR:

Truncated nuclear norm and GraphLaplacian regularized LowRank Representation
 TCGA:

The Cancer Genome Atlas
 PAAD:

Pancreatic Ductal Adenocarcinoma
 ESCA:

Esophageal Carcinoma
 COAD:

Colorectal Adenocarcinoma
 CHOL:

Cholangiocarcinoma
 HNSC:

Head and Neck Squamous Cell Carcinoma
 ACC:

Accuracy
 NMI:

Normalized Mutual Information
 SVT:

Singular Value Thresholding
 SVD:

Singular Value Decomposition
References
Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians 2018, 68(6):394–424.
Lokody I. Cancer genomics: signature analysis suggests cancer origins. Nat Rev Genet. 2013;14(10):677–677.
Liu JX, Gao YL, Zheng CH, Xu Y, Yu J. Blockconstraint robust principal component analysis and its application to integrated analysis of TCGA Data. IEEE Trans Nanobiosci. 2016;15(6):510–6.
Yu N, Wu MJ, Liu JX, Zheng CH, Xu Y: Correntropybased hypergraph regularized NMF for clustering and feature selection on multicancer integrated data. IEEE Trans Cybern 2020.
Liu JX, Wang YT, Zheng CH, Sha W, Mi JX, Xu Y: Robust PCA based method for discovering differentially expressed genes. BMC Bioinform. 2013, 14(S8).
Feng CM, Xu Y, Liu JX, Gao YL, Zheng CH. Supervised discriminative sparse PCA for comcharacteristic gene selection and tumor classification on multiview biological data. IEEE Trans Neural Netw Learn Syst. 2019;30(10):2926–37.
Liu W, Yang X, Tao D, Cheng J, Tang Y. Multiview dimension reduction via Hessian multiset canonical correlations. Information Fusion. 2018;41:119–28.
Abualigah LM, Khader AT, AlBetar MA, Alomari OA. Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl. 2017;84:24–36.
Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Rev: Comput Stat. 2010;2(4):433–59.
Candes EJ, Li X, Ma Y, Wright J. Robust principal component analysis? J ACM. 2011;58(3):1–37.
Feng CM, Gao YL, Liu JX, Zheng CH, Yu J. PCA based on graph laplacian regularization and Pnorm for gene selection and clustering. IEEE Trans Nanobiosci. 2017;16(7):257–65.
Wright J, Ganesh A, Rao S, Ma Y: Robust principal component analysis: Exact recovery of corrupted lowrank matrices via convex optimization. arXiv 2009 Arxiv:0905.0233v1:144.
Xu XX, Gao YL, Liu JX, Wang YX, Dai LY, Kong XZ, Yuan SS. A novel lowrank representation method for identifying differentially expressed genes. Int J Data Min Bioinform. 2018;19(3):185–201.
Yin M, Gao J, Lin Z. Laplacian regularized lowrank representation and its applications. IEEE Trans Pattern Anal Mach Intell. 2016;38(3):504–17.
Jiao CN, Gao YL, Yu N, Liu JX, Qi LY. Hypergraph regularized constrained NMF for selecting differentially expressed genes and tumor classification. IEEE J Biomed Health Inform. 2020;24(10):3002–11.
Liu G, Lin Z, Yu Y. Robust Subspace Segmentation by LowRank Representation. In: International conference on machine learning: Edited by Fürnkranz J, Joachims T. Omnipress2600 Anderson StMadisonWIUnited States 2010: 663–670.
Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y. Robust recovery of subspace structures by lowrank representation. IEEE Trans Pattern Anal Mach Intell. 2013;35(1):171–84.
Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–23.
Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003;15(6):1373–96.
Donoho DL, Grimes C. Hessian eigenmaps: Locally linear embedding techniques for highdimensional data. Proc Natl Acad Sci. 2003;100(10):5591–6.
Gao S, Tsang IWH, Chia LT. Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Trans Pattern Anal Mach Intell. 2012, 35(1):92–104.
Liu G, Yan S: Latent lowrank representation for subspace segmentation and feature extraction. In: 2011 International conference on computer vision. IEEE 2011: 1615–1622.
Cao F, Chen J, Ye H, Zhao J, Zhou Z. Recovering lowrank and sparse matrix based on the truncated nuclear norm. Neural Netw. 2017;85:10–20.
Zhang D, Hu Y, Ye J, Li X, He X: Matrix completion by truncated nuclear norm regularization. In: 2012 IEEE conference on computer vision and pattern recognition: 2012. IEEE 2012: 2192–2199.
Yao H, Debing Z, Jieping Y, Xuelong L, Xiaofei H. Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE Trans Pattern Anal Mach Intell. 2013;35(9):2117–30.
Liu Q, Lai Z, Zhou Z, Kuang F, Jin Z. A truncated nuclear norm regularization method based on weighted residual error for matrix completion. IEEE Trans Image Process. 2015;25(1):316–30.
Lin Z, Liu R, Su Z. Linearized alternating direction method with adaptive penalty for lowrank representation. In: International conference on neural information processing systems: 2011. 612–620.
Wang YX, Gao YL, Liu JX, Kong XZ, Li HJ. Robust principal component analysis regularized by truncated nuclear norm for identifying differentially expressed genes. IEEE Trans Nanobiosci. 2017;16(6):447–54.
Cai D, He X, Han J, Huang TS. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2011;33(8):1548–60.
Cui Y, Zheng CH, Yang J. Identifying subspace gene clusters from microarray data using lowrank representation. Plos One 2013, 8(3):e59377.
Yin M, Gao J, Lin Z, Shi Q, Guo Y. Dual graph regularized latent lowrank representation for subspace clustering. IEEE Trans Image Process. 2015;24(12):4918–33.
Ponder BA. Cancer genetics. Nature. 2001;411(6835):336–41.
Pei Y, Wang P, Liu H, He F, Ming L. FOXQ1 promotes esophageal cancer proliferation and metastasis by negatively modulating CDH1. Biomed Pharmacother. 2015;2015(74):89–94.
Schutter HD, Geeraerts H, Verbeken E, Nuyts S. Promoter methylation of TIMP3 and CDH1 predicts better outcome in head and neck squamous cell carcinoma treated by radiotherapy only. Oncol Rep. 2009;21(2):507–13.
Zhao L, Wang YX, Xi M, Liu SL, Zhang P, Luo LL, Liu MZ. Association between Ecadherin (CDH1) polymorphisms and pancreatic cancer risk in Han Chinese population. Int J Clin Exp Pathol. 2015;8(5):5753.
Xu Z, Zheng X, Yang L, Liu F, Zhang E, Duan W, Bai S, Safdar J, Li Z, Sun C. Chemokine receptor 7 promotes tumor migration and invasiveness via the RhoA/ROCK pathway in metastatic squamous cell carcinoma of the head and neck. Oncol Rep. 2015;33(2):849–55.
Faried A, Nakajima M, Sohda M, Miyazaki T, Kato H, Kuwano H. Correlation between RhoA overexpression and tumour progression in esophageal squamous cell carcinoma. Eur J Surg Oncol. 2005;31(4):410–4.
Kusama T, Mukai M, Iwasaki T, Tatsuta M, Matsumoto Y, Akedo H, Nakamura H. Inhibition of epidermal growth factorinduced RhoA translocation and invasion of human pancreatic cancer cells by 3hydroxy3methylglutarylcoenzyme a reductase inhibitors. Can Res. 2001;61(12):4885–91.
Wu H, Zheng J, Deng J, Zhang L, Li N, Li W, Li F, Lu J, Zhou Y. LincRNAuc002yug.2 involves in alternative splicing of RUNX1 and serves as a predictor for esophageal cancer and prognosis. Oncogene 2015, 34(36):4723–4734.
Cathrine H, Schildkraut JM, Murphy SK, WongHo C, Vaughan TL, Harvey R, Marks JR, Jirtle RL, Brian C, Brian C. IGF2R polymorphisms and risk of esophageal and gastric adenocarcinomas. Int J Cancer. 2009;125(11):2673–8.
Natarajan BK. Sparse approximate solutions to linear systems. SIAM J Comput. 1995;24(2):227–34.
Fang X, Han N, Wu J, Xu Y, Yang J, Wong WK, Li X. Approximate lowrank projection learning for feature extraction. IEEE Trans Neural Netw Learn Syst. 2018;29(11):5228–41.
Wang J, Lu CH, Liu JX, Dai LY, Kong XZ. Multicancer samples clustering via graph regularized lowrank representation method under sparse and symmetric constraints. BMC Bioinform. 2019;20(1):718.
Acknowledgements
I am grateful to the anonymous reviewers whose suggestions and comments contributed to the significant improvement of this paper. I thank Shandong University of Science and Technology for providing the experimental site.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 22 Supplement 12 2021: Explainable AI methods in biomedical data science. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume22supplement12.
Funding
Publication costs are funded by the National Statistical Science Research Project under Grant No. 2019LY49. The funders did not play any role in the design of the study, the collection, analysis, and interpretation of data, or in writing of the manuscript.
Author information
Authors and Affiliations
Contributions
QL contributed to the design of the study, implemented TGLRR, performed the experiments, drafted the manuscript, and approved the final manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The author declares that I have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, Q. A truncated nuclear norm and graphLaplacian regularized lowrank representation method for tumor clustering and gene selection. BMC Bioinformatics 22 (Suppl 12), 436 (2021). https://doi.org/10.1186/s1285902104333y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902104333y