A truncated nuclear norm and graph-Laplacian regularized low-rank representation method for tumor clustering and gene selection

Background Clustering and feature selection act major roles in many communities. As a matrix factorization, Low-Rank Representation (LRR) has attracted lots of attentions in clustering and feature selection, but sometimes its performance is frustrated when the data samples are insufficient or contain a lot of noise. Results To address this drawback, a novel LRR model named TGLRR is proposed by integrating the truncated nuclear norm with graph-Laplacian. Different from the nuclear norm minimizing all singular values, the truncated nuclear norm only minimizes some smallest singular values, which can dispel the harm of shrinkage of the leading singular values. Finally, an efficient algorithm based on Linearized Alternating Direction with Adaptive Penalty is applied to resolving the optimization problem. Conclusions The results show that the TGLRR method exceeds the existing state-of-the-art methods in aspect of tumor clustering and gene selection on integrated gene expression data.

hinders the advances in medicine studies [4,5]. Evidently, data dimension reduction acts a momentous role in the process of genomic data analysis.
Data dimension reduction aims to get a significant low-dimensional representation of high-dimensional data, remove redundant features and prevent overfitting. Therefore, it has achieved great successes in many areas, such as characteristic gene selection [6], image analysis [7], and text documents [8]. Principal Component Analysis (PCA) is one of the most classic linear dimension reduction methods [9]. Based on the high efficiency of PCA, it has been widely used on different kinds of data and developed in many fields [5,10,11]. To boost the robustness of PCA, Candès et al. developed a new PCA in [10], called Robust PCA (RPCA), which is exploited for background modeling from video and analyzing face images. Moreover, in [5], RPCA is exploited for discovering differentially expressed genes by Liu et al. Though, the above PCA methods all have obtained excellent results, the performance of these methods is corrupted with the noisy observation data.
In [12], Wright et al. introduced a low-rank matrices recovery approach for removing the noise of data. Then, PCA is applied to the low-rank matrix. The robustness of data processing is enhanced significantly through the approach in [12].
The experimental data X are usually obtained from a union of multiple subspaces S = S 1 , S 2 , . . . , S k rather than a single space, where S i indicates low-dimensional space hidden in high-dimensional space [13][14][15]. Since these methods related to PCA prefer to research the data obtained from a single low-dimensional space, Liu et al. proposed a low-rank representation (LRR) model that can excavate the global distribution between data points to study X [16]. LRR strives to look for the lowest rank matrix representation about original data and has got brilliant results in several applications [16,17]. However, LRR still exists a few shortages, for instance it cannot reveal the local manifold structures of data obtained from a non-linear low-dimensional manifold. Joyfully, various manifold learning models have been put forward, such as ISOMAP [18], Laplacian Eigenmap (LE) [19], Locally Linear Embedding (LLE) [20], and graph-Laplacian regularization [21].
A graph-Laplacian regularized LRR (LLRR) model [14] was developed, which introduces the graph regularization into LRR. In LLRR model, the useful rules hiding among the data points including the global geometric structure and the internal similarity information are all seized. LLRR only exploits one view of data, i.e. data manifold for data analysis. Contrasted with LLRR, Latent LRR (LatLRR) model adds another view, i.e. feature manifold to do image processing [22]. For solving these minimization problems of LRR, LLRR and LatLRR, the common point is to use the nuclear norm to approximate the rank operator. Given a data matrix X, the nuclear norm means that the sum of all singular values belonging to X. Since the nuclear norm minimizes the sum of all singular values for accomplishing the minimization problem, all non-zero singular values have different influences for the rank [23]. Thus, the nuclear norm maybe not the best way to approximate the rank of the matrix. To better approximate the rank and handle the non-convex optimizing problems, the truncated nuclear norm (TNN) was proposed in [24] and attracted much attention [13,23,25,26]. The TNN that is the sum of few smallest singular values of a matrix can dispel the harm of shrinkage of the leading singular values, so it may be a more robust regularization to get the rank of a matrix than the nuclear norm.
To strengthen the efficiency and robustness of the model, in our paper, a novel LRR method is developed, named Truncated nuclear norm and graph-Laplacian regularized Low-Rank Representation model (TGLRR). In the objective function of TGLRR, the nuclear norm is replaced by the TNN for reaching the robust approximation of rank function, a graph-Laplacian regularization is imposed to find the local manifold structure, and the L 1 -norm is used for realizing the sparse constraints of outliers. The main contributions of our paper are showed as follows. Firstly, compared with the popular LRR model regularized by the nuclear norm, our TGLRR method can obtain a better performance by the TNN, and solve the non-convex and discontinuous issues. Secondly, the TGLRR method can seize the valuable information lying in data manifold and feature manifold simultaneously. Finally, the TGLRR method can capture the internal similarity information and some underlying affinity among data points by incorporating a graph regularization term and utilizing a linear association of some bases to represent each data point.
The remainder of this article is organized as follows. In the Results section, TGLRR is exploited for clustering and feature selection on integrated gene expression data. In Conclusions section, conclusions and the future work are given. In Methods section, our TGLRR method is put forward and the optimization problem is resolved through an efficient framework based on LADMAP [27].

Integrative gene expression datasets
To validate the performance of TGLRR model, six clustering experiments and a feature selection experiment are conducted. The experimental data are integrative cancer gene expression data instead of single cancer data for avoiding sample imbalance problem. Seven different datasets are produced via integrating five different single gene expression data downloaded from The Cancer Genome Atlas (TCGA, https:// www. cancer. gov/ about-nci/ organ izati on/ ccg/ resea rch/ struc tural-genom ics/ tcga). The pertinent information about the seven integrative datasets is listed in Table 1.
PAAD, ESCA, COAD, CHOL and HNSC are the abbreviations of Pancreatic Ductal Adenocarcinoma, Esophageal Carcinoma, Colorectal Adenocarcinoma, Cholangiocarcinoma, Head and Neck Squamous Cell Carcinoma, respectively. Taking PAAD-COAD Table 1 Description about seven integrative gene expression datasets To eliminate redundant features and avoid over-fitting, the dimension of data matrix X is reduced before clustering and feature selection experiment, which can also greatly abate the computational cost. PCA is chosen for dimension reduction experiments in our paper. In addition, 2000-dimensional data X are obtained after dimension reduction.
Different from the method in [25] that tries all the possible values to seek the optimal value of r, the method in [28] is used to choose the optimal value of r. A curve graph showing the singular values needs to be drawn. Figure 2 shows the summary curve graph on six datasets applied in clustering experiments.
X-axis indicates the number and Y-axis denotes the singular values in Fig. 2. The value of the first inflection point in each curve is chosen as the value of r corresponding to each dataset. The principle of selecting r is that the singular values before

Convergence analysis
Since Algorithm 1 is a practical application based on LADMAP framework whose convergence has been proved in [27], Algorithm 1 should also be convergent. Many approaches are able to demonstrate the convergence property of algorithms [23,29]. In our paper, an efficient approach in [29] by means of auxiliary function is exploited to validate the convergence property of TGLRR method. The results are exhibited in Fig. 3.
The abscissa in Fig. 3 indicates the iteration number and the ordinate denotes the loss function value. As shown in Fig. 3, our model is convergent. The TGLRR method begins to converge after 30 iterations on two datasets, such as HNSC-ESCA and CHOL-HNSC-ESCA. On other four datasets, the TGLRR method converges in 40 iterations. Here, the HNSC-ESCA dataset may be easily addressed, so our method begins to converge on the two datasets after 30 iterations while it needs 40 iterations on the other four datasets.
In respect of the dictionary matrix X, the optimal solution Z * to TGLRR is able to symbolize "the minimum rank representation" of the data matrix X. What's more, the i-th column about Z could be regarded as a "better" reflection of the i-th column about X so as to make the subspace structure more easily detectable [31]. Namely, the optimal To measure the performance of our approach, three quantity metrics are adopted in this paper, i.e., accuracy (ACC), normalized mutual information (NMI) and F-measure. As a widely used metric in machine learning field, ACC can be defined as follows: where n indicates the total number of tumor samples in an integrated data, δ x, y is a delta function set to 1 only when x = y and 0 otherwise, L (i) denotes true class label of the i-th sample, and l i represents the cluster label produced by the algorithms. Map(l i ) is a mapping function permuting every l i to match real sample label.
The second index of NMI is defined by where T and T denote two different tumor index sets separately. H(T) and H T represent the entropy in T and T , respectively. And MI T,T = t∈T t ∈T P t,t log P t,t P(t)P t where P(t) is the marginal probability distribution function, namely, the probabilities that a tumor sample arbitrarily chosen from an integrated dataset belongs to cluster T. In addition, MI T,T indicates the joint probabilities that a tumor sample belongs to the two clusters T and T simultaneously. F-measure is the comprehensive evaluation index considering both precision and recall, and written as: To prove the effectiveness of TGLRR, the detailed clustering results of these methods on integrative tumor gene expression data are listed by three tables. In tables, the values about ACC, NMI and F-measure are the average of 100 clustering results of each approach, and the values on the right of ± are the variance of 100 results. Table 2 reports the clustering results on PAAD-COAD and HNSC-ESCA datasets. Obviously, our TGLRR method exceeds other six comparison methods on PAAD-COAD dataset. TGLRR is more robust than other six methods on PAAD-COAD dataset from the point of the variance values. HNSC-ESCA data are extraordinary and may be easily addressed, in which the clustering results about all algorithms are good and particularly the evaluation indices of RPCA and TGLRR are 1.
The clustering results on two integrated datasets containing three types of tumors are exhibited in Table 3. It can be seen that the clustering performance of TGLRR model outperforms other models on COAD-PAAD-ESCA dataset. On CHOL-HNSC-ESCA dataset, TGLRR's ACC, NMI and F-measure values are higher than values obtained via other five models except for LRR. Consequently, it still can be said that the TGLRR method outstrips other methods on CHOL-HNSC-ESCA dataset.

Feature selection
Cancers are commonly relevant to gene mutation or abnormal expression of genes. Thus, in this subsection, the TGLRR method is used to identify co-feature genes of PAAD, ESCA and HNSC from PAAD-ESCA-HNSC dataset. From the formula (14), a minimum solution G * can be got from an integrative gene expression data X via TGLRR scheme. G * can obtain the feature manifold structure lying in data. As a result, it can be applied in feature gene extraction. From the view of cancer, its pathogenesis may be related to gene mutation [32]. It is extremely meaningful to find out the feature genes inducing cancers from gene expression data.
Similar to the subsection of Parameters Selection, 10 −2 , 10 3 and 4 are assigned to , β and r. Table 5 exhibits the top 10 co-feature genes with the mean of highest relevance score distinguished by the TGLRR method from PAAD-ESCA-HNSC dataset. The related diseases, related pathways and coded proteins about these genes are gotten from Gene-Cards (https:// www. genec ards. org/). These genes are most likely to lead to PAAD, ESCA and HNSC simultaneously.
From Table 5, clearly, CDH1 gene with the highest relevance score can result in a host of cancers, which indicates that CDH1 may be a dangerous co-feature gene. What's more, PAAD, ESCA and HNSC are all correlative with CDH1 and RHOA, which can be affirmed from [33][34][35][36][37][38]. It is a verifiable fact that TGFB1 and RELA all serve as a predictor for PAAD and ESCA via consulting some literatures. Some data show that PTPN11 may induce HNSC and PAAD. From [39,40], it can be seen that ESCA is relevant to IGF2R and RUNX1. In addition, the related pathways of RUNX1 and EWSR1 include transcriptional misregulation in cancer. So, RUNX1 and EWSR1 may be co-characteristic genes of PAAD, ESCA and HNSC.
All in all, the TGLRR method is successful in identifying co-characteristic genes on the integrative gene expression datasets.

Discussions
The TGLRR method is applied to the tumor clustering and gene selection, and superior to the other methods. Based on above results, it can be affirmed that the TNN could capture more valuable information existed in data than the nuclear norm from data. By comparing the results of DGLRR, a conclusion can be drawn that the graph Laplacian regularization imposed on feature manifold may cause adverse effects for clustering Table 4 The clustering results on HNSC-PAAD-CHOL-ESCA and ESCA-COAD-CHOL-PAAD data on our integrative datasets. The TGLRR method has some limitations. For example, on HNSC-PAAD-CHOL-ESCA dataset, the variance values of TGLRR is larger than LRR. It may be caused by the integrated datasets and its stability needs to be improved in future. In a word, these improvements to the prevenient LRR model can help TGLRR catch more useful information concealed in the low-dimensional manifold structure.

Conclusions
The paper proposes a Low-Rank Representation approach called TGLRR. It can capture the global and local geometric structures in data manifold via using the raw data matrix as the dictionary matrix and introducing the graph-Laplacian regularization term. Furthermore, TGLRR can gain a better approximation to the rank operator than the approaches regularized by the nuclear norm. The objective function of TGLRR is perfectly resolved through an iterative algorithm based on LADMAP framework. The efficiency and robustness of our TGLRR method are testified through the encouraging experimental results.

Related LRR methods
Based on the assumption that the observation data X are sampled from a union of several low-dimensional subspaces S = S 1 , S 2 , . . . , S k located in a high-dimensional spaces, LRR was raised in [16]. If data are noiseless, the rank minimization problem of LRR is written into Table 5 The top 10 genes selected via TGLRR on PAAD-ESCA-HNSC Take the contents in the second column of the second row as an example, the first, second and third numeral are the relevance score of CDH1 gene to PAAD, ESCA and HNSC, respectively, and the fourth is the mean where X = [x 1 , x 2 , · · · , x n ] ∈ R m×n is the original data matrix and Z ∈ R n×n is a lowrank matrix recovered from X via LRR. D ∈ R m×n is a basis matrix (or named dictionary matrix), which spans the whole data space linearly. The observation data generally exist more or less noise in real life, so the optimization problem (4) may be impracticable. The LRR model with noise is where P ∈ R m×n is the reconstruction errors matrix (or called noise matrix). is a penalty parameter aiming to adjust the sparsity of matrix P and the reconstruction fidelity of data matrix X damaged by errors matrix P. P 0 is the L 0 -norm of matrix P, which indicates the number of non-zero elements in matrix P.
Since the rank function is discrete, the problem (5) may have multiple solutions and the L 0 -minimization is non-convexity and intractable. Usually, solving the problem (5) is NP-hard [41]. To better solve the above rank minimization problem, the nuclear norm is imposed on the low-rank matrix, and the L 0 -norm is replaced with the L 2,1 -norm [17].
The convex optimization problem about LRR model is written as follows: To get a self-expression model, the observation data X are generally installed as the dictionary matrix [13,14,22]. The final LRR model becomes For low-rank matrix Z = [z 1 , z 2 , · · · , z n ] , its each element z ij can reflect the manifold information, i.e. the similarity between the data point x i and the data point x j . Therefore, matrix Z can be seen as an affinity matrix [14]. LRR is devoted to seek the lowest rank representation of the observation data. With the help of an appropriate dictionary matrix, the underlying row space can be recovered via the lowest rank representation such that the true segmentation of data can be correctly revealed. Thus, LRR method can manage the data extracted from a union of multiple subspaces well [17].
Nevertheless, LRR method has to face two issues owing to the raw data X that are used as the basis. First, LRR method requires that the basis contains adequate data samples from the subspaces so as to possess the capacity of representing the underlying subspaces. Second, LRR method demands that noise of data X is little, i.e. only a part of X is corrupted. To remedy these two shortcomings of LRR, Liu et al. proposed the following convex optimization LRR problem [22]:  where �P� 1 = n i=1 m j=1 p ij is the L 1 -norm of matrix P and G is the feature matrix separated from the original X. Equation (8) is a state-of-the-art LRR-based subspace learning model, named LatLRR. By means of LatLRR model, the observed sampling can be expressed via many unobserved sampling effectively [42]. In practical application, Z and G are applied in cluster analysis and feature selection, respectively.

Truncated nuclear norm (TNN)
The TNN is the summation of a few smaller singular values, i.e. the sum of some largest singular values is subtracted from the nuclear norm [24]. As an approximation of a rank operator, the largest r-th singular values could produce minor amount of information, meanwhile, the minimal (min (m, n) − r)-th singular values act a crucial role [23]. Compared to the nuclear norm, the TNN may be a better approximation to the rank operator. Its mathematical formula is where δ i (Z) denotes the i-th largest singular value belongs to Z and r is a nonnegative integer and r ≤ min(m, n).
Since the minimization of Eq. (9) is not convex, it cannot be directly resolved through the approaches. For overcoming this issue, Hu et al. come up with a theorem [25]. According to the Theorem, the equivalent transformation of Eq. (9) is achieved.

Graph-Laplacian regularization
Graph-Laplacian regularization is an outstanding manifold learning method, which can uncover the internal geometrical structures among the data points. As a result, naturally, appears a number of LRR models regularized by graph embedding manifold regularization [13,43].
Given a k-nearest-neighbor graph G, suppose it has n vertices, and each vertex denotes a data point hidden in an underlying sub-manifold M [11]. Then, a symmetric weight matrix W ∈ R n×n is constructed, where w ij expresses the i weight of the edge linking vertices i and j. The value of every w ij can be calculated via where N k d j indicates the k-nearest-neighbors of data point d j . Next, a diagonal matrix O, termed a degree matrix, need to be established. The value of the i-th member of O can be calculated by the sum of all the similarities associated with vertex d j , i.e. o ii = j w ij . The graph-Laplacian matrix L can be obtained by Finally, the graph embedding regularization term can be formulated by

Truncated nuclear norm and graph-Laplacian regularized low-rank representation method
Motivated by strengthening the robustness of LRR, our method (TGLRR) is put forward.
Considering that some data may exist nonlinear geometric structure [14] and the disadvantages about the nuclear norm, the TNN and graph embedding manifold learning are introduced into our rank minimization problem to extract more essential information hidden in data. The objective function of TGLRR is formulated as follows: where β ≥ 0 and ≥ 0 are the regularization parameters for balancing the contribution of each term.
Essentially, TGLRR can get a more precise approximation to the rank function with the help of the TNN than the nuclear norm. And the underlying low-dimensional structures of data could be captured by the aid of the graph-Laplacian regularization and the basis matrix X.

Optimization solution
To correctly solve the optimization problem of (14), an efficient iterative algorithm based on LADMAP framework is designed. The algorithm (Algorithm 1) is implemented via alternating two iterative procedures till Eq. (14) converges to the minimum.
The first step is to determine matrix A and B.
The second step is to resolve the following convex optimization problem: Step 2: To achieve the separation of objective function (15), an auxiliary variable F is introduced. Equation (15) is rewritten as follows: Equation (16) can be solved through LADMAP method, which introduces two Lagrangian multipliers Y 1 and Y 2 . Thus, the augmented Lagrangian function can be defined as (13) Tr ZLZ T .  where µ is the penalty parameter and �·� 2 F denotes the Frobenius norm of a matrix that is �X� 2 Next, the alternating minimization strategy is adopted to compute Z, F, G and P. In the iterative procedure, Z, F, G or P is updated when the other three variables are fixed, respectively.
Updating Z To get the solution of Z, the below minimization objective w.r.t. Z needs to be solved.

Updating F
The below sub-problem w.r.t. F is Equation (21) is the smooth convex planning problem. Different to the solving rules of Z and G, we can differentiate Eq. (21) and set it to zero to gain the answer of F. Its optimal solution is Updating P (17) L Z, G, F, P, µ, Y 1 , Y 2 = �Z� * − Tr AFB T + �G� * µ k+1 = min (µ max , ρ k µ k ).

Time complexity
In this subsection, the time complexity about TGLRR is discussed. Clearly, the main running time of TGLRR is expended on calculating the matrices Z, F, G and P. For the m × n input data matrix X, it has m genes and n samples. The time complexity of SVD method with respect to Z is O r Z n 2 ( r Z is the lowest rank of Z decided by algorithm 1). For the same activity, the time complexity of SVD decomposition of G is O r G m 2 . The optimal solution of F can be obtained in O(rmn) . In the resolving procedure of P, Y 1 also needs to be updated. The computational cost of P and Y 1 needs O mn 2 + mr P n and O nm 2 , respectively. Since, m >> n in our dataset, the total time cost of algorithm 1 is O nm 2 .