### Mathematical definition

Above all, we define some notations which will be frequently used in following sections. (1) The input data matrix is denoted by **X** = (*x*_{1}, ..., *x*_{n}) ∈ *ℝ*^{m × n}, where *n* is the number of samples and *m* is the number of variables, i.e., genes in the gene expression data. (2) The new subspace of projected data points is denoted by **H** = *ℝ*^{n × k} and the principal direction is denoted by **U** = (**u**_{1}, ..., **u**_{k}) ∈ *ℝ*^{m × k}. (3) The Frobenius norm is denoted as ‖**X**‖_{F}. (4) The L_{2,1}-norm is denoted as \( {\left\Vert \mathbf{X}\right\Vert}_{2,1}={\sum}_{\mathrm{i}=1}^{\mathrm{n}}\sqrt{\sum_{j=1}^m{\mathbf{x}}_{ij}^2}={\sum}_{i=1}^n{\left\Vert {\mathbf{x}}^i\right\Vert}_2 \). (4) The trace of matrix **Z** is denoted as Tr(**Z**).

### The classical PCA and Graph-Laplacian PCA

In this subsection, we briefly review the classical PCA and gLPCA. PCA finds the new subspace of projected data points **H** and principal direction **U** by solving the following optimization problem [7]:

$$ \underset{\mathbf{U},\mathbf{H}}{\min }{\left\Vert \mathbf{X}-\mathbf{U}{\mathbf{H}}^T\right\Vert}_F^2\kern1em s.t.\kern0.3em {\mathbf{H}}^T\mathbf{H}=\mathbf{I}. $$

(1)

In gene expression data, each column *x*_{i} is a linearized vector of sample. The basic PCA model cannot recover non-linear structure of data. gLPCA incorporates the geometric manifold information to find the non-linear structure of data [7]. Considering **H** is the embedding matrix, the gLPCA is formulated as follows:

$$ \underset{\mathbf{U},\mathbf{H}}{\min }{\left\Vert \mathbf{X}-\mathbf{U}{\mathbf{H}}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{LH}\right)\kern1.2em s.t.\kern0.4em {\mathbf{H}}^T\mathbf{H}=\mathbf{I}, $$

(2)

where **L** = **D** − **W** is the graph Laplacian matrix. **D** = *diag* (*d*_{1}, ..., *d*_{n}) is a diagonal matrix whose elements are column or row sums of **W** (**W** is a symmetric nonnegative weight matrix). It can be expressed as *d*_{i} = ∑_{j}**W**_{ij}. The definition of **W**_{ij} is listed as follows:

$$ {\mathbf{W}}_{ij}=\Big\{{\displaystyle \begin{array}{l}1\kern1.7em if\kern0.3em {\mathbf{x}}_i\in {\mathbf{N}}_k\left({\mathbf{x}}_j\right)\kern0.6em or\kern0.5em {\mathbf{x}}_j\in {\mathbf{N}}_k\left({\mathbf{x}}_i\right),\\ {}0\kern1.5em otherwise,\end{array}}\kern1em $$

(3)

where **N**_{k}(**x**_{i}) is the *k* nearest neighbours of **x**_{i} [24]. The authors also presented a robust version to improve the robustness of this method. Since our paper focuses on the sparsity of the gLPCA method, we will not elaborate this robust version further.

### The proposed method: PCA via joint graph Laplacian and sparse regularization (gLSPCA)

Recently, sparse representation has been widely applied in the field of bioinformatics. It decomposes a set of high-dimensional data into a series of linear combinations of low dimensional codes, and hopes the combination coefficients to be zero as much as possible. The PCA suffers from the fact that the PCs are typically dense. The interpretation of the PCs might be facilitated if the idea of sparse constraint has been utilized. We consider introducing L_{2,1}-norm constraint on the PCs **H** to improve the interpretability of PCA based method. Since the L_{2,1}-norm can induce sparsity in rows, the PCs can be sparser and more easily explained [25]. Then, the quality of the decomposition is improved. The proposed method (gLSPCA) solves the following minimization problem:

$$ \underset{\mathbf{U},\mathbf{H}}{\min }{\left\Vert \mathbf{X}-\mathbf{U}{\mathbf{H}}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{LH}\right)+\gamma {\left\Vert \mathbf{H}\right\Vert}_{2,1}\kern1.1em s.t.\kern0.4em {\mathbf{H}}^T\mathbf{H}=\mathbf{I}, $$

(4)

where *α* and *γ* are scalar parameters to balance the weights of graph Laplacian and sparse constraint respectively.

### Optimization

It is hard to obtain a closed solution from Eq. (4). Thus, we solve the problem via iterative optimization. The solution of **U** is obtained by calculating partial derivatives at first. Then, the solution of **H** can be obtained by performing eigen-decomposition, after these two variables **U** and **H** are integrated into one variable **H** to substitute the original objective function. Obtaining convergence after a number of iterations, we finally get the PCs with internal geometry and sparsity which were ignored in previous studies. Firstly, following an optimization technique of L2,1-norm [25, 26], the optimization of original problem can be approximated by the following problem:

$$ \underset{\mathbf{U},\mathbf{H}}{\min }{\left\Vert \mathbf{X}-\mathbf{U}{\mathbf{H}}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{LH}\right)+\gamma \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{DH}\kern0.1em \right)\kern0.9000001em s.t.\kern0.4em {\mathbf{H}}^T\mathbf{H}=\mathbf{I}, $$

(5)

where **D** is a diagonal matrix with elements:

$$ {\mathbf{D}}_{ii}=\frac{1}{2{\left\Vert {\mathbf{h}}_i\right\Vert}_2}. $$

(6)

Then, to get the solution of **U**, we fix **H** and the derivative of *ℒ*(**U**, **H**, **D**) respect to **U** is

$$ \frac{\partial \mathcal{L}\left(\mathbf{U},\mathbf{H},\mathbf{D}\right)}{\partial \mathbf{U}}=-2\mathbf{XH}+2\mathbf{U}, $$

(7)

By setting the derivative of **U** to zero, we have

$$ \mathbf{U}=\mathbf{XH}. $$

(8)

Substituting the solutions of **U** into Eq. (5), we have

$$ {\displaystyle \begin{array}{l}\mathrm{Tr}\left(\mathbf{X}-\mathbf{XH}{\mathbf{H}}^T\right){\left(\mathbf{X}-\mathbf{XH}{\mathbf{H}}^T\right)}^T+\alpha \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{LH}\right)+\gamma \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{DH}\right)\\ {}=-\mathrm{Tr}\left({\mathbf{H}}^T{\mathbf{X}}^T\mathbf{X}\mathbf{H}\right)+{\left\Vert \mathbf{X}\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{LH}\right)+\gamma \mathrm{Tr}\left({\mathbf{H}}^T\mathbf{DH}\right)\\ {}=\mathrm{Tr}\left({\mathbf{H}}^T\left(-{\mathbf{X}}^T\mathbf{X}+\alpha \mathbf{L}+\gamma \mathbf{D}\right)\mathbf{H}\right)+{\left\Vert \mathbf{X}\right\Vert}_F^2.\end{array}} $$

(9)

Therefore, Eq. (8) is equivalent to the following problem:

$$ \ell \left(\mathbf{H}\right)=\underset{{\mathbf{H}}^T\mathbf{H}=\mathbf{I}}{\min}\mathrm{Tr}\left({\mathbf{H}}^T\mathbf{AH}\kern0.1em \right), $$

(10)

where **A** = − **X**^{T}**X** + *α***L** + *γ***D**. Thus, the optimal **H** is the eigenvectors corresponding to the first *k* smallest eigenvalues of the matrix **A**.

In the following, for convenience of parameter setting, we transform **A** to another equivalent form. We use *η*_{k} to denote the largest eigenvalue of matrix **X**^{T}**X** − *γ***D**. For Laplacian matrix **L**, we use *η*_{s} to represent the largest eigenvalue of **L**. We then set

$$ \alpha =\frac{\beta }{1-\beta}\frac{\eta_k}{\eta_s}, $$

(11)

so that the tuning of *α* becomes the tuning of *β*. Thus, (4) can be rewritten as follows:

$$ \underset{\mathbf{H}}{\min}\mathrm{Tr}\ {\mathbf{H}}^T\left[\left(1-\beta \right)\left(\mathbf{I}-\frac{{\mathbf{X}}^T\mathbf{X}+\gamma \mathbf{D}}{\eta_k}\right)+\beta \frac{\mathbf{L}}{\eta_s}\kern0.1em \right]\kern0.1em \mathbf{H}\kern1.9em s.t.\kern0.6em {\mathbf{H}}^T\mathbf{H}=\mathbf{I}. $$

(12)

In this way, the solution of **H** can be obtained by computing the first *k* smallest eigenvalues of matrix **A**_{1}:

$$ {\mathbf{A}}_1=\left(1-\beta \right)\left(\mathbf{I}-\frac{{\mathbf{X}}^T\mathbf{X}+\gamma \mathbf{D}}{\eta_k}\right)+\beta \frac{\mathbf{L}}{\eta_s}. $$

(12)

The range of *β* is 0 ≤ *β* ≤ 1. In particular, when *β* = 0 and *γ* = 0, gLSPCA degrades to classical PCA. When *β* = 1 and *γ* = 0, it equals to Laplacian Embedding (LE). We summarize the algorithm of the proposed gLSPCA approach in Algorithm 1.

**Input**: Data matrix **X** = (*x*_{1}, ..., *x*_{n}) ∈ *R*^{m × n}, parameters *γ* and *β*. |

**Output**: Matrix **U** and **H**. |

**1**: Initialize **D** = **I**_{n × n}; |

**2**: **repeat** |

Construct weight matrix **W**; Compute the diagonal matrix **D**, graph Laplacian **L**; Compute **H** by the eigenvectors corresponding to the first *k* smallest eigenvalues of matrix **A**_{1}; Compute the optimal **U** according to Eq. (8); Compute diagonal matrix **D** according to Eq. (6); |

**Until converges** |

### Convergence analysis

We would like to show the objective value does not increase in each iteration of the proposed gLSPCA algorithm. Firstly, a simple lemma is provided [32].

### Lemma 1.

For any non-zero vectors **a,b** ∈ *ℝ*^{m}:

$$ {\left\Vert \mathbf{a}\right\Vert}_2-\frac{{\left\Vert \mathbf{a}\right\Vert}_2^2}{2{\left\Vert \mathbf{b}\right\Vert}_2}\le {\left\Vert \mathbf{b}\right\Vert}_2-\frac{{\left\Vert \mathbf{b}\right\Vert}_2^2}{2{\left\Vert \mathbf{b}\right\Vert}_2}. $$

(14)

The convergence analysis of gLSPCA is summarized as Theorem 1.

### Theorem 1:

The optimization procedure of the proposed gLSPCA algorithm will monotonically decrease the objective function in each iteration.

### Proof.

Following the algorithm of gLSPCA, when we fix **D**^{t} in the *t*-th iteration and optimize **U**^{t + 1}, **H**^{t + 1}, we have:

$$ {\displaystyle \begin{array}{l}{\left\Vert \mathbf{X}-{\mathbf{U}}^{t+1}{\left({\mathbf{H}}^{t+1}\right)}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\left({\mathbf{H}}^{t+1}\right)}^T\mathbf{L}{\mathbf{H}}^{t+1}\right)+\gamma \mathrm{Tr}\left({\mathbf{H}}^{t+1}{\mathbf{D}}^t{\mathbf{H}}^{t+1}\kern0.1em \right)\\ {}\le {\left\Vert \mathbf{X}-{\mathbf{U}}^t{\left({\mathbf{H}}^t\right)}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\left({\mathbf{H}}^t\right)}^T\mathbf{L}{\mathbf{H}}^t\right)+\gamma \mathrm{Tr}\left({\mathbf{H}}^t{\mathbf{D}}^t{\mathbf{H}}^t\kern0.1em \right).\end{array}} $$

(15)

Since \( {\left\Vert \mathbf{H}\right\Vert}_{2,1}={\sum}_{i=1}^n{\left\Vert {\mathbf{h}}_i\right\Vert}_2 \), this inequality indicates

$$ {\displaystyle \begin{array}{l}{\left\Vert \mathbf{X}-{\mathbf{U}}^{t+1}{\left({\mathbf{H}}^{t+1}\right)}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\left({\mathbf{H}}^{t+1}\right)}^T\mathbf{L}{\mathbf{H}}^{t+1}\right)+\gamma \sum \limits_{i=1}^n\left(\frac{{\left\Vert {\mathbf{h}}_i^{t+1}\right\Vert}_2^2}{2{\left\Vert {\mathbf{h}}_i^t\right\Vert}_2}-{\left\Vert {\mathbf{h}}_i^{t+1}\right\Vert}_2\right)\\ {}\le {\left\Vert \mathbf{X}-{\mathbf{U}}^t{\left({\mathbf{H}}^t\right)}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\left({\mathbf{H}}^t\right)}^T\mathbf{L}{\mathbf{H}}^t\right)+\gamma \sum \limits_{i=1}^n\left(\frac{{\left\Vert {\mathbf{h}}_i^t\right\Vert}_2^2}{2{\left\Vert {\mathbf{h}}_i^t\right\Vert}_2}-{\left\Vert {\mathbf{h}}_i^{t+1}\right\Vert}_2\right).\end{array}} $$

(16)

According to Lemma 1, we know that

$$ \frac{{\left\Vert {\mathbf{h}}_i^{t+1}\right\Vert}_2^2}{2{\left\Vert {\mathbf{h}}_i^t\right\Vert}_2}-{\left\Vert {\mathbf{h}}_i^{t+1}\right\Vert}_2\ge \frac{{\left\Vert {\mathbf{h}}_i^t\right\Vert}_2^2}{2{\left\Vert {\mathbf{h}}_i^t\right\Vert}_2}-{\left\Vert {\mathbf{h}}_i^t\right\Vert}_2. $$

(17)

Thus, we have the following result.

$$ {\displaystyle \begin{array}{l}{\left\Vert \mathbf{X}-{\mathbf{U}}^{t+1}{\left({\mathbf{H}}^{t+1}\right)}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\left({\mathbf{H}}^{t+1}\right)}^T\mathbf{L}{\mathbf{H}}^{t+1}\right)+\gamma {\left\Vert {\mathbf{H}}^{t+1}\right\Vert}_{2,1}\\ {}\le {\left\Vert \mathbf{X}-{\mathbf{U}}^t{\left({\mathbf{H}}^t\right)}^T\right\Vert}_F^2+\alpha \mathrm{Tr}\left({\left({\mathbf{H}}^t\right)}^T\mathbf{L}{\mathbf{H}}^t\right)+\gamma {\left\Vert {\mathbf{H}}^t\right\Vert}_{2,1}.\end{array}} $$

(18)

This inequality proves that the objective function of (4) will monotonically decrease in each iteration.