CeModule: an integrative framework for discovering regulatory patterns from genomic data in cancer

Background Non-coding RNAs (ncRNAs) are emerging as key regulators and play critical roles in a wide range of tumorigenesis. Recent studies have suggested that long non-coding RNAs (lncRNAs) could interact with microRNAs (miRNAs) and indirectly regulate miRNA targets through competing interactions. Therefore, uncovering the competing endogenous RNA (ceRNA) regulatory mechanism of lncRNAs, miRNAs and mRNAs in post-transcriptional level will aid in deciphering the underlying pathogenesis of human polygenic diseases and may unveil new diagnostic and therapeutic opportunities. However, the functional roles of vast majority of cancer specific ncRNAs and their combinational regulation patterns are still insufficiently understood. Results Here we develop an integrative framework called CeModule to discover lncRNA, miRNA and mRNA-associated regulatory modules. We fully utilize the matched expression profiles of lncRNAs, miRNAs and mRNAs and establish a model based on joint orthogonality non-negative matrix factorization for identifying modules. Meanwhile, we impose the experimentally verified miRNA-lncRNA interactions, the validated miRNA-mRNA interactions and the weighted gene-gene network into this framework to improve the module accuracy through the network-based penalties. The sparse regularizations are also used to help this model obtain modular sparse solutions. Finally, an iterative multiplicative updating algorithm is adopted to solve the optimization problem. Conclusions We applied CeModule to two cancer datasets including ovarian cancer (OV) and uterine corpus endometrial carcinoma (UCEC) obtained from TCGA. The modular analysis indicated that the identified modules involving lncRNAs, miRNAs and mRNAs are significantly associated and functionally enriched in cancer-related biological processes and pathways, which may provide new insights into the complex regulatory mechanism of human diseases at the system level. Electronic supplementary material The online version of this article (10.1186/s12859-019-2654-3) contains supplementary material, which is available to authorized users.

For the Standard NMF problem, although the objective function of NMF is convex in W only or H only, it is not convex in both variables together. Obviously, the above objective function is not convex on W, H1, H2 and H3, and it is unreasonable to find its global minimum. In the following, as the same way for the original NMF, we adopt an iterative update algorithm to find local minimum of the problem by updating W and Hi iteratively. Below we detail the multiplicative updating algorithm for CeModule to identify the local minimum of the objective function F: Based on the simple knowledge of linear algebra, the objective function F can be rewritten as follows: The partial derivatives of the above function with respect to W and Hi are: Using the Karush-Kuhn-Tucker (KKT) conditions φlkwlk=0, ψjkhjk (1) =0, ωpkhpk (2) =0, and θqkhpk (3) =0, we obtain the following equations for wlk, hjk (1) , hpk (2) , and hpk (3) :

Q.Xiao et al.
We have the following theorem to guarantee the convergence of the above updating rules to a local optimum. Theorem 1 The objective function F of the CeModule problem is non-increasing under the above updating rules in Eq. (7). The objective function is finite and invariant under these updates if and only if W , H1, H2, and H3 are at a stationary point.
The principle of convergence proof of NMF can be easily expanded to prove this theorem. A difference to NMF is that the objective function F here can be unbounded below. Only the objective is finite, the CeModule can get a stable local solution. Here, we show that F is non-increasing under the updating rules. In particular, here we prove that the F is nonincreasing under the updating rule for W, and same feature under the updating rules for H1, H2, and H3 can be similarly proved. We will adopt the same strategy used in (Lee and Seung, 2001) that introduced an auxiliary function in the Expectation-Maximization algorithm. The following is the definition of the auxiliary function.
Definition G(w,w') is an auxiliary function of F(w) if the following conditions are satisfied: The above auxiliary function is very important because of the following lemma.

Lemma 1: If G is an auxiliary function for F, then F is non-increasing
under the update ). Now, we show that the updating rule for W is exactly the update in this Lemma with a proper auxiliary function.
Considering any element (W)ab in W, we use the Fab to denote the part of F, which is only relevant to (W)ab. It is easy to check that is an auxiliary function for Fab. Proof: Since G(w, w) = Fab(w) is obvious, we only need show that G(w, wab (t) ) ≥ Fab(w). To do this, we first consider the Taylor series expansion of Fab(w)  G(w, wab (t) ) ≥ Fab(w). We can now demonstrate the convergence of Theorem 1.
Proof of Theorem 1: Replacing G(w, wab (t)) ) in Eq. (9) by Eq. (12) results in the update rules: Since Eq. (12) is an auxiliary function, Fab is non-increasing under this update rule.

S3. Parameter selection
In this section, we discuss how to solve the optimization problem of CeModule to determine the parameter values of our model. As aforementioned, the objective function is not convex on W, H1, H2 and H3, and we cannot analytically compute general solutions. Therefore, an iterative update algorithm is employed to find local minimum of the problem by updating W and Hi iteratively. This iterative procedure converges to the local minima because of the fact that the objective function is bounded below, and the sequence of function values is monotonically decreasing, and the gradients at the convergence are zeros. As the problem is non-convex, we perform the learning process several times with different parameter combinations. The optimum combination is obtained from the following values: α, λ1, λ2, and λ3 are chose from {0, 10 −4 , 10 −3 , 10 −2 , 10 −1 }, γ1 and γ2 are selected from {10 −2 , 10 −1 , 0, 5, 10, 20}. After repeating this process several times (500 iterations in our experiments), we select the parameter combination from the result that leads to the lowest value for our objective function. Finally, we set α, λ1, λ2, λ3 γ1 and γ2 to 0.0001, 0.01, 0.01, 0.01, 10, and 10, respectively. For the sub-space dimension K, we set K to 70 on the basis of a miRNA cluster analysis. Parameter T is a given threshold that is employed to determine the module members (lncRNAs/miRNAs/ mRNAs). In this way, the number of lncRNAs/miRNAs/mRNAs in each identified module strongly relies on the value of T. Too large or too small value of the threshold T may generate inappropriate modules. We expect that the pair-wise lncRNAs (miRNAs or mRNAs) within each module identified by CeModule are highly correlated, and then adopt the average absolute Pearson correlation coefficient (PCC) of all modules, , to determine the value of T for lncRNAs/miRNAs/mRNAs based on the expression profiles, where corr is a function that calculates PCC for the pair (x, y), and x and y represent a pair of lncRNAs (miRNAs or mRNAs) in module i, K and M i denote the number of modules and the number of lncRNAs (miRNAs or mRNAs) in module i, respectively. After conducting a series of tests based on our experiment scenario, we finally set T=5/3/4 (5/3/3) for selecting lncRNAs/miRNAs/mRNAs, and obtained 70 modules with an average of 68.2 (46.1) lncRNAs, 6.3 (5.5) miRNAs, and 55.5 (48.1) mRNAs per module for OV (UCEC) dataset

S4. Kaplan-Meier survival analysis to identify clinically related modules
Here, for instance, we also investigated whether the modules identified by CeModule based on OV dataset were associated with the survival of ovarian cancer patients, and a Kaplan-Meier (KM) survival analysis was performed using the R package survival. The clinical data are downloaded from TCGA, and 383 samples are retained after removing those not included in the expression data or those with unavailable survival time. For each module, we classified the patients into two groups ("low group" vs. "high group") according to the lower or higher module-averaged lncRNA expression levels than the sample means. The log-rank test was employed to evaluate the statistical significance for each module, and then the p-values are further adjusted by multiple test correction using Benjamini-Hochberg method (FDR<0.05).
The prediction of survival power of a module is better than that of an individual gene. As shown in Figure S5, we found that a representative prognostic modules (module 43) in ovarian cancer could significantly separate patients into two groups with different clinical outcomes. The module 43 was significantly associated with the survival of OV patients (FDR=2.35e-02), and the median survival times of the "high group" and "low group" are 1583 and 1264 days, respectively. Notably, the patients of the "low group" faced higher risks as shown in Figure S5. The observations imply that the proposed method has a potential ability to discover modules that could provide useful information for the prediction of cancer prognosis. FDR=2.35e-02 module_43 Figure S5. Kaplan-Meier survival curves for ovarian cancer patients classified into two groups using the module-averaged lncRNA expression levels.