Optimization and expansion of non-negative matrix factorization

Background Non-negative matrix factorization (NMF) is a technique widely used in various fields, including artificial intelligence (AI), signal processing and bioinformatics. However existing algorithms and R packages cannot be applied to large matrices due to their slow convergence or to matrices with missing entries. Besides, most NMF research focuses only on blind decompositions: decomposition without utilizing prior knowledge. Finally, the lack of well-validated methodology for choosing the rank hyperparameters also raises concern on derived results. Results We adopt the idea of sequential coordinate-wise descent to NMF to increase the convergence rate. We demonstrate that NMF can handle missing values naturally and this property leads to a novel method to determine the rank hyperparameter. Further, we demonstrate some novel applications of NMF and show how to use masking to inject prior knowledge and desirable properties to achieve a more meaningful decomposition. Conclusions We show through complexity analysis and experiments that our implementation converges faster than well-known methods. We also show that using NMF for tumour content deconvolution can achieve results similar to existing methods like ISOpure. Our proposed missing value imputation is more accurate than conventional methods like multiple imputation and comparable to missForest while achieving significantly better computational efficiency. Finally, we argue that the suggested rank tuning method based on missing value imputation is theoretically superior to existing methods. All algorithms are implemented in the R package NNLM, which is freely available on CRAN and Github.


Background
Non-negative matrix factorization (NMF or NNMF) [1] has been widely used as a general method for dimensional reduction and feature extraction on non-negative data. The main difference between NMF and other factorization methods, such as SVD, is the nonnegativity, which allows only additive combinations of intrinsic 'parts' , i.e. the hidden features. This is demonstrated in [1], where NMF learns parts of faces and a face is naturally represented as an additive linear combination of different parts. Indeed, negative combinations are not as intuitive or natural as positive combinations.
In bioinformatics, NMF is sometimes used to find 'metagenes' from expression profiles, which may be *Correspondence: ericxihuilin@outlook.com 1 Informatics & Biocomputing, Ontario Institute for Cancer Research, Toronto, Canada Full list of author information is available at the end of the article related to some biological pathways [2,3]. NMF has been used to extract trinucleotide mutational signatures from mutations found in cancer genomic sequences and it was suggested that the trinucleotide profile of each cancer type is a positive linear combination of these signatures [4].
There are several different algorithms available for NMF decomposition, including the multiplicative algorithms proposed in [1], gradient descent and alternating nonnegative least square (ANLS). ANLS is gaining attention due to its guarantee to converge to a stationary point and being a faster algorithm for non-negative least squares (NNLS).
In this paper, we first unify various regularization forms on the result components, which encourage desired properties such as orthogonality and sparsity and show how the conventional multiplicative algorithms [1] can be modified to adapt to these regularizations inspired by [3]. We then adopt the ANLS approach, but incorporate a solution to the NNLS problem using a coordinate-wise algorithm proposed by [5], in which each unknown variable can be solved sequentially and explicitly as simple quadratic optimization problems. We demonstrate that this algorithm can converge much faster than traditional multiplicative algorithms. For NMF with Kullback-Leibler divergence loss, we extend this methodology by approaching the loss with a quadratic function.
NMF is a dimension reduction method, as the resulting decomposed matrices have a smaller number of entries than the original matrix. This means that one does not need all the entries of the original matrix to perform a decomposition, thus NMF should be able to handle missing entries in the target matrix. Indeed, factorization can be fulfilled by dropping the loss items related to the missing entries if the target loss function is a sum of per-entry losses, e.g., mean square error (MSE) or Kullback-Leibler (KL) divergence. Furthermore, the reconstructed matrix has values on entries that are missing in the original matrix. This reveals the capability of NMF for missing value imputation. Inspired by this observation and the popular training-validation tuning strategy in supervised models, we introduce a novel method to optimize the only hyper-parameter k, i.e. the rank of NMF decomposition.
NMF is essentially unsupervised. It performs a blind decomposition, which puts the meaning of the result in question. This might limit the applications of unsupervised methods in areas where strong interpretability is critical, including most biomedical research. On the other hand, decomposition without utilizing known discovery (prior knowledge) may not be effective, especially with a small sample size. To overcome these challenges, we apply a masking technique to the NMF decomposition during the iterating algorithms to retain certain structures or patterns in one or both resulting matrices, which can be designed according to our prior knowledge or research interest. This technique can be used to perform a pathway or sub-network guided decomposition or to separate different cell types from mixed tissue samples.
All of these algorithmic innovations are implemented in the popular R programming language. They serve as an alternative to the widely-used NMF package [6] which was first translated from a MATLAB package and later optimized via C++ for some algorithms. The sparse alternating NNLS (ANLS) by [3] is anticipated to be fast in theory, but its implementation in R leads to slow performance in practice. Our NNLM package combines the efficient NNLS algorithm with the use of Rcpp, which seamlessly integrates R and C++ [7] and is freely available and open-source.
In summary, the main contributions of this work include: • unifying various type of regularizations and deriving the correspondent multiplicative algorithms; • developing a faster algorithm for NMF using sequential coordinate-wise descent; • introducing a method to handle missing entries in the target matrix, which results in a novel method to determine the rank k and a new application of NMF for missing value imputation; • introducing a masking technique to integrate prior knowledge and desirable properties and demonstrating how it can be used to achieve tumour content deconvolution.

Algorithms comparison
We carry out an experiment for illustration purpose using a subset of microarray data from a group of Non-Small Cell Lung Cancer (NSCLC) data ( [8], available in package NNLM) to compare SCD with Lee's multiplicative algorithms. Results are shown in Fig. 1

Missing value imputation
A comparison of different imputation methods are shown in Fig. 2 and Table 2. A subset of the NSCLC dataset [8] is used with 30% randomly selected to be missing. One can see that NMF is almost as good as missForest [9] but much faster, and clearly better than MICE [10] and simple median imputation in this example.

Choice of k
We performed a simulation study to illustrate our proposed method for selecting the rank hyperparameter k. Entries of W ∈ R 400×3 and H ∈ R 3×50 are sampled independently and uniformly from interval (0, 1) and (0, 10) respectively. A was constructed by WH plus noise sampled independently for each entry from the standard normal distribution. All negative values in A are set to 0 (very unlikely). We choose MSE as loss and run the purposed algorithm 5 times, each with a random 30% entries deleted. The result is shown in Fig. 3. As we could see, different runs (indicated by different colors) give consistent results. The mean square errors (MSEs) of the reconstruction of the missing entries are minimized at k = 3 for all runs.

Tumour content devolution
Expression deconvolution is of constant interest in bioinformatics and clinical research [11,12]. Some NMF related methods were proposed [13]. However, our unique methods of using mask matrices are more flexible and powerful, as one can almost guide the decomposition towards any biological procedure of interest by integrating prior knowledge into the initial and mask matrices. As compared to Bayesian methods like the ISOpure [14], NMF based methods are much faster.
We use part of Beer's lung adenocarcinoma data [15], which contains 30 tumours and 10 normal samples, with 250 transcripts, available in the Isopure R package [16]. A comparison to the result from ISOpure using the full dataset (80 tumours and 5151 transcripts) is shown in Fig. 4. We can see that our result based on a small part of the dataset produces a comparable result.

Discussion
We combine common regularizations of NMF into a general form to explore the power of mixing different types of regularizations. The choice of weights of the regularization should depend on the problem. For example, independence or orthogonality (J 2 ) may be favored for content deconvolution, while sparsity (J 3 ) might be more important for metagenes or sub-networks discovery. J 1 can be used to reduce variance in the outcome matrices used for downstream analysis such as prognostic biomarker discovery. Another way to choose these hyperparameters is to use the same approach as introduced in "Choice of k" section for tuning k, i.e., choose the ones that minimize reconstruction error or variation. This can be done together with the choice of k. The choice of MSE or KL as the loss function depends on the nature and the distribution of the entries. A general principle is to use MSE when the distribution of the entries are centered around a certain region, i.e., the magnitudes are roughly the same (e.g., the simulation study in "Choice of k" section). However, for very skewed distributions (e.g, count data) or data with outliers, the KL loss may fit better, as if MSE is used in this case, the large entries might dominate the loss while small entries have little impact, resulting in a factorization with large variance. For the latter case, one can also perform the decomposition in the log space if all entries have values greater than 1 or in the log(1 + A) space with MSE. However, the interpretation of the results has to be changed as well.
Although NMF can be done with missing entries, when the missing process is correlated with the value itself, i.e., not missing completely at random (MCAR), the resulting reconstruction may be biased. Besides, when there are many missings, especially when a certain row or column is largely missing, the composition and the reconstruction could have a large variation and thus not reliable. The same argument also carries to the proposed method for choosing k.
The masking technique is simple yet useful for many applications. Here we only demonstrate its application to tumour content deconvolution with an experiment on a small dataset only to showcase its capability. The comparable result with a common method like ISOpure encourages us for more future work in this direction, such as metagenes and sub-network related analysis, as well as content deconvolution.
All methodologies described in this paper are implemented in the R package NNLM, available on CRAN and Github. All the code for the experiments in this paper can be found on the vignette of the package.

Conclusion
In this work, we generalize the regularization terms in NMF and extend the multiplicative algorithm to the general case. We develop a new solver based on sequential coordinate-wise descent for both KL and MSE losses and demonstrate its efficiency through complexity analysis  Comparing NMF to ISOpure [16] for tumour content deconvolution and numerical experiments. Our method and implementation can also naturally handle missing entries and be used to impute missing values through reconstruction. We show that the NMF imputation method is more efficient and accurate than popular methods. Motivated by the missing value imputation, we introduce a simple and intuitive method to determine the rank of NMF. Finally, by introducing the masking technique, we show that NMF can be applied to tumour content deconvolution and can achieve similar results as compared to existing methods like ISOpure with better computational efficiency.

Methods
In this section, we generalize the multiplicative algorithms [1] to incorporate regularizations in (3) and briefly argue how they can be derived. We then introduce a new and faster algorithm for NMF with mean square loss ("Alternating non-negative least square (ANLS)" section) and KL-divergence distance loss ("Sequential quadratic approximation for Kullback-Leibler divergence loss" section) with regularizations, based on the alternating scheme. In "Missing entries" section, we address in all algorithms a common problem that some entries of the target matrix may be unreliable or not observed. The method we introduced naturally leads to an intuitive and logically robust method to determine the unknown rank parameter k ("Choice of k" section) and a novel approach for missing value imputation for array data ("Missing value imputation" section). We then re-examine NMF in "Masking, content deconvolution and designable factorization" section and develop a method to integrate prior knowledge into NMF and guide the decomposition in a more biologically meaningful way, which can be powerful in applications.

Overview
NMF decomposes a matrix A into two matrices with nonnegative entries with smaller ranks, L(x, y) is a loss function, which is mostly chosen to be square error 1 2 (x − y) 2 , or KL divergence distance x log(x/y) − x + y. The latter can be interpreted as the deviance from a Poisson model. J W (W ) and J H (H) are regularizations on the W and H respectively to encourage the desired properties, such as high sparsity, smaller magnitude or better orthogonality. Various regularization forms are introduced [3,11], but mostly can be unified as the following form, where I is an identity matrix, E is a matrix of proper dimension with all entries equal to 1, X ·i and X i· are the i th column and row respectively. J 1 is a ridge penalty to control the magnitudes and smoothness. J 1 also helps stabilize numerical algorithms. J 2 (X) is used to minimize correlations among columns, i.e., to maximize independence or the angle between X ·i , X ·j [11]. J 3 is a LASSO-like penalty that controls matrixwise sparsity. [3] introduced a different type of regularization to favour sparsity in a column-wise manner as the following, Obviously,J = J 1 + J 2 , a special case of (2).
Conventionally, (1) is solved by an alternating algorithm, which solves W and H alternately and iteratively. Thanks to the non-negative constraint, the penalties do not bring additional complexity.

Adding regularization to Lee's multiplicative algorithm
Two multiplicative updating algorithms are proposed in [1] for square loss and KL divergence loss. They are adapted by [11] to cases with sparsity regularization. Here we modify these algorithms to integrate all the above regularizations as the following.
With square loss With Kullback-Leibler divergence distance, hk j ← hk j l w lk a lj / q w lq h qj When β i = 0, i = 1, 2, 3, these are the original multiplicative algorithms in [1]. If β 1 = 0, these updates reduce to equations (10) and (23) in [11]. The proof when β 1 = 0 can be done similarly as in [11]. The updating rules for W are similar to (5) and (6).
These multiplicative algorithms are straightforward to implement, but they have the drawback that when an entry of W or H is initialized as zero or positive, it remains zero or positive throughout the iterations. Therefore, all entries should be initialized to be positive. As a consequence, true sparsity cannot be achieved in general, unless a hard-thresholding is imposed, as many of the entries would be small enough to be thresholded to zero.

Alternating non-negative least square (ANLS)
When L is a square loss, the following sequential coordinate-wise descent (SCD) algorithm proposed by [5] is used to solve a penalized NNLS for H while W is fixed. Let Since E − I is semi-negative definite, to ensure the uniqueness and the convergence of the algorithm, we impose the constraint that β 1 > β 2 , in which case vkk > 0 for allk.
If all elements of H are fixed except for hk j , then the above is a quadratic function of hk j with a non-negative constraint, which can be explicitly optimized by Obviously, j = 1, . . . , m can be updated independently and in parallel. We then have the following SCD algorithm for solving H when W is fixed.

Initialization. Set
2. Repeat until convergence: fork = 1 to k, update simultaneously and in parallel for j = 1, . . . m, where V ·j , U (t) ·j denote the j -th column of matrix V and From (10), one can see that each iteration is nonincreasing and therefore the algorithm converges to some fixed point. Any entry of a fixed point should be either on the boundary with its gradient pointing out of the feasible region (H ≤ 0) or at a stationary point. A formal proof of convergence can be found in [5].
The alternating algorithm fixes W and solves for H using NNLS, and then fixes H and solves for W using the same algorithm. This procedure is repeated until the change of A − WH is sufficiently small. Each update is non-increasing, thus the alternating algorithm converges.
Instead of initializing H (0) = 0 for every iteration, we use a warm-start, i.e., initializing H (0) as the result from the previous iteration.

Sequential quadratic approximation for Kullback-Leibler divergence loss
When L is a KL divergence distance, we use a similar SCD algorithm, by approximating KL(A|WH) with a quadratic function.
Assume W is known and H is to be solved. Let where H (t) is the current value of H in the iterative procedure.
When fixing all other entries, the Taylor expansion of the penalized KL divergence up to the 2nd order at h This can be solved explicitly by A similar formula for updating W ik can be derived. Note that when an entry ofÂ = WH is 0, the KL divergence is infinity. To avoid this, we add a small number to the denominators in both (13) and (14).

Complexity and convergence speed
The first step of SCD (Eq. 11) has complexity of kmn due to W T A. The second step (Eq. 12) costs km × k × N i , where the second k is due to the update of U (t+1) ·j and N i is the number of inner iterations to solve the non-negative linear model. In total, kmn + k 2 mN i multiplications are needed for solving H given W fixed. Accounting the similar computation for W, the total complexity of SCD is O (m + n)k 2 N i + 2nmk N o , where N o is the number of outer iterations to alternate W and H. N i N o is the total number of epochs, i.e., one complete scan over all entries of W and H. For Lee's multiplicative algorithm with MSE, when W is fixed, the complexity of solving H is knm for W T A on the numerator, k 2 n for W T W on the denominator and k 2 mxN i for multiple H at the denominator for N i times, which add up to knm + k 2 n + k 2 mN i . Accounting for W and the alternatings, Lee's algorithm with MSE loss has the same complexity as SCD. The same analysis can be done with their KL counterparts, for which both algorithms have the same complexity of O nmk 2 N i N o .
Obviously, algorithms with square error loss are faster than the KL based ones (by a factor of k) in terms of complexity, and can benefit from multiple inner iterations N i (reducing the expensive computation of W T A and AH T ) as typically k m, n, which generally should reduce N o . In contrast, algorithms with KL loss cannot benefit from inner iterations due to the re-calculation of WH on each inner iteration. Though the SCD and Lee's algorithm are similar in terms of complexity, one can expect a much faster convergence in SCD. This is because Lee's algorithm is essentially a gradient descent with a special step size [1] which is a first order method, while SCD is a Newton-Raphson like second order approach.

Missing entries
Due to various reasons, not all entries of A will always present. In some cases, even if an entry is observed, it may not be reliable. In this case, it may be better to treat them as missing entries. Since matrix A is mostly assumed to have a low-rank k, the information in A is redundant for such a decomposition. Hence factorization can be done with the presence of missing entries in A, using only the observed ones.
In fact, as the loss function is usually the sum of losses of all elements, it is natural to simply drop losses related to the missing entries. For any j, let I j = {i : a ij not missing} andĪ j = {i : a ij is missing}. When updating the j-th column of H, allĪ j rows of W should be removed, i.e., U, V in (8) are modified as where W I j · and A I j · denote the submatrices of W and A with row indices in I j . Unlike the non-missing case, V depends on j. Similar modification can be applied to the KL counterpart (14) and Lee's multiplicative algorithms (5, 6) by replacing W T W and W T A in the same way as in (15). Note that the re-calculation of V only increases the complexity of MSE based method but not KL based, in which case it has to be re-computed nevertheless. The ability to handle missing values is crucial in applications, and turns out to induce a novel missing value imputation method (described in "Missing value imputation" section and a novel method for choosing k (described in "Choice of k" section).

Missing value imputation
As discussed in "Missing entries" section, the information in A is mostly redundant for factorization purposes. Hence reasonable results can still be achieved with missing entries present in matrix A. The reconstructionsÂ = WH on missing entries are reasonable predictions for the missing values.
The advantage of NMF imputation is that it takes into account all the complete entries when imputing a single missing entry, which implies that NMF can capture complex dependency among entries, while a conventional statistical missing value imputation algorithm, e.g., missForest [9] and MICE [10], usually models missing entries in a feature-by-feature (column-by-column or row-by-row) manner and iterates over all features multiple times to capture complex dependency.

Choice of k
The selection of hyper-parameters is a typical challenge for all unsupervised learning algorithms. The rank k is the only but critical parameter, which is a priori unknown. Brunet et al. [2] suggests to try multiple runs of each k and uses a consensus matrix to determine k. This idea assumes that cluster assignment is stable from run to run if a clustering into k classes is strong. However, the assumption needs to be verified and the purpose of NMF is not always for clustering. Besides, the idea of consensus is to choose k with lower variation in clustering, which is not necessarily the right measure for choosing k. We argue that a reasonable k should be able to remove noise and recover the signal. One idea, brought from the denoising autoencoder [17], is to add noise to the matrix A, factorize the noisy version and compare the reconstructed matrix to the original A. One can expect that the "correct" k should give the smallest error rate. This could be a general approach for many unsupervised learning algorithms. However, when it comes to NMF, the choice of noise is not obvious as the noisy version of A has to be nonnegative as well, which suggests that injected noise may also introduce bias. In addition, the choice of the noise distribution is yet another hyperparameter not obvious to pick.
Given the ability to handle missing entries in NMF described in the above section and the powerful missing value imputation of NMF demonstrated in "Missing value imputation" section, we come up with a novel approach, akin to the well-known the trainingvalidation split approach in supervised learning.
1. Some portion (e.g., 30%) of entries are randomly deleted (selected to be missing) from A. 2. The deleted entries are imputed by NMF with a set of different k's.
3. The imputed entries are compared to their observed values, and the k that gives the smallest error is selected.
The above approach can be argued by the assumption that only the correct k, if exists, has the right decomposition that can recover the missing entries. In contrast to the training-validation split in supervised learning, due to the typically big number of entries in A, we generally have a very large 'sample size' . One can also easily adapt the idea of cross-validation to this approach. This idea should apply to any unsupervised learning method that handles missing values. Note that bootstrapping and cross-validation can also be easily incorporated here.

Masking, content deconvolution and designable factorization
Microarrays are popular techniques for measuring mRNA expression. Strictly speaking, an mRNA profile of a certain tumour sample is typically a mixture of cancerous and healthy profiles as the collected tissues are 'contaminated' by healthy cells. A pure cancer profile is usually more suitable for downstream analysis [14].
One can utilize NMF for such a purpose by formatting it as where matrix W is an unknown cancer profile, and matrix W 0 is a known healthy profile. Rows of A represent probes or genes while columns represent patients or samples. The task here is to solve W , H and H 1 given A and W 0 , which can be thought of as a 'guided' NMF. In this decomposition, the number of columns of W can be interpreted as the number of unknown cancerous profiles or cell types. The corresponding tumour percentage of sample j can be estimated aŝ A more general implementation is to use mask matrices for W and H, where the masked entries are fixed to their initial values or 0 if not initialized. Indeed, one can treat this as a form of hard regularization. It can be seen immediately that the above deconvolution is a special case of this masking technique, in which the masked entries are initialized to the known profile and fixed. This technique is designed to integrate domain knowledge, such as gene sub-networks, pathways, etc, to guide the NMF towards a more biologically meaningful decomposition.
For example, assume S = {S 1 , ..., S L }, where each S l , l = 1, ..., L is a set of genes in certain sub-network or pathway l. One can design W as a matrix of K columns (K ≥ L), with w il = 0 when i ∈ S l . NMF factorization will learn the weight or contribution w il of real gene i in sub-network or pathway l from data. One can also interpret h lj i w il as an expression level of sub-network or pathway l in patient j. Besides, W ·j 's for j = L + 1, . . . , K are unknown subnetworks or pathways. Note that K is unknown beforehand, but can be determined by the method introduced in "Choice of k" section.
Similarly, if S l is a set of marker genes (those that are known to be expressed only in a specific cell type / tissue), for tissue l, by letting w il = 0, i ∈ q =l S q , one can find the relative abundance of each tissue type in a sample. A similar formula to Eq. (17) can be used to compute the proportions of known (j = 1, . . . , L) and unknown (j = L + 1, . . . , K) cell types / tissues.
Another possible application of masking is metaanalysis of different cancer types which finds metagenes that are shared among cancers. For instance, assume A 1 , A 2 are expressions of lung cancer and prostate cancer microarrays. By setting certain parts of the coefficient matrix H to 0, for example, we can expect that W 1 and W 2 are lung and prostate cancer specific profiles, while W 0 is a shared profile.