Fast Nonnegative Matrix Factorization and Applications to Pattern Extraction, Deconvolution and Imputation

Nonnegative matrix factorization (NMF) is a technique widely used in various fields, including artificial intelligence (AI), signal processing and bioinformatics. However existing algorithms and R packages cannot be applied to large matrices due to their slow convergence, and cannot handle missing values. In addition, most NMF research focuses only on blind decompositions: decomposition without utilizing prior knowledge. We adapt the idea of sequential coordinate-wise descent to NMF to increase the convergence rate. Our NMF algorithm thus handles missing values naturally and integrates prior knowledge to guide NMF towards a more meaningful decomposition. To support its use, we describe a novel imputation-based method to determine the rank of decomposition. All our algorithms are implemented in the R package NNLM, which is freely available on CRAN.

1 Introduction 1 Nonnegative matrix factorization (NMF or NNMF) ( [Lee and Seung, 1999]) has been 2 widely used as a general method for dimensional reduction and feature extraction on 3 nonnegative data. The main difference between NMF and other factorization methods, 4 such as SVD, is the nonnegativity, which allows only additive combinations of intrinsic 5 'parts', i.e. the hidden features. This is demonstrated in ( [Lee and Seung, 1999]  In bioinformatics, NMF is sometimes used to find 'meta-genes' from expression 10 profiles, which may be related to some biological pathways ( [Brunet et al., 2007], [Kim 11 and Park, 2007]). NMF has been used to extract trinucleotide mutational signatures 12 from mutations found in cancer genomic sequences and it was suggested that the 13 trinucleotide profile of each cancer type is positive linear combination of these 14 signatures ( [Alexandrov et al., 2013]). 15 There are several different algorithms available for NMF decomposition, including 16 the multiplicative algorithms proposed in ( [Lee and Seung, 1999]), gradient descent and 17 alternating nonnegative least square (ANLS). ANLS is gaining attention due to its 18 guarantee to converge to a stationary point and faster algorithms for nonnegative least 19 squares (NNLS).
In this paper, we adapt the ANLS approach, but incorporate a solution to the NNLS 21 problem using a coordinate-wise algorithm proposed by [Franc et al., 2005] All of these algorithmic innovations are implemented in the popular R programming 45 language. They serve as an alternative to the widely-used NMF package ( [Gaujoux and 46 Seoighe, 2010]) which was first translated from a MATLAB package and later optimized 47 via C++ for some algorithms. The sparse alternating NNLS (ANLS) by ( [Kim and 48 Park, 2007]) is anticipated to be algorithmically fast, but is implemented in R leading to 49 slow performance in practice. Our NNLM package combines the efficient NNLS 50 algorithm with the use of Rcpp, which seamlessly integrates R and C++ ( [Eddelbuettel 51 and Francois, 2011]) and is freely available and open-source. 2/10 Here A ∈ R n×m , W ∈ R n×K , H ∈ R K×m , I is an identity matrix, E is a matrix of 57 proper dimension with all entries equal to 1, X ·i and X i· are the i th column and row 58 respectively. Obviously, J 4 = J 1 + J 2 . L(x, y) is a loss function, which is mostly chosen 59 to be square error 1 2 (x − y) 2 , or KL divergence distance x log(x/y) − x + y. The latter 60 can be interpreted as the deviance from a Poisson model.

61
The above four types of regularizations can be used for different purposes. J 1 is a 62 ridge penalty to control the magnitudes and smoothness. J 2 (X) is used to minimize 63 correlations among columns, i.e., to maximize independence or the angle between X ·i , 64 X ·j ( [Zhange et al., 2008]). J 3 and J 4 ( [Kim and Park, 2007]) is a LASSO-like penalty, 65 which controls both magnitude and sparsity. J 3 (X) tends to control matrix-wise 66 sparsity, which may result in some columns of X having all entries equal to 0, while 67 J 4 (X) forces sparsity in a column-wise manner.

68
An alternating algorithm solves W and H alternately and iteratively. Due to the 69 nonnegative constraint, the penalty does not bring additional complexity. In addition, 70 convergence is guaranteed as the loss is monotonically non-increasing at each step. It 71 converges to a local minimum, as if not, there exists a coordinate with non-zero 72 derivative and thus the correspondent update will have make an improvement on the 73 loss.

75
In this section, we will introduce a new and fast algorithm for NMF with mean square 76 loss (section 3.1) and KL-divergence distance loss (section 3.2) with regularization, 77 based on the alternating scheme. In section 3.3, we generalize the multiplicative 78 algorithm proposed by ( [Lee and Seung, 1999]) to incorporate all the regularizations in 79 Equation (2). We then re-examine NMF in section 3.4 and develop a novel method to 80 integrate prior knowledge into NMF and guide the decomposition in a more biologically 81 meaningful way, which can be powerful in applications. In section 3.5, we show how one 82 can utilize NMF to impute missing values in array data, and such a capacity leads to a 83 very intuitive and logically robust tuning method to determine the unknown parameter 84 k (the rank) in section 3.6. When L is a square loss, the following sequential coordinate-wise descent (SCD) 87 algorithm is used to solve a penalized NNLS for H while W is fixed.

92
Since E − I is semi-negative definite, to ensure the uniqueness and the convergence 93 of the algorithm, we impose the constraint that β 1 ≥ β 2 .

3/10
The alternating algorithm fixes W and solve for H using NNLS, and then fixes H 95 and solve for W using the same algorithm. This procedure is repeated until the change 96 of A − W H is sufficiently small.

97
Instead of initializing H (0) = 0 for every iteration, we use a warm-start, i.e., make 98 use of the result from previous iteration, where H is the current matrix during the iteration. When L is a KL divergence distance, we use a similar SCD algorithm, by approximating 103 KL(A|W H) with a quadratic function.

104
Assume W is known and H is to be solved. Let where H (t) is the current value of H in the iterative procedure.

106
When expanding the penalized KL divergence up to the 2nd order with respect to kj while fixing all others, we can solve the resulting 108 quadrature explicitly by A similar formula for updating W ik can be derived. Note that when an entry ofÂ is 110 0, the KL divergence is infinity. To avoid this, we add a very small number both to A 111 andÂ. Two multiplicative updating algorithms are proposed in [Lee and Seung, 1999] for 114 square loss and KL divergence loss. We modify these algorithms to integrate all the 115 above regularizations as follows.

4/10
This multiplicative algorithm is straight-forward to implement, but it has the 121 drawback that when W or H is initialized with a zero or positive entry, it remains zero 122 or positive throughout the iterations. Therefore, true sparsity can not be achieved 123 generally, unless a hard-thresholding is imposed, as many of entries would be small 124 enough to be thresholded to zero.  One can utilize NMF for such a purpose by formatting it as where W is an unknown cancer profile, and W 0 is a known healthy profile. Rows of A 132 represent probes or genes while columns represent patients or samples. The task here is 133 to solve W , H and H 1 given A and W 0 , which can be thought of as a 'guided' NMF. In 134 this decomposition, the number of columns of W can be interpreted as the number of 135 unknown cancerous profiles or cell types. The corresponding tumour percentage of 136 sample j can be estimated as A more general implementation is to use mask matrices for W and H, where the 138 masked entries are fixed to their initial values or 0 if not initialized. Indeed, one can 139 treat this as a form of hard regularization. It can be seen immediately that the above 140 deconvolution is a special case of this mask technique, in which the masked entries are 141 initialized to the known profile and fixed. This feature is designed to integrate domain 142 knowledge, such as gene subnetworks, pathways etc, to guide the NMF towards a more 143 biologically meaningful decomposition.

144
For example, assume S = {S 1 , ..., S L }, where each S l , l = 1, ..., L is a set of genes in 145 certain subnetwork or pathway l. One can design W as a matrix of K columns (K ≥ L), 146 with w il = 0 when i ∈ S l . NMF factorization will learn the weight or contribution w il of 147 real gene i in sub-network or pathway l from data. One can also interpret h lj ( i w il ) 148 as an expression level of sub-network or pathway l in patient j. Beside, W ·j 's for 149 j = L + 1, . . . , K are unknown sub-networks or pathways. Note that K is unknown 150 before hand, but later we will see that it can be determined by NMF imputation.

151
Similarly, if S l is a set of marker genes (those that are known to be expressed only in 152 a specific cell type / tissue), for tissue l and by letting w il = 0, i ∈ q =l S q , one can find 153 the relative abundance of each tissue type in a sample. A similar formula to Equation 154 (6) can be used to compute the proportions of known (j = 1, . . . , L) and unknown 155 (j = L + 1, . . . , K) cell types / tissues.

156
Another possible application is for meta-analysis of different cancer types, to find 157 meta-genes that are shared among cancers. For instance, assume A 1 , A 2 are expressions 158 of lung cancer and prostate cancer microarrays. By setting certain parts of the 159 coefficient matrix H to 0, for example, we can expect that W 1 and W 2 are lung and prostate cancer specific profiles, while W 0 161 is a shared profile.

Missing value imputation 163
Since matrix A is assumed to have a low rank k, information in A is redundant for such 164 a decomposition. Hence it is possible to allow some entries in A to be missing. Such an 165 observation reveals the capability of NMF imputing the missing entries in A, which is a 166 typical task in recommendation systems.

167
The advantage of NMF imputation over traditional statistical imputation methods is 168 that it takes into account all the complete entries when imputing a single missing entry, 169 which implies that NMF can capture complex dependency among entries, while a 170 typical missing value imputation algorithm usually models missing entries in a 171 feature-by-feature (column-by-column or row-by-row) manner and iterates over all 172 features multiple times to capture complex dependency. A comparison of different 173 imputation methods are shown in Figure 1 (B). A subset of NSCLC dataset is used with 174 30% set to missing at random. One can see that NMF is almost as good as missForest 175 ( [Stekhoven and Buehlmann, 2012]) but much faster, and clearly better than MICE and 176 simple median imputation in this example. Tuning hyper-parameter is a typical challenge for all unsupervised learning algorithms. 179 The rank k is the only but very crucial parameter, which is a priori unknown. [Brunet 180 et al., 2007] suggests to try multiple runs of each k and uses a consensus matrix to 181 determine k. This idea assumes that cluster assignment is stable from run to run if a 182 clustering into k classes is strong. However this assumption needs to be validated and 183 the purpose of NMF is not always for clustering. Another idea, brought from the 184 denoising auto-encoder ( [Vincent et al., 2008]), is to add noise to the matrix A, can expect that the desired k should give the smallest error rate. This could be a 187 general approach for many unsupervised learning algorithms. However, when comes to 188 NMF, the choice of 'noise' is not easy since the noisy version of A has to be nonnegative 189 as well, which suggests that 'noise' may also introduce bias.

190
Given the powerful missing value imputation in NMF, we come up with a novel 191 7/10 approach, akin to the well known idea of training-validation split idea in supervised 192 learning. Some entries are randomly deleted from A and then imputed by NMF with a 193 set of k's. These imputed entries are later compared to their observed values, and the k 194 that gives the smallest error will be our choice, as only the correct k, if exists, has the 195 right decomposition that can recover the missing entries. In contrast to the 196 training-validation split in supervised learning, due to the typically big number of 197 entries in A, we generally have a very large 'sample size'. One can also easily adapt the 198 idea of cross-validation to this approach. This idea should be applicable to any 199 unsupervised learning methods that support missing value imputation.

200
To illustrate, we performed a simulation study. W and H with k = 3, were generated 201 and A was constructed by W H plus some noise. This result is shown in Figure 1 (D). 202 As we could see, different runs give consistent results. The mean square errors(MSEs) 203 decrease as rank k increase, but the increases rate slows down when k = 3 (the true 204 rank). Meanwhile, the MSEs for imputed values are minimized at k = 3 for all runs. faster convergence in SCD. This is because Lee's algorithm is essentially a gradient 218 descent with a special step size ( [Lee and Seung, 1999]) which is a first order method, 219 while SCD is a Newton-Raphson like second order approach.

220
Expression deconvolution is of constant interest in bioinformatics and clinical 221 research ( [Zhange et al., 2008], [Abbas et al., 2009]). Some NMF related methods were 222 proposed ( [Gaujoux and Seoighe, 2011]). However, our unique methods of using mask 223 matrices are more flexible and powerful, as one can guide the decomposition towards 224 almost arbitrary biological procedure of interest by integrating prior knowledge into the 225 initial and mask matrices. As compared to Bayesian methods like the ISOpure ( [Quon 226 et al., 2013]), NMF based methods are much faster. An application to an NSCLC 227 dataset was performed and compared to the existing method ISOpure ( [Anghel et al.,228 2015]) in Figure 1 (C).

229
All methodologies described in this paper are implemented in the R package NNLM, 230 available on CRAN.

232
We extended the sequential coordinate-wise descent (SCD) to KL-divergence and 233 applied SCD to NMF based on the alternating scheme. We introduced mask matrices in 234 NMF algorithms to integrate prior knowledge to the decomposition, in order to induce 235 more meaningful results. Due to the dimension reduction nature, we found that NMF is 236 an outstanding method for missing value imputation, allowing us to introduce a novel 237 8/10 method for tuning hyper-parameters, which is also applicable to any other unsupervised 238 method that can handle missing values.