### Non negative matrix factorization

NMF is a matrix factorization algorithm originally introduced by Lee *et al*. to the analysis of facial images [39]. This technique can be applied to the analysis of multidimensional datasets in order to reduce the dimensionality, discover latent patterns and, more important, aid in the interpretation of the data.

The main difference between NMF and other classical factorization techniques that have been applied to gene expression data analysis, such as principal component analysis (PCA), singular value decomposition (SVD) or independent component analysis (ICA) relies in the non-negativity constraints imposed on both, the basis (**W**) and encoding vectors (**H**). These constraints tend to lead to a parts-based representation of the data because they allow only additive, not subtractive, combinations. In this way, factors can be interpreted as parts of the data or, in other words, as subsets of elements that tend to occur together in sub-portions of the dataset. On the contrary, other factorization techniques, like the ones mentioned above allow the entries of **W** and **H** to be of arbitrary sign, involving complex cancellations of positive and negative elements to reconstruct the original dataset. In other words, NMF tends to produce factors that lend themselves to a relatively easy contextual interpretation, while the factors obtained by the other mentioned approaches produce factors with no obvious contextual "meaning" in themselves.

Formally, the non-negative matrix decomposition can be described as follow:

**V ≈ WH** Eq. 1

where **V** ∈ ℝ^{m×n}is a positive data matrix with *m* variables and *n* objects, **W** ∈ ℝ^{m×k}are the reduced *k* basis vectors or factors, and **H** ∈ ℝ^{k×n}contains the coefficients of the linear combinations of the basis vectors needed to reconstruct the original data (also known as encoding vectors). Additionally we have the following conditions: *k* ≤ *m*, all matrices **V**, **W**, **H** are non-negative, and the columns of **W** (the basis vectors) are normalized (sum up to 1). As we have mentioned, the main difference between NMF and other classical factorization models relies in the non-negativity constraints imposed on both the basis **W** and encoding vectors **H**. In this way, only additive combinations are possible:

The objective function, based on the Poisson likelihood, can be defined using the following divergence function, which we need to minimize:

Solving the problem described in the previous equation, the derived algorithm is as follows:

1. Initialize **W** and **H** with positive random numbers.

2. For each basis vector **W**
_{
a
}∈ ℝ^{mx1}, update the corresponding encoding vector **H**
_{
a
}∈ ℝ^{1xn
}; followed by updating and normalizing the basis vector **W**
_{
a
}. Repeat this process until convergence.

Iteration of the rules described above converges to a local minimum of the objective function described in equation 3. Formally, the detailed algorithm follows:

Repeat until convergence:

For a = 1 ...*k* do begin

For b = 1 ...*n* do

For c = 1 ...*m* do begin

End

End

### Non-smooth Non Negative Matrix Factorization (*ns*NMF)

Even if NMF has been presented and used as a method capable of finding the underlying component-based structure of complex data, there is no explicit guarantee in the method to support this property, other than the non-negativity constraints. In fact, taking a closer look at the basis and encoding vectors produced by the original NMF model [39], it is noticeable that there is a high degree of overlapping among basis vectors that contradict the intuitive nature of the "parts" [40]. As a consequence, a further evolution of NMF capable of producing more localized feature representations of both genes and experiments is highly desirable in this type of application.

In this direction, there are several reported attempts for solving this problem by making modifications to the original NMF functional to enforce sparseness on the basis vectors, the encoding vectors, or both [41–43].

In this work we decided to use a recent sparse non-negative factorization technique whose cost function is derived by introducing a modification to the original NMF model (equation 1) in order to demand sparseness to both, the basis and encoding vectors. The new method, here referred to as Non-smooth Non-Negative Matrix Factorization (*ns*NMF) [23], differs from the original in the use of an extra smoothness matrix to impose sparseness. A full comparison of this method with the other reported sparse versions of NMF [41–44] can be found in [23] whose results reflects the superiority of *ns*NMF in finding sparse factors without drastically affecting the quality of the factorization process.

The goal of *ns*NMF is to find sparse structures in the basis functions that better explain the data set. The interpretation of the new factorization is then two fold: data can be faithful reconstructed using additive combinations of a reduced set of factors and, at the same time, interpretation of the factors is easier due to the intuitive sparse, non-overlapped part-based representation of the data.

In order to get sparseness, the *ns*NMF model demands a smooth distribution of the factors. This is achieved by changing the model of equation 1 into:

**V** ≈ **WSH** Eq. 7

where **S** ∈ ℝ^{k×k}is a positive smoothness matrix defined as:

where **I** is the identity matrix, **1** ∈ ℝ^{1×k
}is a column vector of 1s, the superscript ^{t} indicates vector transpose and the parameter *θ* controls the sparseness of the model, satisfying 0 ≤ *θ* ≤ 1.

Finally the new objective function for the *ns*NMF model can be formally described as:

The interpretation of **S** as a smoothing matrix can be explained as follows. Let **X** be a positive, non-zero, vector. Consider the transformed vector **Y** = **SX**. If *θ* = 0, then **Y** = **X**, and no smoothing on **X** has occurred. However, as *θ* → 1, the vector **Y** tends to the constant vector with all elements almost equal to the average of the elements of **X**. This is the smoothest possible vector, in the sense of "non-sparseness", because all entries are equal to the same non-zero value, instead of having some values close to zero and others clearly non-zero. Note that when *θ* = 0, the model corresponds to the basic NMF.

Further insight into the nature of the new *ns*NMF model can be obtained from the dual interpretation of Eq. 7, which can be equivalently written as:

**V** = **(WS)H** = **W(SH)**

Non-sparseness in the basis **W** will force sparseness in the encoding **H**. At the same time, non-sparseness in the encoding **H** will force sparseness in the basis **W**. Due precisely to the simultaneity of both conditions, sparseness will be enforced on both basis and encoding parts. This property of *ns*NMF is the main motivation for using this algorithm for biclustering, due to its ability in extracting local (sparse) patterns from the data.

The new algorithm is very straightforward to derive by taking partial derivatives of the functional in equation 9 with respect to **H** and **W** and setting them to zero respectively [23]. As it was expected, for a given sparseness parameter value 0 ≤ *θ* ≤ 1, the final algorithm is a simple modification of the original, basic NMF algorithm given by Eqs. 4–6:

- 1.
In Eq. 4 (update for **H**), substitute **W** by **WS**.

- 2.
In Eq. 5 (update for **W**), substitute **H** by **SH**.

Intuitively the algorithm obtained after minimizing the functional in equation 9 reflects the nature of the *ns*NMF: to estimate the encoding vectors **H**, the algorithm takes into account the smooth version of **W** (given by **WS**). In a similar manner, to estimate the values of the basis vectors **W**, the algorithm used the smooth version of **H** (given by **SH**). When these two conditions are used simultaneously, sparse versions of **W** and **H** are obtained.

It is important to mention that the parameter *θ* controls the sparseness of the model and cannot be estimated from the functional, in the sense that it is a hyper-parameter. In this work we have performed numerous empirical tests, and found that the value of 0.5 leads to reasonable results without affecting very much the explained variance of the model.

### Selection of the factorization rank

An important consideration in the application of *ns*NMF, and also in the classical NMF model, is the selection of the number of factors needed to better represent the data. Generally, as a rule of thumb, this value is generally chosen so that (*n + m*) *k* <*nm* and thus the product **WH** can be regarded as a compressed form of the data in **V** [39]. Nevertheless, this estimation is not informative enough to make a proper decision. Finding an appropriate value of *k* depends on the application and it is mostly influenced by the nature of the dataset itself. It is intuitively evident that the more factors we use, the more detailed information we get. However, since the main goal of this application is to automatically extract a number significant block-structures related to the main biological patterns in the dataset, it is important to use only a reduced set of factors that explain the data enough without obscuring the biclusters information with too many details.

In addition, another important aspect to take into account when selecting the number of factors in *ns*NMF is that the sparseness of the model is less evident if only a few factors are used. This is a natural consequence of any factorization method: the fewer factors we use, the most informative the factors should be in order to explain, in the best possible way, the original data

In this work, we used the model selection method proposed by Brunet *et al*. [21] to estimate the value of *k*. They used the cophenetic correlation coefficient as a measure of the stability of the model for different values of *k* with respect to different random initial conditions. The values of *k* where the cophenetic coefficient shows the highest value or begins to fall reflects stability in the results with respect to the differences in the initial conditions [21].

### Gene selection in basis experiments

Sorting the matrix by basis genes and basis experiments creates a natural ordination in which genes and samples are arranged based on their association to a given local pattern. The challenge now is to determine the partition corresponding to the set of genes and experiments that best define the local feature captured by the algorithm. One of the advantages of the factorization model we are using is that the sparse nature the *ns*NMF algorithm reinforces those genes and experiments that significantly sustain the factor while masks those that do not add any value to it at the same time. However if a small number of factors are used, the *ns*NMF algorithm will try to explain the data in the best possible way, consequently producing a large set of genes and conditions in the resulting biclusters. This situation is especially evident in the case of genes due to the high dimensional gene space of this type of matrices. Therefore, an additionally selection of the most representative genes, out of the sorted list produced by the algorithm, is still needed. In related works some authors have applied different criteria to select the most representative genes in each factor. For example we can impose a threshold in gene coefficients to obtain a reduced number of genes in each factor [22] or simply select a determined number of genes in each factor [45]. In this work we defined as factor-specific genes those genes that show high coefficients for a given factor and at the same time they show low coefficients for the rest. Operationally, this was achieved by sorting the genes in descending order by their coefficients in a given column of **W** (column *j*) and selecting only the first consecutive genes from the sorted list whose highest entry in **W** was the coefficient in column *j*. This procedure was repeated for each column of **W** and the set of genes selected in each case define a gene expression module.

### Consistency of gene modules

The set of genes contained in each module can show variations across different runs of the algorithm because of differences in the initial conditions. Furthermore, although the rank used in the factorization is highly related to the cluster structure of conditions, some groups of conditions may not be characterized by a well defined gene expression signature and the set of genes belonging to the corresponding module can vary among different runs of the algorithm. We can assume that a gene module is consistent, and represent a coherent structure, when it can be recovered independently of the initial random conditions. To identify consistent modules we run the algorithm 100 times and selected the set of results obtained in the factorization leading to the largest explained variance for further analysis. We then evaluated the repetition of the *k* gene modules obtained in this factorization across the remaining 99 factorizations. This was achieved by comparing the genes contained in each module. We consider that two modules were similar and were repeated when they shared 75% of their genes (90% in the case of synthetic data). A module was then considered consistent when it was found in more than 80% of factorizations.

### Datasets and data preprocessing

#### Simulated data

Synthetic data containing different overlapping and non-overlapping structures were generated to assess the performance of our method. Homogeneous block-structures were generated by random numbers with N(3,1) distribution. Background noise consisted in a 100 × 20 random matrix with N(0,1) distribution. The first dataset (A) contains two non-overlapped patterns of size 20 × 5 and 25 × 8 respectively. The second dataset (B) contains three overlapped patterns of size 10 × 8, 15 × 9 and 20 × 5 respectively and one non-overlapped pattern of size 10 × 3.

#### Human tissue dataset

Su *et al*. used human Affymetrix high-density oligonucleotide arrays to determine the gene expression profiles of 79 different human tissue samples and cell lines [6]. Data containing gene expression levels monitored with the Human U133 Affymetrix GeneChip across human tissues were obtained from the Gene Expression Omnibus repository [46]. The samples and replicates representing the same tissue or cell line were averaged. To exclude genes with minimal variation across samples, we removed those probe sets that did not satisfy the max-min < 1000 and max/min > 10 thresholds (max and min represent the maximum and minimum expression values for each gene respectively). Additionally, we also eliminated those genes that did not show an expression value greater than or equal to 1000 in at least 5 conditions.

#### Soft-tissue tumor dataset

This dataset was generated by Nielsen *et al*. [24] and contains expression profiles of different soft-tissue tumor types, including 8 gastrointestinal stromal tumors, 8 monophasic synovial sarcomas, 4 liposarcomas, 11 leiomyosarcomas, 8 malignant fibrous histiocytomas and 2 benign peripheral nerve-sheath tumors (schwannomas). They analyzed a gene expression dataset that contains expression levels of 5520 well defined genes across 46 samples (five tumor samples were hybridized in duplicate). We applied our method to the analysis of this tumor dataset which is available at [47]. Genes with absent values in more than 6 samples were filtered out and the remaining missing values were filled out using the *k*-nearest neighborhood approach (*k* = 10) [48].

After pre-processing, each gene expression dataset was normalized as Getz *et al*. described [16] and exponentially scaled to fit the positive constraints of the model.

### Analysis of biological annotation enrichment

To evaluate the enrichment of functional annotations in the set of genes contained in each module we used the Onto-Express tool [49]. This tool uses the hypergeometric test to estimate the statistical significance of the enrichment of a given GO term in a list of genes with respect to a reference list. We used the "biological process" category of the GO ontology to assign a biological meaning to each module. As a reference list we used the full set of genes in the array. Categories with *p-*values less than 0.01 were considered statistically significant.