 Research
 Open Access
A Bregmanproximal point algorithm for robust nonnegative matrix factorization with possible missing values and outliers  application to gene expression analysis
 Stéphane Chrétien†^{1}Email author,
 Christophe Guyeux†^{2},
 Bastien Conesa^{3},
 Régis DelageMouroux^{4},
 Michèle Jouvenot^{4},
 Philippe Huetz^{5} and
 Françoise Descôtes^{6}
https://doi.org/10.1186/s1285901611208
© The Author(s) 2016
 Published: 31 August 2016
Abstract
Background
NonNegative Matrix factorization has become an essential tool for feature extraction in a wide spectrum of applications. In the present work, our objective is to extend the applicability of the method to the case of missing and/or corrupted data due to outliers.
Results
An essential property for missing data imputation and detection of outliers is that the uncorrupted data matrix is low rank, i.e. has only a small number of degrees of freedom. We devise a new version of the Bregman proximal idea which preserves nonnegativity and mix it with the Augmented Lagrangian approach for simultaneous reconstruction of the features of interest and detection of the outliers using a sparsity promoting ℓ _{1} penality.
Conclusions
An application to the analysis of gene expression data of patients with bladder cancer is finally proposed.
Keywords
 Feature extraction
 Nonnegative matrix factorization
 Gene expression analysis
 Outliers and missing data
Background
The method was first explored by Lee and Seung [8] in the late 90’s and it then enjoyed a significant growth of interest in many application fields and especially machine learning. There exists a wide variety of methods for computing the NMF. One most employed strategy is the famous alternating minimization scheme, which consists in successively minimizing in U and then in V. Notice that minimization in U (resp. V) is a convex and easy optimization problem. Furthermore, it has been observed as quite efficient in practice. The main drawback of this approach however, is that no convergence garantee towards a global minimizer has been proved so far. Moreover, handling the nonnegativity constraints appears to be cumbersome is certain settings and the convergence speed of the method depends on the way these constraints are incorporated into the iterations. The work described in is [9] is a very interesting contribution to the study of potential convexifications for the NMF problem. It uses certain separability assumptions. Separability is the property that the features are some data vectors already belonging to the sample. Following shortly after, [10] proposed an efficient approach based on linear programming which also relied on separability. Recently, under similar assumptions, [11] proposed a very simple approach based on successive projections. When separability holds the above algorithms are the methods of choice for NMF. Unfortunately, separability does not hold in very important cases, and there is still a lot of work to do in order to understand the performance guarantees of the existing algorithms for NMF. Back to the not necessarily separable case, [12, 13] proposed Bregman divergence based iterative methods for NMF. Bregmandivergence based proximal approaches have been the subject of great interest recently due to good practical performances and connection with mirror descent type algorithms (see for instance the survey [14]).
In the next section, the Bregman proximal scheme is presented and in the subsequent Section, a version taking into account potential outliers and/or missing data is described in full details. The choice of the relaxation parameter is also addressed. An application to the analysis of gene expression data of patients with bladder cancer is proposed in the last Section.
Method
A Bregman proximal scheme for nonnegative matrix factorization
The space alternating Bregmanproximal scheme
Since no explicit solution to this decoupled system of equations, we will use a fixed point approach defined as follows.
 1.
Take U ^{(k+1,0)}=U ^{(k)}.
 2.\(\forall \: l \in \mathbb {N}^{*}\), define$$ \begin{aligned} U_{ij}^{(k+1,l+1)} & = \exp\left(\frac{1}{\rho}\left[\left(MU^{(k+1,l)}V^{t}\right)V\right]_{i,j}\right.\\ &\left.\quad+\ln{U_{ij}^{k}}\right). \end{aligned} $$(0.6)
 3.
Stop when the difference between two successive iterates is sufficiently small, e.g., less that 1e3. Denote by l ^{∗} the iteration number when this occurs and output \(U^{(k+1)}=U^{(k+1,l^{*})}\phantom {\dot {i}\!}\).
A toy numerical experiment
At this point, we can see on a toy example that the method converges, but it is obviously not proven in the general case. However, the present state of knowledge in the analysis of numerical algorithms for NMF is still lacunary and the case of outliers is therefore even more out of reach for the moment. The only case where a method has been proposed that finds an optimal solution in polynomial time is the case of separable data (see [18]). We were not able to prove that the assumptions are satisfied by our data set. Proving convergence of the method to a stationary point is not convincing either for practical purposes and is out of scope for the present study. To our opinion, and in view of the state of the art, showing a nice behavior of the algorithm in a toy example can be a sufficient practical evidence that the method has a stable behavior, which is what the practitioners want to know before going into more details. The convergence analysis of the algorithm will be studied in a later project and we hope to obtain a more precise theoretical understanding in a near future.
The case of outliers and missing data
with \(\left \S\right \_{1} = \sum _{i,j} \left S_{ij}\right \), and λ is a relaxation parameter whose value is discussed in Section 6.
The augmented Lagrange function
We now introduce an Alternating Direction Method of Multipliers. This method consists of solving iteratively in all the variables one after the other, and then updating the dual variable. For this purpose, we compute in the next subsections the optimum value for the problem of minimizing the augmented Lagrange function with respect to each variable.
Individual minimization subproblems in Y, S, U, and V
Minimization with respect to Y.
Minimization with respect to S.
This can be performed by optimizing each component of S independently of the other. As for the case of minimizing with respect to Y, we distinguish between two cases, while if (i,j)∉Ω one easily checks that \(S_{ij}^{*}=0\). If (i,j)∈Ω, we will use the following result.
Theorem 0.1.
Minimization with respect to U and V.
We just have to use the fixed point subroutine given by (0.6).
Our Bregman proximaltype ADMM
Note that mean imputation is a widely used approach for dealing with missing data. This is also the most basic one. One of the most efficient method for missing data is the proposal of [19]. However, this latter is based on standard multivariate analysis. It does not take into account the nonnegativity of the data, and it does not address the joint problem of extracting relevant features.
 1.
Set S=S ^{(k)},U=U ^{(k)},V=V ^{(k)},Λ=Λ ^{(k)} and obtain Y ^{(k+1)}=Y ^{∗} given by (0.11),
 2.
Set Y=Y ^{(k+1)},U=U ^{(k)},V=V ^{(k)},Λ=Λ ^{(k)} and obtain S ^{(k+1)}=S ^{∗} given by (0.15),
 3.
Set Y=Y ^{(k+1)},S=S ^{(k+1)},U=U ^{(k)},Λ=λ ^{(k)} and obtain U ^{(k+1)} using the fixed point method (0.6)
 4.
Set Y=Y ^{(k+1)},S=S ^{(k+1)},U=U ^{(k+1)},Λ=Λ ^{(k)} and obtain U ^{(k+1)} using the fixed point method (0.6).
 5.
Set Λ ^{(k+1)}=Λ ^{(k)}+M−Y−S
Choosing the value of λ
The choice of the parameter λ is crucial for the good performances of the proposed method. We performed a selection of λ using the following approach.
 1.
Propose an a priori range of values for λ such that its maximum values leads to S equals the null matrix at optimality.
 2.For each value of λ, select a set \(\mathcal S\) of s entries chosen uniformly at random in M and consider them as missing data temporarily.
 (a)
Find the solution of (0.7), where the missing data incorporate the set of entries which were artificially declared as missing in the previous step.
 (b)
Compute the average squared error on the data artificially declared as missing. Denote this quantity by e r r _{ λ }.
 (a)
 3.
Select λ in the prescribed range as the one which minimizes the average squared error e r r _{ λ }.
Results
Description of the data
In this section, we apply our method to bladder cancer expression data. The number of new patients affected by bladder cancer in 2013 attained 10,000 in France, thus improving the diagnosis of bladder cancer is a Public Health priority. Determining the genes responsible for bladder cancer would undoubtedly permit to design an efficient and adapted set of medical treatments.
First of all the treatment should depend on the advancement of the tumor. For this purpose, researchers have gathered important gene expression data and the corresponding state of the malignant tumor in the bladder of 100 patients in the Lyon region (France), as described in [20]. The prospective multicentre study has been performed between September 2007 and May 2008, it formerly included 108 bladder tumours (45 pTa, 35 pT1 and 28 >pT1). In this study, 34 genes have been selected from the lists provided by the Biometric Research Branch class comparison analyses. For this purpose, researchers have gathered important gene expression data and the corresponding state of the malignant tumor in the bladder of 100 patients in the Lyon region (France), as described in [20]. From the lists of genes provided by the Biometric Research Branch class comparison analyses [19], the microarray results of 34 selected differentially expressed genes were analyzed for validation using realtime quantitative PCR in other bladder tumour cohort.

TVNIM : noninfiltrating tumors;

TVIM : infiltrating tumors.
The tumor states have been classified into the following groups:

Ta : noninfiltrating tumor in Urothelium;

T1a : noninfiltrating tumor in Urothelium and parts of the chorion;

T1b : noninfiltrating tumor in Urothelium and the full chorion;

>T1 : infiltrating tumor.
In the standard classification, the last group of the list incorporates states T2 to T4b.
Experimental results
The first subplot of Fig. 4 represents the matrix S after convergence, while the second subplot is the matrix V ^{ t }. The third subplot, for its part, represents the distance in Frobenius norm between two successive iterates of Λ. Finally, the fourth subplot represents the evolution of the relative error between M and its NMF \(U^{(k)}V^{(k)^{t}}\phantom {\dot {i}\!}\).

pTa mostly consists of 3 subgroups indexed by {1,4,6};

pT1a mostly consists of 3 subgroups indexed by {4,5};

pT1b mostly consists of only one group, which is {5};

finally, >pT1 mostly consists of 5 subgroups indexed by {2,3,5,7,8}.
The intersection of the index subsets between two adjacent states is always a singleton, up to a discarded minority of individuals. The cluster indexed by 5 appears at medium to serious levels. The lowest level is characterized by cluster 6 while the most serious level is characterized by the more significant appearance of clusters 2,3,7 and 8.
Comparison using a Gaussian mixture model selection
The problem of choosing the number of clusters K a priori is a difficult one. This is usually done by comparing the penalized maximum likelihood values for different values of K and choosing the maximum one. Model selection can be performed too using the Bayesian Information Criterion (BIC). This criterion is the opposite of the maximum likelihood value penalized with log(n)× the number of real parameters to estimate.
Some more precise investigations should now be performed in order to understand the biological meaning of these clusters, i.e., to understand the factors of gravity in this cancer.
Conclusion
In this article, a new way to find a relevant dictionary for extracting the relevant features in a given dataset has been presented, in an original context of missing values and outliers. The wellknown Nonnegative Matrix Factorization (NMF) method has been extended on denoised data, where missing values have been guessed and outliers have been detected, leading to a mixture of Bregman proximal methods and the of Augmented Lagrangian scheme. Finally, an application to the analysis of gene expression data of patients with bladder cancer has been provided for illustration purpose.
Declarations
Acknowledgements
The first and second authors were funded by the grant Eléments Transposables from the Région FrancheComté. The first author would like to acknowledge the help of Haokun Li for the preparation of the manuscript. The work of Regis Delage Mouroux and Michele Jouvenot was supported by the Région FrancheComté and by the Ligue Contre le Cancer (CCIRGE). P. Huetz acknowledges the Montbéliard and Besançon Leagues against Cancer for financial support.
Declarations
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 8, 2016. Selected articles from the 11th International Symposium on Bioinformatics Research and Applications (ISBRA ’15): bioinformatics. The full contents of the supplement are available online https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume17supplement8.
Funding
Publication costs for this work were funded by FEMTOST, Département DISC and the AND team, the Laboratoire de Mathematiques de Besancon, the National Physical Laboratory, the Région FrancheComtéx and by the Ligue Contre le Cancer (CCIRGE).
Availability of data and materials
The code is available by email request to stephane.chretien@npl.co.uk. The data sets are available by email request to francoise.descotes@chulyon.fr.
Authors’ contributions
SC proposed the Bregman proximal method, wrote the initial Matlab code and performed the preliminary experiments. CG did the experimental study, performed the comparison with Sparse PCA and wrote a large part of the paper. BC and PH perfomed large scale numerical experiments and verifications. RDM and MJ contributed the biological interpretation. FD produced the dataset and supervised methodology for the biological part of the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Xu W, Liu X, Gong Y. Document clustering based on nonnegative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM: 2003. p. 267–73.Google Scholar
 Berry MW, Browne M. Email surveillance using nonnegative matrix factorization. Comput Math Organ Theory. 2005; 11(3):249–64.View ArticleGoogle Scholar
 Jia S, Qian Y. Constrained nonnegative matrix factorization for hyperspectral unmixing. Geosci Remote Sens IEEE Trans. 2009; 47(1):161–73.View ArticleGoogle Scholar
 Guillamet D, Vitria J. Nonnegative matrix factorization for face recognition. In: Topics in Artificial Intelligence. Springer: 2002. p. 336–44.Google Scholar
 Chan TH, Ma WK, Chi CY, Wang Y. A convex analysis framework for blind separation of nonnegative sources. Signal Process IEEE Trans. 2008; 56(10):5120–34.View ArticleGoogle Scholar
 Kim H, Park H. Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis. Bioinformatics. 2007; 23(12):1495–502.View ArticlePubMedGoogle Scholar
 Li Y, Sima DM, Cauter SV, Sava C, Anca R, Himmelreich U, Pi Y, Van Huffel S. Hierarchical nonnegative matrix factorization (hnmf): a tissue pattern differentiation method for glioblastoma multiforme diagnosis using mrsi. NMR Biomed. 2013; 26(3):307–19.View ArticlePubMedGoogle Scholar
 Lee DD, Seung HS. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999; 401(6755):788–91.View ArticlePubMedGoogle Scholar
 Esser E, Möller M, Osher S, Sapiro G, Xin J. A convex model for nonnegative matrix factorization and dimensionality reduction on physical space. Image Process IEEE Trans. 2012; 21(7):3239–52.View ArticleGoogle Scholar
 Recht B, Re C, Tropp J, Bittorf V. Factoring nonnegative matrices with linear programs. In: Advances in Neural Information Processing Systems: 2012. p. 1214–22.Google Scholar
 Gillis N, Vavasis SA. Fast and robust recursive algorithmsfor separable nonnegative matrix factorization. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2014; 36(4):698–714.View ArticleGoogle Scholar
 Sra S, Dhillon IS. Generalized nonnegative matrix approximations with bregman divergences. In: Advances in Neural Information Processing Systems: 2005. p. 283–90.Google Scholar
 Li L, Lebanon G, Park H. Fast bregman divergence nmf using taylor expansion and coordinate descent. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: 2012. p. 307–15.Google Scholar
 Bubeck S. Convex optimization: Algorithms and complexity. Foundations and Trends®; in Machine Learning. 2015; 8(34):231–357.View ArticleGoogle Scholar
 Candès EJ, Li X, Ma Y, Wright J. Robust principal component analysis?J ACM (JACM). 2011; 58(3):11.View ArticleGoogle Scholar
 Zhou T, Tao D. Godec: Randomized lowrank & sparse matrix decomposition in noisy case. In: International Conference on Machine Learning. Omnipress: 2011.Google Scholar
 Parikh N, Boyd SP. Proximal algorithms. Foundations Trends Optim. 2014; 1(3):127–239.View ArticleGoogle Scholar
 Gillis N. The why and how of nonnegative matrix factorization. Regularization Optim Kernels Support Vector Mach. 2014; 12:257.Google Scholar
 Husson F, Josse J. missmda: Handling missing values with/in multivariate data analysis (principal component methods). R package version. 2010; 1(2).Google Scholar
 Descotes F, Dessen P, Bringuier PP, Decaussin M, Martin PM, Adams M, Villers A, Lechevallier E, Rebillard X, RodriguezLafrasse C, et al.Microarray gene expression profiling and analysis of bladder cancer supports the subclassification of t1 tumours into t1a and t1b stages. BJU Intl. 2014; 113(2):333–42.View ArticleGoogle Scholar