Nonnegative matrix factorization by maximizing correntropy for cancer clustering
 Jim JingYan Wang^{1},
 Xiaolei Wang^{1} and
 Xin Gao^{1, 2}Email author
https://doi.org/10.1186/1471210514107
© Wang et al.; licensee BioMed Central Ltd. 2013
Received: 16 February 2012
Accepted: 8 March 2013
Published: 24 March 2013
Abstract
Background
Nonnegative matrix factorization (NMF) has been shown to be a powerful tool for clustering gene expression data, which are widely used to classify cancers. NMF aims to find two nonnegative matrices whose product closely approximates the original matrix. Traditional NMF methods minimize either the l_{2} norm or the KullbackLeibler distance between the product of the two matrices and the original matrix. Correntropy was recently shown to be an effective similarity measurement due to its stability to outliers or noise.
Results
We propose a maximum correntropy criterion (MCC)based NMF method (NMFMCC) for gene expression databased cancer clustering. Instead of minimizing the l_{2} norm or the KullbackLeibler distance, NMFMCC maximizes the correntropy between the product of the two matrices and the original matrix. The optimization problem can be solved by an expectation conditional maximization algorithm.
Conclusions
Extensive experiments on six cancer benchmark sets demonstrate that the proposed method is significantly more accurate than the stateoftheart methods in cancer clustering.
Background
Because cancer has been a leading cause of death in the world for several decades, the classification of cancers is becoming more and more important to cancer treatment and prognosis [1, 2]. With advances in DNA microarray technology, it is now possible to monitor the expression levels of a large number of genes at the same time. There have been a variety of studies on analyzing DNA microarray data for cancer class discovery [35]. Such methods are demonstrated to outperform the traditional, morphological appearancebased cancer classification methods. In such studies, different cancer classes are discriminated by their corresponding gene expression profiles [1].
Several clustering algorithms have been used to identify groups of similar expressed genes. Nonnegative matrix factorization (NMF) was recently introduced to analyze gene expression data and this method demonstrated superior performance in terms of both accuracy and stability [68]. Gao and Church [3] reported an effective unsupervised method for cancer clustering with gene expression profiles via sparse NMF (SNMF). Carmona et al. [9] presented a methodology that was able to cluster closely related genes and conditions in subportions of the data based on nonsmooth nonnegative matrix factorization (nsNMF), which was able to identify localized patterns in large datasets. Zheng et al. [5, 7] applied penalized matrix decomposition (PMD) to extract metasamples from gene expression data, which could captured the inherent structures of samples that belonged to the same class.
NMF approximates a given gene data matrix, X, as a product of two lowrank nonnegative matrices, H and W, as X≈H W. This is usually formulated as an optimization problem, where the objective function is to minimize either the l_{2} norm or the KullbackLeibler (KL) distance [10] between X and HW. Most of the improved NMF algorithms are also based on the minimization of these two distances while adding the sparseness term [3], the graph regularization term [11], etc. Sandler and Lindenbaum [12] argued that measuring the dissimilarity of W and HW by either the l_{2} norm or the KL distance, even with additional bias terms, was inappropriate in computer vision applications due to the nature of errors in images. Sandler and Lindenbaum [12] proposed a novel NMF with earth mover’s distance (EMD) metric by minimizing the EMD error between X and HW. The proposed NMFEMD algorithm demonstrated significantly improved performance in two challenging computer vision tasks, i.e., texture classification and face recognition. Liu et al. [4] tested a family of NMF algorithms using αdivergence with different α values as dissimilarities between X and HW for clustering cancer gene expression data.
It is widely acknowledged that DNA microarry data contain many types of noise, especially experimental noise. Recently, correntropy was shown to be an effective similarity measurement in information theory due to its stability to outliers or noise [13]. However, it has not been used in the analysis of microarray data. In this paper, we propose a novel form of NMF that maximizes the correntropy. We introduce a new NMF algorithm with a maximum correntropy criterion (MCC) [13] for the gene expression databased cancer clustering problem. We call it NMFMCC. The goal of NMFMCC is to find a metasample matrix, H, and a coding matrix, W, such that the gene expression data matrix, X, is as correlative to the product of H and W as possible under MCC.
Related works
He et al. [13] recently developed a face recognition algorithm, correntropybased sparse representation (CESR), based on MCC. CESR tries to find a group of sparse combination coefficients to maximize the correntropy between the facial image vector and the linear combination of faces in the database. He et al. [13] demonstrated that CESR was much more effective in dealing with the occlusion and corruption problems of face recognition than the stateoftheart methods. However, CESR learns only the combination coefficients while the basis faces (the faces in the database) are fixed. Comparing to CESR, NMFMCC can learn both the combination coefficients and the basis vectors jointly, which allows the algorithm to obtain more basis vectors for better representation of the data points. Zafeiriou and Petrou [14] addressed the problem of NMF with kernel functions instead of inner products and proposed the projected gradient kernel nonnegative matrix factorization (PGKNMF) algorithm. Both NMFMCC and PGKNMF employ kernel functions to map the linear data space to a nonlinear space. However, as we show later, NMFMCC computes different kernels for different features, while PGKNMF computes a single kernel for the whole feature vector. Thus, NMFMCC allows the algorithm to assign different weights to different features and emphasizes the discriminant features with high weights, thus achieving feature selection. In contrast, like most kernel based methods, PGKNMF simply replaces the inner product by the kernelfunction and treats the features equally, thus there is no feature selection function.
Methods
In this section, we first briefly introduce the traditional NMF method. We then propose our novel NMFMCC algorithm by maximizing the correntropy in NMF. We further propose a expectation conditional maximizationbased approach to solve the optimization problem.
Nonnegative matrix factorization
The factorization is quantified by an objective function that minimizes some distance measure, such as:

l_{ 2 }norm distance: One simple measure is the square of the l_{2} norm distance (also known as the Frobenius norm or the Euclidean distance) between two matrices, which is defined as:$\begin{array}{l}{F}^{{l}_{2}}=\sum _{d=1}^{D}\sum _{n=1}^{N}{\left({X}_{\mathit{\text{dn}}}\sum _{k=1}^{K}{H}_{\mathit{\text{dk}}}{W}_{\mathit{\text{kn}}}\right)}^{2}.\end{array}$(2)

Kullback  Leibler (KL) divergence: The second one is the divergence between two matrices [10], which is defined as:${F}^{\mathit{\text{KL}}}=\sum _{d=1}^{D}\sum _{n=1}^{N}\left({X}_{\mathit{\text{dn}}}\mathit{\text{ln}}\frac{{X}_{\mathit{\text{dn}}}}{{(\mathit{\text{HW}})}_{\mathit{\text{dn}}}}{X}_{\mathit{\text{dn}}}+{(\mathit{\text{HW}})}_{\mathit{\text{nd}}}\right).$(3)
Maximum correntropy criterion for NMF
Another thing that has to be changed is that the definition of correntropy is not subject to the kernel being Gaussian as they seem to imply through the text, so for instance when they define they can say E(k(xy)) and one of the common choices of k is the Gaussian kernel giving....
where k_{ σ } is a kernel that satisfies the Mercer theory and E[·] is the expectation. One of the common choices of k_{ σ } is the Gaussian kernel given as ${k}_{\sigma}(xy)=\mathit{\text{exp}}(\frac{{(xy)}^{2}}{2{\sigma}^{2}})$.
We can see that the kernel is applied to the entire feature vector, x, and each feature x_{ d },d=1⋯,D is treated equally with the same kernel parameter. However, in (7), kernel functions are applied to different functions. This can allow the algorithm to learn different kernel parameters as we will introduce later. In this way, we can assign different weights to different features and thus implement feature selection.
We should notice the significant difference between NMFMCC and CESR. As a supervised learning algorithm, the CESR represents a test data point, x_{ t }, as a linear combination of all the the training data points as ${x}_{t}\approx {\sum}_{n=1}^{N}{x}_{n}{w}_{\mathit{\text{nt}}}=X{w}_{t}$ and w_{ t }=[w_{1t},⋯,w_{ N t }]^{⊤} is the combination coefficient vector. CESR aims to find the optimal w_{ t } to maximize the correntropy between x_{ t } and X w_{ t }. Similarly, NMFMCC also tries to represent a data point x_{ n } as a linear combination of some basis vectors as ${x}_{n}\approx {\sum}_{k=1}^{K}{h}_{k}{w}_{\mathit{\text{kn}}}=X{w}_{n}$ and w_{ n }=[w_{1n},⋯,w_{ K n }]^{⊤} is the combination coefficient vector. Differently from CESR, NMFMCC aims to find not only the optimal w_{ n } but also the basis vectors in H to maximize the correntropy between x_{ n } and H w_{ n }, n=1,⋯,N. The internal difference between NMFMCC and CESR lies in whether to learn basis vectors or not.
In order to solve the optimization problem, we recognize that the expectation conditional maximization (ECM) method [19] can be applied. Based on the theory of convex conjugate functions [20], we can derive the following proposition that forms the basis to solve the optimization problem in (9):
Proposition 1
and for a fixed z, the supremum is reached at ϱ=−g(z,σ).
where superscript φ is the convex conjugate function φ of g(z) defined in Proposition 1, and ρ=[ρ_{1},⋯,ρ_{ D }]^{⊤} are the auxiliary variables.
That is, maximizing F(H,W) is equivalent to maximizing the augmented function $\hat{F}(H,W,\mathit{\rho})$.
The NMFMCC Algorithm
 1.EStep: Compute ρ given the current estimations of the metasample matrix H and the coding matrix W as:$\begin{array}{l}{\rho}_{d}^{t}=g\phantom{\rule{1pt}{0ex}}\left(\sqrt{\sum _{n=1}^{N}{\left({x}_{\mathit{\text{dn}}}\sum _{k=1}^{K}{h}_{\mathit{\text{dk}}}^{t}{w}_{\mathit{\text{kn}}}^{t}\right)}^{2}},{\sigma}^{t}\right),\end{array}$(14)where t means the tth iteration. In this study, the kernel size (bandwidth) σ^{2}^{ t } is computed by$\begin{array}{l}{{\sigma}^{2}}^{t}=\frac{\theta}{2D}\sum _{d=1}^{D}\sum _{n=1}^{N}{\left({x}_{\mathit{\text{dn}}}\sum _{k=1}^{K}{h}_{\mathit{\text{dk}}}^{t}{w}_{\mathit{\text{kn}}}^{t}\right)}^{2},\end{array}$(15)
where Θ is a parameter to control the sparseness of ${\rho}_{d}^{t}$.
 2.CMsteps: In the CMstep, given ${\rho}_{d}^{t}$, we try to optimize the following function respect to H and W:$\begin{array}{l}\phantom{\rule{16.0pt}{0ex}}({H}^{t+1},{W}^{t+1})=\underset{H,W}{\mathit{\text{argmax}}}\sum _{d=1}^{D}\left({\rho}_{d}^{t}\sum _{n=1}^{N}{\left({x}_{\mathit{\text{dn}}}\sum _{k=1}^{K}{h}_{\mathit{\text{dk}}}{w}_{\mathit{\text{kn}}}\right)}^{2}\right)\\ \phantom{\rule{7.5em}{0ex}}=\underset{H,W}{\mathit{\text{argmax}}}\phantom{\rule{1em}{0ex}}\mathit{\text{Trac}}\left[{(X\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathit{\text{HW}})}^{\top}\mathit{\text{diag}}\phantom{\rule{1pt}{0ex}}({\mathit{\rho}}^{t})(X\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathit{\text{HW}})\right]\\ \phantom{\rule{8.5em}{0ex}}\mathrm{s.t.}\phantom{\rule{1em}{0ex}}H\ge 0,\phantom{\rule{1em}{0ex}}W\ge 0,\end{array}$(16)
where d i a g(·) is an operator that converts the vector ρ to a diagonal matrix.
By introducing a dual objective function,$\begin{array}{l}\phantom{\rule{16.0pt}{0ex}}\mathcal{O}(H,W)=\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[{(X\mathit{\text{HW}})}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})(X\mathit{\text{HW}})\right]\\ \phantom{\rule{4em}{0ex}}=\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[{X}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})X\right]\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}2\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[{X}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}}\right]\\ \phantom{\rule{5.5em}{0ex}}+\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[{W}^{\top}{H}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}}\right],\end{array}$(17)the optimal problem in (16) can be reformulated as the following dual problem:$\begin{array}{l}({H}^{t+1},{W}^{t+1})=\underset{H,W}{\mathit{\text{argmin}}}\phantom{\rule{1em}{0ex}}\mathcal{O}(H,W)\\ \phantom{\rule{8.9em}{0ex}}\mathrm{s.t.}\phantom{\rule{1em}{0ex}}H\ge 0,\phantom{\rule{1em}{0ex}}W\ge 0.\end{array}$(18)Let ϕ_{ d k } and ψ_{ k n } be the Lagrange multiplier for constraints h_{ d k }≥0 and w_{ k n }≥0, respectively, and Φ=[ϕ_{ d k }] and Ψ=[ψ_{ k n }]. The Lagrange $\mathcal{\mathcal{L}}$ is$\begin{array}{l}\phantom{\rule{16.0pt}{0ex}}\mathcal{\mathcal{L}}=\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[{X}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})X\right]2\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[{X}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}}\right]\\ \phantom{\rule{2.2em}{0ex}}+\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[{W}^{\top}{H}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}}\right]+\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[\Phi {H}^{\top}\right]\\ \phantom{\rule{2.2em}{0ex}}+\mathit{\text{Trac}}\phantom{\rule{1pt}{0ex}}\left[\Psi {W}^{\top}\right].\end{array}$(19)The partial derivatives of $\mathcal{\mathcal{L}}$ with respect to H and W are$\begin{array}{ll}\frac{\partial \mathcal{\mathcal{L}}}{\mathrm{\partial H}}=& 2\mathit{\text{diag}}({\mathit{\rho}}^{t})X{W}^{\top}+2\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}}{W}^{\top}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\Phi \end{array}$(20)and$\begin{array}{l}\frac{\partial \mathcal{\mathcal{L}}}{\mathrm{\partial W}}=2{H}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})X+2{H}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}}+\Psi \end{array}$(21)Using the KarushKuhnTucker optimal conditions, i.e., ϕ_{ d k }h_{ d k }=0 and ψ_{ k n }w_{ k n }=0, we get the following equations for h_{ d k } and w_{ k n }:$\begin{array}{l}2{(\mathit{\text{diag}}({\mathit{\rho}}^{t})X{W}^{\top})}_{\mathit{\text{dk}}}{h}_{\mathit{\text{dk}}}\\ +2{(\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}}{W}^{\top})}_{\mathit{\text{dk}}}{h}_{\mathit{\text{dk}}}=0\end{array}$(22)and$\begin{array}{l}2{({H}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})X)}_{\mathit{\text{kn}}}{w}_{\mathit{\text{kn}}}\\ +2{({H}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})\mathit{\text{HW}})}_{\mathit{\text{kn}}}{w}_{\mathit{\text{kn}}}=0\end{array}$(23)These equations lead to the following updating rules to maximize the expectation in (13).
The metasample matrix H, conditioned on the coding matrix W:$\begin{array}{l}{h}_{\mathit{\text{dk}}}^{t+1}\leftarrow {h}_{\mathit{\text{dk}}}^{t}\frac{{(\mathit{\text{diag}}({\mathit{\rho}}^{t})X{{W}^{t}}^{\top})}_{\mathit{\text{dk}}}}{{(\mathit{\text{diag}}({\mathit{\rho}}^{t}){H}^{t}{W}^{t}{{W}^{t}}^{\top})}_{\mathit{\text{dk}}}}\end{array}$(24)

The coding matrix W conditioned on the newly estimated metasample matrix H^{t+1}:$\begin{array}{l}{w}_{\mathit{\text{kn}}}^{t+1}\leftarrow {w}_{\mathit{\text{kn}}}^{t}\frac{{({{H}^{t+1}}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t})X)}_{\mathit{\text{kn}}}}{{({{H}^{t+1}}^{\top}\mathit{\text{diag}}({\mathit{\rho}}^{t}){H}^{t+1}{W}^{t})}_{\mathit{\text{kn}}}}\end{array}$(25)
We should note that if we exchange the numerator and denominator in (24) and (25), new update formulas will be yield. The new update rules are dual for (24) and (25), and our experimental results show that the dual update rules achieve similar clustering performances as (24) and (25).

Algorithm 1 summarizes the optimization procedure.
Algorithm 1 NMFMCC Algorithm
Proof of convergence
In this section, we will prove that the objective function in (16) is nonincreasing under the updating rules in (24) and (25).
Theorem 1
The objective function in (16) is nonincreasing under the update rules (24) and (25).
To prove the above theorem, we first define an auxiliary function.
Definition 1
are satisfied.
The auxiliary function is quite useful because of the following lemma:
Lemma 1
Since the updating rule is essentially based on elements, it is sufficient to show that each F_{ k n } is nonincreasing under the update step of (25).
Lemma 2
is an auxiliary function for F_{ k n }, which is relevant only to w_{ k n }.
Proof
Thus, (32) holds and $G(w,{w}_{\mathit{\text{kn}}}^{t})\ge {F}_{\mathit{\text{kn}}}(w)$. □
We can now demonstrate the convergence of Theorem 1.
Proof of Theorem 1
Since (30) is an auxiliary function, F_{ k n } is nonincreasing under this update rule as in (25).
Similarly, we can also show that O is nonincreasing under the updating steps in (24).
Experiments
Datasets
Summary of the six cancer gene expression datasets used to test the NMFMCC algorithm
Dataset name  Diagnostic task  Samples ( N)  Genes ( D)  Cancer Classes ( K)  Ref 

Leukemia  Acute myelogenous leukemia  72  5327  3  [25] 
Brain Tumor  5 human brain tumor types  90  5920  5  [26] 
Lung Cancer  4 lung cancer types and normal tissues  203  12600  5  [27] 
9 Tumors  9 various human tumor types  60  5726  9  [28] 
SRBCT  Small, round blue cell tumors  83  2308  4  [29] 
DLBCL  Diffuse large Bcell lymphomas  77  5469  2  [24] 
Performance metric
where I(A,B) returns 1 if A=B and 0 otherwise.
Tested methods
We first compared the MCC with other loss functions between X and HW for the NMF algorithm on the cancer clustering problem, including l_{2} norm distance, KL distance [10], αdivergence [4], and earth mover’s distance (EMC) [12]. We further compared the proposed NMFMCC algorithm with other NMFbased algorithms, including the penalized matrix decomposition (PMD) algorithm [7], the original NMF algorithm [22], the sparse nonnegative matrix factorization (SNMF) algorithm [3], the nonsmooth nonnegative matrix factorization (nsNMF) algorithm [9] and the projected gradient kernel nonnegative matrix factorization (PGKNMF).
Results
Discussion
Traditional unsupervised learning techniques select features with features selection algorithms and then do clustering using the selected features. The NMFMCC algorithm proposed here achieves both goals simultaneously. The learned gene weight vector reflects the importance of the genes in the gene clustering task, and the coding matrix encodes the clustering results for the samples.
Our experimental results demonstrate that the improvement of NMRMCC over the other methods increases when the number of genes increases. This shows the ability of the proposed algorithm to effectively select the important genes and cluster samples. This is an important property because highdimensional data analysis has become increasingly frequent and important in diverse fields of sciences and engineering, and social sciences, ranging from genomics and health sciences to economics, finance and machine learning. For instance, in genomewide association studies, hundreds of thousands of SNPs are potential covariates for phenotypes such as cholesterol level or height. The large number of features presents an intrinsic challenge to many classical problems, where usual lowdimensional methods no longer apply. The NMFMCC algorithm has been demonstrated to work well on the datasets with small numbers of samples but large numbers of features. It can therefor provide a powerful tool to study highdimensional problems, such as genomewide association studies.
Conclusion
We have proposed a novel NMFMCC algorithm for gene expression databased cancer clustering. Experiments demonstrate that correntropy is a better measure than the traditional l_{2} norm and KL distances for this task, and the proposed algorithm significantly outperforms the existing methods.
Declarations
Acknowledgements
The study was supported by a grant from King Abdullah University of Science and Technology, Saudi Arabia. We would like to thank Dr. Ran He for the discussion about the maximum correntropy criterion at ICPR 2012 conference.
Authors’ Affiliations
References
 Shi F, Leckie C, MacIntyre G, Haviv I, Boussioutas A, Kowalczyk A: A biordering approach to linking gene expression with clinical annotations in gastric cancer. BMC Bioinformatics. 2010, 11: 47710.1186/1471210511477.PubMed CentralView ArticlePubMedGoogle Scholar
 de Souto MCP, Costa IG, de Araujo DSA, Ludermir TB, Schliep A: Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008, 9: 49710.1186/147121059497.PubMed CentralView ArticlePubMedGoogle Scholar
 Gao Y, Church G: Improving molecular cancer class discovery through sparse nonnegative matrix factorization. Bioinformatics. 2005, 21 (21): 3970—3975View ArticlePubMedGoogle Scholar
 Liu W, Yuan K, Ye D: On alphadivergence based nonnegative matrix factorization for clustering cancer gene expression data. Artif Intell Med. 2008, 44 (1): 15. 10.1016/j.artmed.2008.05.001.View ArticlePubMedGoogle Scholar
 Zheng CH, Ng TY, Zhang L, Shiu CK, Wang HQ: Tumor classification based on nonnegative matrix factorization using gene expression data. IEEE Trans Nanobioscience. 2011, 10 (2): 8693.View ArticlePubMedGoogle Scholar
 Kim MH, Seo HJ, Joung JG, Kim JH: Comprehensive evaluation of matrix factorization methods for the analysis of DNA microarray gene expression data. BMC Bioinformatics. 2011, 12 (Suppl 13): S810.1186/1471210512S13S8.PubMed CentralView ArticlePubMedGoogle Scholar
 Zheng CH, Zhang L, Ng VTY, Shiu SCK, Huang DS: Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM Trans Comput Biol Bioinformcs. 2011, 8 (6): 15921603.View ArticleGoogle Scholar
 Tjioe E, Berry M, Homayouni R, Heinrich K: Using a literaturebased NMF model for discovering gene functional relationships. BMC Bioinformatics. 2008, 9 (7): P1PubMed CentralView ArticleGoogle Scholar
 CarmonaSaez P, PascualMarqui R, Tirado F, Carazo J, PascualMontano A: Biclustering of gene expression data by nonsmooth nonnegative matrix factorization. BMC Bioinformatics. 2006, 7: 7810.1186/14712105778.PubMed CentralView ArticlePubMedGoogle Scholar
 Venkatesan R, Plastino A: Deformed statistics KullbackLeibler divergence minimization within a scaled Bregman framework. Phys Lett A. 2011, 375 (48): 42374243. 10.1016/j.physleta.2011.09.021.View ArticleGoogle Scholar
 Cai D, He X, Han J, Huang TS: Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2011, 33 (8): 15481560.View ArticlePubMedGoogle Scholar
 Sandler R, Lindenbaum M: Nonnegative matrix factorization with earth mover’s distance metric for image analysis. IEEE Trans Pattern Anal Mach Intell. 2011, 33 (8): 15901602.View ArticlePubMedGoogle Scholar
 He R, Zheng WS, Hu BG: Maximum correntropy criterion for robust face recognition. IEEE Trans Pattern Anal Mach Intell. 2011, 33 (8): 15611576.View ArticlePubMedGoogle Scholar
 Zafeiriou S, Petrou M: Nonlinear nonnegative component analysis. CVPR: 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vols 14. 2010, Miami: IEEE Conference on Computer Vision and Pattern Recognition, 28522857.Google Scholar
 Yan H, Yuan X, Yan S, Yang J: Correntropy based feature selection using binary projection. Pattern Recognit. 2011, 44 (12): 28342842. 10.1016/j.patcog.2011.04.014.View ArticleGoogle Scholar
 He R, Hu BG, Zheng WS, Kong XW: Robust principal component analysis based on maximum correntropy criterion. IEEE Trans Image Process. 2011, 20 (6): 14851494.View ArticlePubMedGoogle Scholar
 Chalasani R, Principe JC: Self organizing maps with the correntropy induced metric. Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN2010). 2010, Barcelona, Spain: , 16.View ArticleGoogle Scholar
 Liu W, Pokharel PP, Principe JC: Correntropy: properties and applications in nongaussian signal processing. IEEE Trans Signal Process. 2007, 55 (11): 52865298.View ArticleGoogle Scholar
 Horaud R, Forbes F, Yguel M, Dewaele G, Zhang J: Rigid and articulated point registration with expectation conditional maximization. IEEE Trans Pattern Anal Mach Intell. 2011, 33 (3): 587602.View ArticlePubMedGoogle Scholar
 BEER G: Conjugate convexfunctions and the epidistance topology. Proc Am Math Soc. 1990, 108 (1): 117126. 10.1090/S00029939199009824008.View ArticleGoogle Scholar
 Qi Y, Ye P, Bader J: Genetic interaction motif finding by expectation maximization  a novel statistical model for inferring gene modules from synthetic lethality. BMC Bioinformatics. 2005, 6: 28810.1186/147121056288.PubMed CentralView ArticlePubMedGoogle Scholar
 Lee DD, Seung HS: Algorithms for nonnegative matrix factorization. Adv Neural Inf Process Syst. 2001, 13: 556562.Google Scholar
 Statnikov A, Aliferis C, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21 (5): 631643. 10.1093/bioinformatics/bti033.View ArticlePubMedGoogle Scholar
 Shipp M, Ross K, Tamayo P, Weng A, Kutok J, Aguiar R, Gaasenbeek M, Angelo M, Reich M, Pinkus G, Ray T, Koval M, Last K, Norton A, Lister T, Mesirov J, Neuberg D, Lander E, Aster J, Golub T: Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nat Med. 2002, 8 (1): 6874. 10.1038/nm010268.View ArticlePubMedGoogle Scholar
 Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, Bloomfield C, Lander E: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
 Pomeroy S, Tamayo P, Gaasenbeek M, Sturla L, Angelo M, McLaughlin M, Kim J, Goumnerova L, Black P, Lau C, Allen J, Zagzag D, Olson J, Curran T, Wetmore C, Biegel J, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis D, Mesirov J, Lander E, Golub T: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002, 415 (6870): 436442. 10.1038/415436a.View ArticlePubMedGoogle Scholar
 Bhattacharjee A, Richards W, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark E, Lander E, Wong W, Johnson B, Golub T, Sugarbaker D, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. 2001, 98 (24): 1379013795. 10.1073/pnas.191502998.PubMed CentralView ArticlePubMedGoogle Scholar
 Staunton J, Slonim D, Coller H, Tamayo P, Angelo M, Park J, Scherf U, Lee J, Reinhold W, Weinstein J, Mesirov J, Lander E, Golub T: Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci. 2001, 98 (19): 1078710792. 10.1073/pnas.191368598.PubMed CentralView ArticlePubMedGoogle Scholar
 Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, Peterson C, Meltzer P: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001, 7 (6): 673679. 10.1038/89044.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.