Unsupervised Bayesian linear unmixing of gene expression microarrays
 Cécile Bazot^{1}Email author,
 Nicolas Dobigeon^{1}Email author,
 JeanYves Tourneret^{1},
 Aimee K Zaas^{2},
 Geoffrey S Ginsburg^{2} and
 Alfred O Hero III^{3}
DOI: 10.1186/147121051499
© Bazot et al.; licensee BioMed Central Ltd. 2013
Received: 5 December 2012
Accepted: 24 January 2013
Published: 19 March 2013
Abstract
Background
This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is that uBLU constrains the factor loadings to be nonnegative and the factor scores to be probability distributions over the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted here to generate random samples according to the posterior distribution of the factors, factor scores, and number of factors. These samples are then used to estimate all the unknown parameters.
Results
Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and compared with previous factor decomposition methods, such as principal component analysis (PCA), non negative matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradientbased algorithm for general matrix factorization (GBGMF). Secondly, we illustrate the application of uBLU on a real timeevolving gene expression dataset from a recent viral challenge study in which individuals have been inoculated with influenza A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated and real data sets considered here.
Conclusions
The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method when compared to other factor decomposition methods from the literature (PCA, NMF, BFRM, and GBGMF). The uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor.
Background
Factor analysis methods such as principal component analysis (PCA) have been widely studied and can be used for discovering the patterns of differential expression in time course and/or multiple treatment biological experiments using gene or protein microarray samples. These methods aim at finding a decomposition of the observation matrix $\mathbf{Y}\in {\mathbb{R}}^{G\times N}$ whose rows (respectively columns) are indexed by gene index (respectively sample index). Typically, in gene expression analysis, the number N of samples is much less than the number G of genes. For example, in an Affymetrix HU133 gene chip, the number G of genes may range from ten to twenty thousand depending on the type of chip description file (CDF) processing used to translate the oligonucleotide fragments to gene labels whereas we only analyze about a hundred of samples.
where m_{ r } is the rth column of M, a_{r, i} denotes the (r, i)th element of the matrix A, a_{ i } and n_{ i } are the ith column of A and N respectively. The number of factors R in the decomposition is usually much less than the number of samples N. Traditional factor analysis methods such as PCA require the experimenter to specify the desired number of factors to be estimated. However, some recent Bayesian factor analysis methods are totally unsupervised in the sense that the number of factors is directly estimated from the data [13].
The model (1) is identical to the standard factor analysis model [4] for which the columns of M are called factors and should correspond to biological signatures (or pathways). Note that the elements of the matrix M are referred to as factor loadings, and the columns of A are the factor scores. Approaches to fitting the model (1) to data include principal component analysis [5, 6], least squares matrix factorization [7, 8], finite mixture modeling [9, 10], and Bayesian factor analysis [4, 11, 12].
This paper presents a new Bayesian factor analysis method called unsupervised Bayesian linear unmixing (uBLU), that estimates the number of factors and incorporates nonnegativity constraints on the factors and factor scores, as well as a sumtoone constraint for the factor scores. The uBLU method presented here differs from the BLU method, developed in [13] for hyperspectral imaging and applied to gene microarray expression analysis in [14]. Note that BLU requires user specification of the number of factors while uBLU determines the number of factors using Bayesian birthdeath model. The positivity and sumtoone constraints are natural in gene microarray analysis when the entries of the observation matrix are nonnegative and when a proportional decomposition is desired. Thus each factor score corresponds to the concentration (or proportion) of a particular factor to a given sample. The advantage of this representation for gene expression analysis is twofold: i) the factor scores correspond to the strengths of each gene contributing to that factor; ii) for each gene chip the factor scores give the relative abundance of each factor present in the chip. For example, a gene having a large loading level (close to one) for a particular factor should have a small loading (close to zero) for all other factors. In this way, as opposed to other factor analysis methods, there is less multiplexing making it easier to associate specific genes to specific factors and vice versa.
A similar approach, based on NMR spectral imaging and called the Bayesian decomposition (BD), has been previously developed by Moloshok et al. and applied to gene expression data [11]. More recently, the coordinated gene activity in pattern sets method (CoGAPS), available as an open Rsource [12], combines the GAPSMCMC matrix factorization algorithm with a thresholdindependent statistic to infer activity in specific gene sets. However, these approaches require cold restarts of the algorithm with different number of factors and with different random seeds to avoid the large number of local minima encountered in the process of fitting the matrix factorization model MA to the data Y. In contrast, the proposed uBLU algorithm uses a judicious model to reduce sensitivity to local minima rather than using cold restarts. The novelty of the uBLU model is that it consists of: (1) a birthdeath process to infer the number of factors; (2) a positivity constraint on the loading and score matrices M, A to restrict the space of solutions; (3) a sumtoone constraint on the columns of A to further restrict the solution space. The uBLU model is justified for nonnegative data problems like gene expression analysis and produces an estimate of the nonnegative factors in addition to their proportional representation in each sample.
Bayesian linear unmixing, traditionally used for hyperspectral image analysis (see [13] for example), is one of many possible factor analysis methods that could be applied to gene expression analysis. These methods include nonnegative matrix factorization (NMF) [7, 8], independent component analysis (ICA) [15], Bayesian decomposition (BD) [11], PCA [5], biclustering [16], penalized matrix decomposition (PMD) [2], Bayesian factor regression modeling (BFRM) [1], and more recently the gradientbased algorithm of Nikulin et al. for general matrix factorization (GBGMF) [17]. Contrary to uBLU, the PCA, ICA, BFRM, GBGMF, biclustering and PMD methods do not account for nonnegativity of the factor loadings and factor scores. On the other hand, NMF does not account for sumtoone constraints on the columns of the factor score matrix. Contrary to PCA and ICA, uBLU does not impose orthogonality or independence on the factors, as well as the GBGMF algorithm. These relaxed assumptions might better represent what is known about the preponderance of overlap and dependency in biological pathways. Finally, uBLU naturally accommodates Bayesian estimation of the number of factors, like BFRM. Note that BFRM has been specifically developed for gene expression data [1].
In this paper we provide comparative studies that establish quantitative performance advantages of the proposed constrained model and its corresponding uBLU algorithm with respect to PCA, NMF, BFRM and GBGMF for timevarying gene expression analysis, using synthetic data with known ground truth. We also illustrate the application of uBLU to the analysis of a real gene expression dataset from a recent viral challenge study [18] in which several subjects were administered viral inoculum and gene expression time course data were collected over a period of several days. Using these data, we may infer relationships between genes and symptoms and examine how the human host response to viral infection evolves with time.
Methods
Mathematical constrained model
where m_{g, r} denotes the (g, r)th element of matrix M. The constraints (5) arise naturally when dealing with positive data for which one is seeking the relative contribution of positive factors that have the same numerical characteristics as the data, i.e., the signature m_{ r } is itself interpretable as a vector of hybridization abundances.
The objective of linear unmixing is to simultaneously estimate the factor matrix M and the factor score matrix A from the available N data samples. The representation (1) is rank deficient since A has rank N − 1. This presents algorithmic challenges for solving the unmixing problem. Several algorithms have been proposed in the context of hyperspectral imaging to solve similar problems [6, 19]. Most of these algorithms perform unmixing in a two step procedure where M is estimated first using an endmember extraction algorithm (EEA) followed by a constrained linear least squares step to estimate A. A common (but restrictive) assumption in these algorithms is that some samples in the dataset are “pure” in the sense that the linear combination of (2) contains a unique factor, say m_{ r }, with factor score a_{r, i}. Recently, this assumption has been relaxed by applying a hierarchical Bayesian approach, called Bayesian linear unmixing (BLU). The resulting algorithm requires the number R of factors to be specified (see [13] for details). Here we extend BLU to a fully unsupervised algorithm, called unsupervised BLU (uBLU), that estimates R using a birthdeath model and a Gibbs sampler. The Gibbs sampler produces an estimate of the entire joint posterior distribution of the model parameters, resulting in a fully Bayesian estimator of the number of factors R, the factor loadings M, and the factor scores A. The uBLU model is described in the next subsection and the Gibbs sampling algorithm is given in the Appendix. In the Results and discussion section below we demonstrate the performance advantages of uBLU as a factor analysis model for simulated and real gene expression data.
Unsupervised Bayesian linear unmixing algorithm
The BLU algorithm studied in [13] generates samples distributed according to the posterior distribution of M and A given the number R of factors for appropriate prior distributions assigned to the mixing parameters in (2). First, the residual errors n_{ i } in (2) are assumed to be independent identically distributed (i.i.d.) according to zeromean Gaussian distributions: ${\mathbf{n}}_{i}\sim \mathcal{N}\left({\mathbf{0}}_{G},{\sigma}^{2}{\mathbf{I}}_{G}\right)$ for i = 1, …, N, where I_{ G } denotes the identity matrix of dimension G × G.
where R_{max} is the maximal number of factors present in the mixture.
where ∝ stands for “proportional to”, ∥·∥ is the standard l_{2}norm, ${\mathbf{1}}_{\mathcal{X}}(\xb7)$ denotes the indicator function on the set $\mathcal{X}$, E = [e_{1}, …, e_{ R }] and ${\mathbf{s}}^{2}=\left[{s}_{1}^{2},\dots ,{s}_{R}^{2}\right]$.
The resulting hierarchical structure of the proposed uBLU model is summarized in the directed acyclic graph (DAG) presented in Additional file 1: Figure S1.
where f(AR), f(TE, s^{2}, R) and f(σ^{2}ν, γ) are respectively the prior distributions of the factor score matrix A, the projected factor matrix T and the noise variance σ^{2} previously defined.
Due to the constraints enforced on the data, the posterior distribution f(M, A, RY) obtained from the proposed hierarchical structure is too complex to derive analytical expressions of the Bayesian estimators, e.g., the minimum mean square (MMSE) and maximum a posteriori (MAP) estimators. In such case, it is natural to use Markov chain Monte Carlo (MCMC) methods [20] to generate samples M^{(t)}, A^{(t)} and R^{(t)} asymptotically distributed according to f(M, A, RY). However, the dimensions of the factor loading matrix M and the factor score matrix A depend on the unknown number R of signatures to be identified. As a consequence, sampling from f(M, A, RY) requires exploring parameter spaces of different dimensions. To solve this dimension matching problem, we include a birth/death process within the MCMC procedure. Specifically, a birth, death or switch move is chosen at each iteration of the algorithm (see the Appendix and [21]). This birthdeath model differs from the classical reversiblejump MCMC (RJMCMC) (as defined in [21]) in the sense that, for the birthdeath model, each move is accepted or rejected at each iteration using the likelihood ratio between the current state and the new state proposed by the algorithm. The factor matrix M, the factor score matrix A and the noise variance σ^{2} are then updated, conditionally upon the number of factors R, using Gibbs moves.
Results and discussion
The proposed method consists of estimating simultaneously the matrices M and A defined in (1), under the positivity and sumtoone constraints mentioned previously, in a fully unsupervised framework, i.e., the number of factors R is also estimated from the data. A Gibbs sampler algorithm is designed that generates samples distributed according to the posterior distribution associated to the proposed uBLU model. For more details about the Gibbs sampling strategy, see the Appendix.
Simulations on synthetic data
To illustrate the performance of the proposed Bayesian factor decomposition, we first present simulations conducted on synthetic data. More extensive simulation results are reported in the Additional file 1.
Simulation scenario
Synthetic datasets ${\mathcal{D}}_{\mathbf{1}}\mathbf{,}\mathbf{\dots}\mathbf{,}{\mathcal{D}}_{\mathbf{4}}$
_{1}  Peaky factors 
_{2}  Realistic factors 
_{3}  Orthogonal factors 
_{4}  Orthogonal and positive factors 
In each case, the R = 3 factors were mixed in random proportions (factor scores), with positivity and sumtoone constraints. All synthetic datasets were corrupted by an i.i.d. Gaussian noise sequence. The signaltonoise ratio is SNR_{ i } = 20 dB where${\text{SNR}}_{i}={G}^{1}{\sigma}^{2}{\u2225\sum _{r=1}^{R}{\mathbf{m}}_{r}{a}_{r,i}\u2225}^{2}$ for each sample i(i = 1, …, N).
Proposed method (uBLU)
The first step of the algorithm consists of estimating the number of factors R involved in the mixture, and hence determining the dimensions of the matrices M and A, using the maximum a posteriori (MAP) estimator ${\stackrel{\u02c6}{R}}_{\text{MAP}}$. The second step of the algorithm consists of estimating the unknown model parameters (M, A and σ^{ 2 }) given${\stackrel{\u02c6}{R}}_{\text{MAP}}$. The estimated posterior distributions of the unknown model parameters are given in Additional file 1: Figure S5 and validate the proposed Bayesian model.
The burnin period and number of Gibbs samples were determined using quantitative methods described in the Additional file 1: Section “Convergence diagnosis”.
Comparison to other methods
The performance of the proposed uBLU algorithm is compared with other existing factor decomposition methods including PCA, NMF, BFRM and GBGMF by using the following criteria, which are common measures used to compare factor analysis algorithms,

the factor mean square errors (MSE)${\text{MSE}}_{r}^{2}=\frac{1}{G}{\u2225{\stackrel{\u02c6}{\mathbf{m}}}_{r}{\mathbf{m}}_{r}\u2225}^{2},\phantom{\rule{1em}{0ex}}r=1,\dots ,R$
where${\stackrel{\u02c6}{\mathbf{m}}}_{r}$ is the estimated rth factor loading vector,

the global MSE of factor scores${\text{GMSE}}_{r}^{2}=\frac{1}{N}\sum _{i=1}^{N}{\left({\xe2}_{r,i}{a}_{r,i}\right)}^{2},\phantom{\rule{1em}{0ex}}r=1,\dots ,R$
where ${\xe2}_{r,i}$ is the estimated proportion of the rth factor in the ith sample,

the reconstruction error (RE)$\text{RE}=\frac{1}{\mathit{\text{NG}}}\sum _{i=1}^{N}{\u2225{\mathbf{y}}_{i}{\stackrel{\u02c6}{\mathbf{y}}}_{i}\u2225}^{2}$(21)

where${\stackrel{\u02c6}{\mathbf{y}}}_{i}=\sum _{r=1}^{R}{\stackrel{\u02c6}{\mathbf{m}}}_{r}{\xe2}_{r,i}$ is the estimate of y_{ i },

the spectral angle distance (SAD) between m_{ r } and its estimate${\stackrel{\u02c6}{\mathbf{m}}}_{r}$ for each factor r = 1, …, R${\text{SAD}}_{r}=arccos\left(\frac{{\stackrel{\u02c6}{\mathbf{m}}}_{r}^{T}{\mathbf{m}}_{r}}{\u2225{\stackrel{\u02c6}{\mathbf{m}}}_{r}\u2225\u2225{\mathbf{m}}_{r}\u2225}\right)$
where arccos(·) is the inverse cosine function,

the global spectral angle distance (GSAD) between y_{ i } (the ith observation vector) and ${\stackrel{\u02c6}{\mathbf{y}}}_{i}$ (its estimate)$\text{GSAD}=\frac{1}{N}\sum _{i=1}^{N}arccos\left(\frac{{\stackrel{\u02c6}{\mathbf{y}}}_{i}^{T}{\mathbf{y}}_{i}}{\u2225{\stackrel{\u02c6}{\mathbf{y}}}_{i}\u2225\u2225{\mathbf{y}}_{i}\u2225}\right),$

the computational time.
The proposed uBLU algorithm, the PCA, NMF and GBGMF methods were implemented in Matlab 7.8.0 (R2009a). The BFRM software (version 2.0) was downloaded from [22] and implemented with default values for the parameters. All methods were implemented on an Intel(R) Core(TM)2 Duo processor.
Simulation results for dataset ${\mathcal{D}}_{\mathbf{1}}$
(a) R= 2  

uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{2})$  0.39  N/A  N/A  205.99  267.42 
0.60  6.04  61.12  N/A  N/A  
0.54  0.97  9.78  325.58  67.14  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{3})$  0.04  N/A  N/A  64.39  226.58 
0.04  2.00  2.00  N/A  N/A  
0.05  0.30  0.28  75.87  41.33  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.46  N/A  N/A  21.69  12.48 
0.29  3.49  3.50  N/A  N/A  
0.28  1.49  1.50  23.24  27.43  
GSAD (×10^{ −2 })  3.39  20.38  20.38  24.04  37.35 
RE  0.18  9.12  9.12  1.94  9.16 
Time (s)  1.24×10^{ 3 }  0.03  0.71  47.15  0.39×10^{ 3 } 
(b) R = 3  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{2})$  0.39  6.01  0.48  212.30  40.27 
0.60  6.53  0.45  681.42  147.74  
0.54  5.86  0.28  137.22  94.90  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{3})$  0.04  6.62  0.19  76.09  45.29 
0.04  2.40  0.01  142.72  17.37  
0.05  0.84  0.05  76.22  33.78  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.46  1.86  0.53  10.68  11.86 
0.29  1.18  0.31  15.18  12.50  
0.28  1.36  0.26  5.33  13.96  
GSAD (×10^{ −2 })  3.37  3.39  3.38  24.23  33.38 
RE  0.18  0.18  0.18  1.84  0.18 
Time (s)  1.24×10^{ 3 }  0.10  0.95  53.60  0.56×10^{ 3 } 
(c) R = 4  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{2})$  0.39  6.02  87.78  205.66  195.89 
0.60  6.53  0.45  247.96  101.34  
0.54  8.03  0.26  330.01  68.69  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{3})$  0.04  23.82  26.56  64.59  57.58 
0.04  11.70  0.23  114.02  3.10  
0.05  6.37  18.04  75.47  27.72  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.46  1.86  6.14  9.74  8.84 
0.29  1.18  0.31  22.15  26.80  
0.28  1.36  0.26  8.17  27.32  
GSAD (×10^{ −2 })  3.39  3.34  3.36  28.62  29.23 
RE  0.18  0.18  0.18  2.08  0.18 
Time (s)  1.24×10^{ 3 }  0.11  0.96  63.88  0.70×10^{ 3 } 
Simulation results for dataset ${\mathcal{D}}_{\mathbf{2}}$
(a) R= 2  

uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}$  0.09  1.97  N/A  N/A  N/A 
0.14  N/A  1.06  37.67  58.75  
0.14  0.12  26.68  52.09  150.09  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{1})$  0.34  0.01  N/A  N/A  N/A 
0.15  N/A  1.12  1.17  22.37  
0.09  0.94  6.24  0.62  1.18  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.39  0.44  N/A  N/A  N/A 
0.48  N/A  1.32  16.53  13.34  
0.47  0.44  3.72  15.21  18.14  
GSAD (×10^{ −2 })  1.51  1.02  1.53  37.99  129.40 
RE (×10^{ −2 })  0.64  1.62  1.65  0.65  5.47 
Time (s)  22.06×10^{ 3 }  0.29  32.02  4.07×10^{ 3 }  9.24×10^{ 3 } 
(b) R = 3  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}$  0.09  1.97  14.87  24.41  61.00 
0.14  0.01  20.53  50.59  58.31  
0.14  0.09  14.02  35.89  65.11  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{1})$  0.34  0.03  0.34  1.41  4.80 
0.15  0.02  2.44  0.65  9.40  
0.09  0.05  0.92  1.19  5.40  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.39  0.44  2.84  14.35  13.72 
0.48  0.12  4.75  15.47  13.62  
0.47  0.37  4.00  17.50  15.82  
GSAD (×10^{ −2 })  1.02  1.02  1.49  29.29  129.29 
RE (×10^{ −2 })  0.64  0.63  1.55  0.75  1.62 
Time (s)  22.06×10^{ 3 }  0.28  45.91  5.37×10^{ 3 }  16.59×10^{ 3 } 
(c) R = 4  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}$  0.09  1.97  13.13  24.25  64.90 
0.14  0.01  20.53  50.52  64.09  
0.14  0.09  14.02  28.32  69.99  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{1})$  0.34  0.09  0.20  1.42  15.12 
0.15  0.48  1.00  0.65  9.55  
0.09  0.05  0.44  1.31  7.73  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.39  0.44  2.54  14.74  14.53 
0.48  0.13  5.52  15.45  14.55  
0.47  0.37  4.79  16.45  16.17  
GSAD (×10^{ −2 })  1.02  1.01  1.06  40.36  129.29 
RE (×10^{ −2 })  0.64  0.63  0.69  0.86  1.50 
Time (s)  22.06×10^{ 3 }  0.54  55.86  5.59×10^{ 3 }  16.59×10^{ 3 } 
Simulation results for dataset ${\mathcal{D}}_{\mathbf{3}}$
(a) R= 2  

uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{3})$  0.01  0.83  0.82  N/A  1.14 
0.85  0.80  0.92  1.34  2.30  
1.15  N/A  N/A  1.36  N/A  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{2})$  7.75  7.29  7.72  N/A  8.94 
7.76  0.47  0.48  12.30  11.86  
9.84  N/A  N/A  11.05  N/A  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.59  7.09  7.04  N/A  15.55 
7.13  6.71  7.19  8.41  16.43  
8.71  N/A  N/A  8.54  N/A  
GSAD (×10^{ −1 })  3.23  2.58  2.59  6.59  15.26 
RE (×10^{ −4 })  3.11  0.70  0.70  0.47  2.50 
Time (s)  1.59×10^{ 3 }  0.01  0.70  42.02  0.40×10^{ 3 } 
(b) R = 3  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{3})$  0.01  0.15  0.15  1.74  1.20 
0.85  1.02  0.76  1.76  2.26  
1.15  1.57  1.03  1.55  2.40  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{2})$  7.75  14.89  2.80  11.40  14.09 
7.76  0.11  0.40  12.11  12.33  
9.84  0.11  0.30  10.94  12.76  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.59  2.60  2.47  11.34  15.76 
7.13  7.16  6.59  9.45  16.40  
8.71  8.80  7.67  9.06  15.66  
GSAD (×10^{ −1 })  3.23  2.58  1.71  6.88  15.20 
RE (×10^{ −4 })  3.11  0.27  0.29  0.49  2.44 
Time (s)  1.59×10^{ 3 }  0.10  1.24  59.72  0.54×10^{ 3 } 
(c) R = 4  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{3})$  0.01  0.02  1.43  1.43  1.19 
0.85  1.48  5.49  3.92  2.06  
1.15  1.68  0.90  1.88  2.33  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{2})$  7.75  13.78  20.56  16.66  13.15 
7.76  4.35  12.36  15.34  11.75  
9.84  3.99  2.67  11.25  13.29  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.59  0.97  10.27  10.24  15.97 
7.13  7.93  15.78  16.45  14.92  
8.71  8.66  6.93  10.98  15.89  
GSAD (×10^{ −1 })  3.23  1.17  1.20  5.51  15.98 
RE (×10^{ −4 })  3.11  0.16  0.16  0.41  2.45 
Time (s)  1.59×10^{ 3 }  0.13  1.15  67.71  0.69×10^{ 3 } 
Simulation results for dataset ${\mathcal{D}}_{\mathbf{4}}$
(a) R= 2  

uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{2})$  0.02  N/A  5.12  N/A  N/A 
1.61  0.01  3.59  15.35  18.69  
0.05  0.44  N/A  14.42  19.20  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{1})$  0.28  N/A  3.23  N/A  N/A 
0.87  0.02  2.65  0.33  1.62  
0.69  0.76  N/A  0.50  1.30  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.34  N/A  4.25  N/A  N/A 
3.08  0.17  3.71  14.90  14.89  
0.51  0.68  N/A  15.59  15.70  
GSAD (×10^{ −2 })  4.97  5.24  5.25  157.09  156.19 
RE (×10^{ −4 })  4.49  4.88  4.89  19.34  8.48 
Time (s)  1.61×10^{ 3 }  0.02  1.36  35.29  0.40×10^{ 3 } 
(b) R = 3  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{2})$  0.02  0.01  6.18  18.38  21.63 
1.61  0.01  4.79  16.10  19.55  
0.05  0.09  4.21  15.04  19.85  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{1})$  0.28  0.05  1.67  1.44  1.29 
0.87  0.05  1.01  0.37  1.75  
0.69  0.05  0.94  0.26  1.17  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.34  0.27  4.12  15.21  15.65 
3.08  0.17  4.09  15.26  15.90  
0.51  0.32  4.16  16.07  15.36  
GSAD (×10^{ −2 })  4.97  4.95  4.99  157.08  154.80 
RE (×10^{ −4 })  4.49  4.34  4.36  25.00  8.48 
Time (s)  1.61×10^{ 3 }  0.10  1.78  41.05  0.55×10^{ 3 } 
(c) R = 4  
uBLU  PCA  NMF  BFRM  GBGMF  
${\text{MSE}}_{r}^{2}(\times 1{0}^{2})$  0.02  0.01  6.98  17.51  21.60 
1.61  0.01  7.30  15.07  19.03  
0.05  0.07  4.27  14.55  19.14  
${\text{GMSE}}_{r}^{2}(\times 1{0}^{1})$  0.28  0.22  0.65  0.75  1.29 
0.87  0.51  0.91  0.77  1.18  
0.69  0.05  0.56  0.56  1.33  
${\text{SAD}}_{r}^{2}(\times 1{0}^{1})$  0.34  0.27  4.41  15.61  15.51 
3.08  0.19  4.81  16.31  14.77  
0.51  0.33  4.00  15.84  15.26  
GSAD (×10^{ −2 })  4.97  4.91  4.94  156.76  162.63 
RE (×10^{ −4 })  4.49  4.30  4.33  13.48  8.29 
Time (s)  1.61×10^{ 3 }  0.16  1.56  48.22  0.70×10^{ 3 } 
These results show that the uBLU method is more flexible since it provides better unmixing performance across all of the considered synthetic datasets ${\mathcal{D}}_{1},\dots ,{\mathcal{D}}_{4}$ as compared to other existing factorization methods (PCA, NMF, BFRM and GBGMF). Moreover, uBLU has the following advantages: i) it is fully unsupervised and does not require the number of factors to be specified as a prior knowledge, ii) due to the constraints, the factors and factor scores are estimated without scale ambiguity. The disadvantage is the execution time: uBLU requires more computation due to the Gibbs sampling.
Evaluation on gene expression data
Here the proposed algorithm is illustrated on a real timeevolving gene expression data from recent viral challenge studies on influenza A/H3N2/Wisconsin. The data are available at GEO, accession number GSE30550.
Details on data collection
Application of the proposed uBLU algorithm
The uBLU algorithm was run with N_{mc} = 10000 Monte Carlo iterations, including a burnin period of N_{bi} = 1000 iterations. uBLU allows the posterior distribution of the number of factors R, depicted in Figure 1a, to be estimated. The results show that the MAP estimate of the number of factors is${\stackrel{\u02c6}{R}}_{\text{MAP}}=4$ (more than 90% of the generated Gibbs samples of the number of factors were equal to 4).
The different factors are depicted in Figure 1b where the G genes have been reordered so that the dominant genes are grouped together in each factor. Factors are then ordered with respect to their maximum loading. Specifically, the kth sharp peak in the figure occurs at the gene index that has maximal loading in factor k. Genes to the right of this dominant gene up to the (k + 1)st peak also dominate in this kth factor, but at a lower degree. uBLU identifies a strong factor (the first factor, in red) by virtue of its significantly larger proportion of highly dominant genes. Many of the genes in this strong factor are recognizable as immune response genes that regulate pattern recognition, interferon, and inflammation pathways in respiratory viral response. A very similar factor was found in a different analysis [14, 18] of this dataset and here we call it the “inflammatory component”.
The factor scores corresponding to this inflammatory component are shown in Figure 1c, where they are rendered as an image whose columns (respectively rows) index the subjects (respectively the different time sampling instants). Figure 1c shows that uBLU clearly separates the samples of subjects exhibiting symptoms (associated with the last 9 rows) from those who remain asymptomatic (associated with the first 8 rows), when the estimated number of factors is$\stackrel{\u02c6}{R}=4$. Moreover, this symptom factor can be used to segment the data matrix into 3 states: preonsetsymptomatic (before significant symptoms occur), postonsetsymptomatic and asymptomatic.
NCIcurated pathway associations of group of genes contributing to uBLU inflammatory component
Pathway name  Genes  Pvalue 

IFNgamma pathway  CASP1, CEBPB, IL1B, IRF1, IRF9, PRKCD, SOCS1, STAT1, STAT3  1.34e09 
PDGFRbeta signaling pathway  DOCK4, EIF2AK2, FYN, HCK, LYN, PRKCD, SLA, SRC, STAT1, STAT3, STAT5A, STAT5B  3.26e08 
IL23mediated signaling events  CCL2, CXCL1, CXCL9, IL1B, STAT1, STAT3, STAT5A  2.18e07 
Signaling events mediated by TCPTP  EIF2AK2, SRC, STAT1, STAT3, STAT5A, STAT5B, STAT6  6.38e07 
Signaling events mediated by PTP1B  FYN, HCK, LYN, SRC, STAT3, STAT5A, STAT5B  2.40e06 
GMCSFmediated signaling events  CCL2, LYN, STAT1, STAT3, STAT5A, STAT5B  3.70e06 
IL12mediated signaling events  HLAA, IL1B, SOCS1, STAT1, STAT3, STAT5A, STAT6  1.32e05 
IL6mediated signaling events  CEBPB, HCK, IRF1, PRKCD, STAT1, STAT3  1.80e05 
For comparison we applied a supervised version of the proposed uBLU algorithm to the H3N2 dataset. This was implemented by setting the number of factors to R = 4 and using the algorithm [13] to jointly estimate M and A. The inflammatory component found by the supervised algorithm was virtually identical to the one found by the proposed algorithm (uBLU) that automatically selects R = 4.
Comparison to other methods
The uBLU algorithm is compared with four matrix factorization algorithms, i.e. PCA, NMF, BFRM and GBGMF methods.
Simulation results for real H3N2 dataset
uBLU  PCA  NMF  BFRM  GBGMF  

Fisher criteria (× 10^{−2}) (22)  6.20  2.03  6.17  4.68  2.30 
RE  6.48.10 ^{ −2 }  4.89  7.31.10^{−2}  4.82  9.51.10^{−2} 
Time  ≈ 12 h  1.5 s  116 s  ≈ 47 min  ≈ 10 h 
Number of iterations  10 000  N/A  5 000  10 000  500 
NCIcurated pathway associations of group of genes contributing to NMF inflammatory component
Pathway name  Genes  Pvalue 

IL23mediated signaling events  CCL2, CXCL1, CXCL9, IL1B, JAK2, STAT1, STAT5A  2.18e07 
IL12mediated signaling events  GADD45B, IL1B, JAK2, MAP2K6, SOCS1, STAT1, STAT5A, STAT6  1.10e06 
IFNgamma pathway  CASP1, IL1B, IRF9, JAK2, SOCS1, STAT1  1.07e05 
Signaling events mediated by TCPTP  EIF2AK2, PIK3R2, STAT1, STAT5A, STAT5B, STAT6  1.07e05 
IL27mediated signaling events  IL1B, JAK2, STAT1, STAT2, STAT5A  1.22e05 
CXCR3mediated signaling events  CXCL10, CXCL11, CXCL13, CXCL9, MAP2K6, PIK3R2  1.23e05 
GMCSFmediated signaling events  CCL2, JAK2, STAT1, STAT5A, STAT5B  6.24e05 
PDGFRbeta signaling pathway  EIF2AK2, JAK2, PIK3R2, ARAP1, DOCK4, STAT1, STAT5A, STAT5B  1.38e04 
One can conclude from these comparisons that, when applied on the H3N2 dataset, the proposed uBLU algorithm outperforms PCA, NMF, BFRM, and GBGMF algorithms in terms of finding genes with higher pathway enrichment and achieving higher contrast of the acute symptom states.
The computational times required by the five considered matrix factorization methods, including the proposed uBLU algorithm, when applied to this real dataset, are reported in Table 7. The GBGMF algorithm is slightly faster than the proposed algorithm but does not identify the inflammatory component or achieve good contrast of the acute symptom states in the H3N2 challenge study.
Conclusions
This paper proposes a new Bayesian unmixing algorithm for discovering signatures in high dimensional biological data, and specifically for gene expression microarrays. An interesting property of the proposed algorithm is that it provides positive factor loadings to ensure positivity as well as sumtoone constraints for the factor scores. The advantages of these constraints are that they lead to better discrimination between sick and healthy individuals, and they recover the inflammatory genes in a unique factor, the inflammatory component. The proposed algorithm is fully unsupervised in the sense that it does not depend on any labeling of the samples and that it can infer the number of factors directly from the observation data matrix. Finally, as any Bayesian algorithm, the Monte Carlobased procedure investigated in this study provides point estimates as well as confidence intervals for the unknown parameters, contrary to many existing factor decomposition methods such as PCA or NMF.
Simulation results performed on synthetic and real data demonstrated significant improvements. Indeed, when applied to real timeevolving gene expression datasets, the uBLU algorithm revealed an inflammatory factor with higher contrast between subjects who would become symptomatic from those who would remain asymptomatic (as determined by comparing to ground truth clinical labels).
In this study, the time samples were modeled as independent. Future works include extensions of the proposed model to account for time dependency between samples.
Appendix A: Gibbs sampler
where the dimensions of the matrices M, T, and A depend on the unknown number of factors R and the priors have been defined in the Section “Methods”.
The different steps of the Gibbs sampler are detailed below.
Inference of the number of factors
The proposed unsupervised algorithm includes a birth/death process for inferring the number of factors R, i.e., it generates samples R in addition to M and A. More precisely, at iteration t of the algorithm, a birth, death or switch move is randomly chosen with probabilities ${b}_{{R}^{\left(t\right)}}$, ${d}_{{R}^{\left(t\right)}}$ and ${s}_{{R}^{\left(t\right)}}$. The birth and death moves consist of increasing or decreasing by 1 the number R of factors using a reversible jump step (see [21] for more details), whereas the switch move does not change the dimension of R and requires the use of a MetropolisHastings acceptance procedure. Let consider a move, at iteration index t, from the state {M^{(t)}, A^{(t)}, R^{(t)}} to the new state {M^{ ⋆ }, A^{ ⋆ }, R^{ ⋆ }}. The birth, death and switch moves are defined as follows, similar to those used in [26] (Algorithms 3, 4 and 5).

Birth move: When a birth move is proposed, a new signature m^{ ⋆ } is randomly generated to build M^{ ⋆ } = [M^{(t)}, m^{ ⋆ }]. The new corresponding space is checked so that the signatures are sufficiently distinct and separate from one another. Then, a new factor score coefficient is drawn, for each vector a_{ i } (i = 1, …, N), from a Beta distribution $\mathcal{\mathcal{B}}\left(1,{R}^{\left(t\right)}\right)$, and the new factor score matrix, denoted as A^{ ⋆ }, is rescaled to sum to one.

Death move: When a death move is proposed, one of the factors of M^{(t)}, and its corresponding factor score coefficients, are randomly removed. The remaining factor scores are rescaled to ensure the sumtoone constraint.

Switch move: When a switch move is proposed, a signature m^{ ⋆ } is randomly chosen and replaced with another signature randomly generated. If the new signature is too close to another, its corresponding factor scores are proportionately distributed among its closest factors. Indeed, the switch move consists of creating a new signature (birth move) and deleting another one (death move) in a faster single step.
Each move is then accepted or rejected according to an empirical acceptance probability: the likelihood ratio between the actual state and the proposed new state. The factor matrix M, the factor score matrix A and the noise variance σ^{2} are then updated, conditionally upon the number of factors R, using the following Gibbs steps.
Generation of samples according to f(TA, σ^{ 2 }, R, Y)
For more details on how we generate realizations from this truncated distribution, see [13].
Generation of samples according to f(a_{1:R − 1, i}T, σ^{ 2 }, R, Y)
Generation of samples according to f(σ^{ 2 }T, A, R, Y)
Appendix B: Contribution of each of uBLU’s constraints
To illustrate the advantage of enforcing nonnegativity and sumtoone constraints on the factors and on the factor scores, as detailed in the Methods section, we evaluated the effect of successively stripping out these constraints from uBLU. In particular we implemented uBLU under the following conditions: i) without any constraints, ii) with only the positivity constraints on the factors and the scores, iii) with only the sumtoone constraint on the scores, and iv) with both positivity and sumtoone constraint on factors and scores as proposed in (5).
Contribution of each of uBLU’s constraints
Without  Positivity  Sumtoone  Positivity and  

constraints  sumtoone  
Pvalue of the “IFNgamma pathway”  6.00.10^{−2}  2.05.10^{−2}  2.17.10^{−1}  1.34.10^{ −9 } 
Pvalue of the “IL23mediated signaling events”  2.60.10^{−1}  8.37.10^{−2}  2.28.10^{−2}  2.18.10^{ −7 } 
Declarations
Acknowledgements
This work was supported in part by DARPA under the PHD program (DARPAN6600109C2082). The views, opinions, and findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense. Approved for Public Release, Distribution Unlimited.
Authors’ Affiliations
References
 Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, West M: Highdimensional sparse factor modelling: applications in gene expression genomics. J Am Stat Assoc 2008,103(484):14381456. 10.1198/016214508000000869PubMed CentralView ArticlePubMed
 Paisley J, Carin L: Nonparametric factor analysis with beta process priors. In Proc 26th Annual Int Conf on Machine Learning. ICML 2009. Montreal, Quebec, Canada; 2009:777784.
 Chen B, Chen M, Paisley J, Zaas A, Woods C, Ginsburg GS, Hero AO, Lucas J, Dunson D, Carin L: Bayesian inference of the number of factors in geneexpression analysis: application to human virus challenge studies. BMC Bioinformatics 2010, 11: 552. 10.1186/1471210511552PubMed CentralView ArticlePubMed
 West M: Bayesian Factor regression models in the “Large p, Small n” paradigm. In Bayesian Statistics 7. Oxford University Press; 2003:723732.
 Yeung KY, Ruzzo WL: Principal component analysis for clustering gene expression data. Bioinformatics 2001,17(9):763774. 10.1093/bioinformatics/17.9.763View ArticlePubMed
 Nascimento JM, BioucasDias JM: Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Trans Geosci Remote Sensing 2005,43(4):898910.View Article
 Lee DD, Seung HS: Algorithms for nonnegative matrix factorization. Proc Neural Info Process Syst 2000, 13: 556562.
 Fogel P, Young SS, Hawkins DM, Ledirac N: Inferential, robust nonnegative matrix factorization analysis of microarray data. Bioinformatics 2007, 23: 4449. 10.1093/bioinformatics/btl550View ArticlePubMed
 McLachlan GJ, Bean RW, Peel D: A mixture modelbased approach to the clustering of microarray expression data. Bioinformatics 2002,18(3):413422. 10.1093/bioinformatics/18.3.413View ArticlePubMed
 Baek J, McLachlan GJ: Mixtures of common tfactor analyzers for clustering highdimensional microarray data. Bioinformatics 2011, 27: 12691276. 10.1093/bioinformatics/btr112View ArticlePubMed
 Moloshok TD, Klevecz RR, Grant JD, Manion FJ, Speier WF, Ochs MF: Application of Bayesian decomposition for analysing microarray data. Bioinformatics 2002, 18: 566575. 10.1093/bioinformatics/18.4.566View ArticlePubMed
 Fertig EJ, Ding J, Favorov AV, Parmigiani G, Ochs MF: CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics 2010, 26: 27922793. 10.1093/bioinformatics/btq503PubMed CentralView ArticlePubMed
 Dobigeon N, Moussaoui S, Coulon M, Tourneret JY, Hero AO: Joint Bayesian endmember extraction and linear unmixing for hyperspectral imagery. IEEE Trans Signal Process 2009,57(11):43554368.View Article
 Huang Y, Zaas AK, Rao A, Dobigeon N, Woolf PJ, Veldman T, Oien NC, McClain MT, Varkey JB, Nicholson B, Carin L, Kingsmore S, Woods CW, Ginsburg GS, Hero A: Temporal dynamics of host molecular responses differentiate symptomatic and asymptomatic influenza A infection. PLoS Genet 2011,8(7):e1002234.View Article
 Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis. New York: John Wiley; 2001.View Article
 Dueck D, Morris QD, Frey BJ: Multiway clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics 2005, 21: 144151. 10.1093/bioinformatics/bth498View Article
 Nikulin V, Huang TH, Ng SK, Rathnayake S, McLachlan GJ: A very fast algorithm for matrix factorization. Stat Probability Lett 2011,81(7):773782. 10.1016/j.spl.2011.02.001View Article
 Zaas AK, Chen M, Varkey J, Veldman T, Hero AO, Lucas J, Huang Y, Turner R, Gilbert A, LambkinWilliams R, Øien NC, Nicholson B, Kingsmore S, Carin L, Woods CW, Ginsburg GS: Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host Microbe 2009,6(3):207217. [http://www.ncbi.nlm.nih.gov/pubmed/19664979] [] 10.1016/j.chom.2009.07.006PubMed CentralView ArticlePubMed
 Winter ME: NFINDR: an algorithm for fast autonomous spectral endmember determination in hyperspectral data. Imaging Spectrometry V Proc SPIE 3753 1999, 266275.
 Gilks WR, Richardson S, Spiegelhalter DJ: Markov Chain Monte Carlo in Practice. London: Chapman and Hall; 1996. (ISBN: 0412055511) (ISBN: 0412055511)
 Green PJ: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 1995,82(4):711732. 10.1093/biomet/82.4.711View Article
 BFRM Software: bayesian factor regression modelling [http://www.stat.duke.edu/research/software/west/bfrm/download.html] []
 Duda RO, Hart PE, Stork DG: Pattern Classification. New York: WileyInterscience; 2000.
 Cox TF, Cox MAA: Multidimensional Scaling. London: Chapman and Hall; 1994.
 Robert CP, Casella G: Monte Carlo Statistical Methods. New York: SpringerVerlag; 1999.View Article
 Dobigeon N, Tourneret JY, Chang CI: Semisupervised linear spectral unmixing using a hierarchical Bayesian model for hyperspectral imagery. IEEE Trans Signal Process 2008,56(7):26842695.View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.