### Mathematical constrained model

Let

**y**
_{
i
} represent a gene microarray vector of

*G* gene expression levels. The elements of

**y**
_{
i
} have units of hybridization abundance levels with non-negative values. In the context of gene expression data, the starting point for Bayesian linear unmixing is the linear mixing model (LMM)

where

*R* is the number of distinct gene signatures that can be present in the chip,

**m**
_{
r
} = [

*m*
_{1, r
}, …,

*m*
_{
G, r
}]

^{
T
} is the

*r*-th gene signature vector,

*m*
_{
g, r
} ≥ 0 is the strength of the

*g*-th gene (

*g* = 1, …,

*G*) in the

*r*-th signature (

*r* = 1, …,

*R*), and

*a*
_{
r, i
} is the relative contribution of the

*r*-th signature vector to the

*i*-th sample

**y**
_{
i
}, where

*a*
_{
r,i
} ∈ [0, 1] and

. Here

**n**
_{
i
} denotes the residual error of the LMM representation. For a matrix of

*N* data samples

, the LMM can be rewritten with matrix notations

where

,

and

represent the factor score matrix, the factor loading matrix and the noise matrix, respectively. The matrices

**M**,

**A** satisfy positivity and sum-to-one constraints defined by

where *m*
_{
g, r
} denotes the (*g*, *r*)-th element of matrix **M**. The constraints (5) arise naturally when dealing with positive data for which one is seeking the relative contribution of positive factors that have the same numerical characteristics as the data, i.e., the signature **m**
_{
r
} is itself interpretable as a vector of hybridization abundances.

The objective of linear unmixing is to simultaneously estimate the factor matrix **M** and the factor score matrix **A** from the available *N* data samples. The representation (1) is rank deficient since **A** has rank *N* − 1. This presents algorithmic challenges for solving the unmixing problem. Several algorithms have been proposed in the context of hyperspectral imaging to solve similar problems [6, 19]. Most of these algorithms perform unmixing in a two step procedure where **M** is estimated first using an *endmember extraction algorithm* (EEA) followed by a constrained linear least squares step to estimate **A**. A common (but restrictive) assumption in these algorithms is that some samples in the dataset are “pure” in the sense that the linear combination of (2) contains a unique factor, say **m**
_{
r
}, with factor score *a*
_{
r, i
}. Recently, this assumption has been relaxed by applying a hierarchical Bayesian approach, called Bayesian linear unmixing (BLU). The resulting algorithm requires the number *R* of factors to be specified (see [13] for details). Here we extend BLU to a fully unsupervised algorithm, called unsupervised BLU (uBLU), that estimates *R* using a birth-death model and a Gibbs sampler. The Gibbs sampler produces an estimate of the entire joint posterior distribution of the model parameters, resulting in a fully Bayesian estimator of the number of factors *R*, the factor loadings **M**, and the factor scores **A**. The uBLU model is described in the next subsection and the Gibbs sampling algorithm is given in the Appendix. In the Results and discussion section below we demonstrate the performance advantages of uBLU as a factor analysis model for simulated and real gene expression data.

### Unsupervised Bayesian linear unmixing algorithm

The BLU algorithm studied in [13] generates samples distributed according to the posterior distribution of **M** and **A** given the number *R* of factors for appropriate prior distributions assigned to the mixing parameters in (2). First, the residual errors **n**
_{
i
} in (2) are assumed to be independent identically distributed (i.i.d.) according to zero-mean Gaussian distributions:
for *i* = 1, …, *N*, where **I**
_{
G
} denotes the identity matrix of dimension *G* × *G*.

The number of factors

*R* to be estimated by the proposed uBLU algorithm is assigned a discrete uniform prior distribution on [2, …,

*R*
_{max}]

where *R*
_{max} is the maximal number of factors present in the mixture.

Because of the constraints in (5), the data samples

**y**
_{
i
} (

*i* = 1, …,

*N*) live in a lower-dimensional subspace of

(whose dimension is upper-bounded by

*K* − 1) denoted as

(

*R*
_{max} − 1 ≤

*K* ≤

*G*). This subspace can be identified by a standard dimension reduction procedure, such as PCA. Hence, instead of estimating the factor loadings

(

*r* = 1, …,

*R*), we propose to estimate their corresponding projections

onto this subspace. Specifically, these projections can be represented as

where

is the empirical mean of the data matrix

**Y** and

**P** is the (

*K* − 1) ×

*G* appropriate projection matrix that projects onto

, which can be constructed from the principal eigenvectors of the empirical covariance matrix of

**Y**. This dimension reduction procedure allows us to work in a lower-dimensional subspace without loss of information, and reduces significantly the computational complexity of the Bayes estimator of the factor loadings. A multivariate Gaussian distribution (MGD) truncated on a subset

is chosen as prior distribution for the projected factors

**t**
_{
r
}. The subset

is defined in order to ensure the non-negativity constraint on

**m**
_{
r
} (see [

13])

More precisely,

is obtained by noting that

and by looking for the vectors

**t**
_{
r
} such that all components of

are non-negative. To estimate the mean vectors

**e**
_{
r
} of these truncated MGDs, one can use a standard endmember extraction algorithm (EEA) common in hyperspectral imaging, e.g. N-FINDR [

19]. To summarize, the prior distribution for the projected factor

**t**
_{
r
} is

where

denotes the truncated MGD with mean vector

**e**
_{
r
} and covariance matrix

, with

a fixed hyperparameter. Assuming the vectors

**t**
_{
r
}, for

*r* = 1, …,

*R*, are

*a priori* independent, the prior distribution for the projected factor matrix

**T** = [

**t**
_{1},…,

**t**
_{
R
}] is

where ∝ stands for “proportional to”, ∥·∥ is the standard *l*
_{2}-norm,
denotes the indicator function on the set
, **E** = [**e**
_{1}, …, **e**
_{
R
}] and
.

The sum-to-one constraint for the factor scores

**a**
_{
i
}, for each observed sample

*i* (

*i* = 1, …,

*N*), allows this vector

**a**
_{
i
} to be rewritten as

and

. Note here that any component of

**a**
_{
i
} could be expressed as a function of the others, i.e.,

. The last component

*a*
_{
R, i
} has been chosen here for notation simplicity. To ensure the positivity constraint, the subvectors

**a**
_{1:R − 1, i
} must belong to the simplex

where ∥·∥

_{1} is the

*l*
_{1} norm (

) and

**a**
_{
i
}≽

**0** stands for the set of inequalities {

*a*
_{
r,i
} ≥ 0}

_{
r = 1, …, R
}. Following the model in [

13], we propose to assign uniform distributions over the simplex

as priors for the subvectors

**a**
_{1:R − 1, i
} (

*i* = 1, …,

*N*), i.e.,

For the prior distribution on the variance

*σ*
^{2} of the residual errors we chose a conjugate inverse-Gamma distribution with parameters

*ν* / 2 and

*γ* / 2

The shape parameter

*ν* is a fixed hyperparameter whereas the scale parameter

*γ* will be adjustable, as in [

13]. A non-informative Jeffreys’ prior is chosen as prior distribution for the hyperparameter

*γ*, i.e.,

The resulting hierarchical structure of the proposed uBLU model is summarized in the directed acyclic graph (DAG) presented in Additional file 1: Figure S1.

The model defined in (1) and the Gaussian assumption for the noise vectors

**n**
_{1}, …,

**n**
_{
N
} allow the likelihood of

**y**
_{1}, …,

**y**
_{
N
} to be determined

Multiplying this likelihood by the parameter priors defined in (10), (13), (14) and (6), and integrating out the nuisance parameter

*γ*, the posterior distribution of the unknown parameter vector

**Θ** = {

**M**,

**A**,

*σ*
^{2},

*R*} can be expressed as

Considering the parameters to be

*a priori* independent, the following result can be obtained

where *f*(**A**|*R*), *f*(**T**|**E**, **s**
^{2}, *R*) and *f*(*σ*
^{2}|*ν*, *γ*) are respectively the prior distributions of the factor score matrix **A**, the projected factor matrix **T** and the noise variance *σ*
^{2} previously defined.

Due to the constraints enforced on the data, the posterior distribution *f*(**M**, **A**, *R*|**Y**) obtained from the proposed hierarchical structure is too complex to derive analytical expressions of the Bayesian estimators, e.g., the minimum mean square (MMSE) and maximum a posteriori (MAP) estimators. In such case, it is natural to use Markov chain Monte Carlo (MCMC) methods [20] to generate samples **M**
^{(t)}, **A**
^{(t)} and *R*
^{(t)} asymptotically distributed according to *f*(**M**, **A**, *R*|**Y**). However, the dimensions of the factor loading matrix **M** and the factor score matrix **A** depend on the unknown number *R* of signatures to be identified. As a consequence, sampling from *f*(**M**, **A**, *R*|**Y**) requires exploring parameter spaces of different dimensions. To solve this dimension matching problem, we include a birth/death process within the MCMC procedure. Specifically, a birth, death or switch move is chosen at each iteration of the algorithm (see the Appendix and [21]). This birth-death model differs from the classical reversible-jump MCMC (RJ-MCMC) (as defined in [21]) in the sense that, for the birth-death model, each move is accepted or rejected at each iteration using the likelihood ratio between the current state and the new state proposed by the algorithm. The factor matrix **M**, the factor score matrix **A** and the noise variance *σ*
^{2} are then updated, conditionally upon the number of factors *R*, using Gibbs moves.

After a sufficient number of iterations (N

_{mc} iterations, including a burn-in period of N

_{bi} iterations), the traditional Bayesian estimators (e.g., MMSE and MAP) can be approximated using the generated samples

**M**
^{(t)},

**A**
^{(t)} and

*R*
^{(t)}. First, the generated samples are used to approximate the MAP estimator of the number of factors

where

*N*
_{
k
} is the number of generated samples

satisfying

*R*
^{(t)} =

*k* and

. Then, conditioned on

, the joint MAP estimator

of the factor and factor score matrices is determined as follows