Probabilistic PCA (PPCA) is a probabilistic formulation of PCA based on a Gaussian latent variable model and was first introduced by Tipping and Bishop in 1999 [12]. The PPCA model reduces the dimension of high-dimensional data by relating a *p*-dimensional observed data point to a corresponding *q*-dimensional latent variable through a linear transformation function, where *q* ≪ *p*. Given the statistical model underpinning PPCA, extensions of the model are possible, and a wealth of statistical tools can be utilized. Such extensions and tools are detailed in what follows.

### Probabilistic Principal Components Analysis

Let

x
_{
i
} = (

*x*
_{
i
}
_{1}, . . . ,

*x*
_{
ip
})

^{
T
} be an observed set of variables (eg. an NMR spectrum) for observation

*i* and

u
_{
i
} = (

*u*
_{
i1}, . . . ,

*u*
_{
iq
})

^{
T
} be a latent variable corresponding to observation

*i* in the latent, reduced dimension space. In terms of traditional PCA,

u
_{
i
} can be viewed as the principal score of subject

*i*. The PPCA model can be expressed as follows

where

**W** is a

*p × q* loadings matrix,

μis a mean vector and

is multivariate Gaussian noise for observation

*i*, i.e.

*p*(

) =

*MV N*
_{
p
}(

0, σ

^{2}
**I** ) where

**I** denotes the identity matrix. The latent variable

u
_{
i
} is also assumed to be multivariate Gaussian distributed,

*p*(

u
_{
i
}) =

*MV N*
_{
q
}(

0,

**I** ). The conditional distribution of the observed data given the latent variable can then be expressed as

.

The distribution of the observed data,

*p*(

x
_{
i
}), also known as the predictive distribution, can be derived from the convolution of

*p*(

u
_{
i
}) and

*p*(

x
_{
i
}|

u
_{
i
}) giving

.

In contrast to the more conventional view of PCA which is a mapping from the high dimensional data space to a low dimensional latent space, the PPCA framework is based on a mapping from a latent space to the data space. The observed data x
_{
i
} is generated by first drawing a value for the latent variable u
_{
i
} from its unit variance multivariate Gaussian distribution, *p*(u
_{
i
}). The observed variable x
_{
i
} is then sampled, conditioning on the generated value for u
_{
i
}, from the isotropic distribution defined in (1).

Any observed data point

x
_{
i
} can be represented in a latent space by its corresponding

*q*-dimensional latent variable

u
_{
i
}. The distribution of the latent variable given the observed data can be derived using Bayes' Theorem to give

where **M** is a *q × q* matrix defined as **M** = **W**
^{
T
}
**W** + σ^{2}
**I**. A key benefit of the PPCA approach is that, not only is an estimate of the location of each observation in the lower dimensional space available through its expected value
(u
_{
i
}) = **M**
^{-1}
**W**
^{
T
} (x
_{
i
} - μ), an estimate of its associated uncertainty is available through the covariance matrix *σ*
^{2}
**M**
^{-1} in (2). This is in contrast to conventional PCA where the lower dimensional location (i.e. the score) of an observation is available, but the uncertainty associated with it is not. The parameters (**W**, μand *σ*
^{2}) of the PPCA model can be estimated using maximum likelihood. Maximizing the (log) likelihood function with respect to model parameters is non-trivial; in [12] it is demonstrated that the estimates do however have closed form solutions. Crucially, the log likelihood of the PPCA model is maximized when the columns of **W** span the principal subspace of conventional PCA [12]. Thus the maximum likelihood estimate (MLE) of the loadings matrix **Ŵ** in PPCA corresponds exactly to the loadings matrix in conventional PCA. Hence the model output in PPCA is exactly that obtained in conventional PCA, but with the additional advantages of uncertainty assessment and potential model extensions.

In this article maximum likelihood estimates of the model parameters are derived via the EM algorithm [13] because of its stability and widespread applicability. The EM algorithm is typically used to compute MLEs in probabilistic models when the model depends on unobserved variables or when some data are missing. The algorithm alternates between two steps until convergence: the expectation (E) step and the maximization (M) step. In the E-step, the expected values of the latent variables are estimated given the observed data and the current estimates of the model parameters. In the M-step, the model parameters are re-estimated by maximizing the log likelihood function using the expected values of the latent variables derived in the previous E-step. The two steps are repeated until convergence. Many convergence assessment criteria are available; some criteria are based on log likelihood gain between iterations while others use an estimate of the converged log likelihood value as a basis for stopping. Here Aitken's acceleration procedure [22] is used for convergence assessment. Specific details of the EM algorithm for the PPCA model are given in Additional File 1. An implementation of the algorithm is available in the package MetabolAnalyze through the **R** statistical software [21].

### Probabilistic Principal Components and Covariates Analysis

With its basis in a statistical model, the PPCA model can be extended in several ways. Given the availability and relevance of subject covariates in metabolomic studies, here the PPCA model is extended to facilitate joint modeling of covariates and metabolomic data, giving the probabilistic principal components and covariates analysis (PPCCA) model. This novel model extension is achieved by assuming that the latent variable distribution for observation

*i* follows a multivariate Gaussian distribution centered at

δ
_{
i
} rather than at the origin, i.e.

*p*(

u
_{
i
}) =

*MV N*
_{
q
} (

δ
_{
i
} ,

**I** ), where

Here α is a *q* (*L* + 1) matrix of parameters which capture the relationship between the latent variable and the covariates and C
_{
i
} is a (*L*+1) vector of an intercept term and the *L* covariates of observation *i*. The motivation behind this model extension is that a subject's covariates may influence their location in the principal subspace. Conditional on this location, the observed data point is then generated as in (1). Note that through the model definition (3) the covariates may have different effects on each of the dimensions of the principal subspace through the parameter vectors α
_{1}, . . ., α
_{
q
}.

Under the PPCCA model, the conditional distribution of

x
_{
i
} given

u
_{
i
} is the same as that of the PPCA model given in

(1). The predictive distribution

*p*(

x
_{
i
}) differs from that of the PPCA model and is now defined as

.

The posterior distribution of the latent variable

u
_{
i
} given the observed data

x
_{
i
} is also affected by the inclusion of covariates and is defined to be

.

The location (or score) of observation *i* in the latent space is given by
(u
_{
i
}) = **M**
^{-1}[**W**
^{
T
} (x
_{
i
} - μ) + *σ*
^{2}
δ
_{
i
}], which depends on both the data point x
_{
i
} and the covariates of observation *i* through δ
_{
i
}. Thus when representing an observation in a reduced dimensional space the PPCCA model takes account of both the spectra data and the associated covariates. Deeper insight to the true underlying structure of the data is then feasible as possibly influential external factors are explicitly modeled.

The effect of covariates on the *q*th latent dimension can be explored through examination of the estimated regression parameter vector α
_{
q
} = (*α*
_{
q0}, . . ., *α*
_{
qL
})^{
T
} ; these parameters can be interpreted within the context of the problem to provide insight to the type and strength of influence some covariates may have on the (often interpretable) latent dimensions.

Parameter estimation for the PPCCA model can be achieved via an efficient EM algorithm; Specific details of the EM algorithm for the PPCCA model are given in Additional File 1. An implementation of the algorithm is available in the package MetabolAnalyze through the **R** statistical software [21].

### Mixtures of Probabilistic Principal Components Analysis Models

The models discussed so far assume that the association between the observed data and the latent variable is linear. This assumption can be inadequate in a situation where the observations in the data set have an underlying group structure. In such cases the linearity assumption may not reveal all of the internal structures of the data [23]. Standard PCA also suffers from this phenomenon.

In many high throughput technologies which result in high dimensional data, interest often lies in identifying underlying sub groups within a set of observations. Exploring high dimensional data with underlying nonlinear structures therefore requires modeling attention. Employing a single PPCA model to model such data is not adequate, since PPCA provides a globally linear projection of the data. A collection of single PPCA models can be combined to obtain a mixture of probabilistic principal components analysis models (MPPCA) [18] which clusters observations into groups and reduces data dimension.

Under a MPPCA model, with probability

*π*
_{
g
}, observation

*i* is modeled as

Here **W**
_{
g
} and μ
_{
g
} are a *p* × *q* loadings matrix and the mean respectively for group *g*, and
is a multivariate Gaussian noise process for observation *i*, given that *i* is a member of group *g*. The latent location for observation *i*, given that *i* is a member of group *g*, is denoted u
_{
ig
}. That is, with probability *π*
_{
g
}, observation *i* is modeled using a PPCA model with group Specific parameters.

Observation

*i* is assumed to have been drawn from a finite mixture distribution with

*G* components (or groups), i.e.

where *p*(x
_{
i
}| μ
_{
g
}, **Σ**
_{
g
}) is a PPCA model for group *g* with mean parameter μ
_{
g
}and covariance matrix
. The mixing proportion *π*
_{
g
} denotes the probability of an observation belonging to group
. Note that for reasons of parsimony the error covariance ^{2} has been constrained to be equal for all groups [24].

The MPPCA model can be fitted using a two stage EM algorithm called the Alternating Expectation Conditional Maximization (AECM) algorithm [25]. For clarity, the details of the AECM algorithm for the MPPCA model are deferred to Additional File 1. Under the MPPCA model, in addition to the latent location variable, the unobserved group membership of each observation is also viewed as a latent variable. Specifically, for each observation, a latent binary vector z
_{
i
} = (*z*
_{
i1}, . . ., *z*
_{
iG
})^{
T
} is imputed where *z*
_{
ig
} = 1 if observation *i* belongs to group *g* and 0 otherwise. At convergence of the AECM algorithm the estimate
is the posterior probability of observation *i* belonging to group *g*. The MPPCA model clusters observations into groups by assigning them to the group to which they have highest posterior probability of membership. Thus clustering of observations and dimension reduction, through the use of principal components, are achieved simultaneously.

### Model Selection

A crucial advantage of working within a probabilistic framework is that a wealth of statistically based model selection tools can be utilized. This allows the determination of the "best" statistical model for the data, i.e. the optimal number of principal components *q* to retain and, in the case of the MPPCA model, the optimal value of *G*. Such choices are made on the basis of statistical principles instead of using traditional ad-hoc approaches, such as a scree plot.

The Bayesian Information Criterion (BIC) [

26] is a popular model selection tool. The BIC is defined to be

where *l* is the maximum log likelihood value, *K* is the number of free parameters in the model and *n* is the number of observations. The model which yields highest BIC value is deemed the optimal model.

The BIC can be viewed as a criterion that rewards model fit (through the first term in (4)) but penalizes model complexity (through the second term in (4)). The penalization in the BIC is much stronger than that of the widely used Akaike Information Criterion [27] and typically selects more parsimonious models. Within the context of mixture models, the BIC has been widely employed, see [10, 28, 29].

Despite the tendency for the BIC to select parsimonious models, in high dimensional data settings it often exhibits noisy behaviour-the BIC can be undefined or perform poorly due to the occurrence of singularities for some starting values of the EM algorithm, for some models (i.e. different values of *q*), or for some numbers of groups (in the case of the MPPCA model). Additionally, diagnosing convergence of the EM algorithm in highly-parameterized models can be difficult, leading to noisy BIC values.

To eradicate this issue, here a regularized version of the BIC [30, 31] is employed as a model selection tool. This modified version of the BIC evaluates the likelihood at the maximum a posteriori (MAP) estimator instead of the MLE. The MAP estimator is derived within the EM algorithm framework where a conjugate prior is included, and the convolution of the likelihood and prior are maximized at the M step. Here, a conjugate inverse gamma prior on *σ*
^{2} is employed throughout-the reported BIC values are based on the MAP estimate for *σ*
^{2} rather than on the MLE. Further details are provided in Additional File 1. This approach avoids singularities, and performs similarly to the BIC when such issues are absent. It also has the effect of smoothing noisy behavior of the BIC, which is often observed when parameter estimation is unstable.

### Jackknife resampling

The loadings of any probabilistic principal components based model can be used to identify variables responsible for the structure in the data. Rather than examining the (typically large number of) point estimates of the loadings alone, a gauge of the uncertainty associated with the loadings can be obtained through estimation of their standard errors.

Here the jackknife resampling method [

32,

33] is implemented. Standard errors are estimated by recomputing the loadings (i.e. re fitting the relevant PPCA model) with the

*i*th observation removed from the dataset giving the loadings matrix

**W**
^{-i
}, for

*i* = 1, . . .,

*n*. The jackknife standard error for the

*j*th loading on the

*k*th principal component is then estimated as

Where
.

The standard errors can then be used to compute 95% confidence intervals (CIs) for the individual loadings. Such CIs can be used to identify loadings which differ significantly from zero on a selected principal component in the optimal model. Those variables whose loadings are significantly different from zero relate to the variables responsible for the structure within the data. This approach therefore provides a sparse list of relevant variables.

Computation time is often an issue when using the EM algorithm to fit statistical models, and employing the jackknife technique to obtain standard errors would clearly exacerbate this problem. In practice computation times are considerably reduced by choosing good starting values for the algorithm for each of the *n* runs when using the jackknife. Here the maximum likelihood estimates of the model parameters when fitted to the entire data set are employed as starting values for each of the jackknife runs, considerably reducing computational costs.

### Metabolomic datasets

The datasets used here were derived from a study previously reported [34]. Brie y, animals were randomly assigned to two treatments groups and treated with pentylenetetrazole (PTZ, treated group) or saline (control group) for a period of 4 weeks. A third treatment group consisted of animals who received one injection only and these data are not used within this paper. Throughout the treatment period urine was collected from the animals in collection tubes containing 1% sodium azide surrounded by ice. The animals had no access to food during this time but had free access to water. At the end of the treatment period brain regions were isolated and metabolites extracted as previously described [34]. NMR spectra were acquired and the spectra were integrated into bin regions of 0.04 ppm using AMIX (Bruker) excluding the water regions (4.0-6.0 ppm).

The urine dataset used herein was constructed from NMR spectra acquired from urine collected on day ten of the study; it consists of 18 spectral profiles (from 9 treated and 9 control animals) over 189 spectral bin regions.

The brain dataset comes from animals in the control group only with spectra acquired from tissues from four brain regions: the pre-frontal cortex, hippocampus, cerebellum and brainstem. In total, there are 33 spectral profiles over 164 spectral bin regions.