- Methodology article
- Open Access
A Bayesian model for classifying all differentially expressed proteins simultaneously in 2D PAGE gels
- Steven H Wu^{1, 2, 5}Email author,
- Michael A Black^{3},
- Robyn A North^{4} and
- Allen G Rodrigo^{1, 2, 6}
https://doi.org/10.1186/1471-2105-13-137
© Wu et al.; licensee BioMed Central Ltd. 2012
- Received: 14 December 2011
- Accepted: 30 May 2012
- Published: 19 June 2012
Abstract
Background
Two-dimensional polyacrylamide gel electrophoresis (2D PAGE) is commonly used to identify differentially expressed proteins under two or more experimental or observational conditions. Wu et al (2009) developed a univariate probabilistic model which was used to identify differential expression between Case and Control groups, by applying a Likelihood Ratio Test (LRT) to each protein on a 2D PAGE. In contrast to commonly used statistical approaches, this model takes into account the two possible causes of missing values in 2D PAGE: either (1) the non-expression of a protein; or (2) a level of expression that falls below the limit of detection.
Results
We develop a global Bayesian model which extends the previously described model. Unlike the univariate approach, the model reported here is able treat all differentially expressed proteins simultaneously. Whereas each protein is modelled by the univariate likelihood function previously described, several global distributions are used to model the underlying relationship between the parameters associated with individual proteins. These global distributions are able to combine information from each protein to give more accurate estimates of the true parameters. In our implementation of the procedure, all parameters are recovered by Markov chain Monte Carlo (MCMC) integration. The 95% highest posterior density (HPD) intervals for the marginal posterior distributions are used to determine whether differences in protein expression are due to differences in mean expression intensities, and/or differences in the probabilities of expression.
Conclusions
Simulation analyses showed that the global model is able to accurately recover the underlying global distributions, and identify more differentially expressed proteins than the simple application of a LRT. Additionally, simulations also indicate that the probability of incorrectly identifying a protein as differentially expressed (i.e., the False Discovery Rate) is very low. The source code is available at https://github.com/stevenhwu/BIDE-2D.
Keywords
- Two-dimensional polyacrylamide gel electrophoresis (2D PAGE)
- Global Bayesian model
- Differentially expressed protein
- Markov chain Monte Carlo (MCMC)
Background
Two-dimensional polyacrylamide gel electrophoresis (2D PAGE) separates hundreds or thousands of proteins simultaneously by their isoelectric point and molecular weight [1]. There are two main approaches to analyse 2D PAGE: (1) an image-based approach, which analyses the raw or preprocessed gel images [2, 3], and (2) a spot-based approach, whereby a standard analytical pipeline is used to identify up- or down-regulated proteins by gel scanning, spot-detection and spot-matching using appropriate software [4, 5]. Data obtained are expressed as absolute or relative protein intensities, typically transformed into log-values. By detecting statistically significant differences in the spot intensities under different experimental or sampling conditions, 2D PAGE is a useful technique for exploring potentially differentially expressed proteins.
Most of the commercial packages for 2D PAGE analysis include several standard statistical analysis methods, for example, two-sample Student's t-tests, Analysis of Variance, and Principal Component Analysis [6, 7]. Nonetheless, a significant challenge with most 2D PAGE analyses is the problem of missing values, whereby spots on one gel are not identified, or matched with, spots on another gel [8]. This should not come as a surprise: the expression of proteins varies from individual to individual from one experimental condition to the next, along with technical variation between gels. Previously, we proposed a likelihood-based model that identified differentially expressed proteins, and which accounted for missing values by positing a class of proteins where the probability of non-expression is greater than zero [9]. In particular, we divided missing values into two categories, due either to the non-expression of a protein, or a level of expression that fell below the limit of detection [3, 10]. The likelihood function utilized a mixture of the two probabilistic models, thus allowing both possible causes of missing values. By applying a Likelihood Ratio Test (LRT), we classified a protein as “differentially expressed” if there was statistically significant support for either a difference in mean expression intensities or a difference in the probabilities of expression across the two categories.
In this paper, we extend our univariate likelihood model to a global model. The aim of a global model is to utilize the relationship between spots so that information about expression probabilities and differences in mean expression intensities can be modeled coherently across all spots. The global likelihood model proposed in this paper maintains all the advantages of the local model proposed previously, that is, the incorporation in the model of probabilities of expression and a limit of detection. Additionally, the global model includes several parametric probability functions that deliver the expected probability of expression and mean expression intensities for individual spots. In other words, the probability of expression and the mean expression intensity for any given spots are random variables drawn from global distributions of these variables, and the parameters of these global distributions are estimated from all expression data. While the characterization and use of global distributions of expression frequencies and intensities is not novel [11, 12], this is the first time that this type of approach has been applied to the problem of modeling protein abundance in 2D PAGE. The empirical distributions of these data sets lend themselves to approximations by well-studied statistical distributions, and their use in statistical inference delivers greater power to detect differentially expressed spots. We illustrate the properties of the global model using simulated data, where the true parameters of the probabilities of expression, and the mean expression intensities are known.
Methods
The Global Bayesian Model
In our paper, the global model is applied to a case–control experimental design, where subjects belong to either a Case (disease) or Control group. Under the simplest experimental design, individuals are assigned to either the Case or Control group, and each subject has a sample that is processed using 2D-PAGE. This approach produces as many 2D-PAGE gels as there are subjects, and after application of the appropriate software algorithms, a list of “spots” is produced (corresponding to proteins that were expressed on at least one gel), along with the intensities of these spots for each gel. Before any analysis is carried out, we calculate the relative intensities by dividing the intensity of individual spots by the sum of all intensities on the corresponding gel, followed by log_{ 2 } transformation. In many instances, there will be no intensity value for a given protein, indicating (as previously noted), that the spot was not expressed or not detected. These spots are indicated by “NA” in the dataset.
The global model proposed here is a hierarchical model with two layers. The first layer is referred to as the local layer. This layer calculates the likelihood for an individual protein, with each protein having its own parameters. The second or “global” layer connects all parameters from the local layers together. Parameters associated with this layer are referred to as global parameters. Since the model attempts to recover a large number of parameters, it is analytically and computationally cumbersome to obtain estimates within a likelihood-based framework. Instead, we have chosen to use Bayesian Markov chain Monte Carlo (MCMC) integration (described below), which is a computationally tractable approach. More importantly, Bayesian MCMC integration allows us to specify prior probability distributions that capture what we expect our parameters to look like when there is no difference between Case and Control. Since the point of Bayesian inference is to recover the posterior distribution (i.e., the distribution of the model parameters, after the incorporation of new data), any significant deviation between the posterior and the prior distributions is a signal that there are statistical differences between Cases and Controls.
The local layer
The local layer focuses on the expression of an individual spot and can be described by four parameters. These four parameters are: 1) the mean for control group expression intensity μ, 2) the difference between case and control mean expression intensities δ, (i.e., the mean for the case group is calculated by μ_{ 1 } = μ_{ 0 } + δ), 3) the probability of expression for the control group p_{ 0 }, which can be expressed as a function of κ and 4) the difference between probabilities of expression between the two groups, τ. The probabilities of expression for the Control and Case groups are calculated by ${p}_{0}=\frac{exp\left(\kappa \right)}{1+exp\left(\kappa \right)}$ and ${p}_{1}=\frac{exp\left(\kappa +\tau \right)}{1+exp\left(\kappa +\tau \right)}$ respectively. Both groups are assumed to have the same standard deviation for expression intensities, σ_{s}, the details of which will be discussed later.
where d is the limit of detection and ν is the maximum expression value.
- (1)
When the intensity, C _{ x,s,i } is less than the level of detection, the the likelihood function reflects a mixture of the possibilities that either the protein was not expressed (i.e., 1 – ρ _{ x }, where ρ _{ x } is the probability of expression), or that the protein was expressed but fell below the level of detection (the second term on the right hand side of the first row, in Equation 2).
- (2)
When the intensity is greater than the level of detection, the likelihood function is given by a truncated normal distribution, with the lower tail truncated at d, the level of detection (second row of Equation 2).
where L(Θ_{ L }) is the likelihood for all proteins at the local layer and S is the total number of proteins in the 2D PAGE experiment.
The global layer
All proteins are assumed to have the same standard deviation of expression intensities (measured on the log scale), which is calculated by multiplying σ_{ g } by the spot standard deviation scalar parameter ψ. Therefore the spot standard deviation σ_{ s } = ψσ_{ g } is used to calculate the likelihood for each spot in the local layer. This allows the model to efficiently estimate the spot standard deviation and explore the potential relationship between σ_{ s } and σ_{ g }.
Markov Chain Monte Carlo (MCMC)
Bayesian inference and the Metropolis-Hastings algorithm
Here, p(D|θ) denotes the likelihood function, and p(θ) is the prior distribution of the parameter set θ. The posterior distribution p(θ|D) summarizes the degree of belief in θ, based on the observed data, D, and prior knowledge of the parameter set.
For complex analyses, including the estimation of parameters in many mixture models, it is often difficult to obtain the posterior distribution directly. Markov chain Monte Carlo (MCMC) integration is a computationally tractable and commonly used solution to the problem. It is an iterative procedure which attempts to recover the posterior distribution by sampling the permissible parameter space. One common implementation of MCMC uses the Metropolis-Hasting algorithm [13, 14], which can be described by the following steps.
Step 1: Begin with initial state Θ.
Step 2: Make a small change to the parameter θ^{ i } to θ* according to a proposal distribution q(θ*|θ^{ i }).
Step 3: Calculate the acceptance ratio α, using the following formula:
Generate μ from U(0, 1) and accept θ^{ i+1 } = θ^{ * } if μ < α.
Otherwise θ^{ i+1 } = θ^{ i }.
Step 4: Set i = i + 1 and repeat Step 1.
The algorithm is repeated until the Markov chain is sampling from the target distribution, typically the (joint) posterior distribution of the parameter(s).
When the Markov chain reaches the stationary or equilibrium distribution, the 95% highest posterior density (HPD) region for the marginal posterior distribution for each parameter can be calculated. The 95% HPD region consists of the smallest collection of potential parameter values such that the marginal posterior probability of the parameter falling into this region is at least 95%.
Prior and proposal distributions
List of prior distributions used in the global model
Global parameterθ^{ i } | Prior distributionp(θ^{ i }) |
---|---|
μ _{ g } | Normal ∼ (μ = −3,σ = 5) |
σ _{ g } | Γ ^{ -1 } (shape = 0.001, rate = 0.001) |
ψ | Uniform ∼ (0.001, 2) |
λ _{ δ } | Exponential ∼ (λ = 1) |
ϕ _{ δ } | Beta ∼ (alpha = 2,beta = 2) |
μ _{ κ } | Normal(μ = 0,σ = 3) |
σ _{ κ } ^{ 2 } | Γ^{ -1 } ∼ (shape = 0.001,rate = 0.001) |
μ _{ τ } | Normal ∼ (μ = 0,σ = 3) |
σ _{ τ } ^{ 2 } | Γ^{ -1 } ∼ (shape = 0.001,rate = 0.001) |
For the global mean expression intensity μ_{ g }, we used a normal distribution centered at −3 with a standard deviation of 5 as the prior. The prior is centered at −3 as the data are log-transformed relative protein expression intensities. If a gel has 1000 proteins with identical expression intensities, then the mean relative percentage volume expression will be 0.1 for each protein, which is ∼ −3.3 when log_{2}-transformed. However, since we do not know the true mean volume, a relatively large standard deviation was assigned to the prior distribution of relative expression intensities. There was insufficient information to provide a good estimate of the prior distribution for the global standard deviation σ_{ g }, therefore a relatively flat inverse-gamma prior σ_{ g } ∼ Γ^{ -1 }(0.001,0.001) was used [16].
The modified Laplace distribution is used to model the difference between two mean expression intensities. This distribution has two parameters: λ_{ δ } is the rate for the exponential distribution component, and ϕ_{ δ } is the proportion of up-regulated proteins. The rate parameter has an exponential prior of λ_{ δ } ∼ Exp (1). The proportion of up-regulated proteins ϕ_{ δ } is bounded between 0 and 1. If there is approximately an equal number of up- and down- regulated proteins then the value of ϕ_{ δ } will be close to 0.5. Therefore the density function for the prior should peak around 0.5 and decrease as ϕ_{ δ } moves toward 0 or 1, thus, a Beta (2,2) distribution was used as the prior for ϕ_{ δ }.
The means for both the probability of expression in the Control group, μ_{ κ }, and the difference between probabilities of expression between the two groups, μ_{ τ }, have more stringent priors. A normal distribution centered at 0 with a standard deviation of 3 is used for both parameters. Under the reparameterisation procedures described earlier, ${p}_{0}=\frac{exp\left(\kappa \right)}{1+exp\left(\kappa \right)}$ and ${p}_{1}=\frac{exp\left(\kappa +\lambda \right)}{1+exp\left(\kappa +\lambda \right)}$, if the probabilities of expression for the control group are given by ρ_{ 0 } = 0.95, this would correspond to κ ∼2.94. We believe that it is unnecessary to distinguish the probability of expression between 0.95 and 1 because the difference is unlikely to be biologically significant. Therefore a relatively small standard deviation was assigned to the prior distribution to avoid κ_{ s } or τ_{ s } moving towards very large values. Consequently, this also prevents false positive results which may occur when the model attempts to distinguish the difference between probabilities of expression beyond 0.95.
List of proposal distributions for both global and local parameters
Global parameterθ^{ i } | Proposal distributionq(θ*|θ^{ i }) |
---|---|
μ _{ g } | Truncated-Normal ∼ (μ = μ _{ g } , lower = d, upper = log _{ 2 } (100)) |
σ _{ g } | Truncated-Normal ∼ (μ = σ _{ g } , lower = 0.01) |
ψ | Truncated-Normal ∼ (μ = ψ, lower = 0.001, upper = 2) |
λ _{ δ } | Truncated-Normal ∼ (μ = λ _{ δ } , lower = 0.01) |
ϕ _{ δ } | ϕ _{ δ } ' = Normal ∼ (μ = ln[ϕ _{ δ } /(1-ϕ _{ δ } )]), ϕ _{ δ } * = exp(ϕ _{ δ } ')/[1 + exp(ϕ _{ δ } ')] |
μ _{ κ } | Normal ∼ (μ = μ _{ κ } ) |
σ _{ κ } | Truncated-Normal ∼ (μ = σ _{ κ } , lower = 0.01) |
μ _{ τ } | Normal ∼ (μ = μ _{ τ } ) |
σ _{ τ } | Truncated-Normal ∼ (μ = σ _{ τ } , lower = 0.01) |
Local parameter θ ^{ i } | Proposal distribution q(θ*|θ ^{ i } ) |
μ _{ s } | Normal ∼ (μ = μ _{ s } ) |
δ _{ s } | Normal ∼ (μ = δ _{ s } ) |
κ _{ s } | Normal ∼ (μ = κ _{ s } ) |
τ _{ s } | Normal ∼ (μ = τ _{ s } ) |
The proportion of up-regulated proteins ϕ_{ δ } was bounded between 0 and 1. Therefore a logit transformation was applied to ϕ_{ δ } to obtain a value without boundaries logit(ϕ_{ δ }) = ϕ_{ δ }/(1-ϕ_{ δ }). A normal distribution with mean set to logit(ϕ_{ δ }) was then used to propose a new value ϕ_{ δ }'. Finally, an inverse-logit transformation was applied to ϕ_{ δ }' to obtain the candidate value ϕ_{ δ }* which is always between 0 and 1.
The global standard deviation, σ_{ g }, the rate parameter for the exponential distribution, λ_{ δ }, the standard deviation for the probabilities of expression, σ_{ κ }, and the standard deviation for the difference in the probabilities of expression, σ_{ τ }, all have the same proposal distributions, a truncated normal distribution with lower bound set to 0.01 and no upper bound. The theoretical lower limit for these values is 0, but 0.01 was used for two reasons. The first was that these values were extremely unlikely to be less than 0.01 for any 2D PAGE experiments. Hundreds of different proteins were separated in each 2D PAGE experiment and it is unlikely for all the proteins to have very similar means and probabilities of expression. The mean of the exponential distribution is 1/λ, and the theoretical maximum intensity for a protein on 2D PAGE is log_{ 2 }(100) ≈ 6.64. Therefore we expect λ_{ δ } to be greater than 0.01 because the mean value for δ_{ s } (the difference between two mean expression intensities) is unlikely to be greater than 100. The second reason was to prevent floating point underflow when computing extremely small likelihood values when the standard deviation approaches 0.
Adaptive MCMC
Since MCMC is a technique that relies on a stochastic perturbation to the current state to generate the next state in a chain, the states are autocorrelated. Depending on the proposal distributions used, there is a possibility for states to persist in a part of parameter space, and mix poorly. We used three different techniques to improve the mixing of the Markov chain: tuning parameters, block updating and parameter expansion.
where σ_{ new } is the standard deviation of the new proposal distribution and σ_{ cur } is the standard deviation of the current proposal distribution. ρ_{ opt } is the optimal acceptance ratio, ρ_{ cur } the current acceptance ratio, and Φ^{ -1 } is the inverse CDF of a standard normal distribution. If the acceptance ratio is higher than the optimal acceptance ratio, then the standard deviation for the proposal distribution is increased to lower the acceptance ratio and vice versa [18]. The standard deviation σ_{ new } is updated once every 500 iterations and the current acceptance ratio ρ_{ cur } is averaged over 3000 iterations.
At the local layer, we paired μ_{ s } and δ_{ s } together and κ_{ s } and τ_{ s } together. At the global level, we paired μ_{ g } and σ_{ g } together, λ_{ δ } and ϕ_{ δ } together, μ_{ κ } and σ_{ κ } together and, μ_{ τ } and σ_{ τ } together. Sometimes the variance parameter was not able to move freely, especially when it approached zero, resulting in poor mixing. The introduction of an additional parameter which links mean and variance together can potentially reduce this issue [20]. This is termed “parameter expansion” and it was implemented here to reduce this problem.
Prior and proposal distributions used for the parameters introduced in the parameter expansions
Global parameterθ^{ i } | Prior distributionp(θ^{ i }) | Proposal distributionq(θ*|θ^{ i }) |
---|---|---|
α _{ μ } | Uniform ∼ (0.01,10) | Truncated-Normal ∼ (μ = α _{ μ } , lower = 0.01) |
α _{ κ } | Uniform ∼ (0.01,10) | Truncated-Normal ∼ (μ = α _{ κ } , lower = 0.01) |
α _{ τ } | Uniform ∼ (0.01,10) | Truncated-Normal ∼ (μ = α _{ τ } , lower = 0.01) |
With the combination of block updating and parameter expansion, there were twelve parameters, including nine parameters from the likelihood model and three tuning parameters (α) described above. These parameters were grouped and updated in eight different blocks.
Simulation analysis
Once we were confident that the Markov chain was sampling the target distribution, the 95% highest HPD for δ_{ s } and τ_{ s } was calculated. The local parameter δ_{ s } and τ_{ s } represent the differences in mean expression intensities between Case and Control groups and the probability of expression, respectively. There are three scenarios whereby a protein may be classified as statistically differentially expressed: 1) If the 95% HPD for δ_{ s } does not include zero, 2) if the 95% HPD for τ_{ s } does not include zero, or 3) if the 95% HPDs for both parameters do not include zero.
Simulation 1. Simulation based on a real experiment
100 differentially expressed proteins, with each protein having different parameter values, were drawn from a global distribution with the following parameters: the mean expression intensities for the control group followed a normal distribution with a mean of −5 and a standard deviation of 1. The standard deviation for each individual protein was 0.7. The difference between mean expression intensities was drawn from a modified Laplace distribution (described in the global layer section) with λ _{ δ } = 0.5 and ϕ _{ δ } = 0.5. The parameter associated with the probability of expression, κ _{ s } was drawn from a normal distribution with a mean of 1 and a standard deviation of 1, and τ _{ s } was drawn from a normal distribution with a mean of 0 and a standard deviation of 2.
Simulation 2. Varying the global distribution of the probabilities of expression
The second simulation was similar to Simulation 1, except that the values of κ_{ s } were no longer assumed to follow a normal distribution. Instead, for each protein, κ_{ s } was drawn from a uniform distribution between −1 and 3, and τ_{ s } was drawn from a uniform distribution between −2 and 2. All other global parameters were identical to those specified in Simulation 1.
Simulation 3. A smaller gap between mean expression intensities and different distributions for the probabilities of expression
In the previous two simulations, λ_{ δ } for the modified Laplace distribution was set to 0.5, which corresponds to a difference between two mean expression intensities of 2. In Simulation 3, the difference between two mean expression intensities was set to 1.5 times the protein standard deviation, which corresponds to λ_{ δ } ≈ 0.66. This was done because results from our previous study showed that LRT had a reasonable performance when the difference between the two mean expression intensities was approximately 1.5 times the standard deviation or higher. This simulation also tested the difference between two probabilities of expression when drawn from two different distributions. For each individual protein, κ_{ s } was still drawn from a normal distribution with mean 1 and a standard deviation of 0.25, but τ_{ s } was divided into two groups. Half of the proteins were simulated from a normal distribution with mean −3 and standard deviation of 0.25; the other half were simulated from a normal distribution with mean 2 and standard deviation of 0.25. Note that we assigned a relatively small standard deviation to these distributions to obtain two non-overlapping normal distributions. This extreme scenario is used to test the flexibility of the Bayesian model. All other global parameters were identical to Simulation 1.
Simulation 4. Estimating the false positive rate
This simulation attempted to investigate the number of proteins falsely classified as differentially expressed when there was no difference between two groups. The difference between local mean expression intensities δ_{ s } and the difference between local probabilities of expression τ_{ s } were fixed at 0 for all proteins. All other global parameters were identical to Simulation 1. This setting makes two groups identical and allows us to estimate the false positive rate of this model.
Application of model to 2D PAGE data
We also applied the global model to a 2D PAGE experiment reported previously by Wu et al [9] in which we selected differentially expressed spots based on a likelihood ratio test This experiment contained 24 individuals, with one gel per individual. Eight hundred and three spots were detected and matched using commercial software.
Results and discussions
Both the global model and the LRT previously defined in Wu et al (2009) were applied to the three simulations.
Simulation 1. Simulation based on a real experiment
Summary of the global parameters for simulation 1, which is based on a real 2D PAGE experiment
Global parameter | Mean from MCMC | Lower 95% HPD | Upper 95% HPD | True value |
---|---|---|---|---|
μ _{ g } | −4.8 | −5.02 | −4.59 | −5 |
σ _{ g } | 1.06 | 0.92 | 1.22 | 1 |
ψ | 0.65 | 0.56 | 0.75 | 0.7 |
λ _{ δ } | 0.57 | 0.46 | 0.69 | 0.5 |
ϕ _{ δ } | 0.57 | 0.45 | 0.67 | 0.5 |
μ _{ κ } | 0.89 | 0.66 | 1.15 | 1 |
σ _{ κ } | 1.00 | 0.78 | 1.24 | 1 |
μ _{ τ } | −0.22 | −0.75 | 0.38 | 0 |
σ _{ τ } | 2.48 | 1.93 | 3.08 | 2 |
The recovered mean for the proportion of up-regulated proteins ϕ_{ δ } was 0.57 with the 95% HPD between 0.45 and 0.67 (the true value is 0.5). This implied that 57% of the spots were considered as up-regulated, that is, the mean expression intensity for the case group was higher than the control group. Nevertheless, this does not represent the proportion of statistically classified differentially expressed proteins because the statistical classification of up- or down regulation depends on whether the 95% HPD of δ_{ s } for each protein includes zero. Under this criterion, there were 38 spots that were (statistically) classified as up-regulated and 30 spots that were (statistically) classified as down-regulated.
Simulation 2. The effect of the underlying global distribution on the probabilities of expression
Summary of the global parameters for simulation 2 where the probabilities of expression were drawn from uniform distributions
Global parameter | Mean from MCMC | Lower 95% HPD | Upper 95% HPD | True value |
---|---|---|---|---|
μ _{ g } | −5.00 | −5.23 | −4.79 | −5 |
σ _{ g } | 1.08 | 0.93 | 1.23 | 1 |
ψ | 0.69 | 0.59 | 0.79 | 0.7 |
λ _{ δ } | 0.62 | 0.50 | 0.75 | 0.5 |
ϕ _{ δ } | 0.54 | 0.43 | 0.65 | 0.5 |
μ _{ κ } | 0.99 | 0.70 | 1.3 | κ ∼ Uniform(−1,3) |
σ _{ κ } | 1.29 | 1.03 | 1.56 | κ ∼ Uniform(−1,3) |
μ _{ τ } | −0.23 | −0.67 | 0.22 | τ ∼ Uniform(−2,2) |
σ _{ τ } | 1.84 | 1.41 | 2.29 | τ ∼ Uniform(−2,2) |
Simulation 3. Smaller difference between mean expression intensities and alternative distributions for the probabilities of expression
Summary of the global parameters for simulation 3 where the difference between two probability of expressions were drawn from two normal distributions
Global parameter | Mean from MCMC | Lower 95% HPD | Upper 95% HPD | True value |
---|---|---|---|---|
μ _{ g } | −5.18 | −5.39 | −4.95 | −5 |
σ _{ g } | 1.13 | 0.98 | 1.29 | 1 |
ψ | 0.64 | 0.55 | 0.73 | 0.7 |
λ _{ δ } | 0.73 | 0.57 | 0.89 | 0.7 |
ϕ _{ δ } | 0.47 | 0.35 | 0.60 | 0.5 |
μ _{ κ } | 0.98 | 0.83 | 1.12 | 1 |
σ _{ κ } | 0.28 | 0.11 | 0.46 | 0.25 |
μ _{ τ } | −0.93 | −1.76 | −0.16 | * |
σ _{ τ } | 3.65 | 2.89 | 4.46 | * |
Simulation 4. Estimating the false positive rate
The 95% HPDs were calculated for all the local parameters δ_{ s } and τ_{ s }, and all the HPD intervals contained zero. This implied that none of the proteins were classified as differentially expressed. The simulations were repeated with 18 and 24 gels in each group while all other parameters remained the same. Once again, in these further simulations none of the proteins was classified as differentially expressed. This demonstrates that the model we propose here has a very low false positive rate.
2D PAGE Example
Of course, because the global model uses a common variance for expression intensities, spots where the variances are significantly different from the common variance will not necessarily be identified as differentially expressed. This appears to account for those spots that are identified by the LRT and not the global Bayesian analysis.
Conclusions
We have demonstrated with simulated data that a global Bayesian model is able to correctly identify more differentially expressed proteins than the use of the LRT proposed in the previous study. In all three simulation analyses, the LRT classified approximately 60% of the proteins as statistically differentially expressed, and the global model classified between 75% and 89% of the proteins. Additionally, with our simulated data, the global model identified correctly identified almost all of the proteins also identified by the LRT. The global model accurately recovered the underlying global distributions in all simulations. The 95% HPD for the five global parameters, μ_{ g }, σ_{ g }, ψ, λ_{ δ } and ϕ_{ δ }, always included the true values used to simulate the dataset. The global distributions used in the model were fixed, but the results from the simulation analyses showed that it can be adapted to a wide range of different underlying distributions. In simulation analysis 2, the model recovered a wide normal distribution to overcome the fact that the underlying distribution was a uniform distribution. In simulation analysis 3, a very wide normal distribution with standard deviation 3.65 was obtained when two non-overlapping normal distributions were used as the true distributions from which data were sampled. Finally, simulations also demonstrated that the False Discovery Rate was very low.
When we applied the global Bayesian analysis and the LRT to real data, we uncovered some interesting disparities that appear to be related to how these methods apply variance estimates. In particular, the global Bayesian model estimates a common variance by combining data available from all spots. This allows the model to estimate the standard deviation more accurately if there is, indeed, a common variance of expression intensities. By using the 95% HPD to identify differentially expressed proteins, additional information is provided on whether a protein is differentially expressed due to the expression intensities, probabilities of expression or possibly both. The proportion of up- or down-regulated proteins can be accurately estimated from the model by the global parameter ϕ_{ δ }. In contrast, the LRT uses only the variance of expression intensities identified for each spot. If the number of expressed spots is low in both Case and Control groups, the power to detect differences is compromised. This is an advantage of the global model when the assumption of a common variance is appropriate. However, when this assumption is violated, the global model does not identify the same spots as being up- or down-regulated as the LRT. It may be possible to apply a mixture of distributions allowing different variances, to overcome this discrepancy. However, it is a common to find with MCMC procedures that adding more parameters, and integrating over these, affects mixing and convergence to the stationary distribution.
It is, of course, true that a realistic biological system involves several different groups of proteins, with each group associated with different biological pathways that are frequently interconnected. In order to capture this complex relationship, it is likely that the expressions of different clusters of proteins will be best explained by different underlying distributions. This will allow the model to separate proteins into several different categories, with each category being represented by a unique global distribution. Whereas the use of multiple global distributions may result in a more accurate estimate of these true global parameters, there is also the danger that introducing new distributions (and new parameters) will lead to overfitting and inflated variance estimates. Several global statistical models developed for other high throughput technologies such as microarrays, often attempt to incorporate biological pathways [22]. The challenge with 2D PAGE is that the true identity of each protein is usually unknown until differentially expressed proteins are determined and then subjected to mass spectroscopy for identification. Without this information, it is very challenging to develop a global model based on biological pathways.
Finally, one other assumption that our global Bayesian model makes is that the variances of expression intensities for the Case and Control groups are equal. We are aware that this may be an unrealistic assumption; however, if we assume the alternative (i.e., unequal variances for Case and Control), our implementation of the MCMC has difficulty converging when the probability of expression is low.
Any MCMC Bayesian analysis requires a choice of prior distributions. Although we have designed priors that appear to be a reasonable characterization of the uncertainty in our parameter values, the model is general enough to allow other priors to be substituted for the ones we use. In this paper, we have not tried different prior distributions, because we are demonstrating how the Bayesian MCMC scheme may be implemented, and we have applied our methods largely to simulated data. With real-world data, it is standard practice when applying Bayesian analyses to real data to test for the sensitivity to different prior distributions.
One drawback of the MCMC approach is the amount of time required for the Markov chain to converge. Multiple runs of Markov chains can be used to assess the convergence and accuracy of the results. An example of this is the Metropolis-coupled Markov chain Monte Carlo (MC^{3}) approach [23]. A typical 2D PAGE experiments may have between 800 to 1200 expressed proteins. With the current implementation, it took around 1.7 hours per million iterations for an experiment with 800 spots on a Intel i5 2.67 GHz CPU. As the number of spots increases, the number of iterations and the time required for the Markov chain to converge may also increase. To improve the usability of this model, a more efficient implementation, such as parallel MCMC, should be used [24]. The source code and jar file are available for download at https://github.com/stevenhwu/BIDE-2D.
Declarations
Acknowledgments
We would like to thank Professor Lesley McCowan (Principal investigator on the SCOPE project in Auckland, New Zealand), Rennae Taylor (project manager), research midwives and the pregnant women who participated in the SCOPE study in Auckland New Zealand. We would like to thank Kelly LeFevre Atkinson for performing the 2D-gel experiment which generated the observed data used in the study. This project was funded by NERF grant UOAX0407, Foundation for Research, Science and Technology, New Zealand. Health Research Council, New Zealand. SHW was supported by a Doctoral Scholarship from the University of Auckland, New Zealand. The comments of two anonymous reviewers helped to improve this manuscript.
Authors’ Affiliations
References
- O'Farrell PH: High resolution two-dimensional electrophoresis of proteins. J Biol Chem 1975, 250(10):4007–4021.PubMed CentralPubMedGoogle Scholar
- Morris JS, Baladandayuthapani V, Herrick RC, Sanna P, Gutstein H: Automated analysis of quantitative image data using isomorphic functional mixed models, with application to proteomics data. The Annals of Applied Statistics 2011, 5: 894–923. 10.1214/10-AOAS407PubMed CentralView ArticlePubMedGoogle Scholar
- Dowsey AW, Dunn MJ, Yang G-Z: The role of bioinformatics in two-dimensional gel electrophoresis. Proteomics 2003, 3(8):1567–1596. 10.1002/pmic.200300459View ArticlePubMedGoogle Scholar
- Berth M, Moser FM, Kolbe M, Bernhardt J: The state of the art in the analysis of two-dimensional gel electrophoresis images. Appl Microbiol Biotechnol 2007, 76(6):1223–1243. 10.1007/s00253-007-1128-0PubMed CentralView ArticlePubMedGoogle Scholar
- Chang J, Van Remmen H, Ward WF, Regnier FE, Richardson A, Cornell J: Processing of data generated by 2-dimensional gel electrophoresis for statistical analysis: missing data, normalization, and statistics. J Proteome Res 2004, 3(6):1210–1218. 10.1021/pr049886mView ArticlePubMedGoogle Scholar
- Biron DG, Brun C, Lefevre T, Lebarbenchon C, Loxdale HD, Chevenet F, Brizard JP, Thomas F: The pitfalls of proteomics experiments without the correct use of bioinformatics tools. Proteomics 2006, 6(20):5577–5596. 10.1002/pmic.200600223View ArticlePubMedGoogle Scholar
- Jacobsen S, Grove H, Nedenskov Jensen K, Sørensen HA, Jessen F, Hollung K, Uhlen AK, Jørgensen BM, Færgestad EM, Søndergaard I: Multivariate analysis of 2-DE protein patterns - Practical approaches. Electrophoresis 2007, 28(8):1289–1299. 10.1002/elps.200600414View ArticlePubMedGoogle Scholar
- Grove H, Hollung K, Uhlen AK, Martens H, Faergestad EM: Challenges related to analysis of protein spot volumes from two-dimensional gel electrophoresis as revealed by replicate gels. J Proteome Res 2006, 5(12):3399–3410. 10.1021/pr0603250View ArticlePubMedGoogle Scholar
- Wu SH, Black MA, North RA, Atkinson KR, Rodrigo AG: A statistical model to identify differentially expressed proteins in 2D PAGE gels. PLoS Comput Biol 2009, 5(9):e1000509. 10.1371/journal.pcbi.1000509PubMed CentralView ArticlePubMedGoogle Scholar
- Wheelock ÅM, Buckpitt AR: Software-induced variance in two-dimensional gel electrophoresis image analysis. Electrophoresis 2005, 26(23):4508–4520. 10.1002/elps.200500253View ArticlePubMedGoogle Scholar
- Albrecht D, Kniemeyer O, Brakhage AA, Guthke R: Missing values in gel-based proteomics. Proteomics 2010, 10(6):1202–1211. 10.1002/pmic.200800576View ArticlePubMedGoogle Scholar
- Krogh M, Fernandez C, Teilum M, Bengtsson S, James P: A probabilistic treatment of the missing spot problem in 2D gel electrophoresis experiments. J Proteome Res 2007, 6(8):3335–3343. 10.1021/pr070137pView ArticlePubMedGoogle Scholar
- Hastings WK: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57(1):97–109. 10.1093/biomet/57.1.97View ArticleGoogle Scholar
- Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equation of State Calculations by Fast Computing Machines. J Chem Phys 1953, 21(6):1087–1092. 10.1063/1.1699114View ArticleGoogle Scholar
- Atkinson K: Proteomic biomarker discovery for preeclampsia.PhD thesis. Auckland: University of Auckland; 2008.Google Scholar
- Gelman A: Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 2006, 1: 515–533.View ArticleGoogle Scholar
- Roberts GO, Gelman A, Gilks WR: Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. Ann Appl Probab 1997, 7(1):110–120.View ArticleGoogle Scholar
- Roberts GO, Rosenthal JS: Optimal Scaling for Various Metropolis-Hastings Algorithms. Stat Sci 2001, 16(4):351–367. 10.1214/ss/1015346320View ArticleGoogle Scholar
- Roberts GO, Sahu SK: Updating Schemes, Correlation Structure, Blocking and Parameterization for the Gibbs Sampler. Journal of the Royal Statistical Society Series B (Methodological) 1997, 59(2):291–317. 10.1111/1467-9868.00070View ArticleGoogle Scholar
- Liu C, Rubin DB, Wu YN: Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 1998, 85(4):755–770. 10.1093/biomet/85.4.755View ArticleGoogle Scholar
- Rambaut A, Drummond A: Tracer v1.4.1. 2007. Available from http://beast.bio.ed.ac.uk/Tracer Available fromGoogle Scholar
- Binder H, Schumacher M: Incorporating pathway information into boosting estimation of high-dimensional risk prediction models. BMC Bioinforma 2009, 10(1):18. 10.1186/1471-2105-10-18View ArticleGoogle Scholar
- Huelsenbeck JP, Ronquist F: MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 2001, 17(8):754–755. 10.1093/bioinformatics/17.8.754View ArticlePubMedGoogle Scholar
- Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F: Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 2004, 20(3):407–415. 10.1093/bioinformatics/btg427View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.