 Methodology article
 Open Access
 Published:
An eScienceBayes strategy for analyzing omics data
BMC Bioinformaticsvolume 11, Article number: 282 (2010)
Abstract
Background
The omics fields promise to revolutionize our understanding of biology and biomedicine. However, their potential is compromised by the challenge to analyze the huge datasets produced. Analysis of omics data is plagued by the curse of dimensionality, resulting in imprecise estimates of model parameters and performance. Moreover, the integration of omics data with other data sources is difficult to shoehorn into classical statistical models. This has resulted in ad hoc approaches to address specific problems.
Results
We present a general approach to omics data analysis that alleviates these problems. By combining eScience and Bayesian methods, we retrieve scientific information and data from multiple sources and coherently incorporate them into large models. These models improve the accuracy of predictions and offer new insights into the underlying mechanisms. This "eScienceBayes" approach is demonstrated in two proofofprinciple applications, one for breast cancer prognosis prediction from transcriptomic data and one for proteinprotein interaction studies based on proteomic data.
Conclusions
Bayesian statistics provide the flexibility to tailor statistical models to the complex data structures in omics biology as well as permitting coherent integration of multiple data sources. However, Bayesian methods are in general computationally demanding and require specification of possibly thousands of prior distributions. eScience can help us overcome these difficulties. The eScienceBayes thus approach permits us to fully leverage on the advantages of Bayesian methods, resulting in models with improved predictive performance that gives more information about the underlying biological system.
Background
Highthroughput experimental methods, including DNA and protein microarrays and other omics techniques, have become ubiquitous, indispensable tools in biology and biomedicine. The number of highthroughput technologies is constantly increasing. They provide the power to measure thousands of features of a biological system in a single experiment, and they have the potential to revolutionize our understanding of biology and medicine. However, the high expectations for omics methods have fallen short of realization, due to the challenges the data present for statistical modeling. Thus, the wealth of data produced is difficult to translate into concrete biological knowledge, new drugs, and clinical practices [1, 2]. A recurring problem is that few experimental samples are generated relative to the number of model parameters [1]. This leads to imprecise parameter and performance estimates, and models prone to overfitting (the 'curse of dimensionality'). In fact, it is difficult to even obtain a trustworthy measure of how imprecise performance estimates are with standard techniques like confidence intervals based on holdout estimates in crossvalidation or bootstrapping [3, 4]. Another problem is that classical statistical models are too restrictive to account for the complexities of integrating omics data with previously acquired data. This has forced us to resort to ad hoc approaches for solving very specific problems [5–7]. Statistical approaches that use prior distributions over the model parameters are known as Bayesian methods [8, 9]. The prior distributions summarize one's belief about the parameters before seeing the data. After performing an experiment, the parameters are updated to a posterior distribution according to Bayes' theorem:
The posterior distribution reflects the contribution of both our prior beliefs and the experimental data (through the likelihood function).
Bayesian methods have a number of advantages that allow us to address the problems inherent in omics data analysis. Firstly, they afford formal and coherent incorporation of prior information and integration of data sources. Apart from the philosophical appeal of being able to use all available information when analyzing a problem, this allows us to tackle the curse of dimensionality in two ways: (1) Prior information can be used to reduce model dimensionality with Bayesian variable selection and regularization, and (2) data from different research studies can be coherently combined to increase the number of available observations. Bayesian modeling thus allows us to "borrow information" across studies [8]. Secondly, Bayesian methods correctly summarize the model's predictive distribution [10]. Thus, we can obtain reliable Bayesian confidence intervals (often called credible intervals) of the estimates of model performance in smallsample problems; this avoids the problem of having to rely on imprecise confidence intervals based on holdout estimates from crossvalidation or bootstrapping [4]. Thirdly, Bayesian methods may be applied to problems with structures too complex for classical statistical methods [11]. Bayesian models can be made increasingly elaborate to accommodate for the integration of prior information and omics data from multiple data sources.
However, the advantages of Bayesian modeling come at a price. We have to specify prior distributions for all parameters in a model. For large models, like those for analyzing omics data, it is unrealistic to expect practitioners to manually collect prior information for all individual parameters. In practice, this limits the powerful advantage of using available prior knowledge. In addition, Bayesian methods require numerical handling of analytically intractable integrals. This is typically performed with Markov chain Monte Carlo (MCMC) methods, which are computationally very expensive and therefore restricts the size of the modeling problems that we can address.
Recent developments in distributed information and computational resources have given rise to the notion of eScience, i.e. computationally intensive science that is carried out in highly distributed network environments [12]. Advances in eScience can be roughly grouped into: (1) developments in semantic technologies (e.g. data standardization and ontologization) and methods for retrieving standardized data (e.g. Web services); and (2) construction of highperformance computing (HPC) facilities and development of middleware for using the HPCs. These two aspects of eScience equip us for efficient use of the Bayesian approach to omics data analysis. Standardized data and Web services allow us to use machines to harvest the Internet for information to use in prior distribution specifications, and HPC resources provide the computational power required to fit large Bayesian models to highthroughput data (Fig. 1). In this paper we demonstrate how eScience permit us to leverage on the advantages of Bayesian analysis when modeling omics data. We show by two examples that the approach can improve predictive performance, and that it permits a deeper analysis of the data by accommodating more complex model structures.
Results
Transcriptomics data: Predicting distant metastasis development in breast cancer patients
Microarray gene expression profiling has shown promise for supporting the prognostication of breast cancer patients based on the expression pattern of specific gene sets ('signatures'). A number of studies have reported different signatures [13–16]. However, concerns have been raised against the prognostic capacity of these signatures when applied to new data [17, 18]. These concerns include the following:

1.
Signatures from different studies share almost no genes, which indicates that the signatures depend to a large degree on the datasets rather than being prognostic for breast cancer [17]. In fact, Michiels et al. [18] showed that, even within a study, the subset of patients used for signature derivation strongly influenced which genes were selected.

2.
The reported performance estimates of the signatures have been questioned [18]; this suggests that worse performance is likely to be obtained when applied to new data. Also, the confidence intervals of the performance estimates (if reported at all) have been disputed [4, 18]. This makes it difficult to assess the level of accuracy in the reported results.
In fact, both these concerns arise due to the small sample sizes typically used in microarray studies relative to the number of profiled genes (i.e. the curse of dimensionality). We employed the eScienceBayes strategy described in Fig. 1 to address this problem. In accordance with the practice in most previous articles, we modeled the development of distant metastases within five years. To increase the sample size, we used the Gene Expression Omnibus (GEO) Web service [19] to collect data from five previously published breast cancer studies [13–16, 20]. To reduce the model dimensionality we employed a Bayesian variable selection procedure, giving higher prior probabilities of including a variable (gene) the higher our prior belief that the gene affects development of distant metastases. The prior belief for each gene to affect development of distant metastases was based on information retrieved by connecting three Web services: NetPath http://www.netpath.org/, DictService http://services.aonaware.com/ and Entrez Utilities http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/DOC/esoap_help.html. NetPath is a curated resource of genes reported to be transcriptionally regulated by cancersignalling pathways. All genes on the HGU133A array listed in NetPath were used to text mine PubMed for all free fulltext articles where the gene names were mentioned in combination with breast cancer. To reduce the number of spurious hits, we discarded genes with names that represent English words by using the dictionary definition Web service DictService. This gave us a list of integers, associated with the genes on the HGU133A array representing the number of times a cancerpathway regulated gene is mentioned together with breast cancer in the literature (all genes not present in NetPath were assigned the number 0). These informative prior distributions were thus based on the assumption that the probability of a gene being related to breast cancer was reflected in the number of times a gene reported in NetPath was mentioned in combination with breast cancer in PubMed articles. A work flow of the procedure is shown in Fig. 2 (further details are given in Methods). The five datasets and the prior distributions were incorporated in a multilevel Bayesian probit model (Model I, see Methods). To assess the discriminative power of Model I, we fitted it five times; each time using three of the datasets as training sets and the remaining two datasets as independent testsets (Fig. 3a, red curve, and b; Additional file 1, Fig. S1 and S2; Table S1). It may be noted that Model I achieved high prediction accuracy on independent test data produced in different research groups. To the best of our knowledge, this is the first time that a modeling approach has demonstrated consistently good predictive ability across multiple independent testsets for predicting breast cancer metastasis development from gene expression data.
To validate the value of using prior information, we tested Model I with noninformative prior distributions (i.e. we assigned equal prior probability for all variables to be included in the variable selection; see Methods for further details) and compared the results to those obtained with informative priors (Fig. 3a, green curve; Additional file 1, Fig. S1, green curves; Table S1). This clearly showed that the use of relevant prior information significantly improves the classification accuracy of Model I (Bayesian Pvalue < 0.001, related to the curves in Fig. 3a). The dramatic improvement was due to the fact that the prior information focused the model to include relevant genes in the signature, thus reducing the risk of detecting spurious relationships and overfitting the model. Analogously, the use of prior information enforced consistency in signature selection across disparate datasets: 15.2% of the genes were selected all five times Model I with informative priors was fit, which may be compared to 1% when noninformative priors were used, and 0% among the original signatures. The median of the pairwise overlap between two different model fits was 27.5%, compared to 3% among the original signatures and 12% reported in Chuang et al. [6].
We also assessed the advantage of using multiple datasets by using only single datasets for training Model I (with informative priors); results are shown in Fig. 3a (blue curve), Additional file 1, Fig. S1 (blue curves) and Table S1. A significant improvement was found when multiple datasets were used for model training compared to when only a single dataset was used (Bayesian Pvalue < 0.001, related to the curves in Fig. 3a). In a final comparison, we modified Model I to use only the expression data that corresponded to the respective gene signatures derived in Wang et al. [13], Miller et al. [14], Sotiriou et al. [15], and Pawitan et al. [16], together with noninformative priors (only the genes printed on the HGU133A were used for the Miller et al. [14] and Pawitan et al. [16] signatures). Again, we found that Model I (with informative priors) performed significantly better than when the signatures from Wang et al. [13], Miller et al. [14], Sotiriou et al. [15], and Pawitan et al. [16] were used. Results are displayed in Additional file 1, Fig. S4 and Table S1. To derive a "final" gene signature, we fitted Model I using all five datasets and informative priors. Some genes were included in this signature that a priori had been regarded as nonrelevant to breast cancer; these genes were strongly supported by the data and may be of interest for future experimental investigations. The genes included in the signature are shown in Additional file 1, Table S2, stratified into genes that were relevant a priori as well as a posteriori, and genes that were regarded nonrelevant a priori but relevant a posteriori.
In order to further demonstrate the eScienceBayes approach, we used it to create a model that predicted the probability of distant metastasis development as a function of time. This model would provide finer granularity in the prognosis predictions of future patients compared to the current practice of stratifying patients into only two classes (metastases or no metastases within five years). Patient stratification into two classes discards information by discretizing a continuous variable (time to development of distant metastasis) into an arbitrarily defined binary variable (yes/no development of distant metastasis within five years). Because not all patients had developed distant metastases before the last followup in the studies, this required a consideration of the censoring of the data (Methods). We created a multilevel Bayesian accelerated failure time (AFT) (Methods; Sha et al. [21]) model with the expression data as explanatory variables and the informative prior distributions (Model II, see Methods for details). Analogously to the assessment of Models I, we assessed Model II by fitting it five times, each time using three of the datasets as a training set and the remaining two datasets as test sets to assess the performance of Model II. The results implied that we indeed could provide more finegrained prognoses predictions, compared to the binary classification afforded with Model I (Fig. 3c).
In this example, we provided evidence for that adopting the eScienceBayes approach can significantly improve consistency in gene signature selection and in the predictive performance of the constructed models. In contrast to previous studies, our results were presented together with Bayesian confidence intervals, which permit assessment of the level of accuracy in the results [4].
Proteomics data: Analyzing PDZ domainpeptide interactions
PDZ domains mediate proteinprotein interactions and have been extensively studied over the last fifteen years. Recently, a largescale PDZ domainpeptide interaction dataset was published in Stiffler et al. [22] and followup studies were later described in Chen et al. [23]. The combined data from these two publications comprise 2306 observations of interactions between peptides and PDZ domains from mouse, C. elegans, and D. melanogaster. We reanalyzed these data using the eScienceBayes approach. To this end, we modeled the PDZ domainpeptide interactions using a multilevel Bayesian probit model with physicochemical properties of the PDZ domains and peptides as explanatory variables and a binary variable as response, indicating whether or not a PDZ domain and a peptide bound to each other (Model III). We derived informative prior distributions from 3D structural information using the Sequence Annotated by Structure [24] (SAS), a tool for annotating a protein sequence with structural information based on all solved 3D structures of the proteins in the Protein Data Bank [25] (PDB). Using the SAS Web service [26] (WSsas) we estiamted the number of contacts made between residues in each PDZ domain and in the peptide ligands. Similarly to the breast cancer demonstration, we used the informative prior distributions to "guide" the variable selection in order to reduce the dimensionality of the model based on the assumption that the importance of a PDZ domain residue for the interaction is reflected in the number of contacts it makes with the ligand (see Methods for details). Fig. 4 shows the work flow for the prior elicitation and modeling. Fig. 5a summarizes the predictive performance of Model III in a receiver operating characteristics (ROC, see Methods) curve based on a 10fold crossvalidation, and shows a comparison with the predictive performance when noninformative prior distributions were used (i.e. equal prior probability for all variables to be included in the variable selection; see Methods). We observed a nonsignificant (Bayesian Pvalue = 0.11) improvement in the predictive performance with informative compared to noninformative priors (Fig. 5a).
In this example, the main advantage of adopting an eScienceBayes approach was the increase in information that we obtained. For example, Model III enabled an analysis of the differences between PDZ domains. By allowing the regression coefficients to vary between PDZ domains (Methods), we found differences in the estimated nonzero coefficients for different PDZ domains (Fig. 5b). This suggests that there may be variations among the PDZ domains in the networks of residues that are important for interacting with ligands [27]. Interestingly, some of these residues are located far away from the binding site. Although the tertiary structures of PDZ domains are very similar, their primary structures vary substantially [28]. This may cause differences in the intramolecular interactions within different PDZ domains. Networks of such intramolecular interaction have previously been reported for PDZ domains [28, 29]. The results from studies by Gianni et al. [29] and Chi et al. [30] support the notion that the networks may differ among PDZ domains; for example, a network present in a PDZ domain from mouse tyrosine phosphatase BL was not found in the human PSD95 PDZ3. The eScienceBayes approach allowed analysis of these differences across a large set of PDZ domains.
Model III further indicated that the PDZ domains from C. elegans and D. melanogaster may form two clusters (Fig. 5b). Given the small datasets from C. elegans and D. melanogaster, which contain only seven PDZ domains each (Fig. 4), the clustering might be a sampling or an experimental artifact. Nevertheless, it is tantalizing to hypothesize that there are general trends in the intramolecular interaction networks in PDZ domains that differ between species. These observed differences are further illustrated in Fig 5c, where five residues, estimated to be allosterically linked to residues in the binding peptide, are shown in one representative PDZ domain from mouse, C. elegans, and D. melanogaster. As shown in the figure, different amino acids, with different positions and physicochemical properties, are estimated to be allosterically coupled to the binding peptide in the three different PDZ domains.
This example shows that an eScienceBayes approach to modeling omics data can straightforwardly accommodate grouping structures in the data. This enables the analysis to reveal tentative differences among PDZ domains that allowed the inference of putative allosteric networks.
Discussion and Conclusions
In a recent Nature Horizons article [12], Cambridge chemistry professor Peter MurrayRust sketched an eScience world where the answer to any question is at the fingertips of every person; the complexity of the question and the size of the data to analyze do not matter. Computers perform all trivial and timeconsuming tasks, like searching through millions of research articles or performing massive algorithmic calculations. In this proofofprinciple paper we have shown that some aspects of this vision in fact have been realized: machines can via Web services be used to retrieve background information that together with multiple data sources (produced locally or retrieved from the Internet) can be coherently exploited in a Bayesian framework to create complex models by employment of HPC resources, thus allowing us to provide answers to biologically relevant questions. This modus operandi contrasts to nonBayesian approaches for merging multiple omics datasets and for integrating omics data with prior information, which to a large degree have relied on ad hoc methods for solving specific problems (see e.g. Kutalik et al. [5], Chuang et al. [6], and Xu et al. [7]). Bayesian statistics provide the flexibility to tailor statistical models to complex data structures, and can simultaneously accommodate prior information. The rather complex models described here would be diffcult to fit using classical statistics. We demonstrated the value of the Bayesian approach with the quite drastic improvements in prediction accuracy in the breast cancer example, and the enhanced depth of the data analysis in the PDZ domainpeptide interaction example. However, presently the eScienceBayes approach is not easy to apply because it requires manual procurement of suitable information resources and Web services, and it requires programming skills to implement the Web service clients, parse the output, and connect them together. Moreover, the eScienceBayes approach requires implementation of MCMCalgorithms to fit Bayesian models and knowledge of middleware and nonuserfriendly interfaces for deployment of the MCMCalgorithms on HPC facilities. However, the most crucial difficulty is the prior elicitation process. Since the prior distributions are central in the Bayesian paradigm, a key aspect of applying the eScienceBayes approach is to summarize retrieved prior knowledge as distributions. In both the breast cancer and PDZ domain demonstrations we used prior information quantified as integers to represent our a priori belief of the independent variables relevance for modeling the response. Although care was taken to design the prior distributions in a sensible way, the prior specifications reflects a somewhat arbitrary step in the work flow, and there is presently no clear way to standardize prior elicitation and make the process transparent, exchangeable, and completely general. Related to this difficulty, another problem is that useful data and prior information in today's scientific practice are lost in objects that are inaccessible to computers, such as figures, tables, and summary statistics.
Thus, there is a distinct need for a standardized system of technologies and tools for working efficiently with computer representations of biological entities, data, information, probabilities, and statistical models  both locally and on the network. Great efforts are indeed being made in this direction. These include the increasing degrees of semantically annotated data that are deposited in public repositories in machine readable formats [31], new Web service protocols that allow for service discovery [32], semantic web technologies to adhere information with probability distributions that can be directly used as priors in Bayesian statistical models [33], and highlevel languages for specifying fast implementations of MCMC algorithms [34]. Moreover, graphical workbenches, like Bioclipse [35], aim to handle all these tasks from a single point of entry. When these technologies and software have matured, it will be straightforward to adopt an eScienceBayes approach to virtually any biological or clinical question. It would then be easy to obtain prior information and data by performing semantic queries, for example, "give me all breast cancer susceptibility genes and the probability distributions describing their association with increasing risk of distant metastasis development, together with all breast cancer Affymetrix HGU133A datasets where distant metastasis development was the clinical endpoint". The retrieved priors and data could then be used directly in a Bayesian statistical model specified in a highlevel language and converted to an algorithmic fitting process deployed on an HPC facility. Finally, the results could be summarized graphically to increase the likelihood of making correct clinical prognostic and treatment decisions.
Methods
Transcriptomics data: Predicting distant metastasis development in breast cancer patients
The datasets with accession numbers GSE2034, GSE7390, GSE4922, GSE2990, and GSE1456 were retrieved with the GEO Web service [19]. The datasets were originally published in Wang et al. [13], Desmedt et al. [20], Miller et al. [14], Sotiriou et al. [15], and Pawitan et al. [16], and we hence denote them by (X, t, γ)_{ ρ }= (X_{ ρ }, t_{ ρ }, γ_{ ρ }), ρ ∈ {W, D, M, S, P}. X represents gene expression measurements, t the time to development of distant metastses, and γ is a binary variable indicating censoring. All five datasets were generated with the Affymetrix HGU133A platform and contained measurements of 22283 probes in 286, 198, 166, 189, and 159 patients, respectively (after removal of 85 replicate patients in the (X, t, γ)_{ M }and (X, t, γ)_{ S }datasets). The patients in the Miller et al. [14] and the Pawitan et al. [16] studies were profiled using both the Affymetrix HGU133A and HGU133B arrays; we here used only the data from the HGU133A array. The columns in each X_{ ρ }were mean centered and scaled to unit variance.
We define g_{ j }, j = 1,...,22283, to be the gene names of the probes on the Affymetrix HGU133A array, and to be the set of gene names reported in NetPath as being transcriptionally regulated by cancer pathways, and to be the set of words in the English language (Fig. 2). Further, set h_{ j }, j = 1,....,22283, to be the number of free, fulltext articles in PubMed where the gene name g_{ j }was mentioned. Finally, let k_{ j }, j = 1,...,22283, be defined according to:
Model I
We modeled whether development of distant metastases occurred within five years using a multilevel Bayesian probit model according to:
where Pr denotes probability, and Φ is the normal cumulative distribution function, N the multivariate normal distribution, δ the onepoint distribution, and W^{1} the inverse Wishart distribution. n_{ ρ }denotes the number of patients in dataset ρ. J^{κ}is the diagonal matrix with the nonzero elements in the vector as diagonal elements. Similarly, and are the elements of x_{ ρi }and β_{ ρ }, respectively, that correspond to the nonzero elements in (x_{ ρi }relates to patient i in dataset ρ and β_{ ρ }are the regression coefficients related to dataset ρ). The index ρ indicates that the coefficients varied by breast cancer study (W, D, M, S, P).
We assumed a priori that on average 100 genes should be included in the signature (i.e. that = 100, where E denotes the expectation operator). This assumptions was based on the number of genes used in previously reported signatures [13–16]. Further, we assumed that the greater the number of times a cancer pathway regulated gene's name has been mentioned together with breast cancer in free, fulltext PubMed articles, the greater is the a priori scientific support that the gene's regulation is correlated with breast cancer relapse. Thus, the larger the kvalue (as defined in (1)) associated with a given gene, the greater the a priori probability of the gene to be included in the signature. In order to give all genes a chance of being selected, we assign the kvalue 1 to a gene not reported in NetPath. Genes reported in NetPath but not mentioned in any free PubMed article get the kvalue 2. The prior distribution for is founded on the idea that the probability of selecting one variable j of the 22283 variables is proportional to the variable j's kvalue, i.e. Pr(select j in one draw) = . Thus, the probability of not selecting the variable j is and the probability of selecting variable j in m independent draws is . All genes thus have a chance to get selected and all genes have a chance of not being selected.
Apart from the prior distribution of , no prior knowledge was assumed for the remaining model parameters (i.e. noninformative priors was used for , Σ^{κ}and m).
Model I was validated by fitting it five times, each time using three of the five datasets (X, r, γ)_{ ρ }, ρ ∈ {W, D, M, S, P}, as training sets and the remaining two as independent test sets. The benefit of using multiple datasets for fitting Model I was assessed by using the same partitioning of the five datasets; however this time only one of the three training sets was used for actually fitting the model (see Additional file 1, Table S1 for the partitions and results). The gain of exploiting the prior information was assessed by modifying Model I so that p = 100/22283 in (2), in which case all genes get an equal a priori probability to be included in the signature.
To afford comparisons with the gene signatures published in Wang et al. [13], Miller et al. [14], Sotiriou et al. [15], and Pawitan et al. [16], we modified Model I according to:
where denotes the subset of x_{ ρi }that corresponds to the gene signature ρ_{ sig }derived in publication ρ, is the number of genes in the signature ρ_{ sig }, and the identity matrix of rank .
To derive a gene signature using all the data, we fitted Model I as described in (2). We counted the number of times that each gene appeared in the 200,000 draws from the posterior distribution in the Markov chain Monte Carlo algorithm (after 200,000 draws burnin, see below). We defined the appearance frequency of a gene g_{ j }as the number of appearances of g_{ j }divided by the total number of iterations (i.e., 200,000 here). We reasoned that the genes with the highest appearance frequencies play the strongest role in predicting the response. The top 100 genes, according to appearance frequency, were selected and are found in Additional file 1, Table S2.
Model II
We modeled the time to development of distant metastases according to:
where the denotations are as in Model I, and ε_{ ρi }is the error term for patient i in dataset ρ. Model II was validated by fitting it five times, each time using three of the five datasets (X, r, γ)_{ ρ }, ρ ∈ {W, D, M, S, P}, as training sets and the remaining two as independent test sets.
Proteomics data: Analyzing PDZ domainpeptide interactions
The peptides and the aligned PDZ sequences (Fig. 4) were, after removal of alignment positions containing gaps, converted to numerical vectors capturing the physicochemical properties of each sequence by using the tciz scales of Muthas et al[36]. The tciz scales are the amino acids' component scores in the two first principal components derived from a principal component analysis of physicochemical properties of 113 natural and nonnatural amino acids. The two tciz scales thus code each sequence residue into two realvalued numbers, which are strongly correlated with the amino acids size and polarity [36]. The tciz scales thus allow us to numerically characterize peptides by concatenating the tciz numbers for all residues in the peptide. Similarly, we could numerically characterize an interaction between a PDZ domain and a peptide as the concatenated tciz description vectors of the PDZ domain and the peptide. Characterizing all interactions in the Stiffler et al. [22] and Chen et al. [23] datasets gave us a matrix, X, with 2306 observations (2019 from mouse, 147 from C. elegans, and 140 from D. melanogaster PDZ domains) and 154 variables (67 residue positions, r_{1},...,r_{67}, in the alignment of the PDZ domains were "gapfree", and each position was coded with two tciz numbers: 67·2 = 134 variables; the peptides contained 10 residues each: 10·2 = 20 variables).
Let H = (X, H(X)) be mean centered and scaled to unit variance, where H(X) denotes the second order interaction terms of the variables in X. H thus has 2306 rows and 11935 columns. Let the column vector y = (y_{1},...,y_{2306}) be the response variable, where y_{ i }= 1 represent that the PDZ domain and the ligand interact and y_{ i }= 0 represents that they do not interact. We assumed, with support from the literature [37] and the data from Stiffler et al. [22], that some PDZ domains are more promiscuous binders than others. This resulted in a correlated error structure when modeling the data. This is an assumption violation in standard generalized linear models, but could be accommodated for in our Bayesian setting by allowing parameter estimates to vary among PDZ domains. Thus, let ϕ = 1,...,96 denote the 96 PDZ domains in the data from Stiffler et al. [22] and Chen et al. [23] (82 from mouse, 7 from C. elegans, and 7 from D. melanogaster).
The PDZ domain residues r_{1},...,r_{67} and the peptide residues p_{1},...,r_{10} are associated with the integers c_{1},...,c_{77}, where the cvalues represent the number of contacts its corresponding residue makes with a ligand, as estimated by SAS by analyzing all solved 3D structures of PDZ domainpeptide complexes available in the Protein Data Bank [25] (PDB) (Fig. 4). We construct the row vectors
where H(c_{ X }) denotes the second order interaction terms of c_{ X }. c_{ X }thus represents the cvalues of the tcizcoded residues r_{1},...,r_{67} and p_{1},...,p_{10} incremented by 1. The cvalues were incremented by 1 to ensure that all variables in H have a chance of being selected (i.e. including those with cvalues equal to zero).
Model III
We modeled the interaction between PDZ domains and peptides using a Bayesian hierarchical probit model according to:
where is the i th row in (the rows of H corresponding to PDZ domain ϕ). d_{ j }is the j th value in the row vector c_{X, H(X)}. Hence, the a priori probability of a variable to be selected is larger, the larger the number of contacts the PDZ residue it is derived from makes with the ligand (the assumption being that the more contact a PDZ residue makes with the ligand, the greater our a priori belief that the residue is important for the interaction). The other denotations are analogous to the ones used in Model I. The prior for m, i.e. the parameter controlling the number of selected variables, was chosen so that roughly one twentieth of the total number of variables was selected (i.e. so that = 11935/20, where E denotes the expectation operator). The mean of the prior distribution for m was estimated using crossvalidation on the training data. Model III was also modified to assess the use of the prior information about which residues that are important for governing the interaction between PDZ domains and peptides. We then set p = 1/11935.
Residue coupling depicted in Fig. 5c
The coupling was estimated by studying the interaction terms in H formed between tciz scales from one residue in the PDZ domain and one in the peptide as outlined in Prusis et al. [38]. The figure displays the five such interaction terms with the largest (absolutevalued) modes of the regression coefficient's distributions. The network residues are color coded according to the physicochemical property that is estimated to be important for the network; residue size is shown in blue and residue polarity in red. Note that the same PDZ domain 3D structure (α 1syntrophin PDZ, PDB accession number 2PDZ) has been used in all three pictures since the structures of Lap4(2/4) and Lin7(1/1) from C. elegans and D. melanogaster have not been solved. The residue labels refer to the amino acid at the given position in Lap4(2/4) and Lin7(1/1) and do thus not correspond to the structure of the depicted amino acids. Residue numbering is according to 2PDZ.pdb.
Bayesian Pvalues
To explain the Bayesian Pvalues used in the article, we give the following example: Let be the two datasets that were removed when fitting Model I and the remaining three datasets. Further, let β and β' be the regression coefficients when Model I was fitted with informative and noninformative priors, respectively. By sampling from the posterior distributions of β and β', we could estimate the distributions of and . The Bayesian Pvalue could then be computed as . This idea is closely related to the method suggested in XiaoLi [39].
ROC curves
ROC curves are plots of the truepositive fraction (TPF) plotted as a function of the falsepositive fraction (FPF) for a binary classifier system as its discrimination threshold is varied. The TPF and FPF are calculated as follows: TPF = TP/(TP + FN) and FPF = FP/(FP + TN), where TP = true positives, FP = false positives, FN = false negatives, and TN = true negatives. The larger the area under the curve (the AUC), the better the classifier. The AUC is equal to the MannWhitney U statistic [40], a nonparametric test for assessing whether two independent samples of observations come from the same distribution.
Model fitting
The posterior distributions for Model I, II, and III are not possible to derive in closed form. The models were thus fitted using Markov chain Monte Carlo (MCMC) methods (see Robert and Casella [41] for an excellent account on MCMC methods). The MCMC algorithms were implemented in the statistical programming language R [42]. Three chains were run for each model and after a burnin period of 200,000 samples, we collected the following 200,000 samples from the posterior distributions. Convergence was checked using the potential scale reduction factor [43]. The computations were performed on UPPMAX computational resources http://www.uppmax.uu.se/ under the project p2006026.
References
 1.
Butcher EC, Berg EL, Kunkel EJ: Systems biology in drug discovery. Nat Biotechnol 2004, 22(10):1253–1259. 10.1038/nbt1017
 2.
Ho RL, Lieu CA: Systems Biology: An Evolving Approach in Drug Discovery and Development. Drugs in R&D 2008, 9(4):203–216.
 3.
WickenbergBolin U, Göransson H, Fryknäs M, Gustafsson MG, Isaksson A: Improved variance estimation of classiffication performance via reduction of bias caused by small sample size. BMC Bioinformatics 2006, 7: 127. 10.1186/147121057127
 4.
Isaksson A, Wallman M, Göoransson H, Gustafsson MG: Crossvalidation and bootstrapping are unreliable in small sample classification. Pattern Recogn Lett 2008, 29(14):1960–1965. 10.1016/j.patrec.2008.06.018
 5.
Kutalik Z, Beckmann JS, Bergmann S: A modular approach for integrative analysis of largescale geneexpression and drugresponse data. Nat Biotechnol 2008, 26(5):531–539. 10.1038/nbt1397
 6.
Chuang H, Lee E, Liu Y, Lee D, Ideker T: Networkbased classiffication of breast cancer metastasis. Mol Syst Biol 2007, 3: 140. 10.1038/msb4100180
 7.
Xu L, Tan AC, Winslow RL, Geman D: Merging microarray data from separate breast cancer studies provides a robust prognostic test. BMC Bioinformatics 2008, 9: 125. 10.1186/147121059125
 8.
Carlin BP, Louis TA: Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall/CRC, New York; 2000.
 9.
Eddy SR: What is Bayesian statistics? Nat Biotechnol 2004, 22(9):1177–1178. 10.1038/nbt09041177
 10.
Gelman A, Hill J: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, New York; 2007.
 11.
Berger JO: Statistical Decision Theory and Bayesian Analysis. SpringerVerlag, Berlin; 1985.
 12.
MurrayRust P: Chemistry for everyone. Nature 2008, 451(7179):648–651. 10.1038/451648a
 13.
Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet 2005, 365(9460):671–679.
 14.
Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA 2005, 102(38):13550–13555. 10.1073/pnas.0506230102
 15.
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 2006, 98(4):262–272. 10.1093/jnci/djj052
 16.
Pawitan Y, Bjöohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, Liu ET, Miller L, Nordgren H, Ploner A, Sandelin K, Shaw PM, Smeds J, Skoog L, Wedren S, Bergh J: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two populationbased cohorts. Breast Cancer Res 2005, 7(6):R953–64. 10.1186/bcr1325
 17.
EinDor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21(2):171–178. 10.1093/bioinformatics/bth469
 18.
Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365(9458):488–492. 10.1016/S01406736(05)178660
 19.
Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30: 207–210. 10.1093/nar/30.1.207
 20.
Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, HaibeKains B, Viale G, Delorenzi M, Zhang Y, d'Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JGM, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C: Strong time dependence of the 76gene prognostic signature for nodenegative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res 2007, 13(11):3207–3214. 10.1158/10780432.CCR062765
 21.
Sha N, Tadesse MG, Vannucci M: Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics 2006, 22(18):2262–2268. 10.1093/bioinformatics/btl362
 22.
Stiffler MA, Chen JR, Grantcharova VP, Lei Y, Fuchs D, Allen JE, Zaslavskaia LA, MacBeath G: PDZ domain binding selectivity is optimized across the mouse proteome. Science 2007, 317(5836):364–369. 10.1126/science.1144592
 23.
Chen JR, Chang BH, Allen JE, Stiffler MA, MacBeath G: Predicting PDZ domainpeptide interactions from primary sequences. Nat Biotechnol 2008, 26(9):1041–1045. 10.1038/nbt.1489
 24.
Milburn D, Laskowski RA, Thornton JM: Sequences annotated by structure: a tool to facilitate the use of structural information in sequence analysis. Protein Eng 1998, 11(10):855–859. 10.1093/protein/11.10.855
 25.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235
 26.
Talavera D, Laskowski RA, Thornton JM: WSsas: a web service for the annotation of functional residues through structural homologues. Bioinformatics 2009, 25(9):1192–1194. 10.1093/bioinformatics/btp116
 27.
Smock RG, Gierasch LM: Sending signals dynamically. Science 2009, 324(5924):198–203. 10.1126/science.1169377
 28.
Lockless S, Ranganathan R: Evolutionarily conserved pathways of energetic connectivity in protein families. Science 1999, 286(5438):295–299. 10.1126/science.286.5438.295
 29.
Gianni S, Walma T, Arcovito A, Calosci N, Bellelli A, Engström A, TravagliniAllocatelli C, Brunori M, Jemth P, Vuister GW: Demonstration of longrange interactions in a PDZ domain by NMR, kinetics, and protein engineering. Structure 2006, 14(12):1801–1809. 10.1016/j.str.2006.10.010
 30.
Chi CN, Elfström L, Shi Y, Snäll T, Engstörm Å, Jemth P: Reassessing a sparse energetic network within a single protein domain. Proc Natl Acad Sci USA 2008, 105(12):4679–4684. 10.1073/pnas.0711732105
 31.
Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RKJ, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novere N, LeebensMack J, Lewis SE, Lord P, Mallon AM, Marthandan N, Masuya H, McNally R, Mehrle A, Morrison N, Orchard S, Quackenbush J, Reecy JM, Robertson DG, RoccaSerra P, Rodriguez H, Rosenfelder H, SantoyoLopez J, Scheuermann RH, Schober D, Smith B, Snape J, Stoeckert CJJ, Tipton K, Sterk P, Untergasser A, Vandesompele J, Wiemann S: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 2008, 26(8):889–896. 10.1038/nbt.1411
 32.
Wagener J, Spjuth O, Willighagen EL, S WJE: XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous Web services. BMC Bioinformatics 2009., 10(279):
 33.
da Costa PCG, Laskey KB, Laskey KJ: PROWL: A Bayesian ontology language for the semantic web. Berlin, Heidelberg: SpringerVerlag; 2008.
 34.
Daumé H III: HBC: Hierarchical Bayes Compiler. 2007.
 35.
Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, MurrayRust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench for chemo and bioinformatics. BMC Bioinformatics 2007, 8: 59. 10.1186/14712105859
 36.
Muthas D, Lek PM, Nurbo J, Karlén A, Lundstedt T: Focused hierarchical design of peptide libraries follow the lead. J Chemometrics 2007, 21(10–11):486–495. 10.1002/cem.1069
 37.
Tonikian R, Zhang Y, Sazinsky SL, Currell B, Yeh JH, Reva B, Held HA, Appleton BA, Evangelista M, Wu Y, Xin X, Chan AC, Seshagiri S, Lasky LA, Sander C, Boone C, Bader GD, Sidhu SS: A specificity map for the PDZ domain family. PLoS Biol 2008, 6(9):e239. 10.1371/journal.pbio.0060239
 38.
Prusis P, Uhlén S, Petrovska R, Lapinsh M, Wikberg JES: Prediction of indirect interactions in proteins. BMC Bioinformatics 2006, 7: 167. 10.1186/147121057167
 39.
XiaoLi M: Posterior Predictive pvalues. The Annals of Statistics 1994, 22(3):1142–1160. 10.1214/aos/1176325622
 40.
Mann HB, Whitney DR: On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 1947, 18: 50–60. 10.1214/aoms/1177730491
 41.
Robert CP, Casella G: Monte Carlo statistical methods. 2nd edition. SpringerVerlag, New York; 2004.
 42.
R Development Core Team: R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria; 2009. [http://www.Rproject.org] [ISBN 3900051070]
 43.
Gelman A, Rubin DB: Inference from Iterative Simulation Using Multiple Sequences. Statistical Science 1992, 7(4):457–472. 10.1214/ss/1177011136
 44.
Kaplan EL, Maier P: Nonparametric estimation of incomplete observations. J Am Stat Assoc 1958, 53: 457–81. 10.2307/2281868
Acknowledgements
Financial support was obtained frm Swedish VR (04X05957) and Uppsala University (KoF07).
Author information
Additional information
Authors' contributions
ME devised and implemented the proposed eScienceBayes framework. OS was involved in program design and aided with the implementation. JESW supervised the project. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Omics Data
 Markov Chain Monte Carlo Algorithm
 Bayesian Variable Selection
 Informative Prior Distribution
 Noninformative Prior