Data
The human protein interaction network
We used the data of Stelzl et al generated by automated yeast two-hybrid (Y2H) interaction mating [36]. From the totally identified 3186 interactions among 1705 proteins, the authors selected 911 interactions among 401 proteins for the high-confidence protein interaction network. They developed a scoring system with 6 criteria including the existence of orthologous protein interactions in D. melanogaster, C. elegans and S. cerevisiae, the proteins shareness of GO annotations etc. Only the interactions which collected 3 or more points by the evaluation with 6 criteria were selected. We use this high-scoring protein interaction network for our analysis. Additional File 1 contains the list of the 401 underlying proteins.
Protein sequences
The primary sequences of the proteins used for the prediction of them into disordered/ordered were downloaded from NCBI [37].
Prediction of protein disorder
To predict the fold group (ordered or disordered) of the proteins we used POODLE, freely available at [38]. POODLE is a set of programs for predicting protein disorder from amino acid sequences using machine learning approaches. POODLE provides three types of disorder prediction according to the length of the target disorder. POODLE-S version puts emphasis on predicting short disorder regions [39]. POODLE-L version focuses on predicting long disorder regions, mainly longer than 40 consecutive amino acids [40]. POODLE-W version predicts if a protein is mostly disordered [41]. We used POODLE-W to classify our 401 proteins into disordered or ordered. Additional File 2 contains the disorder probability outputed by POODLE-W for each protein. Based on these probabilities, using the threshold 0.7, we obtained 130 disordered proteins (see the list of these proteins in Additional File 3).
Bayesian statistical analysis of the network
Network model
In the statistical modelling approach, the network with n nodes is represented by a random variable X, a realization of X is denoted by X = x. Let binary outcome X
ij
= 1 if node i has a directed edge to node j, and X
ij
= 0 otherwise, then x is a binary data matrix (X
ij
). The exponential random graph model has the following general form:
where z
p
(x) is the network statistic of type p, θ
p
is the parameter associated with z
p
(x) and κ(θ) is a normalizing constant that ensures that the probabilities sum to unity [42]. The parameters θ are the unknown 'regression coefficients' which need to be estimated.
It is the normalizing constant, which makes the maximum likelihood estimation of the random graph models intractable, since it is defined over the entire graph space with 2n(n-1)possible directed graphs. Thus, for the estimation of p*-models, alternative model formulations and approximate estimation techniques evolve. The maximum pseudolikelihood estimation approach (MPLE) uses an approximate likelihood function [43]. The MC-MLE applies Markov Chain Monte Carlo stochastic simulation technique for the maximum likelihood estimation [44].
In the present paper, we consider a special class of the ERGMs, the p2-model. The model allows to employ Bayesian model inference using Gibbs sampling as proposed in [45].
Firstly, we introduce the p1-model for directed graphs and then show how it can be modified for the undirected case. The p2-model is presented in the next section.
The p1-model considers the relationships between two nodes, namely there are three relations possible: no edges, an asymmetric edge, or a reciprocated edge. The p1-model seeks to predict the probability of each of these kinds of relations for each pair of nodes. The unit of analysis for the model is dyad D
ij
= (X
ij
, X
ji
), i.e. the pair of edges between two nodes i and j, for 1 ≤ j ≤ n. Each dyad has four possible outcomes: (0, 0), (1, 0), (0, 1), (1, 1). Let
with m
ij
+ a
ij
+ a
ji
+ n
ij
= 1. Here, m
ij
stands for the probability of a reciprocated (mutual) relation, a
ij
for an assymetric relation and n
ij
for no relation between nodes i and j. The p1-model assumes that the dyads are mutually independent, then the joint probability for the data matrix (X
ij
) is written as:
By taking logarithms and exponentiating, one obtains
where
and
The log-odds-ratio θ
ij
divides the sample cases where there is an edge from i to j by cases where there is no edge and, therefore, measures the likelihood of an edge. The log-odds-ratio ϕ
ij
divides the symmetric cases by the assymetric cases and, intuitively, measures the tendency for reciprocation of the edge (i, j). To overcome the problem of too many parameters, it was postulated that
Henceforth, ϕ is the global parameter that measures the degree of reciprocity in the entire network; θ
ij
is splitted into the global and local node-specific effects. The global parameter θ (called density) measures the overall degree of forming edges in the population. Each local parameter α
i
represents the propensity of the node i to send edges (expansiveness). The local parameter β
j
represents the ability of the node j to attract edges (attractiveness). A positive value of α
i
or β
i
indicates that the corresponding node i is more expansive or attractive, respectively, than a typical member of the population i.e. it contributes to the probability of the edge beyond the overall density in the network.
With the transformation of parameters, the distribution is then written as:
In this form the distribution belongs to the exponential family with the following statistics: M - the number of mutual (reciprocated) edges, E - the total number of edges, and Δ
in
(i) and Δ
out
(i) - the in- and out-degrees of node i. The equation shows, that the coefficient θ refers to the effect of the global density of the network on the probability of edges between two nodes, ϕ refers to the effect of the overall amount of reciprocity in the network on the probability of a reciprocated edge, α and β refers to the effect of each node's out-degree and in-degree on the probability that the node will have connections to other nodes. An equivalent log-linear formulation of the p1-model was suggested by Fienberg and Wasserman [46]. In this formulation, a dyad (X
ij
, X
ji
) is represented by four Bernoulli variables Yij 00, Yij 10, Yij 01and Yij 11as follows:
The p1-model is then described with the four log-linear equations:
for i <j. The first and the second equations describe the probability that two nodes will be connected with an asymmetric relation. The third is designed to predict the probability of a reciprocated relation and the last describes the probability of a null relation between nodes. In this formulation, the scaling parameters λ
ij
= log(n
ij
) play a role of the 'residual': if an edge is not mutual or asymmetric, it must be non-present.
λ s are fixed according to the constraint Σk, lY
ijkl
= 1.
In case of an undirected graph, the previously described model simplifies as follows. Instead of four, only two Bernoulli variables Yij 0, Yij 1are considered:
Since there is no directed edges, the reciprocity parameter ϕ equals 0. The expansiveness and attractiveness parameters reduce to the parameter of only one type called sociality, which reflects the propensity of a node to be connected in an undirected graph. We denote it with α. The model is then defined by the following two equations predicting the probability of an edge between nodes i and j to be present or absent in the graph:
for i <j. θ is the global density parameter. Positive values of α
i
or α
j
increase the probability that nodes i and j will be connected. Henceforth, the p1-model seeks to examine how edges between pairs of nodes relate to particularly important attributes of each node and of the network as a whole, i.e. the model includes the structural features of the network explicitly.
Extension of the network model (p2-model)
The concepts density and reciprocity of a network, as well as node-specific attractiveness and expansiveness are defined endogeneously i.e. based on the relations within the network. However, they can be linked to some exogeneous concepts using nodal (as well as edge-specific) attributes. This states one of the important advantages of the statistical modeling of networks. The binary network data can be related to nodal attributes while taking into account the specific network structure.
Particularly, the network model can be extended by modeling sociality parameters α as a linear regression:
where X
i
is the vector of size m ≥ 1 of the attributes (i.e. explanatory variables or covariates) of node i, γ is the vector of the corresponding coefficients, and A are the so called random effects [18]. This formulation expresses the simple idea that socialities depend on the nodal attributes. Naturally, the attributes do not explain all variation in sociality parameters as is represented by the terms A. The components of A are modelled as normally distributed random variables with expectation 0 and variance
). This variance can be interpreted as the variance of the α's left after taking into account the effect of the covariates X.
In this paper, we consider only one covariate - the disorder property of each node in the network, thus studying the effect of disorder on the nodal sociality. Generally, multiple attributes can be considered in the extended model. Also, in a similar manner, density and reciprocity parameters can be related to edge-specific attributes.
The specification of α s described here turns the original p1-model with fixed effects parameters into the random-effects p2-model [18]. The original p1-model assumes that the dyads are independent, i.e. the probability of edges between the nodes i and j is not affected by the presence (or absence) of edges involving any other pair of nodes. Obviously, this assumption is very constraining. The generalization of the p1-model including node-specific random effects relaxes the assumption of dyadic independence. Conditionally on the random effects, the dyads are then assumed to be mutually independent.
Bayesian inference of the network model
We employ a fully Bayesian approach to network modelling. Different to the maximum likelihood estimation, the Bayesian approach addresses the problem of learning the model from data as calculating the posterior probability of a model given data. Suppose that the data D has been generated by a model m. If p(m) is the prior probability of model m, then the posterior model probability by Bayes rule is
The marginal likelihood is defined as
where p(θ
m
|m) is the prior distribution of model parameters θ
m
for model m . The calculation of the marginal likelihood for the model considered here is intractable. Hence, Markov Chain Monte Carlo (MCMC) stochastic simulation techniques must be employed to facilitate the Bayesian model inference. MCMC simulation generates samples from the joint posterior distribution p(m, θ
m
|D) allowing to estimate the posterior parameter probabilities. Here, the MCMC technique Gibbs sampling is being applied. Gibbs sampling reduces the problem of dealing simultaneously with a large number of unknown parameters in a joint distribution into a simpler problem of dealing with one variable at a time, iteratively sampling each from its full conditional distribution given the current values of all other variables in the model. Bayesian approach for learning the network model using Gibbs sampling was described previously in [45].
We utilize the Linux version of the software OpenBUGS (BUGS stands for Bayesian Updating with Gibbs Sampling), the general purpose software for Gibbs sampling on graphical models [47]. OpenBUGS provides a declarative language for specifying a graphical model. The output of Markov chain simulation is used to summarize the posterior distribution of the variables of interest.
The Bayesian approach requires the specification of the model likelihood and of the prior probability distributions for the model parameters. Equations (1) define the model likelihood, where Y
k
are the data matrices to be calculated previously based on the observed data X. In a Bayesian approach, we can specify the prior distribution for the model parameters hierarchically, i.e. dependent on the hyperparameters. We define the prior probability distribution for the density parameter θ as a normal distribution with mean 0 and standard deviation σ
θ
(the operator ~ stands for 'is distributed as'):
Now we need to specify the prior distribution for the hyperparameter σ
θ
. Note, that in OpenBUGS, the normal distribution is parameterized by its precision τ, rather than its standard deviation σ, which are connected by τ = σ-2. We prescribe a gamma distribution for the precision parameter, that is a good practice, since gamma distribution is a conjugate prior distribution for the precision of the normal distribution. The specification is then written as:
To make the prior for θ noninformative, its standard deviation should be large, thus expressing large uncertainty. This is achieved by setting the hyperparameters in the gamma distribution as a0 = 0.001 and b0 = 0.001.
Within the statistical approach for network analysis proposed here, it is possible to include a node-level information into the network model and study the effects of this information on model parameters. We incorporate into the model the information on protein disorder for each protein in the network. The node-specific sociality parameters α
i
, i = 1,..., n are then defined as follows:
where g
i
is the binary variable, g
i
= 1 if the node i belongs to the group of disordered proteins and g
i
= 0 otherwise; γ represents the effect of the disorderdness on the sociality parameter and a
i
is the random effect. Note that although the attribute disorder is represented here with the binary valued variable, continuous or ordinal variables can be readily inserted as covariates into the model.
The parameter γ is modeled as follows:
The prior specification for the parameters a
i
is:
Again, the hyperparameters a0 and b0 are equal 0.001.
Note that conditionally on
,
, the parameters θ and α
i
are independent.
Monitoring convergence of the Markov chain
For summarizing and monitoring convergence of the Markov chain, the R-package CODA was used [48]. The Markov chain must be monitored for diagnosing a lack of convergence. We used Geweke's, Raftery and Lewis's, Gelman and Rubin's diagnostics. As proposed by Gelman and Rubin, a number of parallel runs of Markov chains should be carried out from different starting points. Convergence is diagnosed when the output from different Markov chains is indistinguishable. For parallel runs of Markov chains we used different initial values of the parameters.
Checking goodness-of-fit of the network model
The goodness-of-fit checking of the network model requires the assessment of how well the model represents the structural properties of the observed network. Therefore, new networks are generated from the fitted model, and structural statistics of the observed network are compared to the statistics calculated based on the generated networks [49]. Here, we check the statistics degree, i.e the distribution of degrees of nodes in a network. For this, we use the R package statnet facilitating the simulation of networks from the ERGMs with MCMC [50].