Protein data sets
Three protein data sets were used throughout this work: one for training of the form factors, another for validating the model, and finally a set of native structures and corresponding artificial decoys for the Z-score calculations. The training data set consists of 297 structures with lengths between 50 and 400 from the Top500 data set of high quality protein structures [34]. To ensure that CRYSOL and our program processed these structures in the same manner, structures with conflicting atoms or non-standard amino acids were excluded (selected structures can be found in additional file 1).
The estimated form factors were validated by calculating scattering curves for 50 proteins (see Table 1) extracted using the PISCES server [35]. These were randomly selected among an initial group of 81 proteins with low sequence similarity (below 25%) with those in the training set, a resolution better than 3 Å and an R-factor below 30%.
The decoy set used in the Z-score evaluation was generated by the structure prediction program TASSER [31]. This decoy set consists of 47 proteins, each with 1040 protein-like decoy structures with varying similarity to the native structure. The decoys are constructed from energy-minimized snapshots from molecular dynamics simulations using the AMBER force field [36]. A single protein from the set, [1CBP], was excluded from our Z-score evaluation, as this structure resulted in an input error when evaluated by the CRYSOL program. Six proteins were left out of the evaluation as they were also present in our training data set.
In all cases, the backbone and side chain centroids were calculated using all non-hydrogen atoms. The Cβ atoms were only included in the calculation of the backbone centroid.
SAXS training data
Due to the lack of publicly available high-quality experimental data needed for the estimation of the form factors, artificial data curves were generated for high-resolution protein structures using the state-of-the-art program CRYSOL [24]. This program calculates the theoretical scattering curve from a given full-atom resolution structure using spherical harmonics expansions. CRYSOL was used to compute a simulated scattering curve including the vacuum and excluded volume scattering components, but without hydration layer contribution; the electron density of the solvent layer was set equal to that of the bulk solvent (i.e. CRYSOL was run with the command line "crysol/dro 0.0 inputfile.pdb"). For the scattering curve evaluations, an upper q-limit of 0.75 Å-1 was chosen in order to be well within the expected valid resolution range for CRYSOL.
Error model and q-binning
The range of the scattering momentum ranges from 0 to 0.75 Å-1, divided in 51 discretized bins. For each bin we sampled form factors from the posterior distribution to obtain the 21-dimensional form factor distributions. The intensity at a given bin was evaluated using the left-hand side q-value, starting at q = 0 in the first bin.
To account for the "experimental" error for the SAXS curves, a standard deviation σ
q
= I
q
β (with β = 0.3) has previously been used in the literature as a realistic estimate [14]. Aiming to be more precise in the portion of the curve of primary interest in a structure prediction application - approximately between q = 0.1 and q = 0.5 Å-1[37] - we introduced a scaling factor (q + α):
with α = 0.15 and β = 0.3. This is significantly stricter at mid q-range than the reference parameters.
Posterior sampling
In order to explore the large parameter space efficiently, we used an optimized, maximum-likelihood based MCMC method implemented in the Muninn program [26]. In the MCMC method, we used the negative of the logarithm of the posterior probability as an energy. We employed a sampling scheme where the density of states is weighted according to the inverse cumulative density of states (1/k ensemble) [38]. Compared to standard MCMC methods, this ensures a more frequent generation of samples in the lower energy regions. Avoiding slow relaxation in the Markov chain also minimizes the risk of the chain getting trapped at a local minimum [26]; a problem often encountered in rough energy landscapes when using Metropolis techniques, as for example in simulated annealing.
An initial proposal value for the vector was obtained by uniformly sampling its components from the interval [0, fmax]. The value of fmax was set to 40; well beyond the limit of the form factor components described in the literature [12]. The MCMC proposed new form factor vectors, , with a transition in which a randomly chosen component f of the vector was re-sampled uniformly from the interval:
where the width m was equal to 1.5. Border effects were taken into account in order to respect the detailed balance condition, which is
The transition probabilities, , can be expressed as:
where represents the probability of selecting given and is the acceptance probability of this transition. The selection probability implied by the local uniform proposal scheme is:
and according to the Metropolis-Hastings criterion [39], the following acceptance probability satisfies the detailed balance condition:
where is given below by Equations 7 and 8 for the TorusDBN independent and dependent cases, respectively.
Point estimation of form factors
The MCMC procedure results in a set of samples from the posterior distribution. In order to obtain a point estimate, we calculated the centroid vector obtained from the set of sampled form factor vectors . The centroid vector was defined as the one with the lowest Euclidean distance to all the other vectors in the set:
where T is the number of samples.
SAXS curve distance measures
A χ2 measure was used to quantify the difference between two scattering curves. S was scaled by the "experimental" error that was used in the estimation of the form factors, thus evaluating the statistical quality of our reconstruction:
where Q is the number of q-bins and σ
q
is the experimental error as described previously.
Decoy recognition
The first decoy recognition experiment - which only uses the likelihood - is performed as follows. First, a SAXS curve I is calculated from the native structure using CRYSOL. For each decoy structure, a corresponding SAXS curve I' is calculated with the Debye formula using the two-body model. In this case, the posterior - which is simply equal to the likelihood P(I | X) - of the decoy X is calculated as:
(7)
where the product runs over all q-bins. For the comparison with CRYSOL, scattering curves were calculated from the decoys using CRYSOL instead of the two-body model.
In the second decoy recognition experiment, TorusDBN is included as a prior distribution, and amino acid information is explicitly included. In this case, the posterior becomes:
(8)
where and are the backbone angles of the decoy and A is the amino acid sequence.
The Z-score for a given decoy set was calculated as the difference between the energy (defined as the negative of the logarithm of the posterior) of the native structure E(X
N
) and the average energy of all structures in the decoy set, divided by the standard deviation σ of the energies:
TorusDBN prior
TorusDBN is a probabilistic model of the local structure of proteins, formulated as a dynamic Bayesian network. It can be considered as a probabilistic alternative to the well known fragment libraries, It allows sampling of plausible protein conformations in continuous space, and it can assign a probability density value to a given sequence of ϕ and ψ angles. For the prior, we used:
where and are the ϕ and ψ angle sequences, respectively. The calculation of from TorusDBN is straightforward using the forward algorithm [21]. We used the model that is described in [21] with default parameters.
Computational efficiency
The naive implementation of the Debye formula (Equation 2) leads to a computational complexity of O(M2), where M is the number of scatterers in the structure under evaluation. Our coarse-grained approach reduces M by representing several atoms by one scattering body (a dummy atom). Each of the dummy atoms contains an average of k atoms, thus lowering the execution time by a constant factor of k2, replacing O(M2) with .
The exact value of k is obviously dependent on the primary sequence of the protein. For both training and validation sets, employing a dummy atom for the backbone and one for the side chain leads to an average k of 4.24. This means that each dummy atom contains on average 4.24 non-hydrogen atoms, leading to an increase in speed of k2≃18. Using only one dummy atom to describe a complete amino acid results in k≃7.8 atoms, allowing for a k2≃60 times faster execution. The absolute running time for a single scattering evaluation is approximately 30 ms for a 129 residues protein ([6LYZ]), using the two-body model and a standard desktop computer (AMD Athlon X2 5200+). Absolute running time comparisons with other programs are obviously unfair, since for a single evaluation the overhead introduced by the file system and the operative system is considerable. This said, our approach is significantly faster than CRYSOL [24](~786 ms).