Volume 12 Supplement 12

## Selected articles from the 9th International Workshop on Data Mining in Bioinformatics (BIOKDD)

- Proceedings
- Open Access

# Planning combinatorial disulfide cross-links for protein fold determination

- Fei Xiong
^{1}, - Alan M Friedman
^{2}and - Chris Bailey-Kellogg
^{1}Email author

**12 (Suppl 12)**:S5

https://doi.org/10.1186/1471-2105-12-S12-S5

© Xiong et al; licensee BioMed Central Ltd. 2011

**Published:**24 November 2011

## Abstract

### Background

Fold recognition techniques take advantage of the limited number of overall structural organizations, and have become increasingly effective at identifying the fold of a given target sequence. However, in the absence of sufficient sequence identity, it remains difficult for fold recognition methods to always select the correct model. While a native-like model is often among a pool of highly ranked models, it is not necessarily the highest-ranked one, and the model rankings depend sensitively on the scoring function used. *Structure elucidation* methods can then be employed to decide among the models based on relatively rapid biochemical/biophysical experiments.

### Results

This paper presents an integrated computational-experimental method to determine the fold of a target protein by probing it with a set of planned disulfide cross-links. We start with predicted structural models obtained by standard fold recognition techniques. In a first stage, we characterize the fold-level differences between the models in terms of topological (contact) patterns of secondary structure elements (SSEs), and select a small set of SSE pairs that differentiate the folds. In a second stage, we determine a set of residue-level cross-links to probe the selected SSE pairs. Each stage employs an information-theoretic planning algorithm to maximize information gain while minimizing experimental complexity, along with a Bayes error plan assessment framework to characterize the probability of making a correct decision once data for the plan are collected. By focusing on overall topological differences and planning cross-linking experiments to probe them, our *fold determination* approach is robust to noise and uncertainty in the models (e.g., threading misalignment) and in the actual structure (e.g., flexibility). We demonstrate the effectiveness of our approach in case studies for a number of CASP targets, showing that the optimized plans have low risk of error while testing only a small portion of the quadratic number of possible cross-link candidates. Simulation studies with these plans further show that they do a very good job of selecting the correct model, according to cross-links simulated from the actual crystal structures.

### Conclusions

Fold determination can overcome scoring limitations in purely computational fold recognition methods, while requiring less experimental effort than traditional protein structure determination approaches.

## Keywords

- Residue Pair
- Fold Recognition
- Feature Subset Selection
- Binary Random Variable
- Minimum Redundancy Maximum Relevance

## Introduction

Seeking to close the gap between computational structure prediction and experimental structural determination, we [7, 8] and others [9–11] have developed methods (which we call *structure elucidation*) to select structural models based on relatively rapid biochemical/biophysical experiments. One type of experiment particularly suitable for this purpose is *cross-linking*, which essentially provides distance restraints between specific pairs of residues, based on the formation (or not) of chemical cross-links. While residue-specific (e.g., lysine-specific) cross-linking has been effectively used for this task [10, 12, 13], we previously showed that planned *disulfide* cross-linking has a number of advantages, in terms of the ease and reliability of experiment and the quality of the resulting information content [7]. In disulfide cross-linking (or “trapping”) [14–16], a pair of cysteine substitutions is made and the formation of a disulfide bond after oxidation is evaluated, e.g., by alteration in electrophoretic mobility [7, 14, 16]. An important point for our purposes here is that disulfide cross-links are *plannable*—we control exactly which pair of residues is probed in a particular experiment.

While earlier methods have focused on probing geometry and selecting a model, we target here a more defined characterization of protein structure, ascertaining the overall protein fold. We call this approach *fold determination*, named in contrast to purely computational *fold recognition* and our less defined structure elucidation approach. We first characterize the topological / fold-level differences in a set of models in terms of contact patterns of secondary structure elements (SSEs); see the middle panel of Fig. 1. The topological representation allows for a robust experimental characterization of the structure, less sensitive to noise and uncertainty in both the models (e.g., threading misalignment) and the actual structure (e.g., flexibility). As a representation with fewer degrees of freedom than the complete threading models, the topological representation also enables us to explicitly consider all possibilities and handle the case when none of the models is correct. Once we have identified a subset of SSE pairs that are most informative for fold determination, we plan disulfide cross-links to evaluate these SSE pairs; see the right panel of Fig. 1. By specifically planning for each such SSE pair, we can account for the dependence among the cross-links and select a set that will be robust to, and even help characterize, model misalignment and protein flexibility.

The method presented here strikes a balance between very limited cross-linking (e.g., six disulfide pairs in our earlier work [7]) and testing all residue pairs. We assume that robotic genetic manipulation methods (e.g., based on SPLISO [17] and RoboMix [18]) can construct a combinatorial set of dicysteine mutants, but that we still should test a much smaller set than all residue pairs. (Our plans require tens to around a hundred cross-links, depending on error requirements.) Thus we must optimize a plan so as to maximize information gain while minimizing experimental complexity. This is analogous to feature subset selection, where the goal is to choose a subset of features from a dataset such that the reduced set still keeps the most “distinguishing” characteristics of the original [19, 20]. At the topological level (Fig. 1, middle) the features are SSE pairs, and the objective is to select those that will correctly classify the real structure to a model. At the cross-link level (Fig. 1, right panel) the features are potential disulfide pairs and the objective is to select those that will correctly classify contact/not for the SSE pair. For each level, we optimize a plan by employing an information-theoretic planning algorithm derived from the minimum redundancy maximum relevance approach [21]. We then evaluate a plan with a Bayes error framework that characterizes the probability of making a correct decision from the experimental data.

## Methods

We are given a set *M* of models. They may be redundant (i.e., some may have the same fold), and they may be incomplete (i.e., a representative of the correct fold may not be included). Our goal is to plan a set of disulfide cross-linking experiments (i.e., identify residue pairs to be individually tested) in order to select among them. As discussed in the introduction, we do this in two stages (Fig. 1), first selecting a “topological fingerprint” of SSE pairs to distinguish the folds, and then selecting cross-links to assess these SSE pairs.

### Topological fingerprint selection

In order to compare SSE topologies, we need a common set of SSEs across the models. Since secondary structure prediction techniques are fairly stable [22, 23], it is generally the case that models have more-or-less the same set of SSEs, covering more-or-less the same residues (> 50% overlapping as observed in our test data). Our approach starts with a set *S* of SSEs that are common to at least a specified fraction (default 50%) of the given models. For example, both models in Fig. 1 have 5 *α*-helices, as do 63 other models for the same target. The later cross-link planning stage will account for the fact that the common SSEs may in fact extend over slightly different residues in the different models.

Given the SSE identities, we form for each model *m*_{
i
} ∈ *M* an *SSE contact graph G*_{
SSE
}_{,}_{
i
} = (*S*, *C*_{
i
}) in which the nodes *S* are the SSEs (common to the specified fraction of models, as described in the preceding paragraph) and the edges *C*_{
i
} ⊂ *S* × *S* are between contacting SSEs (specific to each model). We determine SSE contacts from residue contacts, deeming an SSE pair to be in contact if a sufficient set of residues are. Our current implementation requires at least 5 contacts (at < 9 Å C^{
β
}-C^{
β
} distance), and at least 20% of each SSE’s residues to have a contact partner in the other SSE.

Our goal then is to find a minimum subset *F* ⊂ *S* × *S* of SSE pairs providing the maximum information content to differentiate the models. As discussed in the introduction, this is much like feature subset selection; in particular, the *max-dependency* feature selection problem seeks to find a set of features with the largest dependency (in term of mutual information) on the target class (here, the predicted structural model) [21]. While max-dependency leads to the minimum classification error, there is unfortunately a combinatorial explosion in the number of possible feature subsets that must be considered. To deal with the combinatorial explosion, we develop here an approach based on the minimum Redundancy Maximum Relevance (mRMR) method [21].

#### Probabilistic model

*c*) the probability of being in contact (

*c*= 1) or not (

*c*= 0). We estimate Pr(

*c*) by counting occurrence frequencies over the contact edge sets

*C*

_{ i }for the models:

*C*

_{ i }, and thus the set includes those SSE contact graphs for which the contact state of c agrees with

*y*. To allow for noise, when evaluating

*x*= 1 we include a contribution from

*y*= 0 (false negative) along with that for

*y*= 1 (true positive), and similarly when evaluating

*x*= 0 we consider both

*y*= 1 (false positive) and

*y*= 0 (true negative). The

*q*function weights the contributions for the agreeing and disagreeing case. We currently employ a uniform weighting independent of edge, since we observed in cross-link planning (below) that the expected error rate in evaluating any SSE contact was well below 10% when using a reasonable number of cross-links:

The approach readily extends to be less conservative and to allow different weights for different SSE pairs, e.g., according to cross-link planning (discussed in the next section).

where again the sums are over {0, 1} and the indicator function is as described above.

*relevance*of each SSE contact edge c in terms of its entropy

*H*(

*c*); a high-entropy edge will help differentiate models while a low-entropy one won’t. We can also evaluate the

*redundancy*of a pair (

*c*,

*c*′) of edges in terms of their mutual information

*I*(

*c*,

*c*′); a high mutual-information pair contains redundant information:

#### Experiment planning

*F*starting from the empty set and at each step adding to the current

*F*the edge

*c*

_{*}that maximizes:

The search algorithm stops when the score for *c*_{*} drops below a threshold (we use 0.01 for the results shown below).

The original mRMR formulation with first-order incremental search was proved to be equivalent to max-dependency (i.e., to provide the most information about the target classification) [21]. The proof carries over to our version upon substituting our formulations of redundancy and relevance (discrete, with choices of SSE pairs providing information about models) in place of the original ones (continuous, with gene profiles representing different types of cancer or lymphoma). Essentially, it can be proved that the optimal max-dependency value is achieved when each feature variable is maximally dependent on the class of samples, while the pairwise dependency of the variables is minimized. Furthermore, this objective can be obtained by pursuing the mRMR criterion in the “first-order” incremental search (i.e., greedy) where one feature is selected at a time. Therefore we don’t need to explicitly compute the complicated multivariate joint probability, but can instead compute just the pair-wise joint probabilities. We thus have an efficient algorithm for finding an optimal set of SSE pairs to differentiate models.

#### Data interpretation

*X*regarding a fingerprint

*F*is a binary vector indicating for each edge whether or not the SSE pair was found to be in contact. Let us denote by the set of possible binary vector values for

*X*. Then the likelihood takes the joint probability over the edges, testing agreement between the observed contact state and that expected under the model:

where we use the subscript to get the *i*^{th} element of the set. The naive conditional independence assumption here is reasonable, since the elements of *F*_{
i
} (SSE contact states) depend directly on the model, and are thus conditionally independent given the model. We then select the model with the highest likelihood. (If we have informative priors, evaluating model quality, we could instead select based on posterior probabilities.)

#### Plan evaluation

*Bayes error*,

*∊*. If we knew which model

*m*were correct and which dataset

*X*we would get, we could evaluate whether or not we would make the wrong decision, choosing a wrong model

*m*′ due to its having a higher likelihood for

*X*than the correct model

*m*. The Bayes error considers separately each case where one particular model is correct and one particular dataset results, and sums over all the possibilities. It weights each possibility by its probability—is the model likely to be correct, and if it is, are we likely to get that dataset. Thus:

where Pr(*m*) is the prior probability of a model, which we currently take as uniform, but could instead be based on fold recognition scores. Here and in the following formulas we use an indicator function 1 that gives 1 if the predicate is true and 0 if it is false. So we assume each different model is correct (at its prior probability), and assess whether or not it would be beaten for each different data set (at probability conditioned on the assumed correct model). This framework thereby gives a probabilistic evaluation of how likely it is that we will make an error, in place of the usual empirical cross-validation that is performed to assess a feature subset selected for classification.

*expected tie ratio*,

*τ*:

The formula mirrors that for *∊*, but instead of counting the number of incorrect decisions, it counts the fraction of ties. Evaluating *τ* as we build up a topological fingerprint allows us to track the incremental power to differentiate folds, up to the point where we find that a set of models has the same fold and *τ* has flat-lined. The metric can readily be extended to account for sets of models whose likelihood is within some threshold of the best.

Finally, the topological fingerprint approach allows us to handle the “none-of-the-above” scenario, when we decide that no model is sufficiently good; i.e., the correct fold isn’t represented by a predicted model. While in other contexts that would be done by comparing the likelihood to some threshold (is the selected model “good enough”?), here we can actually explicitly consider the chance of not considering the correct fold. Note that since a fingerprint typically has a small number of SSE pairs, we can enumerate the space
of its possible values (indicating whether or not each SSE pair in the fingerprint is in contact). Some of those values,
, correspond to models in *M*, while the rest,
, are “uncovered”. We want to decide if an uncovered fold
is better than the fold *f* for the selected model. Moving from models to folds, we can evaluate Pr(*X* | *f*) by a formula like Eq. 8, simply testing whether each *X*_{
i
} has the value specified in *f*. Then we can decide that it is “none of the above” (models) if
such that
.

*ν*, the

*expected none-of-the-above ratio*:

Thus *ν* is the fraction of experimental datasets for which an uncovered fold will be better than the best covered fold. We currently do not include a prior on *X*, in order to provide a direct assessment of how many experiments could lead to a none-of-the-above decision. However, we could obtain a weighted value by estimating Pr(*X*), e.g., from the priors on the individual SSE pairs (from Eq. 1). For the same reason, we treat Pr(*f*′) as uniform over the uncovered folds *f*′, rather than evaluating it by priors on SSE pairs. Note that the formula does not include SSE pairs in (*S* × *S*) \ *F*; i.e., pairs not in the fingerprint. This is as if they contribute equally to covered and uncovered folds, and thus do not affect the outcome. In the absence of other information or assumptions about the uncovered folds, this is a reasonable (and conservative) assumption, and yields an interpretable metric.

### Cross-link selection

Once a topological fingerprint *F* has been identified, the next task is to optimize a disulfide cross-linking plan to experimentally evaluate the SSE pairs in the fingerprint. We separately plan for each SSE pair (their conditional independence was discussed in the previous section), optimizing a set of disulfide cross-link experiments (a single cross-link per experiment), such that, taken together, these cross-links will reveal whether or not the SSE pair is in contact. The overall plan is then the union of these SSE-pair plans. Thus we focus here on planning for a single SSE pair. We must account for noise and uncertainty in both the model and the actual protein, as well as for dependency among cross-links. This paper represents the first to address these issues.

Different models may place an SSE at somewhat different residues, so when planning cross-links to probe that SSE’s contacts, it is advantageous to focus on residues common to many models (and thus able to provide information about cross-linkability in those models). We define for each SSE a set of common residues that may be used in a disulfide plan. Our current implementation includes all residues that appear in at least half of the models that have that SSE. In the following, let *R* denote the common residues for a target SSE pair.

For each model *m*_{
i
} we construct a *residue cross-link graph G*_{
xlink
}_{,}_{
i
} = (*R*, *D*_{
i
}), in which the nodes are common residues *R* and there are edges *D*_{
i
} ⊂ *R* × *R* between possible disulfide pairs (specific to each model). We compute the *cross-linking distance* for a residue pair as the C^{
β
}–C^{
β
} distance, and take as edges those with distance at most 19 Å, based on an analysis of rates of disulfide formation [7, 14]. Our method could be generalized to include a more detailed geometric evaluation of the likelihood of cross-linking.

#### Probabilistic model

We place a distribution Pr(*δ*) over possible offsets by which an SSE could be misaligned in a model. That is, residue number *r* in the model is really residue *r* + *δ* in the protein, and thus a cross-link involving residue *r* + *δ* is really testing proximity to residue *r*. We use a distribution with 0.5 probability at 0 offset, decaying exponentially on both sides up to a maximum offset. Analysis of a model or the secondary structure prediction could provide a more problem-specific distribution. We currently consider each SSE separately; a future extension could model correlated misalignments resulting from threading. We sample a set of alternative backbones for a model, and place a distribution Pr(*b*) over the identities of these alternatives. While there are many ways to sample alternative structures, we currently use Elastic Normal Modes (ENMs) as implemented by *elNémo*[24], sampling along the lowest non-trivial normal mode. We set Pr(*b*) according to the amplitude of the perturbation, using a Hookean potential function derived from ENMs. Future extensions could model different aspects of flexibility, such as local unfolding events during which a cross-link may be captured.

*ℓ*,

*ℓ′*:

and similarly for backbone flexibility. Furthermore, misalignment and flexibility are independent.

#### Experiment planning

*L*⊂

*R*×

*R*to experimentally cross-link, in order to assess whether or not the SSE pair is in contact. This is another feature subset selection problem, and we again employ an mRMR-type incremental algorithm. Here a possible cross-link

*ℓ*’s relevance is evaluated in terms of the information it provides about whether or not the SSE pair is in contact:

*I*(

*ℓ*,

*c*), where

*c*is the binary random variable for contact of a target SSE pair. Redundancy is again evaluated in terms of mutual information. Thus the objective is:

and we incrementally select cross-links to maximize the difference in relevance regarding contact and average redundancy with already-selected cross-links.

#### Data interpretation

*Y*be the set of cross-linking data, indicating for each residue pair in

*L*whether or not a disulfide was detected. To decide whether or not c is in contact, we will compare Pr(

*Y*|

*c*= 1) and Pr(

*Y*|

*c*= 0), and take the one with higher likelihood. Intuitively, the more cross-links that are detected, the more confident we are that the SSE pair is in contact. Thus we currently employ a sigmoidal function to evaluate the likelihood:

Here *k* is the number of detected cross-links in *Y*, and *k*_{0} is the minimum number of positive cross-links for us to start believing c is in contact. For example, for *c* = 1, given a default number of 10 experiments, we set *k*_{0} = 3 and the likelihoods of *c* = 1 for *k* = 0, 3, 6 are then approximately 0.05, 0.5, and 0.95, respectively. The metric could be extended to reward the broader distribution of cross-links throughout each SSE. However, in our current framework, we find that having a sufficient number of cross-links without regard to location tends to achieve that goal.

#### Plan evaluation

As in the previous section, we sum over the possible outcomes (here, in contact or not) and the possible experimental results (
, all binary choices for cross-links in plan *L*), weighted by their probabilities, and see which yield the wrong decision. In the absence of an informative prior for c (and one that we want to use in interpreting the data), we simply use Pr(*c* = 1) = Pr(*c* = 0) = 0.5. Note that, if desired, we could use the cross-linking Bayes error as a replacement for *q* (as 1 – *∊*) in evaluating Pr(*c* = *x*). These values could be precomputed for all candidate SSE pairs, or a fingerprint could be reevaluated and perhaps modified upon evaluating its possible cross-link plan.

## Results and discussion

*α*, some that all-

*β*, and some that are mixed

*α*and

*β*. For each target, a number of high-quality models have been produced by different groups; we evaluate those of common SSE content, as described in the methods. The models vary in similarity to the crystal structure (the PDB ID indicated), which is unknown at the time of modeling and furthermore not used for experiment planning, as well as to each other (the average root mean squared deviation in atomic coordinates, RMSD, between pairs of models is indicated). Our goal is to select for each target an experiment plan to robustly determine the model(s) of the same fold as the crystal structure.

Test data sets (from CASP7)

CASP ID | PDB ID | 2° | AAs | Models | Av. RMSD |
---|---|---|---|---|---|

T0283_D1 | 2hh6 | 5 | 97 | 162 | 17.26 |

T0289_D2 | 2gu2 | 5 | 74 | 34 | 13.45 |

T0299_D1 | 2hiy | 3 | 91 | 30 | 15.23 |

T0304_D1 | 2h28 | 2 | 101 | 26 | 15.76 |

T0306 | 2hd3 | 7 | 95 | 45 | 14.22 |

T0312_D1 | 2h6l | 2 | 132 | 55 | 16.13 |

T0351 | 2hq7 | 5 | 117 | 65 | 15.42 |

T0382_D1 | 2i9c | 6 | 119 | 196 | 12.79 |

T0383 | 2hnq | 2 | 127 | 59 | 11.61 |

### Topological fingerprint selection

*∊*), expected tie ratio (

*τ*), and expected none-of-the-above ratio (

*ν*) as more SSE pairs are included in the topological fingerprint. It may seem counterintuitive that

*∊*initially increases with the addition of SSE pairs. However, this is because we define the Bayes error of a tie as zero (Eq. 9), and separate out the tie ratio. With few SSE pairs in the fingerprint,

*τ*is generally high—few decisions will be made, as many models look equally good, and the Bayes error is small. Then as SSE pairs are added,

*τ*drops sharply—the fold is more specifically determined, decisions will be made, and the potential for error (as reflected in the Bayes error) increases. Once a sufficient number of SSE pairs has been selected, the specifically-determined fold is distinct, and the decisions are likely to be right, and

*∊*will decrease. Thus it is both appropriate and helpful to consider

*∊*and

*τ*together, as they provide complementary information in the progress toward obtaining a unique and correct fold.

On the other hand, we observe that the *ν* value is usually 0 in the first few steps, because at that point there are not distinct folds separated, and it is easy for the SSE graphs from the predicted models to “cover” all the possible folds. *ν* becomes non-zero when there are uncovered folds. Its value first decreases because the number of covered folds and the number of uncovered folds are both increasing as more SSE pairs are included, and *ν* only gets contributions from an uncovered fold with *greater* (not equal) likelihood as the best covered fold. At some point the number of covered folds stops increasing (due to the limited set of predicted fold types), while the number of uncovered folds is still growing. Then the additional fold possibilities in the uncovered space result in a higher risk of “none-of-the-above”, and thus the *ν* value starts increasing again. This trend is particularly obvious for targets T0289_D2 and T0304_D1; in fact, we return to T0304_D1 below as a real example of “none-of-the-above”.

*q*function (Eq. 2), essentially indicating the confidence we expect to have in the experimental evaluation of an SSE pair. We performed a sensitivity analysis for three values of

*q*, from 0.7 (fairly ambiguous) to 0.9 (fairly confident). Fig. 4 shows that for one target the trends are very similar for all three values; our algorithm is insensitive to the choice. Other targets display similar insensitivity (not shown).

### End-to-end simulation study

Once we have selected a topological fingerprint, we next design a disulfide cross-linking plan to determine the contact state of the selected SSE pairs. To validate the overall process (fingerprint + disulfides), we perform a simulation study. Given a selected set of residue pairs for cross-linking, we use the crystal structure (PDB entry in Tab. 1) to determine whether or not they should form disulfides (C^{
β
}-C^{
β
} distance < 9 Å), and treat those evaluations as the data. We also use the set of all SSE pairs to directly compare the fold of each model with that of the crystal structure, and thereby label each model as being the “correct” fold or not depending on whether or not they have the same SSE contacts for the same SSE pairs. We then evaluate whether or not the simulated data for the selected cross-linking plans result in the same conclusions as the direct comparisons of folds.

*r*of residues that must be in contact to declare that the SSE pair is in contact in the structure or model. A high

*r*value results in very few SSE pairs deemed to be in contact (we found that to happen with

*r*= 0.3), while a low one yields some fairly weak contacts. As the figure shows, a moderate

*r*value of around 0.2 generally results in quite good fold determination results.

### Robustness

One of the merits of the fold determination approach is that it is robust to errors in models, and can even account for the case when none of the models is correct. The selected targets provide examples requiring such robustness; we summarize here just a couple. *Misalignment*. In Eq. 12 we account for being off by up to *δ* residues in the SSE locations. In the case of T0312_D1, there are 23 models of the correct fold, but with *δ* = 0, only 7 of them agree with the crystal structure regarding all the cross-links in the experimental plan, while with *δ* = 1 there are 14 that agree, and with *δ* = 2 there are 16. The remaining unmatched models are looser in structure, and the match is sensitive to the threshold we use to measure SSE contacts. *None-of-the-above*. For target T0304_D1, none of the models has the same SSE contact graph as the crystal structure. The GDT [6] scores of predicted models are in the low 30s, which indicates relatively poor agreement with the crystal structures. As shown in Fig. 3, the *ν* value is relatively high, indicating a potential risk of missing the right fold. Indeed once we evaluate the models under the simulated data, we find that the likelihoods are low (< 2 × 10^{–3}), compared to that (≈ 0.66) of the uncovered but correct fold, which is found by enumeration.

## Conclusions

This paper presents a computational-experimental mechanism to rapidly determine the overall organization of secondary structure elements of a target protein by probing it with a planned set of disulfide cross-links. By casting the experiment planning process as two stages of feature selection—SSE pairs characterizing overall fold and residue pairs characterizing SSE pair contact states—we are able to develop efficient information-theoretic planning algorithms and rigorous Bayes error plan assessment frameworks. Focusing on fold-level analysis results in a novel approach to elucidating three-dimensional protein structure, robust to common forms of noise and uncertainty. At the same time, the approach remains experimentally viable by finding a greatly reduced set of residue pairs (tens to around a hundred, out of hundreds to thousands) that provide sufficient information to determine fold.

## Declarations

### Acknowledgements

This work was inspired by conversations with and related work done by Michal Gajda and Janusz Bujnicki, International Institute of Molecular and Cell Biology, Poland. It was supported in part by US NSF grant CCF-0915388 to CBK.

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 12, 2011: Selected articles from the 9th International Workshop on Data Mining in Bioinformatics (BIOKDD). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/12?issue=S12.

## Authors’ Affiliations

## References

- Godzik A:
**Fold recognition methods.***Methods Biochem. Anal*2003,**44:**525–546.PubMedGoogle Scholar - Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM:
**CATH—a hierarchic classification of protein domain structures.***Structure*1997,**5:**1093–1108. 10.1016/S0969-2126(97)00260-8View ArticlePubMedGoogle Scholar - Govindarajan S, Recabarren R, Goldstein RA:
**Estimating the total number of protein folds.***Proteins*1999,**35:**408–414. 10.1002/(SICI)1097-0134(19990601)35:4<408::AID-PROT4>3.0.CO;2-AView ArticlePubMedGoogle Scholar - Xu J, Li M, Kim D, Xu Y:
**RAPTOR: optimal protein threading by linear programming.***J. Bioinform. Comput. Biol*2003,**1:**95–117. 10.1142/S0219720003000186View ArticlePubMedGoogle Scholar - Moult J:
**A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction.***Curr. Opin. Struct. Biol*2005,**15:**285–289. 10.1016/j.sbi.2005.05.011View ArticlePubMedGoogle Scholar - Zemla A, Venclovas C, Moult J, Fidelis K:
**Processing and analysis of CASP3 protein structure predictions.***Proteins*1999,**3**(Suppl):22–29.View ArticlePubMedGoogle Scholar - Ye X, O’Neil PK, Foster AN, Gajda MJ, Kosinski J, Kurowski MA, Bujnicki JM, Friedman AM, Bailey-Kellogg C:
**Probabilistic cross-link analysis and experiment planning for high-throughput elucidation of protein structure.***Protein Sci*2004,**13:**3298–3313.PubMed CentralView ArticlePubMedGoogle Scholar - Ye X, Friedman AM, Bailey-Kellogg C:
**Optimizing Bayes error for protein structure model selection by stability mutagenesis.***Comput Syst Bioinformatics Conf*2008,**7:**99–108.View ArticlePubMedGoogle Scholar - Grantcharova V, Riddle D, Baker D:
**Long-Range order in the src SH3 folding transition state.***PNAS*2000,**97:**7084–7089. 10.1073/pnas.97.13.7084PubMed CentralView ArticlePubMedGoogle Scholar - Young MM, Tang N, Hempel JC, Oshiro CM, Taylor EW, Kuntz ID, Gibson BW, Dollinger G:
**High throughput protein fold identification by using experimental constraints derived from intramolecular cross-links and mass spectrometry.***PNAS*2000,**97:**5802–5806. 10.1073/pnas.090099097PubMed CentralView ArticlePubMedGoogle Scholar - Chen T, Jaffe J, Church G:
**Algorithms for identifying protein cross-links via tandem mass spectrometry.***J. Comp. Biol*2001,**8:**571–583. 10.1089/106652701753307494View ArticleGoogle Scholar - Haniu M, Narhi LO, Arakawa T, Elliott S, Rohde MF:
**Recombinant human erythropoietin (rHuEPO): cross-linking with disuccinimidyl esters and identification of the interfacing domains in EPO.***Protein Sci*1993,**2:**1441–1451. 10.1002/pro.5560020908PubMed CentralView ArticlePubMedGoogle Scholar - Kruppa GH, Schoeniger J, Young MM:
**A top down approach to protein structural studies using chemical cross-linking and Fourier transform mass spectrometry.***Rapid Commun. Mass Spectrom*2003,**17:**155–62. 10.1002/rcm.885View ArticlePubMedGoogle Scholar - Careaga C, Falke J:
**Thermal motions of surface alpha-helices in the D-galactose chemosensory receptor. Detection by disulfide trapping.***J. Mol. Biol*1992,**226:**1219–1235. 10.1016/0022-2836(92)91063-UPubMed CentralView ArticlePubMedGoogle Scholar - Hughes R, Rice P, Steitz T, Grindley N:
**Protein-protein interactions directing resolvase site-specific recombination: A structure-function analysis.***EMBO J*1993,**12:**1447–1458.PubMed CentralPubMedGoogle Scholar - Kwaw I, Sun J, Kaback H:
**Thiol cross-linking of cytoplasmic loops in lactose permease of**Escherichia coli**.***Biochemistry*2000,**39:**3134–3140. 10.1021/bi992509gView ArticlePubMedGoogle Scholar - Saftalov L, Smith PA, Friedman AM, Bailey-Kellogg C:
**Site-directed combinatorial construction of chimaeric genes: general method for optimizing assembly of gene fragments.***Proteins*2006,**64:**629–642. 10.1002/prot.20984View ArticlePubMedGoogle Scholar - Avramova LV, Desai J, Weaver S, Friedman AM, Bailey-Kellogg C:
**Robotic hierarchical mixing for the production of combinatorial libraries of proteins and small molecules.***J. Comb. Chem*2008,**10:**63–68. 10.1021/cc700106eView ArticlePubMedGoogle Scholar - Liu H, Motoda H:
**Feature selection for knowledge discovery and data mining.***Springer*1998.Google Scholar - Guyon I, Elisseeff A:
**An introduction to variable and feature selection.***Journal of Machine Learning Research*2003,**3:**1157–1182.Google Scholar - Peng H, Long F, Ding C:
**Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.***IEEE Trans Pattern Anal Mach Intell*2005,**27:**1226–1238.View ArticlePubMedGoogle Scholar - Kabsch W, Sander C:
**Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.***Biopolymers*1983,**22:**2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar - Jones D:
**Protein secondary structure prediction based on position-specific scoring matrices.***J. Mol. Biol*1999,**292**(2):195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar - Suhre K, Sanejouand YH:
**ElNemo: a normal mode web server for protein movement analysis and the generation of templates for molecular replacement.***Nucleic Acids Res*2004,**32:**610–614. 10.1093/nar/gkh368View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.