Predictors of natively unfolded proteins: unanimous consensus score to detect a twilight zone between order and disorder in generic datasets

Background Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding. Results In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins. Conclusions Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated.


Background
For long time it has been thought that the existence of a stable three dimensional structure is necessary for a protein molecule to be functional. Nevertheless, the evidence that many proteins are unfolded in their functional state has induced, in the last decade, a re-assignment of the structure-function paradigm [1]. Natively unfolded proteins, also known as "intrinsically unstructured" or "intrinsically disordered" [1][2][3] lack a well defined tertiary structure, being functional in states made of an ensemble of flexible conformations. It is known that these proteins are involved in important cellular processes like signalling, targeting and DNA binding [1][2][3][4][5][6][7][8]. It has been suggested that they may play critical roles in the development of cancer [9,10] and in some amyloidotic diseases [10][11][12]; moreover, the absence of a rigid struc-ture allows them to bind different targets with high specificity and low affinity, suggesting that they may be hubs in protein interaction networks [13][14][15][16]. The current release 5.0 of the Disprot database (http://www.disprot.org/; see also [17,18]) contains 517 proteins with different "flavours" of disorder [19].
A great effort has been made to predict natively unfolded proteins through an ab-initio analysis of their amino acidic sequences. Many methods have been proposed (for reviews see [20,21]), both to predict disordered amino acids of a protein, i.e. amino acids whose atoms are hard to localize by X-ray crystallography or NMR spectroscopy [22,23], and to define synthetic binary scalar indexes that express the probability that a protein has a global tendency to fold or to remain unfolded. It has been observed that predictors of natively unfolded proteins are, basically, functionals of the amino acidic composition [24]. Szilagyi et al. reported that hydrophobicity and charge distributions of folded and natively unfolded proteins overlap, and, on a hydrophobicity/charge plane, there is an area occupied by proteins from both groups [24]. This suggests that, in the vector space of amino acidic compositions, it is not possible to find a hyperplane separating the points corresponding to sequences of folded proteins and those corresponding to sequences of unfolded proteins, but there exists an overlap volume, that is reasonable to identify with a "twilight zone" between order and disorder [24,25].
In this work we are interested in finding an operational method to identify proteins in this twilight zone.
In the following paragraphs of this section we outline the perspective of our work and anticipate the main results we shall present and discuss in the next sections.
In the first part of this study we revisit several methods to predict natively unfolded proteins through binary scalar indexes of folding. Uversky et al. propose to analyse the mean hydrophobicity and mean net charge of the protein sequences [26]; following their method, Prilusky et al. developed a web-based server named FoldIndex [27]. Galzitskaya et al. propose to use the mean packing of a protein sequence as an indicator of its folding status [28][29][30]. Recently, mean pairwise energy was used to effectively discriminate folded proteins from natively unfolded ones in a peculiar set of 39 protein complexes [31,32]. A more refined approach is Poodle-W by Shimizu et al. [33]; they analyse the amino acid composition of the protein sequences with a spectral graph transducer [34].
We compared the performances of these predictors on several datasets. We also defined a global binary folding index, named here gVSL2, using VSL2 [35,36], a predictor of disordered amino acids that excellently performed in the CASP7 experiment [37] and which is an evolution of the predictors that evidenced different flavours of disorder [19].
We observed that in many cases a number of sequences were differently classified by different indexes (figure 1). To identify these proteins we introduced, by combining several indexes, a strictly unanimous consensus score S SU that leaves unclassified a protein if at least two indexes disagree in classifying it. We verified that proteins unclassified by S SU span, on a hydrophobicity/charge plane, an overlap area occupied both by folded and natively unfolded proteins (figures 2 and 3). We checked also that these proteins have amino acidic frequencies intermediate between those of proteins that S SU predicts either as folded or natively unfolded (figures 4, 5), so we have concluded that they belong to a twilight zone in the space of amino acidic composition. Proteins in this twilight zone have a flexibility intermediate between that of folded and that of natively unfolded proteins and have a dependence of chain length similar to that reported by Szilagyi et al. [24].
We used S SU to scan several genomes from Archaea, Bacteria and Eukarya looking for natively unfolded proteins; the obtained percentages are similar to those already reported in the literature [38,39]; through S SU we found robust scaling laws both in the classified and unclassified sets of proteins, of possible significance for studies in molecular evolution [40,41].

Figure 1 Overlaps between the predictions of single indexes.
Comparison of the predictions by mean pairwise energy, gVSL2 and Poodle-W, on set C. The three indexes agree on 1785 proteins predicting them in the same class; 224, 210 and 231 proteins are singly classified by each of the indexes respectively, on each one of these proteins the prediction of one index is at variance with those of the others; the remaining figures refer to the pairwise cross-predictions.

Performances of single indexes of folding
To compare the performance of the synthetic predictors we computed them on set A, that Prilusky et al. selected to test FoldIndex [27]. The results are reported in table 1. The index HQ is our implementation of the algorithm described by Uversky and co-workers [26]. Mean packing and mean pairwise energy were computed following the original protocols [29,31] (see Methods). We classified a protein as natively unfolded if it had a mean packing below 20.55 and a mean pairwise energy above -0.37 arbitrary energy unit (a.e.u.); these threshold values were optimized to reach a sensitivity of at least 0.80 and a level of false predictions as low as possible. gVSL2 is the arithmetic mean of the disorder scores obtained by the predictor VSL2 [35,36] (see Methods for details).
From the AUC values shown in table 1 it is possible to rank the single indexes. We observe that mean pairwise energy, gVSL2 and Poodle-W have about the same performance and are the best three.
To test the stability of these results over different datasets, we repeated the experiment using two other groups of proteins: set B, used by Shimizu et al. to test Poodle-W [33] and set C, our own selection of proteins. Set B contains 526 folded proteins with a level of disorder below 5% and 81 natively unfolded proteins with a level of disor-der above 70%; it has been built as a very discriminative set, aiming at including either fully ordered or fully disordered proteins. Set C was compiled to test the ability of previously proposed methods to separate, on one side, fully ordered folded proteins but also proteins containing a percentage of disordered amino acids above 5%, from, on the other side, fully disordered proteins containing more than 70% of disordered residues. It contains 2369 folded proteins, 1573 fully ordered with a level of disorder below 5% and the remaining 796 proteins with higher percentage of disordered amino acids. Set C contains also 81 natively unfolded sequences with a level of disorder above 70% (see Methods). The performances of the single folding indexes on set B and C are reported in tables 2 and 3, respectively. The AUC values confirm that Poodle-W, gVSL2 and mean pairwise energy <E c >are the best performing on both sets. Note also that Poodle-W and <E c >have a low level of false positives f p , whereas gVSL2 tends to have a higher sensitivity S n .
It is evident the lower performance of all the indexes on set C, with respect to set B. As said above, the protocol followed by Shimizu et al. [33] is more restrictive; their set contains folded proteins whose average composition is different from that of unfolded ones (as shown in the hydrophobicity/charge plot of figure 3) whereas set C, Figure 2 Hydrophobicity/charge plot of folded, unfolded and unclassified proteins in set B. Hydrophobicity/charge plot of proteins in set B, experimentally identified as folded (red) and unfolded (blue). Upper green triangles refer to folded proteins unclassified by S SU , lower black triangles refer to unfolded proteins unclassified by S SU . This plot is a projection of the vector space of amino acidic compositions. Hydrophobicity and charge have been computed following ref. [25]. being more generic, contains a large number of untypical folded sequences. Poodle-W confirms itself as the relatively best single index, both on the optimized dataset B and on the more generic dataset C.
We checked that set C includes 506 proteins that are complexed with peptides or other macromolecules. These proteins may be partially disordered, particularly if they are not linked with their targets. We controlled however that the changes in the performance of the indexes are negligible if we exclude these proteins. The performance of the folding indexes computed on set C purged from these 506 complexed proteins are included in the [Additional file 1: Table ST1].

A consensus score to detect a twilight zone of amino acidic composition
In figure 1 we compare, in a Venn diagram, the results of the predictions obtained by mean pairwise energy <E c >, gVSL2 and Poodle-W for the 2450 proteins in set C. We see that the three indexes predict in the same folding class 1785 proteins, and that the indexes partly overlap, i.e. <E c >and gVSL2 predict in the same folding class 231 proteins, and there are other 434 proteins on which at least two indexes of folding agree. On the other hand, there are proteins that are uniquely predicted by each single folding index: 224 by <E c >, 210 by gVSL2 and 231 by Poodle-W. So, globally, there are 665 proteins on which at least two indexes disagree.
It is reasonable to think that proteins differently classified by two single folding indexes have amino acidic frequencies different from those typical of folded and natively unfolded proteins, then they are natural candidates to belong to a twilight zone in the vector space of amino acidic compositions. To identify these proteins we used a consensus score. The use of combination scores is not new [39,42] and a meta-server is also available [43]. We introduced a combination rule that we called strictly unanimous score, S SU . We decided to combine mean pairwise energy, gVSL2 and Poodle-W because, among the global indexes here considered, they are the best performing ones (see tables 1, 2 and 3). S SU classifies a protein as folded if all the indexes agree in predicting it as folded; conversely, it classifies a protein as natively unfolded if all the indexes agree in predicting it as Note that, due to the clause of unanimity, the performance of the single indexes on sets B and C purged from the unclassified proteins coincides with that of S SU and it is improved (compare tables 2 and 3 with table 4). This improvement after filtering can be obtained only if one combines in S SU a set of indexes of comparable performances, as discussed in the second subsection of the Discussion. We observe, in table 4, a relatively high sensitivity, and a low level of false positives, suggesting that removing the unclassified proteins is a filter of false predictions. We checked that 68% of the false predictions of <E c >, 74% of those of gVSL2 and 54% of those of Poodle-W are unclassified by S SU . Note also the small fraction of unclassified unfolded proteins; that is essentially due to the fact that unfolded proteins are less numerous in the datasets.
Since unclassified proteins by S SU are often false predictions of the single indexes, they should have an untypical amino acidic composition. We checked that on the hydrophobicity/charge plane, which is a projection of the space of amino acidic compositions (figures 2 and 3). Figure 2 refers to set B, figure 3 to set C. Points in blue and in red correspond to experimentally determined unfolded and folded proteins, respectively. Points in green and in black correspond to the proteins unclassified by S SU . To give all the information we have distinguished with green points folded unclassified proteins, and with black points unfolded unclassified proteins. It is evident that in set C folded and unfolded proteins are largely mixed up. This is due to the inclusion in the dataset of a number of protein complexes (about 22%), as well as of binding, signalling, nucleic acids associated and membrane proteins (about 40%). Many of them could have a composition typical of unfolded proteins, but nevertheless are folded and crystallizable after binding with a ligand. We checked that the overlap decreases if we exclude these proteins from the dataset and the hydrophobicity and charge distributions tend to those of set B [Additional file 2: Figure S1 (A)]. Note, however, that in both figures 2 and 3 (see insets) the centroids of the areas spanned by the three groups of pro- teins are well separated and aligned; those referring to unclassified proteins (green) are intermediate. Interestingly, in the more curated set B, the proteins unclassified by S SU fall in a more restricted and well defined overlap area. A similar result is obtained from set C purged from complexes and ligand binding proteins [Additional file 2: Figure S1 (A)]. In the supplemental figure S1(B) we plot, on the hydrophobicity/charge plane, folded proteins in set C that are protein complexes or ligand binding proteins. Therefore, supplemental figure S1(B) reports those folded proteins that have been excluded in supplemental figure S1(A); the unfolded proteins in set C are reported both in the supplemental figures S1 (A) and (B). Let us note that unclassified proteins are not all or mostly membrane proteins, signalling and DNA binding proteins. It is also clear that the centroids of the unclassified proteins are always intermediate both in the whole set C (figure 3), in the set purged by untypical proteins (supplemental figure S1 (A)) and in its complement (supplemental figure S1 (B)). These observations confirm that proteins we assign to the twilight zone belong to an overlap volume in the amino acidic composition space.

Figure 5
Distribution of the log-odds ratio S in folded, unfolded and unclassified proteins (twilight zone). Distribution of log-odds ratios in predicted folded (red bars), unfolded (blue bars) and unclassified proteins (green bars), as evaluated by S SU on set C. From this graph the twilight zone can be defined as the set of proteins whose S-scores are sufficiently close to zero.

Amino acidic composition of proteins in the twilight zone
It is interesting to check directly whether proteins which are unclassified by S SU have amino acidic frequencies intermediate between those predicted as folded and those predicted as natively unfolded. We computed the frequencies of the amino acids for the three groups of proteins in set C. The results are reported in figure 4. We see that W, C, F, I, Y, V, L, H and N are more present in folded than in unfolded proteins, whereas A, R, Q, S, P, E, K are more frequent in unfolded than in folded proteins; this result is consistent with that reported in the literature [19,44]. Following Romero et al. [44], we indicate the first group of amino acids as order-promoting and the second group as disorder-promoting. We observe also that the unclassified proteins by S SU have a peculiar composition with respect to folded and natively unfolded ones; more precisely, the frequencies of W, F, I, Y, V, L, Q, S, P, E and K are comprised between those of proteins predicted either ad folded or natively unfolded. M, T, A, N and D have frequencies similar in the three groups of sequences. By looking at figure 4 one would conclude that the amino acidic composition of the unclassified proteins is substantially intermediate between those of proteins predicted as folded and those of natively unfolded. However, this could only signals that the set of unclassified proteins is a balanced mixture of folded and unfolded proteins. To exclude this hypothesis we checked that only a small fraction of unclassified proteins belong to the set of natively unfolded sequences (see previous section and table 4). We further characterized the amino acidic composition of the proteins in set C by computing the log-odds ratio S of the likelihoods, for a sequence, of being composed mainly by order-promoting or disorder-promoting amino acids (see Methods for details). Positive (negative) values of S indicate that the sequence is composed mainly by order-promoting (disorder-promoting) amino acids [45]. The distributions of log-odds ratios are reported in figure  5 for sequences that are either classified by S SU or left   figure 4, the frequencies of C, H, R and G in the unclassified proteins, are not comprised between those in the other two classes but are even higher than those. This means that the amino acidic composition of the unclassified proteins is not strictly in between those of folded and natively unfolded proteins. The important point to make here is: the composition in the twilight zone is different, possibly corresponding to different folding propensities.

General remarks
Most global predictors of disorder rely on the amino acidic composition of protein sequences [24]. The distribution of the amino acids in folded proteins partly overlaps with that in natively unfolded proteins, moreover, as also noted in [24], the existence of two-state homodimers, that are disordered as monomers but fold on homodimerization, indicates that composition alone does not determine the folding status of a sequence. In this paper we have presented the unanimous consensus score S SU ; this score effectively selects proteins that, in the vector space of amino acidic composition, belong to an overlap volume occupied both by folded and natively unfolded proteins. This overlap volume has been theoretically investigated by Szilagyi et al. [24] as a twilight zone between order and disorder. Proteins in the twilight zone, as identified by S SU , have a distribution of the log-odds ratios S concentrated around zero, statistically different and intermediate between the S distributions of folded and unfolded proteins. This suggests that proteins in the twilight zone are those with approximately the same number of order-promoting and disorder-promoting amino acids.
It is important to point out that S SU assigns proteins to the class of folded, unfolded or unclassified considering only their amino acidic composition and not the percentage of disordered amino acids they contain. So, proteins unclassified by S SU generally do not have a higher fraction of disordered amino acids than proteins classified as folded. In figure 6 we report the distributions of disordered amino acids in proteins assigned by S SU to the class of folded and unclassified proteins; it is evident that both classes have the same distribution. We checked that, in set C, of the 796 proteins with a percentage of disordered amino acids higher than 5%, 541 are classified by S SU as folded, 45 as unfolded and 210 are unclassified. We also checked that of the 506 complexed proteins alluded to in the results section and in the supplemental table ST1 [Additional file 1: Table ST1], 361 are classified as folded, 30 as unfolded and 115 are unclassified. So, proteins in the twilight zone cannot directly be identified with proteins with long flexible segments or loops. In figure 7 we report the distributions of mean flexibility in predicted folded, unfolded and unclassified proteins (see Methods for definitions). Interestingly, unclassified proteins exhibit a mean flexibility (-0.465 ± 0.001) which is intermediate between that of proteins predicted folded (-0.486 ± 0.001) and natively unfolded (-0.431 ± 0.002). This suggests that the mean flexibility of proteins in the twilight zone is not directly related to the number of disordered amino acids. It will be interesting, on a case-by-case study, to investigate how the distribution of disordered amino acids correlates with the localization of flexible tracts in the proteins of the twilight zone.
An interesting point is raised by Szilagyi et al. [24], who underscore the relevance of the chain length for the tendency of a protein to be folded or unfolded. This suggested us to look at the distribution of lengths in the Table 4 proteins of the three classes: folded, unfolded and unclassified (twilight zone, see figure 8). It is interesting to note that in all cases there is a scaling power law, but the scaling exponents are quite different in the three classes; remarkably the twilight zone has the more negative scaling exponent (-3.3 ± 0.2), indicating that proteins in this class tend to be shorter than folded and unfolded ones. We subdivided proteins of set B into different bins of length, corresponding to sequences of 50-99, 100-199, 200-299 and up to 300 amino acids, respectively. Then, we estimated the extent of the projection of the twilight zone in the hydrophobicity/charge plane for each bin considered. The results are reported in graphs of supplemental figure S2 [Additional file 3: Figure S2], where dashed lines refer to a twilight zone determined by a logistic regression similar to that in ref. [24], whereas solid lines delimit the twilight zone as detailed in the Methods section; our estimates of the projection of the twilight zone are given by areas delimited by parallel solid lines drawn to include, for each bin of length, not less than 75% of the dots referring to proteins left unclassified by S SU . We observe a narrowing of the extent with the increase of chain lengths, consistently with the observations in Szilagyi et al. (see figure 2 in ref [24]) based on a logistic regression. Moreover, we have observed that the narrowing is quite independent from the choice of the threshold of 75% we use to delimit the zone. It is important to observe that the projection of the twilight zone, as selected by S SU (the area between solid lines in supplemental figure S2), does not coincide with the twilight zone determined by the logistic regression in the hydrophobicity/charge plane, as done in [24] (the area between dashed lines in the same supplemental figure S2). It is out of scope here to study how many and which of the proteins left unclassified by S SU are also recognized by the logistic regression in a twilight zone of the hydrophobicity/charge plane; let us only remark again that we have shown here that unclassified proteins are characterized by an S score close to zero, related to a mean flexibility which is intermediate between that of folded and unfolded proteins (figures 5 and 7). We believe that S SU can be useful as a starting point for further studies of the proteins lying in the region between order and disorder. We already used S SU in a previous work [41], scanning several genomes from different kingdoms; we obtained percentages of natively unfolded proteins of about 0.8% for Archaea, 3.7% for Bacteria and 23.4% for Eukarya, consistent with those previously reported [38,39]. The percentage of unclassified proteins is of 5.1% for Archaea, 7.4% for Bacteria and 15.8% for Eukarya. We also found scaling laws: the scaling exponent Figure 6 Fraction of disordered amino acids in folded, unfolded and unclassified proteins (twilight zone). Fraction of disordered amino acids in predicted folded (red bars) and unclassified (green bars) proteins, as evaluated by S SU from set C. A residue is disordered if it is present in the SEQRES but not in the ATOM field in the PDB file of the protein [33].

: Performance of each of the indexes combined into S SU on the sets B and C purged by the unclassified proteins
of the percentage of disordered proteins as a function of the number of proteins in the genome is 1.81 ± 0.10, whereas the percentage of unclassified proteins scales with an exponent 1.29 ± 0.05 [41]. In that analysis we combined mean packing, mean pairwise energy and gVSL2, but we did not use Poodle-W, since it was then unavailable to us. We are planning to extend that research using the operational combination of indexes proposed in the present study.
In the next section, at last, we propose a generic protocol on how to combine indexes into a strictly unanimous score.

A protocol for the combination of indexes able to select a reliable subset of proteins in the twilight zone
The extent of the twilight zone made of the unclassified proteins depends on the particular choice of the combined indexes. We checked that if one combines indexes with relatively low performance with more performing ones then only a small fraction of unclassified proteins corresponds to the false predictions of most single indexes. In this case many mistakenly classified proteins by the poor performing indexes are just unrecognized true positives and true negatives of the best performing ones, and then the twilight zone is poorly characterized. To avoid that we propose the following simple protocol: the folding indexes are initially ordered by decreasing values of the AUC; then one or more indexes are subsequently used to form consensus scores, monitoring the parallel decrease of f p . S SU (1) is just the best performing single index; S SU (2) is the combination of the first with the second best index; S SU (3) the combination of the first three indexes, and so on. An enhancement of the performance of the indexes on the more and more polarized subsets is expected at each step, together with a parallel increase in the percentage of unclassified proteins. In fact, these trends are shown in table 5 for the indexes considered in this work; by looking at this table it is clear that, from step 3 on, f p tends to saturate, whereas the fraction of unclassified proteins increases. As a rule of thumb we propose to add indexes until the f p starts to saturate and we adopted this criterion in selecting Poodle-W, gVSL2 and mean pairwise energy <E c >in the present study.

Conclusions
It has been pointed out that the separation between order and disorder in proteins is not sharp [24,25], but there exists a twilight zone between them. In Szilagyi et al. [24] the twilight zone is identified as an overlap volume in the space of amino acidic compositions, occupied both by folded and natively unfolded proteins. In this work we used the consensus score S SU to select, operationally, proteins belonging to the twilight zone, out of a generic dataset. The method can be easily adopted for large database screening, since it is easy to implement and computationally efficient. Our results show that proteins unclassified by S SU : i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded  SU (4) is the combination of Poodle-W, gVSL2, <E c >and <P>; finally S SU (5) is the combination of Poodle-W, gVSL2, <E c >, <P>and HQ; nc is the fraction of unclassified proteins.

Figure 8 Distribution of lengths in folded, unfolded and unclassified proteins (twilight zone).
Log-log plots of the distribution of lengths in the three classes of proteins extracted by S SU from set C. The scaling exponents, evaluated from a regression of the power law region in each graph are: -2.7 ± 0.2 (folded, red data points); -1.2 ± 0.3 (unfolded, blue); -3.3 ± 0.2 (unclassified, green). and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of predicted folded proteins and that of predicted unfolded proteins. These last remarks indicate these unclassified proteins belong to a class that possibly has physical properties intermediate between those of folded and those of natively unfolded proteins.

Datasets
Set A was selected by Prilusky et al. to test FoldIndex [27]. It is composed by 151 folded proteins and 39 natively unfolded proteins. Folded proteins are extracted from OCA Protein Data Bank http://bioportal.weizmann.ac.il/oca, selecting those with less than 10% of sequence identity and lengths comprised between 50 and 200 residues. Proteins with disulfide bridges, heterogroups or other non-protein elements are excluded; moreover, only X-ray structures with a percentage of disordered amino acids below 5% are considered. Unfolded proteins are selected by searching the literature for experimentally certified cases. Set B was selected by Shimizu et al. to train and test Poodle-W [33]. It is composed by 526 folded proteins selected from the PDB, and 81 natively unfolded proteins selected from the DisProt database [17,18]. Folded proteins are extracted from the PDB, among those with: i) resolution better than 2 Å; ii) R-factor below 20% and refined through Refmac5, SHELXL97 or CNS; iii) a percentage of disorder below 5% and no disordered amino acids in the central area (i.e. between the 10th residue from the N-terminal and the 10th residue from the C-terminal). Natively unfolded proteins are extracted from DisProt version 3.3 among those with a percentage of disordered amino acids above 70%. Both folded proteins and natively unfolded ones have a sequence identity below 30%, determined using BLAST-Clust.
Set C was selected for the present work. Folded proteins are extracted from PDBSelect25 [46,47], version October 2007, which contains 3694 proteins with sequence identity lower than 25%. Structures with a resolution above 2 Å and an R-factor above 20% are excluded. A restricted list of 2369 is obtained, 1573 of which with a percentage of disordered amino acids below 5%. Operationally, a residue is considered as disordered if it is present in the SEQRES but not in the ATOM field of the PDB file [37,38]. A list of 81 natively unfolded proteins, with at least 70% of disordered amino acids and sequence identity below 25%, are extracted from the DisProt database [17,18], version 3.6. Note that the 81 unfolded proteins in this set, coincide only in part with the 81 unfolded proteins selected, with a more refined procedure, in [33]. The list of PDB and DisProt entries of the proteins in set C are available in Additional files 4 and 5.

HQ index
We followed the method by Uversky et al. [26], revisited by Prilusky et al. [27]. From mean hydrophobicity <H>and mean net charge <Q>of the protein sequences, the following index is evaluated: We considered a protein as natively unfolded if HQ is negative or zero, otherwise we considered it as folded.

Mean packing
The mean packing of a protein sequence is the arithmetic mean of the packing values of each amino acid. The packing index has been introduced by Galzitskaya et al. [29], it is evaluated by counting, for each type of residue, the residues located within a distance of 8 Å, and averaging over a large dataset of structures. The packing index is here assigned to the central residue as the average over a sliding window of length 11. We considered a protein as natively unfolded if its mean packing is below 20.55, otherwise we considered it as folded. We did not observe a significant improvement of the performance changing the length of the sliding window.

Mean pairwise energy
We followed the method by Dosztanyi et al. [31], as implemented in IUPred [48]. The pairwise energy of an amino acid expresses its "contact interaction" with the amino acids located, downward and upward, from 2 to 100 position apart, along a given sequence. The pairwise energy of amino acid i at position p is given by: We considered a protein as natively unfolded if its mean pairwise energy is above -0.37 a.e.u., otherwise we considered it as folded.

Index gVSL2 derived from VSL2
VSL2 [35,36] is a disorder predictor that assigns to each amino acid of a protein its propensity to be disordered, estimated using a combination of support vector machines. The score from VSL2 is normalized between 0 and 1 and an amino acid is considered disordered if its value is above 0.5.
We evaluated the global index gVSL2 by assigning to the central residue the average of VSL2 scores (variant VSL2B) over sliding windows of 11 residues, and then taking the average value over all the residues. We classified a protein as natively unfolded if its gVSL2 value is above 0.5.

Poodle-W
Poodle-W [33] is a predictor of natively unfolded proteins based on a Spectral Graph Transducer semi-supervised learning machine [34]. It analyses the amino acid compositions of both proteins with known structural properties and with unknown structural properties, and it constructs a k-nearest neighbour graph with sequences as vertices and similarity among sequences as edges; then it cuts the graph into two sub-graphs so as to minimize the misclassification of the sequences with known structure and the sum of the edges weights of the graph; the resulting sub-graphs identify the predicted folded and natively unfolded proteins.

Statistics for the amino acid composition of protein sequences
We followed the approach of Romero et al. [44]. In a dataset containing N sequ sequences, let n j be the number of the amino acids of the j-th sequence and n ij the number of occurrence of the i-th amino acid in the j-th sequence. The frequency of the i-th amino acid in the j-th sequence is given by: The frequency of the i-th amino acid in the dataset analysed is computed as: where N aa is the number of the amino acids in the dataset. The corresponding variance is given by:

Log-odds ratio of the likelihoods that a sequence has amino acidic composition typical of folded and unfolded proteins
To compute this score, one usually refers to a probabilistic model. One assumes to have reliable estimates of the probability of occurrence of each amino acid a in folded and unfolded proteins {π a (F) } and {π a (U) }, respectively. A folded protein can be thought of as if its sequence is sampled from {π a (F) } through a sequence of independent extractions. The likelihood that a sequence has amino acidic composition typical of a folded protein is: where is the probability of amino acid a (estimated from the occurrences of a in a convenient sample of folded proteins) and n a is the occurrence of amino acid a in the sequence. The probabilistic model implicit in the above definition is a 0-order Markov chain. Similarly we can define L U by using π a (U) , with a simple extension of the notation. Then, L F /L U is the likelihood ratio associated to a given sequence, through its amino acidic composition {n a }, where a runs over all amino acid labels.
The log-odds ratio of a given sequence is then: Order-promoting amino (disorder-promoting) acids contribute with positive (negative) terms to S, since their ratios π a (F) /π a (U) are bigger (lower) than one. Therefore, S is positive (negative) if the protein is composed mainly by order-promoting (disorder-promoting) amino acids. When a protein is composed by approximately the same number of order-promoting and disorder-promoting amino acids, its S score has a value close to zero. Likelihood ratio tests are treated in different textbooks [45,49]

Mean flexibility
The mean flexibility of a protein is the arithmetic mean of the flexibility values of its amino acids. We used the flexibility scale by Smith et al. [50]; they derive their flexibility scale through an analysis of B-factors of 290 proteins selected from PDB Select 25 version August 1998. The flexibility index is here assigned to the central residue as the average over a sliding window of length 11. We did not observe a significant difference in the behaviour of figure 7 changing the length of the sliding window.

Identification of the twilight zone in the hydrophobicity/ charge plane
To characterize the extension of the twilight zone found by S SU we used the following procedure. First, we identified two discriminative lines to separate predicted folded and predicted unfolded proteins from those unclassified by S SU . To this end we considered two hydrophobicity/ charge planes: one for predicted folded and unclassified proteins and the other one for predicted unfolded and unclassified proteins. For all pairs of points, in each plane we traced a line and evaluated its performance in separating the two groups of proteins; among all the discriminative lines which separate unclassified proteins with sensitivity above 0.80, we chose, in each plane, the one with lowest fraction of false positives. Of the discriminative lines found in the two planes, we chose, in turn, the one with lower fraction of false positives. Second, to delimit the twilight zone, we plotted a second line by translating the first one so to include, between the two lines, not less than 75% of unclassified proteins. In supplemental figure S2 [Additional file 3: Figure S2] we report, for different bins of protein length, the twilight zones determined by the procedure above described. The projections of the twilight zone, in each bin, are represented by the area of the plane delimited by solid lines.

Indicators of performance
We evaluated the performance of the predictors by using very common indicators [37,51]: • Sensitivity: • Specificity: • False predictions: • Area Under Curve (AUC) of a Receiver Operating Characteristic (ROC) curve where TP stands for the true positives, TN the true negatives, FP the false positives and FN for the false negatives.
S n and S p express the fraction of correctly predicted proteins in a dataset; more precisely, S n is the fraction of correctly identified natively unfolded proteins, whereas S p is the percentage of correctly identified folded proteins. f p expresses the fraction of folded proteins that are classified by the index as natively unfolded. Receiver operating characteristic (ROC) curves are widely used to evaluate the performance of binary folding indexes and are obtained by plotting the fraction of true positive predictions versus the fraction of false negative predictions for each threshold value of the probability to be disordered. The performance of the predictor is evaluated through the area under the curve (AUC), generally computed by the trapezoid rule; the AUC is independent from discriminative thresholds. The AUC values were computed using ROCR [52].  table ST1. This file contains supplemental Table ST1, with caption. Additional file 2 Supplemental figure S1. This file contains supplemental figure S1, with caption.