Intrinsically disordered domains deviate significantly from random sequences in mammalian proteins

Background In order to characterize mammalian intrinsically disordered domains (IDDs) we examined the patterns in their amino acid abundance as well as overrepresented local sequence motifs. We considered IDDs from mouse proteins associated with innate immune responses as well as a set of generic human genes. These sets were compared with artificially generated random sequences with the same overall amino acid abundance and length distributions. IDDs were then clustered by amino acid abundance, and further analyzed in terms of co-occurrence of clusters with functionally characterized Pfam domains. Results Overall, IDDs were very different from randomly generated sequences. The deviation from random distributions was at least as great as that for ordered domains, for which the deviation can be rationalized in terms of strong evolutionary pressure for structure and function. The co-occurrence of certain Pfam domains with specific IDD clusters was found to be significant (p-value < 0.01). Local sequence motifs that were over-represented in the innate immune set consisted mostly of low complexity fragments, primarily characterized by amino acid repeats, and could not be assigned an obvious functional role. Conclusions Our results suggest that IDDs are constrained within a narrow subset of possible sequences. This is most likely a result of biophysical restraints that have yet to be elucidated. More detailed examination of the functional relationship between the IDDs and associated Pfam domains is one possible avenue of investigation.


Background
Intrinsically disordered domains (IDDs) are abundant in eukaryotic proteomes [1], especially in cell signaling proteins [2]. It has been shown that many IDDs become ordered upon binding other macromolecules [3], and that their binding modes can be diverse [4] with the length of the IDD modulating the binding affinity [5]. This allows them to function as hubs in protein-protein interaction networks [6][7][8].
IDDs can be recognized by their amino acid composition, which is biased toward hydrophilic residues [9]. There is also a bias toward low-complexity regions, characterized by the abundance of one or a few amino acids, in IDDs [10]. IDDs have been shown to evolve more rapidly than ordered domains [11,12], while maintaining their length and location in the protein [8]. Despite their functional importance, IDDs are generally filtered out when performing structure-based protein function prediction in order to focus attention on better-characterized ordered domains.
In this study, we determined the levels of similarity between IDDs in a set of mammalian immune and nonimmune proteins and compared them to random and ordered sequences. IDDs were then clustered into similar sequence groups for proteins in the immune and non-immune sets and their associated ordered domains were studied. We further extracted common sequence motifs in the IDDs in an attempt to identify common sequence patterns.

Overview
We examined two sets of eukaryotic IDDs to study their sequence similarity and their association with specific functional domains. The first was taken from a set of 1580 mouse genes relevant to macrophage response to microbial stimulation (Innate Immune Set). The second is from a set of 1663 human proteins selected at random, excluding those in the Innate Immune Set. The use of two sets afforded the opportunity to examine whether, after filtering obvious homologs, there was greater similarity within a set of functionally related proteins than within a set of proteins picked at random. For the purposes of our analysis we here define IDDs as any predicted disordered segment of 30 residues or more, using the program Disopred2 [1] at a false positive threshold of 5%.
Since sequence repetitions are abundant in IDDs, similarity measures that do not require pair-wise alignment are convenient. To this end, frequency distributions were computed for single and multiple amino acid occurrences within an IDD, as described in Methods. Since dissimilarity measures, such as Kullback-Leibler divergence or the Student's t-test are sensitive to the sample size, we chose instead to directly measure the similarity between two frequency distributions using two methods. The first is a variation of a Gaussian-based score that has proven useful in structure comparison (eqn. 2 in Methods). The second is by explicit enumeration of all possible amino acid sequence motifs of length 2-5. Using the latter method, we examined the ratio of observed to expected frequencies for each possible motif and discuss those that deviate significantly from their expected values.

Disordered sequences are not random
Using the Gaussian-based similarity score we carried out all-against-all comparison of IDDs in the immune and non-immune sets. For each of the IDD sets, we also constructed a randomized sequence set with the identical overall amino acid composition, sequence number and domain length frequency by shuffling the residues in the original native sequence set, as described in Methods. We then constructed a histogram of the similarities within each of the resulting 4 sets by binning the calculated similarity scores into 50 equal-sized windows. As figure 1A illustrates, the random distributions are skewed toward the high end of the similarity spectrum, while native IDDs are much more diverse. Thus, the similarity between either of the IDD sets is much lower than the similarity between random-immune and random-non-immune sets. This shows clearly that IDDs are not constructed randomly from a pool of disorder promoting amino acids. As a comparison, we performed the same calculations on a set of ordered protein sequences extracted from a representative set of structured domains. As figure 1B illustrates, the ordered domains are also much more different from each other than random sequences are, even when the length distribution and overall composition are held constant. However, the overall similarity (as indicated by the peak in the distribution) is much lower in the disordered set than in the ordered set.
As figure 2 illustrates, the frequency distributions of individual amino acids within IDDs are in general nonsymmetric. That is, there are long tails to the right indicating that some sequences are rich in a particular amino acid type. As has been well documented by others, these particular amino acids include Glu, Gly, Pro, and Ser [9]. When examined closely, we observe broad peaks for Glu, Gly, and Pro at large abundances (25-30% of the IDD length). This non-uniform distribution of amino acids in IDDs is consistent with the nonrandom similarity distributions described above.

Randomizing IDDs predicted to reduce IDD content
Disorder prediction was performed on the randomized IDD sequences from the non-immune set. The percentage of disordered residues was predicted to decrease by 45% overall (data not shown) upon randomization. This result also suggests that IDDs are not just randomly aligned regions but have some specific tendency to be biologically constrained. However, we must make this conclusion with caution, since it may be due to an artifact of the prediction algorithm: Since the Disopred2 program was trained on native sequences, it is not clear whether it is justified to interpret the results when applied to randomized sequences. However, we can at least say that the Disopred2 program correctly identifies the difference between the random sequences and the real IDDs. Based on this result, we speculate that true IDDs have requirements beyond mere disorder; namely, the ability to fold upon encountering a target protein or the need to contain specific local sequence patterns necessary for biochemical function.

Sequence motifs in IDDs
In order to identify motifs in IDDs we enumerated all possible sequence fragments from 2-5 amino acids in length and examined the frequency distributions in the IDD and random sets. We computed the ratio of observed frequencies to their expectation values by calculating the frequencies of each single amino acid in the dataset.
Since fragments with very rare occurrence could not be interpreted statistically, we discarded any motifs with less than 10 counts. Figure 3 displays the histogram of the distribution of the natural log of the ratios. It shows that the deviations of observed frequencies for IDDs from the expected values are larger than that for randomly Figure 1 Observed frequency of amino acid histogram similarity scores. The similarity score is scaled from 0 to 20 for convenience (i.e., 100% identical histograms would have a score of 20). Native refers to actual protein sequences and random to artificially generated sequences with the same overall amino acid composition and length distribution as native sequences. A) Data are shown for non-immune random, immune random, non-immune IDD, and immune IDD sets. B) Data are shown for random ordered and native ordered sequences.
Teraguchi et al. BMC Bioinformatics 2010, 11(Suppl 7):S7 http://www.biomedcentral.com/1471-2105/11/S7/S7 generated sets, indicating that IDDs tend to have some particular motifs. In principle, the center of the distribution is 0, where the observed frequency equals the expectation value. This is actually the case for doublet and triplet fragments where almost every sequence motif is observed a number of times. However, for the quartet and quintet, the center is shifted to the right, most likely due to the fact that motifs with less than 10 counts were discarded. For all the random sets, there was no Quintet with observed frequency greater than 10. (Note that this simple model does not distinguish between multiple motifs found in the same protein sequence and those found in different sequences.) We next examined the specific motifs that were overrepresented in the innate immune set. From table 1, we can see that the over-represented motifs consist almost entirely of low complexity fragments that contain at least 3 occurrences of the same amino acid. We could not easily assign a functional role to such motifs. For example, there are no obvious SH3, SH2, or kinase binding sites on this list. However, a large number of amino acid repeats were found in these motifs.

Clustering IDDs by amino acid composition
In order to characterize IDDs with similar amino acid compositions, we carried out hierarchical clustering by average linkage with respect to a Gaussian-based similarity score described in Methods (eqn. 2). The number of clusters was monitored as a function of the similarity cutoff threshold (data not shown) and all 4 sets showed a similar sensitivity to the cutoff value. A clustering cutoff value of 0.15 was chosen as few new clusters were formed for moderately larger values. Note that the clustering cutoff value represents the average value of eqn. 3 over the entire cluster. These clusters are used in the following comparison between innate immune and non-immune sets.

Innate immune and non-immune sets contain similar IDD clusters
In order to identify IDD clusters that appeared in both the immune and non-immune sets the Gaussian-based similarity score was generalized to compare IDD clusters, rather than pairs of IDD sequences as performed in the previous section. We defined the similarity between two amino acid frequencies in two clusters as the similarity between their average values and we set the square of the width of the Gaussian w 2 to the sum of the square of the standard deviations (eqn. 4). With this formulation of similarity, clusters from the immune set were paired with clusters from the non-immune set by maximizing the Gaussian similarity score. An example of two similar clusters in comparison with the background distribution of amino acid frequencies is shown in figure 4. The amino acid frequencies for the IDD clusters identified are listed in Table 2 showing differences in the abundance of amino acids across clusters.

Similar IDD clusters share similar Pfam domains
In order to assess whether similar IDD clusters are associated with similar ordered domains, we examined the co-occurrence of Pfam domains [13] in each pair of similar clusters as defined in the previous sub-section. A co-occurrence was considered significant if the Pfam domain occurred in both the immune and non-immune clusters in a given pair. We also restricted our analysis to Pfam domains that occurred more than once in either the immune or non-immune sets. The frequency of each of the Pfam domains in its respective cluster was much higher than that expected by chance (see Methods). A total of 51 Pfam domains satisfied the above criteria (Additional File 1), and the most significant results are listed in Table 3.
In order to estimate the significance of the observed number of shared Pfam domains identified by the histogram similarity score, we replaced the maximization of the Gaussian similarity score (used to match clusters in the innate immune set with clusters in the Non-immune set) with a random pairing of clusters. There are many possible combinations of pairs, so we repeated the random pairing a total of 9000 times and obtained a background distribution of Pfam domain co-occurrence (figure 5). The maximum value in this exercise was 47, corresponding to a p-value of 0.01, based on direct integration of the frequency distribution. Therefore, we can say with a high degree of confidence that the co-occurrence of 51 Pfam  The criterion for including motifs in the table was that the ratio of observed over expected was higher in the immune set than in the non-immune set by a factor of at least 1.25. domains is not due to chance (with a p-value << 0.01), and thus there is a bias for similar IDDs to be associated with specific Pfam domains. The majority of the Pfam domains listed in Additional File 1 are involved in transcription, signal transduction, or both. The high IDD content in mammalian transcription factors has been examined before [14]. Thus it is difficult to make a simple functional interpretation of the IDD clusters. However, the co-occurrence of Pfam domains indicates that a frequency distribution-based analysis is practically useful for suggesting possible biological or biochemical roles of un-annotated IDDs. Although the Table 2 Amino acid frequency distributions of similar IDD clusters   Cluster  A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V   Immune (0)  11  6  2  5  1  5  14  4  2  2  12  7  2  1  5  10  4  0  1  4 Non-Immune (99)  7  5  4  5  1  8  14  3  2  3  13  10  2  1  2  7  5  0  2  5 Immune (1)  number of co-occurring Pfam domains is higher than that expected by chance, there are, nevertheless, many Pfam domains that do not co-occur with specific IDD cluster with a high statistical significance.

Discussion
In this study we carried out comparison of IDDs at both the overall amino acid composition level and at the local sequence motif level. These two levels of comparison span a wide range and yet we observe similar trends in both extremes. Namely, individual IDD sequences are very different from artificially constructed sequences picked naively. This, in turn, might imply that there is strong selective pressure on IDDs, just as there is strong pressure on ordered domains; however, direct evidence for this interpretation is beyond the scope of the current study. In the case of ordered domains we can understand such pressure in terms of   Figure 5 Background distribution of Pfam domain co-occurrence. Instead of using the Gaussian similarity score to match clusters in the innate immune set and the generic disordered set, we inserted a random matching function. The resulting distribution clearly indicates that the number of co-occurring Pfam domains identified by Gaussian similarity is highly significant.
the structural and functional requirements. The resulting distribution of ordered protein sequences is a tradeoff between genetic drift, which tends toward randomization, and biochemical function, which tends to limit the observed amino acid sequences to a small subset of the possible random combinations. If we examine the distribution of sequence identities within a given fold, for example, we usually see two peaks (figure 6). One small peak is near 100% and contains the close family members. The other peak is broader and covers the "twilight zone" region from 0-30%. It is thus not unreasonable to hypothesize that a similar trade-off occurs for IDDs, and that the pressure in this case is due to the need for IDDs to be metastable, only becoming ordered upon binding a target protein. Understanding the exact role of specific IDDs will help to refine the interpretation of their compositional diversity.
In the case of local sequence motifs, we observed a strong bias toward low complexity patterns. These patterns did not include obvious binding site for kinases, SH2, or SH3. Therefore, it was not straightforward to assign a functional role to the motifs. However, the abundance of amino acid repeats in local sequence motifs in IDDs and the non-random nature of the IDDs together raise the possibility that the maintenance of these repeats may provide an additional restraint during the evolution of IDDs. Though the prevalence of amino acid repeats in disordered regions have been studied [15], the combined evolution of amino acid repeats and disordered regions needs to be investigated further. Moreover, our analysis did not reveal a systematic difference between the IDDs in the innate immune set and those in the generic IDD set.
Further, we also find that similar IDD clusters are associated with certain Pfam domains indicating possible functional roles for, and limitations on, the IDDs.

Conclusions
There were two motivations for the current study. The first was that IDDs could be clustered into sub-groups in order to allow a more fine-tuned assignment than merely "IDD" when assigning domains to uncharacterized sequences. The second was to associate these clusters with certain ordered domains to facilitate future annotation of IDD-containing protein sequences. In terms of these goals, we were modestly successful as the histogram-based method is efficient and appropriate to classify IDDs according to their most abundant amino acids. With regard to the association with ordered domains, we were also successful, as judged by the statistical significance of Pfam domains associated with the innate immune and generic IDD data sets. In terms of practical importance, the IDD clusters identified here should be of use in characterizing orphan sequences containing IDDs. The classification of IDDs remains an open problem. One interesting avenue of future work Figure 6 Sequence identity within ordered folds. The figure was constructed by picking 10 query domains at random, and calculating the sequence identity of all similar folds to the query as returned by the SeSAW structural alignment server [18].
will be to examine the predicted structural and functional constraints in IDD evolution.

Innate immune IDD set
Amino acid sequences corresponding to 1580 mouse genes potentially relevant to macrophage response to microbial stimulation were downloaded from the Innate Immune Database [16]. This list combines genes with significant expression changes under stimulation with lipopolysaccharides (LPS), and genes coding for proteins known to interact in the TNFα/NF-B signaling pathway. A representative set of 1237 protein sequences was prepared using the cd-hit program [17] at 40% sequence identity. For each sequence disordered regions were predicted using the Disopred2 program [1] and 1464 predicted disordered regions of length 30 or more were retained for analysis.

Non innate immune IDD set
Amino acid sequences 1663 human proteins were selected at random, excluding those in the Innate immune set, and a representative set of 2171 disordered regions of length 30 or more was prepared as above.

Ordered set
The amino acid sequences for a set of 1999 representative structural domains was taken from Protein Data Bank (PDB) atom records.

Random sequence sets
Randomized versions of each of the above 3 sequence sets were prepared as follows. The entire sequence set was concatenated into a single string and shuffled. The length of each of the original sequences was stored. Then for each of the original sequence lengths, a random sequence was constructed by repeatedly picking an amino acid at random from the concatenated sequence and transferring it to the random sequence. In this way, the resulting set of randomized sequences has the same length distribution and the overall amino acid composition was identical to that of the original sequence set.

Similarity score for amino acid compositions
The frequency of a particular amino acid type a in an individual sequence i was given by the ratio of the number of N(a) to the length of the sequence f a N a len i i ( ) The Gaussian similarity score for a pair of frequencies was given by The Gaussian term is always non-negative and evaluates to 1 for a perfect match between sequences i and j. The width of the Gaussian w is an adjustable parameter empirically set to 0.1 for the purpose of distributing most of the similarity values over the range 0.5-1. The exponential terms are weighted by the average frequency in order to give more emphasis to abundant amino acids. The denominator evaluates to 2 in every case so it is not actually necessary but we include it for completeness. Without this weighting term we found that zero counts were dominating the similarity score. Note that pseudo counts are not needed for the above score (i.e., for zero values in the histograms of short IDDs). For clustering, it is convenient to convert the similarity to a pseudo distance by We also generalized eqn. 2 for comparing clusters of sequences. We defined the distance between two amino acid frequencies in two clusters as the distance between their average values (|〈 f i (a)〉 -〈 f i (a)〉|) for clusters i and j, and we set the square of the width of the Gaussian w 2 to the sum of the square of the standard deviations (w 2 = σ i (a) 2 + σ i (a) 2 ). With these modifications, the Gaussian score becomes where N tot p is the total number of occurrences of Pfam domain p in the entire set of sequences, N c is the number of Pfam domains in cluster c, and the denominator is the sum of N c over all clusters (i.e., the total number of Pfam domains).

Enumeration of short sequence motifs
Sequence motifs of length n were generated by explicitly enumerating all possible fragments. Given background amino acid frequencies p(a) the expectation value for a given motif was given by the product of the individual background frequencies and the total number of possible fragments in the dataset with given length N(n).