Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila

Background Compositionally biased (CB) regions are stretches in protein sequences made from mainly a distinct subset of amino acid residues; such regions are frequently associated with a structural role in the cell, or with protein disorder. Results We derived a procedure for the exhaustive assignment and classification of CB regions, and have applied it to thirteen metazoan proteomes. Sequences are initially scanned for the lowest-probability subsequences (LPSs) for single amino-acid types; subsequently, an exhaustive search for lowest probability subsequences (LPSs) for multiple residue types is performed iteratively until convergence, to define CB region boundaries. We analysed > 40,000 CB regions with > 20 million residues; strikingly, nine single-/double- residue biases are universally abundant, and are consistently highly ranked across both vertebrates and invertebrates. To home in subpopulations of CB regions of interest in human and D. melanogaster, we analysed CB region lengths, conservation, inferred functional categories and predicted protein disorder, and filtered for coiled coils and protein structures. In particular, we found that some of the universally abundant CB regions have significant associations to transcription and nuclear localization in Human and Drosophila, and are also predicted to be moderately or highly disordered. Focussing on Q-based biased regions, we found that these regions are typically only well conserved within mammals (appearing in 60–80% of orthologs), with shorter human transcription-related CB regions being unconserved outside of mammals; they are also preferentially linked to protein domains such as the homeodomain and glucocorticoid-receptor DNA-binding domain. In general, only ~40–50% of residues in these human and Drosophila CB regions have predicted protein disorder. Conclusion This data is of use for the further functional characterization of genes, and for structural genomics initiatives.


Background
Compositional bias for a subset of residues is a widespread phenomenon in protein sequences; it has historically been linked to proteins having a structural role, or displaying some intrinsic protein disorder [1][2][3]. Many types of compositionally-biased (CB) region are masked as low-complexity sequence during protein sequence alignment, as a matter of course [4][5][6][7][8], since failure to mask such sequences can lead to a false assumption of evolutionary relatedness. The most commonly used of these masking programs, SEG [7], assesses sequence entropy using user-defined input parameters determining the granularity of the sequence masking.
Previous analysis of compositional bias has focused on single-residue biases, and homopolymeric runs [9][10][11]. Algorithms that can derive CB regions for multiple residue types have also been developed [6,8]. Here, for the first time, we have derived an exhaustive assignment of CB regions made from multiple residues types, in complete proteomes, substantially developing and expanding the scope of our bias analysis algorithm [6]. The present concept of compositional bias has been developed to enable the assignment and exhaustive analysis of biases for multiple residue types, built up from an initial detection of single-residue biases, in a way that is independent of window-lengths, or similar user-defined parameters. We find that a short list of biases is universally abundant in the metazoan proteomes examined, along with some notable relative species-specific abundances. For human and fruitfly, CB regions are analysed for conservation, length, functional linkages, and predicted protein disorder content. Some of the universally abundant biases are linked to nuclear localization and transcription in Human and/or Drosophila.

Some biases are universally abundant in metazoans
Over 40,000 CB regions in thirteen metazoan proteomes were assigned using the procedures described in Methods. Briefly, protein sequences are initially scanned for the lowest-probability subsequences (LPSs) for single aminoacid types; subsequently, an exhaustive search for lowest probability subsequences (LPSs) for multiple residue types is performed iteratively until convergence, to define CB region boundaries. A CB region is labelled with a CB signature (denoted {abc...} where a, b, c, ... are the residue types that it comprises, in decreasing order of significance). Each CB region has an associated P min value. Any region with an initial strong bias for residue type a, and any number of other subsidiary biases is denoted {a(X) n }. It is important to note that these P-values are only meaningful in a relative sense; the process of probability minimization provides a way to define boundaries for regions comprising complex compositional biases, that are distributed or mingled over the length of a particular subsequence.
What are the most consistently abundant biases across all of the metazoan proteomes? To answer this question, for each proteome, each bias type was ranked in decreasing order of abundance. Then, across all of the proteomes, the mean of this ranking was calculated, as well as the number of times the bias types occurred in the top ten of rankings. The twenty-five bias types with the smallest mean ranking values are listed in Table 1. Strikingly, nine single-and  double-residue biases are consistently highly ranked in  these proteomes: {C}, {P}, {GP}, {Q}, {ED}, {G}, {E},  {S}, {H} and {T} occur in the top ten of at least six species, both vertebrate and invertebrate (Tables 1 and 2). Some abundant species-specific biases stand out, e.g., {Q} regions are most abundant in the fruitfly (Table 2), when compared to all the other proteomes, and, in combination with {QH} regions (the second most prevalent bias in fruitfly) and {QPH} regions, comprise 13% of all the CB regions in that organism. These CB regions will be discussed in more detail below.
Other examples of abundant species-specific biases may be indicative of spurious gene predictions. Examination of examples of the many {HT} and {CV} regions found in the two puffer-fish proteomes ( Table 2), indicates that they arise from genome regions with simple repeats, and typically have poorly predicted introns; these thus may arise from systematic errors in gene prediction. * Mean rank is simply calculated from averaging over rankings (in decreasing order of abundance) for all thirteen proteomes. ** Number of times the bias appears in the top 10 (with the number of proteomes this bias occurs in, in brackets). *** The types of CB region have been ranked in increasing order of mean rank for the human proteome.
Although many of the most abundant biases across the metazoans are made from either one or two residue types, most biased regions are comprised of a larger number of residues, with a broad mode from about 3 to 5 residue types. This is illustrated for the human proteome ( Figure  1). More than a quarter (~27%) of the human CB regions have signatures of ≥ 6 residue types; this is because the bias assignment algorithm can detect CB regions that are composed of multiple milder single-residue biases. (An example of such a region is given in Figure 7(C) below.)

Functional biases and predicted protein disorder content of the top ten biases in human and Drosophila
Obviously, these bias prevalences represent many diverse types of protein subsequence; therefore, to pick out specific subpopulations that are of interest, we need to perform some further characterizations. To this end, for the CB regions in both the human and Drosophila proteomes, after filtering for coiled coils and known protein struc-tures, we examined: (i) significant functional associations based on Gene Ontology (GO) categories and terms; (ii) predicted protein disorder content (using the program DISOPRED [12]); (iii) CB region length; (iv) CB region conservation. We focus specifically on Q-based and Ebased biases, as specific examples.
Tables 3 and 4 show that most of the top ten biases (6/10 for both human and Drosophila) come from the 'universally prevalent' list; some of these have significant associations with transcriptional functional categories and with nuclear localization. These CB regions also have moderate to high predicted protein disorder contents (D valuẽ 0.4-0.8) (Tables 3 and 4). The D value is the fraction of the CB region that is predicted to be disordered by the program DISOPRED [12].
For example, {ED} regions in human have significant associations to 'nucleus' and 'DNA-dependent regulation of transcription', and are on average predicted to be moderately disordered (mean D values of 0.56) ( Table 3). {Q} regions (in both Drosophila and human) and {QH} regions (in Drosophila only) have similar functional associations, and are predicted to be moderately to highly disordered (D ~0.4-0.8) (Tables 3 and 4).
Additionally, we separated GO terms into those that are transcription-associated and those that are not (see Methods for details). Then, using these two 'supercategories', we tested for significant association with the transcription supercategory for each CB region type. For both human and Drosophila, the CB regions that demonstrate such a significant association with the transcription supercategory, also have significant association to individual GO terms linked to transcription (Tables 3 and 4).

Further analysis of nuclear-/transcription-related biases GO and protein domain associations for the largest CB region grouping, {Q(X) n }
Since {Q} regions, and {Q(X) n } in general, represent the most numerous CB region grouping in either human or Drosophila, we examined the top twenty significant GO assignments for {Q(X) n } regions in more detail for Drosophila and Human, as well as for Rat and Mouse (Table  5). Noticeably, across Drosophila and the three mammals, 'DNA-dependent regulation of transcription', 'transcription factor activity' and 'nucleus' are all highly-ranked functional associations. Similar prevalences are observed for abundant GO terms, if all {Q}+{QH}+{QPH} regions are analyzed in the same way (not shown).
The {Q(X) n } grouping is also sufficiently numerous that we can count up the most frequently associated globular domains (i.e., domains that are in the same sequences) ( Table 6). The most commonly associated domain in both Human and Drosophila is the 'DNA/RNA-binding three-helical bundle', chiefly arising from the 'Homeodomain-like' superfamily. This domain was first found in Drosophila homeotic genes, and occurs widely in transcription factors; related domains are also used in other DNAbinding proteins, such as telomeric proteins, recombinases, etc.
Number of bias residue types per CB region in the human proteome Figure 1 Number of bias residue types per CB region in the human proteome. The number of bias residue types per CB region is binned in a bar chart (x-axis). The total occurrences for each 'number of bias residue types' is on the y-axis.

CB region length
In general, the nuclear-/transcription-related biases show a mode in region length at 20-40 residues. This is shown specifically for {QH} regions in Figure 2. A similar fall-off is observed for the distribution for the subset of {QH} regions that are labelled in the GO classification as associated with 'transcription' or localization in the 'nucleus'. A 'blow-up' of the overall {QH} histogram ( Figure 3) demonstrates that these regions are not adequately analysed simply as homopolymeric tracts. The subsidiary nature of the H component of the bias is evident, as it is interspersed with longer homopolymeric runs of Q.

Conservation
As case studies, we examined the conservation of {Q(X) n } and {E(X) n } regions in other metazoans, relative to human. Orthologs of proteins were determined with the bi-directional best hits approach, using BLASTP [13] (evalue ≤ 0.0001 with alignment over 0.6 of the length of both sequence, both with and without masking compositionally biased parts). We analysed the fraction of orthologs that maintain a biased region of the same character ({Q(X) n } or {E(X) n }) (Table 7). Generally, these regions (filtered for coiled coils), show high conservation in orthologs from other mammals (60-80% depending on criteria), and low conservation in invertebrates (0-50%) (Table 7). Obviously, these numbers broadly cover a diverse set of CB regions; visual curation reveals that shorter {Q(X) n } and {E(X) n } CB regions consisting of short homopolymeric runs of {Q} are not conserved from human to invertebrates, and that all of the regions that are conserved are longer (> ~90 residues). Indeed, this lack of conservation in invertebrates is also evident when one examines specifically the {Q}+{Q}+{QPH} and {ED}+{E} subsets (Table 7). A multiple alignment of FOXP2, a gene important in language in humans, is illustrated as an example of conservation of a {Q} region defined in vertebrate proteomes ( Figure 4).

Predicted protein disorder -general observations
Prediction of protein disorder has recently been the focus of much research activity [1,12,14]. Such regions present a challenge for further proteome-scale experimental characterization. We analyzed the predicted protein disorder content of the human and Drosophila CB regions, using the program DISOPRED [12]. In summed total (simply adding up the total amounts of residues), the human CB region data is predicted to be ~42% disordered, with a similar value observed for the fruitfly (45%). This compares to 17% (human) and 15% (fruitfly) for the whole proteomes of these organisms, indicating a strong relationship between the defined CB regions and predicted protein disorder. However, most predicted protein-disorder is not defined as compositionally biased (67% of pre- dicted protein disorder regions ≥ 20 residues in human, and 72% in fruitfly). Figure 5 shows that distribution of the fraction of disorder (denoted D) predicted for each CB region for human and fruitfly, is approximately uniform; a wide diversity of predicted protein disorder contents is also illustrated by plots of D versus CB region length (shown for human in Figure 6).
We examined the inferred cellular compartment for the CB regions, divided into four different groupings according to their D values, and then calculated propensities to have these compartments for each disorder grouping (   [14]; regulation of transcription from RNA polymerase II promoter (3 × 10 -6 ) GO:0006333 [11]; chromatin (dis)assembly (4 × 10 -10 ) GO:0003700 [10]; transcription factor activity (2 × 10 -2 ) (*) -The CB regions are sorted in decreasing order of abundance. They are denoted by their CB signatures in column #1. Column #2 contains the total number of members in a particular cluster; column #3 is the mean value of D, the disorder fraction of each member; column #4 lists the top five significantly-associated (P' ≤ 0.05) GO (Gene Ontology terms for the cluster in the format: GO term name [count for GO term]; description of GO term in words. In addition, bias types that are significantly associated with transcription (where we reduced GO categories to just two categories, 'transcription-related' and 'non-transcription-related'), are labelled with a † sign. coiled coils and known protein structures, and examined significant functional associations, predicted protein disorder content (using the program DISOPRED [12]), CB region length, and conservation in Human and Drosophila. We found that some of the universally prevalent biases in metazoans are significantly associated with transcription regulation and nuclear localization in human and/ or Drosophila. Furthermore, the CB regions identified are not necessarily contiguous with predicted disordered domains (only 40-50% of the residues in these regions are also in predicted disordered regions).
The CB assignment data presented here will be of further use to home in on functional associations. Furthermore, this classification will also help to delineate systematic errors in genome annotation, such as likely false-positive protein motif matches, or subsets of spurious gene predictions (as noted above for the two puffer fish genomes).
The CB data can also be used for further characterization of subtypes of protein disorder [15]. It is also useful for informing strategies in structural genomics projects, since such projects rely on the correct parsing of domains and subsequences. Further data relating to the analysis in this paper is available from the author. ). The total combined amino-acid composition of all of these proteomes (*) For each bias grouping, the total number of regions is listed, followed by the (total number conserved with the bias region/total number conserved) (percentage in brackets) for each of the proteomes: mouse, rat, chicken, C. elegans and Drosophila (fruitfly), relative to Human. For Drosophila, the 'reverse' conservation is also listed. A 'blow-up' of the overall distribution of {QH} region lengths was calculated, and used as the standard for all subsequent calculations. CB assignment was performed using a development of the algorithm previously described for classification of regions with single-residue biases (Harrison and Gerstein, 2003). The assignment of CB regions comprises two steps: (i) initial search for single-residue LPSs, and (ii) iterative build-up of multiple-residue biases until convergence, i.e., until no lower probability subsequence for a given set of bias residues can be found.

Exhaustive assignment of CB regions
Example of conservation of {Q} region in vertebrates: FOXP2 and its orthologs Figure 4 Example of conservation of {Q} region in vertebrates: FOXP2 and its orthologs. A multiple alignment is shown for FOXP2 and its orthologs on other vertebrates, made using the MUSCLE program [21]; the {Q} region is highlighted in red if its P-value was high enough to be included in the present analysis; otherwise, it is highlighted in green.
The fraction of predicted disorder (denoted D in the text) is binned as a bar chart for both the human and fruitfly proteomes Figure 5 The fraction of predicted disorder (denoted D in the text) is binned as a bar chart for both the human and fruitfly proteomes. The bin p-q contains all values D, such that p ≤ D <q. The proportion of occurrences in each bin is given on the y-axis.
Plot of the D value versus the length of a CB region for the human proteome Figure 6 Plot of the D value versus the length of a CB region for the human proteome.

(i) Initial search for single-residue lowest probability subsequences (LPSs)
We searched for biased regions for each of the 20 aminoacid types as described previously (Harrison and Gerstein, 2003). For each amino-acid type x, and for the range of window sizes (20 ≤ w ≤ 2,500 residues), we search each protein sequence for stretches that have compositional bias of the lowest probability (P min ): where i is each possible start position for a window w in the sequence. The probability P bias (i,w) in equation (1) is given by a binomial distribution: Examples of assigned CB regions Figure 7 Examples of assigned CB regions. In each case, the name of the protein, its current Ensembl identifier, its CB signature and P min value are indicated. The CB region is in bold and underlined; the rest of the sequence is in plain text. The proteins are as follows: (A) leukosialin from the human protein, (B) and unnamed fruitfly protein and (C) an unnamed chicken protein.
where f x is the proportion of amino-acid type x as given by the total combined composition of all of the proteomes. The count for x is denoted n in the window w starting at position i. Sequence stretches with P min are termed LPSs (Lowest Probability Subsequences), as they have the smallest P bias values for a given residue type and protein sequence.
(ii) Iterative build-up of multiple-residue biases The procedure described in (i) was generalized to calculate biases derived from any number of residue types exhaustively for a given protein sequence, as follows. P min values are calculated for any set of amino acids {xyz...}, by summing up the number of residues over the whole residue-type set; however, they only picked in preference over a previously-calculated bias made by a smaller number of residue types, if their P min values are smaller. The set of residue types contributing to the bias (sorted in decreasing order of their original P min values), is defined as the CB signature.
The build-up of multiple-residue biases is performed as follows. For each protein sequence, all single-residue LPSs are sorted in decreasing order of P min . These initial sorted single-residue LPSs thus have a single-letter CB signature. Then, iteratively until convergence, for each LPS, the list of LPSs of higher P min value is searched to check for mutual overlap > 10 residues between the two regions. For all such overlapping pairs, the LPS for the combined residuetype set is calculated, and a new CB signature is derived if the combined P min is smaller. This procedure is performed iteratively until convergence. Using this procedure, regions that comprise mild bias for multiple residue types can be detected as significantly biased. Three examples of CB regions defined using the above procedure are shown in Figure 7; the first example (A) is a {TPSM} region in leukosialin from the human proteome, the second (B) is a {QPH} region from an un-named protein in the fruitfly, and the third (C) is an un-named protein from chicken which has a {AQTVISLPN} region N-terminal to a POU transcription factor domain. This last example demonstrates how the algorithm can detect a biased region that is composed of many mild, single-residue biases.

Classification of CB regions
To classify CB regions across a whole proteome, suitable thresholds for P min must be derived for deciding on inclusion in the analysis. P min thresholds were derived as follows. Longer protein sequences can have more significantly biased subsequences. To allow for this sequence length -dependent effect, we calculated a sequence length -dependent P min threshold. For a random sample of 10,000 protein sequences, P min for the most biased subsequence was plotted against sequence length on a log-log scale. To extract the relationship of sequence length with P min for this data, a line was fitted (significant r 2 value = 0.1, P < 0.001). Then, the intercept of this line was decreased until just 10% of protein sequences had CB regions picked for inclusion in the data set.
So that the smallest sequences do not have unreasonably high threshold values, the P min value was calculated at which 10% of all of the protein sequences in a proteome would have a CB region assigned to them. This second sequence-length-independent threshold P min value was used, where it was smaller than the sequence-lengthdependent value. Using percentages of sequences in the range 5% to 15% to calculate these threshold P min values does not qualitatively change the main observations reported in the paper.

CB signatures
All regions that have the same CB signature were grouped together. To allow for small differences in the order of recruitment to longer CB signatures, in some cases, we also analysed permutations of CB signatures (e.g., {xzy} and {xyz} are such permutations).

Sequence annotations
Annotation of protein disorder was performed using DIS-OPRED [12], using default parameters trained to give a 5% false positive rate. The total fraction of predicted protein disorder in a CB region is given by the D value. Coiled coils were identified with the program MULTICOIL [17], using default parameters. Known protein domains were assigned using the ASTRAL 40% identity protein domain sequence set, and BLAST using e-value ≤ 0.01 [13,18]. Types of biased region that map to repetitive Zinc-fingercontaining proteins (> 0.5 of the length of the protein) were numerous and were additionally filtered out.
GO (Gene Ontology; [19]) functional categories were taken from the annotation files provided on the Ensembl [16] and Gene Ontology [20] websites. Further GO term annotations were derived by mapping functional GO annotations for the PDB (downloaded from [20]) onto Ensembl protein annotations, using 50% sequence identity and 0.8 fractional sequence coverage (for the protein domain) as thresholds, using alignment made by the program BLASTP (e-value ≤ 0.0001) [13]. These thresholds were benchmarked on the complete SCOP protein domain sequence database [18], to give a 2% false positive rate for GO term transfer. Significant associations between GO terms and lists of protein sequences we calculated using binomial statistics, and a P'-value threshold of 0.05, where P' has been adjusted to account for multiple hypothesis testing, using the Bonferroni correction. In addition we used two functional supercategories, wherein all transcription-associated and non-transcription-associated GO terms were pooled together. The transcription-