Skip to main content
  • Research article
  • Open access
  • Published:

Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila



Compositionally biased (CB) regions are stretches in protein sequences made from mainly a distinct subset of amino acid residues; such regions are frequently associated with a structural role in the cell, or with protein disorder.


We derived a procedure for the exhaustive assignment and classification of CB regions, and have applied it to thirteen metazoan proteomes. Sequences are initially scanned for the lowest-probability subsequences (LPSs) for single amino-acid types; subsequently, an exhaustive search for lowest probability subsequences (LPSs) for multiple residue types is performed iteratively until convergence, to define CB region boundaries. We analysed > 40,000 CB regions with > 20 million residues; strikingly, nine single-/double- residue biases are universally abundant, and are consistently highly ranked across both vertebrates and invertebrates. To home in subpopulations of CB regions of interest in human and D. melanogaster, we analysed CB region lengths, conservation, inferred functional categories and predicted protein disorder, and filtered for coiled coils and protein structures. In particular, we found that some of the universally abundant CB regions have significant associations to transcription and nuclear localization in Human and Drosophila, and are also predicted to be moderately or highly disordered. Focussing on Q-based biased regions, we found that these regions are typically only well conserved within mammals (appearing in 60–80% of orthologs), with shorter human transcription-related CB regions being unconserved outside of mammals; they are also preferentially linked to protein domains such as the homeodomain and glucocorticoid-receptor DNA-binding domain. In general, only ~40–50% of residues in these human and Drosophila CB regions have predicted protein disorder.


This data is of use for the further functional characterization of genes, and for structural genomics initiatives.


Compositional bias for a subset of residues is a widespread phenomenon in protein sequences; it has historically been linked to proteins having a structural role, or displaying some intrinsic protein disorder [13]. Many types of compositionally-biased (CB) region are masked as low-complexity sequence during protein sequence alignment, as a matter of course [48], since failure to mask such sequences can lead to a false assumption of evolutionary relatedness. The most commonly used of these masking programs, SEG [7], assesses sequence entropy using user-defined input parameters determining the granularity of the sequence masking.

Previous analysis of compositional bias has focused on single-residue biases, and homopolymeric runs [911]. Algorithms that can derive CB regions for multiple residue types have also been developed [6, 8]. Here, for the first time, we have derived an exhaustive assignment of CB regions made from multiple residues types, in complete proteomes, substantially developing and expanding the scope of our bias analysis algorithm [6]. The present concept of compositional bias has been developed to enable the assignment and exhaustive analysis of biases for multiple residue types, built up from an initial detection of single-residue biases, in a way that is independent of window-lengths, or similar user-defined parameters. We find that a short list of biases is universally abundant in the metazoan proteomes examined, along with some notable relative species-specific abundances. For human and fruitfly, CB regions are analysed for conservation, length, functional linkages, and predicted protein disorder content. Some of the universally abundant biases are linked to nuclear localization and transcription in Human and/or Drosophila.

Results & discussion

Some biases are universally abundant in metazoans

Over 40,000 CB regions in thirteen metazoan proteomes were assigned using the procedures described in Methods. Briefly, protein sequences are initially scanned for the lowest-probability subsequences (LPSs) for single amino-acid types; subsequently, an exhaustive search for lowest probability subsequences (LPSs) for multiple residue types is performed iteratively until convergence, to define CB region boundaries. A CB region is labelled with a CB signature (denoted {abc...} where a, b, c, ... are the residue types that it comprises, in decreasing order of significance). Each CB region has an associated Pmin value. Any region with an initial strong bias for residue type a, and any number of other subsidiary biases is denoted {a(X)n}. It is important to note that these P-values are only meaningful in a relative sense; the process of probability minimization provides a way to define boundaries for regions comprising complex compositional biases, that are distributed or mingled over the length of a particular subsequence.

What are the most consistently abundant biases across all of the metazoan proteomes? To answer this question, for each proteome, each bias type was ranked in decreasing order of abundance. Then, across all of the proteomes, the mean of this ranking was calculated, as well as the number of times the bias types occurred in the top ten of rankings. The twenty-five bias types with the smallest mean ranking values are listed in Table 1. Strikingly, nine single- and double-residue biases are consistently highly ranked in these proteomes: {C}, {P}, {GP}, {Q}, {ED}, {G}, {E}, {S}, {H} and {T} occur in the top ten of at least six species, both vertebrate and invertebrate (Tables 1 and 2).

Table 1 Universally abundant compositional biases ***
Table 2 Top biases for the the thirteen metazoan proteomes (*)

Some abundant species-specific biases stand out, e.g., {Q} regions are most abundant in the fruitfly (Table 2), when compared to all the other proteomes, and, in combination with {QH} regions (the second most prevalent bias in fruitfly) and {QPH} regions, comprise 13% of all the CB regions in that organism. These CB regions will be discussed in more detail below.

Other examples of abundant species-specific biases may be indicative of spurious gene predictions. Examination of examples of the many {HT} and {CV} regions found in the two puffer-fish proteomes (Table 2), indicates that they arise from genome regions with simple repeats, and typically have poorly predicted introns; these thus may arise from systematic errors in gene prediction.

Although many of the most abundant biases across the metazoans are made from either one or two residue types, most biased regions are comprised of a larger number of residues, with a broad mode from about 3 to 5 residue types. This is illustrated for the human proteome (Figure 1). More than a quarter (~27%) of the human CB regions have signatures of ≥ 6 residue types; this is because the bias assignment algorithm can detect CB regions that are composed of multiple milder single-residue biases. (An example of such a region is given in Figure 7(C) below.)

Figure 1
figure 1

Number of bias residue types per CB region in the human proteome. The number of bias residue types per CB region is binned in a bar chart (x-axis). The total occurrences for each 'number of bias residue types' is on the y-axis.

Functional biases and predicted protein disorder content of the top ten biases in human and Drosophila

Obviously, these bias prevalences represent many diverse types of protein subsequence; therefore, to pick out specific subpopulations that are of interest, we need to perform some further characterizations. To this end, for the CB regions in both the human and Drosophila proteomes, after filtering for coiled coils and known protein structures, we examined: (i) significant functional associations based on Gene Ontology (GO) categories and terms; (ii) predicted protein disorder content (using the program DISOPRED [12]); (iii) CB region length; (iv) CB region conservation. We focus specifically on Q-based and E-based biases, as specific examples.

Tables 3 and 4 show that most of the top ten biases (6/10 for both human and Drosophila) come from the 'universally prevalent' list; some of these have significant associations with transcriptional functional categories and with nuclear localization. These CB regions also have moderate to high predicted protein disorder contents (D value ~0.4–0.8) (Tables 3 and 4). The D value is the fraction of the CB region that is predicted to be disordered by the program DISOPRED [12].

Table 3 Most abundant CB regions in Human and their significant functional associations and predicted protein disorder (*)
Table 4 Top Ten Biases for Fruitfly, and their significant functional associations and protein disorder values (*)

For example, {ED} regions in human have significant associations to 'nucleus' and 'DNA-dependent regulation of transcription', and are on average predicted to be moderately disordered (mean D values of 0.56) (Table 3). {Q} regions (in both Drosophila and human) and {QH} regions (in Drosophila only) have similar functional associations, and are predicted to be moderately to highly disordered (D~0.4–0.8) (Tables 3 and 4).

Additionally, we separated GO terms into those that are transcription-associated and those that are not (see Methods for details). Then, using these two 'supercategories', we tested for significant association with the transcription supercategory for each CB region type. For both human and Drosophila, the CB regions that demonstrate such a significant association with the transcription supercategory, also have significant association to individual GO terms linked to transcription (Tables 3 and 4).

Further analysis of nuclear-/transcription-related biases

GO and protein domain associations for the largest CB region grouping, {Q(X)n}

Since {Q} regions, and {Q(X)n} in general, represent the most numerous CB region grouping in either human or Drosophila, we examined the top twenty significant GO assignments for {Q(X)n} regions in more detail for Drosophila and Human, as well as for Rat and Mouse (Table 5). Noticeably, across Drosophila and the three mammals, 'DNA-dependent regulation of transcription', 'transcription factor activity' and 'nucleus' are all highly-ranked functional associations. Similar prevalences are observed for abundant GO terms, if all {Q}+{QH}+{QPH} regions are analyzed in the same way (not shown).

Table 5 Most abundant GO terms for {Q(X)n} CB regions in the fruitfly, mouse, rat and human proteomes *

The {Q(X)n} grouping is also sufficiently numerous that we can count up the most frequently associated globular domains (i.e., domains that are in the same sequences) (Table 6). The most commonly associated domain in both Human and Drosophila is the 'DNA/RNA-binding three-helical bundle', chiefly arising from the 'Homeodomain-like' superfamily. This domain was first found in Drosophila homeotic genes, and occurs widely in transcription factors; related domains are also used in other DNA-binding proteins, such as telomeric proteins, recombinases, etc.

Table 6 Associated SCOP domains for Q{(X) n } regions in Human and Fruitfly (*)

CB region length

In general, the nuclear-/transcription-related biases show a mode in region length at 20–40 residues. This is shown specifically for {QH} regions in Figure 2. A similar fall-off is observed for the distribution for the subset of {QH} regions that are labelled in the GO classification as associated with 'transcription' or localization in the 'nucleus'. A 'blow-up' of the overall {QH} histogram (Figure 3) demonstrates that these regions are not adequately analysed simply as homopolymeric tracts. The subsidiary nature of the H component of the bias is evident, as it is interspersed with longer homopolymeric runs of Q.

Figure 2
figure 2

Distribution of lengths of {QH} regions in D. melanogaster. There are two histograms: the overall distribution (red bars), and the nuclear- or transcription-related proteins (blue bars). The nuclear- and transcription-related proteins have been compiled by grouping together all proteins that have been assigned one of the GO terms that has been adjudged transcription-related (See main text for details).

Figure 3
figure 3

A 'blow-up' of the overall distribution of {QH} region lengths. The {QH} regions are listed horizontally in order of increasing length; Q residues are coloured red and H residues green, with other residues in black.


As case studies, we examined the conservation of {Q(X)n} and {E(X)n} regions in other metazoans, relative to human. Orthologs of proteins were determined with the bi-directional best hits approach, using BLASTP [13] (e-value ≤ 0.0001 with alignment over 0.6 of the length of both sequence, both with and without masking compositionally biased parts). We analysed the fraction of orthologs that maintain a biased region of the same character ({Q(X)n} or {E(X)n}) (Table 7). Generally, these regions (filtered for coiled coils), show high conservation in orthologs from other mammals (60–80% depending on criteria), and low conservation in invertebrates (0–50%) (Table 7). Obviously, these numbers broadly cover a diverse set of CB regions; visual curation reveals that shorter {Q(X)n} and {E(X)n} CB regions consisting of short homopolymeric runs of {Q} are not conserved from human to invertebrates, and that all of the regions that are conserved are longer (> ~90 residues). Indeed, this lack of conservation in invertebrates is also evident when one examines specifically the {Q}+{Q}+{QPH} and {ED}+{E} subsets (Table 7). A multiple alignment of FOXP2, a gene important in language in humans, is illustrated as an example of conservation of a {Q} region defined in vertebrate proteomes (Figure 4).

Table 7 Conservation of {Q(X) n } and {E(X) n } biased regions (*)
Figure 4
figure 4

Example of conservation of {Q} region in vertebrates: FOXP2 and its orthologs. A multiple alignment is shown for FOXP2 and its orthologs on other vertebrates, made using the MUSCLE program [21]; the {Q} region is highlighted in red if its P-value was high enough to be included in the present analysis; otherwise, it is highlighted in green.

Predicted protein disorder – general observations

Prediction of protein disorder has recently been the focus of much research activity [1, 12, 14]. Such regions present a challenge for further proteome-scale experimental characterization. We analyzed the predicted protein disorder content of the human and Drosophila CB regions, using the program DISOPRED [12]. In summed total (simply adding up the total amounts of residues), the human CB region data is predicted to be ~42% disordered, with a similar value observed for the fruitfly (45%). This compares to 17% (human) and 15% (fruitfly) for the whole proteomes of these organisms, indicating a strong relationship between the defined CB regions and predicted protein disorder. However, most predicted protein-disorder is not defined as compositionally biased (67% of predicted protein disorder regions ≥ 20 residues in human, and 72% in fruitfly). Figure 5 shows that distribution of the fraction of disorder (denoted D) predicted for each CB region for human and fruitfly, is approximately uniform; a wide diversity of predicted protein disorder contents is also illustrated by plots of D versus CB region length (shown for human in Figure 6).

Figure 5
figure 5

The fraction of predicted disorder (denoted D in the text) is binned as a bar chart for both the human and fruitfly proteomes. The bin p-q contains all values D, such that pD<q. The proportion of occurrences in each bin is given on the y-axis.

Figure 6
figure 6

Plot of the D value versus the length of a CB region for the human proteome.

Figure 7
figure 7

Examples of assigned CB regions. In each case, the name of the protein, its current Ensembl identifier, its CB signature and Pmin value are indicated. The CB region is in bold and underlined; the rest of the sequence is in plain text. The proteins are as follows: (A) leukosialin from the human protein, (B) and unnamed fruitfly protein and (C) an unnamed chicken protein.

We examined the inferred cellular compartment for the CB regions, divided into four different groupings according to their D values, and then calculated propensities to have these compartments for each disorder grouping (Table 8). For human, biased regions have a propensity to be nuclear if D> 0.25, and to be nuclear regardless of D value for the fruitfly. Also, for very high disorder values (D> 0.75), there is significant linkage to both nuclear and cytoplasmic compartments for both human and fruitfly.

Table 8 Cellular compartments for protein with CB regions with different D values (*)


We have derived a method for assignment of compositionally-biased regions and have applied it consistently to the proteomes of thirteen metazoans. We found that a number of biases are universally abundant in metazoans ({P}, {Q}, {GP}, {C} and {ED}), but that there are also some interesting species-specific tendencies, such as the large proportion of {Q}, {QH}, {QHP} and {QPH} regions in the fruitfly proteome. To delineate subpopulations of CB regions of particular interest, we filtered for coiled coils and known protein structures, and examined significant functional associations, predicted protein disorder content (using the program DISOPRED [12]), CB region length, and conservation in Human and Drosophila. We found that some of the universally prevalent biases in metazoans are significantly associated with transcription regulation and nuclear localization in human and/or Drosophila. Furthermore, the CB regions identified are not necessarily contiguous with predicted disordered domains (only 40–50% of the residues in these regions are also in predicted disordered regions).

The CB assignment data presented here will be of further use to home in on functional associations. Furthermore, this classification will also help to delineate systematic errors in genome annotation, such as likely false-positive protein motif matches, or subsets of spurious gene predictions (as noted above for the two puffer fish genomes). The CB data can also be used for further characterization of subtypes of protein disorder [15]. It is also useful for informing strategies in structural genomics projects, since such projects rely on the correct parsing of domains and subsequences. Further data relating to the analysis in this paper is available from the author.


Exhaustive assignment of CB regions

The proteomes of thirteen higher eukaryotes were downloaded from the Ensembl website [16], in November 2004. They are [versions in square brackets]: human [build 34], chimpanzee [CHIMP1], mouse [NCBIM33], rat [RGSC3.1], fruit fly [version 3], mosquito (A. gambiae) [MOZ2a], honey bee [1st assembly], zebra fish [ZFISH4], and two puffer fish species (Fugu rubripes [FUGU2], Tetraodon nigriviridis [TETRAODON7]). The total combined amino-acid composition of all of these proteomes was calculated, and used as the standard for all subsequent calculations. CB assignment was performed using a development of the algorithm previously described for classification of regions with single-residue biases (Harrison and Gerstein, 2003). The assignment of CB regions comprises two steps: (i) initial search for single-residue LPSs, and (ii) iterative build-up of multiple-residue biases until convergence, i.e., until no lower probability subsequence for a given set of bias residues can be found.

(i) Initial search for single-residue lowest probability subsequences (LPSs)

We searched for biased regions for each of the 20 amino-acid types as described previously (Harrison and Gerstein, 2003). For each amino-acid type x, and for the range of window sizes (20 ≤ w ≤ 2,500 residues), we search each protein sequence for stretches that have compositional bias of the lowest probability (Pmin):

Pmin = [P bias (i, w)], i and x (1)

where i is each possible start position for a window w in the sequence. The probability P bias ( i , w ) in equation (1) is given by a binomial distribution:

P b i a s ( i , w ) = [ w ! n ! ( w n ) ! ] · ( f x ) n · ( 1 f x ) w n ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaudaWgaaWcbaGaemOyaiMaemyAaKMaemyyaeMaem4CamNaeiikaGIaemyAaKMaeiilaWIaem4DaCNaeiykaKcabeaakiabg2da9maadmaabaWaaSaaaeaacqWG3bWDcqGGHaqiaeaacqWGUbGBcqGGHaqidaqadaqaaiabdEha3jabgkHiTiabd6gaUbGaayjkaiaawMcaaiabcgcaHaaaaiaawUfacaGLDbaacqWIpM+zdaqadaqaaiabdAgaMnaaBaaaleaacqWG4baEaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqWGUbGBaaGccqWIpM+zdaqadaqaaiabigdaXiabgkHiTiabdAgaMnaaBaaaleaacqWG4baEaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqWG3bWDcqGHsislcqWGUbGBaaGccaWLjaGaaCzcamaabmaabaacbeGae8NmaidacaGLOaGaayzkaaaaaa@5F98@

where f x is the proportion of amino-acid type x as given by the total combined composition of all of the proteomes. The count for x is denoted n in the window w starting at position i. Sequence stretches with Pmin are termed LPSs (L owest P robability S ubsequences), as they have the smallest Pbias values for a given residue type and protein sequence.

(ii) Iterative build-up of multiple-residue biases

The procedure described in (i) was generalized to calculate biases derived from any number of residue types exhaustively for a given protein sequence, as follows. Pmin values are calculated for any set of amino acids {xyz...}, by summing up the number of residues over the whole residue-type set; however, they only picked in preference over a previously-calculated bias made by a smaller number of residue types, if their Pmin values are smaller. The set of residue types contributing to the bias (sorted in decreasing order of their original Pmin values), is defined as the CB signature.

The build-up of multiple-residue biases is performed as follows. For each protein sequence, all single-residue LPSs are sorted in decreasing order of Pmin. These initial sorted single-residue LPSs thus have a single-letter CB signature. Then, iteratively until convergence, for each LPS, the list of LPSs of higher Pmin value is searched to check for mutual overlap > 10 residues between the two regions. For all such overlapping pairs, the LPS for the combined residue-type set is calculated, and a new CB signature is derived if the combined Pmin is smaller. This procedure is performed iteratively until convergence. Using this procedure, regions that comprise mild bias for multiple residue types can be detected as significantly biased. Three examples of CB regions defined using the above procedure are shown in Figure 7; the first example (A) is a {TPSM} region in leukosialin from the human proteome, the second (B) is a {QPH} region from an un-named protein in the fruitfly, and the third (C) is an un-named protein from chicken which has a {AQTVISLPN} region N-terminal to a POU transcription factor domain. This last example demonstrates how the algorithm can detect a biased region that is composed of many mild, single-residue biases.

Classification of CB regions

To classify CB regions across a whole proteome, suitable thresholds for Pmin must be derived for deciding on inclusion in the analysis. Pmin thresholds were derived as follows. Longer protein sequences can have more significantly biased subsequences. To allow for this sequence length -dependent effect, we calculated a sequence length -dependent Pmin threshold. For a random sample of 10,000 protein sequences, Pmin for the most biased subsequence was plotted against sequence length on a log-log scale. To extract the relationship of sequence length with Pmin for this data, a line was fitted (significant r2 value = 0.1, P < 0.001). Then, the intercept of this line was decreased until just 10% of protein sequences had CB regions picked for inclusion in the data set.

So that the smallest sequences do not have unreasonably high threshold values, the Pmin value was calculated at which 10% of all of the protein sequences in a proteome would have a CB region assigned to them. This second sequence-length-independent threshold Pmin value was used, where it was smaller than the sequence-length-dependent value. Using percentages of sequences in the range 5% to 15% to calculate these threshold Pmin values does not qualitatively change the main observations reported in the paper.

CB signatures

All regions that have the same CB signature were grouped together. To allow for small differences in the order of recruitment to longer CB signatures, in some cases, we also analysed permutations of CB signatures (e.g., {xzy} and {xyz} are such permutations).

Sequence annotations

Annotation of protein disorder was performed using DISOPRED [12], using default parameters trained to give a 5% false positive rate. The total fraction of predicted protein disorder in a CB region is given by the D value. Coiled coils were identified with the program MULTICOIL [17], using default parameters. Known protein domains were assigned using the ASTRAL 40% identity protein domain sequence set, and BLAST using e-value ≤ 0.01 [13, 18]. Types of biased region that map to repetitive Zinc-finger-containing proteins (> 0.5 of the length of the protein) were numerous and were additionally filtered out.

GO (Gene Ontology; [19]) functional categories were taken from the annotation files provided on the Ensembl [16] and Gene Ontology [20] websites. Further GO term annotations were derived by mapping functional GO annotations for the PDB (downloaded from [20]) onto Ensembl protein annotations, using 50% sequence identity and 0.8 fractional sequence coverage (for the protein domain) as thresholds, using alignment made by the program BLASTP (e-value ≤ 0.0001) [13]. These thresholds were benchmarked on the complete SCOP protein domain sequence database [18], to give a 2% false positive rate for GO term transfer. Significant associations between GO terms and lists of protein sequences we calculated using binomial statistics, and a P'-value threshold of 0.05, where P' has been adjusted to account for multiple hypothesis testing, using the Bonferroni correction. In addition we used two functional supercategories, wherein all transcription-associated and non-transcription-associated GO terms were pooled together. The transcription-associated GO terms are: GO:0006355; GO:0006357; GO:0006366; GO:0006367;GO:0016563;GO:0003676;GO:0003677;GO:0003700;GO:0003702;GO:0003704;GO:0003713;GO:0030374;GO:0030528.

Orthologs for conservation

Orthologs were calculated using the bidirectional best hits method and a BLASTP threshold of e-value ≤ 0.0001 [13], with the additional requirement for both of the potential orthologs to match each other over 0.6 of their sequence lengths. Potential orthologs were calculated both with and without the CB region masked, to give 'upper' and 'lower' bounds for ortholog detection.



Lowest Probability Subsequence


compositional bias or compositionally-biased


Gene Ontology.


  1. Bracken C, Iakoucheva LM, Romero PR, Dunker AK: Combining prediction, computation and experiment for the characterization of protein disorder. Curr Opin Struct Biol 2004, 14: 570–576. 10.1016/

    Article  CAS  PubMed  Google Scholar 

  2. Dyson HJ, Wright PE: Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol 2005, 6: 197–208. 10.1038/nrm1589

    Article  CAS  PubMed  Google Scholar 

  3. Fink AL: Natively unfolded proteins. Curr Opin Struct Biol 2005, 15: 35–41. 10.1016/

    Article  CAS  PubMed  Google Scholar 

  4. Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy DP, Hamodrakas S, Sander C, Ouzounis CA: CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 2000, 16: 915–922. 10.1093/bioinformatics/16.10.915

    Article  CAS  PubMed  Google Scholar 

  5. Wise MJ: a software tool for low complexity proteins and protein domains. Bioinformatics 2001, 17 (Suppl): S288-S295.

    Article  Google Scholar 

  6. Harrison PM, Gerstein M: A method to assess compositional bias in biological sequences and its application to prion-like glutamine/asparagine-rich domains in eukaryotic proteomes. Genome Biol 2003, 4: R40-R46. 10.1186/gb-2003-4-6-r40

    Article  PubMed Central  PubMed  Google Scholar 

  7. Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Methods Enzymol 1996, 266: 554–571.

    Article  CAS  PubMed  Google Scholar 

  8. Kuznetsov I, Hwang S: A novel sensitive method for the detection of user-defined compositional bias in biological sequence. Bioinformatics 2006, 22: 1055–1063. 10.1093/bioinformatics/btl049

    Article  CAS  PubMed  Google Scholar 

  9. Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ: Amino acid runs in eukaryotic proteomes and disease associations. PNAS 2002, 99: 333–338. 10.1073/pnas.012608599

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de la Banda MG, Whisstock JC: Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res 2005, 15: 537–551. 10.1101/gr.3096505

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  11. Alba MM, Guigo R: Comparative analysis of amino acid repeats in rodents and humans. Genome Res 2004, 14: 549–554. 10.1101/gr.1925704

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 2004, 337: 635–645. 10.1016/j.jmb.2004.02.002

    Article  CAS  PubMed  Google Scholar 

  13. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Linding R, Russell RB, Neduva V, Gibson TJ: GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res 2003, 31: 3701–3708. 10.1093/nar/gkg519

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Vucetic S, Brown CJ, Dunker AK, Obradovic Z: Flavors of protein disorder. Proteins 2003, 52: 573–584. 10.1002/prot.10437

    Article  CAS  PubMed  Google Scholar 


  17. Wolf E, Kim PS, Berger B: MultiCoil: a program for predicting two- and three-stranded coiled coils. Protein Sci 1997, 6: 1179–1189.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  18. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, 32: D189-D192. 10.1093/nar/gkh034

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32: D258-D261. 10.1093/nar/gkh036

    Article  Google Scholar 


  21. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797. 10.1093/nar/gkh340

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references


This work was supported by grants to P.H. from the National Science and Engineering Research Council of Canada, and from McGill University.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Paul M Harrison.

Additional information

Authors' contributions

P.H. performed this work and wrote the paper.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Harrison, P.M. Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila. BMC Bioinformatics 7, 441 (2006).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: