COPASAAR – A database for proteomic analysis of single amino acid repeats
© Depledge and Dalby; licensee BioMed Central Ltd. 2005
Received: 18 February 2005
Accepted: 03 August 2005
Published: 03 August 2005
Single amino acid repeats make up a significant proportion in all of the proteomes that have currently been determined. They have been shown to be functionally and medically significant, and are associated with cancers and neuro-degenerative diseases such as Huntington's Chorea, where a poly-glutamine repeat is responsible for causing the disease. The COPASAAR database is a new tool to facilitate the rapid analysis of single amino acid repeats at a proteome level. The database aims to simplify the comparison of repeat distributions between proteomes in order to provide a better understanding of their function and evolution.
A comparative analysis of all proteomes in the database (currently 244) shows that single amino acid repeats account for about 12–14% of the proteome of any given species. They are more common in eukaryotes (14%) than in either archaea or bacteria (both 13%). Individual analyses of proteomes show that long single amino acid repeats (6+ residues) are much more common in the Eukaryotes and that longer repeats are usually made up of hydrophilic amino acids such as glutamine, glutamic acid, asparagine, aspartic acid and serine.
COPASAAR is a useful tool for comparative proteomics that provides rapid access to amino acid repeat data that can be readily data-mined. The COPASAAR database can be queried at the kingdom, proteome or individual protein level. As the amount of available proteome data increases this will be increasingly important in order to automate proteome comparison. The insights gained from these studies will give a better insight into the evolution of protein sequence and function.
Single amino acid repeats (SAARs) are uninterrupted runs of identical amino acids that exist in many proteins and are currently a major focus of research. These are an example of a simple sequence repeat (SSR), which occurs when a simple sequence motif is repeated in the DNA sequence. These repeats are found in the proteome and can eventually dictate the structure and function of proteins. Repeats within the amino acid sequence are usually dependent on repetitive elements in the genome. They originate from unequal crossing-over or replication errors resulting from the formation of unusual DNA secondary structures such as hairpins or slipped strands [1–3]. Amongst the various DNA duplication events, SSRs are abundant in eukaryotic genomes and may be a major source of quantitative genetic variation [4–6]. SSRs in the codingregions of proteins can give rise to a variety of repeats including SAARs, short tandem repeats, and the repetition of homologous domains of 100 or more residues. However the focus of this work is solely on SAARs.
There has been some suggestion that these repeated sequence patterns may be a mechanism that provides regular arrays of spatial and functional groups, useful for structural packing or for one to one interactions with target molecules . This suggests that error-prone SAAR expansion allows the rapid evolution of proteins with repetitive structure, which can lead to rapidly changing phenotypes .
Marcotte et al., suggested that eukaryotic proteomes have a significantly higher incidence of SAARs than either bacterial or archaeal proteomes . They showed that most SAARs occur in protein classes associated only with eukaryotes so protein classes associated with both eukaryotes and prokaryotes are much less likely to contain repeats. This would imply that the formation of SAARs is a relatively recent evolutionary event.
Dominantly inherited neurodegenerative diseases are associated with abnormally expanded tracts of glutamine residues.
Other notable repeats/comments
Huntington's disease protein
1 SAAR (23-residues)
2x Pro repeats (11-and 10-residues)
2x Glu repeats (6-and 5-residues)
Spinocerebellar ataxin type 1
2 SAARs (15-and 12-residues)
The two Gln repeats are separated by 4 residues
Androgen receptor (Kennedy's disease)
3 SAARs (21-, 6-and 5-residues)
1x Pro repeat (8-residues)
1x Ala repeat (5-residues)
1x Gly repeat (24-residues)
Examples of functional SAARs can be seen in proteins which are associated with development and transcriptional regulatory capacities, with the majority of them active in central or peripheral nervous system function and development . This has been extensively studied in Drosophila melanogaster but there are also examples in other eukaryotes, for example, the case of the transcription factor II (TFII) in humans  which contains a 34 residue glutamine run. This SAAR is absent from all related proteins, and yet appears to be functionally important. Extended runs can also provide substrates for caspase cleavage, yielding tangles, plaques, dead neurons and triggering apoptosis . They also provide binding sites for protein-protein interactions .
SAARs are generally less than 20 residues long and are primarily composed of the residues of the amino acids glutamine, asparagine, serine, threonine, proline, histidine, glycine, alanine, aspartic acid and glutamic acid [17, 18]. It is curious that glutamine followed by asparagine and serine are the most common SAARs found, especially when considering that the occurrence of leucine, isoleucine, alanine and valine in proteins is much greater. This is particularly interesting when considering that long SAARs of these 4 amino acids are rarely found.
The greatest challenge facing scientists who wish to study SAARs is the lack of tools for analysing SAARs and mining the data collected. While some software exists  for detecting and analysing SAARs, it is limited in its application in that it is only designed for analysing single proteins rather than whole proteomes. The aim of this paper is to describe a new web application dedicated to the analysis of SAARs in whole proteomes.
Construction and content
The COPASAAR (COmparative Proteome Analysis of Single Amino Acid Repeats) database was developed in MySQL 4.0.18 running on Mandrake Linux version 10.0. Access to the database is through a web interface written in Perl:CGI and uses the Perl ChartDirector  and Descriptive::Statistics modules to generate histograms and statistical analysis of the data. Currently the database contains 244 proteomes, which are made up of 862,886 proteins with a storage requirement of 1.2 Gbytes.
Repeat analysis software
Proteome data files were obtained from the integr8 database at the EBI  in Fasta format. These files were analysed for repeats using a series of scripts written in Perl. The database itself was written in SQL and the data was imported into the database as tab delimited text files using the mysqlimport client. This process was automated by the use of shell scripts.
The algorithm used for detecting and measuring a repeat compares each residue with the next one. If it finds two identical residues side-by-side then it continues the comparison to the next residue until it encounters a different amino acid. If a different residue is detected the programme records the repeat in an array of amino acid type and repeat length.
Expected repeat lengths
As a reference to the actual occurrence of SAARs a statistical model was created where the amino acids are assumed to be distributed randomly based on their occurrence in a specific protein . The probability of a SAAR of length n occurring will then be;
P(SAAR of length n) = f n (1 - f)2
Where f is the frequency of the particular amino acid in the protein. The (1-f)2. term accounts for there being a different amino acid at each end of the SAAR.
To find the expected number of repeats of a given amino acid within a protein this probability is multiplied by the number of potential starting points for the repeat. This will be equal to the sequence length minus the length of the repeat plus one.
Expected number of repeats of length n = f n (1 - f)2 (l - (n - 1))
Where l is the length of the protein.
Running times for the software
For all of the currently available proteomes the running time to extract all of the repeats and to generate the expected repeat tables at the protein and proteome level is about 3 hours on a Pentium 4 2.0 GHz. Import of the tables into MySQL is very rapid and takes less than 30 minutes. All of the scripts used to create the database can be downloaded from the COPASAAR website.
The database schema
The COPASAAR website houses the user interface to the main programmes, a documentation page featuring software documentation, and a download section so that users can download the database for use on local machines. The user-interface provides menu driven query access to the database. The user simply selects the species they wish to analyse and uses the 'post' method to send the request. Results are displayed either in tabulated or graphical form as bar charts. The website is hosted on an Apache (version 2.0.44) webserver.
COPASAAR proteome data
The current database consists of 244 proteomes; 19 eukaryotic species, 205 prokaryotic species and 20 archaeal species. A full list of the species can be found in additional file 1.
There have been previous systematic studies of simple amino acid repeat distributions in proteomes [7, 14, 23, 24] but what COPASAAR aims to do is to provide a comprehensive and simple to update resource that means that makes comparative studies much easier to carry out and which also increases the number of biological questions that can be asked.
Adding new proteomes to the database is simple using the repeat analysis scripts and this procedure will be made even easier by the new naming convention for proteome data files that will use the organism name rather than using taxonomic identifiers that can change for the same proteome between database releases.
To illustrate some of the capabilities of the database using the current web interface functionality we have made a high level comparison of occurrence of SAARs across the three super kingdoms.
Comparison of archaeal, eukaryotic and bacterial kingdoms
The proportion of a proteome composed of SAARs and the percentage of proteomes in each kingdom with a greater number of SAARs than the mean. *The overall mean is 13.18%
SAARs (as a percentage of the whole proteome)
Proteomes (with a greater % of repeats than the overall mean*)
P. falciparum, has a very unusual repeat distribution that is different to all other proteomes, prokaryotes and archaea included. Nearly 20% of the P. falciparum proteome is made up of repeats. The distribution of asparagine repeats is particularly significant. There are 137 repeats of over 20 asparagines in length which is highly unusual as long asparagine repeats are associated with prion domains and fibril formation [28, 29]
SAARs composition by amino acid.
The prediction model shows a close correlation to the actual repeat distribution in many cases and in particular for short SAARs although there is a consistent slight under-estimation of the number of expected repeats. This would suggest that shorter repeats are mostly randomly distributed and that few of them are likely to be functionally significant. Short repeats are therefore likely to form part of the neutral drift of protein sequence evolution.
COPASAAR provides an essential tool for the study of repeats in comparative proteomics. The ability to quickly analyse proteomes (and individual proteins) and to map the distribution and size of SAARs will hopefully benefit scientists from many different fields. COPASAAR will provide a useful resource for finding new protein families that can be used as species specific markers. Data on the evolution of repeats between species will also allow us to develop models of adaptive traits in proteomes. This will be particularly important in understanding the evolution of amyloid associated diseases.
Availability and requirements
Online access to COPASAAR can be found at;
All of the source code for the project and the database files are also available from this site and are available under the GPL. Software requirements have been described above and non-academics should be aware of licensing restrictions regarding the use of the commercial software Perl ChartDirector.
DPD would like to acknowledge an EPSRC studentship that funded this work. ARD would like to acknowledge Simon Lofting for his comments on the statistical analysis.
- Pearson CE, Sinden RR: Trinucleotide repeat DNA structures: dynamic mutations from dynamic DNA. Curr Opin Struct Biol 1998, 8: 321–330. 10.1016/S0959-440X(98)80065-1View ArticlePubMedGoogle Scholar
- Kruglyak S, Durrett R, Schug MD, Aquadro CF: Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol Biol Evol 2000, 17: 1210–1219.View ArticlePubMedGoogle Scholar
- LeProust EM, Pearso CE, Sinden RR, Gao XL: Unexpected formation of parallel duplex in GAA and TTC trinucleotide repeats of Friedreich's ataxia. J Mol Biol 2000, 302: 1063–1080. 10.1006/jmbi.2000.4073View ArticlePubMedGoogle Scholar
- Kashi Y, King D, Soller M: Simple sequence repeats as a source of quantitative genetic variation. Trends Genet 1997, 13: 74–78. 10.1016/S0168-9525(97)01008-1View ArticlePubMedGoogle Scholar
- Alba MM, Santibanez-Koref MF, Hancock JM: Conservation of polyglutamine tract size between mice and humans depends on codon interruption. Mol Biol Evol 1999, 16: 1641–1644.View ArticlePubMedGoogle Scholar
- Alba MM, Guigo R: Comparative analysis of amino acid repeats in rodents and humans. Genome Res 2004, 14: 549–554. 10.1101/gr.1925704PubMed CentralView ArticlePubMedGoogle Scholar
- Katti MV, Sami-Subbu R, Ranjekar PK, Gupta VS: Amino acid repeat patterns in protein sequences: Their diversity and structural-functional implications. Protein Sci 2000, 9: 1203–1209.PubMed CentralView ArticlePubMedGoogle Scholar
- Fondon JW, Garner HR: Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci U S A 2004, 101: 18058–18063. 10.1073/pnas.0408118101PubMed CentralView ArticlePubMedGoogle Scholar
- Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D: A census of protein repeats. J Mol Biol 1999, 293: 151–160. 10.1006/jmbi.1999.3136View ArticlePubMedGoogle Scholar
- Djian P: Evolution of simple repeats in DNA and their relation to human disease. Cell 1998, 94: 155–160. 10.1016/S0092-8674(00)81415-4View ArticlePubMedGoogle Scholar
- Sutherland GR, Richards RI: The Molecular-Basis of Fragile Sites in Human-Chromosomes. Curr Opin Genet Dev 1995, 5: 323–327. 10.1016/0959-437X(95)80046-8View ArticlePubMedGoogle Scholar
- Karlin S, Burge C: Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci U S A 1996, 93: 1560–1565. 10.1073/pnas.93.4.1560PubMed CentralView ArticlePubMedGoogle Scholar
- Bence NF, Sampat RM, Kopito RR: Impairment of the ubiquitin-proteasome system by protein aggregation. Science 2001, 292: 1552–1555. 10.1126/science.292.5521.1552View ArticlePubMedGoogle Scholar
- Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ: Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci U S A 2002, 99: 333–338. 10.1073/pnas.012608599PubMed CentralView ArticlePubMedGoogle Scholar
- Hoffmann A, Sinn E, Yamamoto T, Wang J, Roy A, Horikoshi M, Roeder RG: Highly Conserved Core Domain and Unique N-Terminus with Presumptive Regulatory Motifs in a Human Tata Factor (Tfiid). Nature 1990, 346: 387–390. 10.1038/346387a0View ArticleGoogle Scholar
- Sun B, Fan W, Balciunas A, Cooper JK, Bitan G, Steavenson S, Denis PE, Young Y, Adler B, Daugherty L, Manoukian R, Elliott G, Shen WY, Talvenheimo J, Teplow DB, Haniu M, Haldankar R, Wypych J, Ross CA, Citron M, Richards WG: Polyglutamine repeat length-dependent proteolysis of huntingtin. Neurobiol Dis 2002, 11: 111–122. 10.1006/nbdi.2002.0539View ArticlePubMedGoogle Scholar
- Huntley MA, Golding GB: Simple sequences are rare in the protein data bank. Proteins 2002, 48: 134–140. 10.1002/prot.10150View ArticlePubMedGoogle Scholar
- Nance MA: Clinical aspects of CAG repeat diseases. Brain Pathol 1997, 7: 881–900.View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Yeates TO: A fast algorithm for genome-wide analysis of proteins with repeated sequences. Proteins 1999, 35: 440–446. 10.1002/(SICI)1097-0134(19990601)35:4<440::AID-PROT7>3.0.CO;2-YView ArticlePubMedGoogle Scholar
- Advanced Software Engineering: .[http://www.advsofteng.com/]
- integr8: .[http://www.ebi.ac.uk/integr8]
- Brendel V, Bucher P, Nourbakhsh IR, Blaisdell BE, Karlin S: Methods and Algorithms for Statistical-Analysis of Protein Sequences. Proc Natl Acad Sci U S A 1992, 89: 2002–2006.PubMed CentralView ArticlePubMedGoogle Scholar
- Sim KL, Creamer TP: Abundance and distributions of eukaryote protein simple sequences. Mol Cell Proteomics 2002, 1: 983–995. 10.1074/mcp.M200032-MCP200View ArticlePubMedGoogle Scholar
- Sim KL, Creamer TP: Protein simple sequence conservation. Proteins 2004, 54: 629–638. 10.1002/prot.10623View ArticlePubMedGoogle Scholar
- Ross CA, Margolis RL: Huntington's disease. Clin Neurosci Res 2001, 1: 142–152. 10.1016/S1566-2772(00)00014-1View ArticleGoogle Scholar
- Cervantes-Kardasch VH, Garcia-Martinez E: Molecular physiopathology of the spinocerebellar ataxia type 6 (SCA6). Rev Invest Clin 2004, 56: 368–374.PubMedGoogle Scholar
- Poirier MA, Jiang H, Ross CA: A structure-based analysis of huntingtin mutant polyglutamine aggregation and toxicity: evidence for a compact beta-sheet structure. Hum Mol Genet 2005, 14: 765–774. 10.1093/hmg/ddi071View ArticlePubMedGoogle Scholar
- Singh GP, Chandra BR, Bhattacharya A, Akhouri RR, Singh SK, Sharma A: Hyper-expansion of asparagines correlates with an abundance of proteins with prion-like domains in Plasmodium falciparum. Mol Biochem Parasitol 2004, 137: 307–319. 10.1016/j.molbiopara.2004.05.016View ArticlePubMedGoogle Scholar
- Kreil DP, Kreil G: Asparagine repeats are rare in mammalian proteins. Trends Biochem Sci 2000, 25: 270–1. 10.1016/S0968-0004(00)01594-2View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.