COPASAAR – A database for proteomic analysis of single amino acid repeats

Background Single amino acid repeats make up a significant proportion in all of the proteomes that have currently been determined. They have been shown to be functionally and medically significant, and are associated with cancers and neuro-degenerative diseases such as Huntington's Chorea, where a poly-glutamine repeat is responsible for causing the disease. The COPASAAR database is a new tool to facilitate the rapid analysis of single amino acid repeats at a proteome level. The database aims to simplify the comparison of repeat distributions between proteomes in order to provide a better understanding of their function and evolution. Results A comparative analysis of all proteomes in the database (currently 244) shows that single amino acid repeats account for about 12–14% of the proteome of any given species. They are more common in eukaryotes (14%) than in either archaea or bacteria (both 13%). Individual analyses of proteomes show that long single amino acid repeats (6+ residues) are much more common in the Eukaryotes and that longer repeats are usually made up of hydrophilic amino acids such as glutamine, glutamic acid, asparagine, aspartic acid and serine. Conclusion COPASAAR is a useful tool for comparative proteomics that provides rapid access to amino acid repeat data that can be readily data-mined. The COPASAAR database can be queried at the kingdom, proteome or individual protein level. As the amount of available proteome data increases this will be increasingly important in order to automate proteome comparison. The insights gained from these studies will give a better insight into the evolution of protein sequence and function.


Background
Single amino acid repeats (SAARs) are uninterrupted runs of identical amino acids that exist in many proteins and are currently a major focus of research.These are an example of a simple sequence repeat (SSR), which occurs when a simple sequence motif is repeated in the DNA sequence.These repeats are found in the proteome and can eventually dictate the structure and function of proteins.Repeats within the amino acid sequence are usually dependent on repetitive elements in the genome.They originate from unequal crossing-over or replication errors resulting from the formation of unusual DNA secondary structures such as hairpins or slipped strands [1][2][3].Amongst the various DNA duplication events, SSRs are abundant in eukaryotic genomes and may be a major source of quantitative genetic variation [4][5][6].SSRs in the codingregions of proteins can give rise to a variety of repeats including SAARs, short tandem repeats, and the repetition of homologous domains of 100 or more residues.However the focus of this work is solely on SAARs.
There has been some suggestion that these repeated sequence patterns may be a mechanism that provides regular arrays of spatial and functional groups, useful for structural packing or for one to one interactions with target molecules [7].This suggests that error-prone SAAR expansion allows the rapid evolution of proteins with repetitive structure, which can lead to rapidly changing phenotypes [8].
Marcotte et al., suggested that eukaryotic proteomes have a significantly higher incidence of SAARs than either bacterial or archaeal proteomes [9].They showed that most SAARs occur in protein classes associated only with eukaryotes so protein classes associated with both eukaryotes and prokaryotes are much less likely to contain repeats.This would imply that the formation of SAARs is a relatively recent evolutionary event.
What is interesting is that SAARs can be either functionally significant or extremely pathogenic depending on the proteins involved.Several human inherited neurodegenerative diseases are triplet-repeat diseases associated with proteins containing long runs of glutamine (long CAG codon iterations which result from mutations of SAARs) as shown in Table 1 [10].The severity of these diseases seems can be correlated with the extent of iterations of the CAG codon above a certain threshold [11].Also notable is that most of these proteins contain two or more additional long runs of amino acids other than glutamine [12].Pathogenicity is due to inflammatory brain responses, oxidative damage and protein aggregations that clog the proteosome [13].
Examples of functional SAARs can be seen in proteins which are associated with development and transcriptional regulatory capacities, with the majority of them active in central or peripheral nervous system function and development [14].This has been extensively studied in Drosophila melanogaster but there are also examples in other eukaryotes, for example, the case of the transcription factor II (TFII) in humans [15] which contains a 34 residue glutamine run.This SAAR is absent from all related proteins, and yet appears to be functionally important.Extended runs can also provide substrates for caspase cleavage, yielding tangles, plaques, dead neurons and triggering apoptosis [16].They also provide binding sites for protein-protein interactions [14].
SAARs are generally less than 20 residues long and are primarily composed of the residues of the amino acids glutamine, asparagine, serine, threonine, proline, histidine, glycine, alanine, aspartic acid and glutamic acid [17,18].It is curious that glutamine followed by asparagine and serine are the most common SAARs found, especially when considering that the occurrence of leucine, isoleucine, alanine and valine in proteins is much greater.This is particularly interesting when considering that long SAARs of these 4 amino acids are rarely found.
The greatest challenge facing scientists who wish to study SAARs is the lack of tools for analysing SAARs and mining the data collected.While some software exists [19] for detecting and analysing SAARs, it is limited in its application in that it is only designed for analysing single proteins rather than whole proteomes.The aim of this paper is to describe a new web application dedicated to the analysis of SAARs in whole proteomes.

Construction and content
The COPASAAR (COmparative Proteome Analysis of Single Amino Acid Repeats) database was developed in MySQL 4.0.18running on Mandrake Linux version 10.0.
Access to the database is through a web interface written in Perl:CGI and uses the Perl ChartDirector [20] and Descriptive::Statistics modules to generate histograms and statistical analysis of the data.Currently the database contains 244 proteomes, which are made up of 862,886 proteins with a storage requirement of 1.2 Gbytes.

Repeat analysis software
Proteome data files were obtained from the integr8 database at the EBI [21] in Fasta format.These files were analysed for repeats using a series of scripts written in Perl.
The database itself was written in SQL and the data was imported into the database as tab delimited text files using the mysqlimport client.This process was automated by the use of shell scripts.
The algorithm used for detecting and measuring a repeat compares each residue with the next one.If it finds two identical residues side-by-side then it continues the comparison to the next residue until it encounters a different amino acid.If a different residue is detected the programme records the repeat in an array of amino acid type and repeat length.

Expected repeat lengths
As a reference to the actual occurrence of SAARs a statistical model was created where the amino acids are assumed to be distributed randomly based on their occurrence in a specific protein [22].The probability of a SAAR of length n occurring will then be; Where f is the frequency of the particular amino acid in the protein.The (1-f) 2 .term accounts for there being a different amino acid at each end of the SAAR.
To find the expected number of repeats of a given amino acid within a protein this probability is multiplied by the number of potential starting points for the repeat.This will be equal to the sequence length minus the length of the repeat plus one.
Expected number of repeats of length n Where l is the length of the protein.

Running times for the software
For all of the currently available proteomes the running time to extract all of the repeats and to generate the expected repeat tables at the protein and proteome level is about 3 hours on a Pentium 4 2.0 GHz.Import of the tables into MySQL is very rapid and takes less than 30 minutes.All of the scripts used to create the database can be downloaded from the COPASAAR website.

The database schema
The database contains 83 tables, most of which contain amino acid specific data.The database schema and table structure are shown in Figure 1.The data is stored at three different levels.At the individual protein, proteome and kingdom levels.While this means there is some redundancy of data this is required to speed up searches so that the amount of analysis that needs to be performed by a query is reduced.For example the expected frequencies of repeats could be calculated during a query from the amino acid occurrences in the proteins, but if this query is at the proteome level this would have to be done for all of the proteins within that proteome and then these would have to be summed to give a proteome level expectation.This would be less efficient and would result in slow querying of the database.

COPASAAR website
The COPASAAR website houses the user interface to the main programmes, a documentation page featuring software documentation, and a download section so that users can download the database for use on local machines.The user-interface provides menu driven query access to the database.The user simply selects the species they wish to analyse and uses the 'post' method to send the request.Results are displayed either in tabulated or graphical form as bar charts.The website is hosted on an Apache (version 2.0.44)webserver.

COPASAAR proteome data
The current database consists of 244 proteomes; 19 eukaryotic species, 205 prokaryotic species and 20 archaeal species.A full list of the species can be found in additional file 1.

Utility
There have been previous systematic studies of simple amino acid repeat distributions in proteomes [7,14,23,24] but what COPASAAR aims to do is to provide a comprehensive and simple to update resource that means that makes comparative studies much easier to carry out and which also increases the number of biological questions that can be asked.
Adding new proteomes to the database is simple using the repeat analysis scripts and this procedure will be made even easier by the new naming convention for proteome data files that will use the organism name rather than using taxonomic identifiers that can change for the same proteome between database releases.
Access to the database can be either through the web interface or for more experienced users the database can be queried directly using SQL.Accessing the MySQL database directly using SQL allows almost any query to be performed.Figure 2 shows the query to find all proteins with a repeat of 6 alanine residues from the human proteome.The problem with accessing the database this way is that it requires a working knowledge of SQL and also the structure of the database.For this reason the web interface will remain the preferred mode of interaction for novice or infrequent users.The web interface currently contains a set of simple queries that can be rapidly expanded depending on requirements.It is expected that users who download the database will want to implement queries specific to their own research which can be done Database schema for COPASAAR Figure 1 Database schema for COPASAAR.Note that each of the species_repeats, species_expected, protein_repeats and protein_expected tables will be repeated 20 times once for each amino acid.
Example SQL script used to query the database for all pro-teins in humans with an alanine repeat of 6 amino acids Figure 2 Example SQL script used to query the database for all proteins in humans with an alanine repeat of 6 amino acids.by customising existing template scripts that generate SQL queries and that format the output either as tables or graphically to be displayed as webpages.
To illustrate some of the capabilities of the database using the current web interface functionality we have made a high level comparison of occurrence of SAARs across the three super kingdoms.

Comparison of archaeal, eukaryotic and bacterial kingdoms
The mean number of amino acids within SAARs (as a percentage of the proteome) within the three super kingdoms is greatest in the eukaryotes at 14.34%.The archaea mean is 13.34% while bacterial proteomes are the lowest with a mean of 13.05% (Table 2a).The overall mean is 13.18%.Of the 19 eukaryotic species, 18 of these proteomes (95%) contain a greater percentage of SAARs than the overall mean compared to 8 out of 20 of the archaea (42%) and 65 out of 205 of the bacteria (32%) (Table 2).If however you look at the maximum length of the repeats between the three kingdoms the distinction between the eukaryotes and the archaea and bacteria is much clearer.In eukaryotes repeats over 20 amino acids occur in most species so far sequenced, although they are less common in the yeasts, whereas for bacteria repeats over 20 residues in length only occur exceptionally in certain Vibrio species (the seafood associated pathogens) and in Lactobacillus plantarum.In archaea a glycine repeat over 20 residues only occurs in Haloarcula marismortui.These results supports the finding of Marcotte [9] and also suggest that the differences between the kingdoms can be specified in terms of repeats for a few amino acids.Glutamine repeats are particularly characteristic of eukaryotes where they have a long tailed distribution, which in the mammals and plants extends beyond 20 residues.It is of particular clinical interest that glutamine repeats play a significant role in eukayotic proteomes because they are associated with amyloid plaque formation in diseases such as Huntington's chorea and spinocerebellar ataxia [25][26][27].A functional explanation for the occurrence of glutamine repeats in transcription factor genes has been suggested by Fondon et al. and this could be the main contributing fac-tor to the occurrence of these repeats in eukaryotes [8].
The only eukaryotes where glutamine does not form the longest repeat are Plasmodium falciparum, Arabidopsis thaliana and Caenorhabditis elegans. A. thaliana contains a very characteristic long lysine repeat (over 100 amino acids), while in C. elegans the longest repeat is serine.
P. falciparum, has a very unusual repeat distribution that is different to all other proteomes, prokaryotes and archaea included.Nearly 20% of the P. falciparum proteome is made up of repeats.The distribution of asparagine repeats is particularly significant.There are 137 repeats of over 20 asparagines in length which is highly unusual as long asparagine repeats are associated with prion domains and fibril formation [28,29] The amino acid compositions of SAARs across the kingdoms are shown in Table 3.The eukaryotes, feature leucine, serine and glutamic acid as the top three constituents.Archaea features leucine and glutamic acid as its top two constituents, while bacteria feature leucine and alanine as the top two constituents.These results agree with the overall distributions of amino acids in the three kingdoms, but although leucine appears in many short repeats and so makes a large contribution to the number of amino acids in SAARs it very rarely has long repeats in any proteome and it is the longest repeat in only a few bacterial species.

Prediction model
The prediction model shows a close correlation to the actual repeat distribution in many cases and in particular for short SAARs although there is a consistent slight under-estimation of the number of expected repeats.This would suggest that shorter repeats are mostly randomly distributed and that few of them are likely to be functionally significant.Short repeats are therefore likely to form part of the neutral drift of protein sequence evolution.

Conclusion
COPASAAR provides an essential tool for the study of repeats in comparative proteomics.The ability to quickly analyse proteomes (and individual proteins) and to map the distribution and size of SAARs will hopefully benefit scientists from many different fields.COPASAAR will provide a useful resource for finding new protein families that can be used as species specific markers.Data on the evolution of repeats between species will also allow us to develop models of adaptive traits in proteomes.This will be particularly important in understanding the evolution of amyloid associated diseases.

Table 2 : The proportion of a proteome composed of SAARs and the percentage of proteomes in each kingdom with a greater number of SAARs than the mean. *The overall mean is 13.18%
"BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours -you keep the copyright Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp BioMedcentral BMC Bioinformatics 2005, 6:196 http://www.biomedcentral.com/1471-2105/6/196