RepSeq – A database of amino acid repeats present in lower eukaryotic pathogens
BMC Bioinformatics volume 8, Article number: 122 (2007)
Amino acid repeat-containing proteins have a broad range of functions and their identification is of relevance to many experimental biologists. In human-infective protozoan parasites (such as the Kinetoplastid and Plasmodium species), they are implicated in immune evasion and have been shown to influence virulence and pathogenicity. RepSeq http://repseq.gugbe.com is a new database of amino acid repeat-containing proteins found in lower eukaryotic pathogens. The RepSeq database is accessed via a web-based application which also provides links to related online tools and databases for further analyses.
The RepSeq algorithm typically identifies more than 98% of repeat-containing proteins and is capable of identifying both perfect and mismatch repeats. The proportion of proteins that contain repeat elements varies greatly between different families and even species (3–35% of the total protein content). The most common motif type is the Sequence Repeat Region (SRR) – a repeated motif containing multiple different amino acid types. Proteins containing Single Amino Acid Repeats (SAARs) and Di-Peptide Repeats (DPRs) typically account for 0.5–1.0% of the total protein number. Notable exceptions are P. falciparum and D. discoideum, in which 33.67% and 34.28% respectively of the predicted proteomes consist of repeat-containing proteins. These numbers are due to large insertions of low complexity single and multi-codon repeat regions.
The RepSeq database provides a repository for repeat-containing proteins found in parasitic protozoa. The database allows for both individual and cross-species proteome analyses and also allows users to upload sequences of interest for analysis by the RepSeq algorithm. Identification of repeat-containing proteins provides researchers with a defined subset of proteins which can be analysed by expression profiling and functional characterisation, thereby facilitating study of pathogenicity and virulence factors in the parasitic protozoa. While primarily designed for kinetoplastid work, the RepSeq algorithm and database retain full functionality when used to analyse other species.
All characterised eukaryotic proteomes contain proteins possessing repeated amino acid motifs within their sequence . These repeats can arise from simple sequence repeats (termed SSRs) occuring in the coding regions of genomes. SSRs typically originate from unequal crossing-over and replication errors which result from the formation of unusual DNA structures such as slipped strands and hairpins [2, 3]. SSRs range from single nucleotide repeats to large multi-codon repeats and are substantially more numerous in non-coding regions of the genome [4, 5]. SSRs are also considered a major source of quantitative genetic variation [5–7].
A range of functions have been ascribed to amino acid repeats, the most common being that repeats are a mechanism for providing regular arrays of spatial and functional groups . Error-prone SSR expansion allows for rapid evolution of proteins with repetitive structure, which can lead to rapidly changing phenotypes. In Saccharomyces cerevisiae, amino acid reiterations of different types are concentrated in different classes of proteins, including transcription factors, protein kinases and membrane transporters .
Proteins containing repeats are particularly widespread within several parasitic taxa including protozoan organisms that are the causative agents of malaria (Plasmodium species; ) and kinetoplastid parasites (Trypanosoma and Leishmania species that cause a range of debilitating human diseases (African Sleeping Sickness, Chagas' Disease and the Leishmaniases ). Known functions of intra-protein repeats include roles in intracellular protein-protein interactions, binding to host-cell receptors and polymerisation of their associated, non-repeated domains [12, 13]. Protein repeats are also implicated in antigenic recognition and evasion of the host immune response to infection .
Proteins containing amino acid repeats are split into distinct categories based on different types of motif (Table 1).
Single amino acid repeats (SAARs) are uninterrupted runs of a single amino acid. Certain amino acids are more prevalent (usually alanine and glutamine), although this feature is usually species-specific .
Di-peptide Repeat (DPRs) motifs occur when a pair of non-identical amino acids are tandemly repeated in a linear sequence. These are often referred to as tandem repeats.
Sequence Repeat Regions (SRRs) occur when a given amino acid motif is repeated several times throughout a protein sequence. The length, number of repeats and amino acid content of the motif varies and can also include elements of SAARs and DPRs.
Consideration must also be given to so-called 'mismatch' repeats. These are characterised by repeat sequences in which mutations (insertions, deletions and substitutions) have rendered the sequences non-identical. While these may not affect the function of the repeat region, they do present an increased challenge for their identification. It has also been shown that mismatch repeats can be functionally important .
It is important to distinguish between repeats at the DNA level and repeats at the proteome level as only a small fraction of the repeats found in nucleotide sequences are translated into individual proteins. It is therefore much faster and simpler to analyse amino acid repeats at the protein level as search algorithms require less complexity and search a smaller amount of data.
While a large number of applications have been developed for detecting and analysing genome-level repeats (e.g. Repeat Masker , Repeat Scout  and Tandem Repeats Finder ), there are very few applications/databases which deal with proteome-level repeats. The available databases, including COPASAAR , ProtRepeatsDB  and TRIPS , mostly focus on prokaryotic and higher eukaryotic analyses. While COPASAAR and TRIPS deal with specific repeat-types (SAARs and tandemly repeated sequences, respectively), ProtRepeatsDB attempts to aggregate all repeat types into its database. Unfortunately the scope of the database and the complexity of the user interface are not suited to experimental biologists looking for less complex methods of quickly identifying repeat-containing proteins for in vitro/in vivo studies.
Of current importance is the creation of a database of repeats that clearly differentiates between the different repeat-types and provides enough options for both "quick and dirty" proteome analyses as well as more comprehensive proteome studies (such as inter-species analyses). This paper presents RepSeq, an online database of amino acid repeats found in lower eukaryotic pathogens.
Construction and content
RepSeq (Database of Repeat Sequences) is a web-based database/application which allows the identification of all amino acid repeats within a given proteome. While primarily designed to work with lower eukaryotic pathogens, RepSeq can be used to study proteomes from any given organism. The RepSeq website houses an interactive database, an upload facility which uses the RepSeq algorithm to analyse user-provided sequences and all relevant documentation, methodology and glossary pages. RepSeq was devised by D. Depledge and implemented by R.J. Lower and D. Depledge.
The RepSeq algorithm was written using PERL and identifies both perfect and mismatch repeats by searching for small regions with perfect identity within a protein sequence. The algorithm takes a FASTA formatted proteome file and uses a sliding window (6 residues in length) to search for repeated sequences. Each protein within the proteome is examined individually and all information collected is stored in conjunction with any information included in the sequence header (protein id, accession number, function etc). These data are then parsed to the RepSeq database. The algorithm functions by counting every 6-residue amino acid motif (termed a 'chunk') that appears within the protein sequence. The number of times each motif is repeated is recorded, together with its position within the protein sequence. The amino acid sequence of each chunk is also examined and a note made when a chunk contains a sequence of identical amino acids (i.e AAAAAA) or a 2-residue tandem repeat (i.e. ARARAR) In the case of SAARs and DPRs, the algorithm then deduces which chunks are part of the same repeat (based on the location and sequence of the chunks). For example, three identical chunks (i.e. AAAAAA) lying adjacent to each other will be identified as a SAAR, 8 residues in length.
All repeated chunks not classified as SAARs or DPRs are counted as SRRs and stored as such in the database. In addition, each SRR is given a score, based on the number of repeats of the chunks and their relative positions. This score can be used as an indicator of how strong a repeat motif is (i.e. a higher score indicates a strongly conserved motif repeated many times, whereas a low score indicates a less conserved motif repeated only a few times). The algorithm sensitivity is increased by including a function that allows for similar chunks (i.e. those in which 5 out of 6 residues are conserved) to be grouped together. This increases the chances of identifying lower identity repeat sequences but can also increase the number of false positives identified. Currently this is only implemented when identifying SRRs due to huge numbers of false positive DPRs and SAARs being identified. The RepSeq algorithm requires approximately 2–3 minutes to analyse a proteome of 10000 proteins on a P4 2.80GHz, 512MB RAM running Windows XP (SP2).
The RepSeq algorithm was evaluated using a number of test data sets, created to resemble lower eukaryotic proteomes in terms of size and proportion of repeat-containing proteins. The data sets generated comprised 5000 and 10000, of which 5% or 25% were repeat-containing proteins (containing SRR repeats of varying length and repetitions). The results are shown in Table 2 and discussed below. The test data sets were generated using a small script written in RUBY.
The database was written using MySQL v5.0 and consists of 3 tables, the schema and structure of which are shown in Figure 1. Given that each organism could house many repeat-containing proteins and that each gene could contain several distinct repeat-regions, it was necessary to structure the database so as to reduce data redundancy. Each table contains a unique and auto-incrementing column (organismID, geneID and repeatID) which allows data linking from child tables via foreign keys.
Access to the database is via the RepSeq web-interface (Figure 2). A variety of search criteria are present for examining the database. Users are able to specify which types of repeat they wish to search for, as well as assign values for the minimum length or number of repeat units. A repeat strength function also allows users to choose between loose, standard and strict searches. This function affects the number of repeat-containing proteins that are returned. Strict searches will only return proteins which possess strong repeats (based on the scoring system described previously) while loose searches will return all proteins containing any form of repeat. The relative strengths of these options are discussed below. Users are also able to search for specific proteins by their accession number or functional description. Once all options are specified, the data are retrieved from the database and displayed in an output table (Figure 3). Selecting proteins from the table will show the full protein sequence and highlight the repeated regions, allowing users to determine the full repeat motif. This method is faster and more reliable than any computational method so far developed.
Proteomes from species not hosted at this site can be uploaded and analysed using the website's upload facility. Here, FASTA formatted files can be submitted for analysis. Results will be stored in the database for a minimum of seven days and a personalised link to a query page (specific to the uploaded proteome) will be provided. The dataset will also be available for download. The upload and processing time varies from 1–5 minutes for proteomes of 20000 proteins.
RepSeq proteome data
The current database contains thirteen proteomes (Table 3). These were obtained from a variety of sources including GeneDB  and PlasmoDB  and are available for analysis using RepSeq. While primarily designed to aid the functional analysis of parasitic proteomes, the RepSeq database has been expanded to include a range of lower eukaryotic pathogens. Further proteomes will be added as they are sequenced and updated proteome releases will be incorporated as they become available.
Utility and discussion
Analysis of the test data sets (Table 2) indicates that the RepSeq algorithm functions properly and is able to identify all major repeat types. SAAR and DPR sequences are identified 100% of the time providing that they are of 6 residues or longer. In all test data sets, RepSeq identified 100% of SRRs when set to identify 2+ repeats on the 'loose' repeat strength threshold setting. This was counterbalanced by the identification of a large number of false positives. Increasing the SRR repeats to 3+, all false positives were removed while over 99.8% of repeat-containing proteins were identified. The "standard" setting (searching for 2+ repeats) identified 99.8% of repeat-containing proteins and registered a significantly smaller proportion of false positives (typically less than 10 in total). The "strict" setting reduced the proportion of repeat-containing proteins identified to 97% but did not detect any false positives. As mentioned earlier, RepSeq will also identify mismatch repeats (although allowing for 1 amino acid substitution) provided that two identical 6-residue sequences are conserved in the repeat. All the false positives identified are proteins containing one repeat of a 5/6 residue sequence and thus can easily be identified and removed from further analysis.
When analysing proteomes, consideration must be given to the fact that in any given amino acid sequence, a proportion of repeats will occur purely by chance (i.e. where repeated motifs are not true repeats but just random sequence). While statistical models are available for predicting the extent to which this may occur , these do not help in determining which repeats can be classified as true repeats (i.e. those of structural/functional importance). There are no established methods for attempting to identify which repeats are functionally significant (that might reduce the need for robust experimental validation of each candidate). There are however, methods for looking at each repeat type (within the context of an individual proteome) that allow users to determine their own cut-off points (for instance, repeat length, or number of repeats). A closer look at the proteome of L. infantum shows that there are 974 proteins containing SAARs which are 6 residues or longer. If the minimum SAAR length is raised to 10 residues, then the number of repeat-containing proteins falls to 60 – a significant reduction. Table 4 shows the relationship between SAAR length and number of repeat-containing proteins as well as the proportion of DPRs and SRRs encountered when varying the number of those repeats searched for.
While comparing repeat size against the number of proteins identified is a good method for identifying SAARs and DPRs, a different approach is required for determining the cut-offs for SRRs. Consider three proteins with different sequence lengths (100, 1000 and 10000 residues). A 10-residue motif repeated once is probably significant in the small protein, yet could have arisen by chance in the two larger proteins. This would suggest that when identifying real SRRs, the percentage of the protein which consists of the repeat region should be used. By contrast the same motif repeated 10 times in the largest protein would account for only 1% of the whole protein, yet could be structurally or functionally important. From this simple example, it is clear that defining a significant SRR requires a consideration of both repeat number and repeat size. Closer examination of the proteomes found in RepSeq suggests that sequences repeated at least three times typically account for large proportions (> 5%) of the whole protein (data not shown). For sequences repeated twice, only those which exceed 2–10% of the whole protein can be classified as non-random. Randomly occurring repeats typically account for < 1% of the total protein.
As discussed above, there is considerable flexibility in determining which cut-off values should be used when identifying repeat-containing proteins. For the purpose of the analyses presented here, the cut-offs established were SAARs of 10 residues or longer, DPRs of four repeats or more and SRRs repeated at least 3 times, using the normal repeat strength threshold setting. Analysis of the Leishmania and Trypanosoma proteomes found that repeat-containing proteins typically account for about 3–4% of the total protein number (Table 5). Interestingly, there is a large variation in the proportion of repeat-containing proteins in the Plasmodium spp. Those species that are pathogens of humans (P.falciparum and P.vivax) contain large numbers of repeat-containing proteins within their proteomes (33.49% and 21.62% respectively), while the rodent malaria species (P.chabaudi, P.berghei and P.yoelii) possess relatively few proteins of this type (2.19%, 4.05% and 13.49% respectively). There is also a large difference in the proteome sizes of Plasmodium species maintained in different hosts: the human infective species contain half the total number of predicted proteins as compared to the rodent pathogen species (although these numbers may change as the data sets undergo further annotation and refinement). It has previously been noted that P. falciparum contains a large number of low complexity repeat regions (predominantly coding for asparagines residues) due to single and multi-codon insertions within the coding region of the corresponding genes [19, 20]. These repeat regions are believed to form non-globular segments of unknown function that extend from protein domains [23–26]. Further functional analysis is required to confirm these predictions. Interestingly, such low complexity repeat regions do not appear in the P.vivax and P.yoelii proteomes.
The other species analysed in this study were the parasitic amoeba, Entamoeba histolytica, in which only 2.79% of the proteome can be classed as repeat-containing, and the soil amoeba, Dictyostelium discoideum, which contains the largest proportion of repeat-containing proteins (34.28%) of any of the protozoan proteomes analysed in this study. While SRRs typically account for the largest proportion of repeat-containing proteins in the species analysed, there are two notable exceptions: T.cruzi and D.discodeum, which both contain a larger proportion of SAARs. In the case of T.cruzi, this may be due to the large number of tyrosine, glutamine and glutamate SAARs that appear throughout the proteome. D.discodeum, like P.falciparum, contains very large numbers of low complexity repeat regions featuring asparagines.
RepSeq has primarily been designed with experimental parasitologists in mind. The ability to rapidly identify repeat-containing proteins (according to whatever criteria are set during the study) allows users to quickly generate lists of proteins for expression-profiling and functional analysis. An example of this is in the comparative proteomic analyses of different Leishmania species that cause diverse disease phenotypes . In this project, the data provided by RepSeq has been further analysed to show that ~70% of repeat-containing proteins are conserved amongst all three species analysed. Furthermore, in nearly all cases, the repeat regions are particularly well conserved during speciation. A small number of the repeat-containing proteins are species-specific. Some of these are already targets for Leishmania researchers attempting to define virulence and pathogenicity factors, while others could provide interesting candidates for vaccine development.
RepSeq provides an essential tool for the study of amino acid repeat-containing proteins. RepSeq compares favourably with other databases such as COPASAAR  and ProtRepeatsDB  due to its ability to quickly read through proteomes and present a comprehensive analysis which can be tailored to a wide variety of studies. Particular advantages are the ability to differentiate between the different repeat types and the ability to search for both very strict and very weak repeats. Furthermore, the sliding window employed by the algorithm is capable of identifying both perfect and mismatch repeats as long as a small part of the repeat is well conserved. While primarily designed for analysing lower eukaryotic organisms, RepSeq is capable of analysing proteomes from all species. The identification of amino acid repeat-containing proteins provides scientists with a new and complete subset of proteins which can be used in a range of studies from expression profiling to functional characterisation.
This may be of particular importance when studying pathogenicity and virulence factors in protozoan parasites  and also has applications to the study of neurodegenerative disease such as Huntington's chorea.
Availability and requirements
RepSeq is freely accessible on the Internet at http://repseq.gugbe.com. The web-interface comprises many integrated sections for easy browsing and data retrieval and is supported with PERL and PHP scripts which enable formulation of queries against the database. All results are displayed either in tabulated or graphical forms.
Depledge DP, Dalby AR: COPASAAR- A database for proteomic analysis of single amino acid repeats. BMC Bioinformatics 2005, 6: 196. 10.1186/1471-2105-6-196
Kruglyak S, Durrett R, Schug MD, Aquadro CF: Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol Biol Evol 2000, 17: 1210–1219.
LeProust EM, Pearso CE, Sinden RR, Gao XL: Unexpected formation of parallel duplex in GAA and TTC trinucleotide repeats of Friedreich's ataxia. J Mol Biol 2000, 302: 1063–1080. 10.1006/jmbi.2000.4073
Kashi Y, King D, Soller M: Simple sequence repeats as a source of quantitative genetic variation. Trends Genet 1997, 13: 74–78. 10.1016/S0168-9525(97)01008-1
Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D: A census of protein repeats. J Mol Biol 1999, 293: 151–160. 10.1006/jmbi.1999.3136
Alba MM, Santibanez-Koref MF, Hancock JM: Conservation of polyglutamine tract size between mice and humans depends on codon interruption. Mol Biol Evol 1999, 16: 1641–1644.
Alba MM, Guigo R: Comparative analysis of amino acid repeats in rodents and humans. Genome Res 2004, 14: 549–554. 10.1101/gr.1925704
Katti MV, Sami-Subbu R, Ranjekar PK, Gupta VS: Amino acid repeat patterns in protein sequences: Their diversity and structural-functional implications. Protein Sci 2000, 9: 1203–1209.
Alba MM, Santibanez-Koref MF, Hancock JM: Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol 1999, 49: 789–797. 10.1007/PL00006601
Hughes AL: The evolution of amino acid repeat arrays in Plasmodium and other organisms. J Mol Evol 2004, 59(4):528–35. 10.1007/s00239-004-2645-4
World Health Organisation[http://www.who.int/tdr]
Rosenthal PJ: Cysteine proteases of malaria parasites. Int J Parasitol 2004, 34(13–14):1489–99. 10.1016/j.ijpara.2004.10.003
Ilg T, Montgomery J, Stierhof YD, Handman E: Molecular cloning and characterization of a novel repeat-containing Leishmania major gene, ppg1, that encodes a membrane-associated form of proteophosphoglycan with a putative glycosylphosphatidylinositol anchor. J Biol Chem 1999, 274(44):31410–20. 10.1074/jbc.274.44.31410
Tetteh KK, Cavanagh DR, Corran P, Musonda R, McBride JS, Conway DJ: Extensive antigenic polymorphism within the repeat sequence of the Plasmodium falciparum merozoite surface protein 1 block 2 is incorporated in a minimal polyvalent immunogen. Infect Immun 2005, 73(9):5928–35. 10.1128/IAI.73.9.5928-5935.2005
Clarke JL, Sodeinde O, Mason PJ: A unique insertion in Plasmodium berghei glucose-6-phosphate dehydrogenase-6-phosphogluconolactonase: evolutionary and functional studies. Mol Biochem Parasitol 2003, 127: 1–8. 10.1016/S0166-6851(02)00298-0
Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics 2005, 21(1):i351–8. 10.1093/bioinformatics/bti1018
Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 1999, 27(2):573–580. 10.1093/nar/27.2.573
Kalita MK, Ramasamy G, Duraisamy S, Chauhan VS, Gupta D: ProtRepeatsDB: A database of amino acid repeats in genomes. BMC Bioinformatics 2006, 7(1):336. 10.1186/1471-2105-7-336
Subirana JA, Palau J: Structural features of single amino acid repeats in proteins. FEBS Lett 1999, 448(1):1–3. 10.1016/S0014-5793(99)00310-5
Singh GP, Chandra BR, Bhattacharya A, Akhouri RR, Singh SK, Sharma A: Hyper-expansion of asparagines correlates with an abundance of proteins with prion-like domains in Plasmodium falciparum. Mol Biochem Parasitol 2004, 137(2):307–19. 10.1016/j.molbiopara.2004.05.016
Gardner MJ, Hall N, Fung E, et al.: Genome sequence of the human malaria parasite Plasmodium falciparum . Nature 2002, 419: 498–511. 10.1038/nature01097
Pizzi E, Frontali C: Molecular evolution of coding and non-coding regions in Plasmodium . Parassitologia 1999, 41: 89–91.
Pizzi E, Frontali C: Low-complexity regions in Plasmodium falciparum proteins. Genome Res 2001, 11: 218–229. 10.1101/gr.GR-1522R
Peacock C, Seeger K, Harris D, Murphy L, Ruiz J, Quail MA, et al.: Conservation of genome architecture and content between three Leishmania species causing diverse human disease . Nature Genetics, in press.
Zhang WW, Mendez S, Ghosh A, Myler P, Ivens A, Clos J, Sacks DL, Matlashewski G: Comparison of the A2 gene locus in Leishmania donovani and Leishmania major and its control over cutaneous infection. J Biol Chem 2003, 12; 278(37):35508–15. 10.1074/jbc.M305030200
We thank Chris Peacock and Matt Berriman (Wellcome Trust Sanger Institute) for help with access to Leishmania and other protozoan sequences via GeneDB and PlasmoDB and Peter Ashton (University of York) for his comments on the manuscript. DD is supported by a postgraduate studentship from the BBSRC; RPJL completed his contribution to this work while an MRes Bioinformatics student.
RPJL and DPD implemented the RepSeq algorithm and created the RepSeq website, RPJL designed and implemented the RepSeq database. DPD and DFS conceived the study, participated in its design and coordination and drafted the manuscript. All authors read and approved the final manuscript.
Daniel P Depledge, Ryan PJ Lower and Deborah F Smith contributed equally to this work.
About this article
Cite this article
Depledge, D.P., Lower, R.P. & Smith, D.F. RepSeq – A database of amino acid repeats present in lower eukaryotic pathogens. BMC Bioinformatics 8, 122 (2007). https://doi.org/10.1186/1471-2105-8-122