RepSeq – A database of amino acid repeats present in lower eukaryotic pathogens
© Depledge et al; licensee BioMed Central Ltd. 2007
Received: 21 December 2006
Accepted: 11 April 2007
Published: 11 April 2007
Amino acid repeat-containing proteins have a broad range of functions and their identification is of relevance to many experimental biologists. In human-infective protozoan parasites (such as the Kinetoplastid and Plasmodium species), they are implicated in immune evasion and have been shown to influence virulence and pathogenicity. RepSeq http://repseq.gugbe.com is a new database of amino acid repeat-containing proteins found in lower eukaryotic pathogens. The RepSeq database is accessed via a web-based application which also provides links to related online tools and databases for further analyses.
The RepSeq algorithm typically identifies more than 98% of repeat-containing proteins and is capable of identifying both perfect and mismatch repeats. The proportion of proteins that contain repeat elements varies greatly between different families and even species (3–35% of the total protein content). The most common motif type is the Sequence Repeat Region (SRR) – a repeated motif containing multiple different amino acid types. Proteins containing Single Amino Acid Repeats (SAARs) and Di-Peptide Repeats (DPRs) typically account for 0.5–1.0% of the total protein number. Notable exceptions are P. falciparum and D. discoideum, in which 33.67% and 34.28% respectively of the predicted proteomes consist of repeat-containing proteins. These numbers are due to large insertions of low complexity single and multi-codon repeat regions.
The RepSeq database provides a repository for repeat-containing proteins found in parasitic protozoa. The database allows for both individual and cross-species proteome analyses and also allows users to upload sequences of interest for analysis by the RepSeq algorithm. Identification of repeat-containing proteins provides researchers with a defined subset of proteins which can be analysed by expression profiling and functional characterisation, thereby facilitating study of pathogenicity and virulence factors in the parasitic protozoa. While primarily designed for kinetoplastid work, the RepSeq algorithm and database retain full functionality when used to analyse other species.
All characterised eukaryotic proteomes contain proteins possessing repeated amino acid motifs within their sequence . These repeats can arise from simple sequence repeats (termed SSRs) occuring in the coding regions of genomes. SSRs typically originate from unequal crossing-over and replication errors which result from the formation of unusual DNA structures such as slipped strands and hairpins [2, 3]. SSRs range from single nucleotide repeats to large multi-codon repeats and are substantially more numerous in non-coding regions of the genome [4, 5]. SSRs are also considered a major source of quantitative genetic variation [5–7].
A range of functions have been ascribed to amino acid repeats, the most common being that repeats are a mechanism for providing regular arrays of spatial and functional groups . Error-prone SSR expansion allows for rapid evolution of proteins with repetitive structure, which can lead to rapidly changing phenotypes. In Saccharomyces cerevisiae, amino acid reiterations of different types are concentrated in different classes of proteins, including transcription factors, protein kinases and membrane transporters .
Proteins containing repeats are particularly widespread within several parasitic taxa including protozoan organisms that are the causative agents of malaria (Plasmodium species; ) and kinetoplastid parasites (Trypanosoma and Leishmania species that cause a range of debilitating human diseases (African Sleeping Sickness, Chagas' Disease and the Leishmaniases ). Known functions of intra-protein repeats include roles in intracellular protein-protein interactions, binding to host-cell receptors and polymerisation of their associated, non-repeated domains [12, 13]. Protein repeats are also implicated in antigenic recognition and evasion of the host immune response to infection .
Examples of amino acid repeats.
Single Amino Acid Repeat (SAAR)
Di-peptide Repeat (DPR)
Sequence Repeat Region (SRR)
Single amino acid repeats (SAARs) are uninterrupted runs of a single amino acid. Certain amino acids are more prevalent (usually alanine and glutamine), although this feature is usually species-specific .
Di-peptide Repeat (DPRs) motifs occur when a pair of non-identical amino acids are tandemly repeated in a linear sequence. These are often referred to as tandem repeats.
Sequence Repeat Regions (SRRs) occur when a given amino acid motif is repeated several times throughout a protein sequence. The length, number of repeats and amino acid content of the motif varies and can also include elements of SAARs and DPRs.
Consideration must also be given to so-called 'mismatch' repeats. These are characterised by repeat sequences in which mutations (insertions, deletions and substitutions) have rendered the sequences non-identical. While these may not affect the function of the repeat region, they do present an increased challenge for their identification. It has also been shown that mismatch repeats can be functionally important .
It is important to distinguish between repeats at the DNA level and repeats at the proteome level as only a small fraction of the repeats found in nucleotide sequences are translated into individual proteins. It is therefore much faster and simpler to analyse amino acid repeats at the protein level as search algorithms require less complexity and search a smaller amount of data.
While a large number of applications have been developed for detecting and analysing genome-level repeats (e.g. Repeat Masker , Repeat Scout  and Tandem Repeats Finder ), there are very few applications/databases which deal with proteome-level repeats. The available databases, including COPASAAR , ProtRepeatsDB  and TRIPS , mostly focus on prokaryotic and higher eukaryotic analyses. While COPASAAR and TRIPS deal with specific repeat-types (SAARs and tandemly repeated sequences, respectively), ProtRepeatsDB attempts to aggregate all repeat types into its database. Unfortunately the scope of the database and the complexity of the user interface are not suited to experimental biologists looking for less complex methods of quickly identifying repeat-containing proteins for in vitro/in vivo studies.
Of current importance is the creation of a database of repeats that clearly differentiates between the different repeat-types and provides enough options for both "quick and dirty" proteome analyses as well as more comprehensive proteome studies (such as inter-species analyses). This paper presents RepSeq, an online database of amino acid repeats found in lower eukaryotic pathogens.
Construction and content
RepSeq (Database of Repeat Sequences) is a web-based database/application which allows the identification of all amino acid repeats within a given proteome. While primarily designed to work with lower eukaryotic pathogens, RepSeq can be used to study proteomes from any given organism. The RepSeq website houses an interactive database, an upload facility which uses the RepSeq algorithm to analyse user-provided sequences and all relevant documentation, methodology and glossary pages. RepSeq was devised by D. Depledge and implemented by R.J. Lower and D. Depledge.
The RepSeq algorithm was written using PERL and identifies both perfect and mismatch repeats by searching for small regions with perfect identity within a protein sequence. The algorithm takes a FASTA formatted proteome file and uses a sliding window (6 residues in length) to search for repeated sequences. Each protein within the proteome is examined individually and all information collected is stored in conjunction with any information included in the sequence header (protein id, accession number, function etc). These data are then parsed to the RepSeq database. The algorithm functions by counting every 6-residue amino acid motif (termed a 'chunk') that appears within the protein sequence. The number of times each motif is repeated is recorded, together with its position within the protein sequence. The amino acid sequence of each chunk is also examined and a note made when a chunk contains a sequence of identical amino acids (i.e AAAAAA) or a 2-residue tandem repeat (i.e. ARARAR) In the case of SAARs and DPRs, the algorithm then deduces which chunks are part of the same repeat (based on the location and sequence of the chunks). For example, three identical chunks (i.e. AAAAAA) lying adjacent to each other will be identified as a SAAR, 8 residues in length.
All repeated chunks not classified as SAARs or DPRs are counted as SRRs and stored as such in the database. In addition, each SRR is given a score, based on the number of repeats of the chunks and their relative positions. This score can be used as an indicator of how strong a repeat motif is (i.e. a higher score indicates a strongly conserved motif repeated many times, whereas a low score indicates a less conserved motif repeated only a few times). The algorithm sensitivity is increased by including a function that allows for similar chunks (i.e. those in which 5 out of 6 residues are conserved) to be grouped together. This increases the chances of identifying lower identity repeat sequences but can also increase the number of false positives identified. Currently this is only implemented when identifying SRRs due to huge numbers of false positive DPRs and SAARs being identified. The RepSeq algorithm requires approximately 2–3 minutes to analyse a proteome of 10000 proteins on a P4 2.80GHz, 512MB RAM running Windows XP (SP2).
Test data set analysis.
Loose Repeat Threshold
Loose Repeat Threshold
Normal Repeat Threshold
Normal Repeat Threshold
499 (99.8 %)
Strict Repeat Threshold
Strict Repeat Threshold
Proteomes from species not hosted at this site can be uploaded and analysed using the website's upload facility. Here, FASTA formatted files can be submitted for analysis. Results will be stored in the database for a minimum of seven days and a personalised link to a query page (specific to the uploaded proteome) will be provided. The dataset will also be available for download. The upload and processing time varies from 1–5 minutes for proteomes of 20000 proteins.
RepSeq proteome data
Protozoan parasite species currently available in RepSeq.
Predicted Protein-Coding Genes
Utility and discussion
Analysis of the test data sets (Table 2) indicates that the RepSeq algorithm functions properly and is able to identify all major repeat types. SAAR and DPR sequences are identified 100% of the time providing that they are of 6 residues or longer. In all test data sets, RepSeq identified 100% of SRRs when set to identify 2+ repeats on the 'loose' repeat strength threshold setting. This was counterbalanced by the identification of a large number of false positives. Increasing the SRR repeats to 3+, all false positives were removed while over 99.8% of repeat-containing proteins were identified. The "standard" setting (searching for 2+ repeats) identified 99.8% of repeat-containing proteins and registered a significantly smaller proportion of false positives (typically less than 10 in total). The "strict" setting reduced the proportion of repeat-containing proteins identified to 97% but did not detect any false positives. As mentioned earlier, RepSeq will also identify mismatch repeats (although allowing for 1 amino acid substitution) provided that two identical 6-residue sequences are conserved in the repeat. All the false positives identified are proteins containing one repeat of a 5/6 residue sequence and thus can easily be identified and removed from further analysis.
Amino acid repeat distribution of selected species.
Single Amino Acid Repeats (SAARs)
Di-peptide repeats (DPRs)
While comparing repeat size against the number of proteins identified is a good method for identifying SAARs and DPRs, a different approach is required for determining the cut-offs for SRRs. Consider three proteins with different sequence lengths (100, 1000 and 10000 residues). A 10-residue motif repeated once is probably significant in the small protein, yet could have arisen by chance in the two larger proteins. This would suggest that when identifying real SRRs, the percentage of the protein which consists of the repeat region should be used. By contrast the same motif repeated 10 times in the largest protein would account for only 1% of the whole protein, yet could be structurally or functionally important. From this simple example, it is clear that defining a significant SRR requires a consideration of both repeat number and repeat size. Closer examination of the proteomes found in RepSeq suggests that sequences repeated at least three times typically account for large proportions (> 5%) of the whole protein (data not shown). For sequences repeated twice, only those which exceed 2–10% of the whole protein can be classified as non-random. Randomly occurring repeats typically account for < 1% of the total protein.
Amino acid repeat frequency in protozoan parasitic proteomes.
Total Predicted Coding Sequences
SRR (3+ Repeats)
Total amino acid repeat containing proteins *
Total % repeat containing proteins
The other species analysed in this study were the parasitic amoeba, Entamoeba histolytica, in which only 2.79% of the proteome can be classed as repeat-containing, and the soil amoeba, Dictyostelium discoideum, which contains the largest proportion of repeat-containing proteins (34.28%) of any of the protozoan proteomes analysed in this study. While SRRs typically account for the largest proportion of repeat-containing proteins in the species analysed, there are two notable exceptions: T.cruzi and D.discodeum, which both contain a larger proportion of SAARs. In the case of T.cruzi, this may be due to the large number of tyrosine, glutamine and glutamate SAARs that appear throughout the proteome. D.discodeum, like P.falciparum, contains very large numbers of low complexity repeat regions featuring asparagines.
RepSeq has primarily been designed with experimental parasitologists in mind. The ability to rapidly identify repeat-containing proteins (according to whatever criteria are set during the study) allows users to quickly generate lists of proteins for expression-profiling and functional analysis. An example of this is in the comparative proteomic analyses of different Leishmania species that cause diverse disease phenotypes . In this project, the data provided by RepSeq has been further analysed to show that ~70% of repeat-containing proteins are conserved amongst all three species analysed. Furthermore, in nearly all cases, the repeat regions are particularly well conserved during speciation. A small number of the repeat-containing proteins are species-specific. Some of these are already targets for Leishmania researchers attempting to define virulence and pathogenicity factors, while others could provide interesting candidates for vaccine development.
RepSeq provides an essential tool for the study of amino acid repeat-containing proteins. RepSeq compares favourably with other databases such as COPASAAR  and ProtRepeatsDB  due to its ability to quickly read through proteomes and present a comprehensive analysis which can be tailored to a wide variety of studies. Particular advantages are the ability to differentiate between the different repeat types and the ability to search for both very strict and very weak repeats. Furthermore, the sliding window employed by the algorithm is capable of identifying both perfect and mismatch repeats as long as a small part of the repeat is well conserved. While primarily designed for analysing lower eukaryotic organisms, RepSeq is capable of analysing proteomes from all species. The identification of amino acid repeat-containing proteins provides scientists with a new and complete subset of proteins which can be used in a range of studies from expression profiling to functional characterisation.
This may be of particular importance when studying pathogenicity and virulence factors in protozoan parasites  and also has applications to the study of neurodegenerative disease such as Huntington's chorea.
Availability and requirements
RepSeq is freely accessible on the Internet at http://repseq.gugbe.com. The web-interface comprises many integrated sections for easy browsing and data retrieval and is supported with PERL and PHP scripts which enable formulation of queries against the database. All results are displayed either in tabulated or graphical forms.
We thank Chris Peacock and Matt Berriman (Wellcome Trust Sanger Institute) for help with access to Leishmania and other protozoan sequences via GeneDB and PlasmoDB and Peter Ashton (University of York) for his comments on the manuscript. DD is supported by a postgraduate studentship from the BBSRC; RPJL completed his contribution to this work while an MRes Bioinformatics student.
- Depledge DP, Dalby AR: COPASAAR- A database for proteomic analysis of single amino acid repeats. BMC Bioinformatics 2005, 6: 196. 10.1186/1471-2105-6-196PubMed CentralView ArticlePubMedGoogle Scholar
- Kruglyak S, Durrett R, Schug MD, Aquadro CF: Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol Biol Evol 2000, 17: 1210–1219.View ArticlePubMedGoogle Scholar
- LeProust EM, Pearso CE, Sinden RR, Gao XL: Unexpected formation of parallel duplex in GAA and TTC trinucleotide repeats of Friedreich's ataxia. J Mol Biol 2000, 302: 1063–1080. 10.1006/jmbi.2000.4073View ArticlePubMedGoogle Scholar
- Kashi Y, King D, Soller M: Simple sequence repeats as a source of quantitative genetic variation. Trends Genet 1997, 13: 74–78. 10.1016/S0168-9525(97)01008-1View ArticlePubMedGoogle Scholar
- Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D: A census of protein repeats. J Mol Biol 1999, 293: 151–160. 10.1006/jmbi.1999.3136View ArticlePubMedGoogle Scholar
- Alba MM, Santibanez-Koref MF, Hancock JM: Conservation of polyglutamine tract size between mice and humans depends on codon interruption. Mol Biol Evol 1999, 16: 1641–1644.View ArticlePubMedGoogle Scholar
- Alba MM, Guigo R: Comparative analysis of amino acid repeats in rodents and humans. Genome Res 2004, 14: 549–554. 10.1101/gr.1925704PubMed CentralView ArticlePubMedGoogle Scholar
- Katti MV, Sami-Subbu R, Ranjekar PK, Gupta VS: Amino acid repeat patterns in protein sequences: Their diversity and structural-functional implications. Protein Sci 2000, 9: 1203–1209.PubMed CentralView ArticlePubMedGoogle Scholar
- Alba MM, Santibanez-Koref MF, Hancock JM: Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol 1999, 49: 789–797. 10.1007/PL00006601View ArticleGoogle Scholar
- Hughes AL: The evolution of amino acid repeat arrays in Plasmodium and other organisms. J Mol Evol 2004, 59(4):528–35. 10.1007/s00239-004-2645-4View ArticlePubMedGoogle Scholar
- World Health Organisation[http://www.who.int/tdr]
- Rosenthal PJ: Cysteine proteases of malaria parasites. Int J Parasitol 2004, 34(13–14):1489–99. 10.1016/j.ijpara.2004.10.003View ArticlePubMedGoogle Scholar
- Ilg T, Montgomery J, Stierhof YD, Handman E: Molecular cloning and characterization of a novel repeat-containing Leishmania major gene, ppg1, that encodes a membrane-associated form of proteophosphoglycan with a putative glycosylphosphatidylinositol anchor. J Biol Chem 1999, 274(44):31410–20. 10.1074/jbc.274.44.31410View ArticlePubMedGoogle Scholar
- Tetteh KK, Cavanagh DR, Corran P, Musonda R, McBride JS, Conway DJ: Extensive antigenic polymorphism within the repeat sequence of the Plasmodium falciparum merozoite surface protein 1 block 2 is incorporated in a minimal polyvalent immunogen. Infect Immun 2005, 73(9):5928–35. 10.1128/IAI.73.9.5928-5935.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Clarke JL, Sodeinde O, Mason PJ: A unique insertion in Plasmodium berghei glucose-6-phosphate dehydrogenase-6-phosphogluconolactonase: evolutionary and functional studies. Mol Biochem Parasitol 2003, 127: 1–8. 10.1016/S0166-6851(02)00298-0View ArticlePubMedGoogle Scholar
- RepeatMasker Open-3.0[http://www.repeatmasker.org]
- Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics 2005, 21(1):i351–8. 10.1093/bioinformatics/bti1018View ArticlePubMedGoogle Scholar
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 1999, 27(2):573–580. 10.1093/nar/27.2.573PubMed CentralView ArticlePubMedGoogle Scholar
- Kalita MK, Ramasamy G, Duraisamy S, Chauhan VS, Gupta D: ProtRepeatsDB: A database of amino acid repeats in genomes. BMC Bioinformatics 2006, 7(1):336. 10.1186/1471-2105-7-336PubMed CentralView ArticlePubMedGoogle Scholar
- Subirana JA, Palau J: Structural features of single amino acid repeats in proteins. FEBS Lett 1999, 448(1):1–3. 10.1016/S0014-5793(99)00310-5View ArticlePubMedGoogle Scholar
- Singh GP, Chandra BR, Bhattacharya A, Akhouri RR, Singh SK, Sharma A: Hyper-expansion of asparagines correlates with an abundance of proteins with prion-like domains in Plasmodium falciparum. Mol Biochem Parasitol 2004, 137(2):307–19. 10.1016/j.molbiopara.2004.05.016View ArticlePubMedGoogle Scholar
- Gardner MJ, Hall N, Fung E, et al.: Genome sequence of the human malaria parasite Plasmodium falciparum . Nature 2002, 419: 498–511. 10.1038/nature01097View ArticlePubMedGoogle Scholar
- Pizzi E, Frontali C: Molecular evolution of coding and non-coding regions in Plasmodium . Parassitologia 1999, 41: 89–91.PubMedGoogle Scholar
- Pizzi E, Frontali C: Low-complexity regions in Plasmodium falciparum proteins. Genome Res 2001, 11: 218–229. 10.1101/gr.GR-1522RPubMed CentralView ArticlePubMedGoogle Scholar
- Peacock C, Seeger K, Harris D, Murphy L, Ruiz J, Quail MA, et al.: Conservation of genome architecture and content between three Leishmania species causing diverse human disease . Nature Genetics, in press.
- Zhang WW, Mendez S, Ghosh A, Myler P, Ivens A, Clos J, Sacks DL, Matlashewski G: Comparison of the A2 gene locus in Leishmania donovani and Leishmania major and its control over cutaneous infection. J Biol Chem 2003, 12; 278(37):35508–15. 10.1074/jbc.M305030200View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.