An analysis of single amino acid repeats as use case for application specific background models
© Łabaj et al; licensee BioMed Central Ltd. 2011
Received: 30 November 2010
Accepted: 19 May 2011
Published: 19 May 2011
Sequence analysis aims to identify biologically relevant signals against a backdrop of functionally meaningless variation. Increasingly, it is recognized that the quality of the background model directly affects the performance of analyses. State-of-the-art approaches rely on classical sequence models that are adapted to the studied dataset. Although performing well in the analysis of globular protein domains, these models break down in regions of stronger compositional bias or low complexity. While these regions are typically filtered, there is increasing anecdotal evidence of functional roles. This motivates an exploration of more complex sequence models and application-specific approaches for the investigation of biased regions.
Traditional Markov-chains and application-specific regression models are compared using the example of predicting runs of single amino acids, a particularly simple class of biased regions. Cross-fold validation experiments reveal that the alternative regression models capture the multi-variate trends well, despite their low dimensionality and in contrast even to higher-order Markov-predictors. We show how the significance of unusual observations can be computed for such empirical models. The power of a dedicated model in the detection of biologically interesting signals is then demonstrated in an analysis identifying the unexpected enrichment of contiguous leucine-repeats in signal-peptides. Considering different reference sets, we show how the question examined actually defines what constitutes the 'background'. Results can thus be highly sensitive to the choice of appropriate model training sets. Conversely, the choice of reference data determines the questions that can be investigated in an analysis.
Using a specific case of studying biased regions as an example, we have demonstrated that the construction of application-specific background models is both necessary and feasible in a challenging sequence analysis situation.
In the post-genomic era, with the abundance of sequencing data, the functional interpretation of these sequences constitutes a key challenge. In particular, the identification of biologically relevant differences or shared patterns against a backdrop of functionally meaningless variation is of interest. In computational sequence analysis, this corresponds to detecting unusual patterns relative to a 'background' model [1–3]. The quality of this model directly affects analysis performance. Tools for homology detection by sequence similarity, like FASTA  or BLAST [5–7], or the identification of functional sites by pattern conservation in multiple sequence alignments [8, 9] traditionally assume positional independence of residues. More recent approaches, including Hidden-Markov-Models (HMMs) [10, 11], allow a local dependency structure. In either case, proteins are considered as 'slightly edited random sequences' . Indeed, the complexity of protein sequences reaches 99% of the maximum possible complexity (complete randomness) . Amino acids can apparently often be exchanged for alternatives with similar physicochemical properties [14, 15], with the exception of key residues, such as those contributing to an active centre .
Interestingly, the same models are typically used for different types of analysis and for all the sequences studied. Increasingly, however, the power and advantages of adjusting background models so that they explicitly take the nature of the studied data sets and/or the question at hand into account are being recognized. One of the first such applications of background-HMMs tailored for the detection of selected functional domains was introduced by PFAM . Recent statistics assessing sequence similarity now can adjust the expected frequencies for each protein . On the other hand, problem-specific background models have been developed, where e. g., advanced secondary-structure background models are used for scoring multiple alignments .
While standard background models in general perform well in the analysis of globular protein domains, they are known to break down in regions of stronger compositional bias or low complexity, giving nonspecific false positives. In general, these regions are not conserved, and are therefore assumed to be tolerated as neutral and consequently filtered and excluded [19, 20]. In contrast, accumulating evidence suggests functional roles of biased regions [21–23], as is for example reflected in the involvement of single amino acid repeats (SAARs) in a number of diseases [24–26]. Furthermore, SAARs can also be an important factor in transcriptional regulation [27, 28] and protein interaction networks , affect morphological changes , and may facilitate adaptive processes [31, 32]. As a result, the study of these particularly simple amino acid repeats is attracting increasing interest, despite the difficulty of detecting biased regions of potential biological relevance [22, 23].
With the unexpected abundance of single amino acid repeats, detection of those with potential biological function is particularly challenging. As the observed frequency of SAARs cannot be captured by standard models [33–35], an exploration of more complex approaches becomes necessary. We here introduce and validate an empirical background model constructed specifically for this application, which adapts to the characteristics of the studied data set. Our approach is then demonstrated on a practical use case. We thus demonstrate how alternative, appropriate background models can be constructed successfully also in challenging cases. These methods are directly applicable to more general related questions regarding biased or low-complexity regions, while similar empirical constructions will be helpful in other situations in which the established standard models break down.
Results and Discussion
In a comparison of the three kingdoms , bacteria and archaea typically had much fewer single amino acid repeats (SAARs) than eukaryotes. Consequently our analysis focused on eukaryotes.
As the standard sequence model of positional independence is, in fact, a Markov model of order zero, higher order Markov models were obvious candidates for an attempt to capture more complex structures in protein sequences. Prediction performance was assessed for SAARs of length five in a comprehensive set of non-fragmented eukaryotic proteins. Statistically, under the standard model, repeats shorter than five are already expected by random chance for typical protein lengths . Incidentally, five residues is also the shortest repeat-length that has been implicated in diseases . Markov models of orders zero to three were considered, including the most complex Markov model that is meaningful for this pattern length (Suppl. Table S1 in Additional File 1).
Non-local models of repeats are apparently required. The parameters of such models have to be learned from multiple sequences, which will differ in amino acid composition and length. We know that both of these affect the expected number of repeats. Protein sequences, however, do not uniformly occupy the corresponding high dimensional parameter space. Moreover, while repeats in general are unexpectedly frequent, they still constitute relatively rare events, with many proteins not containing repeats at all. For investigations of general functional associations of repeats, one wants a model that has balanced positive and negative errors for realistic sets of proteins. To this end, models can directly be tuned for the correct prediction of repeats observed in the tested protein sets. Optimizing average prediction performance for groups of proteins may moreover allow simpler models of satisfactory quality.
Similarly when considering, for each amino acid, the mean prediction error (RMSE) for all repeat lengths, both application-specific models did much better than the Markov model (Suppl. Fig. S5A). It is noteworthy that even the simple application-specific ZIPol model easily outperformed the more complex Markov model. Examining predictions for individual amino acids and repeat lengths, statistically significant deviations from the model were observed for 81% and 60% of the Markov and ZIPol predictions, respectively. Both models performed worse in the apparently harder prediction of frequent longer repeats. In contrast, no significant deviations were observed for the ZIRVM model at all (Suppl. Fig. S5B). Consequently, protein sets with typical repeat frequencies can be used as reference sets with the application-specific models. See Additional File 1 for Suppl. Figs.
Use case: SAARs in signal peptides
In order to demonstrate the benefits of a dedicated background model, we apply the introduced application specific model in an analysis of the over-representation of single amino acid repeats in signal peptides. The amino-end of the growing polypeptide chain of secreted and many membrane proteins contains a signal peptide, with a central part rich in hydrophobic amino acids. It has been observed that the location of many SAARs shows a positional bias towards the termini of polypeptides [34, 38–40]. Although a possible association of leucine repeats with signal peptides has been suggested earlier , there has been no systematic study of repeat enrichment in signal peptides. In particular, a quantitative test is not possible without an appropriate background model to account for the effects of varying amino acid composition and sequence lengths on the observed repeat counts.
We now consider how application specific background models enable quantitative studies, and how the models can adapt to address specific questions, also adjusting to interim results as an analysis progresses. In the first investigation phase of this use case, a ZIRVM background model was trained on a comprehensive set of proteins without signal peptides and signal anchors. This allows an examination of
Hypothesis (1) Mature sequences of secreted and type I membrane proteins do not differ regarding the distribution of repeats from proteins with no signal sequence.
A model of the repeat distribution in proteins without a signal peptide would thus also capture the observed frequencies of repeats in the mature parts of proteins that had that transient peptide cleaved off. If this hypothesis holds - i. e., a common model can be learned from the comprehensive set of proteins with no signal sequences - then we can formulate the central question of the use case as follows:
Hypothesis (2) Signal peptides show an unusual enrichment of certain repeats (relative to a common background distribution).
The 'Null Hypothesis' expectation for the first test was that the ratio of observed and predicted repeat counts in the mature sequences would not be significantly different from the distribution of this ratio for proteins without signal peptides. Although this was indeed the case for longer repeats of the three strongly hydrophobic amino acids F, I, and V, surprisingly, highly significant differences were observed otherwise, affecting more than 80% of all amino acid/repeat length combinations (Suppl. Tab. S2 in Additional File 1). As a result we can strongly reject hypothesis (1).
The observation that the mature parts of secreted and type I membrane proteins have a distinct repeat distribution is actually interesting in its own right. Furthermore, this result changes the kind of related questions that we can ask and how we can test them, highlighting the need to first explicitly validate the (sometimes implicit) underlying assumptions of hypotheses. In particular, we now know that there is no common background distribution, making the original, simple formulation of the second hypothesis void. Background models fitted on proteins without signal peptides are in general not appropriate for examining a potential enrichment of repeats in signal peptides relative to the mature parts of the protein.
Such an analysis becomes possible after adapting the background model. By training on mature sequences of secreted and type I membrane proteins we can test the use case question as
Hypothesis (2b) Signal peptides show an unusual enrichment of certain repeats sequences relative to the mature parts.
While the figure examines the hydrophobic residues most frequent in signal peptides (A, F, I, L, and V, as indicated by the legend), we here actually report results for all amino acids (Suppl. Tab. S3 of the Additional File 1). It is noteworthy that in a complete survey of all eukaryotic proteins for SAARs of all amino acids there was no trend for enrichment of repeats in signal peptides other than the reported general significant overrepresentation of leucine repeats. This was unexpected, considering that the also hydrophobic alanine is similarly abundant in the core region of signal peptides and that the related hydrophobic amino acids isoleucine and valine are also frequent there. While it is understood that this hydrophobic region is required for interaction with the signal recognition particle, recent work has uncovered a surprising complexity of signal sequences  and suggested additional functions such as in the modulation of protein biogenesis . The unusual enrichment of leucine repeats in signal peptides of eukaryotes may thus serve another purpose, although their exact role yet remains to be investigated by experiments.
Recent developments in sequence analysis have made it increasingly apparent that empirical adjustments or novel application-specific approaches are required to define a suitable baseline for each study. In particular, the identification of biologically relevant differences or shared patterns - against a backdrop of functionally meaningless variation - corresponds to identifying unusual observations relative to an appropriate 'background' model. The quality of the background model directly affects the performance of an analysis. For example, adjustments of statistics for protein amino acid composition have considerably improved existing sequence analysis tools for homology detection . Still, regions of stronger compositional bias are traditionally filtered because they lead to a localised breakdown of the classic background model, making them more difficult to study quantitatively [19, 20].
In this manuscript, we have explored the suitability of more complex sequence models and application-specific approaches for the investigation of biased regions, using specific sequence repeats as example. Interestingly, even the most complex local sequence models could not predict the high frequency of the observed repeat regions. In contrast, application-specific zero-inflated models consistently performed better, despite their much lower dimensionality. In particular, we could show that a zero-inflated RVM model (ZIRVM) captured the multi-variate dependencies well. It was also flexible enough for application in different scenarios, adapting to the reference data and question at hand.
These observations are, moreover, of wider relevance: Biased regions are abundant in most organisms. Although not conserved in general, certain biased regions are increasingly implicated in functional roles [21–23]. For traditional sequence similarity based methods of homology detection in the study of protein structure and function, however, biased regions have to be filtered. In contrast, an observed significant enrichment of biased regions in certain protein classes or selected sequence parts can aid functional comparisons at the feature level . We have here chosen to study single amino acid repeats (SAARs), which form a particularly simple class of biased regions. Nevertheless, their high frequency and their potential functions are not fully understood . SAARs have, however, anecdotally been identified as causing a number of diseases [24–26], constituting an important factor in transcriptional regulation [27, 28] and protein interaction networks , affecting morphological changes , and contributing to the facilitation of adaptive processes [31, 32].
In our use case for the application specific model, we identified general differences in the repeat distribution between proteins without a signal peptide and the mature parts of proteins remaining after cleavage of these transient regions. Considering this evidence of the heterogeneous nature of protein space, we next focused on proteins with signal peptides, adapting the background model accordingly. Relative to the mature sequences, further investigation identified leucine repeats as highly enriched in eukaryotic signal peptides, in contrast to repeats of any other amino acid. This is remarkable because it sets leucine apart from the remaining hydrophobic residues frequently found in signal peptides. As we have shown in a study of smaller scope elsewhere, these repeats are actually better conserved than their surrounding host sequence, suggesting a yet unknown function . We have shown here that this is unique to leucine and that no other amino acids exhibit a similar trend for significant enrichment in signal peptides.
To summarize, in challenging sequence analysis situations the construction and validation of appropriate background models can become necessary. Using a specific case of studying biased regions as an example, we have shown the breakdown of both the standard sequence model of positional independence and higher-order local predictors. We have then illustrated the usefulness of dedicated, application-specific background models in the detection of biologically interesting signals. Besides highlighting the importance of selecting a suitable background model, our use case also shows how the question examined actually defines what constitutes the 'background', and thus determines an appropriate reference training set.
Data and feature extraction
Primary sequence data and analysis results were managed in a customized InterMine data warehouse [44, 45]. Protein sequences were obtained from UniProt  release 15.13. To minimize artefacts, proteins annotated in UniProt as fragments were filtered because they could particularly affect signal peptide prediction, where knowing the start of the protein sequence is important. As SAARs are rare in Bacteria and Archaea we concentrated on eukaryotic proteins, yielding a total of 1.9 million sequences. To facilitate the model fit, we construct a function from the counts of non-overlapping single amino acid tracts. The cumulative distribution of tract counts, in particular, provides an exact measure of repeat abundance while being easier to model than the tract counts themselves as it has a smaller number of jumps. Consequently, this is the function that is being modelled in this paper, and when we report 'repeat counts', we are referring to these cumulative counts. For example, in the sequence ACDFLLLLLGWSLLV there is one non-overlapping leucine tract of length five and one non-overlapping leucine tract of length two. Constructing the cumulative count distribution, for this example sequence, we observe one SAAR of length five (or longer), one SAAR of length four or longer, one of length three or longer and two SAARs of length two or longer. It is these cumulative counts that we report, often dropping the implied 'or longer' in the manuscript text.
SAARs were identified and counted by custom Perl scripts. Feature statistics were computed in the 'R' statistical environment . Considering that frequencies were clearly residue specific, we model repeat frequencies independently for each amino acid.
Traditional sequence models
The challenge of identifying biologically relevant patterns is central to sequence analysis in bioinformatics and, consequently, a variety of complementary tools exist to detect unusual sequence regions. An established generic and powerful class of models captures local dependencies by way of Markov models, as supported, for example, by RSAT [3, 48], R'MES , QuickScore , and SPatt . Different implementations have their own respective strengths and features, including providing a service online with user support, optimized run-time performance, and different statistical approximation options for the assessment of significance. SPatt, in particular, offers statistics that are suitable for a direct comparison to traditional, non-overlapping pattern counts, and was therefore employed for our analyses. SPatt 2.0 was run with default parameters for Gaussian approximations, testing Markov models of orders zero to three. This includes both the highest meaningful model order for patterns of length five as well as the standard model of positional independence, which is a Markov model of order zero (also see Figure 1).
Parameters were obtained from the respective k - tuple frequencies (k = 1 ... 4) of the comprehensive set of non-fragmented eukaryotic proteins from UniProt (about 109 residues). Model complexity thus ranged from 20 parameters for the positional independence model to 204 = 160,000 parameters for the third order Markov model.
Application specific models
where c is the modelled count of repeats of non-overlapping n amino acids of residue type AA, conditional on the predicted existence of any such repeats in a protein of length l with a given amino acid composition f AA . Coefficients x i were obtained by standard least-squares fit using the nls function. This zero-inflated polynomial (ZIPol) model had a total of 304 parameters.
For each protein, amino acid residue, and repeat length, the number of SAARs was recorded, giving a total of about 350 million observations. Besides the repeat length, the application specific models also considered protein length and amino acid frequencies as covariates to the SAAR counts.
For each residue type, a separate predictor was trained in a two step process: First, about three million data points were subsampled for logistic regression. Classification thresholds were empirically chosen to strike an equal balance between positive and negative prediction errors for each repeat length.
For the second modelling step, weighted subsampling was used to correct for the highly skewed nature of the data: Much fewer repeats were observed for higher repeat lengths, or at lower amino acid frequencies. For example, there were 13,675, 4,357, 1,642, 642, 294, and 145 R-repeats of lengths five to ten. Training sets were therefore compiled that balanced the number of observations in bins of defined protein lengths, amino acid composition, and repeat length. The total number of data points subsampled was limited to 2,000 for practical reasons in the RVM training.
Model quality was verified in 100-fold cross-validation, with two thirds of the comprehensive set of non-fragmented eukaryotic proteins used for training, and the remaining third as independent test set. For each repetition, we computed the ratio of observed and predicted repeat counts as a test score. We assess model fit by comparing the distribution of test scores to the perfect score of 1, calculating an empirical p-value. In plots, the standard deviation of this distribution is shown to indicate the prediction model uncertainty. Capturing average performance for each amino acid, we calculate the root mean square error (RMSE) over all repeat lengths. Calculations were performed on a log-scale for symmetrical weighting of over- and underprediction.
For all residue types except tryptophan both ZIRVM and ZIPol models could be fit and verified in 100-fold cross-validation. Longer tryptophan repeats are extremely rare, which makes an independent validation of the model fit difficult.
This validation demonstrates the ability of a model to capture the relevant sequence properties in a comprehensive set of proteins, thus identifying a relevant sequence model for the data examined. With an appropriately chosen reference set, it can serve as a background model.
To demonstrate the benefits of introducing application specific background models we explored the abundance of repeats in signal peptides.
For a conservative prediction of signal peptides, we combined the neural network and Hidden-Markov-Model predictors of SignalP 3.0 [54, 55] applied using their default settings. Predictions were accepted if both methods agreed on the location of the cleavage site, at least one of the prediction scores met its default threshold, and the worse score reached at least half its threshold. Combining both predictors exploits the high sensitivity of the neural network in the detection of signal peptides, while still allowing a discrimination of signal anchors by the Hidden-Markov-Model component .
In the first phase of this analysis, a ZIRVM model was trained on the non-fragmented eukaryotic proteins having neither a signal peptide nor a signal anchor. We then studied the ratios of observed vs predicted SAAR counts for the mature sequences of secreted and type I membrane proteins. Deviations from the ZIRVM background model were assessed by empirical p-values computed from the distribution of test scores compiled as before. The reported two-sided test significance of differences for each amino acid is after Holm correction for testing multiple repeat lengths.
For a valid test of repeat enrichment in the signal peptide relative to the mature sequence, in the second phase of the analysis, the mature sequences of secreted and type I membrane proteins were used to train the background model. Testing for enrichment, reported significances are for one-sided tests.
Note that a separate assessment of model fit by cross-validation for the alternative training sets gave similar results to our model comparison on the comprehensive set of proteins (data not shown).
The authors gratefully acknowledge support by the Vienna Science and Technology Fund (WWTF), Baxter AG, the Austrian Institute of Technology, and the Austrian Centre of Biopharmaceutical Technology. The authors are grateful to G. Kreil for advice and helpful discussions. PPL also thanks A. Bardet and G. Leparc for helpful discussions.
- Reinert G, Schbath S, S WM: Probabilistic and Statistical Properties of Words: An Overview. J Comp Biol 2000, 7(1–2):1–46. 10.1089/10665270050081360View ArticleGoogle Scholar
- Xie J, Kim NK: Bayesian Models and Markov Chain Monte Carlo Methods for Protein Motifs with the Secondary Characteristics. J Comp Biol 2005, 12(7):952–970. 10.1089/cmb.2005.12.952View ArticleGoogle Scholar
- Thomas-Chollier M, Sand O, Turatsinze JV, Janky R, Defrance M, Vervisch E, Brohée S, van Helden J: RSAT: regulatory sequence analysis tools. Nucleic Acids Research 2008, 36(suppl 2):W119-W127.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, J LD: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Lopez R, Silventoinen V, Robinson S, Kibria A, Gish W: WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res 2003, 31(13):3795–8. 10.1093/nar/gkg573PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson J, Higgins D, Gibson T: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–80. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG, J H: T-Coffee: A novel method for fast and accurate multiple sequence alignment. Proc Int Conf Intell Syst Mol Biol 2000, 302: 205–17.Google Scholar
- Birney E, Thompson JD, Gibson TJ: PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res 1996, 24(14):2730–9. 10.1093/nar/24.14.2730PubMed CentralView ArticlePubMedGoogle Scholar
- Sonnhammer EL, Eddy SR, Durbin R: PFAM: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28(3):405–20. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Pande VS, Grosberg AY, Tanaka T: Nonrandomness in protein sequences: evidence for a physically driven stage of evolution. Proceedings of the National Academy of Sciences of the United States of America 1994, 91(26):12972–12975. 10.1073/pnas.91.26.12972PubMed CentralView ArticlePubMedGoogle Scholar
- Weiss O, Jimenez-Montano MA, Herzel H: Information Content of Protein Sequences. Journal of Theoretical Biology 2000, 206(3):379–386. 10.1006/jtbi.2000.2138View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- Chechetkin VR: Block structure and stability of the genetic code. Journal of Theoretical Biology 2003, 222(2):177–188. 10.1016/S0022-5193(03)00025-0View ArticlePubMedGoogle Scholar
- Ptitsyn OB, Volkenstein MV: Protein structure and neutral theory of evolution. J Biomol Struct Dyn 1986, (4):137–56.View ArticlePubMedGoogle Scholar
- Altschul SF, Wootton JC, Gertz EM, Agarwala R, Morgulis A, Schäffer AA, Yu YK: Protein database searches using compositionally adjusted substitution matrices. FEBS J 2005, 272(20):5101–9. 10.1111/j.1742-4658.2005.04945.xPubMed CentralView ArticlePubMedGoogle Scholar
- Sadreyev RI, Grishin NV: Accurate statistical model of comparison between multiple sequence alignments. Nucl Acids Res 2008, 36(7):2240–2248. 10.1093/nar/gkn065PubMed CentralView ArticlePubMedGoogle Scholar
- Wootton J, Federhen S: Statistics of local complexity in amino-acid-sequences and sequence databas. Computers & chemistry 1993, 17(2):149–163.View ArticleGoogle Scholar
- Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA: Cast: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics 2000, 16(10):915–22. 10.1093/bioinformatics/16.10.915View ArticlePubMedGoogle Scholar
- Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D: A census of protein repeats. J Mol Biol 1999, 293: 151–60. 10.1006/jmbi.1999.3136View ArticlePubMedGoogle Scholar
- Kreil DP, Ouzounis CA: Comparison of sequence masking algorithms and the detection of biased protein sequence regions. Bioinformatics 2003, 19(13):1672–81. 10.1093/bioinformatics/btg212View ArticlePubMedGoogle Scholar
- Kuznetsov IB, Hwang S: A novel sensitive method for the detection of user-defined compositional bias in biological sequences. Bioinformatics 2006, 22(9):1055–1063. 10.1093/bioinformatics/btl049View ArticlePubMedGoogle Scholar
- Delot E, King LM, Briggs MD, Wilcox WR, Cohn DH: Trinucleotide expansion mutations in the cartilage oligomeric matrix protein (COMP) gene. Hum Mol Genet 1999, 8: 123–8. 10.1093/hmg/8.1.123View ArticlePubMedGoogle Scholar
- Siwach P, Ganesh S: Tandem repeats in human disorders: mechanisms and evolution. Front Biosci 2008, 13: 4467–84.View ArticlePubMedGoogle Scholar
- Hands S, Sinadinos C, Wyttenbach A: Polyglutamine gene function and dysfunction in the ageing brain. Biochim Biophys Acta 2008, 1779(8):507–21.View ArticlePubMedGoogle Scholar
- Gerber H, Seipel K, Georgiev O, Hofferer M, Hug M, Rusconi S, Schaffner W: Transcriptional activation modulated by homopolymeric glutamine and proline stretches. Science 1994, 263(5148):808–811. 10.1126/science.8303297View ArticlePubMedGoogle Scholar
- Brown L, Paraso M, Arkell R, Brown S: In vitro analysis of partial loss-of-function ZIC2 mutations in holoprosencephaly: alanine tract expansion modulates DNA binding and transactivation. Hum Mol Genet 2005, 14(3):411–420.View ArticlePubMedGoogle Scholar
- Hancock JM, Simon M: Simple sequence repeats in proteins and their significance for network evolution. Gene 2005, 345: 113–118. 10.1016/j.gene.2004.11.023View ArticlePubMedGoogle Scholar
- Fondon JW, Garner HR: Molecular origins of rapid and continuous morphological evolution. Proceedings of the National Academy of Sciences of the United States of America 2004, 101(52):18058–18063. 10.1073/pnas.0408118101PubMed CentralView ArticlePubMedGoogle Scholar
- Caburet S, Cocquet J, Vaiman D, A VR: Coding repeats and evolutionary 'agility'. BioEssays 2005, 27(6):581–587. 10.1002/bies.20248View ArticlePubMedGoogle Scholar
- Kashi Y, King DG: Simple sequence repeats as advantageous mutators in evolution. Trends Genet 2006, 22(5):253–9. 10.1016/j.tig.2006.03.005View ArticlePubMedGoogle Scholar
- Mar Alba M, Santibanez-Koref MF, Hancock JM: Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol 1999, 49(6):789–97. 10.1007/PL00006601View ArticlePubMedGoogle Scholar
- Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ: Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA 2002, 99: 333–8. 10.1073/pnas.012608599PubMed CentralView ArticlePubMedGoogle Scholar
- Depledge DP, Dalby AR: Copasaar-a database for proteomic analysis of single amino acid repeats. BMC Bioinformatics 2005, 6: 196. 10.1186/1471-2105-6-196PubMed CentralView ArticlePubMedGoogle Scholar
- Łabaj PP, Leparc GG, Bardet AF, Kreil G, Kreil DP: Single amino acid repeats in signal peptides. FEBS Journal 2010, 277(15):3147–3157. 10.1111/j.1742-4658.2010.07720.xView ArticlePubMedGoogle Scholar
- Karlin S: Statistical significance of sequence patterns in proteins. Curr Opin Struct Biol 1995, 5(3):360–71. 10.1016/0959-440X(95)80098-0View ArticlePubMedGoogle Scholar
- Zhang L, Yu S, Cao Y, Wang J, Zuo K, Qin J, Tang K: Distributional gradient of amino acid repeats in plant proteins. Genome 2006, 49(8):900–5. 10.1139/G06-054View ArticlePubMedGoogle Scholar
- Huntley MA, Clark AG: Evolutionary analysis of amino acid repeats across the genomes of 12 Drosophila species. Mol Biol Evol 2007, 24(12):2598–609. 10.1093/molbev/msm129View ArticlePubMedGoogle Scholar
- Siwach P, Sengupta S, Parihar R, Ganesh S: Spatial positions of homopolymeric repeats in the human proteome and their effect on cellular toxicity. Biochem Biophys Res Commun 2009, 380(2):382–6. 10.1016/j.bbrc.2009.01.101View ArticlePubMedGoogle Scholar
- Hegde RS, Bernstein HD: The surprising complexity of signal sequences. Trends Biochem Sci 2006, 31(10):563–71. 10.1016/j.tibs.2006.08.004View ArticlePubMedGoogle Scholar
- Gouridis G, Karamanou S, Gelis I, Kalodimos CG, Economou A: Signal peptides are allosteric activators of the protein translocase. Nature 2009, 462: 363–367. 10.1038/nature08559PubMed CentralView ArticlePubMedGoogle Scholar
- Koestler T, von Haeseler A, Ebersberger I: FACT: Functional annotation transfer between proteins with similar feature architectures. BMC Bioinformatics 2010, 11: 417. 10.1186/1471-2105-11-417PubMed CentralView ArticlePubMedGoogle Scholar
- Lyne R, Smith R, Rutherford K, Wakeling M, Varley A, Guillier F, Janssens H, Ji W, Mclaren P, North P, Rana D, Riley T, Sullivan J, Watkins X, Woodbridge M, Lilley K, Russell S, Ashburner M, Mizuguchi K, Micklem G: FlyMine: an integrated database for Drosophila and Anopheles genomics. Genome Biology 2007, 8(7):R129. 10.1186/gb-2007-8-7-r129PubMed CentralView ArticlePubMedGoogle Scholar
- InterMine home page[http://www.intermine.org/]
- The UniProt Consortium: The universal protein resource (UniProt). Nucleic Acids Res 2008, (36 Database):D190–195.
- The R Project home page[http://www.r-project.org/]
- Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies Journal of Molecular Biology 1998, 281(5):827–842. 10.1006/jmbi.1998.1947
- Hoebeke M, Schbath S:R'MES: Finding Exceptional Motifs, version 3. User guide. 2006. [http://migale.jouy.inra.fr/outils/mig/rmes]Google Scholar
- The QuickScore home page[http://algo.inria.fr/dolley/QuickScore/]
- Nuel G: Numerical Solutions for Patterns Statistics on Markov Chains. Statistical Applications in Genetics and Molecular Biology 2006, 5: 26.View ArticleGoogle Scholar
- Tipping M: The Relevance Vector Machine. Advances in neural information processing systems 2000, 12: 652–658.Google Scholar
- Tipping M: Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research 1 2001, 211–244.Google Scholar
- Nielsen H, Engelbrecht J, Brunak S, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 1997, 10: 1–6. 10.1093/protein/10.1.1View ArticlePubMedGoogle Scholar
- Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 2004, 340(4):783–95. 10.1016/j.jmb.2004.05.028View ArticlePubMedGoogle Scholar
- Nielsen H, Krogh A: Prediction of signal peptides and signal anchors by a hidden Markov model. Proc Int Conf Intell Syst Mol Biol 1998, 6: 122–30.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.