FISH Amyloid – a new method for finding amyloidogenic segments in proteins based on site specific co-occurence of aminoacids
© Gasior and Kotulska; licensee BioMed Central Ltd. 2014
Received: 27 March 2013
Accepted: 3 February 2014
Published: 24 February 2014
Amyloids are proteins capable of forming fibrils whose intramolecular contact sites assume densely packed zipper pattern. Their oligomers can underlie serious diseases, e.g. Alzheimer’s and Parkinson’s diseases. Recent studies show that short segments of aminoacids can be responsible for amyloidogenic properties of a protein. A few hundreds of such peptides have been experimentally found but experimental testing of all candidates is currently not feasible. Here we propose an original machine learning method for classification of aminoacid sequences, based on discovering a segment with a discriminative pattern of site-specific co-occurrences between sequence elements. The pattern is based on the positions of residues with correlated occurrence over a sliding window of a specified length. The algorithm first recognizes the most relevant training segment in each positive training instance. Then the classification is based on maximal distances between co-occurrence matrix of the relevant segments in positive training sequences and the matrix from negative training segments. The method was applied for studying sequences of aminoacids with regard to their amyloidogenic properties.
Our method was first trained on available datasets of hexapeptides with the amyloidogenic classification, using 5 or 6-residue sliding windows. Depending on the choice of training and testing datasets, the area under ROC curve obtained the value up to 0.80 for experimental, and 0.95 for computationally generated (with 3D profile method) datasets. Importantly, the results on 5-residue segments were not significantly worse, although the classification required that algorithm first recognized the most relevant training segments. The dataset of long sequences, such as sup35 prion and a few other amyloid proteins, were applied to test the method and gave encouraging results. Our web tool FISH Amyloid was trained on all available experimental data 4-10 residues long, offers prediction of amyloidogenic segments in protein sequences.
We proposed a new original classification method which recognizes co-occurrence patterns in sequences. The method reveals characteristic classification pattern of the data and finds the segments where its scoring is the strongest, also in long training sequences. Applied to the problem of amyloidogenic segments recognition, it showed a good potential for classification problems in bioinformatics.
KeywordsMachine learning Amyloid Intramolecular contact sites Hot spot
Amyloids are proteins which aggregate into oligomers and then fibrils that accumulate in cells. Their intramolecular contact sites form a characteristic zipper pattern. Although a few functional amyloids are known, the majority of proteins lose their physiological function when they aggregate and they become cytotoxic for cells [1–5]. The exact reason for this cytotoxicity is still unclear but many studies show that intermediate oligomeric structures are the main culprits. The number of amyloidogenic diseases following misfolding of a protein into the amyloid is constantly increasing and include Alzheimer’s disease (amyloid-β, tau), Parkinson’s disease (α-synuclein), type 2 diabetes (amylin), Creutzfeldt-Jakob’s disease (prion protein), Huntington’s disease (huntington), amyotrophic lateral sclerosis (SOD1), and many others (for a review see e.g. ). They affect constantly increasing number of people, especially in well developed countries. Recognition of factors responsible for protein misfolding can contribute to better understanding of its mechanisms and potential drug design. Recent studies indicate that there may be certain protein sequence determinants responsible for their affinity to form amyloids. These may be short segments of aminoacids, which are called hot spots [7, 8]. Those fragments are harmless only when they are buried inside a protein. The amyloidogenic fragments responsible for amyloidogenicity of the whole protein are believed to be 4-10 residues long and it is often assumed that 6-residue fragments of amyloidogenic properties are typical “hot spots” . Recognition of amyloidogenic fragments can be obtained by computational approach, for example physico-chemical methods, e.g. Tango , ZipperDB [10, 11], FoldAmyloid [12, 13], Pasta [14, 15], AggreScan , PreAmyl , Zyggregator , CamFold , NetCSSP , AmyloidMutant [21, 22], BetaScan , and consensus AmylPred . Statistical methods have also been employed in the classification. In our previous work we used classical machine learning methods  implemented in WEKA . Other methods include Waltz  using Position Specific Scoring Matrices (PSSM), or Bayessian classifier and weighted decision tree applied to long sequences of bacterial antibodies . A few hundreds of amyloid peptides have been experimentally found, although the dataset is very limited. Also computational methods generate databases of potential amyloids, such as 3D profile [9, 29], which is a physicochemical method that generated the most numerous computational dataset – ZipperDB .
In this manuscript we propose a new machine learning method for the identification of amyloidogenic segments in amino acid sequences, based on the presence of a segment with the highest scoring for co-occurrence of residue pairs. By application of a sliding window, the algorithm all by itself recognizes the most relevant training segments in positive training instances.
Machine learning method
Our classification method is based on the assumption that aminoacid sequences (such as amyloidogenic fragments) exhibit certain, well defined, pattern of residue distribution, which is position specific and, most importantly, involves co-occurrence of two aminoacids at different positions. For example, the pattern would not only include a high chance of valine occurrence at position 2, but also the valine would entail isoleucine at position 4. The pattern should be contained in one segment and limited in length. A pattern in the negative dataset is not important, as long as it is different from that of the positive set. However, it may happen that the discriminative pattern is more pronounced in the negative set - we also test our method with this regard. To investigate the co-occurrence pattern, a relevant window length needs to be specified. This window is equivalent to the minimal fragment of a protein sequence displaying the classification property.
The most relevant segments in positive training sequences, carrying the classification pattern, are found in the iterative procedure that selects those which are most distant from the averaged pattern of negative segments, as well as closest to the segments selected from other positive sequences. The distance, w, between positive and negative segments is represented by a sum of elements of array MatrixYES divided by MatrixNo. The procedure, resulting with the choice of optimal segments in the set of positive training fragments, gives the maximum distance value, w d , which is used in the classification procedure as a threshold value.
In the classification of test sequences, a distance w l is defined, which is an a’priori assumed ratio of w d (between 0 and 1), providing a threshold value used in the classification test of sequences. Detailed training algorithm of the method is presented in Figure 2. In the classification of the test set (or a set of unclassified sequences), the greatest actual distance ratio, w s , between MatrixYES of the tested sequence and MatrixNO is calculated. If w s assumes a value greater than a selected value of w l then the window is classified as positive (Figure 2).
where TP, FP, FN and TN represent the numbers of true positives, false positives, false negatives and true negatives, respectively.
Our classification method was first trained and validated on 3 experimental datasets of short peptide fragments, specifying their amyloid or β-aggregation propensities: AmylHex  with 6-residue sequences including 67 positive and 91 negative, Waltz  with 6-residue sequences including 49 positive and 71 negative, Tango (TG, tested for aggregation)  with a variable (4-43) residue fragments including 71 positive and 172 negative instances, downloaded from FoldAmyloid database . The choice of experimental datasets is very limited since very few data are available, and our choice included all of them. Unfortunately, all these datasets are biased, which can influence the results of machine learning.
To compare the performance of our classification method with classical machine learning methods, we used another dataset of 4481 hexapeptides, which was computationally obtained with the 3D profile method . The 3D profile method was originally proposed in  and applied in ZipperDB to generate the database of amyloidogenic hexapeptides. This computational dataset was generated with a faster version of the 3D profile algorithm . It is not as biased as the experimental datasets and it was previously used in tests with a number of classical machine learning methods .
Then, our classification method, trained with 5-residue sliding window on the set of short peptides from Waltz dataset , was tested on 4 full length amyloidogenic proteins: amyloid-β and tau (Alzheimer’s disease), α-synuclein (Parkinson’s disease), amylin (type 2 diabetes), and prion protein sup-35 (Creutzfeldt-Jakob’s disease). The Waltz dataset was selected for the training since it did not contain fragments of the tested proteins. In these proteins, the method indicated amyloidogenic regions, classified with various values of the classification threshold w l , which were compared with experimentally validated data.
Finally, we merged all the experimental datasets. The full dataset included all experimentally tested peptides from different groups, whose length did not exceed 10 aminoacids, and involved also fragments from prion sup35. The full dataset consisted of 436 (146 positive and 290 negative) fragments (see Additional file 1). This dataset was first used in 4-fold cross-validation of our method, and then to train our web service FISH Amyloid, which is now freely available for classification.
Results and discussion
Our method was trained on hexapeptides from different datasets, using two sliding window lengths: 5 and 6 (note that training on the 6-residue fragments with a window of length 6 eliminates the stage of finding the most relevant pattern-carrying, windows in the training and testing sequences). The results, obtained with different classification threshold w l were represented as ROC curves.
Testing the quality of our new classification method and comparing it with different methods could only be possible while working with the same datasets as those state-of the-art methods. Therefore, to compare the performance of our method to classical machine learning methods, first we ran tests on the non-biased computational dataset generated with the physicochemical 3D profile method . The result can be used for comparison with other machine learning methods since the same dataset was previously classified with several classical machine learning methods implemented in WEKA . In this case, AUC ROC obtained with our method was 0.95 for a 6-residue window and 0.87 for a 5-residue sliding window. Top results of the state of the art methods from WEKA, working on hexapeptides, were very similar. For example, neural network (multilayer perceptron – MLP) and alternating decision tree, which showed the highest performance for this dataset from over 100 machine learning methods available in WEKA, obtained AUC ROC = 0.96 . This is very similar to our results with the method presented here, obtained for the 6-residue window. Other classical methods implemented in WEKA obtained lower quality. Moreover, the result of new method was not significantly worse when it worked on a sliding window of length 5, although it first required that the algorithm finds the most relevant windows in the training and testing sequences. Hence, the classification quality of the new method presented here was very close to the top results obtained with classical machine learning methods on the same dataset. Moreover, none of the classical methods was capable of finding the most relevant training window, which is an asset of our new method.
Training set (horizontal)
Tested set (vertical)
sliding window of length 5
0.75 | 0.62
0.82 | 0.21
0.77 | 0.42
0.62 | 0.60
0.69 | 0.60
0.59 | 0.51
0.69 | 0.60
0.84 | 0.31
0.81 | 0.47
sliding window of length 6
0.76 | 0.57
0.77 | 0.30
0.78 | 0.44
0.54 | 0.45
0.69 | 0.61
0.61 | 0.43
0.48 | 0.57
0.82 | 0.25
0.79 | 0.47
Tests on prion sup35 fragments
With the optimal parameters, FoldAmyloid was reported to obtain: for the scale of the expected packing density Sn = 75%, Sp = 74%, for the donor scale Sn = 69%, Sp = 78%, for the acceptor scale Sn = 0.77 and Sp = 74% . Our method, trained on the same dataset as FoldAmyloid, with a 5-residue sliding window, obtained AUC ROC = 0.82, the diagonal point of the ROC curve was Sn = 75% and Sp = 75%.
The method was capable of finding most of the segments that have already been experimentally confirmed. It can be observed that other fragments have also been shown as potential hot-spots, however most of them have not been experimentally tested.
Based on this extended experimental dataset, we trained our method for finding amyloidogenic windows in aminoacid sequences, and made it available as a web tool called FISH Amyloid (Hot Spot Is Found in Amyloid - reversed), which is currently available at http://www.comprec.pwr.wroc.pl/COMPREC_home_page.html. The service uses 5-residue sliding windows, both for training and classification, displaying the score value at the beginning of each window. Those residues that belong to at least one positive window are classified as positive and denoted by “1”. The list of fragments that constituted the extended dataset is also available at the service site.
The classification on the extended dataset was also compared with the performance of Waltz and FoldAmyloid (packing density) methods. Using 75% of data in each of 4 test, FoldAmyloid showed Sn = 58%, Sp = 75%, Waltz obtained Sn = 71%, Sp = 83%, and FISH Amyloid in the same 4 tests achieved Sn = 76% and Sp = 76% (see Additional file 1).
Co-localized pairs of aminoacids
All + sup35
S-I, Y-Y, L-Y, I-I
V-I, T-I, V-Y, F-I
R-R, T-I, V-I
V-I, L-I, N-Y
I-I, I-F, Y-R, Y-I
We proposed an original classification method which recognizes classification pattern in sequences, taking into account position dependent frequency of aminoacids and site specific co-occurrence between their pairs. The method reveals the characteristic co-occurrence pattern of the data. Moreover, it is able to find the segments with the co-occurrence pattern of the highest scoring, also in long training sequences, and use them for the training. Our method was applied to the problem of recognition of amyloidogenic segments and it showed a good potential for their classification. We obtained good results for a sliding window of lengths 6 and 5. The web tool FISH Amyloid, using this method trained on full experimental dataset of amyloid fragments 4-10 aminoacids long, with 5-residue sliding window, is currently available at our server: http://www.comprec.pwr.wroc.pl/COMPREC_home_page.html (it will be moved to http://www.comprec.edu.pwr.wroc.pl/COMPREC_home_page.html). FISH Amyloid offers prediction of amyloidogenic segments in protein sequences.
This work was in part supported by the grant N N519 643540 from National Science Center of Poland. The access to computers and clusters through Wroclaw Centre for Networking and Supercomputing at Wroclaw University of Technology is greatly acknowledged.
- Jaroniec CP, MacPhee CE, Bajaj VS, McMahon MT, Dobson CM, Griffin RG: High-resolution molecular structure of a peptide inan amyloid fibril determined by magic angle spinning NMR spectroscopy. Proc Natl Acad Sci U S A. 2004, 101: 711-716. 10.1073/pnas.0304849101.View ArticlePubMed CentralPubMedGoogle Scholar
- Makin OS, Atkins E, Sikorski P, Johansson J, Serpell LC: Molecular basis for amyloid fibril formation and stability. Proc Natl Acad Sci U S A. 2005, 102: 315-320. 10.1073/pnas.0406847102.View ArticlePubMed CentralPubMedGoogle Scholar
- Nelson R, Sawaya MR, Balbirnie M, Madsen AO, Riekel C, Grothe R, Eisenberg D: Structure of the cross- beta spine of amyloid-like fibrils. Nature. 2005, 435: 773-778. 10.1038/nature03680.View ArticlePubMed CentralPubMedGoogle Scholar
- Sawaya MR, Sambashivan S, Nelson R, Ivanova MI, Sievers SA, Apostol MI, Thompson MJ, Balbirnie M, Wiltzius JJW, McFarlane HT, Madsen AØ, Riekel C, Eisenberg D: Atomic structures of amyloid cross β-spines reveal varied steric zippers. Nature. 2007, 447: 453-457. 10.1038/nature05695.View ArticlePubMedGoogle Scholar
- Thompson MJ, Balbirnie M, Wiltzius JJW, McFarlane HT, Madsen AØ, Riekel C, Eisenberg D: Atomic structures of amyloid cross β-spines reveal varied steric zippers. Nature. 2007, 447: 453-457. 10.1038/nature05695.View ArticlePubMedGoogle Scholar
- Uversky VN, Fink AL: Conformational constraints for amyloid fibrillation: the importance of being unfolded. Biochim Biophys Acta. 2004, 1698: 131-153. 10.1016/j.bbapap.2003.12.008.View ArticlePubMedGoogle Scholar
- Rousseau F, Schymkowitz J, Serrano L: Protein aggregation and amyloidosis: confusion of the kinds?. Curr Opin Struct Biol. 2006, 16: 118-126. 10.1016/j.sbi.2006.01.011.View ArticlePubMedGoogle Scholar
- Serrano L, de la Paz Lopez M: Sequence determinants of amyloid fibril formation. Proc Natl Acad Sci U S A. 2004, 101: 87-92. 10.1073/pnas.2634884100.View ArticlePubMed CentralPubMedGoogle Scholar
- Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L: Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat Biotechnol. 2004, 22: 1302-1306. 10.1038/nbt1012.View ArticlePubMedGoogle Scholar
- Thompson MJ, Sievers SA, Karanicolas J, Ivanova MI, Baker D, Eisenberg D: The 3D profile method for identifying fibril-forming segments of proteins. Proc Natl Acad Sci U S A. 2006, 103: 4074-4078. 10.1073/pnas.0511295103.View ArticlePubMed CentralPubMedGoogle Scholar
- Goldschmidt L, Tenga PK, Riek R, Eisenberg D: Identifying the amylome, proteins capable of forming amyloid-like fibrils. PNAS. 2010, 107: 3487-3492. 10.1073/pnas.0915166107.View ArticlePubMed CentralPubMedGoogle Scholar
- Galzitskaya OV, Garbuzynskiy SO, Lobanov MY: Prediction of amyloidogenic and disordered regions in protein chains. PLoS Comput Biol. 2006, 2: e177-10.1371/journal.pcbi.0020177.View ArticlePubMed CentralPubMedGoogle Scholar
- Garbuzynskiy SO, Lobanov MY, Galzitskaya OV: FoldAmyloid: a method of prediction of amyloidogenic regions from protein sequence. Bioinformatics. 2010, 26: 326-332. 10.1093/bioinformatics/btp691.View ArticlePubMedGoogle Scholar
- Trovato A, Chiti F, Maritan A, Seno F: Insight into the structure of amyloid fibrils from the analysis of globular proteins. PLoS Comput Biol. 2006, 2: e170-10.1371/journal.pcbi.0020170.View ArticlePubMed CentralPubMedGoogle Scholar
- Trovato A, Seno F, Tosatto SC: The PASTA server for protein aggregation prediction. Protein Eng Des Sel. 2007, 20: 521-523. 10.1093/protein/gzm042.View ArticlePubMedGoogle Scholar
- Conchillo-Solé O, de Groot NS, Avilés FX, Vendrell J, Daura X, Ventura S: AGGRESCAN: a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides. BMC Bioinformatics. 2007, 8: 65-10.1186/1471-2105-8-65.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang Z, Chen H, Lai L: Identification of amyloid fibril-forming segments based on structure and residue-based statistical potential. Bioinformatics. 2007, 23: 2218-2225. 10.1093/bioinformatics/btm325.View ArticlePubMedGoogle Scholar
- Tartaglia GG, Vendruscolo M: The Zyggregator method for predicting protein aggregation propensities. Chem Soc Rev. 2008, 37: 1395-1401. 10.1039/b706784b.View ArticlePubMedGoogle Scholar
- Tartaglia GG, Vendruscolo M: Proteome-level interplay between folding and aggregation propensities of proteins. J Mol Biol. 2010, 402: 919-928. 10.1016/j.jmb.2010.08.013.View ArticlePubMedGoogle Scholar
- Kim C, Choi J, Lee SJ, Welsh WJ, Yoon S: NetCSSP: web application for predicting chameleon sequences and amyloid fibril formation. Nucleic Acids Res. 2009, 37: W469-W473. 10.1093/nar/gkp351.View ArticlePubMed CentralPubMedGoogle Scholar
- O'Donnell CW, Waldispühl J, Lis M, Halfmann R, Devadas S, Lindquist S, Berger B: A method for probing the mutational landscape of amyloid structure. Bioinformatics. 2011, 27: i34-i42. 10.1093/bioinformatics/btr238.View ArticlePubMed CentralPubMedGoogle Scholar
- Bryan AW, O'Donnell CW, Menke M, Cowen LJ, Lindquist S, Berger B: STITCHER: dynamic assembly of likely amyloid and prion β-structures from secondary structure predictions. Proteins. 2011, 80: 410-420.View ArticlePubMed CentralPubMedGoogle Scholar
- Bryan AW, Menke M, Cowen LJ, Lindquist SL, Berger B: BETASCAN: probable beta-amyloids identified by pairwise probabilistic analysis. PLoS Comput Biol. 2009, 5: e1000333-10.1371/journal.pcbi.1000333.View ArticlePubMed CentralPubMedGoogle Scholar
- Frousios KK, Iconomidou VA, Karletidi CM, Hamodrakas SJ: Amyloidogenic determinants are usually not buried. BMC Struct Biol. 2009, 9: 44-10.1186/1472-6807-9-44.View ArticlePubMed CentralPubMedGoogle Scholar
- Stanislawski J, Kotulska M, Unold O: Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides. BMC Bioinformatics. 2013, 14: 21-10.1186/1471-2105-14-21.View ArticlePubMed CentralPubMedGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter. 2009, 11: 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Maurer-Stroh S, Debulpaep M, Kuemmerer N, de la Paz Lopez M, Martins IC, Reumers J, Morris KL, Copland A, Serpell L, Serrano L, Schymkowitz JW, Rousseau F: Exploring the sequence determinants of amyloid structure using position-specific scoring matrices. Nat Methods. 2010, 7: 237-242. 10.1038/nmeth.1432.View ArticlePubMedGoogle Scholar
- David MP, Concepcion GP, Padlan EA: Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies. BMC Bioinformatics. 2010, 11: 79-10.1186/1471-2105-11-79.View ArticlePubMed CentralPubMedGoogle Scholar
- Kuhlman B, Baker D: Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci U S A. 2000, 97: 10383-10388. 10.1073/pnas.97.19.10383.View ArticlePubMed CentralPubMedGoogle Scholar
- server: http://services.mbi.ucla.edu/zipperdb/
- server: http://bioinfo.protres.ru/fold-amyloid/amyloid_base.html
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.View ArticlePubMed CentralPubMedGoogle Scholar
- server: http://waltz.switchlab.org/
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.