Amyloids are proteins capable of forming aberrant intramolecular contact sites that are characteristic of the beta zipper configuration, and can lead to fibrils instead of the functional structure of a protein [1–5]. The processes of amyloid oligomerization, which precedes fibril formation is currently regarded as responsible for serious health conditions, such as Alzheimer’s disease (amyloid-β, tau), Parkinson’s disease (α-synuclein), type 2 diabetes (amylin), Creutzfeldt-Jakob’s disease (prion protein), Huntington disease (huntington), amyotrophic lateral sclerosis (SOD1), and many others (for a review see e.g.) . Therefore, it is of great interest to develop methods for predicting mechanisms leading to this phenomenon. It has been proposed that short segments of amino acids can be responsible for the amyloidogenic properties [7, 8]. Those fragments are harmless only when they are buried inside a protein. The fragments responsible for amyloidogenicity of the whole protein are believed to be 4–10 residues long and it is often assumed that 6-residue fragments with amyloidogenic properties are sufficient “hot spots” . Recognition of amyloidogenic fragments can be obtained by computational approach, for example physico-chemical methods, e.g. Tango , ZipperDB [9, 11], Pasta , AggreScan , PreAmyl , Zyggregator , CamFold , NetCSSP , FoldAmyloid , AmyloidMutant [19, 20], BetaScan , and consensus AmylPred . Statistical methods have also been employed in the classification. In our previous work we used classical machine learning methods  based on WEKA . Other methods include Waltz  using Position Specific Scoring Matrices (PSSM), or Bayessian classifier and weighted decision tree applied to long sequences of bacterial antibodies .
No more than two hundreds of such hexapeptides have been experimentally found. New computational algorithms are trained or validated on the scarce experimental dataset. Two papers published in BMC Bioinformatics, presenting machine learning methods - Pafig  and another approach based on Pafig , used their own method for extending the training and testing datasets. The authors assumed that all hexapeptides that belong to an amyloid protein can be regarded as amylo-positive, while those from proteins never reported as amyloid are always amylo-negative. Different machine learning methods were then applied to classify amyloid hexapeptides trained on a few thousand of full-length proteins cut into hexapeptides, which were labeled according to this scheme. The classification, validated on hexapeptides obtained in the same way, produced seemingly good results.
However, due to experimental observations, amyloid propensity of a full protein can result only from one amyloidogenic fragment in this protein, while the occurrence of amyloiodogenic part, which is well hidden inside the protein, may never lead to fibril formation. This was confirmed by results of 3D profile method , which produced the largest computational database of potential amyloid hexapeptides – ZipperDB . In the database there are very many examples of proteins including highly amyloidogenic fragments that have never been observed to form an amyloid. It is possible that those fragments are screened inside the protein and deprived of contacts with other fragments of high amyloid propensity, hence unable to start oligomerization and fibril formation.
Therefore, we decided to look closer at the datasets proposed in Pafig (Hexpepset) and validate the results of this method, which was trained on a dataset obtained contrary to these observations. For this purpose we performed statistical analysis of the dataset with regard to possible false patterns or undesirable biases. Then we used other state of the art computational methods to classify amyloid hexapeptides and compare their results with Pafig by means of clustering approach. The objective was to study how compatible is Pafig to other classification methods.