Amyloids are proteins that can form fibrils - highly ordered aggregates of a characteristic zipper structure [1–4]. Majority of these proteins natively have a completely different functional structure in their physiological state, although functional amyloids also exist [5, 6]. A hypothesis holds that in vivo amyloidogenic regions are usually capped by gatekeeper aminoacids, like prolines and glycines, which prevent aggregation, and may have a high affinity to chaperone proteins . Very often amyloids lead to serious diseases, like Alzheimer disease (amyloid-β, tau), Parkinson disease (α-synuclein), type 2 diabetes (amylin), Creutzfeldt-Jakob disease (prion protein), Huntington disease (huntington), amyotrophic lateral sclerosis (SOD1), etc. (for a review see e.g. ). The number of diseases that turn out amyloid-associated is constantly increasing. It is believed that their toxicity is related to insertion of non-mature aggregates into plasma membranes as non-selective ion channels.
Recently, it was discovered that amyloidogenic properties can be due to short segments of aminoacids in a protein sequence (hot spots), which can transform the structure when non-burried . It was proposed that hexapeptides can sufficiently represent such hot-spots, although they may vary between 4–10 aminoacids. A few hundreds of such peptides have been experimentally found, however testing all combinations is not possible. Instead, they can be predicted by computational methods.
Several physico-chemical methods have been proposed to predict amylogenicity of a peptide, e.g. Tango , ZipperDB [10, 11], Pasta , AggreScan , PreAmyl , Zyggregator , CamFold , NetCSSP , FoldAmyloid , AmyloidMutant [19, 20], BetaScan , and consensus AmylPred . The majority of these methods predict probability of a sequence to form β-aggregates. As it turned out, such an approach was not always successful. Although β-aggregation is related to amyloidosis, structural and biophysical properties are different [7, 9]. β-aggregation is quite common in highly concentrated proteins, which do not form fibers. On the other hand, certain amyloids, like prions, are poorly predicted by tools dedicated to β-aggregates.
Methods like 3D profile, applied in ZipperDB or AmyloidMutant, which take into account more specific structural features of amyloids - resembling a steric zipper  - work better in such cases. Also statistical elements seem to help in the classification, as shown in Waltz  using Position Specific Scoring Matrices (PSSM), or Bayesian classifier and weighted decision tree applied to long sequences of bacterial antibodies .
Experimental datasets, upon which new classification methods could be built, are still very limited. Those sequences that show amyloid propensity are rarely well characterized. For the majority of them, it is not known which segment is responsible for their amylogenicity and few of them have an experimental structure of high resolution . The biggest database of potential hexapeptides, generated with the 3D profile method, comes from the ZipperDB. The classical 3D profile method applies over 2.5 thousand scaffolds resembling a steric zipper structure, on which tested hexapeptides are threaded, and their minimal energy is calculated. If the minimal energy of one chain is below a threshold value, which could be obtained from experimental dataset of hexapeptides, then the hexapeptide is classified as amyloidogenic. The method is reasonable and quite accurate - the authors of Waltz tested it on the independent dataset from prion protein sup35, which was experimentally derived. They reported that the 3D profile method showed accuracy of 0.8, with sensitivity of 0.67 and specificity of 0.84 . The database in ZipperDB, which is freely available on-line , is constantly growing. Currently it covers all ORFs from 3 genomes: H. sapiens, S. cervisiae, and E. coli, with 50% redundancy. Interestingly, the database shows hot spots in a majority of proteins. It does not mean that they can easily turn into amyloids in the physiological conditions but it shows new interesting aspects of this topic. Unfortunately, the 3D profile method is very computationally expensive and not very simple to use.
In this paper, we propose two methods to extend the ZipperDB dataset, classifying hexapeptide candidates at lower computational cost. One of the methods is closely related to the original idea of ZipperDB, only reducing the number of profiles. The other one, which introduces the main increase of the efficiency, uses a completely different statistical approach - machine learning. Both methods are tested versus original ZipperDB database classification.