- Research article
- Open Access
Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies
BMC Bioinformaticsvolume 11, Article number: 79 (2010)
All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences.
The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%.
This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general.
Antibodies are used in a number of therapeutic procedures such as target-specific anti-cancer therapy, immunosuppression, and purging prior to bone marrow transplants. Most of those antibodies are of nonhuman origin, and their administration often results in the generation of adverse immune responses, which also limit their efficacy . Humanization is usually performed to lessen the occurrence of these responses, to improve circulation half-life, and to restore effector functions [1, 2]. Current humanization strategies include the retention of variable domains or the specificity-determining residues (SDR) only, grafting of complementarity-determining regions (CDR), and veneering [3–6].
Humanization, however, may decrease the thermal stability of an antibody and result in affinity reduction, as well as amyloid fibril formation, especially when the substitutions leave the humanized antibody prone to unfolding [3, 7, 8]. Studies indicate that the potential to form fibrils is a general property of polypeptide chains, but the propensity for amyloidosis is largely influenced by its sequence and the stability of its native state [9–11]. Furthermore, there is evidence that some antibody sequences, notably kappa light chain sequences, become prone to fibril formation due to point mutations acquired during affinity maturation . Apart from these, events that lead to misfolding, such as conformational transitions between alpha helices and beta sheets, and partial or complete unfolding, could lead to amyloidosis [13–15]. Consequently, it would be of interest to develop a method to predict such events, as well as to identify mutations that could lead to amyloidosis. Currently, a number of computational methods are available for amyloidogenic potential prediction [16–18]. These generally use either the physicochemical properties of amino acids to create models for predicting aggregation rate on mutation and identifying hotspots, or the information from overlapping amyloidogenic polypeptide decomposition . Recently, a method using mean packing density profiling has also been reported, and has been found to be able to predict both amyloidogenic and intrinsically disordered regions in both peptides and proteins . Nevertheless, these methods yield predictions on which regions of a sequence are potentially amyloidogenic; for highly similar sequences, as the case is with both amyloidogenic and non-amyloidogenic antibodies, results from such methods are not so easy to distinguish (See Supplementary Information, additional file 1). In this paper, we explore the use of naive Bayesian and decision tree classification methods for predicting the amyloidogenic propensities of antibody sequences, with the primary application of predicting amyloidogenic propensities of engineered antibodies in mind. The naive Bayesian method provides the advantage of taking the effects of mutations at specific combinations of positions into account. The decision tree, on the other hand, intuitively allows the evaluation of more factors that may contribute to the amyloidogenic potential. For generating the classifiers in both methods, 143 amyloidogenic antibody sequences derived from twelve different germlines and 158 corresponding non-amyloidogenic derivatives were used. The unambiguous assignment of amyloidogenic and non-amyloidogenic sequences to their respective germlines is a critical premise in this paper. Germlines are DNA elements that define the basic, inherited antibody repertoire of an individual, which are rearranged and mutated during the response to foreign antigens . As indicated previously, some sequences become prone to fibril formation after this mutation process ; consequently, the generation of separate alignments for the amyloidogenic and non-amyloidogenic derivatives of a single germline might lead to the identification of mutation patterns or characteristics exclusively associated with amyloidosis. It is critical that sequences are assigned correctly to a germline in order to ensure that the mutations observed are actual mutations, and do not arise from incorrect alignments. All alignments used in this paper are hand-annotated.
To test the classifiers and to evaluate the effects of the training set size, a holdout test set consisting of an additional 103 amyloidogenic sequences and 28 non-amyloidogenic sequences for eight of the twelve germlines was used. The naive Bayesian method, which is solely based on positional information, yields a prediction accuracy of 60.84% for amyloid-formers after LOO cross-validation, which is consistent with the 61.16% accuracy for the holdout test set. When the latter is included in the training set, LOO cross-validation accuracy increases to 81.08%. Sequences classified using a decision tree, on the other hand, yielded an average prediction accuracy of 78.64% for the holdout test set.
A direct implementation of the Naive Bayesian method results in prediction accuracies between 60.84% and 81.08%
LOO cross-validation was performed to evaluate the accuracy of the Bayesian classifier; this particular method was used to allow the calibration data to be reused as test samples while simulating the prediction of future unknowns . The average accuracy from this validation was at 60.84 ± 35.96% for classifying amyloidogenic sequences, with 25.95% of the non-amyloidogenic sequences being misclassified (Table 1, AMC and NAMC). Validation performed on the holdout test set yielded an average accuracy of 61.16 ± 13.75%, which falls within the LOO cross validation result (Table 1, AM Test).
To evaluate the effects of training set size, the holdout test set was combined with the original training set to generate a new set of classifiers. These were again subjected to LOO cross-validation, yielding a higher average accuracy of 81.08 ± 29.33% (Table 1, AMC, new).
Germline-specific decision trees result in an average prediction accuracy of 78%
In order to construct a decision tree, we analyzed the nature of the mutations exclusively associated with amyloid formers using an algorithm and accompanying visualization program that we have previously developed [22, 23]. Results indicate that most of the mutations that occur exclusively in CDR residues or in FR residues of amyloidogenic derivatives are most likely the biggest contributors to misfolding, with 69% of the mutations in exposed CDR resulting in a general increase in sheet-forming propensity, as opposed to the 36% in buried FRs (Figures 1 and 2; Table 2). In contrast, the complements (31% for exposed CDRs and 64% for buried FRs) resulted in decreased sheet-forming propensities. We used these information as branch weights for an initial decision tree (Table 3); before establishing the weight thresholds for classification, however, we checked if paths taken by amyloidogenic and non-amyloidogenic derivatives can be generalized. Interestingly, we found no consensus paths for either amyloidogenic or non-amyloidogenic sequences; instead, consensus paths appear to exist for each germline (Figure 3A, Table 4). Consequently, we constructed a second decision tree which takes the germline of origin into account, as the case was in the Bayesian analysis. Depending on the germline, weights along selected paths are either boosted or decreased (Figure 3B, Table 4). Thresholds for separation were chosen to maximally distinguish samples in the training set (Table 5), and are evaluated using the holdout test set. Table 6 lists the classification results per germline.
The diversity of the antibody repertoire is generated through the combinatorial recombination of a small pool of germline genes and its somatic hypermutation. Nevertheless, these diversification processes have setbacks, including the generation of autoreactive antibodies as well as structurally compromised antibodies . The latter are implicated in diseases that range from benign, high-level soluble light-chain production to pathological deposition in glomerular basal membrane cells, bone marrow plasma cells, interstitial tissues, arterial walls and basement membranes [24, 25]. These unwanted effects often result from a set of mutations whose consequences on the structure are not so evident, so much so that the resulting unstable light chains evade elimination during posttranslational quality control [24, 26]. Avoiding such mutations or combinations thereof is critical in antibody engineering.
From studies carried out on amyloidogenic antibodies, some patterns that can be linked to amyloidosis have been found. Poshusta and co-workers, for instance, have reported that non-conservative mutations account for 0.6 - 0.79 of the total mutations in V λ sequences, while 0.4 - 0.59 account for the mutations in V κ sequences . They also reported differences in the location of these mutations in patients with different secreted levels of light chains. Specifically, it is implied that the position of mutations, and not the amount secreted, plays a more important role in light chain amyloidogenic propensity, based on studies on patients with very low light chain levels but advanced amyloid deposition . Consequently, it is clear that two factors, at the minimum, have to be considered in generating a protocol for predicting amyloid formation: the combination of positions at which the mutation occurs, as well as how these affect the structural stability of the antibody.
A review by Caflisch  classified the computational approaches used in predicting protein and peptide aggregation propensity into two general groups. The first makes use of the physicochemical properties of the amino acids to create phenomonological models for predicting aggregation behavior on mutation. The second, on the other hand, uses the decomposition of amyloidogenic peptides into overlapping segments. These are then simulated to the level of atoms to obtain estimates of aggregation propensity, as well as the structural details of the aggregates. Some programs that have since been developed to deal with amyloidosis include the PASTA server [28, 29], a fibril prediction program , AGGRESCAN , Zyggregator , and Pafig , among others. Nevertheless, these algorithms deal with the prediction of the segments involved or possibly involved in amyloidosis, but do not generate direct predictions on whether a given sequence will be amyloidogenic or not. Here, we propose methods that may be used to complement existing prediction protocols in obtaining direct predictions about the amyloidogenicity of an antibody sequence; the method may be extended to other protein types, provided that there are sufficiently related positive and negative training sets.
A Naive Bayesian classifier uses probabilities to link hypotheses to events defined by a set of attributes. In Mitchell , the Naive Bayesian classifier v N B is defined as:
where v j is one of a set of V classes and a i is one of n attributes describing an event.
This approach is attractive for the current problem, where there are only two possible outcomes. The most straightforward way of applying it is to use information of the combinations of positions at which mutations occur in amyloidogenic and non-amyloidogenic derivatives of a single germline. For example, to gauge the probability that a test sequence x derived from a germline g will be amyloidogenic, one would use the Bayes equation to evaluate the association between the positional combination of mutations, c, in x and the two hypotheses:
where xm 1, xm 2, ..., x mn define c, and with p AM and p NAM being defined by the positional mutational probabilities in amyloidogenic and non-amyloidogenic derivatives, respectively. Applying this method (Methods section, equations 4 and 5; Figure 4) yielded an average prediction accuracy of 60.8%; for an independent test set, the accuracy was 61.16% (Table 1). When the test set is used for training as well, the accuracy of amyloid sequence classification increases significantly. Misclassification of non-amyloidogenic sequences is also reduced by an average of 3% (Table 1, NAM Test). This correlation between the size of the training set and prediction accuracy has been previously observed . It may be noteworthy to mention that the prediction accuracy for derivatives of the germline X72813 did not improve significantly even after the augmentation of the data set. Predictions for this germline are similarly low with the decision tree. Interestingly, most of the derivatives of X72813 are implicated in light chain deposition disease (LCDD). An interesting feature of LCDD-associated sequences is that when these are synthesized in vitro, the resulting proteins do not aggregate. Furthermore, the analysis of these sequences frequently show no obvious predisposition towards misfolding . This may be a possible explanation for the difficulty in obtaining correct predictions for its amyloid-forming derivatives. If this set is treated as an outlier, the average prediction accuracy is 83.64 ± 18.49%.
In general, however, it is imperative to increase the training set size - not only in terms of the number of derivatives per germline, but in terms of the number of germlines covered, in order to improve the performance of the classifier. A development of a program for automatically generating training sets is a non-trivial task, however, and is beyond the scope of this study. It could also be possible to consider other characteristics, such as the physico-chemical and structural effects of a mutation, as factors for defining p AM or p NAM . Nevertheless, the question of how such factors would be incorporated in the calculation has to be justified first, from both statistical and biological points-of-view. Since our main interest is to provide a proof-of-concept that a simple set of classification algorithms may be used for predicting amyloidosis, we opted to complement the Bayesian method with a decision tree, where one could factor in additional effects of mutations for classifying sequences.
Decision trees are particularly useful in classifying unknowns into one of a finite number of categories, based on the results of a series of tests on the attributes of a sample [36, 37]. It works by posing a series of questions about the features associated with unknowns; each question is contained in a node, and each node has child nodes for each possible answer to its question [38, 39]. It eventually terminates in leaves, which correspond to a classification. There are many variants of decision trees; in the simplest form, 'yes'/'no' paths are followed throughout the classification process; in others, probability distributions over the classes are used in order to estimate the conditional probability that an item reaching a leaf belongs to the class if defines . In biology, it has been used in Parkinson's disease management , disease severity profiling [41, 42], toxicity analysis , large-scale proteomic studies [44, 45], microarray data classification  and phylogenetic analysis, among other applications. Depending on the number of factors that will be considered to classify the samples, decision trees may be made by hand or constructed automatically using a learning or an optimization algorithm [38, 47]. Choosing these factors and its arrangement on the tree to optimally separate samples remain challenges in the creation of decision trees; algorithms have since been developed for optimal tree creation [36–38]. For this study, four splitting variables were considered, based on the mutation trends observed in both amyloidogenic and non-amyloidogenic samples.
In order to obtain weights for the splitting variables, mutation matrices were generated for the amyloiodogenic and non-amyloidogenic derivatives of the different germlines. An interesting result from the analysis of these matrices is that 69% of the mutations exclusively found in exposed CDR residues of amyloid formers appear to be implicated in higher sheet-forming propensities, while 64% exclusive to buried FR residues involve shifts to residues with lower sheet-forming propensities (Figures 1 and 2, Table 2). This may suggest that mutations stabilizing sheet structures in the CDR, which normally assume loop structures, contribute as much to amyloidosis as those that destabilize the sheet structure in critical regions (i.e. buried FR residues). This is not unlikely, based on some previous observations. Hurle et al. , for instance, performed a positional analysis of 36 amyloidogenic sequences to find mutations that occur in less than 1% of all sequences at a particular position. These mutations were mostly found in CDRs, notably CDR1, for both κ and λ light chains. Furthermore, Stevens et al. observed that 24 out of the 26 invariant residues in κ light chains which drastically affect the structure of the antibody upon mutation are found on the protein surface, and make no obvious contributions to folding. Mutations in CDRs are generally more varied, and its contributions to amyloidosis, though not as easy to pinpoint, are probably very significant . Finally, these results are consistent with predictions using other methods (see supplementary information, additional file 1); this consistency may be viewed as a validation of our observations.
From these observations, a decision tree was created to approximate the contribution of each mutation to the overall amyloidogenicity of a sequence. The use of this tree on the independent test set yielded a prediction accuracy of 78.64% (Table 6), which is close to the 75% prediction accuracy obtained when the decision tree is tested on training set sequences. LOO cross validation was not performed for this method, since this would require weights to be changed as many times as there are sequences. Classifiers generated with the training set appear to have a better performance than those from the naive Bayesian method. One possible reason was that more factors are taken into consideration - one approximates the effect of the mutation itself, as well as the effect that it has in being at a particular region; at the same time, it also roughly approximates the combined effect of mutations, which are likely to be equally responsible for misfolding as individual mutations [27, 50]. Nevertheless, this does not imply that the naive Bayesian method is entirely without merit, since it is clear that position or combinations of positions where mutations occur has a key role in amyloidosis . It is also evident that more sequences have to be used, as with the naive Bayesian method. Prediction results will also be probably improved by including additional factors such as hydrophilicity, size and charge changes as splitting variables, or refining the positions based on precedent studies . In adding splitting variables, the construction of a decision tree could be performed using an [automated] optimization algorithm .
A caveat for both methods, however, is the possibility of overfitting, which is the description of random error, instead of true correlations. This phenomenon is one of the key problems in machine learning, and may occur when there are more degrees of freedom than data [51, 52]. Overfitted model results are not representative of the population behavior, and are unlikely to be replicated. There are several rules of thumb for avoiding overfitting, which includes having a minimum of 10 - 15 observations per predictor variable, with larger sample sizes required in cases where the effect sizes are small, or when predictors are highly correlated . For binary response models, the sample size may not be directly relevant , although for this problem, it appears that sample size plays an important role. Due to the limited sample set size, it was only possible to perform a single holdout validation and LOO cross validation, whose results were consistent. However, for future work involving larger training sets, it would be possible to include measures and perform more definitive tests to ensure that overfitting is eliminated or minimized.
This exploratory study indicates that the Naive Bayesian classifier and decision trees may be used for "yes"- or "no"-type predictions on the amyloidogenicity of a sequence. Analysis of results from both methods suggests that prediction accuracy may be improved by optimizing the training set sizes, and by incorporating more information about the alterations brought about by mutations into the calculations. Some other factors that may be considered include hydrophilicity and charge changes brought about by the replacement residues, with respect to its location, as well as the way the mutations cluster from sequences with known structures. Another factor that might be considered is the sequence of immunoglobulin folding and the implications of having mutations in the N-terminal region, which is the first to be folded . The further development of these classification techniques, including the possibility of creating a hybrid between Naive Bayesian and decision trees, appears to be worthwhile; these methods may eventually be adapted for predicting the amyloidogenicity of non-immunoglobulin sequences.
The training set, comprised of 143 amyloidogenic and 158 non-amyloidogenic derivatives of the germlines were obtained from the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/). A holdout test set comprised of 103 amyloidogenic and 28 non-amyloidogenic sequences, chosen on account of the absence of gaps, as well as the possibility of assigning these unambiguously to a germline set, were also obtained from the NCBI. Sequences were assigned to the closest germline using ClustalW, and resulting alignments were manually annotated. Kabat numbering and CDR/FR definitions were applied to all sequences. The non-amyloidogenic derivation sets were constructed from randomly chosen derivatives of each germline which have, as a derivation set, approximately the same total number of mutations as the amyloidogenic counterparts. The first five amino acid residues are omitted in the analysis, since these may have been primer-derived. All sequences of the amyloidogenic and non-amyloidogenic antibodies used in the analysis, which are identified by their NCBI accession codes, as well as their putative germline derivation, are in the supplementary information (additional file 2).
Naive Bayesian Classification
We generated a Naive Bayesian Classifier for each germline on the basis of its amyloidogenic and non-amyloidogenic derivatives. Briefly, the probability p of a mutation occurring at position x was quantified for both amyloidogenic (p AM ) and non-amyloidogenic (p NAM ) derivatives of the same germline. Raw values of p AM and p NAM can take the value of 0; to avoid this, we used the Laplace correction method, where 1 is added to the numerator and 2 to the denominator. The respective complements, q AM and q NAM , which represent the retention of the residue, is given by 1 - p AM or 1 - p NAM , respectively. These probabilities are then used to calculate the amyloidogenic and non-amyloidogenic propensities for a test sequence s derived from the same germline as the training set. Supposing that s has mutations at positions defined by the set M, the amyloidogenic probability AM will be calculated as:
while the non-amyloidogenic probability is calculated as:
where x refers to the position (Figure 4). If AM is greater than NAM, then the sequence is classified as amyloidogenic; otherwise, it is classified as non-amyloidogenic. Classifier accuracy was cross-checked against both the training and test sets were used. Due to the limited number of sequences obtained, validation is preliminary, and consists of a LOO cross-validation, performed for all amyloidogenic and non-amyloidogenic derivatives, and a one-time holdout test validation.
Decision tree generation and sequence classification
A weighted decision tree was constructed to provide a quantitative estimate of both individual and joint contributions of mutations as functions of location (i.e. CDR/FR), exposure and changes in sheet forming propensity. The steps for generating the tree are shown in Figure 5. Initially, separate mutation matrices for buried CDR residues, buried FR residues, exposed CDR residues, and exposed FR residues are generated for alignments of amyloidogenic and non-amyloidogenic derivatives, based on the algorithm described in . Here, exposed residues were defined as residues having ≥ 25% accessible surface; exposure information was generated for each alignment using structural homologues of the germline sequence (see supplementary information, additional file 2). These were then visualized to facilitate easier analysis, then post-processed by subtracting the non-amyloidogenic from the amyloidogenic matrix image, resulting in an image where the relative intensities are proportional to the predominance of specific mutations. A binary matrix containing mutations exclusive to amyloid-formers was also generated. In the matrices, residues were arranged according to increasing β-sheet-forming propensities (Table 7) , with the original residues in the rows and the replacement residues in the columns, such that all mutations to the right of the diagonal are associated with increased sheet-forming propensities, while those to the left correspond to decreased sheet-forming propensities (Figure 2; Figure 5, step 1). The trends observed in these matrices (Figures 1, 2 and 5, step 2; Table 2) were then used as weights, which were associated with the branches of the tree. At this point, we determined if paths taken by amyloid and non-amyloid-formers could be generalized, or if these showed germline dependence. This led to the identification of paths that may be used in maximizing separation between amyloidogenic and non-amyloidogenic derivatives per germline (Table 4; Figure 5, step 3); for instance, amyloidogenic derivatives of X93627 can be maximally separated from corresponding non-amyloidogenic derivatives by giving a tenfold higher score to mutations that follow the path leading to leaf 2 and a tenfold lower score for mutations leading to leaf 8. Boosted and decreased paths to specific leaves are indicated in Table 4 in boldface and italics, respectively. Consequently, tracing the path through the tree that describes each mutation yields a score, s, calculated as the product of the weights along the path. Using this strategy, the average amyloidogenic potential for every sequence, AM seq , was calculated as follows:
where s corresponds to scores of individual mutations, and n corresponds to the number of mutations in a sequence. Since s is amplified in certain paths, amyloidogenic sequences are expected to have higher AM seq values. Thresholds for classifying sequences as amyloidogenic or non-amyloidogenic were defined per germline based on the average scores of amyloidogenic derivatives (Figure 5, step 4). Cross-validation was performed on the holdout test set (Figure 5, step 5).
Presta L: Antibody engineering. Curr Opin Biotechnol 1992, 3: 394–398. 10.1016/0958-1669(92)90168-I
Presta L: Antibody engineering for therapeutics. Current Opinion in Structural Biology 2003, 13(4):519–525. 10.1016/S0959-440X(03)00103-9
Padlan E: A possible procedure for reducing the immunogenicity of antibody variable domains while preserving their ligand-binding properties. Molecular Immunology 1991, 28(4–5):489–498. 10.1016/0161-5890(91)90163-E
Roguska M, Pedersen J, Keddy C: Humanization of murine monoclonal antibodies through variable domain resurfacing. Proceedings of the National Academy of Sciences 1994, 91: 969–973. 10.1073/pnas.91.3.969
Clark M: Antibody humanization: a case of the 'Emperor's new clothes'? Immunol Today 2000, 21: 397–402. 10.1016/S0167-5699(00)01680-7
Ewert S, Honegger A, Plückthun A: Stability improvement of antibodies for extracellular and intracellular applications: CDR grafting to stable frameworks and structure-based framework engineering. Methods 2004, 34(2):184–199. 10.1016/j.ymeth.2004.04.007
Hurle M, Helms L, Li L, Chan W, Wetzel R: A role for destabilizing amino acid replacements in light-chain amyloidosis. Proceedings of the National Academy of Sciences 1994, 91: 5446–5450. 10.1073/pnas.91.12.5446
Mateo C: Humanization of a mouse monoclonal antibody that blocks the epidermal growth factor receptor: recovery of antagonistic activity. Immunotechnology 1997, 3: 71–81. 10.1016/S1380-2933(97)00065-1
de la Paz ML, Serrano L: Sequence determinants of amyloid fibril formation. Proceedings of the National Academy of Sciences 2004, 101: 87–92. 10.1073/pnas.2634884100
Srisailam S, Wang HM, Kumar T, Rajalingam D, Sivaraja V, Sheu HS, Chang YC, Yu C: Amyloid-like Fibril Formation in an All beta-Barrel Protein Involves the Formation of Partially Structured Intermediate(s). Journal of Biological Chemistry 2002, 277(21):19027. 10.1074/jbc.M110762200
Villegas V, Zurdo J, Filimonov V, Aviles F, Dobson C, Serrano L: Protein engineering as a strategy to avoid formation of amyloid fibrils. Protein Science 2000, 9: 1700–1708. 10.1110/ps.9.9.1700
Vidal R, Goni F, Stevens F, Aucouturier P, Kumar A, Frangione B, Ghiso J, Gallo G: Somatic Mutations of the L12a Gene in V-kappa1 Light Chain Deposition Disease: Potential Effects on Aberrant Protein Conformation andDeposition. American Journal of Pathology 1999, 155(6):2009.
Uversky VN, Fink AL: Conformational constraints for amyloid fibrillation: the importance of being unfolded. Biochimica et Biophysica Acta (BBA) - Proteins & Proteomics 2004, 1698(2):131–153. 10.1016/j.bbapap.2003.12.008
Ding F, Borreguero J, Buldyrey S: Mechanism for the-helix to-hairpin transition. Proteins: Structure, Function and Genetics 2003, 53: 220–228. 10.1002/prot.10468
Gross M, Gross M, Wilkins DK, Wilkins DK, Pitkeathly MC, Pitkeathly MC, Chung EW, Chung EW, Higham C, Higham C, Clark A, Clark A, Dobson CM, Dobson CM: Formation of amyloid fibrils by peptides derived from the bacterial cold shock protein CspB. Protein Sci 1999, 8(6):1350. 10.1110/ps.8.6.1350
Conchillo-Solé O, Groot NSD, Avilés FX, Vendrell J, Daura X, Ventura S: AGGRESCAN: a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides. BMC bioinformatics 2007, 8: 65. 10.1186/1471-2105-8-65
Caflisch A: Computational models for the prediction of polypeptide aggregation propensity. Current opinion in chemical biology 2006, 10(5):437–44. 10.1016/j.cbpa.2006.07.009
Zavaljevski N, Stevens F, Reifman J: Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics 2002, 18: 689–696. 10.1093/bioinformatics/18.5.689
Galzitskaya O, Garbuzynskiy S, Lobanov M: Prediction of amyloidogenic and disordered regions in protein chains. PLoS Comput Biol 2006, 2: e177. 10.1371/journal.pcbi.0020177
Behar SM, Scharff MD: Somatic diversification of the S107 (T15) VH11 germ-line gene that encodes the heavy-chain variable region of antibodies to double-stranded DNA in (NZB × NZW)F1 mice. Proc Natl Acad Sci USA 1988, 85(11):3970. 10.1073/pnas.85.11.3970
Hawkins D: The problem of overfitting. J Chem Inf Comput Sci 2004, 44: 1–12.
David M, Asprer J, Ibana J, Concepcion G, Padlan E: A study of the structural correlates of affinity maturation: antibody affinity as a function of chemical interactions, structural plasticity and stability. Molecular Immunology 2007, 44: 1342–1351. 10.1016/j.molimm.2006.05.006
David M, Lapid C, Daria V: An efficient visualization tool for the analysis of protein mutation matrices. BMC bioinformatics 2008, 9: 218. 10.1186/1471-2105-9-218
Stevens FJ, Argon Y: Pathogenic light chains and the B-cell repertoire. Immunol Today 1999, 20(10):451–7. 10.1016/S0167-5699(99)01502-9
Perfetti V, Ubbiali P, Vignarelli M, Diegoli M, Fasani R, Stoppini M, Lisa A, Mangione P, Obici L, Arbustini E: Evidence that amyloidogenic light chains undergo antigen-driven selection. Blood 1998, 91(8):2948.
Stefani M: Protein misfolding and aggregation: new examples in medicine and biology of the dark side of the protein world. BBA-Molecular Basis of Disease 2004, 1739: 5–25. 10.1016/j.bbadis.2004.08.004
Poshusta TL, Sikkink LA, Leung N, Clark RJ, Dispenzieri A, Ramirez-Alvarado M, Hofmann A: Mutations in Specific Structural Regions of Immunoglobulin Light Chains Are Associated with Free Light Chain Levels in Patients with AL Amyloidosis. PLoS ONE 2009, 4(4):e5169. 10.1371/journal.pone.0005169
Trovato A, Seno F, Tosatto S: The PASTA server for protein aggregation prediction. Protein Engineering Design and Selection 2007, 20: 521–523. 10.1093/protein/gzm042
Trovato A, Chiti F, Maritan A, Seno F: Insight into the structure of amyloid fibrils from the analysis of globular proteins. PLoS Comput Biol 2006, 2: 1608–1618. 10.1371/journal.pcbi.0020170
Zhang Z, Chen H, Lai L: Identification of amyloid fibril-forming segments based on structure and residue-based statistical potential. Bioinformatics 2007, 23(17):2218–2225. 10.1093/bioinformatics/btm325
Tartaglia GG, Pawar AP, Campioni S, Dobson CM, Chiti F, Vendruscolo M: Prediction of aggregation-prone regions in structured proteins. J Mol Biol 2008, 380(2):425–36. 10.1016/j.jmb.2008.05.013
Tian J, Wu N, Guo J, Fan Y: Prediction of amyloid fibril-forming segments based on a support vector machine. BMC bioinformatics 2009, 10(Suppl 1):S45. 10.1186/1471-2105-10-S1-S45
Mitchell T: Machine Learning. McGraw Hill; 1997.
Vega V, Bressan S: Continuous Naive Bayesian classifications. In Lecture Notes in Computer Science. Volume 2911. Edited by: et al TS. Heidelberg: Springer; 2003:279–289.
Rocca A, Khamlichi A, Aucouturier P, Noel L, Denoroy L, Preud'homme J, Cogne M: Primary structure of a variable region of the V kappa I subgroup (ISE) in light chain deposition disease. Clinical and Experimental Immunology 1993, 91: 506–509.
Moret B: Decision trees and diagrams. Computing Surveys 1982, 4: 595–623.
Quinlan J: Decision trees and decision-making. IEEE transactions on systems, man and cybernetics 1990, 20: 339–346. 10.1109/21.52545
Norton S: Generating better decision trees. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, Detroit, MI, USA Edited by: Sridharan N. 1989, 800805: 800–805.
Kingsford C, Salzberg SL: What are decision trees? Nat Biotechnol 2008, 26(9):1011. 10.1038/nbt0908-1011
Olanow C, Watts R, Koller W: An algorithm (decision tree) for the management of Parkinson's disease (2001): treatment guidelines. Neurology 2001, 56: 1–88.
Adam B, Qu Y, Davis J, Ward M, Clements M, Cazares L, Semmes O, Schellhammer P, Yasui Y, Feng Z, Wright G: Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplase and Healthy Men. Cancer Research 2002, 62: 3609–3614.
Kang X, Xu Y, Wu X, Liang Y, Wang C, Guo J: Proteomic Fingerprints for Potential Application to Early Diagnosis of Severe Acute Respiratory Syndrome. Clinical Chemistry 2005, 51: 56–64. 10.1373/clinchem.2004.032458
Dunkley E, Isbister G, Sibbritt D: The Hunter Serotonin Toxicity Criteria: simple and accurate diagnostic decision rules for serotonin toxicity. Q J Med 2003, 96: 635–642.
Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A, Cort JR, Booth V, Mackereth CD, Saridakis V, Ekiel I, Kozlov G, Maxwell KL, Wu N, Mcintosh LP, Gehring K, Kennedy MA, Davidson AR, Pai EF, Gerstein M, Edwards AM, Arrowsmith CH: Structural proteomics of an archaeon. Nature Structural & Molecular Biology 2000, 7(10):903. 10.1038/82823
Geurts P, Fillet M, Seny DD, Meuwis M: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 2005, 21: 318–3145. 10.1093/bioinformatics/bti494
Wang Y, Tetko I, Hall M, Frank E: Gene selection from microarray data for cancer classification--a machine learning approach. Computational Biology and Chemistry 2005, 29: 37–46. 10.1016/j.compbiolchem.2004.11.001
Bennett K: Decision tree construction via linear programming. In Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society Conference, Utica, Illinois Edited by: Evans M. 1992, 97–101.
Hurle M, Helms L, Li L, Chan W, Wetzel R: A role for destabilizing amino acid replacements in light-chain amyloidosis. Proceedings of the National Academy of Sciences 1994, 91(12):5446–5450. 10.1073/pnas.91.12.5446
Abraham RS, Geyer SM, Ramírez-Alvarado M, Price-Troska TL, Gertz MA, Fonseca R: Analysis of somatic hypermutation and antigenic selection in the clonal B cell in immunoglobulin light chain amyloidosis (AL). J Clin Immunol 2004, 24(4):340–53. 10.1023/B:JOCI.0000029113.68758.9f
Depristo MA, Weinreich DM, Hartl DL: Missense meanderings in sequence space: a biophysical view of protein evolution. Nature Reviews Genetics 2005, 6(9):678–687. 10.1038/nrg1672
Vezhnevets A, Barinova O: Avoiding boosting overfitting by removing confusing samples. In European Conference on Machine Learning (ECML07), LNAI Edited by: et al K. 2007, 430–441.
Babyak M: What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models. Psychosomatic Medicine 2004, 66: 411–421. 10.1097/01.psy.0000127692.23278.a9
Zanetti M, Capra J: The antibodies. Volume 1. CRC Press; 1996.
Minor DL, Kim PS: Measurement of the beta-sheet-forming propensities of amino acids. Nature 1994, 367(6464):660–3. 10.1038/367660a0
MPCD, GPC and EAP jointly conceptualized the project. EAP obtained and manually annotated the amyloidogenic sequences and their germline assignments. MPCD implemented the programs for Naive Bayesian analysis and decision tree-based classification and performed the analysis of the results. All authors have read and approved the manuscript in this form.