Skip to main content

Table 2 Sequence Representations used for Model Selection. CDR3 sequences were cut into snippets of varying length and represented as DNA sequence, amino acid sequence, or Atchley factors [10]. Classification accuracy results are reported as the fraction of patients for which the model’s prediction of the diagnosis is correct

From: Statistical classifiers for diagnosing disease from immune repertoires: a case study using multiple sclerosis

Snippet Length Sequence Representation Classification Accuracy on the Training Data Set by Exhaustive 1-Holdout Cross-Validation
4 Amino Acids Atchley Factors 11/23 ≈ 47.8%
5 Amino Acids Atchley Factors 15/23 ≈ 65.2%
6 Amino Acids Atchley Factors 20/23 ≈ 87.0%
7 Amino Acids Atchley Factors 14/23 ≈ 60.9%
2 DNA Triplets DNA Nucleotides 12/23 ≈ 52.2%
6 DNA Triplets DNA Nucleotides 8/23 ≈ 34.8%
6 Amino Acids Amino Acid Residue 15/23 ≈ 65.2%