Skip to main content

Table 2 Sequence Representations used for Model Selection. CDR3 sequences were cut into snippets of varying length and represented as DNA sequence, amino acid sequence, or Atchley factors [10]. Classification accuracy results are reported as the fraction of patients for which the model’s prediction of the diagnosis is correct

From: Statistical classifiers for diagnosing disease from immune repertoires: a case study using multiple sclerosis

Snippet Length

Sequence Representation

Classification Accuracy on the Training Data Set by Exhaustive 1-Holdout Cross-Validation

4 Amino Acids

Atchley Factors

11/23 ≈ 47.8%

5 Amino Acids

Atchley Factors

15/23 ≈ 65.2%

6 Amino Acids

Atchley Factors

20/23 ≈ 87.0%

7 Amino Acids

Atchley Factors

14/23 ≈ 60.9%

2 DNA Triplets

DNA Nucleotides

12/23 ≈ 52.2%

6 DNA Triplets

DNA Nucleotides

8/23 ≈ 34.8%

6 Amino Acids

Amino Acid Residue

15/23 ≈ 65.2%