Skip to main content

Table 1 Performance of Transformer-based architectures compared to baseline models

From: Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

 

Progression

Accuracy

Precision

AUROC

[95% CI]

F1

AUPRC

Recall

MCC

BERT-Base

0.88

[0.87, 0.90]

0.71

[0.66, 0.76]

0.92

[0.91, 0.94]

0.72

[0.68, 0.76]

0.76

[0.72, 0.81]

0.74

[0.69, 0.79]

0.65

[0.60, 0.70]

BERT-Med

0.88

[0.86, 0.89]

0.69

[0.64, 0.73]

0.92

[0.91, 0.94]

0.73

[0.69, 0.76]

0.75

[0.69, 0.80]

0.77

[0.72, 0.81]

0.65

[0.60, 0.69]

BERT-Mini

0.85

[0.83, 0.87]

0.61

[0.56, 0.66]

0.89

[0.88, 0.91]

0.68

[0.64, 0.72]

0.68

[0.62, 0.73]

0.77

[0.72, 0.81]

0.59

[0.54, 0.63]

BERT-Tiny

0.80

[0.78, 0.82]

0.51

[0.46, 0.56]

0.84

[0.82, 0.86]

0.56

[0.52, 0.61]

0.56

[0.50, 0.63]

0.63

[0.58, 0.68]

0.43

[0.38, 0.49]

Longformer

0.86

[0.84, 0.87]

0.70

[0.63, 0.75]

0.89

[0.87, 0.91]

0.62

[0.57, 0.67]

0.70

[0.65, 0.75]

0.55

[0.50, 0.61]

0.54

[0.48, 0.59]

Clinical BERT

0.88

[0.86, 0.89]

0.69

[0.64, 0.74]

0.93

[0.91, 0.94]

0.72

[0.68, 0.75]

0.77

[0.72, 0.82]

0.75

[0.70, 0.80]

0.64

[0.59, 0.69]

DFCI-ImagingBERT (BERT frozen, CNN head)

0.90

[0.89, 0.92]

0.75

[0.70, 0.79]

0.95

[0.94, 0.96]

0.78

[0.74, 0.81]

0.84

[0.80, 0.87]

0.81

[0.77, 0.85]

0.72

[0.68, 0.76]

DFCI-ImagingBERT (BERT unfrozen, linear head)

0.90

[0.89, 0.92]

0.74

[0.69, 0.79]

0.95

[0.94, 0.96]

0.78

[0.74, 0.81]

0.85

[0.81, 0.89]

0.81

[0.77, 0.85]

0.71

[0.67, 0.76]

CNN

0.89

[0.87, 0.90]

0.72

[0.66, 0.76]

0.93

[0.92, 0.95]

0.74

[0.70, 0.78]

0.81

[0.77, 0.85]

0.77

[0.72, 0.82]

0.67

[0.62, 0.72]

TF-IDF

0.88

[0.86, 0.89]

0.72

[0.67, 0.77]

0.92

[0.90, 0.93]

0.69

[0.64, 0.73]

0.75

[0.71, 0.80]

0.66

[0.61, 0.71]

0.61

[0.56, 0.66]

Flan-T5-XXL (zero-shot)

0.89

[0.87, 0.90]

0.77

[0.72, 0.82]

0.92

[0.91, 0.94]

0.71

[0.66, 0.75]

0.77

[0.72, 0.81]

0.65

[0.60, 0.71]

0.64

[0.59, 0.69]

 

Response

Accuracy

Precision

AUROC

[95% CI]

F1

AUPRC

Recall

MCC

BERT-Base

0.93

[0.92, 0.95]

0.80

[0.74, 0.85]

0.93

[0.90, 0.95]

0.73

[0.68, 0.78]

0.78

[0.73, 0.83]

0.67

[0.61, 0.74]

0.70

[0.64, 0.75]

BERT-Med

0.93

[0.92, 0.94]

0.75

[0.69, 0.81]

0.92

[0.90, 0.95]

0.71

[0.66, 0.76]

0.78

[0.72, 0.83]

0.68

[0.62, 0.74]

0.67

[0.62, 0.73]

BERT-Mini

0.92

[0.91, 0.94]

0.72

[0.65, 0.78]

0.90

[0.88, 0.93]

0.71

[0.66, 0.76]

0.74

[0.67, 0.79]

0.71

[0.65, 0.77]

0.67

[0.61, 0.72]

BERT-Tiny

0.89

[0.88, 0.91]

0.59

[0.53, 0.66]

0.86

[0.83, 0.89]

0.61

[0.55, 0.67]

0.63

[0.57, 0.70]

0.63

[0.57, 0.70]

0.55

[0.49, 0.61]

Longformer

0.92

[0.90, 0.93]

0.80

[0.72, 0.87]

0.89

[0.86, 0.91]

0.61

[0.54, 0.67]

0.71

[0.64, 0.77]

0.49

[0.42, 0.56]

0.59

[0.52, 0.65]

Clinical BERT

0.93

[0.92, 0.94]

0.77

[0.70, 0.83]

0.93

[0.90, 0.95]

0.72

[0.66, 0.77]

0.77

[0.70, 0.83]

0.67

[0.61, 0.74]

0.68

[0.62, 0.73]

DFCI-ImagingBERT (BERT frozen, CNN head)

0.94

[0.93, 0.95]

0.83

[0.77, 0.89]

0.94

[0.93, 0.96]

0.76

[0.71, 0.80]

0.81

[0.76, 0.86]

0.69

[0.63, 0.76]

0.73

[0.67, 0.78]

DFCI-ImagingBERT (BERT unfrozen, linear head)

0.94

[0.93, 0.95]

0.84

[0.77, 0.89]

0.93

[0.91, 0.95]

0.73

[0.68, 0.78]

0.80

[0.75, 0.85]

0.65

[0.59, 0.72]

0.71

[0.65, 0.76]

CNN

0.93

[0.92, 0.94]

0.92

[0.86, 0.97]

0.94

[0.92, 0.96]

0.67

[0.60, 0.72]

0.82

[0.77, 0.87]

0.52

[0.45, 0.59]

0.66

[0.60, 0.72]

TF-IDF

0.93

[0.91, 0.94]

0.81

[0.74, 0.87]

0.93

[0.91, 0.95]

0.68

[0.63, 0.73]

0.75

[0.69, 0.81]

0.59

[0.53, 0.66]

0.65

[0.59, 0.71]

Flan-T5-XXL (zero-shot)

0.92

[0.90, 0.93]

0.69

[0.63, 0.76]

0.90

[0.87, 0.93]

0.69

[0.64, 0.74]

0.69

[0.61, 0.75]

0.68

[0.61, 0.75]

0.64

[0.58, 0.70]

  1. Performance of Transformer-based architectures compared to baseline models for the document classification tasks of identifying cancer progression/worsening and response/improvement. Additional model characteristics are provided in Table 2. Precision, Recall, and F1 measures are calculated using the model output score threshold that maximizes the F1 score in the training set. The best AUROC for each outcome is in bold face, as are the AUROC’s for any model that are not statistically significantly different from the best AUROC for each outcome