Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

Elmarakeby, Haitham A.; Trukhanov, Pavel S.; Arroyo, Vidal M.; Riaz, Irbaz Bin; Schrag, Deborah; Van Allen, Eliezer M.; Kehl, Kenneth L.

doi:10.1186/s12859-023-05439-1

BMC Bioinformatics

Table 1 Performance of Transformer-based architectures compared to baseline models

From: Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports

	Progression
	Accuracy	Precision	AUROC [95% CI]	F1	AUPRC	Recall	MCC
BERT-Base	0.88 [0.87, 0.90]	0.71 [0.66, 0.76]	0.92 [0.91, 0.94]	0.72 [0.68, 0.76]	0.76 [0.72, 0.81]	0.74 [0.69, 0.79]	0.65 [0.60, 0.70]
BERT-Med	0.88 [0.86, 0.89]	0.69 [0.64, 0.73]	0.92 [0.91, 0.94]	0.73 [0.69, 0.76]	0.75 [0.69, 0.80]	0.77 [0.72, 0.81]	0.65 [0.60, 0.69]
BERT-Mini	0.85 [0.83, 0.87]	0.61 [0.56, 0.66]	0.89 [0.88, 0.91]	0.68 [0.64, 0.72]	0.68 [0.62, 0.73]	0.77 [0.72, 0.81]	0.59 [0.54, 0.63]
BERT-Tiny	0.80 [0.78, 0.82]	0.51 [0.46, 0.56]	0.84 [0.82, 0.86]	0.56 [0.52, 0.61]	0.56 [0.50, 0.63]	0.63 [0.58, 0.68]	0.43 [0.38, 0.49]
Longformer	0.86 [0.84, 0.87]	0.70 [0.63, 0.75]	0.89 [0.87, 0.91]	0.62 [0.57, 0.67]	0.70 [0.65, 0.75]	0.55 [0.50, 0.61]	0.54 [0.48, 0.59]
Clinical BERT	0.88 [0.86, 0.89]	0.69 [0.64, 0.74]	0.93 [0.91, 0.94]	0.72 [0.68, 0.75]	0.77 [0.72, 0.82]	0.75 [0.70, 0.80]	0.64 [0.59, 0.69]
DFCI-ImagingBERT (BERT frozen, CNN head)	0.90 [0.89, 0.92]	0.75 [0.70, 0.79]	0.95 [0.94, 0.96]	0.78 [0.74, 0.81]	0.84 [0.80, 0.87]	0.81 [0.77, 0.85]	0.72 [0.68, 0.76]
DFCI-ImagingBERT (BERT unfrozen, linear head)	0.90 [0.89, 0.92]	0.74 [0.69, 0.79]	0.95 [0.94, 0.96]	0.78 [0.74, 0.81]	0.85 [0.81, 0.89]	0.81 [0.77, 0.85]	0.71 [0.67, 0.76]
CNN	0.89 [0.87, 0.90]	0.72 [0.66, 0.76]	0.93 [0.92, 0.95]	0.74 [0.70, 0.78]	0.81 [0.77, 0.85]	0.77 [0.72, 0.82]	0.67 [0.62, 0.72]
TF-IDF	0.88 [0.86, 0.89]	0.72 [0.67, 0.77]	0.92 [0.90, 0.93]	0.69 [0.64, 0.73]	0.75 [0.71, 0.80]	0.66 [0.61, 0.71]	0.61 [0.56, 0.66]
Flan-T5-XXL (zero-shot)	0.89 [0.87, 0.90]	0.77 [0.72, 0.82]	0.92 [0.91, 0.94]	0.71 [0.66, 0.75]	0.77 [0.72, 0.81]	0.65 [0.60, 0.71]	0.64 [0.59, 0.69]

	Response
	Accuracy	Precision	AUROC [95% CI]	F1	AUPRC	Recall	MCC
BERT-Base	0.93 [0.92, 0.95]	0.80 [0.74, 0.85]	0.93 [0.90, 0.95]	0.73 [0.68, 0.78]	0.78 [0.73, 0.83]	0.67 [0.61, 0.74]	0.70 [0.64, 0.75]
BERT-Med	0.93 [0.92, 0.94]	0.75 [0.69, 0.81]	0.92 [0.90, 0.95]	0.71 [0.66, 0.76]	0.78 [0.72, 0.83]	0.68 [0.62, 0.74]	0.67 [0.62, 0.73]
BERT-Mini	0.92 [0.91, 0.94]	0.72 [0.65, 0.78]	0.90 [0.88, 0.93]	0.71 [0.66, 0.76]	0.74 [0.67, 0.79]	0.71 [0.65, 0.77]	0.67 [0.61, 0.72]
BERT-Tiny	0.89 [0.88, 0.91]	0.59 [0.53, 0.66]	0.86 [0.83, 0.89]	0.61 [0.55, 0.67]	0.63 [0.57, 0.70]	0.63 [0.57, 0.70]	0.55 [0.49, 0.61]
Longformer	0.92 [0.90, 0.93]	0.80 [0.72, 0.87]	0.89 [0.86, 0.91]	0.61 [0.54, 0.67]	0.71 [0.64, 0.77]	0.49 [0.42, 0.56]	0.59 [0.52, 0.65]
Clinical BERT	0.93 [0.92, 0.94]	0.77 [0.70, 0.83]	0.93 [0.90, 0.95]	0.72 [0.66, 0.77]	0.77 [0.70, 0.83]	0.67 [0.61, 0.74]	0.68 [0.62, 0.73]
DFCI-ImagingBERT (BERT frozen, CNN head)	0.94 [0.93, 0.95]	0.83 [0.77, 0.89]	0.94 [0.93, 0.96]	0.76 [0.71, 0.80]	0.81 [0.76, 0.86]	0.69 [0.63, 0.76]	0.73 [0.67, 0.78]
DFCI-ImagingBERT (BERT unfrozen, linear head)	0.94 [0.93, 0.95]	0.84 [0.77, 0.89]	0.93 [0.91, 0.95]	0.73 [0.68, 0.78]	0.80 [0.75, 0.85]	0.65 [0.59, 0.72]	0.71 [0.65, 0.76]
CNN	0.93 [0.92, 0.94]	0.92 [0.86, 0.97]	0.94 [0.92, 0.96]	0.67 [0.60, 0.72]	0.82 [0.77, 0.87]	0.52 [0.45, 0.59]	0.66 [0.60, 0.72]
TF-IDF	0.93 [0.91, 0.94]	0.81 [0.74, 0.87]	0.93 [0.91, 0.95]	0.68 [0.63, 0.73]	0.75 [0.69, 0.81]	0.59 [0.53, 0.66]	0.65 [0.59, 0.71]
Flan-T5-XXL (zero-shot)	0.92 [0.90, 0.93]	0.69 [0.63, 0.76]	0.90 [0.87, 0.93]	0.69 [0.64, 0.74]	0.69 [0.61, 0.75]	0.68 [0.61, 0.75]	0.64 [0.58, 0.70]

Performance of Transformer-based architectures compared to baseline models for the document classification tasks of identifying cancer progression/worsening and response/improvement. Additional model characteristics are provided in Table 2. Precision, Recall, and F1 measures are calculated using the model output score threshold that maximizes the F1 score in the training set. The best AUROC for each outcome is in bold face, as are the AUROC’s for any model that are not statistically significantly different from the best AUROC for each outcome

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com