From: Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports
Model | Architecture | # of parameters Trainable/Total | Pre-trained or contextual token embeddings? | Language model pre-trained in domain? | Language model frozen for classification training? | Final classification training layer/strategy tested |
---|---|---|---|---|---|---|
TF-IDF | Bag of words logistic regression with elastic net regularization | 40 K | No | N/A | N/A | N/A |
CNN | One-dimensional convolutional neural network with global max-pooling | 7 M | No | N/A | N/A | N/A |
BERT-base | BERT | 766 K/110 M | Yes | No | Yes | CNN head |
BERT-med | BERT | 521 K/42 M | Yes | No | Yes | CNN head |
BERT-mini | BERT | 275 K/11 M | Yes | No | Yes | CNN head |
BERT-tiny | BERT | 152 K/4.5 M | Yes | No | Yes | CNN head |
Longformer [13] | RoBERTa with local context and global attention | 766 K/128 M | Yes | No | Yes | CNN head |
ClinicalBERT [10] | BERT | 766 K/110 M | Yes | Partial (trained on MIMIC-III ICU data) [39] | Yes | CNN head |
DFCI-ImagingBERT, frozen | BERT | 766 K/110 M | Yes | Yes (trained on DFCI imaging reports) | Yes | CNN head |
DFCI-ImagingBERT, unfrozen | BERT | 110 M | Yes | Yes (trained on DFCI imaging reports) | No | Linear head |
Flan-T5 XXL | Text to Text Transfer Transformer | 11 B | Yes | No | N/A (zero-shot learning only) | 1−the predicted probability of the word “no” |