Skip to main content

Table 1 Criteria of datasets used in studies applying deep learning to bladder cancer histopathology images

From: Which data subset should be augmented for deep learning? a simulation study using urothelial cell carcinoma histopathology images

Reference

Dataset source(s)

Count and pathology of patients/slides

Stain

Magnification

Count of images (tiles) (patches)

Dimensions and selection of images (tiles) (patches)

Data augmentation method(s)

Training  (: validation) : testing ratio

[29]a

TCGA

≈ 500 slides of UCC or adjacent normal cuts

H&E

20 × 

4711 normal and 73,425 cancer (depending on slide-level labels)

512 × 512

Non-overlapping

After background removal

None

70:30 (of slides)

Stratified

[29]b

TCGA

388 UCC slides

H&E

Not mentioned

185,064 total

512 × 512

Non-overlapping

Excluding normal tiles

None

70:30 (of slides)

[54]

Not mentioned

Eight bladder biopsy slides

Pathology was not mentioned

H&E

40 × 

Not mentioned

For training and validation:

64 × 64 at 10 × 

Non-overlapping

After background removal

For testing:

64 × 64 by a sliding window with 8-pixel steps

None

Not mentioned

[55]

The Ohio State University

39 T1 bladder cancerc slides

H&E

40 × 

Excluding background tiles: 13,606 training, 1360 validation, and 1359 testing

512 × 512

Non-overlapping

Including background

None

31:4:4 (of slides)

Non-stratified for tiles/classes

[56]

University Hospital of Stavanger, Norway

32 UCC patients/slides

HES

400 × 

(100 × and 25 × by down-sampling)

139,861 (after augmentation) at each magnification level

128 × 128

400 × tiles: non-overlapping for all classes (including background) except muscle and stroma where 50% overlap was present

100 × and 25 × tiles: centered at corresponding 400 × tiles

For muscle and stroma training tiles only: rotation and flipping

Five-fold cross-validation (of patients) using only training and testing sets (no validation set)

[57]d

Three centers in the Netherlands

328 non-muscle invasive UCC specimens from 232 patients

H&E

20 × 

≈ 500,000 total

572 × 572

25% overlap

Excluding patches with ≥ 75% background pixels

Random color variation, flipping, and mirroring of the training patches

60:20:20 (of patients)

[57]e

Three centers in the Netherlands

328 non-muscle invasive UCC specimens from 232 patients

H&E

20 × 

123,132 undefined, 564,710 low grade, and 493,374 high grade

224 × 224

25% overlap

From regions of urothelium segmented by U-Net

Random flipping and mirroring of the training patches

60:20:20 (of patients)

[14]

TCGA and University of Florida Health Shands Hospital in the United States

913 UCC slides

H&E

40 × 

Training: 148,671

Validation: 8371

Testing: not mentioned

1024 × 1024

Randomly

From manually partially annotated tumor and non-tumor regions

Each has a binary annotation mask

Rotation, horizontal and vertical flips, and random crop

Not mentioned to which data it was applied

620:193:100 (of slides)

[58]

Edinburgh hospitals

100 muscle-invasive UCC patients/slides

IF (PanCK, Hoechst)

20 × 

Not mentioned

Not mentioned

None

Not mentioned

[59]f

TCGA

100 UCC patients/slides

H&E

20 × 

Excluding testing: 79,747 tumor and 92,797 non-tumor

512 × 512

Non-overlapping

Including background

Random rotation, zooming, flipping, and color-based

During training

48:12:40 (of slides)

[59]g

TCGA

253 UCC patients/slides

(124 low and 129 high tumor mutational burden)

H&E

For AP clustering:

2.5 × 

For feature extraction:

20 × 

125,358 total tumor tiles, from which AP clustering selected 11,164 representative tiles

For AP clustering:

128 × 128

Non-overlapping

From segmented tumor

For feature extraction:

1024 × 1024

Selected by AP clustering

None

Leave-one-out cross validation

[60]

University of Rochester Medical Center

1177 UCC images (460 stage Ta and 717 stage T1)

Not mentioned if each image came from a separate slide

H&E

100 × 

Not mentioned

700 × 700

One to four images were cropped from the central part of each raw image

None

70:30 (after sampling 460 Ta and 460 T1 imagesh)

[61]

TCGA and local institution of the authors

Muscle-invasive UCC

TCGA: 318 slides from 294 patients

Local institution: 38 slides from 13 patients

H&E

10 × 

Training patches:

18,552, 68,880, 264,550, and 1,044,158 at effective 2.5 × , 5 × , 10 × , and 20 × , respectively

Rest of patches:

Not mentioned

300 × 300 (at effective 2.5 × , 5 × , 10 × , and 20 ×)

Non-overlapping

From manually annotated tumor regions

Random rotation, flipping, warping, brightness, and contrast

During training

TCGA: 146:73:75 (of patients)

Local institution: all testing

[15]

TCGA and University Clinic Hospital Erlangen

Muscle-invasive bladder canceri

TCGA: 363 (training and validation) patients/slides

Erlangen: 16 (testing)

patients/slides

H&E

TCGA: Not mentioned

Erlangen: 40 × 

TCGA: 807,943 total, but only a random 250,833 were used

Erlangen: Not mentioned

512 × 512j

Non-overlapping

From manually annotated tumor regions

Random flipping, mirroring, contrast / saturation / brightness changes, and cutouts

Not mentioned to which data it was applied

TCGA: 90:10 (of slides) stratified

[62]k

The Stanford tissue microarray database

2139 bladder cancerg slides (542 GATA3, 514 CK14, 544 S100P, and 539 S0084)

IHC

Not mentioned

Not mentioned

224 × 224 (Inception-v1) and 229 × 229 (Inception-v3, and Inception-ResNet-v2)

Not mentioned how tiles were derived from slides

None

70:15:15 (of slides)

[62]l

The Stanford tissue microarray database

2137 bladder cancerg slides (680 Score 0, 235 Score 1, 284 Score 2, and 938 Score 3)

IHC

Not mentioned

Not mentioned

224 × 224 (Inception-v1) and 229 × 229 (Inception-v3, and Inception-ResNet-v2)

Not mentioned how tiles were derived from slides

None

70:15:15 (of slides)

[63]

TCGA

332 UCC patients

Slide count was not mentioned

H&E

20 × 

Not mentioned

512 × 512

Non-overlapping

From manually annotated tumor regions

Random horizontal and vertical flipping

During training

Stratified three-fold cross-validation (of patients)

[64]

TCGA

381 UCC slides

H&E

For the lymphocyte CNN:

20 × 

For the necrosis CNN:

6.67 × 

Not mentioned

Non-overlapping

For the lymphocyte CNN:

100 × 100

Excluding background

For the necrosis CNN:

333 × 333

Only for the lymphocyte CNN:

Random croppingm, color perturbing, rotation, and mirroring

For training and testing separately

Not mentioned

[65]

TCGA

290 UCC patients/slides

H&E

20 × 

10,000 patches per slide

100 × 100

Non-overlapping

None

Not mentioned

[66]

Amsterdam University Medical Center

Non-muscle invasive UCC

359 and 281 patients for 1- and 5-year survival, respectively

Slide count was not mentioned

H&E

20 × 

1-year: ≈ 5,500,000 (recurrence in 35%)

5-year: ≈ 4,400,000 (recurrence in 64%)

224 × 224

Non-overlapping

From urothelium segmented by U-Net [57]

None

60:20:20 (of patients)

  1. aDataset to distinguish cancer from normal. Approximate figures were retrieved from graphs as they were neither mentioned accurately in the paper nor in the supplementary materials
  2. bDataset to classify TP53 mutation status
  3. cNo specific histology was stated
  4. dDataset to segment urothelium
  5. eDataset to grade the segmented urothelium
  6. fDataset for tumor segmentation
  7. gDataset for patient-level tumor mutational burden classification into low or high categories
  8. hNot stated if these were raw images or tiles
  9. iUCC from TCGA, but histology not specified for the Erlangen cohort
  10. jA supplementary figure suggests that tile resolution is 1 μm/pixel, i.e., 10 × 
  11. kDataset for biomarker classification
  12. lDataset for biomarker staining score classification
  13. mInput patches were randomly cropped from a larger image. However, it is not clear how this does not contradict with subdividing the whole slide image into non-overlapping patches
  14. AP Affinity propagation, CNN Convolutional neural network, H&E Hematoxylin and eosin, HES Hematoxylin eosin saffron, IF Immunofluorescence, IHC Immunohistochemistry, PanCK Pan-cytokeratin, TCGA The Cancer Genome Atlas, UCC Urothelial cell carcinoma