Skip to main content
Fig. 1 | BMC Bioinformatics

Fig. 1

From: Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data

Fig. 1

Schematic. An overview of the quality control, data processing, and training pipelines. Data from recount undergoes several quality checks at the sample and study level, resulting in a dataset of approximately 45,000 samples divided into training, testing, and validation datasets. Twelve different datasets are created from these data, each with a different gene set (all, comprising all genes; O, comprising key GO categories; OT, comprising O genes that are known transcription factors) and transform (“TPM”, transcripts per million; “CLR”, a centered-log-ratio transform of TPM; “Z-score”, a Z-score normalization of the CLR data relative to healthy tissue expression levels in GTEx; “Z-ternary”, a ternarization of Z-score). The training data is used to train unsupervised models capable of embedding the data (a “no embedding” model is also included, which does not alter the data). These embedded features, along with labels for individual tasks, are used to train a variety of supervised models. The supervised models are trained and evaluated using nested cross validation

Back to article page