Skip to main content
Fig. 3 | BMC Bioinformatics

Fig. 3

From: CIPHER: a flexible and extensive workflow platform for integrative next-generation sequencing data analysis and genomic regulatory element prediction

Fig. 3

Outline of the random forest machine learning process for enhancer prediction by CIPHER. a Enhancer elements can be identified de novo in a preferred cell line by using select histone modification and chromatin accessibility data and inputting it into CIPHER, which will then output a list of predicted enhancer elements by applying the model to the cell line. Genomic features (histone modification and chromatin accessibility data) are calculated for defined enhancers obtained from the ENCODE project. Non-enhancer elements are promoter regions −/+ 1 Kb from the TSS of all known genes. A subset of all enhancer and non-enhancer elements is split into two groups: (1) a testing and (2) a training dataset. The training dataset is used to generate the machine-learning model where decision trees are generated until the model can effectively separate enhancers from non-enhancers. The testing dataset is used to validate the model, and a confusion matrix is used to calculate the accuracy of the model. b Enhancer identification workflow. DNase chromatin accessibility (DHS) and chromatin signatures (H3K4me1, H3K4me3, and H3K27ac) are used as input data. CIPHER splits the reference genome into 200-bp windows and then applies its random forest-based machine learning model to each reference window to classify each window as an ‘enhancer’ or ‘non-enhancer’. Enhancer windows are then merged so that windows within 1 bp of each other form a single continuous enhancer element. c Genome browser tracks of DHS and enhancer signature markers (H3K27ac and H3K4me1) alongside the position of the predicted enhancer elements (blue blocks) output by CIPHER’s machine learning model

Back to article page