MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics

Background Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. Results We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. Conclusion MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04143-2.


Genome assemply
We used the D. melanogaster genome assembly release 5 [1] for all analyses in the present work.

Polycomb/Trithorax Response Elements (PREs)
For comparability with our previous work [2], we acquired PcG-targets from Schwartz et al. [3], from the article's Supplementary Table S6, and converted coordinates from D. melanogaster genome assembly 4 to 5.

Boundary Elements (BEs)
As candidates for Boundary Elements (BEs), we downloaded Topologically Associating Domains (TADs) from Sexton et al. [4], from the article's Supplementary Table S1, and we placed 3kb-long regions centered at the boundaries between TADs.
For the second experiment with modelling BEs, we acquired TADs from the Ramirez et al. [5] study, from their Supplementary Data 1, and we placed 3kb-long regions centered at the boundaries between TADs.

Markov chains
For generating negatives, we used 4th-order Markov chains. We used the Markov chain implementation from Gnocis (article submitted), with pseudocounts of 1. We trained a Markov chain genome-wide and generated 3kb-long sequences, henceforth called dummy genomic sequences. We also trained Markov chains on both positives (PREs or BEs) and generated 3kb-long sequences, henceforth called dummy PREs or dummy BEs, depending on the training class.

Coding sequences
As a third class of negatives, we acquired annotated coding sequences from FlyBase [1]. We concatenated the coding sequences and split them into non-overlapping 3kb-long fragments.

ModENCODE Polycomb targets
We downloaded PcG-enriched regions from modENCODE and denoted predictions that overlap with one or more relevant modENCODE regions as "evidenced".

ModENCODE Boundary Elements
We downloaded regions for boundary element-associated factors from modENCODE and denoted predictions that overlap with one or more relevant modENCODE regions as "evidenced".

Polycomb Response Element DNA sequence motifs
The M2019 motif set is identical to the one used in [2].
For the MPWM motif set, we replaced the Pho, Zeste-and GAF motifs with Position Specific Scoring Matrices (PSSMs) from the Fly Factor Survey [8]. Boundary Element DNA sequence motifs The M2012 motif set is as was used in [9].
For M2020, we replaced the Su(Hw) motifs with a Position Specific Scoring Matrix from the Fly Factor Survey [8]. Additionally, we acquired peaks for Ibf1 and Ibf2 from [10], extracted the underlying sequences, performed motif discovery using MEME-ChIP [11] and added the top three motifs for each factor (all of which were IUPAC motifs discovered by DREME [12]) to our list.

Cross-validation
In order to aid comparability with our previous work [2], we used an identical cross-validation procedure, for both PREs and BEs. Models were trained with 110 positive and negative training sequences (110 for each negative class for SVM-MOCCA and RF-MOCCA) and applied to score 50 independent positives and 5000 independent negatives. In order to account for random variability, we repeated the procedure 20 times.

Model configurations
For the CPREdictor, we used a window size of 500bp, and for SVM-MOCCA and RF-MOCCA a window size of 3kb, as in [2]. For MOCCA models, we used a step size of 100bp. For genome-wide prediction of BEs with RF-MOCCA, in order to reduce running time, we used an increased step size of 1000bp. For SVM-MOCCA, a quadratic kernel (polynomial degree 2), was used. For RF-MOCCA, 500 trees were used. For the features of SVM-MOCCA and RF-MOCCA, we used local motif and dinucleotide occurrence frequencies. For core-CRE prediction, we used a step size of 1000 and the default mode (continuous maximum).
When applying the jPREdictor, to aid comparability, we used the M2019 motifs, a window size of 500, and step size of 100. When applying cdBEST, we used the basic-version script. In our cross-validation procedure, we applied cdBEST to each of our test sequences separately, and checked whether cdBEST had predicted any boundaries in the sequence, by parsing the output file "hits table.txt", yielding a binary prediction per sequence. Based on these predictions, we calculated 20 confusion matrices (one per cross-validation fold) per test case (BEs versus dummy genomic sequences, and BEs versus dummy BEs).