AIKYATAN: mapping distal regulatory elements using convolutional learning on GPU

Background The data deluge can leverage sophisticated ML techniques for functionally annotating the regulatory non-coding genome. The challenge lies in selecting the appropriate classifier for the specific functional annotation problem, within the bounds of the hardware constraints and the model’s complexity. In our system Aikyatan, we annotate distal epigenomic regulatory sites, e.g., enhancers. Specifically, we develop a binary classifier that classifies genome sequences as distal regulatory regions or not, given their histone modifications’ combinatorial signatures. This problem is challenging because the regulatory regions are distal to the genes, with diverse signatures across classes (e.g., enhancers and insulators) and even within each class (e.g., different enhancer sub-classes). Results We develop a suite of ML models, under the banner Aikyatan, including SVM models, random forest variants, and deep learning architectures, for distal regulatory element (DRE) detection. We demonstrate, with strong empirical evidence, deep learning approaches have a computational advantage. Plus, convolutional neural networks (CNN) provide the best-in-class accuracy, superior to the vanilla variant. With the human embryonic cell line H1, CNN achieves an accuracy of 97.9% and an order of magnitude lower runtime than the kernel SVM. Running on a GPU, the training time is sped up 21x and 30x (over CPU) for DNN and CNN, respectively. Finally, our CNN model enjoys superior prediction performance vis-‘a-vis the competition. Specifically, Aikyatan-CNN achieved 40% higher validation rate versus CSIANN and the same accuracy as RFECS. Conclusions Our exhaustive experiments using an array of ML tools validate the need for a model that is not only expressive but can scale with increasing data volumes and diversity. In addition, a subset of these datasets have image-like properties and benefit from spatial pooling of features. Our Aikyatan suite leverages diverse epigenomic datasets that can then be modeled using CNNs with optimized activation and pooling functions. The goal is to capture the salient features of the integrated epigenomic datasets for deciphering the distal (non-coding) regulatory elements, which have been found to be associated with functional variants. Our source code will be made publicly available at: https://bitbucket.org/cellsandmachines/aikyatan. Electronic supplementary material The online version of this article (10.1186/s12859-019-3049-1) contains supplementary material, which is available to authorized users.


Supplementary
Machine Learning Models Linear Support Vector Machine Given training vectors x i 2 R d , i = 1, · · · , n , and a vector y 2 R n such that y i = {1, 0}, a linear classifier learns the weight vector w. The decision function is sgn(w T x). We chose L2-regularized L2-loss SVM to solve our problem, the primal formulation for which is defined as follows: where C is the penalty parameter for the misclassification error and is thus used to set the regularization level. The objective function of an SVM is optimizing two things: a separating hyperplane with the maximum margin and a reasonable decision boundary for the training set that correctly separates as many instances as deemed necessary, the two of which are contradictory. Therefore, we need to find a balance between the two through our choice of hyperparameters.
The latter is what the penalty parameter C does, that is, it a↵ords a penalty for misclassifying an instance. A low C gives a large margin while a high C results in a tighter fit or a smaller margin with lower tolerance for outliers and may potentially overfit on the training dataset.
The dual problem of an SVM, which lends itself naturally to kernels is: where, e is the vector of all ones, D is a diagonal matrix, Q ij = y i y j x T i x j , and D ii = 1 2C , 8i. We used the Linear support vector machine (SVM) module of the scikit-learn Python ML package for the linear SVM model. In order to determine the ideal values for the penalty parameter of the error term, C, we performed a validation experiment and calculated the validation rate on the validation set. This process was executed for each C 2 {10 3 , 10 2 , ..., 10 3 } and the best model had C = 10 2 . Then, the model was tested on a significantly larger dataset (the test set), distinct from the training and validation set, and again the validation rate was calculated to determine the model's e↵ectiveness.

RBF Kernel Support Vector Machine Model
Kernel SVM projects the training vector into another feature space Z by switching Q ij from y i y j is a kernel function of choice. In our study, we chose the Radial Basis Function (RBF) kernel, which is defined as follows: where = 1 2 , and is a free parameter. For = 1, the RBF kernel can be expanded as follows using the Taylor series: The feature space Z that the RBF kernel function is projected to has infinite dimensions. With this mathematical property, it is always possible to find a hyperplane in an infinite-dimensional feature space Z to separate two classes of data points. We also used the scikit-learn package implementation for our kernel SVM algorithm. We tune the penalty parameter of the error term, C, and the additional parameter introduced by the RBF kernel. Now, is not strictly an SVM parameter but a parameter introduced by the kernel. controls the standard deviation of the Gaussian function that defines the sphere of influence of each training example defined by hyperspheres around them. It can be thought of as the inverse of the radius of influence of the samples selected by the model as support vectors. Smooth models have lower values, while complex models have higher values. We evaluate the prediction accuracy for every combination of C 2 {10 3 , 10 2 , ..., 10 3 } and 2 { 1 480 ·10 3 , 1 480 ·10 2 , ..., 1 480 ·10 3 }. The best model had C = 10 0 and = 1 480 · 10 1 .

Random Forest
Since the intuition of how to construct random forest is straightforward, we only provide the reference here for the curious reader [22]. To select the best model, we tried the number of trees from the list {10, 50, 100} and the depth of trees from the list {50, 100} and selected the best ones for our problem.

PR Metric and Data Preprocessing
Performance Metrics: The PR experiment poses a harder problem because, in reality, there are many more unknown functional sites than just the p300, CBP, NANOG, SOX2, OCT4-binding regulatory sites, or, TSS. We would have to evaluate the performance on all these sites that are, in reality, not all accounted for using ground truth data. These unknown sites are possibly mis-labeled as negative, thus introducing wrong information for the ML models. Distinct from the PR metric, the VR metric relaxes the "prediction true" condition to account for the nature of the data set. The data set is such that the regulatory sequences generally span a relatively small region of genome and most are smaller than 1000 base pairs (bp), but some sequences can be between 1000 bp and 5000 bp in size. Therefore, the locations in proximity of a regulatory site (identified by a p300, TFBS, or DHS peak) can potentially be positives. We refer to the set of locations within a small window of proximity to all known regulatory sites as the gray area. If the prediction classifies a point in the gray area as positive, that is taken to be a correct prediction and contributes positively to the VR. The PR metrics consider peaks instead and since some unknown regulatory sites are being falsely labeled (owing to the lack of ground truth data), PR accuracy becomes lower. In contrast, VR metrics only consider how much of the peaks and gray areas are validated true, and hence, the relaxed condition makes it easier to achieve a higher VR value. Thus, the VR metric has been used in most recent publications [31,12].
Training and Testing Data sets for PR-and VR-based Prediction Performance: PR and VR target di↵erent criteria for prediction performance, with PR predicting peaks, while VR predicting peaks and gray area. Thus, we must prepare two di↵erent test data sets. Specifically, the test data set for VR should contain peaks and gray area, whereas the test set for PR should only contain peaks. For the training set, VR should contain peaks and gray area whereas PR does not. Next, the data set for PR was generated using a 500 bp sliding window whereas the data set for VR was generated using a 2500 bp sliding window. The reason for PR's smaller sliding window is to narrow down the gray area around the peak for stricter "peak regions".
Data Preprocessing: We also remove the redundant samples (63% of samples have no histone modification signal ) and balance the data set (for the PR data set, positive samples comprise only 0.5%) Figure S1a describes the pipeline for generating training and test sets from the raw histone modifications' input. To reduce the variance, we divide the test set into five nonoverlapping test sets and also generate five overlapping training sets for each training set size. The five training sets are paired with the five test sets with a one-onone correspondence. The final PR or VR numbers are averaged from five train-test pairs for each experiment.

PR Results
In this section, we compare precision-recall curve (PRC) for linear SVM, kernel SVM, DNN, CNN, random forest, and RFECS, by measuring average area under precision-recall curve (AUPR) across 5 training sets of the same size. We trained on 50 MB, 300 MB, 600 MB, and 1500 MB datasets, and tested on 7.8 GB test sets for all ML models. Figure S1b shows average AUPR with di↵erent training set sizes for Aikyatan. The decision boundary that classifies regulatory sites for linear SVM is a linear hyperplane. If the dataset is not linearly separable, the accuracy su↵ers. This appears to be the characteristic of our data leading to the worst-in-class performance of linear SVM, especially at larger training set sizes. Another approach to improve prediction performance is using an ensemble of weak learners to solve a single problem. We chose ( Algorithmic Performance: In our dataset, RFECS outperforms SVMs and DNN and kernel SVM outperform DNN, random forest, and linear SVM for each training set size. DNN's performance is similar to linear SVM at 50 MB and 300 MB but outperforms linear SVM as the training set size increases to 600 MB and beyond. CNN's performance is similar to kernel SVM at 50 MB and 300 MB but outperforms kernel SVM as the training set size increases to 600 MB and beyond. This is likely due to the fact a CNN can capture spatial nuances in the input. To rank average PR performance for these models, we perform statistical significance testing to compare all ML models on 1500 MB training sets. We use paired one-tailed t-testing and get p-values lower than 0.05. Therefore, we conclude that CNN outperforms the other five ML models, shining as best-in-class in our Aikyatan suite; RFECS outperforms kernel SVM; kernel SVM outperforms random forest, DNN; and linear SVM performs the worst, when evaluated using the PR metric. From these results, one might use kernel SVM or RFECS as candidates to solve the classification when not using deep learning approaches. However, as we have discussed previously, the computation cost for kernel SVM is significantly higher, and for random forests, the memory required to build the classifier is a linear function of the size of training set. Thus, when considering large-scale data analysis, these two methods pose challenges on building satisfiable inference models. (a)  Figure S1: Figure S1a shows the pipeline for generating PR Dataset. Figure S1b presents Figure S2: Sensitivity analysis on the size of convolutional layers, pooling layer, and the number of kernels, of convolutional layers. Figure S2a shows the validation rate as a function of the size of the kernels of the first convolutional layer. We observe that the larger the size of kernels , the smaller the validation rate. Since varying the size of the kernels the second convolutional layer has the same similar pattern as varying the size of the kernels of the first convolutional layer, we omit to show the results. Figure S2b shows the validation rate as a function of the size of pooling window of the pooling layer. We observe that the larger the size of window, the smaller the validation rate. Figure S2c shows the validation rate as a function of the number of the kernels of each convolutional layers. We observe that the validation rate is continuously increased until the number of kernels is 64. Beyond that, we find the lower validation rate when used 128 kernels for each convolutional layer.