Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms

Han, Guo-Sheng; Li, Qi; Li, Ying

doi:10.1186/s12859-021-04006-w

Volume 22 Supplement 6

19th International Conference on Bioinformatics 2020 (InCoB2020)

Research
Open access
Published: 02 June 2021

Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms

Guo-Sheng Han^1,2,
Qi Li^1,2 &
Ying Li^1,2

BMC Bioinformatics volume 22, Article number: 129 (2021) Cite this article

2011 Accesses
5 Citations
1 Altmetric
Metrics details

Abstract

Background

Nucleosome plays an important role in the process of genome expression, DNA replication, DNA repair and transcription. Therefore, the research of nucleosome positioning has invariably received extensive attention. Considering the diversity of DNA sequence representation methods, we tried to integrate multiple features to analyze its effect in the process of nucleosome positioning analysis. This process can also deepen our understanding of the theoretical analysis of nucleosome positioning.

Results

Here, we not only used frequency chaos game representation (FCGR) to construct DNA sequence features, but also integrated it with other features and adopted the principal component analysis (PCA) algorithm. Simultaneously, support vector machine (SVM), extreme learning machine (ELM), extreme gradient boosting (XGBoost), multilayer perceptron (MLP) and convolutional neural networks (CNN) are used as predictors for nucleosome positioning prediction analysis, respectively. The integrated feature vector prediction quality is significantly superior to a single feature. After using principal component analysis (PCA) to reduce the feature dimension, the prediction quality of H. sapiens dataset has been significantly improved.

Conclusions

Comparative analysis and prediction on H. sapiens, C. elegans, D. melanogaster and S. cerevisiae datasets, demonstrate that the application of FCGR to nucleosome positioning is feasible, and we also found that integrative feature representation would be better.

Background

The nucleosome is the basic structural unit of eukaryotic chromatin. It is formed by the combination of histones and DNA. The core is an octamer formed by two copies of each histones H2A, H2B, H3 and H4, DNA is wound around it about 1.65 turns. Among them, the DNA wrapped around the octamer is called core DNA, which is 147 base pairs in length; the DNA sequence that connects two adjacent nucleosomes is called linker DNA, which ranges from 20 to 60 base pairs [1]. In eukaryotic cells, nucleosomes play a crucial role in the process of genome expression, DNA replication, DNA repair and transcription [2,3,4,5,6]. In addition, studies have demonstrated that abnormal histone modifications in the nucleosome structure are directly related to diseases such as tumors [7] and lupus erythematosus [8]. Therefore, the mechanism of nucleosome positioning in DNA sequence has an extremely important research value, which is also one of the hot spots in current epigenetics research.

The precise position of the nucleosome on the DNA sequence in the whole genome is called nucleosome positioning. Early experiments mainly used micrococcal nuclease to process chromatin to achieve nucleosome positioning [9]. In recent years, benefiting from the development and application of high-throughput experimental techniques, such as chromatin immunoprecipitation-chip (ChIP-chip), chromatin immunoprecipitation sequencing (ChIP-Seq), many breakthroughs have been made in nucleosome positioning experiments. The nucleosome positioning maps of different species such as Saccharomyces cerevisiae [10, 11], Homo sapiens [12], Caenorhabditis elegans [13], Drosophila melanogaster [14], etc. have been obtained, which provides a large amount of data basis for researchers to carry out theoretical research and prediction.

Much of the research in nucleosome positioning is based on DNA sequence analysis [15, 16]. The DNA sequence consists of four nucleotides: A, T, C and G. Studies have shown that the affinity between genomic DNA sequences and histones is clearly dependent on sequence order, which indicates that the DNA sequence order does affect the position of nucleosome formation. Although some provide the support that nucleosome positioning is affected by multiple factors such as DNA sequence, ATP-dependent nucleosome remodeling enzymes and transcription factors [17, 18]. Many researchers used sequence analysis methods to express nucleosome DNA sequence characteristics and then performed nucleosome positioning and recognition.

In the past decade, with the popularity of machine learning algorithms, a multitude of computational models based on DNA sequence information have been proposed. Chen et al. proposed the "iNuc-Physchem" nucleosome prediction model using 12 physicochemical features of DNA, which identified the core DNA and linker DNA of the yeast genome nucleosome [19]. Later, the research group also established a biophysical model based on the deformation energy of DNA sequences to predict the sequence of nucleosomes [20]. Guo et al. used pseudo k-tuple nucleotide composition to successfully express the feature vector of the DNA sequence, and used the support vector machine (SVM) classifier to train H. sapiens, C. elegans and D. melanogaster [21]. 3LS model used similar methods and combined the distribution of different numbers of nucleotide combinations in the sequence to further improve the prediction accuracy [22]. ZCMM model based on the Z-curve (z-curve) theory and the position weight matrix (PWM), the prediction performance is excellent on D. melanogaster [23].

Deep learning is also applied to nucleosome positioning and achieved good prediction quality. These deep learning models all used one-hot encoding. Gangi et al. [24] constructed a deep learning model that integrates convolutional layers and long short-term memory networks. LeNup model added the Inception module and gated convolutional network to the convolutional neural network to improve the nucleosome positioning [25].

In this work, we firstly will use frequency chaos game representation to construct DNA sequence features. This feature representation method has not been used in nucleosome positioning before. Secondly, we also integrated FCGR with other feature vectors and adopted the principal component analysis (PCA) algorithm to achieve the feature dimensionality reduction. Finally, various machine learning algorithms such as support vector machine (SVM), extreme learning machine (ELM), extreme gradient boosting (XGBoost), multi-layer perceptron (MLP), and convolutional neural networks (CNN) will be used to perform comparative analysis and prediction of nucleosome positioning.

Results

Rule of performance evaluation

Cross validation is a statistical analysis method used to validate the model. The basic idea is to divide the original data into a training set and a test set. First, use the training set to train the model, and then use the test set to test the classification or prediction performance of the obtained model. In this work, we used K-fold cross-validation to evaluate the performance of the predictor through four parameters: sensitivity ($S_{n}$), specificity ($S_{p}$), accuracy (ACC), and Mathew's correlation coefficient (MCC). The specific definition are as follows:

$$\left\{ {\begin{array}{*{20}c} {S_{n} = \frac{TP}{{TP + FN}}} \\ {S_{p} = \frac{TN}{{TN + FP}}} \\ {ACC = \frac{TP + TN}{{TP + TN + FP + FN}}} \\ {MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {(TP + FN) \times (TP + FP) \times (TN + FN) \times (TN + FP)} }}} \\ \end{array} } \right.$$

(1)

where TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives, respectively [25]. $S_{n}$ is the true positive rate. When $S_{n}$ = 1, it means that all core DNA of nucleosomes have been correctly predicted.${ }S_{p}$ is true negative rate. When $S_{p}$ = 1, it means that all linker DNAs are correctly predicted. ACC reflects the ratio of the number of correctly predicted samples of each category to the total sample. MCC comprehensively evaluates the prediction results. MCC ∈ [− 1,1]. MCC = − 1 means that the correlation is completely opposite. MCC = 1 means that the prediction result is completely correlated with the true category. MCC = 0 means that the prediction is completely random.

Receiver operating characteristic curve (ROC curve) and area under curve (AUC) are often used to evaluate the pros and cons of a binary classifier. Area under curve (AUC) is the area under the Roc curve, usually between 0.5 and 1. As a value, AUC can be used to evaluate the quality of the classifier more intuitively. The larger the AUC value, the better. Taking into account the length of the paper, this paper only calculates the AUC value and does not draw the ROC curve one by one.

Performance of predictors

According to the characteristics of FCGR described above, the different values of K nucleotide will affect the feature expression of the DNA sequence [26]. A large K value means a high feature dimension. And generally, high-dimensional features are relatively sparse, and the fitting quality may not be outstanding. Obviously, choosing an appropriate K value will have a greater impact on the classification effect of each classifier. Some studies have combined DNA sequence features [22, 23, 27, 28]. Similarly, FCGR can also use different combinations of K nucleotide values as feature vectors.

Feasibility of FCGR

In this work, we flatten the FCGR matrix into a normalized vector (1-D) corresponding to the frequency of K nucleotides as the input of SVM and ELM [27]. The input of MLP and CNN models are not only single-channel FCGR images (2D) [26, 27], but also multiple K-value images, the image size is 64 × 64. For the input of multi-K-value images, we leveraged multiple channels to feed in the combination of K values when training the model, and used simple averaging to calculate the final prediction result. To find the appropriate value of K or combination, we use 10-fold cross-validation. Figure 1 shows the classification accuracy of each classifier with different K values and combinations.

For SVM, the accuracy of H. sapiens, C. elegans reaches its peak with K = 1, 2 and 4; the accuracy of D. melanogaster was the highest with K = 2 and 4. For ELM, the accuracy of D. melanogaster reaches an peak when K = 2; the accuracy of H. sapiens reaches its peak when K = 2 and 4; the classification accuracy of C. elegans is best with K = 1, 2 and 4 like using SVM.

For MLP, the accuracy of H. sapiens and D. melanogaster reaches its peak with K = 3, 4 and 5; the classification accuracy of C. elegans is best with K = 3 and 4. For CNN, H. sapiens have the best classification quality when using the FCGR image with K = 4; the accuracy of C. elegans reaches its peak with K = 4 and 5; the accuracy of D. melanogaster reaches its peak with K = 3, 4 and 5. Table 1 clearly shows the best prediction results for four species via 10-fold cross-validation.

Table 1 The prediction results for four species via 10-fold cross-validation by SVM, ELM, MLP, CNN

19th International Conference on Bioinformatics 2020 (InCoB2020)

Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms

Abstract

Background

Results

Conclusions

Background

Results

Rule of performance evaluation

Performance of predictors

Feasibility of FCGR

Comparison of the results with integrative features

Comparison of the results with dimensionality reduction

Comparison with other algorithms

Comparison with other advanced methods

Discussion

Conclusions

Methods

Dataset descriptions

DNA sequence feature representation

Support vector machine

Extreme learning machine

Extreme gradient boosting

Multilayer perceptron

Convolutional neural network

Availability of data and materials

Abbreviations

References

Acknowledgements

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us