Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana

Ueno, Daishin; Kawabe, Harunori; Yamasaki, Shotaro; Demura, Taku; Kato, Ko

doi:10.1186/s12859-021-04291-5

Research
Open access
Published: 22 July 2021

Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana

Daishin Ueno¹,
Harunori Kawabe¹,
Shotaro Yamasaki¹,
Taku Demura¹ &
…
Ko Kato¹

BMC Bioinformatics volume 22, Article number: 380 (2021) Cite this article

2197 Accesses
13 Citations
2 Altmetric
Metrics details

Abstract

Background

RNA degradation is important for the regulation of gene expression. Despite the identification of proteins and sequences related to deadenylation-dependent RNA degradation in plants, endonucleolytic cleavage-dependent RNA degradation has not been studied in detail. Here, we developed truncated RNA end sequencing in Arabidopsis thaliana to identify cleavage sites and evaluate the efficiency of cleavage at each site. Although several features are related to RNA cleavage efficiency, the effect of each feature on cleavage efficiency has not been evaluated by considering multiple putative determinants in A. thaliana.

Results

Cleavage site information was acquired from a previous study, and cleavage efficiency at the site level (CS_site value), which indicates the number of reads at each cleavage site normalized to RNA abundance, was calculated. To identify features related to cleavage efficiency at the site level, multiple putative determinants (features) were used to perform feature selection using the Least Absolute Shrinkage and Selection Operator (LASSO) regression model. The results indicated that whole RNA features were important for the CS_site value, in addition to features around cleavage sites. Whole RNA features related to the translation process and nucleotide frequency around cleavage sites were major determinants of cleavage efficiency. The results were verified in a model constructed using only sequence features, which showed that the prediction accuracy was similar to that determined using all features including the translation process, suggesting that cleavage efficiency can be predicted using only sequence information. The LASSO regression model was validated in exogenous genes, which showed that the model constructed using only sequence information can predict cleavage efficiency in both endogenous and exogenous genes.

Conclusions

Feature selection using the LASSO regression model in A. thaliana identified 155 features. Correlation coefficients revealed that whole RNA features are important for determining cleavage efficiency in addition to features around the cleavage sites. The LASSO regression model can predict cleavage efficiency in endogenous and exogenous genes using only sequence information. The model revealed the significance of the effect of multiple determinants on cleavage efficiency, suggesting that sequence features are important for RNA degradation mechanisms in A. thaliana.

Peer Review reports

Background

RNA degradation is an important process for regulating gene expression in living organisms [1]. RNA degradation is mediated by deadenylation-dependent or endonucleolytic cleavage-dependent RNA degradation mechanisms [2]. In both degradation mechanisms, the final step in RNA digestion is catalyzed by exonucleases. Deadenylation-dependent RNA degradation has been studied in plants [3], and some proteins and related sequences were identified and analyzed; however, endonucleolytic cleavage-dependent RNA degradation has not been analyzed in detail.

Degradome sequencing methods, which are techniques for detecting truncated RNA ends using next-generation sequencing (NGS), have been developed to study the mechanisms underlying endonucleolytic cleavage–dependent RNA degradation [4,5,6]. Although these methods contribute to the identification of cleavage sites, the detected cleavage sites are biased toward the 3′ end of the transcript because of poly A selection in library preparation [6, 7]. In previous work from our group, we attempted to solve this problem by developing truncated RNA end sequencing (TREseq) in A. thaliana [8]. TREseq analysis showed high G nucleotide frequency around the cleavage sites; the cleavage sites were highly accumulated around the start and stop codons, and three-nucleotide periodicity was observed in the coding sequences (CDSs) [8, 9]. These tendencies are similar to ribosome movements, which are reported in ribosome profiling methods, suggesting that the translation process affects RNA cleavage [8,9,10]. In our previous study, we evaluated the relationships between single determinants and cleavage efficiency in A. thaliana using Pearson’s correlation analysis [11]. We found that cleavage efficiencies were related to several determinants (e.g., G nucleotide frequency around cleavage sites and RNA length). However, because Pearson’s correlation analysis can evaluate only one-to-one relationships, the effects of multiple putative determinants on cleavage efficiency remain to be elucidated.

To evaluate several features, multiple regression analysis has been performed using large feature sets [12, 13]. Sequence information (e.g., nucleotide sequence, codons, or amino acid usage) was used, and RNA abundance or RNA stability was predicted by a multiple regression model. However, multiple regression models can result in predictions with large variance, thereby affecting the accuracy of prediction in high-dimensional data [14]. To overcome the shortcomings of multiple regression in high-dimensional data, the Least Absolute Shrinkage and Selection Operator (LASSO) regression model was designed [15]. LASSO penalizes the absolute size of regression coefficients. Therefore, by setting as many coefficients as possible to zero, the objective variable is regressed with as few features as possible. Necessary features can thus be extracted according to the coefficient in the LASSO regression model. In a previous study, a LASSO regression model was applied to predicting the ribosome occupancy on RNA (i.e., the translation process) [16]. Sequence information was used, and approximately 60 features were reduced to a subset of 10–20 features. The results revealed that specific nucleotides or codon sequence patterns are important for the translation process. Although these integrated analyses using the LASSO regression model were used to predict the transcription (RNA expression level) or translation process (ribosome occupancy) [16, 17], little information is available in RNA degradation mechanism, especially for endonucleolytic cleavage–dependent RNA degradation in A. thaliana. Although some cleavage sites induced by microRNA were analyzed and could be predicted by computing RNA complementary sequences (nucleotide information) [18], the majority of cleavage sites did not appear to be induced by microRNA [8, 19, 20]. Therefore, the determinants of genome-wide RNA cleavage sites remain to be elucidated in A. thaliana.

To address this issue, we obtained cleavage sites throughout the genome and performed feature selection for cleavage efficiency using LASSO in A. thaliana. In the LASSO regression model, 155 features were selected, and the coefficients indicated that in addition to features around cleavage sites, features of the whole RNA were also important. We also confirmed the selected features in the LASSO regression model using a different regression model (Ridge regression), which addressed some of the problems of multiple regression models but did not decrease the number of features relative to LASSO. In addition, we attempted to predict cleavage efficiency in endogenous and exogenous genes using only sequence information, and the prediction accuracy was similar to that of the model using all features. These results suggest that sequence features in whole RNA and around cleavage sites are critical for determining the cleavage efficiency at each site in both endogenous and exogenous genes in A. thaliana.

Materials and methods

Plant material

Arabidopsis thaliana T87 cell suspension was obtained from Riken Cell Bank (Tsukuba,

Japan) and cultured in modified Murashige–Skoog medium, as described previously [21].

Data processing for TREseq

Reads from cultured cells and seeds of A. thaliana ecotype Columbia-0 transformed with p35S::firefly luciferase (F-luc)::heat shock protein 18.2 terminator (HSPT) [11, 22] were acquired from previous TREseq analyses [8, 11] and mapped to the TAIR version 10 reference genome (www.arabidopsis.org) or the p35S::F-Luc::HSPT sequence using HISAT2. After mapping, the first nucleotide (5′ end) of each read was counted using BED files as described previously [9]. Cap RNA with more than 50 reads at each gene was used for RNA abundance information [9]. To estimate the cleavage efficiency at each site, the reads at each 5′ degradation intermediate normalized to RNA abundance were defined as the cleavage score at the site level (CS_site). At the gene level, we defined the total CS_site value at each gene as the CS_gene value.

Library construction for ribosome profiling

Ribosome-protected fragments (RPFs) were selected as described previously [23, 24]. In brief, A. thaliana T87 cells were harvested 3 days after inoculation and frozen in liquid nitrogen, followed by homogenization in extraction buffer (200 mM Tris–HCl, pH 8.5, 50 mM KCl, 25 mM MgCl₂, 2 mM EGTA, 100 µg/ml heparin, 100 µg/ml cycloheximide, 2% polyoxyethylene 10-tridecyl ether, and 1% sodium deoxycholate), and centrifuged at 15,000 g for 10 min at 4 °C [23]. Cells were incubated with 6 μl RNase I (Thermo Fisher Scientific, MA, USA) for 30 min, and the reaction was stopped by addition of 10 μl RNase inhibitor (Thermo Fisher Scientific). A 26.25–71.25% sucrose density gradient buffer (200 mM Tris–HCl, pH 8.5, 200 mM KCl, and 200 mM MgCl₂) was used to collect monosomes by sucrose density gradient centrifugation at 55,000 rpm for 50 min at 4 °C in an SW55 rotor (Beckman Coulter, CA, USA). After isolation of monosomes, RPFs were purified using the TruSeq Ribo Profile kit (Illumina). The libraries were sequenced on an Illumina NextSeq 500 (Illumina).

Data processing for ribosome profiling

The adapter sequences were trimmed, and reads were mapped to the TAIR version 10 reference genome (www.arabidopsis.org) using the modified MOIRAI system [8, 9]. After mapping, the first nucleotide (5′ end) of each read was counted using BED files. To estimate the RPFs at each site, the average of RPF reads at the 5' end of each site normalized to RNA abundance was defined as ribosome occupancy at the site level (RO_site). At the gene level, the total RO_site values at each gene were defined as RO_gene values.

Analysis of cleavage sites using LASSO and Ridge regression

In TREseq analysis, cleavage sites are detected in approximately 2 million sites. To select reliable sites, we selected sites in the genes with > 20% cleaved sites relative to RNA length and whose CS_gene values were between the 5th and 95th percentile. In addition, we removed expected microRNA-induced cleavage sites using psRNATarget [18]. Data were separated into training and test sets (9: 1), and training data were used to construct the model.

Explanatory variables in the model

In the model, features around the cleavage sites or features in whole RNA were extracted (Fig. 1). RNA sequence information (nucleotide, codon, and corresponding amino acid sequence) were obtained from the TAIR10 database (www.arabidopsis.org), and the minimum free energy (stability of secondary structure) was predicted using the RNAfold software (http://rna.tbi.univie.ac.at/) based on RNA nucleotide sequences [25]. In addition, in the ribosome profiling method used in this analysis, RO_site and RO_gene values were used to obtain ribosome occupancy information (Fig. 1). For the features around cleavage sites, we determined nucleotide, codon, and the corresponding amino acid sequences ± 30 nucleotides around the cleavage sites. A comprehensive search of the features around the cleavage sites was performed and nucleotide, codon, or corresponding amino acid frequencies were calculated. The window size was changed by one nucleotide (minimum length, 1 nucleotide; maximum length, 60 nucleotides) and the sliding window was shifted by one nucleotide. Because ribosome occupancy affects long-distance cleavage efficiency [26], the region was extended by ± 200 nucleotides around the cleavage site. In terms of secondary structure, because the minimum free energy of short RNA sequences cannot be predicted by RNAfold, the minimum window size was changed to 5 nucleotides (minimum length, 5 nucleotides; maximum length, 60 nucleotides), and the sliding window was shifted by 5 nucleotides.

In terms of whole-RNA features, we extracted 5′-UTR, CDS, 3′-UTR, and whole RNA sequences from the TAIR10 database (www.arabidopsis.org) and calculated the nucleotide, codon, and corresponding amino acid frequencies. The minimum free energy was predicted using RNAfold in each region (5′-UTR, CDS, 3′-UTR, or whole RNA). The sum of RO_site values in each region (5′-UTR, CDS, 3′-UTR, or whole RNA) was used to obtain ribosome occupancy information. In addition, 50-nucleotide sequences were extracted from the 5’ or 3’ end of each region (5′-UTR, CDS, 3′-UTR, or whole RNA), and the nucleotide frequency information was added to the model. Because several codons around start codon appeared to be related to the translation process [27, 28], 10 codons or their corresponding amino acid sequences were extracted from 5’ or 3’ end of CDS, and the codons or corresponding amino acid frequencies were used for the model. An example of an explanatory variable using “nucleotide sequence” is shown in Fig. 2.

To remove multi-collinearity from explanatory variables, we calculated the Spearman rank correlation coefficient among features. If the correlation coefficient between features was ≥ 0.6, the feature with the highest correlation to the CS_site value was used for the model. In addition, explanatory variables that showed no correlation (r < 0.1) according to the Spearman coefficient were removed. Finally, 1,191 features were used for the LASSO and Ridge regression models. RNA sequence information can be obtained from the TAIR10 database, and the data processed in this analysis (cleavage sites and ribosome profiling information) are available in GitHub (https://github.com/daishin-ueno/LASSO_and_Ridge_regression/tree/main/Datasets).

Construction of the LASSO and Ridge regression models

For feature selection, we used linear_model.Lasso or linear_model.Ridge from the Python package scikit-learn [29]. In the i-th observation, ${y}_{i}$ was the objective variable, and the cleavage efficiencies (CS_site values) and the vector ${\mathrm{x}}_{i}=\left({x}_{i1},{x}_{i2},\cdots ,{x}_{ip}\right)$ were the corresponding feature value set. The linear regression model of the cleavage efficiencies was defined as follows:

$${\widehat{y}}_{i}=\alpha +{\varvec{\upbeta}}\bullet {\mathbf{x}}_{i}=\alpha +\sum_{j=1}^{p}{\beta }_{j}{x}_{ij} ,$$

where ${\varvec{\upbeta}}\bullet {\mathbf{x}}_{i}$ represents the inner product of the vectors ${\varvec{\upbeta}}=\left({\beta }_{1},{\beta }_{2},\cdots ,{\beta }_{p}\right)$ and ${\mathbf{x}}_{i}$, ${\beta }_{j}$ is the coefficient for j-th feature, ${\alpha}$ is the intercept, and $p$ is the number of features.

The LASSO regression estimator uses the L1 regularization penalty:

$${{\varvec{\upbeta}}}_{LASSO}=\underset{{\varvec{\upbeta}}}{\mathrm{arg min}}\left\{{{\sum }_{i=1}^{N}\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}+\lambda {\sum }_{j=1}^{p}\left|{\beta }_{j}\right|\right\},$$

where $\lambda {\sum }_{j=1}^{p}\left|{\beta }_{j}\right|$ is the L1 regularization penalty on the coefficient ${\beta }_{j}$ and $\lambda \ge 0$ is the tuning parameter.

The Ridge regression estimator uses the L2 regularization penalty:

$${{\varvec{\upbeta}}}_{Ridge}=\underset{{\varvec{\upbeta}}}{\mathrm{arg min}}\left\{{{\sum }_{i=1}^{N}\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}+\lambda {\sum }_{j=1}^{p}{\beta }_{j}^{2}\right\},$$

where $\lambda {\sum }_{j=1}^{p}{\beta }_{j}^{2}$ is the L2 regularization penalty on ${\beta }_{j}$ and $\lambda \ge 0$ is the tuning parameter.

The tuning parameter λ was determined using training data. In LASSO regression, mean squared error (MSE) between predicted and measured CS_site values was calculated by changing the parameter λ (10^–10–10^–1) in tenfold cross validation [30] using model_selection.cross_val_score from the Python package scikit-learn [29, 31]. To increase interpretability in the LASSO regression model, we determined the parameter λ with a reduced number of features while maintaining MSE (Fig. 3a). The same data processing approach was used in the LASSO regression model, which was constructed using only sequence information (Fig. 3b). In Ridge regression, MSE between predicted and measured CS_site values was calculated by changing the parameter λ (10^–10–10^–1) in tenfold cross validation [30] using model_selection.cross_val_score from the Python package scikit-learn [29, 31]; we determined that λ = 10⁵ where MSE was smallest in the range 10^–10–10^–1. Source codes for the LASSO or Ridge regression model using linear_model.Lasso or linear_model.Ridge from the Python package scikit-learn are available at GitHub (https://github.com/daishin-ueno/LASSO_and_Ridge_regression/tree/main/Source_code).

Results and discussion

Data processing in the LASSO regression model

For the objective variable, indicators of cleavage efficiency at the site level (CS_site value) were calculated using our previous TREseq data [8], with the number of reads at each cleavage site normalized to RNA abundance. Explanatory variables (features) can be divided into three categories: sequence (nucleotide, codon, or corresponding amino acid frequency), stability of secondary structure, and ribosome occupancy (Fig. 1). For the explanatory variables (features), RNA sequences were obtained from the TAIR10 database, and nucleotide, codon, and corresponding amino acid frequencies were acquired. Minimum free energies of secondary structures were calculated using the RNAfold software based on RNA nucleotide sequences. In addition, ribosome occupancy information (DRA010802) was obtained under the conditions used for our previous TREseq data [8], and the RO_site value (the number of ribosome-protected fragments at the site level, normalized against RNA abundance) was used for model. Because multi-collinearity among features negatively affects the prediction accuracy, we removed features with high correlations between explanatory variables. In addition, features that showed no correlation between objective and explanatory variables according to Spearman’s correlation test (r < 0.01) were removed from the feature extraction process (Fig. 4). To obtain reliable cleaved sites, we selected genes with > 20% cleaved sites relative to RNA length. The total numbers of analyzed genes and sites were 1,107 and 429,185 sites, respectively. Sites were separated into training and test data sets (Table 1), and the CS_site value was used as the objective variable. Input data were formatted using CS_site values and features (Fig. 4) and are available in GitHub (https://github.com/daishin-ueno/LASSO_and_Ridge_regression/tree/main/Datasets/Final_input_data_for_model_construction). The LASSO or Ridge regression model was then constructed using training data, and its performance was evaluated using test data. Features of non-zero coefficients were selected, and each feature was estimated according to its importance score in the model (Fig. 4).

Table 1 Training and test data in the LASSO and Ridge regression models

Full size table

Prediction of CS_site value using the LASSO regression model

The prediction accuracy (Pearson’s correlation coefficient) was calculated using test data; the correlation coefficient was r = 0.74 (Fig. 5). Features with a coefficient of zero were removed, which decreased the number of features from 1,191 to 155. The resultant 155 features predicted the cleavage efficiency. These features were divided into positive (contributing to increased cleavage efficiency) and negative (contributing to decreased cleavage efficiency) groups according to the correlation coefficient. In the positive coefficient (effect) group, the correlation coefficient of whole RNA features around cleavage sites was 0.78 (59.7%) and that of whole RNA features was 0.53 (40.3%) (Fig. 6a). Nucleotide sequences accounted for approximately 90% of the positive coefficients in the features around cleavage sites (Fig. 6b). This result is consistent with those of previous studies reporting that nucleotide frequency around cleavage sites has a positive effect on CS_site values [8, 9, 11], and suggests that sequence features are major determinants of CS_site values in the positive coefficient features. When we focused on the positive coefficients of whole RNA features, ribosome occupancy, codon, and corresponding amino acid sequence accounted for approximately 50% (Fig. 6c). These results suggest that the translation process (codon, corresponding amino acid sequence, or ribosome occupancy) has a positive effect on CS_site value in features of the whole RNA, but not in features around cleavage sites.

In the negative coefficient group, whole RNA features were also related to cleavage efficiency; however, the majority of features were around cleavage sites (Fig. 7a). Analysis of the features around the cleavage sites indicated that nucleotide frequency was a major determinant of CS_site values in the negative coefficient group (Fig. 7b), which is similar to the observation in the positive coefficient features around cleavage sites (Fig. 6b). Among whole RNA features, codon or corresponding amino acid sequences were selected in addition to nucleotide sequences in the negative coefficient group (Fig. 7c). Considering that some nucleotide sequence patterns have an effect on ribosome occupancies [16], the nucleotide sequences in whole RNA features seemed to be involved in the translation process and to affect cleavage efficiency at the site level (Figs. 6c, 7c). These tendencies were also observed after selecting the five most positive or negative features based on the correlation coefficient in the LASSO regression model (Tables 2, 3). In particular, a G nucleotide frequency of − 4 to + 2 (positive) and a G nucleotide frequency of + 4 to + 5 (negative) were selected. These results are consistent with the nucleotide frequency around the cleavage sites because G nucleotide frequency was high from the − 3 to + 1 position, whereas it was low around the + 4 position, as reported in a previous TREseq analysis [8]. Taken together, these results indicate that the 155 features explained the CS_site value in the LASSO regression model, and whole RNA features (e.g., translation process) were related to cleavage efficiency, in addition to nucleotide frequency around cleaved sites.

Table 2 Positive coefficient features in the LASSO regression model

Full size table

Table 3 Negative coefficient features in the LASSO regression model

Full size table

Confirmation of selected features using a different model

The advantages of the proposed LASSO regression model include reduction of the number of features (explanatory variables) and increased interpretability. On the other hand, the LASSO regression model cannot select several features if they have similar correlations to the objective variable. Hence, it is possible that some important features for cleavage efficiencies were removed from the LASSO regression model. To overcome this limitation, we needed to confirm that similar trends would be observed if we used a different model. Therefore, to confirm the importance of the features selected in the LASSO regression model, we performed an experiment using the Ridge regression model. We constructed the Ridge regression model based on data processing in the LASSO regression model. The prediction accuracy of Ridge was first tested using Pearson’s correlation coefficient, which showed that the prediction accuracy was comparable to that of the LASSO regression model (Fig. 8). The features with a coefficient of zero were removed, and the 1191 features were reduced to 1,051 features (Fig. 9a). We also calculated the correlation coefficient of feature importance, which is common in the LASSO and Ridge regression models, using Pearson’s correlation coefficient, and similar tendencies were observed in both models (Fig. 9b). In addition, when the five most positive or negative features were selected (Tables 4, 5), nucleotide frequency around the cleavage sites and whole RNA features related to the translation process were selected. These results suggest that features selected in the LASSO regression model are reliable.

Table 4 Positive coefficient features in the Ridge regression model

Full size table

Table 5 Negative coefficient features in the Ridge regression model

Full size table

Predicting CS_site values using only sequence information

Although ribosome occupancy had the highest positive coefficient in the LASSO regression model, most coefficients were related to sequence information. In addition, ribosome occupancy was explained by nucleotide or codon sequence in a previous study [15]. Thus, we hypothesized that we could predict cleavage efficiency at each site using only sequence information. We removed features related to ribosome occupancy or secondary structure information and re-constructed the LASSO regression model. Prediction accuracy (Pearson’s correlation coefficient) was calculated using test data; the correlation coefficient was r = 0.68 (Fig. 10). These results indicate that cleavage efficiency at the site level could be explained using only sequence information.

Validation of the LASSO regression model through prediction of exogenous genes

The LASSO regression model was constructed using endogenous genes. If this model could explain cleavage efficiency in plant cells, we hypothesized that cleavage efficiency could be predicted in exogenous genes in A. thaliana. We obtained the CS_site values of the Firefly luciferase (F-luc) gene, which was inserted into the A. thaliana genome (DRA009373) [11, 22]. Because ribosome profiling information for the F-luc gene was lacking, we used a model constructed using only sequence information (Fig. 10). We predicted CS_site values in F-luc RNA and calculated the Pearson's correlation coefficient between measured and predicted CS_site values. The prediction accuracy was r = 0.71 (Fig. 11). These results suggest that the selected features for CS_site values in the LASSO regression model are reliable, and RNA cleavage efficiency at the site level in both endogenous and exogenous genes can be predicted using only sequence information in A. thaliana.

Conclusions

We conducted feature selection using the LASSO regression model in A. thaliana. The model revealed the significance of the effect of multiple determinants on cleavage efficiency at each site, and suggests that whole RNA features are important for determining cleavage efficiency in addition to features around the cleavage sites. The selected features in the LASSO regression model were validated using a different method, and this model could predict cleavage efficiency in both endogenous and exogenous genes using only sequence information. These results indicate that mathematical models can predict cleavage efficiency at the site level in transgenes in plants, providing new insight into the importance of sequence features for RNA degradation mechanisms in A. thaliana.

Availability of data and materials

TREseq reads of cultured cells and F-luc RNA are available in the DDBJ Sequence Read Archive (DRA) database under accession numbers DRA005995 (https://ddbj.nig.ac.jp/DRASearch/study?acc=DRP003990) and DRA009373 (https://ddbj.nig.ac.jp/DRASearch/submission?acc=DRA009373). Ribosome profiling reads in A. thaliana are available under the accession number DRA010802 (https://ddbj.nig.ac.jp/DRASearch/submission?acc=DRA010802). Processed data in this study are available at GitHub (https://github.com/daishin-ueno/LASSO_and_Ridge_regression).

References

Keene JD. Minireview: global regulation and dynamics of ribonucleic acid. Endocrinology. 2010;151:1391–7.
Article CAS Google Scholar
Parker R. RNA degradation in Saccharomyces cerevisae. Genetics. 2012;191:671–702.
Article CAS Google Scholar
Chiba Y, Green PJ. mRNA degradation machinery in plants. J Plant Biol. 2009;52:114–24.
Article CAS Google Scholar
Gregory BD, O’Malley RC, Lister R, Urich MA, Tonti-Filippini J, Chen H, et al. A link between RNA metabolism and silencing affecting arabidopsis development. Dev Cell. 2008;14:854–66.
Article CAS Google Scholar
German MA, Pillay M, Jeong DH, Hetawal A, Luo S, Janardhanan P, et al. Global identification of microRNA-target RNA pairs by parallel analysis of RNA ends. Nat Biotechnol. 2008;26:941–6.
Article CAS Google Scholar
Addo-Quaye C, Eshoo TW, Bartel DP, Axtell MJ. Endogenous siRNA and miRNA targets identified by sequencing of the arabidopsis degradome. Curr Biol. 2008;18:758–62.
Article CAS Google Scholar
Weinberg DE, Shah P, Eichhorn SW, Hussmann JA, Plotkin JB, Bartel DP. Improved ribosome-footprint and mRNA measurements provide insights into dynamics and regulation of yeast translation. Cell Rep. 2016;14:1787–99.
Article CAS Google Scholar
Ueno D, Yamasaki S, Demura T, Kato K. Comprehensive analysis of mRNA internal cleavage sites in Arabidopsis thaliana. J Biosci Bioeng. 2018;125:723–8.
Article CAS Google Scholar
Ueno D, Mukuta T, Yamasaki S, Mikami M, Demura T, Matsui T, et al. Different plant species have common sequence features related to mRNA degradation intermediates. Plant Cell Physiol. 2020;61:53–63.
Article CAS Google Scholar
Ibrahim F, Maragkakis M, Alexiou P, Mourelatos Z. Ribothrypsis, a novel process of canonical mRNA decay, mediates ribosome-phased mRNA endonucleolysis. Nat Struct Mol Biol. 2018;25:302–10.
Article CAS Google Scholar
Ueno D, Mikami M, Yamasaki S, Kaneko M, Mukuta T, Demura T, et al. Changes in mRNA degradation efficiencies under varying conditions are regulated by multiple determinants in Arabidopsis thaliana. Plant Cell Physiol. 2020;62:143–55.
Article Google Scholar
Nie L, Wu G, Zhang W. Correlation of mRNA expression and protein abundance affected by multiple sequence features related to translational efficiency in Desulfovibrio vulgaris: a quantitative analysis. Genetics. 2006;174:2229–43.
Article CAS Google Scholar
Cheng J, Maier KC, Avsec Ž, Petra RUS, Gagneur J. Cis-regulatory elements explain most of the mRNA stability variation across genes in yeast. RNA. 2017;23:1648–59.
Article CAS Google Scholar
Kyung M, Gilly J, Ghoshz M, Casellax G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010;5:369–412.
Google Scholar
Tishbirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodological). 1996;58:267–88.
Hu Q, Merchante C, Stepanova AN, Alonso JM, Heber S. Mining transcript features related to translation in Arabidopsis using LASSO and random forest. 2015 IEEE 5th Int Conf Comput Adv Bio Med Sci ICCABS 2015. 2015;1–6.
Qabaja A, Alshalalfa M, Bismar TA, Alhajj R. Protein network-based Lasso regression model for the construction of disease-miRNA functional interactions Computational methods for biomarker discovery and systems biology research. Eurasip J Bioinforma Syst Biol. 2013;2013:1–11.
Article Google Scholar
Dai X, Zhao PX. PsRNATarget: A plant small RNA target analysis server. Nucleic Acids Res. 2011;39(SUPPL. 2):155–9.
Article Google Scholar
Hou CY, Lee WC, Chou HC, Chen AP, Chou SJ, Chen HM. Global analysis of truncated RNA ends reveals new insights into Ribosome Stalling in plants. Plant Cell. 2016;28:2398–416.
Article CAS Google Scholar
Yu X, Willmann MR, Anderson SJ, Gregory BD. Genome-wide mapping of uncapped and cleaved transcripts reveals a role for the nuclear mrna cap-binding complex in cotranslational rna decay in arabidopsis. Plant Cell. 2016;28:2385–97.
Article CAS Google Scholar
Matsui T, Takita E, Sato T, Kinjo S, Aizawa M, Sugiura Y, et al. N-glycosylation at noncanonical Asn-X-Cys sequences in plant cells. Glycobiology. 2011;21:994–9.
Article CAS Google Scholar
Yamasaki S, Sanada Y, Imase R, Matsuura H, Ueno D, Demura T, et al. Arabidopsis thaliana cold-regulated 47 gene 5′-untranslated region enables stable high-level expression of transgenes. J Biosci Bioeng. 2018;125.
Lei L, Shi J, Chen J, Zhang M, Sun S, Xie S, et al. Ribosome profiling reveals dynamic translational landscape in maize seedlings under drought stress. Plant J. 2015;84:1206–18.
Article CAS Google Scholar
Yamasaki S, Matsuura H, Demura T, Kato K. Changes in polysome association of mRNA throughout growth and development in Arabidopsis thaliana. Plant Cell Physiol. 2015;56:2169–80.
CAS PubMed Google Scholar
Gruber AR, Lorenz R, Bernhart SH, Neuböck R, Hofacker IL. The Vienna RNA websuite. Nucleic Acids Res. 2008;36 Web Server issue:70–4.
Simms CL, Yan LL, Zaher HS. Ribosome collision is critical for quality control during no-go decay. Mol Cell. 2017;68:361–73.
Article CAS Google Scholar
Zalucki YM, Power PM, Jennings MP. Selection for efficient translation initiation biases codon usage at second amino acid position in secretory proteins. Nucleic Acids Res. 2007;35:5748–54.
Article CAS Google Scholar
Bivona L, Zou Z, Stutzman N, Sun PD. Influence of the second amino acid on recombinant protein expression. Protein Expr Purif. 2010;74:248–56.
Article CAS Google Scholar
Barupal DK, Fiehn O. Machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Google Scholar
Hasan MM, Manavalan B, Shoombuatong W, Khatun MS, Kurata H. i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation. Plant Mol Biol. 2020;103:225–34. https://doi.org/10.1007/s11103-020-00988-y.
Article CAS PubMed Google Scholar
Varoquaux G, Buitinck L, Louppe G, Grisel O, Pedregosa F, Mueller A. SCIKIT-LEARN: machine learning without learning the machinery machine. GetMobile Mob Comput Commun. 2015;19:29–33.
Article Google Scholar

Download references

Acknowledgements

We thank DNAFORM for excellent deep sequencing analysis and helpful suggestions. We also thank Yuichi Sakumura at NAIST for helpful technical advice and discussions.

Funding

This work was supported by “Development of Production Techniques for Highly Functional Biomaterials Using Smart Cells of Plants and Other Organisms” from the New Energy and Industrial Technology Development Organization (NEDO).

Author information

Authors and Affiliations

Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, 630-0192, Japan
Daishin Ueno, Harunori Kawabe, Shotaro Yamasaki, Taku Demura & Ko Kato

Authors

Daishin Ueno
View author publications
You can also search for this author in PubMed Google Scholar
Harunori Kawabe
View author publications
You can also search for this author in PubMed Google Scholar
Shotaro Yamasaki
View author publications
You can also search for this author in PubMed Google Scholar
Taku Demura
View author publications
You can also search for this author in PubMed Google Scholar
Ko Kato
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DU, SY, TD, and KK designed the research. DU, HK, and SY performed the experiments. DU and KK wrote the article. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ko Kato.

Ethics declarations

Ethics approval and consent to participate

Collection of A. thaliana T87 cultured cells complied with the Genetically Modified Organism Safety Management Regulations of the Nara Institute of Science and Technology (NAIST).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Ueno, D., Kawabe, H., Yamasaki, S. et al. Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana. BMC Bioinformatics 22, 380 (2021). https://doi.org/10.1186/s12859-021-04291-5

Download citation

Received: 12 March 2021
Accepted: 07 July 2021
Published: 22 July 2021
DOI: https://doi.org/10.1186/s12859-021-04291-5

Feature selection for RNA cleavage efficiency at specific sites using the LASSO regression model in Arabidopsis thaliana

Abstract

Background

Results

Conclusions

Background

Materials and methods

Plant material

Data processing for TREseq

Library construction for ribosome profiling

Data processing for ribosome profiling

Analysis of cleavage sites using LASSO and Ridge regression

Explanatory variables in the model

Construction of the LASSO and Ridge regression models

Results and discussion

Data processing in the LASSO regression model

Prediction of CSsite value using the LASSO regression model

Confirmation of selected features using a different model

Predicting CSsite values using only sequence information

Validation of the LASSO regression model through prediction of exogenous genes

Conclusions

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

Prediction of CS_site value using the LASSO regression model

Predicting CS_site values using only sequence information