ORFik: a comprehensive R toolkit for the analysis of translation

Tjeldnes, Håkon; Labun, Kornel; Torres Cleuren, Yamila; Chyżyńska, Katarzyna; Świrski, Michał; Valen, Eivind

doi:10.1186/s12859-021-04254-w

Software
Open access
Published: 19 June 2021

ORFik: a comprehensive R toolkit for the analysis of translation

Håkon Tjeldnes¹,
Kornel Labun¹,
Yamila Torres Cleuren^1,2,
Katarzyna Chyżyńska¹,
Michał Świrski³ &
…
Eivind Valen ORCID: orcid.org/0000-0003-1840-6108^1,2

BMC Bioinformatics volume 22, Article number: 336 (2021) Cite this article

4407 Accesses
13 Citations
12 Altmetric
Metrics details

Abstract

Background

With the rapid growth in the use of high-throughput methods for characterizing translation and the continued expansion of multi-omics, there is a need for back-end functions and streamlined tools for processing, analyzing, and characterizing data produced by these assays.

Results

Here, we introduce ORFik, a user-friendly R/Bioconductor API and toolbox for studying translation and its regulation. It extends GenomicRanges from the genome to the transcriptome and implements a framework that integrates data from several sources. ORFik streamlines the steps to process, analyze, and visualize the different steps of translation with a particular focus on initiation and elongation. It accepts high-throughput sequencing data from ribosome profiling to quantify ribosome elongation or RCP-seq/TCP-seq to also quantify ribosome scanning. In addition, ORFik can use CAGE data to accurately determine 5′UTRs and RNA-seq for determining translation relative to RNA abundance. ORFik supports and calculates over 30 different translation-related features and metrics from the literature and can annotate translated regions such as proteins or upstream open reading frames (uORFs). As a use-case, we demonstrate using ORFik to rapidly annotate the dynamics of 5′ UTRs across different tissues, detect their uORFs, and characterize their scanning and translation in the downstream protein-coding regions.

Conclusion

In summary, ORFik introduces hundreds of tested, documented and optimized methods. ORFik is designed to be easily customizable, enabling users to create complete workflows from raw data to publication-ready figures for several types of sequencing data. Finally, by improving speed and scope of many core Bioconductor functions, ORFik offers enhancement benefiting the entire Bioconductor environment.

Availability

http://bioconductor.org/packages/ORFik.

Background

Messenger RNAs (mRNAs) can be divided into three regions: the transcript leader sequence also known as the 5′ untranslated region (5′ UTR), the coding sequence (CDS), and the trailer sequence or 3′ untranslated region (3′ UTR). Eukaryotic translation is normally initiated by the binding of the 40S ribosomal small subunit (SSU) and associated initiation factors, adjacent to the mRNA 5′ cap. The SSU then proceeds to scan downstream, until it reaches a favorable initiation context at the start of an open reading frame (ORF). Here, it recruits the 60S large ribosomal unit, which together with the SSU forms the 80S elongating complex which starts translating the protein. Inside the 80S ribosome there are three binding sites that accomodate tRNA base-pairing to mRNA: the aminoacyl-site (A), the peptidyl-site (P), and the exit-site (E). Incoming aminoacyl-tRNAs enter the ribosome at the A site, binding to the mRNA codon. The peptidyl-tRNA carrying the growing polypeptide chain is held in the P site, while the E site holds deacylated tRNAs just before they exit the ribosome (Additional file 1: Fig. S1). The 80S proceeds to translate the ORF, processing it codon-by-codon by translocating 3 nucleotides (nts) for each step, until it reaches a terminating codon and the protein synthesis is complete [1].

While eukaryotic transcripts typically encode only a single protein, evidence from high-throughput methods has revealed that many 5′ UTRs contain short upstream ORFs (uORFs) that can be translated [2]. While the functional importance of uORFs is still debated, several uORFs have been found to regulate gene expression [2, 3]. This primarily occurs by hindering ribosomes from reaching the protein-coding ORF leading to translational inhibition. This demonstrates that at least a subset of uORFs is functionally important.

While translation was previously studied on a gene-by-gene basis, the introduction of ribosome profiling (ribo-seq) and later, translation complex profiling (TCP-seq) and ribosome complex profiling (RCP-seq) has made it possible to obtain a snapshot of translating and scanning ribosomes across the whole transcriptome [4,5,6]. Together with information on RNA levels and isoforms, protein-coding ORFs and uORFs can be identified and their translational levels can be quantified. Getting functional insight from sequencing data requires robust computational analysis. Ribo-seq, being a mature assay, has a number of software packages and web services designed specifically to handle it. TCP-seq [5, 7, 8] and RCP-seq [6], on the other hand, are much less supported. These methods need tools that can consider both scanning SSU and 80S elongating ribosome dynamics as well as the relationship between these.

Another complicating factor is that many genes have alternative transcription start sites (TSSs) [9]. The study of translation initiation requires accurate 5′ UTRs annotation. It is otherwise challenging to determine which uORF candidates should be included in the analysis. The choice of TSS dictates which uORFs are present in the 5′ UTR. In certain cases, uORFs are only present in specific tissues with the correct variant of the 5′ UTR [10].

To address these challenges and provide a comprehensive tool for studying translation in custom regions, we developed ORFik, a Bioconductor software package that streamlines the analysis of translation. It supports accurate 5′ UTR annotation through RNA-seq and cap analysis of gene expression (CAGE), detection and classification of translated uORFs, characterization of sequence features, and the calculation of over 30 features and metrics used in the analysis of translation (Additional file 1: Table S1).

Implementation

ORFik is implemented as an open-source software package in the R programming language, with parts in C++ to efficiently process large datasets. While tools for analyzing translation exist, none of them support comprehensive analysis of ribo-seq, TCP-seq and RCP-seq in combination with CAGE. Furthermore, many of these are either online tools [11], or are limited to studying only specific steps or aspects of translation [12]. A full comparison between functionality of related tools can be found in Additional file 1: Table S2 [13,14,15,16,17,18,19,20,21,22,23] and benchmarks comparing ORFik to related tools can be found in Additional file 1: Table S3-Table S5 and Additional file 1: Figs. S2–S4.

ORFik is highly optimized and fast. To achieve this, we have reimplemented several functions in the Bioconductor core package GenomicFeatures that are currently inefficient for larger datasets, like converting from transcript coordinates to genomic coordinates and vice versa. In addition, to aid with the ever-increasing size of datasets, we have focused on allowing faster computation of large bam files with our format “.ofst” based on the Facebook compression algorithm zstd [24, 25]. “.ofst” is a serialized format (see the section on Optimized File Format), with optional collapsing of duplicated reads, enabling near-instantaneous data loading (Fig. 1; Additional file 1: Table S5).

Overview

A typical workflow takes transcriptome/genome annotation and high-throughput sequencing data as input, processes these to make transcriptome-wide tracks (Fig. 2a), and use these to either make summary statistics for all genes or transcripts, or to characterize one or more specific transcriptomic regions. Regions can be of any type and are completely user-customizable, but typically consist of genes, 5′ UTRs, CDSs, uORFs, start codons, or similar. ORFik can then be used to calculate summary statistics and features for all candidate regions.

ORFik supports standard translation analysis: it can map reads from RNA-seq and ribo-seq, it performs trimming and P-site shifting of ribo-seq reads, quantifies ribosomal occupancy (Fig. 2b), characterizes ORFs, and creates a range of plots and predictions. In addition, it supports the analysis of translation initiation through TCP-seq, RCP-seq, and CAGE. It can quantify translation initiation through scanning efficiency and ribosome recruitment, and can correlate these with sequence elements. Overall, ORFik provides a toolbox of functions that is extremely versatile and enables the user to go far beyond standardized pipelines.

Obtaining and preprocessing data

The first step in ORFik is obtaining and preprocessing data (Fig. 2a). It automates direct download of datasets from the NGS repositories: SRA [26], ENA [27], and DRA [28], download of annotations (FASTA genome and GTF annotation) through a wrapper to biomartr [29], while also supporting local NGS datasets and annotations. The sequencing data can be automatically trimmed with fastp [30] with auto removal of adapter sequences or any user defined sequence. Adapter presets for the most common Illumina adapters are included (TruSeq, small RNA-seq and Nextera), but other sequencing platforms can be used by customizing adapter removal. Following this, ORFik can detect and remove any of the known contaminants (e.g., PhiX 174, rRNA, tRNA, non-coding RNAs) and finally align the reads using STAR [31].

Quality control

Post-alignment quality control is initiated by calling the function ORFikQC. ORFik outputs quality control plots and tables for comparisons between all runs in an experiment and the whole process is streamlined to be as fast and simple as possible for the user. Among others, plots of the meta coverage across all transcript regions and correlation plots between all pairs of samples (Fig. 2d; Additional file 1: Fig. S5) are generated. Furthermore, essential mapping statistics are calculated such as the fraction of reads mapping to important features, contaminants, and transcript regions (Fig. 2c).

Single-base resolution of transcription start sites

Many organisms have extensive variations in the use of isoforms between tissues and can often use a range of sites to initiate transcription. Since many analyses of translation depend on an accurate annotation of 5′UTRs, ORFik supports the use of CAGE data. CAGE is a high-throughput assay for the precise determination of transcription start sites (TSSs) at single-base pair resolution [32]. ORFik makes use of CAGE (or similar 5′ detection assays) to reannotate transcripts. In a typical workflow (Additional file 1: Fig. S7), ORFik identifies all CAGE peaks in promoter-proximal regions and assigns the largest CAGE peak as the TSS. This can be customized to only consider specific thresholds or exclude ambiguous TSSs that are close to the boundary of other genes.

Automatic read length determination

When performing ribo-seq it is customary to size-select a range of fragments that correspond to the size protected by a single ribosome. This is because not all isolated fragments necessarily originate from regions protected by ribosomes. Instead, these can be the product of other sources such as RNA structure, RNA binding proteins or incomplete digestion [33]. This filtering of sizes is typically performed first in the lab and then computationally. To identify which read lengths most likely originate from actual ribosome footprints (RFPs), ORFik identifies read lengths that display 3 nt periodicity over protein-coding regions, indicative of ribosome translocation (Fig. 2a; Additional file 1: Fig. S1). For each fragment length, we sum the 5′ ends of footprints mapping to the first 150 nucleotides in the CDSs of the top 10% of protein-coding genes ranked by coverage and keep lengths with at least 1000 reads. The resulting vectors are subject to discrete Fourier transform, and the fragment lengths whose highest amplitude corresponds to a period of 3 are considered to be bonafide RFPs (Additional file 1: Fig. S8). By default, all read lengths that lack this periodicity are filtered out [34].

Sub-codon resolution through P-shifting

Sequenced RFP reads span the whole region where the ribosome was situated. In many analyses, however, it is interesting to increase the resolution and determine exactly which codon the ribosome was processing when it was captured, the so-called P-site. Several methods of determining P-site location within footprints (P-shifting) have previously been developed [12, 23, 34, 35]. ORFik predicts the P-site offset per read length, taking inspiration from the Shoelaces algorithm [34]. To determine the P-site offset in the protected footprints, ORFik considers the distribution of the different read lengths over the translation initiation site (TIS) region. For each read length ORFik takes the 5′ end of all reads from all genes and sums these for each coordinate relative to the TIS. ORFik then performs a change point analysis to maximize the difference between an upstream and downstream window relative to the changepoint (Additional file 1: Fig. S9). This analysis results in the most probable offset to shift the 5′ ends of fragments protected by initiating ribosomes exactly at the TIS (Fig. 2b). The function can process any number of libraries and can supply log files and heatmaps over the start and stop codon used to verify that the P-site detection was correct (Additional file 1: Fig. S10) and users can manually override the suggested shifts. The user can choose between several formats, where the two default formats are wiggle format (wig) and ofst (for very fast loading into R).

Analysis

After preprocessing the user creates an instance of the “experiment” class. This summarizes and provides information about the data and makes it possible to analyze, plot, or create features for any number of NGS libraries. Experiments contain all relevant metadata so that naming and grouping in plots can be performed automatically. It also specifies the correct annotation and genomic sequence files for the data, enabling automatic downloading of these. To facilitate sharing between researchers the experiment class is constructed to not contain any local information like file paths. This makes it easy to send a short single script to collaborators that can perfectly replicate the entire process starting from scratch and producing identical downstream analyses and plots.

ORFik extends GenomicRanges and GenomicFeatures to the transcriptome

To facilitate the analysis of any region in the transcriptome, ORFik extends the functionality of two Bioconductor core packages: GenomicRanges and GenomicFeatures [24]. The intention of the first package was to provide robust representation and facilitate manipulation of contiguous (non-spliced) genomic intervals. GenomicFeatures, on the other hand, aimed at extending this functionality to spliced ranges and more complicated annotations, like gene models. The high-speed functions for intra- and inter-ranges manipulation of GenomicRanges are, however, inconsistent or completely lacking for the spliced ranges. ORFik extends these packages with more than 50 new functions for spliced ranges and more effectively converts between genomic and transcriptomic coordinates (Additional file 1: Table S5). ORFik also supports fully customizable subsets of transcript regions and direct subsetting of among others start codons, stop codons, transcription start sites (TSS), translation initiation sites (TIS), and translation termination sites (TTS).

ORFik also contains functions to facilitate rapid calculation of read coverage over regions. Beyond basic coverage per nucleotide, ORFik supports different coverage summaries like read length or translation frame in ORFs (Additional file 1: Table S7). All coverage functions also support the input reads as collapsed reads (merged duplicated reads with a meta column describing the number of duplicates). This greatly speeds up calculations and reduces memory consumption, especially for short-read sequencing data characterized by high duplication level (e.g., ribo-seq).

Visualizing meta coverage

Metaplots are a useful way of visualizing read coverage over the same relative region across multiple transcripts. ORFik implements these plots through a generalized syntax used for intuitive one-line functions. All plots can be extended or edited as ggplot objects (R internal graphic objects [36]). Since meta coverage and heat maps can be represented in multiple ways that emphasize different features of the data ORFik provides 14 different data transformations for metaplots; more basic ones like the sum, mean and median, and more advanced transformations like z-score (number of reads mapped to position—mean reads of region / standard deviation over region), position-normalized (number of reads mapped to position / total number of reads mapped to region), or mRNA-seq normalized (number of reads mapped to position / mRNA-seq FPKM of the entire gene) (equations in Additional file 1: Table S7). It also provides filters to avoid bias from single occurrences such as extreme peaks caused by contaminants in the data (e.g., non-coding RNAs). ORFik can filter these peaks or whole transcripts, to make plots more representative of the majority of the regions.

Identifying and characterizing open reading frames

ORFs can be split into a hierarchy based on the available evidence: (1) sequence composition and presence in a transcript, (2) active translation, measured with ribo-seq, (3) peptide product detection (4) confirmation of function. ORFik addresses the two first levels of this hierarchy.

Obtaining all possible ORFs based on sequence is accomplished through a scan of user-provided FASTA sequences to find candidate ORFs. The search is efficiently implemented using the Knuth–Morris–Pratt [37] algorithm with binary search for in-frame start codons per stop codon (Additional file 1: Table S5). It supports circular prokaryotic genomes and fast direct mapping to genomic coordinates from transcript coordinates. The provided sequence data can be any user-provided files or biostrings objects but the most straightforward approach is to obtain these from ORFiks own data loading functions. After identification, ORFs can be saved (e.g., in BED format with color codes) and loaded into a genome browser for visualization.

If based purely on the sequence, the number of ORFs in most genomes is vast [38]. When searching for uORFs or novel genes it can therefore be advantageous to move to the second step of the ORF hierarchy and also consider the ribosomal occupancy of the novel region. However, when analyzing small regions, like putative uORFs, simply observing the presence of reads will often have low predictive power. This stems from two issues: The first is a sampling problem, in that ribosomes over a short region might be transient and simply not get sampled and sequenced unless the occupancy is particularly high. The second is the aforementioned problem that not all reads originate from ribosomes. A weak ribo-seq signal alone is therefore not conclusive as evidence for translation [39].

To address this, several metrics or features have been produced that quantify how RNA fragments that originate from ribosome protection behave over verified protein-coding regions. ORFik currently supports more than 30 of these metrics that have been previously described in the literature, by us and others (Additional file 1: Table S1). ORFik also provides a wrapper function computeFeatures that produces a complete matrix of all supported metrics with one row per ORF or region. This resulting output matrix can be used to characterize specific ORFs or as input to machine learning classifiers that can be used to predict novel functional ORFs. By using a set of verified ORFs (e.g., known protein-coding sequences) the user can construct a positive set and use the features learned from them to classify the set of new candidate ORFs. Alternatively, the matrix may be exported and provided as input to other tools.

Differential expression

ORFik supports various ways of estimating differential expression or preparing data for other tools. For visualization, it supports standard plots for studying fold changes of translation and RNA, and the relationship between expression and translational efficiency (Fig. 3a). Since many tools require raw counts as input for analysis of expression [40,41,42,43], ORFik can provide count data tables through the countTable function. This can optionally collapse and merge replicates or normalize data to FPKM values. These count tables can be supplied to DESeq2, anota2seq, and other tools to perform differential expression analysis [40, 44]. Alternatively, an implementation of the deltaTE algorithm [45] is also included in ORFik, which can detect translational regulation between conditions using a Wald test for statistical significance (Fig. 3b).

Results

To illustrate the functionality of ORFik we show two use cases, where we (1) study translation initiation with profiling of scanning SSUs, and (2) make a simple pipeline to predict and characterize translated uORFs.

Using ORFik to analyze translation initiation

When analyzing data that includes small subunit profiling the goal is often to understand or characterize the regulation of translation initiation. ORFik was here applied to a TCP-seq data set from HeLa cells [8], and easily produces metaprofiles over transcripts, separated into scanning SSU and 80S elongating ribosomes (Fig. 4a). As expected, the SSU scanning complexes display the highest coverage in 5′UTRs, while elongating ribosomes are enriched over the CDSs. To obtain accurate measurements of scanning over 5′UTRs, CAGE data was used to reannotate the TSSs. The effect on the coverage of SSU complexes around the TSS as a result of this reannotation can be viewed in Fig. 4b. These heatmaps have been transcript-normalized, meaning that all the reads mapping to one transcript have been normalized to sum to 1. This has the effect of weighing each transcript equally, but ORFik supports a range of other normalization methods that emphasize different aspects of the data (Additional file 1: Fig. S11). The increased coverage and sharper delineation of the TSS that can be seen in these plots illustrate that CAGE-reannotation is important even in well-annotated transcriptomes like the human. Accurate TSS is also important when studying features at the start of the transcript such as the presence of a TOP motif [47]. TOP motifs are known to be involved in the regulation of specific genes and ORFik can correlate such features to other observations of the transcript. An example of such an analysis is associating motifs with the scanning efficiency (SE) — the number of scanning ribosomes on the 5′UTR relative to the RNA abundance [5]. In our latest study, we showed an association between ribosome recruitment and TOP motifs during early zebrafish development [6]. ORFik easily produces a similar analysis for the HeLa cells revealing no such relationship in these cells.

Beyond recruitment, an important part of translation initiation is the recognition of the start codon. This is facilitated by the context surrounding it and ORFik provides several ways of exploring these features. At the level of basic visualization, heatmaps similar to those for TSS can be produced to explore the conformations and periodicity around the start codon (Additional file 1: Fig. S10). Provided with small subunit profiling, the initiation rate (IR)—the number of 80S elongating ribosomes relative to SSU scanning [5] — can also be calculated. Together with the sequence, this allows for investigating which initiation sequences lead to the most productive elongation (Fig. 4d, e).

Detecting translated upstream open reading frames

To illustrate an example of a search for novel translated regions, we applied ORFik to the problem of discovering differential uORF usage across three developmental stages of zebrafish embryogenesis: 12 h post-fertilization (hpf), 24 hpf, and 48 hpf [46]. Using public ribo-seq, RNA-seq [46], and CAGE [50] data we used ORFik to derive stage-specific 5′UTRs and searched these for all ATGs and near-cognate start codons. ORFik supports a number of features that have been shown to correlate with translation activity (Additional file 1: Table S1). We used ORFik to calculate all these metrics per stage for: (1) all uORF candidates, (2) all CDSs of known protein-coding genes, and (3) random regions from the 3′ UTRs. We used these to train a random forest model on each stage with the H2O R-package [51] using the CDSs as a positive set and the 3′UTR regions as the negative set (Additional file 1: Supplementary Note 1). This model was used to predict translational activity over the uORF candidates resulting in 8191 unique translated uORFs across all 3 stages (12hpf: 4042, 24hpf: 2178, 48hpf: 2676). Of these, only 133 are translated in all stages (Fig. 5a). Given the stringency of our training set which consists of long, translated proteins (positive), contrasted with short non-translated 3′UTR regions (negative) this should be considered a very conservative prediction. More nuanced predictions could be achieved by tuning these sets to better represent the typically shorter, more weakly translated uORFs.

For the predicted translated uORFs a clear difference in the usage of start codons can be seen relative to all possible candidate uORFs by sequence (Fig. 5b). While purely sequence-predicted uORFs show a relatively uniform distribution in start codon usage, a clear bias towards ATG followed by CTG can be seen for the translated uORFs. Since the classifier favors uORFs that start with start codons known to be strong initiators, despite not using start codon sequence as a feature, suggest that it is able to identify uORFs that are actively translated. Since the classifier is trained on CDSs, the resulting uORFs also have features that resemble known protein-coding genes (Fig. 5c).

Summary

In summary, we have developed ORFik, a new API and toolkit for streamlining analysis of ORFs and translation. ORFik introduces hundreds of tested, documented, and optimized methods to analyze and visualize ribosome coverage over transcripts. It supports a range of data formats and can be used to create complete pipelines from read processing and mapping to publication-ready figures. We demonstrate its use on a transcriptome-wide study of translation initiation and quick annotation of translated uORFs. Together, this empowers users to perform complex translation analysis with less time spent on coding, allowing the user to focus on biological questions.

Availability of data and materials

Code for figures, tables and alignment of data: https://github.com/Roleren/ORFik_article_code. The datasets analysed during the current study will be downloaded automatically when running the scripts above. They can also be found in the respective sequence read archives referenced in Additional file 1: Table S9.

Availability and requirements

Project name: ORFik

Project home page: http://bioconductor.org/packages/ORFik

Archived version: ORFik

Operating system(s): Platform independent

Programming language: R

Other requirements: R version ≥ 3.6, Bioconductor version ≥ 3.10

License: MIT+ file LICENSE

Any restrictions to use by non-academics: none.

Abbreviations

ORF:: Open reading frame
CDS:: Coding sequence (main ORF of mRNA)
5′ UTR:: 5′ Untranslated region (also called transcript leader)
3′ UTR:: 3′ Untranslated region (also called transcript trailer)
uORF:: Upstream open reading frame
NGS:: Next-generation sequencing
RFP:: Ribosome-protected footprint (also referred to as RPF)
P-, E-, A-site:: TRNA binding sites in the ribosome (peptidyl, exit and aminoacyl)
P-shifting:: Relative 5′ end shifting of Ribo-seq reads to their estimated P-site
HPF:: Hours post-fertilization
TSS:: Transcription start site
TIS:: Translation initiation site
SSU:: Small subunit (ribosomal)
80S:: 80S elongating ribosome

References

Jackson RJ, Hellen CUT, Pestova TV. The mechanism of eukaryotic translation initiation and principles of its regulation. Nat Rev Mol Cell Biol. 2010;11:113.
Article CAS PubMed PubMed Central Google Scholar
David R. Morris APG: upstream open reading frames as regulators of mRNA translation. Mol Cell Biol. 2000;20:8635.
Article Google Scholar
Barbosa C, Peixeiro I, Romão L. Gene expression regulation by upstream open reading frames and human disease. PLoS Genet. 2013;9:66.
Article CAS Google Scholar
Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009;324:218–23.
Article CAS PubMed PubMed Central Google Scholar
Archer SK, Shirokikh NE, Beilharz TH, Preiss T. Dynamics of ribosome scanning and recycling revealed by translation complex profiling. Nature. 2016;535:570–4.
Article CAS PubMed Google Scholar
Giess A, Torres Cleuren YN, Tjeldnes H, Krause M, Bizuayehu TT, Hiensch S, Okon A, Wagner CR, Valen E. Profiling of small ribosomal subunits reveals modes and regulation of translation initiation. Cell Rep. 2020;31:107534.
Article CAS PubMed Google Scholar
Wagner S, Herrmannová A, Hronová V, Gunišová S, Sen ND, Hannan RD, Hinnebusch AG, Shirokikh NE, Preiss T, Valášek LS. Selective translation complex profiling reveals staged initiation and co-translational assembly of initiation factor complexes. Mol Cell. 2020;79:546.
Article CAS PubMed PubMed Central Google Scholar
Bohlen J, Fenzl K, Kramer G, Bukau B, Teleman AA. Selective 40S footprinting reveals cap-tethered ribosome scanning in human cells. Mol Cell. 2020;79:66.
Article CAS Google Scholar
de Klerk E, de Klerk E. ‘t PA: Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends Genet. 2015;31:128–39.
Article PubMed CAS Google Scholar
Kurihara Y, Makita Y, Kawashima M, Fujita T, Iwasaki S, Matsui M. From the Cover: Transcripts from downstream alternative transcription start sites evade uORF-mediated inhibition of gene expression in Arabidopsis. Proc Natl Acad Sci USA. 2018;115:7831.
Article CAS PubMed PubMed Central Google Scholar
Liu Q, Shvarts T, Sliz P, Gregory RI. RiboToolkit: an integrated platform for analysis and annotation of ribosome profiling data to decode mRNA translation at codon resolution. Nucleic Acids Res. 2020;48:W218–29.
Article CAS PubMed PubMed Central Google Scholar
Lauria F, Tebaldi T, Bernabò P, Groen EJN, Gillingwater TH, Viero G. riboWaltz: optimization of ribosome P-site positioning in ribosome profiling data. PLoS Comput Biol. 2018;14:e1006169.
Article PubMed PubMed Central CAS Google Scholar
Verbruggen S, Ndah E, Van Criekinge W, Gessulat S, Kuster B, Wilhelm M, Van Damme P, Menschaert G. PROTEOFORMER 2.0: further developments in the ribosome profiling-assisted proteogenomic hunt for new proteoforms. Mol Cell Proteomics. 2019;18:S126–40.
Article CAS PubMed PubMed Central Google Scholar
Legrand C, Tuorto F. RiboVIEW: a computational framework for visualization, quality control and statistical analysis of ribosome profiling data. Nucleic Acids Res. 2019;48:e7–e7.
Article PubMed Central CAS Google Scholar
Legendre R, Baudin-Baillieu A, Hatin I, Namy O. RiboTools: a Galaxy toolbox for qualitative ribosome profiling analysis. Bioinformatics. 2015;31:2586–8.
Article CAS PubMed Google Scholar
Ozadam H, Geng M, Cenik C. RiboFlow, RiboR and RiboPy: an ecosystem for analyzing ribosome profiling data at read length resolution. Bioinformatics. 2020;36:2929–31.
Article CAS PubMed PubMed Central Google Scholar
Tyler WH. Backman TG: systemPipeR: NGS workflow and report generation environment. BMC Bioinfor. 2016;17:66.
Article CAS Google Scholar
Perkins P, Mazzoni-Putman S, Stepanova A, Alonso J, Heber S. RiboStreamR: a web application for quality control, analysis, and visualization of Ribo-seq data. BMC Genomics. 2019;20:422.
Article PubMed PubMed Central CAS Google Scholar
Michel AM, Mullan JPA, Velayudhan V, O’Connor PBF, Donohue CA, Baranov PV. RiboGalaxy: a browser based platform for the alignment, analysis and visualization of ribosome profiling data. RNA Biol. 2016;13:316–9.
Article PubMed PubMed Central Google Scholar
Calviello L, Sydow D, Harnett D, Ohler U: Ribo-seQC: comprehensive analysis of cytoplasmic and organellar ribosome profiling data.
Dunn JG, Weissman JS. Plastid: nucleotide-resolution analysis of next-generation sequencing and genomics data. BMC Genomics. 2016;17:958.
Article PubMed PubMed Central CAS Google Scholar
Chung BY, Hardcastle TJ, Jones JD, Irigoyen N, Firth AE, Baulcombe DC, Brierley I. The use of duplex-specific nuclease in ribosome profiling and a user-friendly software package for Ribo-seq data analysis. RNA. 2015;21:1731–45.
Article CAS PubMed PubMed Central Google Scholar
RiboProfiling: a Bioconductor package for standard Ribo-seq pipeline processing. PubMed—NCBI. https://www.ncbi.nlm.nih.gov/pubmed/27347386.
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol. 2013;9:e1003118.
Article CAS PubMed PubMed Central Google Scholar
Zstandard - Fast real-time compression algorithm. https://github.com/facebook/zstd. Accessed 20 May 2020.
Leinonen R, Sugawara H, Shumway M. International nucleotide sequence database collaboration: the sequence read archive. Nucleic Acids Res. 2011;39:D19-21.
Article CAS PubMed Google Scholar
Amid C, Alako BTF, Balavenkataraman Kadhirvelu V, Burdett T, Burgin J, Fan J, Harrison PW, Holt S, Hussein A, Ivanov E, Jayathilaka S, Kay S, Keane T, Leinonen R, Liu X, Martinez-Villacorta J, Milano A, Pakseresht A, Rahman N, Rajan J, Reddy K, Richards E, Smirnov D, Sokolov A, Vijayaraja S, Cochrane G. The European Nucleotide Archive in 2019. Nucleic Acids Res. 2019;48:D70–6.
PubMed Central Google Scholar
Nakamura Y, Kodama Y, Saruhashi S, Kaminuma E, Sugawara H, Takagi T, Okubo K. DDBJ sequence read archive/DDBJ omics archive. Nat Proc. 2010;4:1.
Google Scholar
Drost H-G, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics. 2017;66:btw821.
Article CAS Google Scholar
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90.
Article PubMed PubMed Central CAS Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15.
Article CAS PubMed Google Scholar
Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, Kodzius R, Watahiki A, Nakamura M, Arakawa T, Fukuda S, Sasaki D, Podhajska A, Harbers M, Kawai J, Carninci P, Hayashizaki Y. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci USA. 2003;100:15776.
Article CAS PubMed PubMed Central Google Scholar
Fremin BJ, Bhatt AS. Structured RNA contaminants in bacterial Ribo-Seq. mSphere. 2020;5:66.
Article Google Scholar
Birkeland Å, ChyŻyńska K, Valen E. Shoelaces: an interactive tool for ribosome profiling processing and visualization. BMC Genomics. 2018;19:66.
Article CAS Google Scholar
Ahmed N, Sormanni P, Ciryam P, Vendruscolo M, Dobson CM, O’Brien EP. Identifying A- and P-site locations on ribosome-protected mRNA fragments using Integer Programming. Sci Rep. 2019;9:66.
Article CAS Google Scholar
Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org. Accessed 20 May 2020.
Knuth DE, Morris JH Jr, Pratt VR. Fast pattern matching in strings. SIAM J Comput. 1977;6:323–50.
Article Google Scholar
Mir K, Neuhaus K, Scherer S, Bossert M, Schober S. Predicting statistical properties of open reading frames in bacterial genomes. PLoS ONE. 2012;7:66.
Google Scholar
Xu Z, Hu L, Shi B, Geng S, Xu L, Wang D, Lu ZJ. Ribosome elongating footprints denoised by wavelet transform comprehensively characterize dynamic cellular translation events. Nucleic Acids Res. 2018;46:109.
Article CAS Google Scholar
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.
Article PubMed PubMed Central CAS Google Scholar
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.
Article CAS PubMed Google Scholar
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40:4288–97.
Article CAS PubMed PubMed Central Google Scholar
Li W, Wang W, Uren PJ, Penalva LOF, Smith AD. Riborex: fast and flexible identification of differential translation from Ribo-seq data. Bioinformatics. 2017;33:1735–7.
Article CAS PubMed PubMed Central Google Scholar
Oertlin C, Lorent J, Murie C, Furic L, Topisirovic I, Larsson O. Generally applicable transcriptome-wide analysis of translation using anota2seq. Nucleic Acids Res. 2019;47:e70.
Article CAS PubMed PubMed Central Google Scholar
Chothani S, Adami E, Ouyang JF, Viswanathan S, Hubner N, Cook SA, Schafer S, Rackham OJL. deltaTE: detection of translationally regulated genes by integrative analysis of Ribo-seq and RNA-seq data. Curr Protoc Mol Biol. 2019;129:e108.
Article CAS PubMed Google Scholar
Bazzini AA, Johnstone TG, Christiano R, Mackowiak SD, Obermayer B, Fleming ES, Vejnar CE, Lee MT, Rajewsky N, Walther TC, Giraldez AJ. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 2014;33:981–93.
Article CAS PubMed PubMed Central Google Scholar
Iadevaia V, Caldarola S, Tino E, Amaldi F, Loreni F. All translation elongation factors and the e, f, and h subunits of translation initiation factor 3 are encoded by 5′-terminal oligopyrimidine (TOP) mRNAs. RNA. 2008;14:1730.
Article CAS PubMed PubMed Central Google Scholar
Grzegorski SJ, Chiari EF, Robbins A, Kish PE, Kahana A. Natural variability of Kozak sequences correlates with function in a Zebrafish model. PLoS ONE. 2014;9:108475.
Article CAS Google Scholar
Kozak M. An analysis of 5′-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987;15:8125–48.
Article CAS PubMed PubMed Central Google Scholar
Nepal C, Hadzhiev Y, Previti C, Haberle V, Li N, Takahashi H, Suzuki AMM, Sheng Y, Abdelhamid RF, Anand S, Gehrig J, Akalin A, Kockx CEM, van der Sloot AAJ, van Ijcken WFJ, Armant O, Rastegar S, Watson C, Strähle U, Stupka E, Carninci P, Lenhard B, Müller F. Dynamic regulation of the transcription initiation landscape at single nucleotide resolution during vertebrate embryogenesis. Genome Res. 2013;23:1938–50.
Article CAS PubMed PubMed Central Google Scholar
H2O.ai (Oct. 2016). R Interface for H2O, R package version 3.10.0.8. https://github.com/h2oai/h2o-3. Accessed 20 May 2020.

Download references

Acknowledgements

We would like to thank the Bioconductor core team and community for providing valuable feedback for developing ORFik.

Funding

This work was supported by the Trond Mohn Foundation, the Research Council of Norway (#250049) and the Norwegian Cancer Society (Project #190290). M.S. was supported by the foundation for Polish Science co-financed by the European Union under the European Regional Development Fund [TEAM POIR.04.04.00-00-5C33/17-00]. The funding bodies played no role in study and provided no input to it.

Author information

Authors and Affiliations

Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway
Håkon Tjeldnes, Kornel Labun, Yamila Torres Cleuren, Katarzyna Chyżyńska & Eivind Valen
Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway
Yamila Torres Cleuren & Eivind Valen
Institute of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Warsaw, Poland
Michał Świrski

Authors

Håkon Tjeldnes
View author publications
You can also search for this author in PubMed Google Scholar
Kornel Labun
View author publications
You can also search for this author in PubMed Google Scholar
Yamila Torres Cleuren
View author publications
You can also search for this author in PubMed Google Scholar
Katarzyna Chyżyńska
View author publications
You can also search for this author in PubMed Google Scholar
Michał Świrski
View author publications
You can also search for this author in PubMed Google Scholar
Eivind Valen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KL, HT and EV conceived the idea for ORFik. HT, MS and KL wrote the code. YTC, KC and EV have contributed to the code and functionality. All authors wrote and approved the final manuscript.

Corresponding author

Correspondence to Eivind Valen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Supplementary file for ORFik: a comprehensive R toolkit for the analysis of translation

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Tjeldnes, H., Labun, K., Torres Cleuren, Y. et al. ORFik: a comprehensive R toolkit for the analysis of translation. BMC Bioinformatics 22, 336 (2021). https://doi.org/10.1186/s12859-021-04254-w

Download citation

Received: 18 January 2021
Accepted: 09 June 2021
Published: 19 June 2021
DOI: https://doi.org/10.1186/s12859-021-04254-w

ORFik: a comprehensive R toolkit for the analysis of translation

Abstract

Background

Results

Conclusion

Availability

Background

Implementation

Overview

Obtaining and preprocessing data

Quality control

Single-base resolution of transcription start sites

Automatic read length determination

Sub-codon resolution through P-shifting

Analysis

ORFik extends GenomicRanges and GenomicFeatures to the transcriptome

Visualizing meta coverage

Identifying and characterizing open reading frames

Differential expression

Results

Using ORFik to analyze translation initiation

Detecting translated upstream open reading frames

Summary

Availability of data and materials

Availability and requirements

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us