Computational approach for calculating the probability of eukaryotic translation initiation from ribo-seq data that takes into account leaky scanning

Background Ribosome profiling (ribo-seq) provides experimental data on the density of elongating or initiating ribosomes at the whole transcriptome level that can be potentially used for estimating absolute levels of translation initiation at individual Translation Initiation Sites (TISs). These absolute levels depend on the mutual organisation of TISs within individual mRNAs. For example, according to the leaky scanning model of translation initiation in eukaryotes, a strong TIS downstream of another strong TIS is unlikely to be productive, since only a few scanning ribosomes would be able to reach the downstream TIS. In order to understand the dependence of translation initiation efficiency on the surrounding nucleotide context, it is important to estimate the strength of TISs independently of their mutual organisation, i.e. to estimate with what probability a ribosome would initiate at a particular TIS. Results We designed a simple computational approach for estimating the probabilities of ribosomes initiating at individual start codons using ribosome profiling data. The method is based on the widely accepted leaky scanning model of translation initiation in eukaryotes which postulates that scanning ribosomes may skip a start codon if the initiation context is unfavourable and continue on scanning. We tested our approach on three independent ribo-seq datasets obtained in mammalian cultured cells. Conclusions Our results suggested that the method successfully discriminates between weak and strong TISs and that the majority of numerous non-AUG TISs reported recently are very weak. Therefore the high frequency of non-AUG TISs observed in ribosome profiling experiments is due to their proximity to mRNA 5′-ends rather than their strength. Detectable translation initiation at non-AUG codons downstream of AUG codons is comparatively infrequent. The leaky scanning method will be useful for the characterization of differences in start codon selection between tissues, developmental stages and in response to stress conditions. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0380-4) contains supplementary material, which is available to authorized users.

We wished to incorporate the inhibitory effect of ORF length on downstream initiation into our LS method (equation (4)) in Results, main text) for TISs that belong to overlapping ORFs and TISs that belong to the same ORF. Our strategy was to incorporate artificial distance starts, TIS Di , between TISs in an mRNA with TIS 1 , TIS 2 , TIS 3 ... TIS k , TIS u , as follows: TIS 1 , TIS D1 , TIS 2 , TIS D2 , TIS 3 , TIS D3 ,... TIS k , TIS u (5) (following on from equation (4) in Results, main text) where artificial starts TIS Di represent scanning ribosomes that have disassociated between TIS i and TIS i+1 and D i represents the nucleotide distance between the TIS i and TIS i+1 .
The probability of ribosomes disassociating between TIS i and TIS i+1 should positively correlate with the distance D i (the longer the ORF of TIS i , the more likely scanning ribosomes will encounter elongating ribosomes and be forced to disassociate from the mRNA before reaching TIS i+1 ).
The probability of scanning ribosomes disassociating should also correlate positively with the number of footprint reads for TIS i : the more ribosomes that initiate at TIS i , the higher the density of elongating ribosomes between TIS i and TIS i+1 , and consequently, the more likely scanning ribosomes will encounter elongating ribosomes and disassociate from the mRNA.
Hence, the probability of disassociation, P di , at TIS Di should approach 1 when D i and R i (absolute number of footprint reads at TIS i ) approach infinity. Likewise, the probability of disassociation P di , should approach 0 when D i and R i approach 0 (there can be no loss of scanning ribosomes if no ribosomes initiate at TIS i ). We propose the following function which satisfies the above criteria: where k is a parameter that can range from 0 to 1. Note that k=1 is equivalent to not taking the distance between TISs into account. A suitable value for k can be determined by fitting different values for k to the data. However, in order to do this, the number of disassociated scanning ribosomes (R di ) for each artificial distance start (TIS Di ) needs to be incorporated into our approach.
The number of disassociated scanning ribosomes is equal to the product of the probability of scanning ribosomes disassociating and the number of available ribosomes: where s starts from TIS Di (does not include footprint reads from the previous TIS i ).
We do not know the number of scanning ribosomes that can potentially disassociate at each artificial distance start TIS Di . However, we can use the absolute number of footprints detected at each TIS i in the data to express R di . The simplest scenario of two TISs (TIS 1 , TIS 2 ), with artificial distance start TIS D1 and 3'artificial start TIS u will be used to illustrate the estimation of R d1 . From (7), Substituting 1-k DiRi for P di from (6), we get Extending this to the general case: Having an estimation (using the number of actual footprint reads at each detected TIS) for the number of disassociated scanning ribosomes R di for each TIS Di , we can then calculate the probabilities P i for each TIS i in an mRNA using our LS method (equation (4) in Results, main text), but now include the estimated number of disassociated ribosomes R di for each artificial distance start TIS Di for an mRNA with TIS 1 , TIS D1 , TIS 2 , TIS D2 , TIS 3 , TIS D3 ,... TIS k , TIS u .
The question remains as to a suitable value for k? To estimate k for the different datasets, we used single isoform transcripts with 2 TISs and no in-frame stop codon between TIS 1 and TIS 2 , and R u (3' artificial TIS) equal to the minimum TIS detection threshold used in each study (Methods). We generated simple linear regressions of the ratios P 1 /P 2 (for different values of k), regressed onto the corresponding nucleotide distances between TIS 1 and TIS 2 for the transcripts considered. We compared these regression slopes (blue plots in Supplementary Figure S6) with the slopes obtained from regressing P 1 /P 2 onto the distances between TIS 1 and TIS 2 where distance is not accounted for (equivalent to k=1) (red plots in Supplementary Figure S6).
The motivation and assumptions for this are explained below: 1. We assume that the further TIS 2 is from TIS 1 , the more scanning ribosomes are lost, which should result in an overestimation of P 1 and an underestimation of P 2 . That is, P 1 /P 2 increases with distance (positive upward slope) (see slopes in the red plots in Supplementary Figure S6).
2. We assume that TISs of different strengths are distributed randomly in the mRNA.
3. If probabilities are estimated accurately there should be no correlation between the probability of initiation at the second codon and the distance between TIS 1 and TIS 2 .This suggests that if the distance factor is correctly accounted for, the slope of the curve of P 1 /P 2 ratios regressed onto the distances between TIS 1 and TIS 2 , should become close to 0.
The following values for k were found to redress the slopes nearer to zero (slopes in blue plots compared to the slopes in the red plots): Human (Lee et al. The corresponding probability distribution plots using these values of k (Supplementary Figure S6), however, do not show any improvement in discriminating the strength of initiation of AUG TISs from CUG TISs compared to when the distance between TISs is not taken into account (equivalent to k=1).

Exploration of the effects of varying the distance parameter k on Kozak context discrimination.
The effect of accounting for the distances between TISs and discrimination of Kozak contexts was investigated. However, as can be seen in Supplementary Figure S7, incorporating a distance factor had little effect for the 3 ribo-seq datasets analysed. The slopes of the regression curves when the distance between TISs is not considered (equivalent to k=1) (blue plots) are steeper compared to the slopes for lower values of the distance parameter k (green plots with different values for k).
Nevertheless, our LS method with a default distance parameter value of k=1 provides better discrimination of Kozak contexts compared to the regression slopes obtained when Kozak context scores are regressed onto the probabilities obtained using the PAS method (red plots). Figure S1. The frequency of each codon as the first or the last TIS in an mRNA. The distributions were generated using Lee et al.     (4) in Results, main text) is applied. A 3' artificial start value of R u =0.05 R LTM -R CHX was used. B. Generated for Mouse (Lee et al. [4] data) with the same description as panel A. C. Generated for Mouse (Ingolia et al. [3] data) with the same description as panel A except for different cumulative #Harringtonine footprint coverage thresholds and a 3' artificial start value of R u =50 #Harr FPs. Figure S5. Regression curves to determine the optimum value for the distance factor k. A. The red plots provide the slopes obtained from regressing P 1 /P 2 onto the distances between TIS 1 and TIS 2 where distance is not accounted for (equivalent to k=1). The blue plots provide the slopes when the ratios P 1 /P 2 obtained with different values for k, are regressed onto the distances between starts. The distances are in bins of 100 nucleotides. The plots are generated for Human (Lee et al.

Supplementary Figures
[4] data) and a 3' artificial start value of R u =0.05 R LTM -R CHX was used. B. Generated for Mouse (Lee et al. [4] data) with the same description as panel A. C. Generated for Mouse (Ingolia et al. [3] data) with the same description as panel A except that a 3' artificial start value of R u =50 #Harr FPs was used.  A.1,B.1,C.1. Probability scores are calculated using the LS approach when the distance between TISs is not considered (equivalent to k=1). A.2,B.2,C.2 Probability scores are calculated using the LS approach with the distance factor k that best redressed the regression slope as described in Supplementary Text S1. Transcripts with two TISs without an in-frame stop codon between the first TIS and second TIS were used. A 3' artificial start value of R u =0.05 R LTM -R CHX was used for the Lee et al. [4] and a 3' artificial start value of R u =50 #Harr FPs was used for the Ingolia et al. [3] data. Figure S7. Exploration of the effects of varying the distance parameter k on Kozak context discrimination. A. The red plot provides the slope obtained from regressing Kozak context scores onto initiation probability scores estimated using the proportion of footprints for the TISs (PAS method, equation (1) in Results, main text). The blue plot provides the slope for the LS approach (equation (4), main text) when distance is not accounted for (equivalent to k=1). The slopes in the green plots are for different values of the distance parameter k using the LS method (equation (4)). The plots are generated for Human (Lee et al. [4] data) and a 3' artificial start value of R u =0.05 R LTM -R CHX was used. B. Generated for Mouse (Lee et al. [4] data) with the same description as panel A. C. Generated for Mouse (Ingolia et al. [3] data) with the same description as panel A except that a 3' artificial start value of R u =50 #Harr FPs was used.

Figure S10
TISs initiation probability browser tracks in GWIPS-viz (http://gwips.ucc.ie/). Visualization of two AUG TISs for the PTEN gene in mouse from GWIPS-viz (generated from Lee et al. [4] data). As can be seen from the reading frames, the first AUG TIS originates from an uORF (green bars represent AUGs and the red bars represent stops). The second AUG TIS denotes the annotated start codon for PTEN. An alternative translation initiation site at an upstream CUG codon in-frame with the canonical AUG translation initiation codon [27] was not detected under the conditions of the Lee et al. [4] study.