Improving the prediction of mRNA extremities in the parasitic protozoan Leishmania
© Smith et al. 2008
Received: 07 November 2007
Accepted: 20 March 2008
Published: 20 March 2008
Skip to main content
© Smith et al. 2008
Received: 07 November 2007
Accepted: 20 March 2008
Published: 20 March 2008
Leishmania and other members of the Trypanosomatidae family diverged early on in eukaryotic evolution and consequently display unique cellular properties. Their apparent lack of transcriptional regulation is compensated by complex post-transcriptional control mechanisms, including the processing of polycistronic transcripts by means of coupled trans-splicing and polyadenylation. Trans-splicing signals are often U-rich polypyrimidine (poly(Y)) tracts, which precede AG splice acceptor sites. However, as opposed to higher eukaryotes there is no consensus polyadenylation signal in trypanosomatid mRNAs.
We refined a previously reported method to target 5' splice junctions by incorporating the pyrimidine content of query sequences into a scoring function. We also investigated a novel approach for predicting polyadenylation (poly(A)) sites in-silico, by comparing query sequences to polyadenylated expressed sequence tags (ESTs) using position-specific scanning matrices (PSSMs). An additional analysis of the distribution of putative splice junction to poly(A) distances helped to increase prediction rates by limiting the scanning range. These methods were able to simplify splice junction prediction without loss of precision and to increase polyadenylation site prediction from 22% to 47% within 100 nucleotides.
We propose a simplified trans-splicing prediction tool and a novel poly(A) prediction tool based on comparative sequence analysis. We discuss the impact of certain regions surrounding the poly(A) sites on prediction rates and contemplate correlating biological mechanisms. This work aims to sharpen the identification of potentially functional untranslated regions (UTRs) in a large-scale, comparative genomics framework.
Leishmania is a unicellular eukaryote that belongs to the Trypanosomatidae family; a strictly parasitic order of kinetoplastida. Leishmania is the causative agent of leishmaniases, vector-borne parasitic diseases with a large spectrum of clinical manifestations in humans ranging from self-resolving skin lesions to life-threatening visceral diseases . Leishmaniasis is endemic in 88 countries mainly in tropical and subtropical regions with an estimated 12 million people presently infected worldwide and at least 350 million people being at risk of infection .
Trypanosomatid protozoan parasites have diverged early on in eukaryotic evolution . Their evolutionary closeness to bacterial ancestors is delineated by intrinsic cellular characteristics such as tandem arranged genes , polycistronic transcription [5, 6], mitochondrial RNA editing , lack of transcriptional control , infrequent introns , and trans-splicing . The latter consists of the 5' cleavage of polycistronic RNA precursors into individual mRNA transcripts by addition of an exogenous 39 to 41 base long capped RNA fragment, namely the splice leader (SL) or mini-exon, provided by a highly abundant SL-RNA , yet similar processes have also been discovered in nematodes and even in mammals [12, 13]. This process is somewhat similar to cis-splicing in other organisms, as RNA is cleaved at an AG dinucleotide downstream of a polypyrimidine stretch.
In addition to co-transcriptional trans-splicing, polyadenylation of the upstream transcript is also required in order to generate monocistronic mRNAs in these organisms. Trypanosomatid protozoa are believed to lack a conserved polyadenylation (poly(A)) signal, in contrast to other eukaryotes who generally require a cytoplasmic polyadenylation motif for successful polyadenylation . Several studies support that polyadenylation is mechanistically coupled to trans-splicing and that it depends upon the presence of polypyrimidine tracts [15–19], thus leading to the belief that the spliceosome complex interacts with the polyadenylation machinery in trypanosomatids. It has also been reported that distant pyrimidine tracts may be responsible for polyadenylated positions further away from the downstream 5' splice site in trypanosomes [17, 20]. These analyses also convey the non-specific nature of poly(A) site selection in trypanosomatids, as polyadenylation seems to occur in a given region rather than at a specific position.
The apparent heterogeneity of kinetoplastid mRNA polyadenylation and its dependence on successful trans-splicing make 3'-untranslated region (3'UTR) length predictions troublesome. Currently, there exists a 3'UTR prediction method for Trypanosoma brucei derived from the statistical analysis of mRNA transcript extremity lengths from expressed sequence tag (EST) data . The prediction is essentially obtained by selecting the position located at an empirical distance (100 bases) upstream of the polypyrimidine tract closest to the open reading frame (ORF). The authors claim a 38% prediction rate within a 73-nucleotide window. These metrics are somewhat inappropriate for predictions in the Leishmania genus since the species flaunt larger intergenic (IR) sequences, higher average UTR lengths, and less stringent splice acceptors [4, 21].
In addition to the statistical analysis of transcript length distributions for 3'UTRs, 5'UTR prediction has been submitted to supplementary investigation [22–24]. Prediction algorithms that essentially focus on selecting the first AG dinucleotide after the longest polypyrimidine stretch can reportedly identify 62% of valid splice junctions in trypanosomes and 51% in Leishmania [20, 23]. For Leishmania, it has been shown that by fragmenting the non-coding sequence upstream of a start codon at every occurrence of AG, the AG following the longest fragment corresponds to a valid splice junction in 60% of the cases. When combining this method with a linear discriminant analysis of dinucleotide composition, the later method can obtain a prediction accuracy as high as 92% on selected high-scoring sequences .
Considering that regulation of gene expression in kinetoplastids occurs mostly at the post-transcriptional level, it has become apparent that UTRs bear essential regulatory tags [8, 25–29]. From the standpoint of computational motif discovery, it is imperative to discriminate between functional and non-functional sequences in order to successfully identify novel conserved regulatory regions. This premise is most important when dealing with non-coding RNA as it is exposed to less stringent evolutionary pressure than open reading frames . It can be expected that limiting sequence and structure motif searches to legitimate mRNA UTRs will generate more informative results while reducing the inherent computational cost of search algorithms.
This paper aims to further improve the in-silico prediction of mRNA extremities in kinetoplastid organisms. We polish trans-splicing prediction in Leishmania by incorporating the pyrimidine content of intergenic regions into a previously developed scoring function, and propose a polyadenylation prediction method based on the global nucleotide composition observed in published expressed sequence tag (EST) data. The selection of different genomic regions surrounding the poly(A) site and their impact on prediction rates has validated the impact of adenosines and downstream polypyrimidines on trypanosomatid polyadenylation.
Previously, the best method to predict splice acceptor sites in trypanosomatids combined statistical analysis of dinucleotide composition with inter-AG fragment length assessment . We simplified the procedure by discarding the statistical discrimination of inter-AG fragments based on dinucleotide composition, thus only considering the inter-AG fragment size for predictions. This approach was compared to two pyrimidine-bias scoring functions that rate inter-AG segments in proportion to their pyrimidine content in addition to their size (see Methods). Both functions are such that inter-AG fragments displaying lower than average pyrimidine content are proportionately penalised whereas those with higher than average content are rewarded.
Splice junction prediction sensitivities of three different scoring models.
< 25 nt
Longest Inter-AG Length
Inter-AG Length + LDA*
Of the 12,052 Leishmania EST sequences in GenBank, 81% correspond to L. infantum and 19% to L. major cDNA. We filtered the data to collect sequences harbouring significant poly-A or poly-T stretches near their extremities in order to search for polyadenylation signals. Only 850 sequences (7% of initial data) satisfied our search constraints (see Methods) of which a mere 218 (1.8%) were successfully mapped to genomic intergenic regions of L. infantum (the accession numbers for the 218 ESTs can be viewed in Additional File 1). The L. infantum EST data contains several flagrantly erroneous and repeated sequences. Comparing the pair-wise identity of mapped ESTs revealed 4 pairs of highly similar sequences which, once aligned, proved to be the only example of alternatively polyadenylated sequence in our data (GenBank accession IDs: CV669949.1, CV670417.1, CV668181.1, CV665773.1, CV670284.1, CV668879.1, CV667130.1, CV661593.1).
Global statistics of genomic sequences in Leishmania infantum.
Capturing such blatant genomic signals in addition to more discrete parameters, like progressive shifts in nucleotide and dinucleotide compositions, could be an effective means of identifying poly(A) sites in unresolved sequences. Such a comparative approach is appealing since conserved sequence motifs surrounding poly(A) sites in trypanosomatid species are not as common as in higher eukaryotes. Using motif detection programs such as MEME  did not yield conclusive results (data not shown). Indeed, the intergenic regions of Leishmania parasites are riddled with low-complexity regions (i.e., short consecutive repeats of 1–3 nucleotides) that can bias the scoring metrics of such programs. To surmount this shortcoming, we investigated over-represented motifs in the regions directly surrounding genomic poly(A) sites in Leishmania using the word enumeration program YMF  in combination with FindExplanator . Hexamers that are over-represented in the regions directly flanking known genomic poly(A) sites were compared to those found in more distant regions (see Additional File 2 for details). The highest-ranking motifs are present in only a fraction of all known poly(A) sites and appear to be randomly distributed within their vicinity (data not shown).
We converted the genomic alignment into a position specific scoring matrix (PSSM) that can subsequently be used to scan non-coding sequences. The PSSM is aligned to every position within the intergenic sequence and emits a bit-score for each position (see Methods). The higher the score, the closer the current position in the intergenic sequence resembles the global composition of a polyadenylated region. We present the depicted prediction rates of a given PSSM as a measure of its sensitivity, or ability to detect valid poly(A) sites. Since the biological data is limited, sensitivity was determined using tenfold cross-validation (see Methods) which allows for unbiased testing, as the testing data is excluded from the training data . The position displaying the highest PSSM score is retained as the poly(A) candidate.
Accuracy of mRNA extremity predictions.
< 25 nt
< 10 nt
< 100 nt
The 5' splice junction prediction methods disclosed in this work were conceived to estimate trans-splicing sites for all input sequences using a simple and effective metric. Since pyrimidines play an important role in trans-splicing, including such a parameter into the inter-AG splice prediction model was forthright and can be warranted by the subsequent increase in sensitivity. Although rather effective, the inter-AG metric's principal hoodwink resides in its synthetic nature, as the underlying biological process is difficult to conceive. The assessment of polypyrimidine tract length was not considered in this work as it has been shown that the inter-AG metric is more powerful . Even if our splice junction prediction results are encouraging, some uncertainty subsists when testing on unconfirmed sequences. This may potentially be a consequence of the parasitic nature of trypanosomatids, which coerces these protozoa to alternate between different life-stages depending on their insect and mammalian host. An additional level of complexity may be essential to improve in-silico predictions in view of the fact that trans-splicing of certain transcripts is developmentally regulated in trypanosomes [36, 37].
When compared to previously published trans-splicing prediction rates , the models we propose here appear to be just as effective at predicting known trans-splicing sites when tested on the same search space (Table 1). Their accuracy remains significant even when increasing the query sequence size (1.75× increase in search space at the cost of 0.9× accuracy). The augmented search space is in order to ensure that the full inter-AG fragments upstream of putative splice sites are considered. Overlapping into the downstream coding sequence is vindicated by erroneous genome annotations; it is not uncommon that the furthest in-frame ATG is selected as a start codon. Also, our scoring function rates all inter-AG fragments, unlike the previously proposed study that selects high-scoring fragments based upon their dinucleotide composition . As shown in Table 3, a scoring threshold can be implemented to ensure that few false-positives are unsuitably identified as splice-junctions at the cost of slightly lower specificity. However, a threshold will necessarily neglect certain sequences, which may be objectionable when dealing with few or essential queries. Since our method is more dependent on correct annotations, it is conceivable that coupling it to linear discriminant analysis would generate even better predictions at the cost of higher complexity.
Predicting poly(A) sites with PSSM's have previously been shown to successfully predict poly(A) sites in humans . Capturing the global nucleotide composition surrounding known poly(A) sites and utilizing it as a comparative predictor has also proven to be a successful prediction procedure in Leishmania. Albeit the public EST data appears to be of questionable quality, stringent screening has permitted to reveal specific polyadenylated sequence traits. Given the nature of the sequence data, smaller mRNA transcripts may be favoured and this should be considered when analyzing results. Nonetheless, PSSM scanning is more than 10 times more effective at identifying poly(A) sites than the distance-only approach when precision is fundamental (Figures 2A and 4). This result can be interpreted as evidence that distance is not as powerful for targeting poly(A) sites in Leishmania than in trypanosomes.
For Leishmania, precision may not be essential when predicting 3'UTR extremities given that several mappings display heterogeneous poly(A) positions . This observation motivates the use of an error margin, which is interpreted as lowering the resolution of sensitivity testing in this work. Allowing correct predictions to be within a certain range of the mapped position emulates the identification of a polyadenylation region. We also tested a window scanning approach, where the cumulative bit-scores for a given range were averaged over the size of the window instead of considering each position independently. Such an approach yielded weaker overall predictions than the position-specific approach (data not shown), perhaps because the extent of polyadenylation regions varies among different transcripts.
The best 3'UTR predictions emanate from the grouping of distance limitation and scanning with dual PSSMs. Combining both metrics proved to be more effective than either one individually (Figures 2, 4, and 5), a result that hints at the importance of each factor when predicting poly-A sites in Leishmania. For restraining PSSM scanning, we tested various distances instead of using a specific confidence interval since spacer sequences display somewhat of a bias towards longer fragments. Although the data is partially derived from estimations, such a shift in the distribution supports the notion that polyadenylation does not occur randomly in Leishmania. Poly(A) sites further away from the splice junction may be an effect of distant polypyrimidine tracts, a situation that has already been observed in trypanosomes . One must also consider that the longer non-coding regions in Leishmania may contain non-annotated genes or provide alternative stage-specific polyadenylation sites, which could explain the longer spacer sequences. These are considerations that motivated the exclusion of intergenic sequences longer than 5000 nucleotides for sensitivity testing.
To our knowledge, no other method can predict poly(A) sites as effectively in Leishmania spp. as the one described in this work. Even enforcing a highly-selective threshold only faintly affects this method's specificity (Table 3). The rather unusual and non-specific nature of kinetoplastid polyadenylation is a line of reasoning to substantiate low computational prediction rates. Although over-represented A-rich hexamer motifs are found (Additional File 2), these are not however present in all the genomic poly(A) sites, which suggests that they may not play a central role in driving polyadenylation in Leishmania. In addition, the genomic alignment of polyadenylated EST mappings cannot be used to mark out a precise consensus sequence, as it is impossible to distinguish the exact cleavage site among multiple consecutive adenosines on the unprocessed transcript. The heterogeneity of poly(A) sites in Leishmania mRNA transcripts is extra incentive for using PSSMs that embody a global trend in nucleotide composition. Furthermore, neglecting secondary structure and stage-specificity are additional factors that make it difficult to conceive obtaining higher prediction accuracies at this point.
Notwithstanding the possibility that no consensus motif drives polyadenylation in kinetoplastids, there is evidence for a biological model based on sequence context. The low sensitivity obtained from a poly(A) prediction algorithm based on spacing metrics alone is an evidence for a more dynamic biological model. Also, the correlation between certain regions of the genomic alignment and their respective prediction rates is most interesting, as best illustrated by the sensitivity surface plots (Figure 2). The data is presented in order to asses the innate characteristics that have an impact on poly(A) targeting.
Two main common sequence features appear to directly influence the prediction sensitivities. Firstly, the adenosine-rich region within close range to the mapped poly(A) site. Secondly, the pyrimidine-rich region 300 to 600 positions downstream. The latter, which represents the polypyrimidine tracts known to be crucial for trans-splicing, generates the best predictions when loosening the accuracy and bounding the search space. In turn, the A-rich region is responsible for the best predictions when precision is fundamental. Interestingly, the affluence of polypyrimidines (most notably thymines) in the -50 to -25 region (Figure 1) may play a role in 3'UTR cleavage since its exclusion from scanning matrices reduces the sensitivity at close range (Figure 2). The matrix encoding the sequence information of zero upstream bases and 25 downstream (0A25) is somewhat futile at predicting poly(A) sites, a rather surprising observation seeing as the adenosine concentration is comparable. Upon closer inspection, it is apparent that adenosine-rich regions are not a fundamental marker because many sequences do not contain profuse adenosine residues at their poly(A) site.
PSSMs can be regarded as a simplistic representation of the interaction between an enzymatic complex and a strand of nucleic acids. The highest scoring position corresponds to a region that is most similar to the consensus of all poly(A) sites, which relates to a high affinity region for the polyadenylation complex. In this perspective and based on our results, it is enticing to contemplate a generic biological model where adenosine richness (possibly contrasted by a pyrimidine-rich upstream region) helps to direct the polyadenylation of specific positions downstream of polypyrimidine tracts in unprocessed mRNA transcripts. Deletion studies directed at these features followed by mapping the modified transcript's poly(A) site could shed additional light into the biological process. Moreover, in-vitro UV cross-linking could help identifying novel ribonucleoproteins (RNPs) that might interact with the trans-splicing/polyadenylation complexes.
The computational tools we describe in this work have been implemented in a small JAVA program named PRED-A-TERM (PREDicting poly(A) sites and TERMinal splice junctions) that can be downloaded from Additional File 4. It emits poly(A) and trans-splicing predictions from intergenic sequence input with partial coding sequence overlap and allows end-users the possibility to select various prediction parameters. The program is tuned for L. infantum but is suitable for other Leishmania species. Although trypanosomes have shorter average intergenic regions than Leishmania, both share similar trans-splicing machinery [39, 40]. Scanning Trypanosoma IRs will however, require additional sequence analysis and subsequent tuning of the model.
We present a simplified 5'UTR prediction function that can predict more than 65% of known trans-splicing sites within 25 nucleotides. This approach performs as good as previously published methods but it significantly reduces computational cost. We also present a novel 3'UTR prediction method for the trypanosomatid Leishmania that compares query sequences to known polyadenylated sequences using position specific scanning matrices. Such an approach is capable of predicting almost 50% of known poly(A) sites within 100 nucleotides, thus doubling the accuracy of the previous distance based approach. The final algorithm implemented in PRED-A-TERM is summarized in Figure 6.
By increasing the precision of large-scale transcriptome predictions in trypanosomatids, the prospective identification of novel regulatory non-coding RNA structures is now within reach. The relatively recent fervour for investigating regulatory functions of non-coding RNA has propelled the emergence of multiple structural RNA detection algorithms [41, 42]. These modern computational methods combined with biological validation could facilitate the discovery of innovative targets for therapeutic treatments.
After aligning the EST data to the genome, we extracted 500 nucleotides upstream of the coding sequence associated to the EST and the first 200 nucleotides downstream of the annotated start codon. Trans-splicing predictions are based upon the most recently published method . Sequences are fragmented at every occurrence of "AG" and each fragment's size is calculated. In the simplest scoring scheme, the longest inter-AG fragment is retained and the sequence's final position is considered as a splice junction candidate. Linear and polynomial pyrimidine bias models calculate the relative pyrimidine concentration of inter-AG fragments and modify each fragment's score proportionately using the following functions:L = λ + 150•λ•δP = λ + 150•λ•δ3
where L and P are the linear and polynomial model scores respectively, λ is the inter-AG fragment length, and δ corresponds to the difference between the pyrimidine concentration of the inter-AG fragment and the average intergenic concentration (55%). In both cases, the last position of the highest scoring inter-AG fragment is retained as the putative splice junction. Optimal coefficients were determined by trial and error testing. Sensitivity testing was performed on the same 214 EST sequences from Leishmania major as reported in that article.
Leishmania sequences for the poly(A) analysis were downloaded from GenBank's expressed sequence tag (EST) public database . Data were filtered to retain sequences having at least 12 adenine (A) or thymine (T) residues at their 3' or 5' end, respectively. Poly-T sequences were subsequently reverse-complemented. Leishmania infantum sequences were aligned to the genome (version 3 downloaded from ) using BLAST with low-complexity filtering disabled . Hits over 100 nucleotides long that displayed over 95% sequence identity were retained. We define an EST sequence as being polyadenylated if it satisfies the following criteria: (i) The last position of the best BLAST hit must immediately precede the poly(A) stretch. (ii) There should be no more than 9 "A" residues out of the next 12 genomic nucleotides following the BLAST hit. (iii) The last alignment match must not be a "N" in the genomic or EST sequence. The polyadenylation site is defined as the last non-"A" residue shared between the EST extremity and the genomic sequence. The full list of polyadenylated EST accession numbers can be viewed in Additional File 1. All filtering steps were achieved using ad-hoc JAVA scripts.
The genomic sequences of the polyadenylated ESTs were aligned and anchored at the mapped poly(A) site, as previously defined. From this alignment, we calculated the specific nucleotide composition for each position relative to the poly(A) site. The resulting nucleotide frequencies were divided by their corresponding average genomic frequency (A = 20.1%, T = 20.2%, C = 29.7%, G = 30.0%) to create an odds matrix. The final position specific scoring matrix (PSSM) was obtained by log-transforming the odds matrix to generate bit scores for each matrix entry.
The genomic intergenic regions (IRs) associated to the retained ESTs were extracted and extended 600 bases past the stop and start codons of the most recent L. infantum genome annotation (version 3). Only IRs inferior or equal to 5000 bases in length were retained. IRs of interest were scanned with PSSM sizes ranging from 1 to 300 upstream and 1 to 600 downstream of the poly(A) location. When scanning the entire IR, query sequences are scanned such that the position corresponding to the anchored poly(A) site in the PSSM is aligned to the first non-coding position downstream of the stop codon; at this point, the upstream matrix positions overlap the ORF. A cumulative bit-score is emitted for each given position and this step is repeated for every position of the intergenic sequence (the positions downstream of the matrix's poly(A) position may overlap the ORF when scanning the last positions). The positions with the highest scores are retained as putative polyadenylation sites. When predicting a polyadenylation region instead of a single position, the cumulative individual bit-scores are averaged over the length of the region scanned. The optimal prediction algorithm is summarized in Figure 6.
The prediction accuracies presented in this work arise from cross-validation sensitivity testing, where the polyadenylated EST data are divided into 10 subsets. Nine of those are used as a training set (in this case, to build a PSSM) which are subsequently tested on the left-over subset. This step is repeated for all subsets and the results are averaged to obtain the mean sensitivity. The average and standard deviation of 15 runs of cross-validation were performed for PSSM scanning and 30 runs for distance-only predictions. All testing was performed using ad-hoc JAVA scripts.
BP is a member of a Canadian Institute of Health Research (CIHR) Group on Host-Pathogen Interactions and of a Fonds Québecois de la Recherche sur la Nature et les Technologies (FQRNT) Center for Host-Parasite Interactions. This work was supported by an operating grant from the CIHR (MOP-12182) awarded to BP.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.