A multispecies polyadenylation site model

Background Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general poly(A) site recognition model challenging. We surveyed nine poly(A) site prediction methods published between 1999 and 2011. All methods exploit the skewed nucleotide profile across the poly(A) sites, and the highly conserved poly(A) signal as the primary features for recognition. These methods typically use a large number of features, which increases the dimensionality of the models to crippling degrees, and typically are not validated against many kinds of genomes. Results We propose a poly(A) site model that employs minimal features to capture the essence of poly(A) sites, and yet, produces better prediction accuracy across diverse species. Our model consists of three dior-trinucleotide profiles identified through principle component analysis, and the predicted nucleosome occupancy flanking the poly(A) sites. We validated our model using two machine learning methods: logistic regression and linear discriminant analysis. Results show that models achieve 85-92% sensitivity and 85-96% specificity in seven animals and plants. When we applied one model from one species to predict poly(A) sites from other species, the sensitivity scores correlate with phylogenetic distances. Conclusions A four-feature model geared towards small motifs was sufficient to accurately learn and predict poly(A) sites across eukaryotes.


Existing poly(A) sites methods
*Names assigned by authors as no name were given in the original papers.

Feature identification by Principal Component Analysis (PCA)
For a proof of concept, we simulated 2,000 200-nt long sequences where all types of nucleotides were equiprobable. Sequence positions were labeled from -100 to +100 to order to mimic poly(A) sequences where the cleavage site is at the midpoint. We setup an uneven distribution of dinucleotides ("dimers") by tagging these sequences with designated dimers as follows: a) 'AC' was specially planted at -90 and +90 in each sequence b) Half of the sequences at +12 were tagged with 'GT', and the other half were tagged with 'AT' c) 'CG' was planted at -10 in half of the sequences d) 'TA' was planted at +52 in each sequence These sequences were first transformed into a position-by-2mer matrix (200 rows and 16 columns) before being processed by PCA using R [10] ( Figure S1A). Description of position-by-kmer matrix can be found in Materials and Methods section. Each blue dot corresponds to a position in the sequence. Its distance from the origin reflects the distribution of dimers at that position. If the distribution is skewed, implying localization, the data point will reside far from the origin. Arrows in red represent individual dimers.
Their lengths reflect the extent of localization of the dimers they represent. The longer the arrow, the higher is the localization. Blue dots pointed by the arrow indicate the most frequent locations of the dimer. The angle between arrows reflects the correlation between the pair of dimers. The smaller the angle, the higher is the correlation. Note that it is inappropriate to compare the length of an arrow with the distance of a position (blue dot) from the origin. Only positions and arrows show unusual localization are labeled ( Figure S1A). Localizations of planted dimers 'AC' and 'CG' in respective positions are noticeable. However, it is less obvious for 'CG, 'GT' and 'AT' due to their moderate localizations, which is reflected by the length of arrows. One issue faced by the typical PCA plot ( Figure S1A) is that it turns crowded when too many arrows and/or positions are considered in the case of longer kmer model and/or larger region. An alternative is to view the distance of positions (blue dots in Figure S1A) and kmers (red arrows in Figure   S1A) from the origin, namely PCA-distances, as shown Figure S1C. We call Figure S1C PCA-position profile. Each blue vertical bar denotes the PCA-distance (units in left yaxis) of each position from the origin. Black dots represent the direction pointed by corresponding positions. Direction is measured by the angle θ, called PCA-angle, as diagrammed in Figure S1B. Note that angles near to +180 and -180 should be considered to point to the similar direction. PCA-angle is measured degrees according to the y-axis on the right. The same idea is applied to display the PCA-distance of dimers as shown in Figure S1D, namely PCA-oligo profile. Black dots indicate the direction pointed by the corresponding dimers (red arrows) in Figure S1A. The five sharp peaks reveal certain dimers are localized in those positions ( Figure S1C). As shown in the PCA plot ( Figure S1A), these positions and dimers point approximately to same direction.
Thus, the common PCA-angle shared between positions in PCA-position profile and kmers in PCA-oligo profile serves as the key to identify the localized kmers in the sequences. For example, the peak in position +12 points to ~135 degree ( Figure S1C), which corresponds to 'AT', 'GT' and 'TT' in Figure S1D. 'TT' likely overlaps with 'GT' in 3' direction as its PCA-distance is slightly shorter. Hence, simulation results support the use of PCA to unravel hidden localizations in the sequences. Ta g g e d_r a n d o m se q u e n c es P C A d ist a n c e  With the human intergenic sequences, false positive rates are 57% and 84% for the trimer and hexamer models, respectively. The hexamer models acheived 88% sensitivity in predicting human poly(A) sequences, which was 3% higher than the trimer model.

PCA of Arabidopsis and Human for trimers, hexamers and octamers
However, the trimer model attained 90% specificity compared to 83% in the hexamer model.
In addition, polya_svm [5] was used to make predictions for the simulated sequences. It committed 82% false positive rate.

Receiver operating characteristic (ROC)
ROC method was used to determine the optimal threshold for predictions produced by the logistic model ( Figure S3). The ROC plot illustrates the influence of threshold on true positive rate (sensitivity) and false positive rate (specificity) for Arabidopsis and human models. According to ROC, the model achieved optimal performance when threshold was set between 0.5 and 0.6. Thus, the threshold for logistic method is set to 0.5 such that a sequence scored above 0.5 is considered as a poly(A) site; negative if otherwise. Figure S3: ROC of logistic method. Thresholds vary from 0.1 to 0.9 with increment of 0.1. Arabidopsis, and human predictions are colored in red, and blue, respectively.

Predictions in transcribed, non-poly(A) sequences
Besides poly(A) sites, we have also considered how well our methods can differentiate poly(A) sequences from other transcribed, non-poly(A) genomic sequences. Two types of sequences were used to test this point. One is the coding sequence (CDS) in which introns are spliced out. The other is the unspliced gene sequence. As shown in Table S2 below, our methods perform better than the other two methods in both types of transcribed sequences in human and Arabidopsis.

Relative contribution of features
We determined the contribution of each feature in prediction. We did that by using only a single feature each time to build the model, and then assessed its performance relative to the full model (Table S3). In humans, the features in decreasing importance are the upstream trimer, the cleavage dimer, the downstream trimer, and the nucleosome occupancy. Such order persists regardless of the ML methods. The relative contribution of features for Arabidopsis shows that the importance of the least important features are reversed compared to humans: the upstream trimer, the cleavage dimer, the nucleosome occupancy, followed by the downstream trimer. This order is the same in both ML methods.

Prediction of larger poly(A) regions
Our results thus far deal only with 600-nt long poly(A) sequences, meaning that the poly(A) site, if exists, is at the midpoint of the sequence. Here, we demonstrate that our model is capable of making prediction for sequence of any length longer than 600 nts.
According to our data, the neighboring positions (±2 nts) of a poly(A) site are usually scored high but not for spurious poly(A) sites. Therefore, we imposed a smoothing method to the scores in a 5-nt window. This method calculates the average scores within the window after dropping the highest and the lowest scores. Five hundred real poly(A) sequences, each 2,000 nts long, were sampled randomly from human, and Arabidopsis datasets. The actual poly(A) site is located in the middle of the sequence. The logistic model used a 600-nt window to scan the sequences from left to right to make predictions regarding whether the midpoint of the window was a poly(A) site or not. Figure  The results show that 8.5% of them were predicted as poly(A) sites by the logistic model.
The error rate is consistent with the result in Table 3 in the main text. The accumulation of predicted poly(A) sites per position is represented in the lower panel. Similar results were obtained for human data. The error rate is 9%, which is also consistent with the result in Table 3 in the main text.