The effects of multiple features of alternatively spliced exons on the KA/KS ratio test

Background The evolution of alternatively spliced exons (ASEs) is of primary interest because these exons are suggested to be a major source of functional diversity of proteins. Many exon features have been suggested to affect the evolution of ASEs. However, previous studies have relied on the KA/KS ratio test without taking into consideration information sufficiency (i.e., exon length > 75 bp, cross-species divergence > 5%) of the studied exons, leading to potentially biased interpretations. Furthermore, which exon feature dominates the results of the KA/KS ratio test and whether multiple exon features have additive effects have remained unexplored. Results In this study, we collect two different datasets for analysis – the ASE dataset (which includes lineage-specific ASEs and conserved ASEs) and the ACE dataset (which includes only conserved ASEs). We first show that information sufficiency can significantly affect the interpretation of relationship between exons features and the KA/KS ratio test results. After discarding exons with insufficient information, we use a Boolean method to analyze the relationship between test results and four exon features (namely length, protein domain overlapping, inclusion level, and exonic splicing enhancer (ESE) frequency) for the ASE dataset. We demonstrate that length and protein domain overlapping are dominant factors, and they have similar impacts on test results of ASEs. In addition, despite the weak impacts of inclusion level and ESE motif frequency when considered individually, combination of these two factors still have minor additive effects on test results. However, the ACE dataset shows a slightly different result in that inclusion level has a marginally significant effect on test results. Lineage-specific ASEs may have contributed to the difference. Overall, in both ASEs and ACEs, protein domain overlapping is the most dominant exon feature while ESE frequency is the weakest one in affecting test results. Conclusion The proposed method can easily find additive effects of individual or multiple factors on the KA/KS ratio test results of exons. Therefore, the system can analyze complex conditions in evolution where multiple features are involved. More factors can also be added into the system to extend the scope of evolutionary analysis of exons. In addition, our method may be useful when orthologous exons can not be found for the KA/KS ratio test.


Background
Alternative splicing (AS) is suggested to be a mechanism to relax selection pressure [1][2][3]. It allows generation of different transcript/protein isoforms from the same genes, leading to increased functional diversity of the proteome. The evolution of alternatively spliced exons (ASEs) has been a topic of extensive studies. A number of exon features have been reported to influence the evolutionary rates of ASEs, such as length [4] and inclusion level (defined as the fraction of ESTs that include a certain exon) [2,[5][6][7][8][9][10][11]. Previous studies used the non-synonymous to synonymous substitution rate (K A /K S ) ratio test to evaluate the relationships between exon features and evolutionary rates of ASEs. The K A /K S ratio test is frequently utilized to examine the evolutionary rates of ASEs. Coding exons are regarded under strong negative selection if they pass the test (i.e. K A /K S ratio significantly smaller than one [12,13]). Therefore, the proportion of ASEs that fail the test (i.e. failing-test exon proportion, or FTE proportion) can be used to indicate the strength of selection pressure and the level of amino acid changes normalized by synonymous substation rate. It was suggested that exons ≤ 75 bp or has ≤ 5% nucleotide substitution rate between species (collectively we call these two exon types "non-applicable exons") may contain insufficient information, rendering the K A /K S ratio test powerless [12]. However, previous studies did not take into account the limitations of the K A /K S ratio test. Since ASEs tend to be short in length and have small genetic distances [14][15][16][17][18], a large portion of ASEs is "non-applicable". Inclusion of non-applicable ASEs may result in high FTE proportion [12] and lead to questionable inferences of ASE evolution. Therefore, it is necessary to re-examine the relationships between inclusion level/length and evolutionary rates of ASEs using only "applicable" exons.
Furthermore, the evolutionary rates of ASEs may be simultaneously affected by multiple factors. The relative strength of individual factors and additive effects of multiple factors on ASE evolution have not been systematically explored. Two factors other than length and inclusion level may also affect the evolution of ASEs: protein domain overlapping and frequency of exonic splicing enhancers (ESEs). Domain overlapping is important because functional domains are suggested to be under strong selection pressure [1,17,19,20]. Meanwhile, ESEs are cis-regulatory elements that regulate pre-mRNA splicing [18,21,22]. The conservation of ESE motifs in ASEs supposedly would reduce the evolutionary rates in these exons.
In this study, we would like to address the following questions: (1) Which of the four factors stated above has the greatest effect on the evolutionary rates of ASEs? (2) Are there additive effects between these factors? (3) What are the combinations of these factors that make ASEs most conserved? We first examine whether non-applicable exons affect the interpretations of the exon feature-FTE proportion relationships. Then we design a Boolean function combined with the Karnaugh map [23] to represent the evolutionary effects of combinational factors (or multiple factors) and to determine which conditions have dominant (or powerless) impacts on the FTE proportion of ASEs. Furthermore, since splicing patterns may differ between human and mouse, the mouse orthologues of human ASEs can be in fact constitutively spliced exons (CSEs). Therefore, we collected two datasets -the ASE dataset and the ACE (conserved ASEs or alternative conserved exons defined in [17]) dataset -for testing the effects of exon features on evolution (see Methods for more details).

Non-applicable exons significantly affect the relationship between FTE proportion and inclusion level/length of ASEs
The basic features of studied human CSEs (4630 exons) and ASEs (508 major form exons and 270 non-majorform exons, see Methods for definition) are listed in Table  1. It appears that ASEs include more short exons and more low-divergence exons than CSEs (P-value < 0.01 by Fisher's exact test). Therefore, it is clear that ASEs include a much higher proportion of non-applicable exons than CSEs (P-value < 10 -6 ).
The FTE proportions of CSEs, major-form ASEs, and nonmajor-form ASEs are shown in Figure 1a. When all exons are considered, inclusion level has a clear negative relationship with FTE proportion. Both the FTE proportions of major-form ASEs and non-major-form ASEs are significantly higher than that of CSEs (P-value < 10 -9 and P-value < 10 -17 , respectively). We then try to exclude non-applicable exons and re-analyze the new dataset. It appears that exclusion of these exons decreases the FTE proportions in all three exon types while the negative relationship between inclusion level and FTE proportion remains (Fig.  1A). However, the relationship is remarkably weakened (P-values > 0.05 for both CSE vs. major-form and majorform vs. non-major-form). Also worth noting is that the FTE proportions in all three exon types have fallen short of 9% (the genome-wide average of FTE proportion [12]), indicating that even for non-major-from exons, the K A /K S ratio test is an adequate prediction tool as long as nonapplicable exons are excluded. In addition, Figure 1A also shows that non-applicable exons have a very high proportion of FTEs, even for CSEs (20% of non-applicable CSEs fail the test). Surprisingly, more than 40% of non-applicable ASEs fail the K A /K S ratio test. The differences in FTE proportion between overall and applicable exons are highly significant regardless of inclusion level (P-value < 10 -4 for all three exon types). Overall, the inclusion level of ASEs appears to be weakly associated with the strength of selection pressure. The weakened relationship between inclusion level and percentage of FTEs when non-applicable exons are excluded implies that inclusion level may not be the most important factor that affects evolutionary rates of ASEs.
On the other hand, FTE proportion reduces remarkably when the lengths of ASEs exceed 100 bp ( Figure 1B). The difference in FTE proportion between short ASEs (≤ 100 bp) and longer ASEs (> 100 bp) is highly significant (Pvalue < 0.001) even when non-applicable exons are excluded. Furthermore, the FTE proportion is as high as 75% for ASEs with length ≤ 50 bp (data not shown). In comparison, only 11% of short, applicable ASEs fail the test ( Figure 1B). The difference in FTE proportion between all short exons (31%) and short, applicable exons (11%) is highly significant (P-value < 10 -5 ). Therefore, although the negative relationship between length and FTE proportion remains significant after excluding non-applicable exons, the applicability of exons still has remarkable influences on the FTE proportions of ASEs, particularly for short ASEs. Overall, our results reveal that interpretations of the relationships between FTE proportion and exon features without excluding non-applicable exons may be misleading. As a result, the following analysis includes only applicable exons.

Influences of protein domain and ESE motif frequency on the FTE proportions of ASEs
Since functional protein domains are usually under selection pressure [1,17,19,20], ASEs overlapping protein domains may be well conserved.  Figure 2A shows that ASEs that overlap with protein domains have a much lower FTE proportion than those that do not (6% vs. 24%, P-values < 10 -4 by the two-tailed Fisher's exact test). Therefore, our results suggest that protein domain overlapping has a significant effect on the evolution of ASEs.
Meanwhile, we used three packages: ESEfinder [28], RES-CUE-ESE [29][30][31], and PESE [32,33], to identify ESEs in the studied ASEs. Figure 2B illustrates that the mean ESE motif frequencies identified by the three packages in passing-test and failing-test ASEs are somewhat different. However, none of the three differences is significant (all P > 0.1 by the Mann-Whitney U-test). Consequently, the observation implies that ESE motif frequencies are only weakly related to FTE proportions of ASEs. This unexpected result may have resulted from overestimation of ESE motifs by the ESE prediction programs [34]. Alternatively, it is also possible that ESEs in fact are not under very strong selection pressure because intronic regulatory 101~200 >200

Influences of single and multiple factors on evolutionary rates of ASEs
So far we have discussed the influences of single factors on the evolution of ASEs. It is of interest to explore the following questions: (i) which one of the four factors mentioned above has the largest influence on FTE proportion; and (ii) how do combinations of these factors affect FTE proportion. For simplicity, we denote these four factors as where the definitions of low/high ESE motif frequency is given in Methods. Table 2 illustrates the impact of each single factor on the FTE proportions of ASEs. It reveals that short exons have the highest proportion of FTEs (11.2%), followed by exons that do not overlap protein domains (8.8%), nonmajor-form exons (7.3%), and lastly by exons with low ESE motif frequency (6.4%). Factors A and B are significantly related to FTE proportion (P < 0.001, all tests used in the section are the two-tailed Fisher's exact test), whereas factors C and D are not (P > 0.1). To determine which of A and B impacts the K A /K S ratio more, we compared the FTE proportions of two different combinations of the two factors. Table 2 shows that long exons without overlapping protein domains have a slightly higher FTE proportion than short exons that overlap protein domains (5.8% vs. 4.0%). However, the difference is not significant (P > 0.1). Therefore, we suggest that length and protein domain overlapping have similar impacts on FTE proportions of ASEs.  (6), which is significantly larger than that of all applicable exons (15.1% vs. 5.1%, P < 0.01). Here the Karnaugh map is employed because it can analyze the interactions of up to six factors [23]. In contrast, the ANOVA analysis can hardly yield reliable results while analyzing interactions of more than three factors. If we use a more stringent threshold and set T as 2.5, then we can obtain a     In summary, our results indicate that length and protein domain overlapping are the two most important factors that affect evolutionary rates of ASEs. The two factors are correlated because longer exons have a higher probability of overlapping protein domains. Nevertheless, the additive effect of the two factors implies that length has some evolutionary effects that are irrelevant with protein domain (and vice versa). Intuitively, longer exons may include more regulatory signals than shorter ones, thus making them subject to stronger selective pressure. However, our results also show that ESEs have only minor effects on the evolutionary rates of ASEs. Although this may have resulted form inaccurate ESE predictions or limited ESE conservation, it appears likely that other regulatory signals (e.g. post-translational modification sites) and protein size/structure also play an important role in ASE evolution. Meanwhile, our results also imply that protein domain has evolutionary effects independent of exon length. This is understandable because very minor modifications in protein domains can result in dramatic structural/functional changes. Therefore, short, domainoverlapping exons that pass the K A /K S ratio test may be good targets for structural analysis for protein function inferences.
Meanwhile, our results also indicate that multiple exon features are interrelated in affecting evolutionary rates of ASEs. The observation implies that these exon features may be functionally correlated. Therefore, evolutionary and functional studies of ASEs should take into consideration the effects of multiple factors. In this sense, the method proposed in this study is a handy tool, for it not only distinguishes relative strength between factors, but also delineates combinational effects of multiple factors.

Effects of exon features on ACEs
It is noteworthy that the ASEs analyzed so far are defined using human splicing patterns. Therefore, these ASEs in fact include both ACEs (73 out of 778 ASEs, ~ 10%) and lineage-specific ASEs (i.e., human-mouse orthologous exon pairs that are observed to be skipping in human but to be constitutive in mouse). To further explore the effects of exon features on ASE evolution, we retrieved ACEs from the ASD database [37] for the K A /K S ratio test. As shown in Table 3, ACEs have somewhat different evolutionary features from ASEs (Table 2). Similar to the results obtained from the ASE dataset, domain overlapping remains the most important, while ESE frequency insignificant in affecting the FTE proportion of ACEs. However, in the new dataset the influence of length is only marginally significant. Moreover, inclusion level also has a marginal effect on FTE proportion, which is different from the ASEbased results. These differences may have resulted from lineage-specific ASEs. Since these ASEs occurred after speciation events, they might have been under different selective pressure from that imposed on ACEs. Furthermore, we found that the ESE frequencies are positively related to inclusion level of ACEs, though the relationship is not significant [see Additional file 1].
It is surprising that the frequency of ESEs does not affect the FTE proportion in both datasets. ESEs are suggested to be enriched in CSEs [34], which have been shown to be under stronger selective pressure than ASEs [9,38,39]. It implies that exons with low K A /K S ratios have a higher frequency of ESEs. Therefore, ESE frequency is expected to be lower in FTEs than in passing-test exons. However, this scenario may be oversimplified for the following reasons. Firstly, other regulatory elements, including ESS (exonic splicing silencer), ISE (intronic splicing enhancer), and ISS (intronic splicing silencer), also participate in splicing regulation. Therefore, ESEs may not be the only (or the most important) factor that determines skipping of exons.

Conclusion
In this study, we evaluate the relative strength and combinational effects of multiple exon features on evolutionary rates of ASEs. We have reached the following conclusions: Firstly, non-applicable exons will bias the exon featureevolutionary rate relationship and should be excluded from analysis. Secondly, length and protein domain overlapping individually, and also additively, have the greatest effects on ASE evolutionary rates. However, ACEs shown a somewhat different trend in that inclusion level and length both have a marginally significant effect on FTE proportion. The difference possibly has resulted from differences between lineage-specific ASEs and ACEs. Thirdly, ESE motif frequency is likely to be the weakest of the four factors studied. This is surprising because ESEs are regarded important for alternative splicing regulation and evolutionarily conserved. Fourthly, length and protein domain overlapping have additive effects on evolutionary rates. Finally, we have identified a combination of ASE features (E FTE = 0 ) that characterize evolutionarily conserved exons. This can be useful when no orthologous exons are available for comparative analysis.

Extraction of human-mouse orthologous exon pairs
For the ASE dataset, well-annotated human CSEs and ASEs were downloaded from the online database ASAP (the Alternative Splicing Annotation Project [40,41] [37,43]). We used EST (i.e., the human UniGene EST database) counts to classify human major-form and nonmajor-form ACEs by the definition stated above. The sequences of exons analyzed in this study (including CSEs, ASEs, and ACEs) are available [44].

The K A /K S ratio test
For the K A /K S ratio analysis of orthologous exon pairs, we performed the following procedures: (i) calculating the numbers of synonymous and non-synonymous sites, K A , K S , and K A /K S values, using the yn00 program of the PAML package [45,46]. The exon pairs were aligned in FASTA format and checked for correct reading frame before submitting to the PAML program; (ii) creating two-way contingency tables with rows comprising numbers of synonymous and non-synonymous sites and columns comprising numbers of changed and unchanged sites; and (iii) testing the independence between the numbers of changed synonymous and non-synonymous sites using one-tailed Fisher's exact test. The Fisher's exact test was performed in R statistics system [47]. If the ratio K A /K S is significantly smaller than one at 5% level (P < 0.05), it stands for that the orthologous exon pairs tested are under strong negative selection. Such exons are termed "passingtest exons"; otherwise the exons are termed FTEs (failingtest exons). The FTEs are likely under positive selection or relaxed negative selection, which may accelerate changes at the amino acid level [12,13].

Identification of ESE motifs
Three programs were used for ESE motif prediction: ESEfinder, RESCUE-ESE and PESE, all with default parameters. The three programs yielded somewhat different results. ESEfinder uses weight matrices to predict ESE motifs that are responsive to four human SR proteins (SF2/ASF, SC35, SRp40 and SRp55). The matrices are based on frequency values derived from alignments of SELEX-derived sequences [28]. In contrast, both RESCUE-ESE [29][30][31] and PESE [32,33] are ab initio methods. The two ab initio methods differ in that the former identifies hexamers preferentially associated with CSEs with weak splice sites, while the latter detects octamers overrepresented in non-coding exons compared with the 5'UTR of intronless genes and pseudo exons [34].

Determination of low/high ESE motif frequencies of exons
Using ESEfinder [28]

AS -Alternative Splicing
ASEs -Alternatively Spliced Exons

CSEs -Constitutively Spliced Exons
ACEs -Alternative Conserved Exons The K A /K S ratio test -the non-synonymous to synonymous substation rate ratio test