No statistical support for correlation between the positions of protein interaction sites and alternatively spliced regions

Background Alternative splicing is an efficient mechanism for increasing the variety of functions fulfilled by proteins in a living cell. It has been previously demonstrated that alternatively spliced regions often comprise functionally important and conserved sequence motifs. The objective of this work was to test the hypothesis that alternative splicing is correlated with contact regions of protein-protein interactions. Results Protein sequence spans involved in contacts with an interaction partner were delineated from atomic structures of transient interaction complexes and juxtaposed with the location of alternatively spliced regions detected by comparative genome analysis and spliced alignment. The total of 42 alternatively spliced isoforms were identified in 21 amino acid chains involved in biomolecular interactions. Using this limited dataset and a variety of sophisticated counting procedures we were not able to establish a statistically significant correlation between the positions of protein interaction sites and alternatively spliced regions. Conclusions This finding contradicts a naïve hypothesis that alternatively spliced regions would correlate with points of contact. One possible explanation for that could be that all alternative splicing events change the spatial structure of the interacting domain to a sufficient degree to preclude interaction. This is indirectly supported by the observed lack of difference in the behaviour of relatively short regions affected by alternative splicing and cases when large portions of proteins are removed. More structural data on complexes of interacting proteins, including structures of alternative isoforms, are needed to test this conjecture.

encoded by a relatively small number of genes. While the estimates of this number still vary significantly in the range of 25000 to 35000 [1] it is clear that we have fewer genes than maize [2], and only approximately five times more than the unicellular eukaryote S. cerevisiae [3]. Earlier predictions postulated that the human genome should contain between 50.000 and 100.000 genes [4]. These recent findings helped to realize the importance of post-transcriptional gene regulation in generating the proteomic complexity of the human cell.
One of the mechanisms for generating the protein diversity in eukaryotes is alternative splicing (AS). Current estimates of the prevalence of AS in the human genome range from one third of genes [5,6] to about 60% [7]. In fact, analysis of EST data showed that almost all human genes have possible alternatively spliced forms, although many of them seem to be non-functional [8]. AS is also widely observed in genomes of other multicellular eukaryotes [9]. There are two aspects of the AS phenomenon. Firstly, recent results demonstrate that alternative isoforms, especially tissue-specific ones, are often evolutionary young [10][11][12]. One possible interpretation for that is that AS provides a convenient evolutionary mechanism for generating new proteins without sacrificing existing ones [10]. Secondly, AS leads to generation of proteins identical in some domains and different in others. This is a powerful regulatory mechanism [13,14]. Indeed, it has been shown that AS, compared to a random expectation, tends to avoid disrupting protein domains, to shuffle entire domains, and to target functional sites in proteins [15]. AS is frequent in genes involved in signal transduction and regulatory interactions [16].
To summarize, AS is a powerful mechanism for modulating the protein mode of action. Perhaps the most important type of functional context in which proteins execute their function is constituted by protein-protein interactions (PPI). It is thus tempting to speculate that AS could influence the structure of protein-protein interaction networks by selectively blocking or activating individual interactions dependent on cellular conditions. Occasional evidence of the interplay between AS and PPI is scattered in the biomedical literature. For example, the domain composition of the Shank postsynaptic density protein is regulated by AS, defining the spectrum of its interaction partners [17]. The interaction of Annexin XI with calcyclin was shown to be isoform specific, owing to partial deletion of the calcyclin-binding site in one of the splice forms [18]. In Drosophila melanogaster the regulation of dynein targeting to various cellular organelles via binding to dynactin is modulated by structural variations in the N-terminal portion caused by AS [19]. At the same time, other studies show no interdependence between protein interaction patterns and AS. For example, Liu et al. [20] reported that specific protein-protein interactions of cytoplasmic serine hydroxymethyltransferase are not regulated by AS.
To test the hypothesis that protein-protein interactions can be modulated in an alternative splicing dependent manner we have undertaken a study of AS events occurring in proteins known to be involved in transient interactions (see Figure 1). We were able to identify the total of 42 alternatively spliced isoforms in the amino acid chains participating in structurally characterized interaction complexes, as surveyed recently by Thornton and Nooren [21]. In addition to statistics based on the entire dataset we also considered separately AS events changing more than a half of the protein length, where complete domain disruption is a likely result of AS [15], as well AS influencing less than 25% that could be expected to have less drastic effect. In all cases, no statistically significant relationship between AS and PPI could be established.

Results
We begin with presenting several examples illustrating the interplay between PPI and AS. In the first case, shown in Figure 2, two alternatively spliced regions were identified in the cyclin-dependent kinase 2 (cdk2). The first (N-terminal) splice event eliminates nearly completely the PSTAIRE region of cdk2 which plays a central role in its interaction with cyclin [22]. The second (C-terminal) sequence span deleted by splicing overlaps partially with the T-loop region of cdk, also involved in cdk2-cyclin interface. In the second example, presented in Figure 3, AS eliminates a large portion of the phosducin N-terminal domain, including one of the regions that contact the βγsubunits of the retinal G-protein transducin [23]. The corresponding isoform, called phosducin-like orphan protein 1, was reported to be less abundant than phosducin and not to be able to bind transducin, although the functional significance of this and other phosducin orphans remains unclear [24]. Finally, in our third positive example ( Figure 4) a significant portion of the interface between the human importin-β and the Ran protein [25] is spliced out, including the entire tandem repeat 7 as well as parts of repeats 6 and 8. This region also includes the so called "stalk region" with unknown function.
In all three examples above there is a clear overlap between the location of alternatively spliced regions and protein interaction sites. In many other cases, though, no such correlation could be established. For example, as seen in Figure 5, AS does not seem to affect the regions of the RhoA protein with which it binds the effector domain of serine-threonine kinases [26]. Based on an extensive manual analysis of our PPI/AS dataset we were not able to formulate a definitive opinion as to the degree to which the location of AS regions correlates with PPI interfaces.
We subsequently attempted to investigate the strength of the correlation between the positions of AS and PPI regions using statistical analysis. Tables 1 and 2 present the entire body of structural PPI evidence and AS information used for this purpose. Our analysis resulted in delineation of 30 alternatively spliced isoforms for 16 proteins involved in heterodimer interactions and 12 isoforms for 5 proteins involved in homodimer interactions. These data were used as input to different counting procedures described in the Methods section. The overall summary for all types of dimers (with smoothing) is given in Table 3 (for individual positions; [see Additional file 1]) and Table 4 (for entire segments; [see Additional file 2]). Table 3, in 15 out of 42 cases, the majority of random windows show less correlation with the contact positions than the real AS segments, in 3 cases the random windows show the same correlation, and in 24 cases the majority of random windows show more correlation with the contact positions than the real AS segment(s). Similarly, for contact segments (Table 4) in 13 cases the majority of windows show less correlation with the contact segments than the real AS segments, in 4 cases the correlation is the same, and in 25 cases most random windows show more correlation. Thus, compared to random control, AS shows a weak tendency to avoid contact regions; however, the chi-squared test shows that in both cases the difference is insignificant. The results of our study are thus negative: the hypothesis that AS might serve as a regulation mechanism for PPI networks could not be confirmed.

As shown in
A schematic representation of the study Figure 1 A schematic representation of the study. The positions of sequence regions involved in protein-protein interactions were delineated from structurally characterized protein complexes and juxtaposed with the location of putative alternatively spliced regions.

Discussion
This study represents the first attempt to investigate the relationship between two complicated biological phenomena -protein-protein interactions and alternative splicing -based on the scarce experimental evidence currently available in the public databases. It was prompted by results of Kriventseva et al. [15] that could be interpreted as evidence that AS tends to target functionally important regions in proteins. However, we were not able to obtain convincing evidence of correlation between contact regions in protein-protein interactions and protein segments corresponding to alternatively spliced regions in mRNA. One possible explanation for that could be that our database of AS events was corrupted by aberrant splicing that abounds in EST databases [8]. To guard against this possibility we considered only AS events expected to be reliable by multiple criteria, including presence in multiple ESTs from several clone libraries and conservation in the mouse genome. Another explanation could be that removal by AS of more than a half of the protein sequence would lead to complete disruption of the 3D protein structure, and thus any analysis relying on contact regions in the intact protein becomes meaningless.
However, we obtained essentially the same results on a subset of cases where the influence of AS was relatively modest (less than 25% of the protein sequence removed or substituted by AS; data not shown). Such shorter alternatively spliced regions are still much longer that a typical functionally important sequence motif of the type considered in [15]. On the other hand, as described above, complete removal or substitution of the interaction domain is a well-established mechanism of regulating PPI by AS.
Given more data, it would be interesting to repeat the analysis of the type presented here on the domain level. For instance, one of the remaining open questions is whether AS tends to target (completely remove or partially destroy) interacting domains or other domains of proteins involved in PPI. Partially, this issue was addressed in a recent study [27] where it was shown that domains of types involved in protein-protein interactions frequently are removed or inserted by alternative splicing, although there was no observed increase in the rate of alternative splicing of such domains compared to other types of domains.

Conclusions
In this study we were not able to establish a statistically significant correlation between the positions of protein interaction sites and alternatively spliced regions. This finding implies that alternative splicing is not a preferred mechanism for controlling protein interaction networks, although many individual protein interactions are known to be isoform-dependent.

Protein interaction data
A comprehensive survey of structurally characterized transient protein interaction complexes has been recently published [21]. We used tables 3 and 4 from this study which list the PDB [28] codes and chain identifiers of interacting proteins pairs in 24 weak homodimers and 25 heterodimers, respectively. Protein sequence spans involved in physical interactions were identified with the help of the Protein-Protein Interaction Server http:// www.biochem.ucl.ac.uk/bsm/PP/server [29]. Contacts between amino acid residues were defined based on the distance threshold of 6 Å between any pair of non-hydrogen atoms. PDB sequences were saved and the positions of residues participating in contacts recorded for each sequence separately.

Detection of AS events
Amino acid chains with known positions of interaction regions were further analysed to detect putative AS events. Initial mapping of the protein sequences to the human genome was conducted using TBLASTN [30] searches. Each hit that coincided both with a human gene annotated in LocusLink [31] and the UniGene [31] cluster linked to the latter was used to obtain a complete set of mRNA and EST sequences corresponding to the gene. mRNAs with annotated protein-coding regions were used to derive the corresponding protein sequences. The exonintron structure of each gene with all known alternatives was established by spliced alignment of ESTs and mRNAs to the genomic DNA sequence using Pro-EST [5]; proteins were spliced-aligned to DNA using Pro-Frame [32]. For further analysis only alternatives changing the coding region covered by the known three-dimensional structure were considered.
An alternative region (that is, a region present in a fraction of isoforms) was assumed to be reliable if it was supported by a full-length mRNA or a protein, or by several ESTs from different clone libraries. An alternative region was discarded if it was observed in less than three ESTs or ESTs from only one clone library. Further, only conserved AS events, defined by comparison with the mouse genome, were retained. The procedure for establishing the conservation of the alternative splicing events is described in detail in [11]. Briefly, for each human gene the orthologous mouse gene was obtained from HomoloGene [31], Alternative splicing in importin-β Figure 4 Alternative splicing in importin-β. a. Amino acid sequence of the protein. The spliced region was identified based on the homology with the human cDNA sequence with the GenBank accession number BE744080. b. Graphical representation of the spliced and interacting regions. c. Ribbon representation of the three-dimencional structure (PDB code 1IBR, chain b). Colouring as in Figure 2.
Alternative splicing in the human rhoa protein    and human protein isoforms were spliced-aligned to the mouse gene DNA using Pro-Frame. An alternative region was considered as conserved if it could be aligned with the same similarity level as the rest of the protein, and the corresponding mouse exons were bounded by canonical splicing sites.

Counting coincidences of AS and PPI regions
Positions of contact regions in protein sequences were compared with positions of AS regions to determine whether there was a significant correlation between them. For each protein sequence considered, two bit strings were generated representing the involvement of each amino acid position in AS and PPI, as depicted in Figure 6. One technical difficulty is that many contact regions are very short (often comprised by just one amino acid residue) and/or may be interrupted by individual residues or groups of residues that do not take part in interactions. We thus used two smoothing parameters -N and M -to group interacting residues in a given amino acid chain into contiguous contact segments. N was defined as the minimal allowed length of a spacer between two residues involved in interactions, and M represents the minimal allowed length of a contact segment. Contact segments separated by spacers shorter than N were merged, and then the segments shorter than M were deleted. We tested different combinations of N and M, and found that the obtained results were robust as regards these parameters (data not shown). The results for N = 5 and M = 1 are reported below.
Smoothing positions of PPI regions Figure 6 Smoothing positions of PPI regions. Amino acid sequence is represented as a bit string in which 1 indicates those residue positions that are involved in PPI, and 0 -those that are not. Subsequently smoothing is conducted such that contact segments separated by spacers shorter than N residues are merged, and then the segments shorter than M are deleted. In this example a sequence fragment of the human importin β-chain (PDB code 1ibr; residue positions from 7 to 77) is shown which interacts with the GTP-binding nuclear protein RAN. The interaction involves residues situated on one side of two α-helices. After smoothing with N = 5 and M = 1 entire helices are considered interacting regions. Several different approaches to calculate the correlation between AS and PPI bit strings with and without smoothing were tested. The first, residue-by-residue approach involved considering each amino acid position separately and counting the number of positions in which the values of AS and PPI bit strings are equal or differ (Figure 7a). Each position can thus be classified as belonging to one of the four classes: AS/PPI, AS/non-PPI, non-AS/PPI, or non-AS/non-PPI. In a second, area-based approach (Figure  7b), we first grouped the adjacent positions of bit strings possessing identical values and then calculated the correlation between occurrences of such groups as a whole. Entire PPI segments were classified into three classes: entirely overlapping with AS, partially overlapping with AS, and non-overlapping with AS.
To estimate the degree of correlation, two types of statistical analysis were employed. Initially we applied the standard χ 2 test to the contingency tables formed by two parameters describing each position (PPI/non-PPI, AS/ non-AS). However, as the results of this analysis were inconclusive (data not shown), we compared the observed correlation with that assuming fixed contacts and random placement of AS segments (cf. the procedure in [15]). Formally, for each AS isoform we considered a window of the same length as the length of the alternatively spliced segment. For each position of the window the correlation between the contact positions (or entire segments) and the current position of the window was computed as described in the previous paragraph. Then for the current position of the window the computed correlation and the correlation of the real AS segment were compared. The current window was subsequently classified into one of three classes: higher correlation, same correlation, and lower correlation. The same procedure was generalized for the case of two (or more) AS segments in an isoform. In this case, all non-overlapping pairs of windows were considered and the same procedure was applied. Again, the window sizes were set to the sizes of the real AS segments.

Authors' contributions
MO conducted protein structure analysis and implemented statistical techniques used in this work. He also Counting coinciding PPI and AS regions b did a major part of data analysis work, created statistical tables and figures presented in this paper. RNN identified the positions of alternatively spliced regions based on comparative genome analysis and spliced alignment. MSG proposed the statistical methods used in this work and supervised their development in collaboration with MO. DF and MSG, together, conceived the original hypothesis of this study. DF proposed the specific approach adopted in this study and supervised the work of MO in the part related to protein structure analysis. The manuscript was written by DF and MSG. All authors read and approved the manuscript.