Extraction, integration and analysis of alternative splicing and protein structure distributed information
© D'Antonio and Masseroli; licensee BioMed Central Ltd. 2009
Published: 15 October 2009
Alternative splicing has been demonstrated to affect most of human genes; different isoforms from the same gene encode for proteins which differ for a limited number of residues, thus yielding similar structures. This suggests possible correlations between alternative splicing and protein structure. In order to support the investigation of such relationships, we have developed the Alternative Splicing and Protein Structure Scrutinizer (PASS), a Web application to automatically extract, integrate and analyze human alternative splicing and protein structure data sparsely available in the Alternative Splicing Database, Ensembl databank and Protein Data Bank. Primary data from these databases have been integrated and analyzed using the Protein Identifier Cross-Reference, BLAST, CLUSTALW and FeatureMap3D software tools.
A database has been developed to store the considered primary data and the results from their analysis; a system of Perl scripts has been implemented to automatically create and update the database and analyze the integrated data; a Web interface has been implemented to make the analyses easily accessible; a database has been created to manage user accesses to the PASS Web application and store user's data and searches.
PASS automatically integrates data from the Alternative Splicing Database with protein structure data from the Protein Data Bank. Additionally, it comprehensively analyzes the integrated data with publicly available well-known bioinformatics tools in order to generate structural information of isoform pairs. Further analysis of such valuable information might reveal interesting relationships between alternative splicing and protein structure differences, which may be significantly associated with different functions.
Most of the genes in higher eukaryotes contain introns. The presence of many introns in higher eukaryotic genes allows the expression of different proteins (isoforms) in different tissues from a single gene, phenomenon known as alternative splicing. In lower eukaryotes, however, alternative splicing is very rare .
It has been estimated that at least 60% of human genes produces alternatively spliced forms , but only for a small part of the human genes the alternative splicing variants have been detected, because the regulatory processes which lead to alternative splicing have not been well understood yet. Alternative splicing may be important in understanding cancer, since some cancer-associated genes have alternatively spliced forms which differ from the forms in normal tissues. Generally, aberrant splicing events are responsible for pathologies .
Alternative splicing is important also from an evolutionary point of view: different transcripts may be translated in different departments of the cell, in different tissues and at different developmental stages, giving rise to cell, tissue and developmental diversity and specificity; alternative splicing also allows evolution of the gene structure . It has been demonstrated that the alternative first exon plays an important role in protein diversity at cell and tissue level [5, 6].
skipped exons (cassette exons and mutually exclusive exons): an entire exon appears in some transcripts but not in others;
exon/intron isoforms: the borders of the exon/intron are different, leading to truncation/extension of introns/exons;
intron retentions: an intron is not spliced out (Figure 1).
In addition to these, alternative promoters or alternative poly-A tails exist: in this case the coding sequence is the same for all the isoforms, but the untranslated regions may vary, e.g. alternative promoters that involve different transcriptional controls.
There are often functional differences among spliced forms: insertion, deletion or substitution of protein domains by alternative splicing frequently modifies protein function by inserting or deleting functional residues or by substituting the sequence that includes a functional site. Alternative splicing inserts or deletes whole protein domains; it does not occur within protein domains. This can be explained by the fact that exons may reflect domain boundaries, or that natural selection may eliminate meaningless alternative splicing variants which do not result from full domains .
To investigate how alternative splicing affects the protein function, it is important to understand whether the protein structure is influenced by the insertion/deletion/substitution of residues. The secondary structure of a protein is determined when special rearrangements result from the folding of localized parts of a protein. Many types of different secondary structures may be present inside the same protein, due to different non-covalent interactions between the residues, e.g. hydrogen bonds produce to alpha helix and beta sheet secondary structures. Averagely, inside a protein 60% of the chain is alpha helix or beta sheet, the other 40% is loop. To date, many different alphabets to define the secondary structure exist; the most commonly used alphabet is the Definition of Secondary Structure of Proteins (DSSP) , which includes 8 types of structures: H (alpha helix), B (isolated beta-bridge), E (extended strand, participating in beta ladder), G (3-helix, i.e. 3/10 helix), I (5-helix, i.e. pi helix), T (hydrogen bonded turn), S (bend), and (other).
When analyzing the protein structure, two additional properties are worth to be investigated: accessibility and flexibility. The former is a measure of the static solvent exposure, i.e. the number of water molecules which are in contact with every residue of the protein: the average accessibility of the protein is calculated as the average value over all the residues. The latter, which refers to the vibration of an atom around its rest position, is measured as thermal motion of the alpha carbon of every residue .
To automatically extract, integrate, and analyze the alternative splicing and protein structure data sparsely available in different distributed databanks, we created a Web application called Alternative Splicing and Protein Structure Scrutinizer (PASS). In order to build the PASS Web application, we: 1) defined a database schema suitable to store and integrate alternative splicing and protein structure information extracted from the Alternative Splicing Database (ASD) , the Ensembl databank http://www.ensembl.org/, and the Protein Data Bank (PDB, http://www.rcsb.org/) ; 2) developed a software capable of creating and updating the database by automatically retrieving and integrating data from different databanks accessible through the Internet; 3) created a software to analyze the retrieved data and store the results from the analysis inside the database, and 4) designed and implemented a Web interface that allows users to query the database in order to examine the integrated data and use them to evaluate their own gene or protein sets.
Results and discussion
PASS database and designed analysis steps
To obtain information about the possible relationship between alternative splicing and protein structure, a multi-step analysis procedure of the integrated data has been designed. The first step concerns the filtering of all the reference protein sequences in Ensembl in order to consider only those proteins which have a resolved structure in PDB; this allows performing all the subsequent analyses on a smaller subset of proteins, thus limiting the computational load required, in particular, by some time-consuming operations. The filtering is achieved by using the Protein Identifier Cross-Reference (PICR, http://www.ebi.ac.uk/Tools/picr/) Web service, which provides a mapping between Ensembl and PDB identifiers . PICR maps 3,480 Ensembl human genes, which correspond to 15.1% of the total human genes in Ensembl, to 1,942 distinct PDB structures, with a total of 13,972 associations (4 structures average for each gene, as more than one protein structure in PDB can be associated to the same alternative splicing gene). Mapping results are stored in the PDB_ENSP table of the PASS database (Figure 2).
A further filtering step is made by performing BLAST  alignments against PDB of each isoform sequence selected through PICR, in order to assess the associations between ASD isoforms and PDB structures suggested by the correspondences between Ensembl reference sequences and PDB structures obtained from PICR. This BLAST search selects 3,056 best alignments between isoforms from ASD and PDB structures (9.5% of the isoforms in ASD), which correspond to 953 genes (9.6% of the genes in ASD, and 4.1% of the human genes in Ensembl). BLAST results are stored in the BLAST_Homology table of the PASS database (Figure 2).
Once all proteins with a resolved structure are selected, CLUSTALW  is executed on each couple of isoforms for which an alternative splicing event is defined in ASD (i.e. imported and stored in the AlternativeSplicingEvents table of the PASS database). This allows annotating each residue of the two isoforms according to the alignment between their alternatively spliced sequences: hence it is possible to verify whether a residue is conserved between the two spliced forms, and whether there are mismatches or gaps. Through CLUSTALW the annotation for both aligned sequences and the position of the alternative elements are determined. Seven different possible positions for the alternative elements were defined and stored in the Annotation table of the PASS database (Figure 2): 0 (inner alternative element), 1 (more than one alternative element), 2 (terminal alternative element), 3 (only mismatches between the two sequences are present), 4 (element aligned to a gap: this happens if the two sequences do not overlap), 5 (element shared between the two sequences), 6 (either mismatches or insertions are defined). The data derived from the analysis with CLUSTALW are stored in the ClustalwAlignedSequences table of the PASS database (Figure 2). In ASD 52,248 events of alternative splicing are described for 23,295 isoforms (72.6% of the isoforms defined in ASD) and 6,384 genes (64.2% of the genes in ASD). By combining the alternative splicing event data with the results from the previous step of BLAST filtering and the residue annotation through CLUSTALW, 3,951 alternative splicing events are annotated (7.6% of the events in ASD), which regard 599 genes and 2,149 isoforms (6% of the genes with alternative splicing, and 6.7% of the total isoforms in ASD).
Summary of the data integrated and stored in the PASS database
Source and type of data
% of total in Ensembl
% of total in ASD
alternative splicing events
Ensembl genes with a resolved structure (from PICR)
Mapping from BLAST against PDB
Alignments produced by CLUSTALW
Results from FeatureMap3D
Software system to create and update the database automatically
In order to manage users' registrations to the PASS Web application and users' input and PASS database search data, the relational PassUsers database has been developed by using MySQL DBMS. It allows: 1) maintaining the information about PASS registered users and recognize them when they log in; 2) storing all the data the registered users upload in the PASS system; 3) storing all the queries to search and extract data in the PASS database that the registered users manually define and want to save in order to be performed at later time without needing to redefine them.
The Queries table contains all the predefined SQL queries the user can use to interrogate the PASS database. Each record of the table contains information about the FROM and WHERE clauses of the SQL queries, while the SELECT clauses of the queries are visually composed by the user and stored in the Fields table associated to the user performed database searches stored in the Searches table. In this way, the user can choose the query to perform, as well as the result fields to display in the Web interface, and can save them to be used at later time to perform the desired searches in the PASS database.
to determine whether a particular type of insertion has a preferred structural pattern;
in presence of an inner alternative exon, to evaluate whether the inserted element might be more rigid or more flexible than the rest of the protein;
in presence of mismatches, to determine whether the substituted residues are on the protein's surface (as indicated by high accessibility values); furthermore, if these have a low similarity with those in the other splicing pattern (as indicated by their low values in the BLOSUM62 matrix ), then the overall properties of the isoforms, such as polarity and solubility, might differ significantly.
A last set of implemented queries allows the PASS user to export the processed data stored in the PASS database by downloading query results as flat files, which may be inputted to other programs such as Excel, MatLAB or R, in order to further analyze them. Towards this aim, several possibilities have been implemented: 1) extract the average structure and the accessibility and flexibility values for all the alternatively spliced genes; 2) extract the average structures sorted by annotation, in order to determine whether conserved residues have a different structural pattern than the alternatively spliced ones; 3) extract the average structures sorted by position of the alternative element, in order to understand whether a particular type of insertion may affect the protein structure; and 4) extract the average structures sorted by the type of alternative splicing event, in order to understand whether the event type may affect the protein structure.
The Alternative Splicing and Protein Structure Scrutinizer (PASS) has been developed as a Web application able to make large scale automatic analyses of alternative splicing and protein structures of human genes. PASS can provide numerous results: from the simple analyses of protein sequences (either structural or about alternative splicing), to the extraction of protein structural information, which is the basis for determining the relationship between alternative splicing and protein structure. All data processing and analyses are performed automatically; this allows executing large scale analyses: more than one thousand protein structures have been investigated and results are available in the PASS database. By using them the PASS Web application is able to process all the human genes and proteins required by the user and provide information about the analysis of their structure. The results may be either visual (in form of bar plots that enable the user to immediately perceive the differences in protein secondary structure distribution among the considered sequences), or in form of tables that may be downloaded for further investigations with specific software tools. To our knowledge, at present there is no other software or databank available that can provide similar integrated information. Evaluation of such valuable information stored in the PASS database might reveal interesting correlations between alternative splicing and protein structure differences significantly associated with different functions.
The Alternative Splicing Database (ASD) from the European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/asd/ was used as source for human alternative splicing data. This is a computer generated high quality databank which includes data from gene/transcript sequence (EST/mRNA) alignments. These data are cleaned from ambiguities and analyzed: alignment gaps are potential introns while matches in the alignment correspond to exons if they are flanked on both sides by introns. Introns and exons which overlap with one another correspond to alternative splicing events; these events are described as skipped exons, exon/intron isoforms, or alternative exon events. The alternative splicing events are defined between different isoforms from a single gene: each of them contains the type of alternative splicing, the splicing patterns between which the event occurs, and the introns/exons which are involved in the alternative splicing event. The latest version (release 3) of the alternatively spliced protein sequences (AltSplice-rel3.peptides.fasta) and the definition of the alternative splicing events (AltSplice-rel3.events.txt) have been downloaded from the ASD FTP site at EBI ftp://ftp.ebi.ac.uk/pub/databases/astd/altsplice/human/latest/. The ASD includes analyses on 9,945 genes and 32,079 isoforms, among which 52,248 events of alternative splicing are defined (Figure 7a).
The Ensembl databank http://www.ensembl.org/ was used as reference benchmark for the human genes. The release 48 of the list of human reference protein sequences, which includes 22,997 human genes, was downloaded from the Ensembl FTP site ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/. Each considered entry contains the Ensembl gene and protein identifiers, the position of the gene on the chromosome, and the protein sequence.
The Protein Data Bank (PDB)  release of February 10th 2008 with protein sequences in FASTA format (pdb_seqres.txt) has been downloaded from ftp://ftp.wwpdb.org/pub/pdb/derived_data/; 111,015 protein structures are included in this release.
Bioinformatics software and tools
The Protein Identifier Cross-Reference (PICR) http://www.ebi.ac.uk/Tools/picr/ has been developed by EBI to map identifiers and sequences between different databanks. We used its Web service version to map all the Ensembl protein identifiers (ENSP) to the PDB identifiers, in order to avoid doing a BLAST of all Ensembl against all PDB.
BLASTALL, which has been downloaded as stand alone executable program from http://www.ebi.ac.uk/blastall/, is used to blast alternatively spliced protein sequences against PDB; it has been chosen because it allows defining the database of hit sequences to be used. To this aim, two steps need to be undertaken: 1) define the database of hit sequences: this is done by running the command "FORMATDB", and 2) run the command "BLASTALL -p BLASTP -m 8", where the parameter "-p BLASTP" defines that the BLAST is made for amino acid residues and the parameter "-m 8" defines that the output must be in tabular format . By taking advantage of the results from PICR, only the alternatively spliced protein sequences from the genes which have a significant hit in PDB are aligned against the protein sequences which have a defined three dimensional structure in PDB. Proceeding this way, the computational load is significantly reduced in comparison with an all-against-all BLAST search.
CLUSTALW , downloaded as stand alone executable program from http://www.ebi.ac.uk/Tools/clustalw2/, is used to annotate every residue in each considered couple of alternatively spliced sequences between which an alternative splicing event is defined (Figure 7b). The output from CLUSTALW (an ALN file) defines the annotation as follows: (residue shared between the two spliced forms); 1 (residue present only in the analyzed splicing pattern); 2 (start and finish position of a gap for the analyzed isoform); 3 (mismatch between the two spliced forms, with substitution between two residues which are very alike; i.e. when the value between the two mismatched residues in the BLOSUM62 matrix  is > 0); 4 (mismatch between the two spliced forms, with substitution between two residues which have a value = 0 between them in the BLOSUM62 matrix); 5 (mismatch between the two spliced forms, with substitution between two residues which have opposite properties, e.g. a positively charged and a non-polar residue; i.e. when the value between the two mismatched residues in the BLOSUM62 matrix is < 0) (Figure 7c). It is relevant to understand whether a mismatch is between similar residues (a high value in the BLOSUM62 matrix) because the properties of similar residues, such as polarity or dimensions, are alike; thus a substitution between such residues would not affect the whole protein structure.
FeatureMap3D http://www.cbs.dtu.dk/services/FeatureMap3D/ is a Web tool which permits to blast a protein sequence against PDB in order to search for the protein structure in PDB (BLASTP is used to align the query sequence against PDB). If an annotation is added to the blasted sequence (by submitting a TAB file instead of a FASTA file), FeatureMap3D differently colors the differently annotated parts of the retrieved structure; otherwise it uses the alignment of the query sequence against the PDB structure as annotation for the different colors and generates a PyMOL script and an annotated pairwise alignment. From FeatureMap3D different types of outputs may be downloaded: the PDB file of the identified structure; the PyMOL script to color the structure as defined by the annotation; the PNG image of the structure; the FeatureMap3D report, which is divided in two parts: first, a summary of the information about the protein (protein name, user's defined name, PDB identifier, structure's resolution, and a report of the BLAST between the query protein sequence and PDB), and second, a table of the protein structural properties, residue by residue (Figure 7d). The report file is used to extract all the structural information imported and stored into the PASS database.
We are grateful to Søren Brunak, who gave us the idea to study the relationship between alternative splicing and protein structures, and to Anne Mølgaard and Rasmus Wernersson for their support during the biological analysis.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 12, 2009: Bioinformatics Methods for Biomedical Complex System Applications. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S12.
- Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al.: The genome sequence of Drosophila melanogaster. Science 2000, 287(5461):2185–2195. 10.1126/science.287.5461.2185View ArticlePubMedGoogle Scholar
- Modrek B, Resch A, Grasso C, Lee C: Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res 2001, 29(13):2850–2859. 10.1093/nar/29.13.2850PubMed CentralView ArticlePubMedGoogle Scholar
- Hiller M, Backofen R, Heymann S, Busch A, Glaesser TM, Freytag JC: Efficient prediction of alternative splice forms using protein domain homology. In Silico Biol 2004, 4(2):195–208.PubMedGoogle Scholar
- Boue S, Letunic I, Bork P: Alternative splicing and evolution. Bioessays 2003, 25(11):1031–1034. 10.1002/bies.10371View ArticlePubMedGoogle Scholar
- Zhang T, Haws P, Wu Q: Multiple variable first exons: a mechanism for cell- and tissue-specific gene regulation. Genome Res 2004, 14(1):79–89. 10.1101/gr.1225204PubMed CentralView ArticlePubMedGoogle Scholar
- Nakao M, Barrero RA, Mukai Y, Motono C, Suwa M, Nakai K: Large-scale analysis of human alternative protein isoforms: pattern classification and correlation with subcellular localization signals. Nucleic Acids Res 2005, 33(8):2355–2363. 10.1093/nar/gki520PubMed CentralView ArticlePubMedGoogle Scholar
- Kriventseva EV, Koch I, Apweiler R, Vingron M, Bork P, Gelfand MS, Sunyaev S: Increase of functional diversity by alternative splicing. Trends Genet 2003, 19(3):124–128. 10.1016/S0168-9525(03)00023-4View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- Thanaraj TA, Stamm S, Clark F, Riethoven JJ, Le Texier V, Muilu J: ASD: the Alternative Splicing Database. Nucleic Acids Res 2004, (32 Database):D64-D69. 10.1093/nar/gkh030Google Scholar
- Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al.: Ensembl 2008. Nucleic Acids Res 2008, (36 Database):D707-D714.Google Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Cote RG, Jones P, Martens L, Kerrien S, Reisinger F, Lin Q, Leinonen R, Apweiler R, Hermjakob H: The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics 2007, 8: 401. 10.1186/1471-2105-8-401PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
- Wernersson R, Rapacki K, Staerfeldt HH, Sackett PW, Molgaard A: FeatureMap3D – a tool to map protein features and sequence conservation onto homologous structures in the PDB. Nucleic Acids Res 2006, (34 Web Server):W84-W88. 10.1093/nar/gkl227Google Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.