Domain fusion analysis by applying relational algebra to protein sequence and domain databases
© Truong and Ikura 2003
Received: 25 February 2003
Accepted: 6 May 2003
Published: 6 May 2003
Skip to main content
© Truong and Ikura 2003
Received: 25 February 2003
Accepted: 6 May 2003
Published: 6 May 2003
Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful.
This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at http://calcium.uhnres.utoronto.ca/pi.
As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time.
The complex metabolic and signaling pathways within the cell are controlled by highly coordinated and intricate protein-protein interactions. Information regarding such protein-protein interactions can be obtained from biochemical and biophysical methods like co-immunoprecipitation , yeast two-hybrid  and mass spectrometry [3, 4]. To complement these often time-consuming experimental methods, computational methods for predicting functional linkages have been developed. Some methods use protein surface interfaces [5, 6]; some use the ordering and/or proximity of genes in genomes [7–9]; while others use the co-occurrences of genes in genomes [10, 11].
Recently, computational methods that exploit domain-domain relationships have been introduced and proven to be useful for the prediction of functional linkages in genomic research [12–15]. In particular, domain fusion analysis exploits the fact that certain proteins in a given genome consist of fused domains that correspond to single, full-length proteins in other genomes [13–15]. The proteins with fused domains in a given genome are likely to directly interact or be involved in the same metabolic and signaling pathways. In their analysis of the M. genitalium genome, Huynen et al. showed that the occurrence of a domain fusion event was highly correlated with function .
The query genome is defined as the genome where functional linkages are predicted, while the reference genome is the amalgamation of all other genomes excluding the query. Domain fusion events found in the reference genome predict functional linkages between proteins in the query. To date, most domain fusion analysis have compared complete genomes of relatively small sizes and rely on a BLAST comparison  between every protein of the query genome to every protein of the reference [13, 14, 16, 18–20]. The analysis has not been applied to larger non-redundant sequence databases such as SWISS-PROT, although the analysis becomes a more powerful prediction tool when more reference genomes are included. One reason for this limitation is that the computation time becomes "prohibitively expensive" .
Other groups have already appreciated the use of relational databases for domain fusion analysis [21, 22]. To complement their work, we present a fast computational method that enables domain fusion analysis on partial or complete genomes in a non-redundant sequence database using simple relational algebra operations. Instead of using BLAST comparisons, we leveraged on existing efforts to predict protein domains by Pfam's HMM domain database . Beginning with Pfam's domain layout prediction of each protein in the SWISS-PROT+TrEMBL protein sequence database, we applied successive relational algebra operations using SQL to identify putative functional linkages, especially in H. sapiens and S. cerevisiae. These results are compared with experimentally demonstrated cases and published protein interaction databases. Finally, we discuss various factors that can generate false positives.
The majority of protein sequence and domain databases are built on the relational database architecture. Typically, data is acquired from a database of this type by relational algebra operations in the form of SQL queries. Therefore, a method that can be performed directly using these operations will save unnecessary conversion of data and leverage on the scalability and efficiency of commercial RDBMS software (relational database management systems). Our method for finding domain fusions can be performed entirely using relational algebra operations.
The method is described using relational algebra notation with the following conventions: bold text refers to a table; A.attribute refers to an attribute or column of A; σ(predicate)(A) is the selection operation with the predicate in parenthesis, which selects rows in A that satisfy the predicate; A × B is the cartesian product operation, which creates a permutation of information between A and B; π(A.attribute1, A.attribute2,...)(A) is the projection operation, which extracts specified attributes from A.
Let F query and F ref be the table of all possible domain fusion templates (DFTs) in the query and reference genomes, respectively. The idea of DFTs is conceptually similar to Rosetta stone  and composite proteins . For example, if a gene has four different domains ABCD, there are six different DFTs: AB, AC, AD, BC, BD, and CD.
Let D q1 and D q2 be copies of D query and let D r1 and D r2 be copies of D ref . F query and F ref can be found by performing a projection and selection operation following a cartesian product between the corresponding domain tables. This operation will enumerate all permutations of DFTs. For example, if gene has three different domains ABC, then there are nine possible permutations of DFTs: AA, AB, AC, BA, BB, BC, CA, CB, and CC. The desired DFTs do not have the same domains (i.e., AA, BB, and CC) and order does not matter (i.e., AB is the same as BA). To remove same domain DFTs, the following clauses are added to the selection predicates: (D q1 .dom≠D q2 .dom) for F query and (D r1 .dom≠D r2 .dom) for F ref . At this stage, it is not necessary to consider the removal of one of the two alternatively ordered DFTs.
Let F put be the table of valid DFTs that can be used in the prediction of functionally linked proteins in the query genome. Therefore, F put can be found by the difference between F ref and F query .
F put = F ref - F query (7)
Finally, let P put be the table of putative functional linkages in the query genome. P put can be obtained by performing a projection and selection operation following a cartesian product between D q1 , D q2 and F put . This operation will, for each DFT in F put , enumerate all permutations of proteins that contain the first domain in the DFT to proteins that contain the second domain in the DFT. Note that this operation can be more efficiently performed if F put includes only domains found in the query genome.
Remember the alternatively ordered DFTs have not been removed. Therefore, if there is a putative functional linkage between protein A and protein B, there will also be a functional linkage between protein B and protein A in P put . To remove these redundant putative functional linkages, it is easiest to re-insert the all rows in P put into a new table with a database trigger enabled that restricts the row insertion of protein A and protein B, if the row of protein B and protein A exists.
Analysis of the S. cerevisiae and H. sapiens sequences in SWISS-PROT+TrEMBL
Valid DFTs involving domains found in the query sequences
Putative functional linkages
Filtered functional linkages
Functional linkages supported by the scientific literature
Types of functional linkages from the scientific literature
Top ten sources of DFTs in H. sapiens
Total sequences in SWISS-PROT+TrEMBL
Sequences per DFT
Top ten sources of DFTs in S. cerevisiae
Total sequences in SWISS-PROT+TrEMBL
Sequences per DFT
Previous methods for domain fusion analysis [13, 14, 20] are essentially identical to our method, except that our method specifically finds individual "domain" fusions, whereas the previous methods used full-length proteins from one organism, which correspond to a fused full-length protein in another organism. We chose our approach as many proteins consist of multiple domains. For example, consider a fusion protein in the reference organism consisting of domains ABCD, which corresponds to two separate proteins in the query organism, consisting of domains AB and CD. Using our method, the list of reference DFTs would be AB, AC, AD, BC, BD and CD; the list of query DFTs would be AB and CD. Therefore, the valid DFTs that can be used for predicting functional linkages are AC, AD, BC and BD. All four DFTs would predict the same functional linkage between the two query organism proteins. In contrast, previous methods would have only a single fusion event that predicts this functional linkage. Therefore, an additional advantage of our approach is that the number of different DFTs predicting a functional linkage could be used to rank our prediction confidence.
Splice variants are treated intermediately as separate genes in our method since each variant may interact with different proteins. For example, consider a query gene with two variants: one variant consisting of domains ABC while another consisting of domains AC. If it is found that BD is a valid DFT for functional linkage prediction, then the first splice variant could be involved in a putative functional linkage that the second is not. Finally, the putative functional linkages of the gene would be the union of functional linkages of the splice variants.
Any prediction method could produce false positive results. Here, we consider several sources of false positives, which may be generated by the present method. A false positive can occur when a functional linkage is predicted between two proteins where none exists. One possible source of false positives in domain fusion analysis is the promiscuity or paralogy in domains (for example, BTB, PDZ, SH2 and SH3 domains), which occur at a high frequency in many different protein sequences that do not share similar functions [13, 14, 20]. The removal of promiscuous domains reduces false positives, but the criterion for classifying them is a difficult problem. One criterion relies on finding domains with a Z-score greater than 10 [13, 20], while another on domains that are involved in domain fusions events with more than 25 other domains .
Another possible source of false positives is the inability to list all the DFTs in the query genome. For example, consider two query genes: one consisting of a domain A while another consisting of a domain B. If it is found that AB is a valid DFT for functional linkage prediction, then the two query genes are perhaps functionally linked. However, if the query genome's DFT list is incomplete, AB may potentially exist and therefore, the two query genes may be falsely predicted as functionally linked. A number of factors can cause this problem including the use of an incomplete query genome, absent or inaccurate profile HMM domains and the erroneous prediction of intron and exon sites.
The domain fusion analysis using relational algebra presented here relies on the prediction of domains from profile HMMs. In contrast, previous approaches to domain fusion analysis often employed heuristic local pairwise sequence alignment (PSA) algorithms such as BLAST . Such algorithms emphasize finding long high scoring local alignments, however, the most strongly conserved residues are commonly distributed across the domain. Therefore, the key drawback of a heuristic PSA-based approach in domain fusion analysis is its relative insensitivity for finding remote homologs and, consequently, domain fusions. Within the last decade, however, the sensitivity of sequence searching techniques has been improved by profile- or motif-based analysis, like the profile HMM, which uses information derived from multiple sequence alignments to construct and search for sequence domains and patterns [35–37]. Unlike the heuristic PSA algorithms, a profile or motif can exploit additional information, such as the position and identity of residues that are conserved throughout the domain, as well as variable insertion and deletion probabilities. Therefore, the advantage of the profile HMM is the sensitivity and accurate delineation of domains, however, the key drawback is its reliance on the accurate construction of a profile HMM for all domains. If the profile HMM of a domain is not constructed or carelessly done, it will not find all putative domains and, consequently, domain fusions. Thus, as the quality and quantity of separate domain databases increases such as BLOCKS , PROSITE , Pfam , SMART , PRINTS-S , ProDom , TIGRFAMs  and amalgamated domain databases such as InterPro , our approach to domain fusion analysis will also become increasingly powerful.
The relational algebra method presented here offers an alternative approach to performing domain fusion analysis that leverages on existing efforts to improve the size and quality of domain and motif databases. We have illustrated the efficacy of the method by identifying many possible functional linkages in H. sapiens and S. cerevisiae sequences in the SWISS-PROT+TrEMBL database. Interestingly, the genomic distribution of the sources of DFTs suggests that DFTs are not likely found either in closely or remotely related organisms, but rather there is a balance between the two extremes that is tilted toward closely related organisms. Finally, future work could expand the method presented here to other genomes of interest.
The analysis was performed on the Oracle RDBMS (version 8) installed on a computer with a dual 750 MHz UltraSPARC-III processor and 4 G of RAM running SunOS 5.8. Sequence information from SWISS-PROT (Release 39) + TrEMBL (Release 17) and domain architecture information from Pfam was migrated to the sequence table and domain layout table of the database, respectively, by Perl and Oracle SQL*loader scripts. To perform the analysis, relational algebra expressions were converted to SQL statements and executed by an Oracle SQL*Plus client connected to the database server. The total computation time for H. sapiens and S. cerevisiae were approximately 4 and 3 hours, respectively.
We would like to thank Amanda Mayo for her assistance in verifying functional linkages from protein interaction databases and scientific literature. This work was supported by a Canadian Institutes of Health Research (CIHR) fellowship to KT and a National Cancer Institute of Canada grant to MI. MI is a CIHR Investigator.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.