Gene functional similarity search tool (GFSST)

Background With the completion of the genome sequences of human, mouse, and other species and the advent of high throughput functional genomic research technologies such as biomicroarray chips, more and more genes and their products have been discovered and their functions have begun to be understood. Increasing amounts of data about genes, gene products and their functions have been stored in databases. To facilitate selection of candidate genes for gene-disease research, genetic association studies, biomarker and drug target selection, and animal models of human diseases, it is essential to have search engines that can retrieve genes by their functions from proteome databases. In recent years, the development of Gene Ontology (GO) has established structured, controlled vocabularies describing gene functions, which makes it possible to develop novel tools to search genes by functional similarity. Results By using a statistical model to measure the functional similarity of genes based on the Gene Ontology directed acyclic graph, we developed a novel Gene Functional Similarity Search Tool (GFSST) to identify genes with related functions from annotated proteome databases. This search engine lets users design their search targets by gene functions. Conclusion An implementation of GFSST which works on the UniProt (Universal Protein Resource) for the human and mouse proteomes is available at GFSST Web Server. GFSST provides functions not only for similar gene retrieval but also for gene search by one or more GO terms. This represents a powerful new approach for selecting similar genes and gene products from proteome databases according to their functions.


Background
Cellular function in a biological system normally involves participation and interaction of multiple genes. Mutations that alter function of any one of these genes can potentially increase disease susceptibility. For example, the tumor suppressor gene BRCA1 suppresses cell growth and participates in transcription-coupled DNA damage repair. Mutations in BRCA1 increase the risk of early onset breast cancer as well as ovarian and prostate cancer [ [1,2], and [3]]. Genes with functions similar to BRCA1 can be considered additional candidate genetic risk factors for breast, ovarian, prostate, or other cancers.
One common approach to identify functionally similar genes is to find genes that share significant sequence homology. However, functional similarity does not always require sequence similarity. For example, both P53 [4] and BRCA1 [5] function as tumor suppressor genes. Similar to BRCA1, mutations in P53 have also been found in breast cancer patients [6]. The two genes share no sequence homology. As a result, a sequence similarity search tool, such as BLAST [7], is unable to reveal their functional similarity.
An alternative to sequence homology search is key word search, but this approach has two weaknesses. First, key words for gene functions are not well defined. Second, key word search is a static binary operation and therefore cannot quantify the significance of the search results, which makes it difficult for a user to evaluate the outcome.
In recent years, the GO Consortium has made major advances in establishing structured, controlled vocabularies describing gene functions. Each vocabulary is structured as a directed acyclic graph (DAG), wherein any term may have more than one parent term as well as zero, one, or more child terms. The DAG structure provides a much richer description of the underlying biology than would be possible with a strictly hierarchical tree graph. GO describes genes and proteins using three categories: Biological Process, Molecular Function, and Cellular Component [ [8] and [9]]. "Molecular Function (MF) describes catalytic or binding activities, at the molecular level. Individual molecular function terms include the broad concept such as 'kinase activity' and the more specific '6phosphofructokinase activity', a subtype of kinase activity. Biological Process (BP) describes biological processes accomplished by one or more ordered assemblies of functions. High-level processes such as 'cell death' can have subprocesses, such as 'apoptosis' and 'apoptotic chromosome condensation'. Cellular Component (CC) describes subcellular structures and macromolecular complexes. Examples of cellular components include 'nuclear inner membrane', with the synonym 'inner envelope', and the 'ubiquitin ligase complex', with several subtypes of these complexes represented." Currently, the GO vocabulary consists of more than 18,000 terms [ [10,11], and [12]].
UniProt (Universal Protein Resource) is a comprehensive catalogue of information on protein sequence and function created by integrating data from Swiss-Prot, TrEMBL, and PIR. UniProt is the central access point for extensive curated protein information, including function, classification, and cross-reference. GO assignments have been applied to data sets in UniProt representing the complete human and mouse proteomes by a combination of electronic mappings and manual curation [ [10] and [11]]. This vocabulary has also been applied to the nonredun-dant proteome sets in UniProt for all other completely sequenced organisms as well as to proteins from a wide range of organisms where the genome is not yet complete.
Recently, some web servers (UCSC Gene Sorter [13] and GOToolBox [14]) have provided gene search engines for shared Gene Ontology (GO) terms and some stand-alone software packages (GO Graph [15] and DynGO [16]) also provide gene similarity search functions.
Using different similarity measures and search methodologies, we have developed a search engine: Gene Functional Similarity Search Tool (GFSST), to identify candidate genes based on their functional similarities. We have implemented GFSST for the human and mouse proteomes in the UniProt database. For a given protein with known functions or a set of protein functions, and a given proteome, the GFSST search engine not only can find any proteins in the proteome with the same functions (shared GO terms) but also can discover proteins with similar functions (not necessarily shared, but with very similar GO terms). We have defined a statistical measure: D-value (Distribution value) to quantify functional similarity of genes based on the GO directed acyclic graph (DAG). There are three levels of definition for D-values. First, the D-value for each GO term on GO DAG is defined as a positive value with additive and monotonic properties. The D-value for each term is the sum of the D-values of its direct children (additive property). Obviously, the Dvalue of the child cannot be greater than the D-value of its parent (monotonic property). In its current implementation, the D-value at the root of GO DAG is assigned a value of 1, and all leaf terms have the same D-value. Then, the D-value for every pair of GO terms is defined as the minimum D-value of their common parent terms. For identical GO terms, the D-value is set to zero. The D-value for a pair of genes is defined as the mean of the D-values for their matched GO term pairs. More details can be found in the METHODS section and in the DISCUSSION.

Results
We have developed a GFSST search engine for human and mouse proteomes. Input data can be a set of GO terms or a protein identifier (either a protein name or an accession number) in the human or mouse UniProt database. The search can be carried out against the human or the mouse proteome. The search engine can also be used as retrieval tool for the human and mouse UniProt databases.

Search by a source protein
Given a source protein, GFSST can retrieve functionally related proteins from a specific proteome. To illustrate this process, we used human BRCA1 as an example. BRCA1_HUMAN (accession # P38398) in UniProt (EBI GOA_human release 28) has 9 GO terms in the Biological Process category ( Table 1). Most of those terms refer to DNA repair and regulation of apoptosis. The top 10 hits of BRCA1_HUMAN searched by GFSST are listed in Table  2, one of which is P53_HUMAN. The two genes share one common GO term GO:0045786 (negative regulation of cell cycle) and 8 similarly matched pairs of GO terms. One of the pairs is GO:0006978 (DNA damage response, signal transduction by P53 class mediator resulting in transcription of P21 class mediator) and GO:0008630 (DNA damage response, signal transduction resulting in induction of apoptosis). Their common parent with minimum D-value 0.0000752370 is GO:0042770 (DNA damage response, signal transduction) ( Figure 1). So the D-value for the pair of GO terms is 0.0000752370. More GO terms matched for P53 and BRCA1 are shown in Table 3. There are a total of 97 hits with D-values < 0.05 in the three GO categories after BRCA1_HUMAN search by GFSST, one of which is BRCA2_HUMAN with D-value 0.0476423351.
In mouse model research, it is important to find mouse proteins that are functionally similar to a human protein.
The top 10 matches with D-values < 0.05 in searching the UniProt mouse proteome (EBI GOA_mouse release 14) with gene BRCA1_HUMAN in the Biological Process category are listed in Table 4.

Search by GO terms
GFSST provides a robust retrieval tool for gene and gene products based on their associated GO terms. This is a more flexible approach than searching with a source protein. Users can design their search targets by a single or a combination of GO terms. Given GO terms, GFSST can retrieve gene products from a specific proteome. Thus users can design their search by providing a set of gene functions (GO terms). GFSST can find genes or gene products matched by those functions and/or by similar functions.
For example, glucose metabolism is a critical pathway in the study of diabetes. The target proteins with the glucose metabolism function (GO term GO: 0006006) in the Biological Process category will thus be relevant to diabetes. GFSST delivered 19 exact matches for this GO term in the UniProt human proteome. Insulin tops the search results.
DNA damage response, signal transduction by P53 class mediator (GO:0030330) is a very important function in cancer research. Performing GFSST search for this GO term in the Biological Process category we find no exactly matching gene products in the UniProt human proteome. There are four protein hits, including BRCA1_HUMAN Specifically binds to the upstream regulatory region of type I IFN and IFN-inducible MHC class I genes (the interferon consensus sequence (ICS)) and activates those genes.  lation of apoptosis), and GO:0006917 (induction of apoptosis), describe the biological processes involved in angiogenesis inhibition. GFSST found 402 proteins with D-value < 0.05 via the above four GO terms. The first 10 proteins are listed in Table 5.

Discussion
GFSST is a new search engine that facilitates the selection of candidate genes and gene products for human disease, for mouse models, for biomarkers, and for drug targets. GFSST can retrieve function-specific gene products (such as proteins, antibodies, and enzymes) from the UniProt database based on assigned ontology annotations rather than sequence, sequence motif, or other sequence attributes. The GFSST search quality will improve with the increasing accuracy and completeness of GO DAG and gene annotation by GO terms. Its usefulness will grow as biological studies expand our current knowledge of the proteome.
It is an important and unique feature of GFSST that users are able to retrieve genes not only from annotated genes, but also from a combination of GO terms. GFSST can retrieve not only the genes with shared GO terms but also the genes with very similar (not necessarily the same) GO terms. In order to make GFSST useful not only for users who are familiar with GO terms, more technologies may be needed such as natural language search tools.
Well-defined D-values for GO terms, for a pair of GO terms, and for a pair of annotated genes (equivalent to a pair of sets of GO terms) are the foundation of GFSST. We have not yet considered evidence codes in the calculation of D-values. For stronger evidence, a heavier weighting factor should be used. We plan to add evidence codes to our algorithm in the next version of our web server.
It is possible to assign different distributions to GO terms. Lord and his collaborators [15] have used the frequencies of the matches to GO terms by UniProt gene products as distributions. We can calculate probabilities p for GO terms and pairs of GO terms exactly as done in [15], and then calculate the D-value for two groups of GO terms with our greedy pairing algorithm. Using this D-value system, we have tested GFSST for BRCA1_HUMAN. Different but similar results are obtained. Outputs obtained using the two measures for BRCA1 gene similarity search are listed in Table 6 (data source: DAG-Edit version 1.416; EBI  GOA_human release 19). Our D-value distributions do not need to calculate the hit frequencies, and are not sensitive to annotation changes; they are in fact essentially annotation-independent. In other words, the D-value is a universal measure for GO terms and pairs of GO terms, and so can be used for any GO annotated database search. The similarity measure used in [15] is dependent on the probability of the specific term occurring in the annotated database, and thus is not a universal measure. This becomes a significant issue for cross-species, such as human-to-mouse, searches, and searches against sources with different annotations, since separate probabilities, p(c), would need to be calculated for every individual and every pair of species/database similarity searches. Nonetheless, in future releases of GFSST, an option for selecting measures other than D-values will be provided for limited databases.
There are some available web servers for gene similarity search that are not entirely restricted to homology. The UCSC Gene Sorter [13] "displays a table of genes within a selected genome that are related to one another." The GOToolBox [14] "has developed methods and tools based on the Gene Ontology (GO) resource allowing the identification of statistically over-or under-represented terms in a gene dataset; the clustering of functionally related genes within a set; and the retrieval of genes sharing annotations with a query gene." We tested Gene Sorter and GOToolBox with the source protein BRCA1_HUMAN. A comparison of the results from GFSST, Gene Sorter and GOToolBox is presented in Table  7. Most of the GOToolBox gene products listed in Column 2 of Table 7 fall within the parts of the ontology dealing with apoptosis, androgen binding, and similar categories, perhaps because the algorithm will usually search for terms no more than one step up in the GO DAG. In contrast, the hits using GFSST (Column 1) seem to cover a broader spectrum of the ontology, e.g., DNA damage response, cell proliferation, centriole regulation, receptor signaling in addition to those already mentioned, due to the fact that the computation is not limited to only one step up the hierarchy. Hence, GFSST provides the opportunity to find unsuspected gene products that may cooperate (either directly or indirectly) to subserve larger biological, enzymatic or metabolic functions, for example shared elements of multiple signal transduction pathways. In the case of the GO search directed from the UCSC Browser (Column 3), only those gene products which have been characterized in the available gene expression studies, albeit quite a large portion of the known genes, will be targeted; this may explain the different hits, for a perhaps narrower range of specific categories, and a few seemingly more farfetched ones.
The similarity measures based on the shared GO terms are simple and useful methodologies. But they are limited to the fact that if there are no shared terms, the similarity score (similarity percentage formulas (3), (4), and (5) in [14]) will be 0 even if two genes have very similar functions. Recognizing this weakness, the authors of GOTool-Box extended the shared GO term from "common associated term" to "common parent terms" [14]. We understand this to mean the most immediate "parent terms". Our D-value measurement defines a similarity score for every pair of GO terms. For example, the pair GO: 0006978 (DNA damage response, signal transduction by P53 class Mediator resulting in transcription of P21 class mediator) and GO: 0008630 (DNA damage response, signal transduction resulting in induction of apoptosis) do not have the most immediate one step up parent (see Figure 1). GOToolBox considered this pair not shared. In contrast, the D-value in GFSST is 0.0000752370 for this pair, which is one of the aligned pairs for BRCA1_HUMAN and P53_HUMAN (see Table  3). In Table 7, GFSST listed P53_HUMAN as the protein with the most significant score for the BRCA1_HUMAN search, but it is not at the top of the GOToolBox search results. It is very interesting that Gene Sorter listed TP53 at the top of its output, which may be explained by the fact that, in addition to shared terms, Gene Sorter uses popularity score in its search.
In short, different search engines have different algorithms and different features. If one wishes to find similar genes with shared GO terms, Gene Sorter and GOToolBox are good choices. If one is interested not only in shared, but also similar GO terms, GFSST is preferable. Beyond that, GFSST allows users the flexibility to design their search targets.

Conclusion
An implementation of GFSST which works on the UniProt (Universal Protein Resource) for the human and mouse proteomes is available at GFSST Web Server [17]. GFSST provides functions not only for similar gene retrieval but also for gene search by one or more GO terms. This represents a powerful new approach for selecting similar genes and gene products from proteome databases according to their functions.

Statistical measures on the GO Directed Acyclic Graph (DAG)
Expressions in the GO are descriptions of concepts situated in an ontology formed by a DAG with concept inclusion as the ordering relation from less specialized terms (parents) to more specialized terms (children). In order to keep the ontology consistent in concept, there are no cycles in the DAG representation. We use a recursive procedure to define a D-value for each term in the DAG. First, we set the count to 1 for each leaf. Then we assign the count for every term recursively by summing the counts of all its direct children from the bottom up. For example, in Figure 1 (a subgraph of Biological Process GO DAG, DAG-Edit version 1.419 rev 3), there are three leaf terms GO:0006978, GO:0006977, and GO:0042771; each is assigned a count of 1. The term GO:0030330 is assigned a count of 3 by summing the counts of its direct children.
The other terms at the same level as GO:0030330 have only one direct child each, and each direct child has count 1; each of these terms is thus assigned a count of 1. The term GO:0042770 therefore has direct children with counts 1, 3, 1 and 1, and is thus assigned a count of 6. According to previous reports [18], the distributions for terms on the GO DAG are monotonically nondecreasing, moving from child term to parent term. In our implementation, the distribution (D-value) for a term is its count divided by the count of the root of the whole GO DAG. As shown in Figure 1, the D-value equals the count divided by 79751 (the count for the root of the Biological Process GO DAG). Since the count for any term is the sum of the counts of its direct children, the distribution of a term is the sum of the distributions of its direct children. In addition to its monotonic property, our distribution is additive.
We define the D-value for a pair of GO terms based on the definition of D-value for a single GO term. For a pair of identical terms, it is natural to define the pair's D-value as equal to 0. Given two different terms c1 and c2 on the GO DAG, they can share parents by multiple paths, as GO allows multiple parents for each term. Let P (c1, c2) denote the set of parental terms shared by both c1 and c2. We define the D-value for the pair of terms c1 and c2 as the minimum D-value of their parents, as shown in the following equation.
The above formula to define the D-value for a pair of GO terms is the same as formula (1) in [15] to define the probability of the minimum subsumer. Instead of using distribution D(c) for GO terms, Lord et al use a frequency probability, p(c). Both definitions are monotonic.
We implement the GFSST application in four steps in the preparation phase. First, we construct an objective class for Gene Ontology structure: the specific directed acyclic graph. Second, a topological search is performed for the GO DAG. Any DAG can be ordered along a vertical line so that all directed edges go from up to down and all children are under their parents [19]. Third, we set the count to 1 for leaf terms. Then we calculate the counts for nonleaf terms by recursive addition on topologically ordered terms from bottom to top, and calculate the D-value by dividing counts by the total count. An example of D-value calculation is shown in Figure 1. First, the counts are calculated, for example, the GO:0042770 has 6 counts and its distribution is 0.

Statistical measures for gene products
Gene products (proteins) often have more than one term in the GO DAG. In order to retrieve all genes associated with the same and/or similar terms from a given proteome, it is necessary to develop a measure between two term groups on the GO DAG. Instead of using a simple (1) For a pair of gene products with the same number (N) of terms on GO DAG, every term for one gene product is matched once and only once to a term for another gene product. There are N factorial numbers of combinations.
The best solution will be one with the minimum sum of their D-values, but determining the minimum is an NPcomplete problem. We have developed a greedy algorithm to obtain an approximate solution. First we calculate D-values for all pairs of terms from different gene products. After sorting all pairs according to D-value, we choose pairs one by one, starting from the pair with the minimum D-value, following the once and only once rule.
(2) If the source gene product has fewer terms than the target gene product, we choose the same number of terms from the target gene product so that the sum of the D-values generated by our greedy algorithm is the minimum. In our implementation, the same greedy procedure is performed, since the number of terms from the source gene product is less than the number of terms from the target gene product. When the once and only once rule is satisfied for the source gene product, the solution is obtained and the procedure is finished. Thus some terms from the target gene product will have no matches.
(3) If the source gene product has more terms than the target gene product, the once and only once rule should be followed for source gene product and the at least once rule is followed for target gene product. Thus some terms from the target gene product will be matched more than once.
In summary, there are two rules that are followed in our algorithm: the minimization of the final D-value and the once and only once rule for the source gene product. After pairing, it is straightforward to get the average.
In order to understand the meaning of the D-value for two genes or gene products, we present some examples. We have calculated the D-values for two pairs of genes in the Biological Process category. The pair BRCA1_HUMAN and P53_HUMAN has D-value 0.0064676759; and the pair BRCA1_HUMAN and SIAS_HUMAN (Sialic acid synthase) has D-value 0.1631506857. From a biological point of view, P53 is closer to BRCA1 than SIAS, consistent with our D-values.
Our algorithm is different from a simple average [15,16].
In the simple average formula, every GO term in one group will match every GO term in the other group, which does not always make sense. For two identical groups of GO terms or identical genes, it makes sense to match only identical GO terms. Our algorithm guarantees identical matches. Therefore the average D-value is obviously 0, the best value.
The Czekanowski-Dice [14] "formula emphasizes the importance of the shared GO terms by giving more weight to similarities than to differences. Consequently, for two genes that do not share any GO terms, the distance value is 1, the highest possible value, whereas for two genes sharing exactly the same set of GO terms, the distance value is 0, the lowest possible value." GFSST emphasizes the importance of the shared GO terms in a different way. First every hit should reach the threshold D-value. Then GFSST presents the search results by sorting according to the number of exact matches. If there  BRCA1_HUMAN  BRCA1_HUMAN  BRCA1  P53_HUMAN  CDK7_HUMAN  TP53  VHL_HUMAN  RNF4_HUMAN  BARD1  P73_HUMAN  RNF14_HUMAN  BIRC2  INHA_HUMAN  Q96PK6  BIRC3  INHBA_HUMAN  PIAS2_HUMAN  MNAT1  CGRF1_HUMAN  TIP60_HUMAN  PML  MAT1_HUMAN  PIAS1_HUMAN  RAD18  PMS1_HUMAN  ABL1_HUMAN  RNF14  RBBP8_HUMAN  P73_HUMAN  TIF1  IRF1_HUMAN  VHL_HUMAN  TRIM28