LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics
© Smith et al. 2007
Published: 09 May 2007
Skip to main content
© Smith et al. 2007
Published: 09 May 2007
A key abstraction in representing proteomics knowledge is the notion of unique identifiers for individual entities (e.g. proteins) and the massive graph of relationships among them. These relationships are sometimes simple (e.g. synonyms) but are often more complex (e.g. one-to-many relationships in protein family membership).
We have built a software system called LinkHub using Semantic Web RDF that manages the graph of identifier relationships and allows exploration with a variety of interfaces. For efficiency, we also provide relational-database access and translation between the relational and RDF versions. LinkHub is practically useful in creating small, local hubs on common topics and then connecting these to major portals in a federated architecture; we have used LinkHub to establish such a relationship between UniProt and the North East Structural Genomics Consortium. LinkHub also facilitates queries and access to information and documents related to identifiers spread across multiple databases, acting as "connecting glue" between different identifier spaces. We demonstrate this with example queries discovering "interologs" of yeast protein interactions in the worm and exploring the relationship between gene essentiality and pseudogene content. We also show how "protein family based" retrieval of documents can be achieved. LinkHub is available at hub.gersteinlab.org and hub.nesg.org with supplement, database models and full-source code.
LinkHub leverages Semantic Web standards-based integrated data to provide novel information retrieval to identifier-related documents through relational graph queries, simplifies and manages connections to major hubs such as UniProt, and provides useful interactive and query interfaces for exploring the integrated data.
Biological research is producing vast amounts of data (e.g. from high throughput experiments such as sequencing projects, and microarray experiments) at a prodigious rate. Most of this data is made freely available to the public, and this has created a large and growing number of internet and world wide web-accessible biological data resources which are characterized by being distributed, heterogeneous, and having a large size variance, i.e. huge, mega-databases such as UniProt  down to medium, small or "boutique" databases (e.g., TRIPLES ) generated for medium or small scale experiments or particular purposes. Most computational analyses of biological data will require using multiple integrated datasets, and integrated data along with rich, flexible and efficient interfaces to it encourages exploratory data analysis which can lead to serendipitous new discoveries: the sum is greater than its parts. Currently, integration often must be done manually in a labor and time intensive way by finding relevant datasets, understanding them, writing code to combine them, and finally doing the desired analysis. The basic requirements for better, more seamless integrated analysis are uniformity and accessibility; data are ineffectual if scattered among incompatible resources.
Web search engines and hyperlinks are the basic and commonly used ways to find things on the web and navigate web content but they do not enable fine-grained cross-site analysis of data. To improve upon this, one key issue is the need for standardization and its widespread use, and tools supporting and enabling it. Biological data is too vast for brute-force centralization to be the complete solution to data interoperability. We must have standards and systems for people and groups to work independently creating and making data available (although ultimately cooperatively and collaboratively) but still in the end all or most of the pieces of biological knowledge and data are connected together in semantically rich ways. The W3C's Semantic Web [4–6] is a promising candidate: it allows web information to be expressed in fine-grained structured ways so applications can more readily and precisely extract and cross-reference key facts and information from it without having to worry about disambiguating meaning from natural language texts. Standard and machine-readable ontologies such as the Gene Ontology  are also created and their common use encouraged to further reduce semantic ambiguity, although there is a need to make these ontologies more machine-friendly .
A basic problem preventing this graph of relationships from being more fully realized is the problem of nomenclature. Often, there are many synonyms for the same underlying entity caused by independent naming, e.g. structural genomics centers assigning their own protein identifiers in addition to UniProt's. There can also be lexical variants of the same underlying identifier (e.g. GO:0008150 vs. GO0008150 vs. GO-8150). Synonyms are a small part of the overall problem, however, and more generally there are many kinds of relationships including one-to-one and one-to-many relationships. For example, a single Gene Ontology or PFAM identifier can be related with many UniProt identifiers (i.e. they all share the same functional annotation or family membership). PFAM represents an important structuring principle for proteins and the genes they come from, the notion of families (or domains) based on evolution; proteins sharing common PFAM domains are evolutionarily related (called homologs) and likely have the same or similar functions. Gene Ontology consists of three widely used structured, controlled vocabularies (ontologies) that describe gene products such as proteins in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. The conceptual graph of identifier relationships is richly connected, and a transitive closure even a few levels deep can lead to indirect relationships with a great number of other entities. Being able to store, manage, and work with this graph of entities and relationships can lead to many opportunities for interesting exploratory analysis and LinkHub is such a system for doing this.
Centralized data integration to an extent does make sense, e.g. a lab or organization might want to create a local data warehouse of interconnections among its individual data resources; but it does not want to have to explicitly connect its data resources up to everything in existence, which is impossible. The key idea is that if groups independently maintaining data resources each connect their resources up to some other resource X, then any of them can reach any other through these connections to X, and we can collectively achieve incremental global data integration in this way. LinkHub is a software architecture and system which aims to help realize this goal by enabling one to create such local minor hubs of data interconnections and connect them to major hubs such as UniProt in a federated "hub of hubs" framework and this is illustrated in figure 1b.
In the results section next, we will demonstrate how LinkHub enables novel information retrieval to documents attached to LinkHub graph nodes based on the relational structure of the LinkHub graph; a particular practical use case of this, providing "family views" to data, will be given. We will then give concrete examples of the kinds of integrated, cross-database queries that can be done with LinkHub, in combination with a previous system of ours called YeastHub, in support of scientific exploratory analysis; example queries discovering "interologs" of yeast protein interactions in the worm and exploring the relationship between gene essentiality and pseudogene content will be given. We will then discuss related work to LinkHub and future directions before concluding. In the methods section we describe implementation details of LinkHub, including its data models and how they are populated with data and LinkHub's web interactive and query interfaces.
The "path type" interface to LinkHub allows one to flexibly retrieve useful subsets of the web documents attached to identifier nodes in the graph based on the graph's relational structure. Normal search engines relying on keyword searches could not provide such access, and LinkHub thus enables novel information retrieval to its known web documents. An important practical use of this "path type" interface is as a secondary, orthogonal interface to other biological databases in order to provide different views of their underlying data. For example, MolMovDB  provides movie clips of likely 3D motions of proteins, and one can access it by PDB  identifiers. However, an alternative useful interface (actually provided by LinkHub) is a "family view" where one queries with a PDB identifier and can view all available motion pages for proteins in the same family as the query PDB identifier. LinkHub also provides a similar "family view" interface to structural genomics data in the SPINE system . The system is flexible and one can easily imagine other similar applications, e.g. a "functional view" where all pages for proteins that have the same Gene Ontology function as a given protein are shown or a "pseudogene family view" where all pages for pseudogenes of proteins in the same family are shown. While the "path type" interface is a simple way of providing novel, relational access to LinkHub identifier node-linked documents, RDF query language access to the LinkHub relational graph would allow the most flexible novel information retrieval.
To demonstrate the data interaction and exploration capabilities engendered by the RDF version of LinkHub, the RDF-formatted LinkHub dataset is loaded into our YeastHub  system which uses Sesame  as its native RDF repository. Two demonstration queries below written in SeRQL (Sesame implementation of RQL)  demonstrate one can efficiently do the kinds of interesting preliminary scientific investigation and exploratory analysis commonly done at the beginning of research initiatives (e.g. to see whether they are worth pursuing further). These queries make use of information present in both YeastHub and LinkHub (and thus could not be done without joining the two systems), and LinkHub is used as 'glue' to provide connections (both direct and indirect) between different identifiers. It is noteworthy that these queries can be formulated and run in relatively little time (a few hours at most) and they roughly duplicate some results from published papers. In effect, LinkHub does the up-front time-consuming manual work of integrating multiple datasets, and this integrated data is generally useful for efficient formulation and execution of queries, which is in contrast to the papers which likely required extensive "one-off" effort to combine the necessary data to achieve their results.
yeast gene name → UniProt Accession → Pfam accession → UniProt Accession → WormBase ID.
Then, for each pair in the yeast protein interaction dataset, we determine if both of its yeast gene names lead to WormBase IDs  in this way and print those WormBase IDs as possible protein interactions if so.
Pseudogenes are genomic DNA sequences similar to normal genes (and usually derived from them) but are not expressed into functional proteins; they are regarded as defunct relatives of functional genes [21, 22]. In the queries here we explore the relationship between gene essentiality (a measure of how important a gene is to survival of an organism) and the number of pseudogenes in an organism. We might hypothesize that more essential genes might have larger numbers of pseudogenes, and we explore this idea with queries of the joined YeastHub and LinkHub data. First, YeastHub has the MIPS  Essential Genes dataset, and we use this as our data on gene essentiality; LinkHub contains a small dataset of yeast pseudogenes .
Abstractly, for each yeast gene name in the list of essential genes, we determine its pseudogenes by traversing identifier type paths in the LinkHub graph like the following:
yeast gene name → UniProt Accession → yeast pseudogene
For each essential yeast gene we then determine how many pseudogenes it has. We can then inspect the list of essential genes to see if there is a relationship between essentiality and number of pseudogenes. Humans have a large number of known pseudogenes  but gene essentiality is difficult to characterize in humans (with many tissue types and developmental states complicating the issue). Since essentiality is well studied in yeast, one thing we can do is determine the human homologs of yeast essential genes, which would perhaps likely be "more important" in a survival sense, and examine them for patterns associated with essentiality. For each yeast gene name in the list of essential genes, we can find the homologous pseudogenes in human by traversing identifier type paths in the LinkHub graph like the following:
yeast gene name → UniProt Accession → Pfam accession → human UniProt Id → UniProt Accession → Pseudogene LSID
Part of the SeRQL for the first query (for yeast pseudogenes) and results from both can be seen in figure 3, and they show that few yeast essential genes are associated with pseudogenes whereas this is not the case with human. This may reflect the difference in processes of creation of the predominate numbers of yeast and human pseudogenes (duplication vs retrotransposition, see [21, 22]).
The basic conceptual underpinnings of LinkHub, i.e., the importance of biological identifiers and linking them, was given by Karp . LinkHub uses a Semantic Web approach to build a practical system based on and extending Karp's ideas on database links. The Semantic Web approach can also be used to implement database integration solutions based on the general approaches of data warehousing [27, 28] and federation [29–31]. Essentially, data warehousing focuses on data translation, i.e. translating and combining multiple datasets into a single database, whereas federation focuses on query translation, i.e. translating and distributing the parts of a query across multiple distinct databases and collating their results into one. A methodological overview and comparison of these database integration approaches was discussed in the biomedical context . LinkHub's architecture is a hybrid of these two approaches: individual LinkHub instantiations are a kind of mini, local data warehouse of commonly grouped data and these are connected to large major hubs such as UniProt in a federated fashion; efficiency is gained by obviating the need for all source datasets to be individually connected to the major hubs.
LinkHub differentiates itself by not integrating all aspects of biological data but rather focusing on an important and more manageable high-level structuring principal, namely biological identifiers and the relationships (and relationship types) among them; hyperlinks to identifier-specific pages present in the "Links" section of the LinkHub web interface give access to additional attributes and data. In fact, our YeastHub system addressed integration more generally by transforming many datasets to common RDF format and storing and giving RDF query access to them in an RDF database. The problem with YeastHub was that the integration was thin, with rich connections among the integrated datasets being limited. LinkHub is thus useful and complementary to YeastHub in this respect as a "connecting glue" among the datasets in that it makes and stores these cross-references and enables better integrated access to the YeastHub data; the example queries above demonstrated this.
Currently, LinkHub has limited web document hyperlinks attached to its nodes, and if this could be increased the utility of the novel information retrieval based on querying the LinkHub relational graph, e.g. "path type" interface, would be enhanced. We are working to leverage the rich information in the LinkHub relational graph for enhanced automated information retrieval to web or scientific literature (MedLine) documents relevant to identifier nodes, e.g. proteomics identifiers, in the graph. A simple search for the identifier itself would likely not give optimal results due to conflated senses of the identifier text and identifier synonyms. In general, we need to consider and query for the key related concepts of an identifier, and these are present in the LinkHub subgraph surrounding the identifier. We consider the web pages attached to the identifiers in the subgraph as a "gold standard" for what additional relevant documents should be like, and we plan to use them as training sets to construct classifiers used to score and rank additional documents for relevance. We feel that this idea could be generalized and that the Semantic Web, which provides detailed information about terms and their relationships, could be leveraged to provide enhanced automated information retrieval or web search for Semantic Web terms.
We also hope to explore how other relevant Semantic Web-related technologies could be effectively used in LinkHub, in particular named graphs  and Life Science IDentifiers or LSIDs . Named graphs allow RDF graphs to be named by URI, allowing them to be described by RDF statements; named graphs could be used to provide additional information (metadata) about identifier mappings, such as source, version, and quality information. LSID is a standard object naming and distributed lookup mechanism being promoted for use on the Semantic Web, with emphasis on life sciences applications. An LSID names and refers to one unchanging data object, and allows versioning to handle updates. The LSID lookup system is in essence like what Domain Name Service (DNS) does for converting named internet locations to IP numbers. We could possibly use LSID for naming objects in LinkHub and incorporate LSID lookup functionality. Finally, like software such as Napster and Gnutella did for online file sharing, we plan to explore enhancing LinkHub to enable multiple distributed LinkHub instantiations to interact in peer-to-peer networks for dynamic biological data sharing, possibly using web services technologies such as Web Services Description Language (or WSDL)  and Universal Description, Discovery and Integration (or UDDI)  for dynamic service discovery, and available peer-to-peer toolkits.
Our paper demonstrates the natural use of Semantic Web RDF to inter-connect identifiers of data entries residing in separate web-accessible biological databases. Based on such a semantic RDF graph of biological identifiers and their relationships, useful, non-trivial cross-database queries, inferences, and semantic data navigation can be performed through web interactive and query access. In addition, these semantic relationships enable flexible and novel information retrieval access based on queries of the LinkHub graph's relational structure to web documents attached to identifier nodes. LinkHub also can simplify and manage connections to major hubs such as UniProt for a lab or organization. LinkHub can be evaluated by considering its current active and practical use in a number of settings. We have already established the "hub of hubs" relationship between UniProt and LinkHub (i.e. UniProt cross-references to our LinkHub). In addition, LinkHub cross-references the targets of the structural genomics initiative to UniProt and serves as a "related links" and "family viewer" gateway for the Northeast Structural Genomics Consortium with which we are affiliated; LinkHub also serves as the "family viewer" for MolMovDB. LinkHub is a step towards answering the question "a life science Semantic Web: are we there yet?" .
A key problem in populating the LinkHub database (described below) is how to determine the relationships among biological identifiers, a specific case of the so-called ontology alignment problem [42, 43]. Biology is blessed with a fundamental, commonly accepted principle around which data can be organized, namely biological sequences such as DNA, RNA, and protein, and various string matching techniques (such as dynamic programming  and BLAST ) for biological sequences can solve a large part of the ontology alignment problem in biology. LinkHub thus takes advantage of biological sequence matching, in particular conservative, exact sequence matching, to cross-reference or align biological identifiers. LinkHub also takes advantage of available sources of pre-computed identifier mappings, with the most important one being UniProt which is arguably the most important major proteomics resource and serves as LinkHub's backbone content (i.e. most relationships between identifiers in LinkHub are indirect through UniProt). The general strategy for mapping identifiers in LinkHub is to first take advantage of known and trusted pre-computed identifier mappings; if such pre-computed mappings are unavailable, an attempt is made to map identifiers based on exact sequence matches of their underlying sequences to UniProt and other sources of sequence data whose identifiers are stored in LinkHub.
Efficient, exact sequence matching programs were developed and used to do quick inter-database cross-referencing or alignment based on exact sequence matches (e.g. to cross-reference TargetDB to UniProt, see below). A custom Perl module was developed and used to index UniProt (and in general sequence databases in FASTA format ) to support this fast exact sequence matching. Specialized Perl web crawlers and other scripts were written to fetch and extract data from different sources in different formats; identifiers, identifier relationships, and other related information were extracted from the sources and inserted into the LinkHub MySQL database (which is also converted to RDF and inserted into the RDF version of LinkHub; see below). A running instantiation of the LinkHub system is at http://hub.gersteinlab.org and http://hub.nesg.org, and it is actively used and populated with data from the Gerstein Lab  and related to the lab's research interests. Thus while the ideas of LinkHub can be applicable more generally to biological data, the concrete instantiation of LinkHub focuses heavily on proteomics data, as that is a key research initiative of the Gerstein Lab. The "hub of hubs" relationship described above has already been established between UniProt and LinkHub (i.e. UniProt hyperlinks to the LinkHub instantiation and cross-references to it in its DR lines). In addition, LinkHub cross-references the proteins which are targets of the structural genomics initiative (obtained from the TargetDB resource ) to UniProt and the LinkHub instantiation serves as a "related links" and "family viewer" (more below) gateway for the Northeast Structural Genomics Consortium (NESG)  with which the Gerstein Lab is affiliated. Additional focuses of the LinkHub instantiation are yeast resources, macromolecular motions , and pseudogenes .
LinkHub is conceptually based on the Semantic Web (graph) model and we thus represent it and store it in RDF. RDF is a popular data model (or ontological language) for the Semantic Web that represents data as a directed labelled graph. Essentially, in RDF URIs  are used for globally unique naming of the nodes (which represent objects) and the edges (which represent relationships between nodes) of the graph, and literal values may also be used in place of pointed to nodes. In addition, RDF comes with query languages (e.g., RDQL ) to allow the user to pose semantic queries against graph data. While there are more advanced ontological languages such as the Web Ontology Language or OWL  that support data reasoning based on Description Logics or DL , RDF is easy to learn and use and much can be effectively modelled with it. For example, the benefits of representing proteomics data in RDF were discussed  and UniProt data has also recently been made available in RDF format . However, there could be a potential problem in performance and scalability when using the new RDF database technology, which can be an important impediment to more active and widespread use of the Semantic Web. In this regard, the creation of high-performance RDF databases should be a research priority of the Semantic Web community. Thus, while we would ideally use only RDF, to support LinkHub's practical daily use for its web interactive interfaces we also model and store its data using relational database technology (MySQL) for efficiency and robustness. A drawback is that relational databases do not naturally model graph structures or provide efficient graph operations for which special procedural codes are necessary (e.g. for the "path type" view described below). It is straightforward mapping between the relational and RDF versions of LinkHub and we have written Java code to do this.
Given protein in database → equivalent UniProt protein → Pfam family → UniProt proteins → other equivalent proteins in database.
AS and MG's funding for this work is from NIH/NIGMS grant P50 GM62413-01. KC's funding for this work is from NIH grant K25 HG02378 and NSF grant DBI-0135442.
This article has been published as part of BMC Bioinformatics Volume 8 Supplement 3, 2007: Semantic e-Science in Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.