The PathSys system
Architecture
The system architecture of PathSys is shown in Figure 1. The system is designed around a warehouse that holds the data according to an internal schema (discussed in the next subsection), a number of specialized index structures that facilitate graph operations, and a Data Manager that keeps the data and external indices synchronized.
We consider two kinds of users. The first is a typical information systems person who creates a new integrated schema through the Integration client, to add a new data source to an existing integrated schema or to define new queries to support a specific kind of analysis. The process of adding a new data source is as follows. The user first determines that the data schema is specified in a language accepted by PathSys (e.g., a relational schema, an XML schema). Next, the schema is sent to PathSys, which validates it and stores in the Schema Library. The user then specifies the mapping between the schema element and the internal data model of PathSys described in Figure 2. Finally, it is stored in the Schema Map Library and the data are ingested into PathSys warehouse through the Data Importer much like the bulk loading operation in a standard DBMS.
The second category of users is a biologist who enters PathSys through the visualization client, BiologicalNetworks [47]. In one sense the visualization and graph manipulation capabilities of BiologicalNetworks are comparable to that of existing visual information integration systems such as Cytoscape [25] and VisANT [26, 27] as well as commercially available tools such as GeneGO [28] and PathwayAssist [29]. A user's query to the system is first analyzed by the User Query Manager and then decomposed into a combination of acyclic graph and regular graph queries, which are handled by their respective query engines (Figure 1). The system uses two graph query engines to execute specialized algorithms [30] customized for each kind of graph. Both engines access the stored data and indices through an API exposed by the Warehouse Manager that provides logical access to the stored data and indices.
In contrast to the visual integration systems such as Cytoscape and VisANT, PathSys has a more comprehensive data model such that the semantic concepts of biological objects, molecular states, and interaction types are more closely mapped to the data elements as shown in Figure 2. The former visual integration systems have a client-end graph manipulation engine with some basic operations, and most data manipulation operations are performed through plug-in function modules. These, however, do not have a server-side graph and relational query engine that can evaluate and optimize arbitrary combinations of operations in a scalable fashion. While the BiologicalNetworks interface does allow a subset of these operations, the full power of the PathSys engine is accessible through the query language described in a later section.
The PathSys data model
A number of systems such as Cytoscape models MIGs as a ternary relation (node1, edge-label, node2), where the edge-label specifies the nature of interaction. We find that model to be inadequate for the following reasons:
-
(1)
Nodes should not only represent proteins or genes, but should also designate their state while participating in an interaction.
-
(2)
For complex molecules, one needs to distinguish between the interactions of the complex and those of the component molecules.
-
(3)
Mechanism should be available to add as many interaction properties as needed and capture more abstract types than is possible with simple labeled edge.
-
(4)
One needs to represent the fact that one interaction can be regulated by the occurrence of other interactions, thus necessitating a (hyper-)edge that connects two (or more) other edges.
In PathSys we distinguish three types of nodes: primary node, connector node and graph node.
Primary node
All macromolecules (e.g. DNAs, RNAs and proteins), small molecules (e.g. ions, ATP, lipids) and physical events (heat, radiation, mechanical stress) are under 'primary node' definition.
Connector node
A connector node is designed to depict the properties of a relationship between a set of source nodes and a set of target nodes. All types of interactions (binding, chemical reaction, expression, etc.) are represented by connector nodes. Note that a connector node is not a simple edge label but a placeholder for "interaction type" and "interaction properties", as shown in Figure 3. The interactions as we stated are m:n relations. Hence we can represent interactions such as chemical reactions with m reactants and n products. The reason for implementing edges as connector nodes with their own properties is that an integration system should be designed to be extensible to hold different information coming from multiple sources. If we have two sources describing a protein-DNA interaction between a protein-node P and a "chromosome-fragment" node D, it is quite possible that these two sources will specify two different properties about this interaction. For example, one source could state that the interaction is that of "transcription factor binding" while another source might state that this interaction is conserved in other species. Modeling the connectors as special nodes allows us to seamlessly scale up by adding as many node properties as needed as information on that edge grows. This could not be accomplished if interactions were modeled just as labeled edges. We illustrate the role of a connector node in terms of the expressive power of the system. Consider the edge as a triple (n1 'activates' n2), where n1, n2 are node constants and 'activates' is an edge name (i.e., an edge label). Our query system allows us to associate a variable x to the edge, thus representing it as x: (n1 'activates' n2). Now the triple (n3 'inhibits' x) is equivalent to the statement "n3 inhibits the activation of n2 by n1". Graphically, this would be represented as an "edge" between the node n3 to the connector node between n1 and n2. Now we can construct queries like "Find all proteins which have properties P1 and P2 and regulate the activation of n2". The answer will find n3 (if n3 has P1 and P2). Similarly, we can represent "competing" interactions as x: (n1 'activates' n2), y: (n3 'activates' n4), (x 'competes_with' y), where the last clause is an "edge" between a pair of connector nodes.
Graph node (Hypernode)
In biological systems molecules often form clusters and groups for performing tasks, behaving like a single state. In our system all complex objects (protein complexes, cellular processes) that might contain graphs are defined by graph nodes (hypernodes) (VisANT [26, 27]. Binding relations within the hypernode are presented as well. A molecular complex like the proteasome is treated as a hypernode, of the type molecular complex. The hypernode gets its own node identifier that is distinct from all nodes (proteins that form subunits of the proteasome). A hypernode may have interactions with single nodes or other hypernodes in the graph. Moreover, members of the hypernode can independently participate in different processes. A hypernode may contain members from different cellular compartments. These features are incorporated in the notion of Graph Node. For visual representation of metanodes see Additional file or supplemental materials at [[46], Section: Data Visualization].
Hypernodes play a crucial role in processing graph queries such as path and neighborhood finding, the algorithmic details of the use of hypernodes in query evaluation are provided in supplementary materials.
The internal data model of the graph (Figure 2) consists of a node type hierarchy N ('child of' relation in the NodeType view), an attribute category hierarchy A ('child of' relation in the AttributeTypeCategory view), bags of nodes N and edges E and a data source D.
For some node types, e.g. gene, one can specify rules to automatically create derived node types such as mRNA(gene) and protein(gene). The node type hierarchy N can be a directed acyclic graph because it admits multiple inheritance; for example, an nuclear transcription factor is both an nuclear-localized protein and a transcription factor protein.
We distinguish between the type of the attribute, which reflects its storage data type, which might be the tuple {int, int} for a specific case, from its semantic category which might be a "chromosomal interval". In our model, attributes are attached to node instances rather than node types. Thus, if one source provides one set of attributes for a node and a second source provides a different set of attributes for the same node, we can combine both sets of attributes. This enables us, for example, to unite putative transcription factor binding sites from Yeast Promoter database from Cold Spring Harbor Labs and intergenic binding probability information from MIT data [24] on compatible chromosomal intervals.
To illustrate our graph model, consider the highly simplified fact that activation of Ste11 to the phosphorylated state Ste11(p) increases the rate of phosphorylation of another protein Ste7 that is thereby activated (Figure 3). Simultaneously, the molecular complex of Ste4 and Ste18 proteins also increases Ste7 phosphorylation. Activated Ste7 ultimately inhibits the process of cell cycle by producing a G1 mitotic checkpoint arrest [31]. The nodes in this case are Ste11, Ste7, Ste4, Ste18, Ste11(p) (phosphorylated), Ste7(p) of protein type and kinase subtype; two Graph Nodes: protein complex and cell cycle pathway; and Connector Nodes: two nodes of type phosphorylation, and one node of type Cellular Process. An edge incident to a connector node denotes that the source nodes participate in the process depicted by the connector node. An edge from a connector node denotes that the process represented by the connector node impacts the target nodes of the edge. The choice of using the connector node implies that the so-called edge label is now a property of the connector node. Syntactic sugar in the query language can specify a query in terms of the edge label, and the system translates it to a query on the connection nodes. Defining a few special edge types can connect two primary nodes without having to go through a connector node. We describe two such special edge types here. The first is a subgraph edge (edge.relationship = 'subgraph') – it goes from a graph type node to another graph type node where the latter is a subgraph of the former, which, for example, can create named subgraphs. A subgraph may be named (i.e. assigned a separate id) for semantic reasons; for instance, it represents a functional subgroup of interacting proteins within a larger interaction graph. Alternately, a subgraph is named because it has a special property. For example, the system indexes all cliques with more than 3 members. These cliques are denoted as special graph nodes that are used during query processing. A second special edge is a member-of edge between a node n and a graph-typed node g that designates that n belongs to the graph represented by g.
Graph attributes
A significant class of systems biology queries addresses graph-theoretic properties of source graphs as well as the integrated graph. PathSys maintains a set of graph attributes for each source graph to answer these aggregate queries. At present they include in and out degrees, betweenness centrality and clustering coefficient. Centrality is defined as b
k
= ∑ij (gijk/gij), where gij is the number of shortest paths from node i to node j, and gijk is the number of shortest path from i to j that pass through k. For node k, clustering coefficient is the ratio of the number of k's edges to the maximum number of possible edges between k's neighbors. These parameters, together with other measures, such as the graph diameters, are maintained and indexed using conventional index structures. For regions of the graph where neighboring nodes have high clustering coefficient, a "clustering coefficient" attribute is maintained by creating a system-defined graph node that represents the highly connected neighbors. Inclusion of any number of such attributes is possible.
Integrating graph sources
The task of integrating a new data source to an existing integrated graph schema consists of three steps – defining a new, unpopulated data source in the integrator, mapping the just-imported schema to nodes, node attributes, and edges of the integrated graph, and expressing conflict resolution policies.
Source definition
An external data source can be a relational database schema, a tree-structured XML document, an RDF-styled triplet that describes an edge set of a graph, or a DAG structured OWL [32] document. Typically, a new ontology or a node/attribute type hierarchy, such as the phenotype classification tree from MIPS, is presented to the system using a tree (here as an OWL description) data, and a collection of node/edge instances and node properties are presented as relational data. To import this data into PathSys, we first define a new data source
CREATE DATA SOURCE yeast phenotype (
fullname 'Yeast Phenotype Classification',
reference localhost://phenotype.owl',
description...)
format XML-RDF-OWL;
where the newly imported data is nicknamed yeast phenotype. XML-RDF-OWL is a format known to the system. For a relational data source, we would declare the format as SQL. With the data source defined, now we specify a PathSys schema element for the new source.
CREATE TREE phenotype tree (
version STRING VALUE '2.3',. . .)
SOURCE yeast phenotype;
Schema mapping
The task of schema mapping is to specify how an element of the imported source should be interpreted as an element of the internal schema of PathSys. In PathSys a tree is a special case of graph that is internally used for query evaluation. In a tree structure source, the OWL schema populates the node type hierarchy in Figure 2. The mapping declarations are:
IMPORT NODE TYPE FROM yeast phenotype (
Class as name,
)GRAPH phenotype tree
IMPORT RELATIONSHIP FROM yeast phenotype(
subClassOf as child of
)GRAPH phenotype tree
In relational mapping the source integration imports a relational schema (a fragment of the MIPS database) into the graph elements of the internal model (see supplemental material for detail). For each schema mapping, the wrapper generator automatically creates the code to populate the PathSys schema from the new data source.
Once the new graph is integrated, the system computes all graph indices for the new incoming graph and updates indices for the whole integrated graph. Detailed information on how the data are physically represented and the Data Definition Language are provided in Additional file or supplemental materials at [[46], Section: Architecture].
Conflict resolution
Crucial to information integration process is resolution of data conflicts. Reconciliation problems are detected by a set of conflict detection rules and are resolved by expert user intervention. Here are some example rules:
-
(1)
Two genes with the different names have the same chromosomal location. For this, we have an automated reconciliation procedure assigning multiple names as synonyms to the same ORF.
-
(2)
Two genes with the same name have different chromosomal location. Problems like this are due to different assigning of gene boundaries, alternative splicing etc. and are resolved by scientists.
-
(3)
Several genes have names such that one name is contained in the other, e.g., 'IME1', 'IME1-TAP(342–531)' and 'IME1(modified:Thr:210)'. The first record refers to the gene IME1, the second to a fragment of gene IME1 that is modified by fusion to a domain called TAP, and the third to the protein encoded by IME1 (IME1p) with the qualifier that the amino acid 'Thr' at the 210-th position was modified. Thus, the records seemingly referring to an item called 'IME1' really refer to objects that are not equal and must be resolved by an expert.
-
(4)
Two genes with different names and chromosomal locations have over 95% similar graph neighborhoods. Products of such genes are likely to be part of the same protein complex and/or have physical interaction. Cases like this can be the starting points for biological discovery to identify functionally related candidate genes.
Querying graphs in PathSys
BioNetSQL, our query language for interaction networks, has the flavor of SQL that can be queried on sets and bags of nodes, edges and their attributes, but additionally allows the returned values to be bags of paths, trees and graphs. Further, the language allows path, tree and graph operations. While a complete description of the language and the query evaluation process is beyond the scope of this paper, we present a few features of the language through one example where we use graph operations in the body of the query and the return data type is a graph. "Find networks of co-localized proteins that are parts of protein complex and are connected by either a 2-hybrid (y2h) edge or a coimmunoprecipitation (coIP) edge."
SELECT
graph(N2(n.name, n.source),
E2(e.label, e.source))
FROM
yeastGraphDBG1(N, E)
WHERE
n:N and c:N and e:E
and n.type << 'protein'
and c.type = 'protein complex'
and (e.label = 'y2h' or e.label = 'coIP')
and pathExpr(G1, c// [member of]n) = true
The query declares a variable c whose type is protein complex. The query returns a graph whose nodes n should be tuples with the attributes name and source (i.e., data source), and whose edges e has a label and a source from which that edge is known. Recall that the system will convert this to a query on a connector node. The << operation specifies that the type of the node is "under" "protein" in the node type hierarchy N. The last line reads as "n has an edge whose label has the value member, and this edge points to c", where c is declared above. Note that we did not mention the relationship between nodes n and edges e, namely, an instance of the returned edge set e connects instances of the returned node set n. This constraint, expressed as n.edge = e, is implied by the construct of line 2, where n and e are constrained to be parts of the same graph. For more features of the language and examples see supplemental material.