Overall architecture
SPIKE is composed of three main software components: 1) A database (DB) of biological signaling pathways with an interface that supports data uploading and querying by multiple end-users. Carefully curated information from the literature as well as data from large public sources constitute distinct tiers of the DB. SPIKE DB is implemented using MySQL as the DB management system (DBMS). 2) A Java-based visualization package that allows interactive graphic representations of regulatory interactions stored in the DB, dynamic layout and navigation through the networks, and superposition of high-throughput genomic and proteomic data on the displayed maps. 3) An algorithmic inference engine that analyzes the networks for novel functional interplays between network components, and enhances the analysis of omic datasets (Figure 1).
Modeling scheme
SPIKE uses a formal modeling scheme in which information on signaling pathways is summarized in a structured language amenable to computerized manipulation and analysis. The scheme is based upon three fundamental requirements: First, it should be of sufficient expressive power to capture information on most aspects of regulatory pathways. Second, it should focus on signaling rather than metabolic pathways, and on the regulatory interplays maintained between network components (the biochemical mechanisms by which the regulatory effects are elicited are of secondary importance in our model). Third, the scheme should be kept simple in order to ease data input by the SPIKE user community. Such input is crucial as our long-term goal is to have the manually curated tier of the DB populated primarily by the users' community.
With these considerations in mind, we adopted (using the terminology of Kitano et al. [11]) an 'entity-relationship' scheme over a 'state transition' one. The 'state transition' scheme regards different post-translational modified versions of a protein as separate entities (or as different 'states' of a protein), visually represents them as distinct nodes in the map, and seeks to trace the transition between these states. The 'entity-relationship', on the other hand, does not distinguish between different modified states of a protein and does not look for such state transitions; it focuses, rather, on the regulatory effects maintained between different proteins within a signaling network (Figure 2A–B). In a state transition scheme (adopted, for example, by Reactome [6], PATIKA [10], CellDesigner [11]), representation of signaling pathways could become extremely complex due to combinatorial growth of possible states (for example, a protein that can be modified on only 5 residues has 25 = 32 different states). An entity-relationship scheme (adopted, for example, by KEGG [5] and Kohn's maps [14, 15]) is much simpler, having all these states represented by only one protein entity, but at the price of not accounting for possible temporal order constraints on the state transitions (e.g., activation of a protein by its phosphorylation on residue B can occur only after it is first phosphorylated on residue A).
The entity-relationship scheme we designed for SPIKE is based on two basic types of objects: "biological entities" and "relations".
Biological entities
SPIKE's modeling scheme comprises four types of biological entities:
a. Gene/RNA/Protein
Genes (and their respective RNA and protein products) are the primary atomic elements of the scheme. To avoid ambiguities, only characterized human genes that are assigned a formal designation by the Human Genome Nomenclature Committee (HGNC) are included in SPIKE's gene space. These objects are uniquely identified by their Entrez Gene IDs [16]. For simplicity, we decided not to define distinct objects for genes, their transcribed RNAs and translated proteins, and their modifications. See below a discussion of scheme limitations. Non-protein coding genes (e.g., genes encoding for rRNAs, tRNAs and microRNAs) are also included in this entity type.
b. Family
A family is a group of isoform genes (encoded by distinct genomic loci) with high sequence homology and whose encoded proteins share most of their biological activities. Well known examples are the JNK family, which includes the JNK1, JNK2 and JNK3 proteins (whose official names are MAPK8, MAPK9 and MAPK10), and the p38 family, which is comprised of four p38 isoforms (officially designated as MAPK11–14).
c. Complex
A complex is a group of proteins (or protein families) that carry out a specific function only when associated with their complex mates. An example is the DNA-damage sensor MRN complex, which is composed of the proteins MRE11, RAD50 and NBS1 [17]. The complex entity supports a nested structure, namely, a complex can be built from sub-complexes.
d. Small molecule
This entity enables the modeling scheme to include descriptions of regulations involving small signaling molecules such as GTP, cAMP, Ca++, etc. Signaling molecules in SPIKE are identified using their ID in the ChEBI DB [18].
In contrast to genes/proteins and small molecules, no controlled nomenclature is available for designating families and complexes. For these entities we use the most common name in the literature.
Relations
The second type of object in our modeling scheme is relations. The scheme defines three types of relations:
a. Regulation
Every regulation in SPIKE is defined by a triplet of the following simple form: Source, Regulatory effect, and Target. For example, "ATM activates TP53", and "CDKN1A inhibits the CDK2-CCNE complex". A regulatory interaction can be defined between any two biological entities, or between a biological entity and another regulation. A regulation can have one of three regulatory effects: "promote/activate", "inhibit/repress", or "unknown". Each regulation (defined by its source, target and effect) is associated with several attributes: the biochemical mechanism by which it is driven (e.g., phosphorylation, transcriptional regulation, ubiquitination), one or more supporting references, quality level and status flag for quality control (see below), and the submitter/data source.
The fact that our scheme allows one regulation as the target of another enhances its expressivity. First, it enables us to describe regulations that affect only a subset of downstream regulations emanating from an entity, as illustrated in Figure 3A. Second, it improves the specificity of the description, as demonstrated by the comparison between Figure 3B and Figure 3C. For regulations that act on other regulations, an additional attribute is added: the physical target, which specifies the physical entity on which the biochemical interaction is exerted (see Figure 3C).
b. Interaction
Much data exist on pairs of proteins that are known to physically interact, but whose regulatory effect is not yet understood. Interaction relations represent such physical interactions, which, unlike regulations, are symmetrical rather than directional. Each interaction is defined by the two participating proteins, and has the following attributes: the experimental method used to identify it (e.g., Y2H, co-immunoprecipitation), supporting reference(s), quality and status flags, and the submitter/data source.
c. Containment
This type of relation is used to define the relationship between families/complexes and the members they contain.
Limitations
SPIKE's modeling scheme is intentionally simple, facilitating easy input to the DB by many end-users. Despite its simplicity, the scheme can describe most key aspects of signal transduction pathways. However, at this stage it deliberately does not account for metabolic reactions, splice variants, cell types and cellular compartments, or for any quantitative aspects of regulations. It also deliberately avoids distinction between a gene and its RNA and protein. SPIKE focuses on regulatory effects maintained between components within signaling networks. We found that for our goals it is not essential to explicitly model the layer in which the regulatory effect is exercised (i.e., transcriptional regulation, post-translational modification, etc.). However, information on the specific biochemical mechanism is stored in the DB, being one of the attributes attached to each regulation. While SPIKE's modeling scheme can accommodate data on any signaling pathway from any organism, at present the implemented DB supports data obtained only in human cells (as genes are associated with the human Entrez-Gene ID and are designated using their HGNC official symbol).
Despite these limitations, SPIKE's design is flexible enough to allow its adjustment to support many of the above features in the future, as biological knowledge expands and standard ontologies and nomenclatures become widely accepted.