iRefIndex: A consolidated protein interaction database with provenance
© Razick et al; licensee BioMed Central Ltd. 2008
Received: 06 May 2008
Accepted: 30 September 2008
Published: 30 September 2008
Interaction data for a given protein may be spread across multiple databases. We set out to create a unifying index that would facilitate searching for these data and that would group together redundant interaction data while recording the methods used to perform this grouping.
We present a method to generate a key for a protein interaction record and a key for each participant protein. These keys may be generated by anyone using only the primary sequence of the proteins, their taxonomy identifiers and the Secure Hash Algorithm. Two interaction records will have identical keys if they refer to the same set of identical protein sequences and taxonomy identifiers. We define records with identical keys as a redundant group. Our method required that we map protein database references found in interaction records to current protein sequence records. Operations performed during this mapping are described by a mapping score that may provide valuable feedback to source interaction databases on problematic references that are malformed, deprecated, ambiguous or unfound. Keys for protein participants allow for retrieval of interaction information independent of the protein references used in the original records.
We have applied our method to protein interaction records from BIND, BioGrid, DIP, HPRD, IntAct, MINT, MPact, MPPI and OPHID. The resulting interaction reference index is provided in PSI-MITAB 2.5 format at http://irefindex.uio.no. This index may form the basis of alternative redundant groupings based on gene identifiers or near sequence identity groupings.
Protein interaction data are an increasingly important bioinformatics dataset used in biomedical research. These data are generated by a multitude of methods including both high-throughput and more traditional low-throughput proteomics studies  as well as in silico predictions based on known interactions .
The past several years have seen a proliferation of interaction databases as the field focuses on ways to collect these data into machine readable formats where they may be more easily computed on and reliably exchanged between users. The International Molecular Exchange (IMEx)  represents one effort to consolidate efforts of interaction databases by facilitating exchange of information between primary databases according to an agreed standard exchange language called the Human Proteome Organization's Proteomics Standards Initiative Molecular Interaction format (HUPO PSI-MI) . Archival members of IMEx agree to share and provide a full dataset of globally available IMEx molecular interaction records (since March 31st, 2006) in a manner similar to the International Nucleotide Sequence Database Collaboration (INSDC) . IMEx also serves to coordinate curation tasks between the partner databases to avoid redundant efforts. IMEx is open to new members and is comprised of five active partners including DIP , IntAct , MINT , MPact  and BioGRID .
The potential for new IMEx members is enormous: one recent compilation of protein-protein interaction resources listed over 90 databases . This large list represents a lively interest in interaction data; it also represents a problem for the user searching for information since there is no unifying index.
We set out to create such an index with two goals in mind; first the index should be capable of grouping together equivalent protein interactions into a single group. The measure of equivalence should be based on exact sequence matches of the protein participants according to the source record without further interpretation based on, for example, encoding genes or near sequence identity. Second, operations performed to map protein and interaction records to these redundant groups should be preserved; this would allow for the mapping to be recreated for data integrity checking and would help identify and classify potential problem records.
Presently, IMEx archival databases provide a complete set of records generated by its members; however, these records are meant to be archival and do not resolve redundancies between records. This becomes problematic when trying to compile a non-redundant list of interactors for a given protein especially when records may use different identifiers to describe the same protein. A number of recent studies have addressed this issue and reported on databases and/or software that aim to generate consolidated interaction data sets from primary interaction databases (MiMI , PIANA  and cPath ). MiMI groups together redundant protein interactors and redundant interactions using "keyless identity functions" and "Deep Merging"; however, these methods are never explicitly defined . PIANA software also allows for integration of protein interaction data from multiple sources and apparently resolves redundancies between different protein identifier types; again, the methods used to do this are never explicitly defined . cPATH software also allows users to integrate protein interaction data from multiple sources. Redundant proteins are identified using lookup tables that may be defined by the user ; however, identical interactions and complexes are not grouped together. The solution presented in this study is unique in that it uses a well-defined, reproducible method to assign distinct identifiers to each distinct protein and to each distinct interaction and/or complex in which the protein participates. Our method is also unique in that it was designed to trace this process and provide feedback to source databases on problematic assignments.
Our construction of a protein interaction index involved parsing PSI-MI files provided by nine interaction databases including BIND [22, 23], BioGRID , DIP , HPRD [26, 27], IntAct [28, 29], MINT , MPact , MPPI  and OPHID . Mapping proteins (and interactions) to redundant groups employed SEGUID-based keys and required addressing a number of issues including malformed and deprecated identifiers, incorrectly assigned taxonomy identifiers and resolution of ambiguous mappings. Methods used to make these assignments were recorded allowing for a detailed examination of potential problems with source records and our own system.
This paper describes the software system and the consolidated interaction dataset that it constructs. We demonstrate the utility of the data set and argue that the method is suited well to integrating high-throughput proteomics studies with existing interaction knowledge. We also suggest that the resulting index could provide useful feedback and search capabilities to source databases. Interaction data that may be publicly redistributed under the license agreement of the source database is indexed by the iRefIndex (interaction reference index) and made freely available in taxon-specific divisions via anonymous FTP in the PSI-MITAB 2.5 tab-delimited, text format.
The non-redundant index of interactions was constructed in four steps summarized here and described further in the following sections. First, SHA-1 digest sequence identifiers (SEGUID's) for proteins were compiled from several sources and were cross-referenced with the source database and most recent accession and taxonomy identifiers for the protein sequence record. Second, interaction data were compiled from several sources and compiled in a single relational database. Third, for each protein interactor in each interaction record, a redundant object group (ROG) was assigned using the SEGUID and taxonomy identifier for the protein. Fourth, each protein-protein binary interaction or complex was assigned to a redundant interaction group (RIG) on the basis of the ROG assignments made above. The number of distinct interaction records was examined for each source database and for each of several taxons.
Sequence Globally Unique Identifiers (SEGUID's)  were employed to provide a unique key for each protein in the interaction dataset that was independent of the source database and accession used to describe the protein. This key may be derived by external groups using only the primary amino acid sequence and the algorithm described below. This key was used to map protein accessions used in interaction records to redundant groups and, in turn, group together redundant interaction records.
The algorithm for the creation of a SEGUID has been described previously . Briefly, an amino acid sequence in single-letter code is converted to upper case after all non-letter characters and trailing or leading spaces are removed. The Secure Hash Algorithm (SHA-1) is used to construct a 160-bit message digest of this amino acid sequence; we used the java.security. Message library implementation of SHA-1. This digest was converted to the base64 representation using the Base64 Java Class (Robert Harder) . All trailing "=" characters used for padding were removed to yield the final 27 character long SEGUID string. SEGUID's may also be derived from primary amino acid sequence using the web interface and services provided by the SEGUID database . In addition, pre-calculated SEGUID's and their mapping to various protein database accessions, aliases and FASTA files can be directly downloaded from the SEGUID FTP site. SEGUID's may be used to refer to a group of accessions that all refer to the same primary amino acid sequence (i.e. a redundant sequence group). Since two proteins in two different organisms may share the same sequence, we also employed a ROG (redundant object group) identifier to distinguish between identical protein sequences in different organisms. A ROG identifier consists of a SEGUID string concatenated with the NCBI taxonomy identifier . So, for instance, while the proteins pointed to by accessions RefSeq: NP_313053 and UniProt:Q3YUU1 belong to the same redundant sequence group (SEGUID = 2c4yjE+JqjvzYF1d0OmUh8pCpz8) these proteins belong to different ROG's (ROGID's 2c4yjE+JqjvzYF1d0OmUh8pCpz8386585 and 2c4yjE+JqjvzYF1d0OmUh8pCpz8300269 respectively).
2. Initial compilation of molecular interaction data
Molecular interaction data sources incorporated by iRefIndex.
Tab-delimited text file.
May 25, 2005.
Jan. 14, 2008
Sept. 1, 2007
Mar. 15, 2008
Dec 21, 2007
April 19, 2007
July 18, 2006
a single primary reference describing the interactor in an external sequence database,
(PSI-MI 2.5 Path: entrySet/entry/interactorList/interactor/xref/primaryRef)
a list of secondary references that may also point to the interactor in an external sequence database, (PSI-MI 2.5 Path: entrySet/entry/interactorList/interactor/xref/secondaryRef)
an NCBI taxonomy identifier describing the source organism of the interactor,
(PSI-MI 2.5 Path: entrySet/entry/interactorList/interactor/organism) and
the primary amino acid sequence of the protein interactor,
(PSI-MI 2.5 Path: entrySet/entry/interactorList/interactor/sequence).
Only the first element is mandatory under the PSI-MI schema. The other three elements were retrieved and recorded whenever present. All four elements were used during the next stage of processing in an attempt to map the interactor to a redundant object group (ROG).
3. Mapping protein interactors to redundant object groups (ROG's)
Features of the ROG assignment score and their corresponding character representations.
Description of feature (when the value is 1).
The interaction record's primary (P) reference for the protein was used to make the assignment.
The source database (D) listed in the interaction record is different than what is expected for the given accession for the protein. In specific cases, this difference is tolerated and the assignment is made.
The taxonomy (T) identifier for the protein (as supplied by the interaction record) differed from what was found in the protein sequence record. This discrepancy was tolerated and the assignment was made.
The protein reference listed by the interaction record was a typographical modification (M) of a known accession. In specific cases, this variation is tolerated and the assignment is made.
The protein reference listed by the interaction record contained version (V) information that was ignored. For example, RefSeq accession.version NP_012420.1 was listed but treated as RefSeq accession NP_012420.
The protein reference used to make the assignment was of the type "see-also". See PSI-MI Path: entrySet/entry/interactorList/interactor/xref/primaryRef/refType = "see-also".
The protein reference listed in the interaction record and used to make the assignment was a secondary UniProt accession and was updated (U) to a primary UniProt accession in order to make the assignment.
The protein reference was a retired NCBI Identifier. NCBI's eUtils (E) were used to retrieve the current accession and/or sequence.
The protein reference used was an NCBI GenInfo Identifier (I).
The interaction record's reference for the protein was an EntrezGene (G) identifier. The corresponding products of the gene were used to make the assignment.
One of the interaction record's secondary (S) references for the protein was used to make the assignment.
More than one possible assignment is possible (+). This case may arise in one of three ways. 1) The reference supplied by the interaction record requires updating but more than one possibility exists. For example, Q7XJL8 was found to be a secondary accession in three separate UniProt records (Q3EBZ2, Q6DR20, and Q8GWA9). 2) The secondary references supplied by the interaction record point to more than one unique protein sequence. 3) An EntrezGene identifier is provided in the interaction record as a protein reference. This identifier points to more than one protein product. An attempt is made to resolve this ambiguity as indicated by ROG score features O, X or L (see below).
More than one possible assignment is possible (see + above). The assignment chosen has a SEGUID that is identical to the SEGUID of the original (O) sequence provided in the interaction record.
More than one possible assignment is possible (see + above). The assignment chosen has the same taxonomy (X) identifier as listed in the interaction record.
More than one possible assignment is possible (see + above). The assignment with the largest (L) SEGUID is arbitrarily chosen (see Methods).
The protein reference, taxonomy identifier and sequence for the protein as provided in the interaction record are used to make a new entry in the SEGUID table. The protein interactor is assigned the newly (N) generated ROG identifier.
3.1 Detailed description of the assignment process
Each logical block in the assignment process is represented in Figure 2(a–i) and is further described below.
a: using the primary reference
The assignment process begins with a consideration of the primary protein reference given by the interaction record. The reference consists of an external database identifier (e.g. UniProt) and an accession pointing to a record in that database (e.g. P31946). The taxonomy identifier for the protein is also considered. If the accession is found in the UniProt partition of the SEGUID table with the expected taxonomy identifier, then the protein interactor is assigned the corresponding SEGUID and ROG identifiers with an assignment score of "P". Seventy nine percent of all assignments were made using the primary reference (see Table 2). The external database or taxon provided by the interaction record may be ambiguous (e.g. "protein accession" is listed as the database or "mammalia" is listed as the taxon). In these cases, database (D) and/or taxon (T) criteria may be relaxed in searching the SEGUID table; the assignment is made with the corresponding score (PD, PT or PTD). Unexpected database names are only tolerated when the provided protein accession is alphanumeric. Unexpected taxon identifiers are always tolerated. In all cases, these discrepancies are recorded along with the expected database name and/or taxon. In some cases, the accession found in the interaction record may require modification in order to find a matching entry in the SEGUID table (see Methods). For example, minor typographical changes are allowed (NP 012420 is allowed to match NP_012420) or the version number of an accession is ignored (NP_777219.1 is allowed as a match to NP_777219). These cases are indicated by score characters M and V respectively. Lastly, protein references are associated with a "type" that may be either "identity" or "see also". The later indicates that the reference does not point to a record about the interactor's sequence but to a record where additional information can be retrieved about the protein. In a few rare cases, the primary reference was of the type "see also". Assignments were made with these references and marked by the score character Q (see Table 2).
b: updating accessions
Protein sequence records from UniProt may be altered over time. In some cases, the new record will receive a new primary accession and the old accession will be included in the new record as a secondary accession. It is also possible that a sequence record may be used to make more than one new sequence record. In these cases, the old accession will appear as a secondary accession in each of the new records . Our work-flow checks for these cases. In the event that a UniProt accession for a protein (as provided in the interaction record) is not found in the SEGUID table, an attempt is made to update the accession. This is accomplished by searching for UniProt records that list the interaction-record-provided accession as a secondary accession. If only a single record is found, the UniProt primary accession in this record is taken as the updated accession. The corresponding assignment score will contain a U (see Table 2). On the other hand, it is possible that more than one record is found and that these records describe different protein sequences (ROG's). In this case, the mapping from the interaction-record-provided accession to an updated accession is said to be ambiguous. This ambiguity may be resolved in block g using the protein's sequence when provided by the interaction record (see below).
NCBI accessions do not require an analogous logical block. An NCBI accession follows a sequence record for its lifetime. Changes in the sequence are indicated by changes to its version number and its primary GenInfo Identifier (GI) [38, 39]. The assignment process considers NCBI accessions and will ignore version information where provided (the assignment score contains a V). In the case that GI's or Protein Databank identifiers are provided, Entrez Programming Utilities (eUtils)  are used to retrieve the GenBank or RefSeq accession. Corresponding assignment scores will contain the letter I to indicate a GI was used and the letter E to indicate that eUtils were used. In summary, an attempt is always made by the assignment process to update the protein's reference to its most recent version.
c: using gene identifiers
In the event that the primary accession is still not found in the SEGUID table and the accession provided is an Entrez GeneID , the protein accessions corresponding to the gene are retrieved. In the best case, only one protein product is present and its taxonomy identifier matches the one given in the interaction record; the assignment score will contain a G. This block also allows for a relaxed taxon match during search of the SEGUID table (score contains a T). In the event that more than one protein is encoded by the gene, the assignment may be made using the protein's sequence when provided by the interaction record (similar to g-block code).
d, e, f: using secondary references
In those cases where an assignment cannot be made using the primary reference provided by the interaction record, the secondary references are consulted. Corresponding assignment scores will contain the letter S. These three blocks of code essentially follow the same logic as their analogous blocks for the primary reference (a, b, c). Only those secondary identifiers with the type "identity" are considered in these steps. These blocks also account for the possibility that the set of secondary references may point to more than one protein (ROG). This ambiguity may be resolved in block g using the protein's sequence.
g: resolving ambiguities using interaction record provided sequence
This logical block attempts to resolve ambiguous assignments that may have arisen in the above blocks by using the protein sequence (as provided by the interaction record). This case is marked in the assignment score by the character "+" and may be resolved if the SEGUID for the sequence (provided by the interaction record) matches one (and only one) of the possible assignments (see O in Table 2).
h: resolving ambiguities using arbitrary methods
This logical block makes an arbitrary assignment where more than one assignment is possible and the ambiguity could not be resolved by using the protein sequence (block g). This case is marked in the score by the character "+" and may be arbitrarily resolved by choosing the assignment that has the expected taxonomy identifier (X) or by simply choosing the assignment with the largest SEGUID (L). The largest SEGUID is determined as the last in a list of SEGUID strings that have been sorted in ascending lexicographical order (see Methods). Arbitrary assignment is a stop gap measure and assignments with scores containing L or X (without an O) should be treated with caution (see X and L in Table 2).
i: using interaction record sequences and archival sequences
In the event that no assignment is possible (i.e., no matching entry is found in the SEGUID table), but the interaction record lists a sequence, the SEGUID is calculated for the sequence and a new entry is made in the SEGUID table (see N in Table 2). The interactor is assigned to the corresponding ROG. Likewise, if a protein reference can be used to retrieve an archived (obsolete) sequence from eUtils, the SEGUID is calculated for the sequence and a new (N) entry is made in the SEGUID table. This mechanism serves as a stop-gap measure (see section 3.3).
3.2 Review of assignments made by database
Assignment of protein interactors to ROG's.
3.3 Review of assignment scores
Number of protein references successfully assigned to ROG's and broken down by assignment score.
Total number with this score type (%)
ROG Assignment Score
Number of cases
Details for one example
UniProt:Q15118 is cited in the interaction record as the primary reference (P).
UniProt:P94102 is cited in the interaction record as the secondary reference (S).
"protein accession" is cited as the source database for accession Q9Z2F5 (D).
Accession NP 191913 is cited in a modified form (M) without the underscore.
EntrezGeneId:26207 (G) encodes multiple proteins (+) but only one matches the original (O) sequence given in the interaction record (RefSeq:NP_858057.1).
UniProt:O95686 is cited and updated (U) to UniProt/KB:Q9UQK1.
GenBank GI:12962935 is cited and updated to RefSeq:NP_002458.2 using eUtils (E).
UniProt:P38706 is cited. Two possible updates are possible (+) but only one matches the original (O) sequence in the interaction record (P0C2H6).
Protein reference cites taxon id as 9534 (African green monkey) but the sequence record cites taxon 9606.
Protein reference cites taxon id as 40674 (mammalia) but the sequence record cites (9606) human.
UniProt:O04063 is cited with taxon identifier 4530 (rice). More than one updated accession exists (+U). Only one possibility has the same sequence as cited in the interaction record (P0C5B0) with taxon identifier 39947 (a specific strain of rice).
The primary reference cited is not found. 49 secondary references are cited (S). 15 of these were found to map to 8 distinct proteins (+). The protein with the largest (L) SEGUID is arbitrarily chosen.
UniProt:Q9MAY7 is cited with a taxon id of 4530 (rice). Two updated accessions are available (+U). Neither one has the expected sequence or taxon id (T) given in the interaction record. The accession with the largest (L) SEGUID is arbitrarily chosen.
EntrezGene:9912 is cited (G). This gene encodes two proteins (+). Neither has the sequence expected from the interaction record. The one with the largest (L) SEGUID is selected.
Primary accession P84244 cited as a "see also" (Q) reference with taxon id 9606. The sequence record cites taxon id 10090 (T).
Q95Q01 is an obsolete accession. The sequence is retrieved from the interaction record. The SEGUID and ROGID are calculated and stored locally as a new entry (N).
RefSeq:NP_010441 is an obsolete accession. The sequence is retrieved using eUtils (E). The SEGUID and ROGID are calculated and stored locally as a new entry (N).
EntrezGene 196549 (G) is cited and encodes two proteins (+). The protein accessions cited by EntrezGene are retired. Sequences are retrieved using eUtils (E). One matches the sequence cited in the interaction record (O). The SEGUID and ROGID are calculated and stored locally as a new entry (N).
Type 1 assignments were least problematic. In all cases, an unambiguous assignment to a ROG was possible using either a primary or secondary reference (P, S). In a few cases, version information was ignored (V), the source database (D) was relaxed or minor modifications (M) to the accession were allowed in order to find the corresponding entry in our SEGUID table. In total, type 1 assignments accounted for 77% of the assignments made.
Type 2 assignments required that the accession provided by the source database be updated using either UniProt secondary accessions (U) or NCBI eUtils (E). In all cases, an unambiguous assignment was made. In a few cases, the sequence provided by the interaction database was required to accomplish this (score PUO+). In total, type 2 assignments accounted for 3% of the assignments made. There is likely always to be some asynchrony between interaction database releases and the major sequence databases; the ability to map accessions to their most recent versions is therefore an essential component of any integrative effort.
Type 3 assignments involved references where the taxonomy identifier provided by the interaction database was different than the 'true' taxon provided by the source sequence record. Type 4 assignments represent those rare cases where both an update to an accession was required and the true taxonomy identifier was different than expected. Type 3 and 4 cases accounted for 16% of all assignments and were typically the result of the interaction record listing a taxon that is parental to the true taxon (e.g. mammalia is listed in place of human or a species taxon is used in place of the 'true' sub-strain identifier). For the most part, this practice was not a problem for our purposes because the protein reference pointed to a single sequence record where the 'true' taxonomy identifier was listed. However, these differences must be carefully considered when analyzing data or when designing search strategies based on taxonomy identifiers.
Type 5 assignments involved references that could be mapped to a number of different proteins (see + in assignment score) and that could not be resolved using sequence data provided in the record. In some cases, this was resolved by choosing the ROG that had the expected taxonomy identifier (X) or by the arbitrary method of choosing the assignment with the largest SEGUID (L) according to its ASCII value. The majority of cases arose from use of internal identifiers for the protein's primary reference and where the list of alternative secondary accessions pointed to multiple proteins. PSI-MI guidelines suggest that proteins be represented with stable identifiers such as UniProtKB or RefSeq accessions  and our results would recommend that distribution of internal identifiers in PSI-MI files be avoided. This would have been possible in the majority of cases. In a few cases, retired UniProt accessions were found that had been split into several new records. The correct mapping was not discernable even when taking into account taxon and sequence information present in the interaction record (see scores with characters U, L and +). Ambiguity also arose from the use of EntrezGene identifiers that point to multiple protein products.
Finally, type 6 assignments involved interactors for which no matching reference or sequence existed in our SEGUID table. The protein sequence provided by the interaction record (or retrieved from archival sources) was used to construct a new (N) SEGUID entry. This served to group together any other interactors that might have the same sequence in the current build of the index. This is a stop-gap measure and new SEGUID entries are discarded from one build of the database to another. The majority of these cases are due to NCBI accessions that have been retired and for which mappings to new accessions are non-existent and not easily automated (e.g. RefSeq: NP_116649). This discontinuity between related sequences is a limitation of NCBI accession use. In some cases, UniProt accessions circumvent this problem by providing secondary (retired) accessions in active records as a built-in history of the sequence; however, in cases where sequence records are split, this may lead to multiple mappings that are not easily disambiguated (see above). On the other hand, the NCBI accession.version and GenInfo system allows for simple and unambiguous updating of a sequence while the accession is active (BIND uses this system and had no ambiguous (type 5) assignments). In the end, there is no perfect system and integrating data will be dependent on updates and clarifications from source databases.
In summary, our method has allowed us to unambiguously map 96% of all protein interactors to redundant object groups (Table 4, score types 1–4). The remaining 4% of proteins are problematic either because our mappings are ambiguous (score type 5) or because we are unable to make any mapping at all to a current sequences (score type 6 and unassigned). These identifiers will be the subject of further investigation with the source databases. The Protein Identifier Cross-Reference Service  was released while this project was under development; implementation of this resource provides access to UniParc sequences and may resolve some of these unassigned identifiers. Table 4 results have been broken down by database and will be made available to interested source databases.
4. Mapping protein-protein binary interactions and complexes to redundant interaction groups (RIG's)
Summary of mapping interaction records to RIG's.
PPI Assigned to RIGID4
Redundancy between pairs of interaction datasets processed in this study.
Number of unique RIG's and ROG's by source organism.
Utility of consolidated data
The publicly distributable subset of the iRefIndex is available in a PSI-MITAB 2.5 tab-delimited text file under a Creative Commons license (see Methods). This format allows for the types of analyses described above.
This paper presents a detailed description of methods used to consolidate protein interaction data from a variety of sources. We have paid special attention to describing exactly how assignments to redundant groups have been made. Our goals have been three-fold. First, we hope that this resource will be used by source interaction databases to identify problem records and improve data accessibility. Second, we believe that a non-redundant dataset is essential to future work (especially where it involves complex data). Third, we have presented methods to generate global keys for protein interactions and complexes.
Redundancy can be defined in a number of ways. We have chosen to define ROG's based on exact sequence and taxonomy identifier match. ROG's may be further grouped into larger redundant groups that include all products derived from a single gene or proteins that are very similar in sequence. We could have based ROG's on gene identifiers to begin with; but, we do not believe that this is the best possible course for three reasons. First, not all proteins are easily mapped to Entrez Gene identifiers. Second, many genes encode redundant protein products. Finally, mapping an interaction to a set of gene identifiers suggests that all products of those genes are involved in the interaction. Once this generalization has been made, there is no way back to specify that some pairs of gene products do not interact. Additionally, the methods suggested here allow external groups to generate universal RIG's and ROG's for their own data sets and integrate them with this dataset. They may then choose to further redefine redundancy according to their own purposes.
As an example, modelling the structure of large macromolecular complexes such as the nuclear pore complex  is dependent on information from a collection of sources (including interaction data) that helps constrain possible structure solutions. Interaction data for complex structure components may be collected from a variety of databases using our index. Further these data may be supplemented with interaction data for analogous components in a range of organisms. These protein components may have low to near sequence similarity to the modelled components; their interactions could be easily overlaid with the modelled complex using a simple Perl script and file that maps analogous proteins into the same redundant object group. As further example, recent work has shown that genes associated with complex diseases tend to cluster in the human interaction network [46, 47]. Again, these clusters could be supplemented with interaction information from analogous gene products in other organisms. Locating these interactions is facilitated using a list of sequence-related ROG identifiers since they are independent of the protein database references used to construct the original interaction records.
Our experience suggests that the casual user of interaction data would find it practically impossible to collect and consolidate interaction data using web-interfaces to the numerous source databases. In addition, the effort expended in setting up and optimizing this system suggests that the task is also inaccessible to most bioinformatics groups unless they are dedicated to analyzing interaction data (see Methods).
The present index provides useful access to a non-redundant view of interaction data using updated protein identifiers and corrected taxonomy identifiers. This paper represents more than a year of development time but is still a work in progress that will require further dialogue with the source databases. In the mean time, users are advised to keep in mind a number of cautionary points.
First, no assessment was made of the accuracy of source records. Source records and literature references should be consulted for further details. We have assumed that protein references listed in interaction records represent those interactors described in the primary literature. In truth, in some cases, the authors of this primary literature themselves may be uncertain of the identity of the exact gene products or protein isoforms that mediate the interaction. In other cases, authors may reference proteins using only a name with no database reference. There is no allowance (or prescribed method) within the PSI-MI specification to deal with such ambiguity and different databases deal with this problem in different ways. BioGRID, for instance, intentionally curates interactions at the gene level; all of the protein products for an interactor's gene may be listed within an interaction record. This practice led to a number of ambiguous assignments for BioGRID interactors. Members of the IMEx consortium adhere to published curation guidelines and we would suggest that other databases also publish their guidelines. Guidelines will understandably differ (even within the same database over time); making these guidelines transparent is an essential part of providing access to data.
Second, PSI-MI XML is a means of exchanging data. It does not guarantee uniform representation of meaning. For instance, some experimental methods (such as immunoprecipitations from cell extracts) will result in "complex data". IntAct PSI-MI XML records will represent these data as a list of multiple interactors in a single interaction. This "complex" grouping carries no information about the binary interactions between member proteins or about their stoichiometry. In contrast, BioGRID will represent complex data using a series of binary interactions with one "hub" interactor in common where the hub may be a tagged "bait" protein used to immunoprecipitate other proteins from an extract (the so-called "spoke" representation). The later representation implies binary interactions that may not exist and runs the risk of confounding complexes that share a common "hub". The IntAct method is more amenable to our bi-partite representation in Figure 4 where complexes are represented by a separate node type. These differences in representation may be accounted for during analysis or possibly even normalized prior to analysis. Again, publication of curation guidelines is an important first step towards this. At present, the iRefIndex PSI-MITAB file preserves complex information where it is present in the source record.
In the near future, we hope to collaborate with members of the IMEx consortium to create a reference index of all publicly available interaction data. The two BioGRID examples given above came to light after soliciting feedback from all source databases on our results; other examples are certain to follow. Eventually, the issues identified by this process can be built into checks carried out by the PSI-MI Validator [4, 48] to avoid recurrence of these problems by databases and data submitters.
Providing a web interface to these data is an obvious priority. We believe this would be best accomplished by providing a programmatic web-services interface to our data. This would allow source databases and applications such as Cytoscape to provide a user interface to these data that would re-direct users to the appropriate source database(s). A Common Query Interface was proposed at the recent HUPO-PSI meeting in Toledo, Spain and development is in progress on its implementation by each of the databases .
Finally, our index has the ability to consolidate interaction records derived from a common publication by multiple source databases. This view would facilitate cross-checking between databases where such duplications exist. Further, this view may help in the process of normalizing the representation of interactions using common standards and controlled vocabulary. This process is expected to be most important for legacy records predating the 2006 IMEx consortium agreement.
The iRefIndex dataset represents a carefully constructed non-redundant index of interaction data. This resource has numerous applications and may form the basis of further efforts to improve access to information and to normalize representation of interactions.
Construction of SEGUID tables
The SEGUID proteome database  was downloaded in tab-delimited format and loaded into a MySQL  database table with corresponding column names and data types. This data set lists SEGUID's for about 6 million proteins along with their corresponding protein database sources, accessions and taxonomy identifiers. This table was supplemented with our own source-database identifier (allowing us to map the provided source database names to a normalized list of databases provided by the PSI-MI database-citation controlled vocabulary ). In addition, we added a Redundant Object Group (ROG) identifier and unique integer equivalents for each distinct SEGUID and ROG identifier to facilitate faster record retrieval on an integer key.
We independently regenerated SEGUID entries for recent releases of UniProt (Release 13.1)  and RefSeq  (Release 28). UniProt SEGUID's were regenerated in order to resolve SEGUID differences between GenBank  and UniProt versions of the same accession due to asynchrony between the two databases at the time. GenBank's versions of UniProt sequences were deleted from the original SEGUID dataset to prevent this problem from recurring. Protein isoform sequences were also retrieved for UniProt sequences and their corresponding SEGUID entries were made. Finally, we independently regenerated SEGUID entries for GenBank sequence records derived from the Protein Data Bank's (PDB) structural records in order to retrieve chain identifiers for protein accessions. These suggested modifications have been relayed to the SEGUID database.
Calculation of SEGUID's for approximately 3.8 million RefSeq records required 32 minutes using the hardware configuration described below. This included the time required to read sequence from a database, calculate the SEGUID and write the results to a data table.
The final SEGUID table was partitioned into five tables according to the source database in order to facilitate lookup and retrieval time. These divisions included UniProt, RefSeq, GenBank, "all other sequence databases" and sequences found in interaction records.
Source data for interactions
Original molecular interaction data was downloaded from nine interaction databases detailed in Table 1. Interaction databases included BIND [22, 23], BioGRID , DIP , HPRD [26, 27], IntAct [28, 29], MINT , MPact , MPPI  and OPHID . Interaction records formatted in version 2.5 of the PSI-MI standard were used wherever possible. In the case of the BIND database, we experienced problems with the available PSI-MI 1.0 dataset and so BIND's flat files (tab-delimited text files) were used instead. These flat files are derived by parsing BIND database records written in the BIND data format . BIND records that describe protein-protein interactions using EntrezGene identifiers are not properly included in this flat file and led to a high number of unassigned interactors (Table 3, column 4). Flat files and BIND-XML files are available from the authors upon request since they are no longer available from the BIND site now administered by Thomson Scientific as part of the BOND database [56, 57]. OPHID is no longer updated and is being replaced by I2D .
Parsing interaction data
PSI-MI XML files were processed using a Java parser employing the Streaming API for XML library (StAX) . StAX is a pull-parser and does not generate a memory tree during operation; this allowed for processing of very large files without exceeding available RAM. Data retrieved from each interaction record included the interaction database name and record accession. Interaction record accessions were not provided by HPRD, MPPI or by OPHID.
Separate configuration files were written for each data source allowing the parser to handle both PSI-MI version 1.0 and 2.5 formatted files and variations in the use of the data structure by different interaction data providers. Interaction records providing evidence that some interaction does not occur were excluded from the consolidated interaction database (PSI-MI 2.5 Path: entrySet/entry/interactionList/interaction/negative) .
Parsing 3220 IntAct files (835 Mb) required 47 minutes using the hardware configuration described below. This time included reading XML files from disk, parsing and writing information back to the relational database.
Redundant object group (ROG) assignments were made for protein interactors as described in Figure 2 and the accompanying text in the results section. Assignment of approximately 628 thousand protein references to ROG's required 3 hours using the hardware configuration described below. This time included reading data from a relational database, making the assignment and writing information back to the relational database.
Physical implementation for dataset production
All development work was performed on a single Linux workstation (2 X Intel(R) Xeon(TM) dual core 3.00 GHz CPUs each with 2 GB RAM). However, the backend relational database management processing required a more capable server-grade system, due to memory and disk storage space requirements. EMBNet Norway  provided a server with 2 quad core Xeon x86_64 processors and a total of 16 Gigabytes of RAM that ran a suitably tuned MySQL 5.0 database server .
The production of the datasets relied on the InnoDB storage engine  in order to ensure 'ACID' based transaction safety . The MySQL InnoDB buffer was scaled to 8 Gigabytes (approximately 50% of the system RAM) ensuring that a large portion of table index operations were performed in memory and thus minimizing disk I/O operations as much as possible.
Additional server-side optimizations included the configuration of certain parameters of the underlying ext3 file system . Briefly, a 4 Kbyte ext3 file system block size achieved fast file system response for MySQL and other backend processes creating large (several Gigabytes) files on average and a maximum file size of 2 Tbytes, essential to accommodate the creation of very large files for MySQL tables.
Finally, Ethernet link aggregation technology  was employed to connect the RDBMS server to a number of development workstations and central file storage areas, in order to increase the speed of secondary post production file access operations via the Network File System .
All these optimizations reduced the production of a typical dataset from weeks (non-optimized system) to a few days and produced a backend platform that could scale appropriately, as the size and number of integrated databases grows.
PSI-MI validation and tag counts
We attempted to validate each of the PSI-MI input files against their relevant schemas (2.5 or 1.0). Only IntAct, MINT and HPRD files validated; all other files returned errors of various types including missing elements, identifiers and unexpected data types. This required that the PSI-MI file parser be customized for each input file in order to account for varying uses of the PSI-MI schema. We used an independent method to ensure our parser had found all interactors and interactions in input files. Tags indicating end of interactor and interaction elements were counted using text parsing methods. These counts exactly matched the number of elements retrieved using the XML parser on each of the input files. A similar analysis confirmed the number of protein interactors and interactions returned by our BIND flat-file parser.
Extensive spot-checking was performed during development and after the final build. In addition, we used web-services provided by IntAct and MINT to confirm the number of interactions and interactors parsed from those sources. No errors were found. Some interactions were returned that were not present in our data set because they were not included in the IntAct release that we parsed or because of differences between identifiers distributed in the XML file versus the web service.
Lexicographical ordering of SEGUID's and ROGID's
SEGUID's are SHA-1 keys written in canonical base64 form  with trailing = characters removed. ROG identifiers concatenate a SEGUID with a numerical taxonomy identifier. Therefore, the allowable characters in a SEGUID or ROG identifier are (in ascending ASCII or Unicode value):
Lists of SEGUID or ROG identifiers were sorted in ascending ASCII-based lexicographical order. The comparison of two strings in a list is achieved by comparing the successive characters in each character index (starting from 0) until one string is determined to be greater than the other. The ASCII or Unicode value of each character determines if one character is 'less' than, 'equal' to or 'greater' than the other. This ordering was implemented using the Arrays.sort method in Java. The Perl sort function can be used to achieve equivalent results. Example code is provided with a test case on the iRefIndex FTP site (see below).
Creation of RIGID's
A RIG identifier (RIGID) is constructed by concatenating ROG identifiers (after sorting them in ascending lexicographical order as described above), applying the SHA-1 algorithm to the resulting string, converting the digest to its base64 representation and removing all trailing "=" characters used for padding. It is important to note that the starting list includes a ROG identifier for each interactor listed in the interaction (even if these ROGID's are repeated). As a result, the RIGID takes into account the stoichiometry of the interactors present in an interaction; a complex composed of three A-type subunits will have a different RIGID than a complex with four A-type subunits.
Availability and requirements
A subset of data consolidated by iRefIndex is available under a Creative Commons Attribution license  via anonymous FTP at ftp://ftp.no.embnet.org/irefindex (user name: "ftp" password: "anonymous"). Presently, iRefIndex is updated manually by rebuilding the entire dataset. Releases will be accompanied by a detailed README file listing the release number, release date, a detailed description of the format and any change notices. No regular release schedule has been set at this time.
This index is provided in a tab-delimited, PSI-MITAB 2.5 text format . Data in this format may be imported into the Cytoscape interaction viewer [69, 70]; however, users are advised that importing the entire index into Cytoscape is likely to be time-consuming and they should instead first select those interactions (rows) of interest for visualization. Additional details are available at http://iRefIndex.uio.no. iRefIndex provides a single entry for each distinct interaction group with links to source databases describing that interaction. ROG assignment scores are not provided but are available on request. Other distributions of iRefIndex data are possible and are actively being developed; this format was chosen in the hopes that it would prove immediately useful to the widest possible audience. Suggestions are welcome.
iRefIndex currently includes only those data sets that clearly can be redistributed under the licenses of the source databases. HPRD, DIP and MPACT are not included in the public distribution at this time. The entire index can be provided on request to academic researchers under a collaborative research agreement. Source databases interested in having their data included in the iRefIndex should contact the corresponding author. Source code is available on request under a GNU-GPL license.
We would like to thank the following for useful feedback and discussions: Andrew Winter, Mike Tyers, Ulrich Gueldener, Sandra Orchard, Luisa Montecchi, Henning Hermjakob, Tony Chiang, Philipp Pagel, Lukasz Salwinski, Igor Jurisica, Paul Boddie and Katerina Michalickova.
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS computational biology. 2007, 3 (3): e42-10.1371/journal.pcbi.0030042.PubMed CentralView ArticlePubMedGoogle Scholar
- Shoemaker BA, Panchenko AR: Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS computational biology. 2007, 3 (4): e43-10.1371/journal.pcbi.0030043.PubMed CentralView ArticlePubMedGoogle Scholar
- IMEx. [http://imex.sourceforge.net/]
- Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, Vinod N, Bader GD, Xenarios I, Wojcik J, Sherman D, Tyers M, Salama JJ, Moore S, Ceol A, Chatr-Aryamontri A, Oesterheld M, Stumpflen V, Salwinski L, Nerothin J, Cerami E, Cusick ME, Vidal M, Gilson M, Armstrong J, Woollard P, Hogue C, Eisenberg D, Cesareni G, Apweiler R, Hermjakob H: Broadening the horizon–level 2.5 of the HUPO-PSI format for molecular interactions. BMC biology. 2007, 5: 44-10.1186/1741-7007-5-44.PubMed CentralView ArticlePubMedGoogle Scholar
- INSDC: International Nucleotide Sequence Database Collaboration. [http://www.insdc.org]
- DIP: Database of Interacting Proteins. [http://dip.doe-mbi.ucla.edu]
- IntAct. [http://www.ebi.ac.uk/intact]
- MINT: The Molecular Interaction Database. [http://mint.bio.uniroma2.it/mint]
- MPact. [http://mips.gsf.de/genre/proj/mpact]
- BioGRID. [http://www.thebiogrid.org]
- Bader GD, Cary MP, Sander C: Pathguide: a pathway resource list. Nucleic Acids Res. 2006, D504-506. 10.1093/nar/gkj126. 34 DatabaseGoogle Scholar
- Jayapandian M, Chapman A, Tarcea VG, Yu C, Elkiss A, Ianni A, Liu B, Nandi A, Santos C, Andrews P, Athey B, States D, Jagadish HV: Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together. Nucleic Acids Res. 2007, D566-571. 10.1093/nar/gkl859. 35 DatabaseGoogle Scholar
- Aragues R, Jaeggi D, Oliva B: PIANA: protein interactions and network analysis. Bioinformatics. 2006, 22 (8): 1015-1017. 10.1093/bioinformatics/btl072.View ArticlePubMedGoogle Scholar
- Cerami EG, Bader GD, Gross BE, Sander C: cPath: open source software for collecting, storing, and querying biological pathways. BMC Bioinformatics. 2006, 7: 497-10.1186/1471-2105-7-497.PubMed CentralView ArticlePubMedGoogle Scholar
- Clark T, Martin S, Liefeld T: Globally distributed object identification for biological knowledgebases. Briefings in bioinformatics. 2004, 5 (1): 59-70. 10.1093/bib/5.1.59.View ArticlePubMedGoogle Scholar
- Cote RG, Jones P, Martens L, Kerrien S, Reisinger F, Lin Q, Leinonen R, Apweiler R, Hermjakob H: The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics. 2007, 8: 401-10.1186/1471-2105-8-401.PubMed CentralView ArticlePubMedGoogle Scholar
- Iragne F, Barre A, Goffard N, De Daruvar A: AliasServer: a web server to handle multiple aliases used to refer to proteins. Bioinformatics. 2004, 20 (14): 2331-2332. 10.1093/bioinformatics/bth241.View ArticlePubMedGoogle Scholar
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004, 4 (7): 1985-1988. 10.1002/pmic.200300721.View ArticlePubMedGoogle Scholar
- Smith M, Kunin V, Goldovsky L, Enright AJ, Ouzounis CA: MagicMatch–cross-referencing sequence identifiers across databases. Bioinformatics. 2005, 21 (16): 3429-3430. 10.1093/bioinformatics/bti548.View ArticlePubMedGoogle Scholar
- Babnigg G, Giometti CS: A database of unique protein sequence identifiers for proteome studies. Proteomics. 2006, 6 (16): 4514-4522. 10.1002/pmic.200600032.View ArticlePubMedGoogle Scholar
- SEGUID Proteome Database. [http://bioinformatics.anl.gov/seguid]
- Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003, 31 (1): 248-250. 10.1093/nar/gkg056.PubMed CentralView ArticlePubMedGoogle Scholar
- Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, Buzadzija K, Cavero R, D'Abreo C, Donaldson I, Dorairajoo D, Dumontier MJ, Dumontier MR, Earles V, Farrall R, Feldman H, Garderman E, Gong Y, Gonzaga R, Grytsan V, Gryz E, Gu V, Haldorsen E, Halupa A, Haw R, Hrvojic A: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 2005, D418-424. 33 DatabaseGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006, D535-539. 10.1093/nar/gkj109. 34 DatabaseGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004, D449-451. 10.1093/nar/gkh086. 32 DatabaseGoogle Scholar
- Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003, 13 (10): 2363-2371. 10.1101/gr.1680803.PubMed CentralView ArticlePubMedGoogle Scholar
- Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, Menon S, Hanumanthu G, Gupta M, Upendran S, Gupta S, Mahesh M, Jacob B, Mathew P, Chatterjee P, Arun KS, Sharma S, Chandrika KN, Deshpande N, Palvankar K, Raghavnath R, Krishnakanth R, Karathia H, Rekha B, Nayak R, Vishnupriya G: Human protein reference database–2006 update. Nucleic Acids Res. 2006, D411-414. 10.1093/nar/gkj141. 34 DatabaseGoogle Scholar
- Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H: IntAct–open source resource for molecular interaction data. Nucleic Acids Res. 2007, D561-565. 10.1093/nar/gkl958. 35 DatabaseGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res. 2004, D452-455. 10.1093/nar/gkh052. 32 DatabaseGoogle Scholar
- Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res. 2007, D572-574. 10.1093/nar/gkl950. 35 DatabaseGoogle Scholar
- Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stumpflen V: MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 2006, D436-441. 10.1093/nar/gkj003. 34 DatabaseGoogle Scholar
- Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, Ruepp A, Frishman D: The MIPS mammalian protein-protein interaction database. Bioinformatics. 2005, 21 (6): 832-834. 10.1093/bioinformatics/bti115.View ArticlePubMedGoogle Scholar
- Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics. 2005, 21 (9): 2076-2082. 10.1093/bioinformatics/bti273.View ArticlePubMedGoogle Scholar
- Secure Hash Algorithm. Federal Information Processing Standards Publication. 2002, 180-2.Google Scholar
- Base64 Java Class. [http://iharder.sourceforge.net/current/java/base64]
- NCBI Taxonomy Browser. [http://www.ncbi.nlm.nih.gov/Taxonomy]
- Bairoch A, Apweiler R, Wu C: UniProt Knowledgebase User Manual. UniProt Consortium. 2008, 12.8Google Scholar
- The NCBI Handbook: Data flow and processing. [http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.section.ch13.Data_Flow_Components]
- Sirotkin K, Tatusova T, Yaschenko E, Cavanaugh M: The Processing of Biological Sequence Data at NCBI. The NCBI Handbook. NCBI. 2006Google Scholar
- Entrez Programming Utilities. [http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html]
- Entrez Gene. [http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene]
- Bermudez VP, Maniwa Y, Tappin I, Ozato K, Yokomori K, Hurwitz J: The alternative Ctf18-Dcc1-Ctf8-replication factor C complex required for sister chromatid cohesion loads proliferating cell nuclear antigen onto DNA. Proc Natl Acad Sci USA. 2003, 100 (18): 10237-10242. 10.1073/pnas.1434308100.PubMed CentralView ArticlePubMedGoogle Scholar
- Scholtens D, Gentleman R: Making sense of high-throughput protein-protein interaction data. Stat Appl Genet Mol Biol. 2004, 3: Article39-PubMedGoogle Scholar
- Scholtens D, Vidal M, Gentleman R: Local modeling of global interactome networks. Bioinformatics. 2005, 21 (17): 3548-3557. 10.1093/bioinformatics/bti567.View ArticlePubMedGoogle Scholar
- Alber F, Dokudovskaya S, Veenhoff LM, Zhang W, Kipper J, Devos D, Suprapto A, Karni-Schmidt O, Williams R, Chait BT, Rout MP, Sali A: Determining the architectures of macromolecular assemblies. Nature. 2007, 450 (7170): 683-694. 10.1038/nature06404.View ArticlePubMedGoogle Scholar
- Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proc Natl Acad Sci USA. 2007, 104 (21): 8685-8690. 10.1073/pnas.0701361104.PubMed CentralView ArticlePubMedGoogle Scholar
- Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tumer Z, Pociot F, Tommerup N, Moreau Y, Brunak S: A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007, 25 (3): 309-316. 10.1038/nbt1295.View ArticlePubMedGoogle Scholar
- Orchard S, Salwinski L, Kerrien S, Montecchi-Palazzi L, Oesterheld M, Stumpflen V, Ceol A, Chatr-aryamontri A, Armstrong J, Woollard P, Salama JJ, Moore S, Wojcik J, Bader GD, Vidal M, Cusick ME, Gerstein M, Gavin AC, Superti-Furga G, Greenblatt J, Bader J, Uetz P, Tyers M, Legrain P, Fields S, Mulder N, Gilson M, Niepmann M, Burgoon L, De Las Rivas J: The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol. 2007, 25 (8): 894-898. 10.1038/nbt1324.View ArticlePubMedGoogle Scholar
- MySQL. [http://dev.mysql.com/downloads]
- OLS: Ontology Lookup Service. [http://www.ebi.ac.uk/ontology-lookup]
- Consortium U: The universal protein resource (UniProt). Nucleic Acids Res. 2008, D190-195. 36 DatabaseGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, D61-65. 10.1093/nar/gkl842. 35 DatabaseGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic acids research. 2008, D25-30. 36 (DatabaseGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: BIND–a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics. 2000, 16 (5): 465-477. 10.1093/bioinformatics/16.5.465.View ArticlePubMedGoogle Scholar
- BIND. [http://bond.unleashedinformatics.com/]
- Hogue CW: The other side of staying out of a BIND. Nat Biotechnol. 2007, 25 (9): 971-10.1038/nbt0907-971a.View ArticlePubMedGoogle Scholar
- OPHID: The Online Predicted Human Interaction Database. [http://ophid.utoronto.ca/ophid]
- StAX. [https://java.sun.com/webservices/docs/1.6/api/index.html]
- PSI-MI 2.5 browser. [http://psidev.sourceforge.net/mi/rel25/doc]
- EMBnet Norway. [http://www.biotek.uio.no/EMBNET]
- InnoDB MySQL Manual. [http://dev.mysql.com/doc/mysql/en/innodb.html]
- ACID transactional properties. [http://en.wikipedia.org/wiki/ACID]
- The Linux ext3 file system. [http://en.wikipedia.org/wiki/Ext3]
- IEEE 802.3ad Link Aggregation website. [http://www.ieee802.org/3/ad/]
- The Network File System (NFS) protocol. [http://tools.ietf.org/html/rfc3530]
- The Base16, Base32, and Base64 Data Encodings. [http://tools.ietf.org/html/rfc4648]
- Creative Commons. [http://creativecommons.org]
- Cytoscape. [http://cytoscape.org]
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13 (11): 2498-2504. 10.1101/gr.1239303.PubMed CentralView ArticlePubMedGoogle Scholar
- Parrish JR, Yu J, Liu G, Hines JA, Chan JE, Mangiola BA, Zhang H, Pacifico S, Fotouhi F, DiRita VJ, Ideker T, Andrews P, Finley RL: A proteome-wide protein interaction map for Campylobacter jejuni. Genome Biol. 2007, 8 (7): R130-10.1186/gb-2007-8-7-r130.PubMed CentralView ArticlePubMedGoogle Scholar
- Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K: A protein interaction map of Drosophila melanogaster. Science. 2003, 302 (5651): 1727-1736. 10.1126/science.1090289.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.