Molecular interaction data is crucial to the study and understanding of the molecular biology of a cell. These data are large and complex, but the creation of a standardised data interchange format (PSI-MI XML) allowed easier access, enabling users to merge data from disparate resources and encouraging the development of tools and software that facilitated network visualisation and analysis. Version 1.0 [1] of the format only allowed a relatively simple description of protein interactions but as the data grew, limitations of the original format were identified, and an updated version, PSI-MI XML2.5 [2], was released in 2007. It allows the description of interactions between molecules other than proteins, and enables the detailed capture of both experimental context and the constructs used in each assay. This version of the interchange format is still widely used to capture experimental data, but the need to describe more abstract concepts has recently resulted in the release of PSI-MI XML 3.0 [3]. PSI-MI XML3.0 allows the capture of details of cooperative or allosteric binding sites, the composition of protein complexes taken from multiple publications, and more complex data types such as dynamic interaction networks that change with time or with concentration of agonist. A simpler tab-delimited representation of molecular interaction data has also been available since 2007 but this has also grown in complexity in response to user requests, and MITAB2.5, 2.6 and 2.7 are now all available [2]. Additionally, at the 2017 HUPO-PSI workshop, the Molecular Interaction workgroup decided the newly developed MI-JSON will be its recommended protocol for serving interaction data to web pages and visualisation tools.
PSI-MI XML, MITAB and MI-JSON are all capable of holding the same data, in differing degrees of detail, and are all annotated using a single shared controlled vocabulary but exist to serve different user groups. The XML format is largely used by software developers and database managers, the MITAB by biologists interested in simple binary representation and the MI-JSON for visual representation. Updating any data format necessitates changes to many dependent systems. A broad range of software, including curation, editing, export, visualisation, validation and analysis packages use the PSI-MI formats to access and manipulate the data and consequently need to be updated with every format update. Format updates add complexity to existing software packages, as the programs need to be extended to utilise the new version whilst still continuing to support those already existing and widely-used. These software and standards are consumed by a diverse group of organisations with different levels of resources, ranging from PhD students in small research groups to data pipeline specialists in pharmaceutical or bioinformatics companies. Potentially some groups may end up using legacy standards and software for many years simply because they do not possess the skills, time, or budget to update their software.
Supporting such diverse needs is time and resource intensive, yet securing funding for software maintenance is challenging [4]. Each new data format is useful and must be maintained, but each update generates a new library, with duplicated code, requiring parallel testing and generating its own bugs. In summary, while new formats meet genuine need, they also result in an expensive cascade of changes to software and tools.
The JAMI (Java Molecular Interaction framework) library was developed, using an object-orientated approach, to address these concerns. JAMI can import, inter-convert and re-export molecular interaction data in a variety of formats and versions. The software has been designed to ensure that modules to read/write new format types can easily be written and added to the library, thus providing a single change-resilient software component to handle all molecular interaction data. It is generally intended that the JAMI library will be used within a Java application, rather than being made available as an API, but users could look to develop a programmatic interface using the JAMI framework, if required. Given the change-resilient remit of the JAMI framework, it was necessary to ensure that JAMI can handle multiple use cases. It needs to concurrently support legacy data models, contemporary data models, and any new changes required in the future, as interaction data becomes ever more sophisticated in its nature. For this reason, the data model was deliberately kept flexible enough to expand, with all classes being interfaces with a default implementation. Implementations may be added, edited or removed if necessary over time. Main entities in the data model include Complex, Interaction, Entity, Participant, and Publication - interfaces with a default implementation and format-specific overloaded behaviours. For example, PSI-XML 2.5 [2] allowed experiment descriptions to contain either a cross-reference to a Publication object, or directly contain a list of attributes such as author and journal, whereas in XML 3.0, it is possible to associate both of these data members with an experiment [3]. Since the Publication and XML export classes are only interfaces, exporting the two different types of Publication can be handled by the same software, with implementation classes reconciling the two XML versions.
When included as a library in bioinformatics software, JAMI hides the complexity of supporting multiple data formats. It facilitates data import, integration and analysis, simplifying software development by offering a single API. JAMI also eases the creation of new interchange formats, like JSON-LD or RDF. Additional formats can be added once to JAMI and are then supported in multiple software packages with little effort. Similarly, JAMI prevents code duplication - each of these software sources drawing from JAMI now share code, ensuring less effort is put into the development of multiple XML/MITAB parsing modules.