LabPipe: an extensible bioinformatics toolkit to manage experimental data and metadata

Background Data handling in clinical bioinformatics is often inadequate. No freely available tools provide straightforward approaches for consistent, flexible metadata collection and linkage of related experimental data generated locally by vendor software. Results To address this problem, we created LabPipe, a flexible toolkit which is driven through a local client that runs alongside vendor software and connects to a light-weight server. The toolkit allows re-usable configurations to be defined for experiment metadata and local data collection, and handles metadata entry and linkage of data. LabPipe was piloted in a multi-site clinical breathomics study. Conclusions LabPipe provided a consistent, controlled approach for handling metadata and experimental data collection, collation and linkage in the exemplar study and was flexible enough to deal effectively with different data handling challenges.

In response, we developed a new bioinformatics toolkit (LabPipe) which enabled us to create customised re-usable configurations for consistent experimental metadata and data management. In this article, the toolkit is described, along with how it was deployed in an exemplar study, handling breathomics data collected across multiple sites and analytical chemistry platforms.

Implementation
LabPipe uses a modular client-server architecture ( Fig. 1) with an extensible plugin mechanism for adding new features. The toolkit consists of LabPipe Server (LPS): a group of light-weight REpresentational State Transfer (REST) based APIs; and Lab-Pipe Client (LPC): a locally installed front-end to support and manage local data/metadata collection and configuration. The toolkit was created iteratively with Continuous Integration approaches, and includes unit tests covering all data operations within the code-base.
The toolkit was developed taking into account the STRIDE threat model [4] and as such includes a number of security measures. At a basic level it is designed to prevent attacks through potential vectors (e.g. XSS and SQL injection). Additionally, user credentials are encrypted in both LPC and LPS to minimise the risk of identify theft. Enforcing default HTTPS connections also prevents man-in-the-middle attacks. Furthermore, the LPS APIs are protected from external injection by sanitising incoming data and rejecting unauthorised query operators such as $where and both LPC and LPS maintain logs when there is a change to configuration or record data, so if there are unauthorised changes it can be traced. Other access limits include specific access controls on the role-based API to restrict user access within allocated boundaries and a customisable rate limit on each API to prevent basic denial-of-service attacks. Despite the inclusion of these security measures, we would not advise that LabPipe is used to collect identifiable or sensitive information without additional mitigation and testing.
The APIs provided by LPS handle configuration, data and metadata download/upload and are protected by role-based authorisation. The server stores configurations in a document-based NoSQL database, which allows more flexibility than a relational database.  Fig. 1 The technical architecture of LabPipe. Controllers in LPS and LPC handle specific aspects of the system. Communication between the two is made through the LPS REST API. Metadata and data files are stored locally in an embedded database and local file storage respectively, and uploaded if or when a network connection becomes available. Additional details are provided in the main text These configurations include pre-defined common parameters such as data collection form templates and study design components (location, operator, instrument). LPS also supports user-defined parameters. Both pre-defined and user-defined parameters are grouped into collections. An editable manifest tells the LPC instances which parameters to retrieve. Access control parameters such as role, access token and API-role mapping are also part of the server configurations. When networked, these configurations are fetched by local LPCs upon set up/use and are also cached locally to enable offline use. It should be possible to setup an LPS instance on any server with Java version 8 or higher and MongoDB version 4 or higher installed, including cloud-based services such as Amazon Web Services. Raw data files are stored as is on the server, with a link between the metadata and the data file stored in MongoDB. Larger data files (defaults to a minimum of 50 MB), are transferred in pieces and re-assembled at the end using a checksum to ensure integrity. LPS uses a combination of security token and user/password based approaches to handle authentication.
LPC was built using the Electron framework and acts as both a data collection assistant and a configuration manager for an LPS instance. In the former role, LPC retrieves configurations and form templates from a connected LPS and generates forms for defined processes. In the latter role, users allocated with appropriate privileges can remotely view, add and edit configurations and form templates on the connected LPS, which can be customised to cover different aspects of metadata collection (e.g. from sample collection quality control checks to clinical observations). Furthermore, field validation can be added (e.g. minimum; maximum; specific data types; regular expressions; basic ontology support) to provide data limits and appropriate errors and warnings to the user. Basic ontology support is setup through an optional field property which specifies which ontology should be used and the ID of the concept. The LPS instance verifies the ontology using the BioPortal API [5]. LabPipe also provides support for different locales, so that it can be used between countries with different numerical and date/time criteria.
A step-by-step guide to setup and use LabPipe is available as Additional file 1 and at docs. labpipe.org/step-by-step. In brief, it involves setting up a central LPS instance and installing the LPC tool on each of the vendor software PCs, then configuring them to connect to the LPS instance. Following this, a user with the superuser role configures the connected LPS with appropriate forms/fields. Once configuration is complete, LabPipe can be used to manage the data and metadata collection process. Users enter metadata into forms generated by the LPC (according to the LPS configuration) at the point of vendor software data generation/collection. The local LPC then links this metadata to the data files generated by the vendor software/guides the user to do this, and uploads them to the LPS instance.
For testing purposes, an LPS instance has been made available at https ://try-serve r.labpi pe.org. But an LPC will need to be installed on a local PC in order to test it. Some example form/field configurations are also provided which can be implemented.

Results and discussion
LabPipe was piloted in the EMBER study [6], a breathomics study involving staff from different backgrounds with various skill sets. Figure 2 shows how LabPipe was setup to support metadata collection and sample data acquisition from 10 analytical chemistry instruments, covering four different techniques for breath sampling across three sites.
Breath sampling instruments in the study included both online technologies (in which samples were analysed in real-time on the instrument) and offline technologies (in which samples were collected and later transferred to be analysed in the lab), and LabPipe was able to handle data management for these different scenarios. Each patient visit involved the generation of multiple sets of data using vendor software, which was supported through LabPipe and linked to metadata entered at the time samples/sample data were acquired.
LabPipe was enhanced through an interdisciplinary collaboration between informaticians, researchers and clinical staff. The aim was to create a platform which was accessible and easy to use. This was helped through an iterative development process guided by qualitative surveys and informal team discussions to assess pain-points in the software. Members of our team agreed that the resulting wizard-based system was better than alternative approaches as it could be setup to guide non-technical members through entire standard operating procedures. In response to researcher's needs, support was also added to LabPipe for controlled access to data through the LPS API and a search/data export user interface in LPC which made it easier for researchers to carry out their analyses.
The introduction of LabPipe into the EMBER study reduced manual data handling and management, saved time and increased efficiency. Prior to its deployment, paper-based records, manual transcription and removable storage devices were used to manage metadata and data from local PCs. By deploying LabPipe and a guided  Fig. 2 LabPipe setup and workflow in the exemplar EMBER breathomics study. The EMBER LPS was setup to handle standard operating procedures for four analytical chemistry instrument data/sample collection techniques including: proton-transfer-reaction mass spectrometry (PTR-MS); gas chromatography ion mobility spectrometry (GC-IMS); atmospheric pressure chemical ionisation compact mass spectrometry (APCI-CMS) and a breath sampler device which collected samples for later analysis. Re-usable configurations for each technique (shown in the diagram with distinct colours) were setup with data collection forms containing appropriate field processing/validations and data file handlers. Data collection at each site was managed through LPC instances which loaded appropriate configurations to generate metadata forms and guide the user through any manual steps required when saving data files. Once these steps were completed, the LPC automatically uploaded data and metadata to the LPS. process for metadata and data collection/linkage, the likelihood of erroneous data and data loss was reduced.
Existing freely available bioinformatics tools with the same scope as LabPipe were investigated [2,3] and only one comparable tool to LabPipe was identified: MASTR-MS.
As in LabPipe, the mechanism for vendor software data acquisition in MASTR-MS is a tool which runs in the background on the client machine (the DataSync Client). However, unlike LabPipe this runs as a light-weight service; and linked metadata is collected using the MASTR-MS web application. While this approach has some advantages, it means that MASTR-MS cannot handle intermittent/no network connections effectively when collecting both metadata and data.
Furthermore, while MASTR-MS provides some flexibility in terms of data collection, it is specifically tailored to metabolomics. In contrast, LabPipe metadata forms/fields can be customised to collect different types of experimental metadata and the vendor data collection/linkage approach can be adapted using the extensible architecture. For example, depending on the vendor tools used to collect data, linkage can be implemented by generating unique file identifiers (concatenating entered data fields into an identifier); through a file watcher/notifier; or by building a vendor software specific data linkage plugin. The extensibility of LabPipe also enables other types of features to be developed. For example, during the EMBER study, a plugin was created which would send researchers notification emails when breath sample metadata and/or data had been uploaded to LabPipe.
Critically, unlike LabPipe, which is being actively developed, it is unclear whether MASTR-MS is still being supported. The latest release was in 2017 and it was developed using a now end-of-life programming language (Python 2).
Although the exemplar presented here is a breathomics study, we believe that LabPipe is flexible enough to handle data collection from vendor software for other experiment types where a similar metadata/data management approach would be effective. This ability will be further enhanced by our plan to add support for data standards such as ISA [7] and mzTab [8]. To facilitate this, ontology support will be improved, and form handling capabilities will be extended to allow more complex metadata configurations, such as multi-faceted and embedded forms.

Conclusion
Through its deployment in the exemplar study we have shown that LabPipe provides a consistent, controlled way to handle metadata and experimental data collection, collation and linkage for clinical bioinformatics. We believe it provides a straight forward, fully configurable and extensible approach to experiment data handling which could help address the needs of modern lab management.