Schema Playground: a tool for authoring, extending, and using metadata schemas to improve FAIRness of biomedical data

Background Biomedical researchers are strongly encouraged to make their research outputs more Findable, Accessible, Interoperable, and Reusable (FAIR). While many biomedical research outputs are more readily accessible through open data efforts, finding relevant outputs remains a significant challenge. Schema.org is a metadata vocabulary standardization project that enables web content creators to make their content more FAIR. Leveraging Schema.org could benefit biomedical research resource providers, but it can be challenging to apply Schema.org standards to biomedical research outputs. We created an online browser-based tool that empowers researchers and repository developers to utilize Schema.org or other biomedical schema projects. Results Our browser-based tool includes features which can help address many of the barriers towards Schema.org-compliance such as: The ability to easily browse for relevant Schema.org classes, the ability to extend and customize a class to be more suitable for biomedical research outputs, the ability to create data validation to ensure adherence of a research output to a customized class, and the ability to register a custom class to our schema registry enabling others to search and re-use it. We demonstrate the use of our tool with the creation of the Outbreak.info schema—a large multi-class schema for harmonizing various COVID-19 related resources. Conclusions We have created a browser-based tool to empower biomedical research resource providers to leverage Schema.org classes to make their research outputs more FAIR. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05258-4.


Background
Funding agencies, international consortia, institutional policies, and publisher requirements have helped promote the adoption of the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles [4,41] for biomedical research data sharing to varying degrees of success.While it is now standard to make datasets accessible and potentially reusable via deposition of the dataset in a repository, metadata *Correspondence: gtsueng@scripps.edustandardization issues (i.e.-lack of standardization in how datasets are described) continue to make it challenging for researchers to make datasets findable, interoperable, and reusable.To address these issues, domain experts and data stewards have been inspecting the gap between principle and practice [23]; extending [19], adapting [15], and adopting the principles [12]; creating their own metadata standards [6] and data schemas [12,16,29].However a large gap remains between the communities that develop standards and the adoption of these standards by data and resource providers due to issues in communication, education/training, incentives, and the availability of supportive tools [14,17].For example, the Dublin Core Metadata Initiative (DCMI) provides a metadata ontology (i.e.-a structured vocabulary for classifying and describing metadata): terms and data elements (Dublin Core Metadata Initiative [9], two generaluse schema classes (i.e.-sets of metadata vocabulary used to describe a conceptual entity): core and qualified, and a thorough guide for utilizing their ontology with their model-based framework for creating schemas: the Dublin Core Application Profile (DCAP) guide [8].The DCAP guide was intended to empower data providers to mix and match Dublin Core (and other) metadata terms/elements (properties) to create new application profiles (schemas) to suit their needs.While the core (data element) schema has been widely-adopted, the lack of authoring tools to help create more type/conceptspecific schemas and the lack of tools for transforming schemas into working formats for consumption and implementation has hampered the adoption and implementation of DCAP [1].Even after standardization communities successfully introduce standards, their adoption, modification, and implementation are frequently defined by widely used tools or repositories within a specific community [12].
Schema.org is a metadata vocabulary standardization project founded by the major search engine companies such as Google, Microsoft, Yahoo, and Yandex.It is an open source, collaborative initiative that develops metadata standards for improved searchability.While domain-neutral, Schema.orgwelcomes proposals and discussions of new properties and classes from anyone, including domain-specific ontology or schema development groups, via participation in their W3C group [38].Members of metadata ontology development communities (including the aforementioned DCMI, as well as LRMI, and other W3C groups) [3,39] have been involved with, have influenced, and have successfully integrated some of their vocabulary into Schema.org[2,32].Schema.org already includes some biomedically relevant classes (i.e.-conceptual entities) like Datasets and Medical Study, and applying Schema.orgclasses to biomedical research resources would improve interoperability, enabling researchers readily ingest existing resources and to leverage search engine-based solutions (like Google Dataset Search) to find resources of interest.Furthermore, the hierarchical nature of schemas from Schema.org allows for inheritance of vocabulary sets (sets of properties) from parent schemas.Although there have been some efforts to leverage Schema.org to improve findability of scientific research data [20,29,31]  Bioschemas is an open and collaborative effort that has been actively promoting the use of Schema.org in the life sciences by serving as a hub for researchers to create new biomedically relevant classes with the goal of refining and proposing these classes to Schema.org[11], Profiti et al. [30], and by raising awareness about the usefulness of metadata schemas.The Bioschemas community has also identified the need for easy-to-use tools to help improve public accessibility and participation in the schema development process.
Here, we describe the Data Discovery Engine's (DDE) Schema Playground, a webbased tool that improves the ease of using any registered schema or Schema.orgclasses.Our tool allows users to easily find and visualize relevant classes from Schema.org,Bioschemas, BioLink [5], and others, extend them; create JSON schema validation rules [22]; and save/share the newly created classes for others to reuse.Our tool expresses schemas in JSON-LD format, improving interoperability of schemas which might normally otherwise be viewed as HTML tables.Our tool also includes a framework for building data registries and creating guides for data submission; however, the implementation and integration of these features on our site is restricted to partner organizations.We introduce the features of this tool, review its value to different types of users, demonstrate its application towards the creation of a new schema for COVID-19-related resources, and discuss its adoption by the Bioschemas metadata standardization community.

Implementation
The Data Discovery Engine's Schema Playground is a browser-based tool built with Vue.js [43], Python/Tornado [35], and the BioThings Software Development Toolkit [25].Schemas from Schema.org and other consortia/projects are stored and made searchable using MongoDB [27] and Elasticsearch [34].The code for the Schema Playground can be found at https:// github.com/ bioth ings/ disco very-app and is free to use under the Apache License 2.0.The schema generated by the DDE are exported as JSON-LD files [21], following RDF schema specifications [40] with embedded JSON Schema metadata validation rules [22].The COVID-19 Outbreak.inforesource schemas were developed by comparing metadata properties across multiple type-specific repositories to identify properties in common.For example, metadata from LitCovid/PubMed, BioRxiv/ MedRxiv, various journals like JAME, NEJM and others, and the metadata from publications found on Zenodo, Figshare and others were compared in order to identify a suitable schema for COVID-19-related publications.Similarly, protocols from protocols.ioand the BioSchemas LabProtocol class were compared to develop a schema for COVID-19-related protocols.Once the desired properties and structure for each class of COVID-19-related resource was identified, the schemas were created by extending existing Schema.orgclasses using the DDE Schema Playground.

Results
The DDE Schema Playground consists of two standard (and fully-accessible) components and two related, custom (limited-access) components (Fig. 1).The standard components improve the ease of use of schemas and classes, while the custom components help communities to reap the benefits of their use.The Schema Editor allows users to import community standard schemas like Schema.org and customize them for biomedical purposes.These extended schemas can then be shared in the Schema Registry, which allows users to view the schemas and reuse them.When used in conjunction with Data Portals built with BioThings SDK [25], The DDE Schema Playground can automatically generate data submission forms known as Data Guides.
To understand how the Schema Playground might help to bridge the gap between data standardization communities and data resource providers, we identified potential utility and value of each of the DDE Schema Playground components for different types of users in our partner communities (Fig. 2).
Any data portals and guides can be used by anyone with sufficient access rights, but the creation of a data portal or data guide requires partnerships with our team to actualize.For the outbreak portal, data submission via the guide is open to all and utilizes GitHub for authentication.For other portals, access may be restricted as required by the responsible partner organization.The data portal and data guides allow data providers and data consumers to collect, share, and use data.Since the data guide converts a custom schema into a web-based data submission form, it enables data consumers and data standardizers to visually inspect and understand the burden of structure.
The schema registry and editor allows data providers and/or standardization communities to find, customize, and share schemas.Sharing schemas via the registry will make it easier for data consumers to understand how to consume data from a data provider and to create data validation if one is not available from the data provider.For example, data-use restrictions usually require a data consumer to create an account with the data provider in order to access data.However, data consumers cannot easily determine whether the data locked behind the account-creation process will actually be useful prior to creating an account.Sharing the data schema via the DDE registry could address this issue by allowing data consumers to understand what's available without actually displaying any restricted-access data.Having a central location for schemas submitted by data providers will also make it easier for data standardization communities to evaluate the needs of the biomedical research community.To our knowledge, the DDE's schema registry is the only crowd-sourced registry for type/concept-specific schemas created specifically for the biological and biomedical research space.To further illustrate the value of the schema registry and editor, we compare and detail the features of the DDE Schema Playground with available tools for creating, applying, and consuming other major schemas such as Schema.organd Bioschemas.
Schema.org, Bioschemas and other data standardization efforts have built strong communities to generate consensus on data modeling for the creation of new schemas or the improvement of existing schemas.Hence, there are extensive processes in place (but few tools) for the creation of a new schema based on Schema.org or any other schemas.Because of its widespread adoption, there are third party tools available for utilizing and consuming markup from Schema.org.The Bioschemas community has developed a process for defining new classes and has a set of tools which cover both the creation of a new schema (google spreadsheet conversion), utilization of a schema (markup generation), and evaluation of use (markup validation, scraping), but these tools vary in usability based on the users programming experience.In contrast to Schema.org,Bioschemas also defines cardinality (allowable number of values per property) and marginality (optional vs required value) in its profile schemas as these are important to the life sciences research community.Although the DDE schema playground was developed independently from the Bioschemas community, our interests aligned and we sought to provide complimentary schema tools to facilitate biomedical schema development and adoption.To do this, we identified schema tools and features available directly from the Schema.organd Bioschemas communities.We expanded the list of tools by searching for "Schema.orgtools", "schema generation tools", "schema creation tools", "schema editing tools", "schema validation tools", "bioschemas tools" in Google and in bio.tools).Bio.tools yielded two relevant results (biovalidator and ObjTables), while Google yielded multiple tool reviews/lists which generally featured similar sets of tools.Most user-friendly tools were aimed towards the generation, extraction, or validation of schema-compatible markup rather than the development of schemas themselves (Additional file 2: Supplemental Table 1).The Bioschemas community has a few well-documented tools for schema development, but many of those tools were only available as source code and required basic programming experience.We focused our efforts on features for which userfriendly tools for schema creation and reuse, resulting in a web-based application that empowers individual data resource providers to utilize and customize existing schemas from Schema.org and other similar efforts.As seen in Table 1, these features include:

Searching and viewing schemas from Schema.org and other metadata standardization efforts
The DDE Schema Playground allows for the visualization of JSON-LD-formatted schemas hosted online either on GitHub or elsewhere (Additional file 1: Supplemental Figure 1A).This allows users who are familiar with Schema.org to review their compliant schema in a more human readable format.The DDE Schema Playground also has a searchable registry of classes from Schema.org,BioLink, BioThings, Bioschemas, and others.Users may browse and visualize the schemas for various classes from these sources to identify the classes of most interest to them (Additional file 1: Supplemental Figure 1B).If a community like Bioschemas or consortia like the National COVID Cohort Collaborative (N3C) [13] is interested in making a new schema available for searching and viewing, they can import and register their JSON-LD-formatted schema.The DDE Schema Playground also enables users to compare up to four schemas.For example, there are multiple Dataset schemas available in the registry, and users can compare them to see what properties are unique to each and what properties they share (Additional file 1: Supplemental Figure 1C).

Extending and customizing a pre-existing schema for a particular use
The ability to browse and inspect pre-existing schemas makes it easier for a user to customize or extend the schema to suit their own purpose.All the properties from the pre-existing schema will be inherited in the extended schema; however, the user may select properties for which validation is desirable.The user can also create new properties to be included in the extended schema.For example, the Dataset class from Schema.org serves as a potential foundation, but a schema focused on COVID-19-related datasets may need additional fields (e.g., infectiousAgent).To tailor the Dataset schema, we find and extend it from the registry (Additional file 1: Supplemental Figure 2A).After we create a name for our schema (the namespace) and the class, we can customize it.We can select to include any property that is available from the schema we are extending (Additional file 1: Supplemental Figure 2B), and we can create new properties (e.g., infectiousAgent) that are tailored to our needs (Additional file 1: Supplemental Figure 2C).This feature also serves as an easy way to maintain Bioschemas profiles as users can update a registered profile by extending from it, making the necessary changes, and pushing them back to Bioschemas.Outside the tool, there is only manual writing/editing of JSON-LD, YAML, TTL, SHACL, ShEx, or other file types and running command-line tools for customizing an existing schema in an interoperable format and making it human-friendly viewable online.

Creating validation for the schema for data quality enforcement
Marginality (whether a property is required or not) and cardinality (whether a property can have one or multiple values) are two aspects of schema properties that are not expressed well by Schema.orgbut are desirable to biomedical researchers (Additional file 1: Supplemental Figure 3A and 3C).In the DDE Schema Playground, this is handled via the creation of JSON Schema validation rules, and the DDE's Schema Validation Editor provides a simple drag and drop mechanism to create straightforward validations (Additional file 1: Supplemental Figure 3B).For slightly more complex validations, the user can edit the validation rule before dragging and dropping it into the property of interest.In our example Dataset schema, we may want to restrict the values for our new property (infectiousAgent) such that they map to and are standardized by an ontology.We edit the example JSON schema validation rules for an ontology to tailor it to the NCBI Taxon ontology (Additional file 1: Supplemental Figure 3D).Schema development working groups often leverage the work done by ontology groups to ensure that the values of a property are standardized.Once these JSON schema validation rules have been created, they can be used to test the validity of JSON-LD-formatted metadata using any of the many third party metadata validation tools and program libraries that are already available.Future features of the DDE will include a built-in metadata validation tool.

Exporting and saving a schema generated by the Schema Playground editor
The DDE Schema Playground allows you to export/download your newly created schema locally and it is also integrated with GitHub, allowing users to save to their GitHub repository (Additional file 1: Supplemental Figure 4A-C).The integration with GitHub allows the edits to the schema to be made by multiple parties and provides the schema owner the option of pulling changes to the schema.Additionally, the schema can be forked and edited/customized allowing for re-use of the schemas which in turn improves findability and reusability of resources which follow the schemas.

Registering a newly created schema in the DDE schema registry to facilitate its extension and re-use
Once saved in GitHub, users can review their schema with the schema viewer and add it to the registry to enable others to easily re-use it (Additional file 1: Supplemental Figure 1A).This provides a user-friendly interface for editing, customizing, and re-using schemas for those who prefer not to manually edit text and format in JSON-LD.
The DDE Schema Playground offers any user the ability to reuse and extend existing schemas.This tool is primarily to assist in the authoring of schemas for use in other applications.In addition, we have converted three Dataset schemas into "guides", which are web-based forms for annotating resources using schemas authored in the DDE Schema Playground.Annotations created using these guides are stored within a resource registry hosted within the DDE.There are currently three public guides based on the Dataset schemas for the Outbreak.infoweb application [28], the N3C initiative, and the CD2H consortium [7].While the creation of guides from schemas is not a fully-automated feature that is available to all users, most of the underlying components are reusable, additional guides can be constructed and hosted within the DDE through collaboration.The Bioschemas community has integrated the DDE schema playground as part of its schema creation and update process to improve participation by members who lack the programming expertise needed to participate via their previous pipeline.

Creating the COVID-19 Outbreak schema using the Schema Playground
Schema.org classes are often simultaneously too broad (lacking properties needed) and too narrow (including too many irrelevant properties) for a specific research purpose.For this reason, it becomes necessary to adapt schemas to suit needs of a biomedical research project.Outbreak.info is a project from the Su, Wu, and Andersen labs at Scripps Research to unify COVID-19 and SARS-CoV-2 epidemiology and genomic data, published research, and other resources [10,37].The standardization of published research and other resources was accomplished by creating a single, multiclass schema to harmonize the metadata: The COVID-19 Outbreak schema.This schema can be found in the DDE registry at https:// disco very.bioth ings.io/ view/ outbr eak/ and was built via the DDE Schema Playground with some manual editing (for merging all the classes into a single schema).There are six principal classes in the Outbreak schema (Analysis, Dataset, Clini-calTrial, ComputationalTool, Protocol, Publication) and many subclasses to support the principal classes.As seen in Table 2, the classes in the Outbreak schema were extended from related Schema.orgclasses (whenever available) and were created based on metadata comparisons from a variety of related sources.By extending from existing schemas, we reuse existing metadata properties when appropriate, and create new properties only when necessary.
For example, the level of detail provided by Protocol Registration System (PRS) schema used by the National Clinical Trial (NCT) registry is more granular than Schema.org'sMedicalStudy class, but broad enough that it encompasses properties from both child classes of MedicalStudy (MedicalTrial and MedicalObservationalStudy).The child classes of MedicalStudy only differ in the property name for the study design (tri-alDesign vs studyDesign), and this property is not delineated in PRS.Further, the PRS includes many properties not currently available in any of these Schema.orgclasses.Adopting the PRS directly was also problematic as we planned to ingest records from other registries like the World Health Organization's Clinical Trial registry (WHOCT), and the PRS was also more granular than WHOCT.For this reason, the Outbreak.infoClinicalTrial class was created by using the DDE to extend from Schema.org, leveraging the PRS-WHO crosswalk [42], and creating properties that could help with issues previously identified [26].In addition to adapting Schema.orgclasses to normalize record data from multiple sources within a class, Outbreak.infoneeded to normalize common metadata properties between different classes.The hierarchical nature of Schema.orgclasses simplified this process, as many derivative classes inherit properties from the Thing class.For example, the Protocol class in the Outbreak schema was extended from the HowTo class in Schema.organd was based on properties identified from available metadata in protocols.ioand the LabProtocol profile from Bioschemas.Since both the Schema.orgclasses, MedicalStudy and HowTo, are derivatives of Thing, the Outbreak schema naturally has properties in common across multiple classes and can normalize the metadata across these classes allowing for cleaner query design and improved search functionality.This schema is currently used to harmonize and improve FAIRness of metadata from over 300,000 resource entries in the Outbreak.inforesearch library at https:// outbr eak.info/ resou rces.

Adoption of the Schema Playground into the Bioschemas schema development and maintenance pipeline
Previously, the pipeline for updating a Bioschemas specification involved the use of a google spreadsheet for attaining community consensus, a command-line tool for converting the CSV from the spreadsheet to yaml, cloning the Bioschemas website repository and copying/editing HTML and YAML files, running Jekyll to test the changes, editing example files in the Bioschemas specifications repository, and creating pull requests for the Bioschemas website repository once everything had performed as tested.The level of expertise needed in order to update a specification has been discussed in multiple Bioschemas community calls as a potential barrier to participation.After initial tests during and after Biohackathon 2021, the Bioschemas community has decided to adopt the DDE into its schema development and maintenance pipeline.Manuals for using the DDE to create or update Bioschemas specifications have been developed, and automated scripts using GitHub actions have been developed to more tightly integrate the tool into the pipeline.As seen in Fig. 3, the process for updating Fig. 3 The Bioschemas profile update process before (left) and after (right) the inclusion of the DDE a Bioschemas profile requires less technical expertise after the integration of the DDE.While the process prior to and after the DDE still requires the ability to edit a YAML/ JSON file (brown) and the ability to use GitHub (black), the DDE-based process does not require the user to have the technical knowledge needed to run tools via the command line (green), or to use Jekyll (blue).

Discussion
In an effort to make scientific resources more FAIR, communities in the biological sciences (Bioschemas), earth sciences (ESIP's Science on Schema.orgcluster), and more are working diligently to align and influence Schema.org to suit the needs of the scientific research community [11,33].These communities serve as an important bridge between domain-specific ontology development groups and the domain-neutral Schema.orgby introducing Schema.org to the scientific research resource providers, identifying existing ontologies to leverage, and creating tailored schemas more suitable for the research community.For example, ontology development groups, like PPEO/MIAPPE [24] and EDAM [18], have been consulted or have participated in the development of the Sample and ComputationalWorkflow schemas by the corresponding Bioschemas working groups.A term from PPEO/MIAPPE was included as a property in the Sample schema, while JSON schema validation rules enforce the use of terms from EDAM as values for certain properties in the ComputationalWorkflow schema.
Although communities like Bioschemas have helped to create more relevant classes or improve existing classes, it is difficult to push these suggestions to Schema.orgwithout compelling use cases or widespread adoption of these tailored classes.For example, the Bioschemas community first developed the Gene class (with input from gene resource providers, Gene Ontology proponents and gene resource consumers) in 2018.However, it was not included as a pending class in Schema.orguntil 2021 due to a lack of widespread adoption.The Bioschemas community spent considerable time and effort on education and training in order to increase the adoption of Bioschemas classes; however, participation in the development of the classes was hampered by the technological expertise needed in order to update a Bioschemas class.The availability of user-friendly tools can make it easier to find and use Schema.organd other community-driven schema classes, and empower data providers and researchers to engage in schema authoring and sharing.
Most tools for utilizing existing Schema.orgclasses focus on the utilization of an existing schema (such as markup generation) and lack the ability to customize the schema in a Schema.org-compliantway.Tools that do allow customizing/creating a schema (e.g., Bioschemas GoWeb) often require some degree of programming.The DDE Schema Playground is a browser-based tool that enables members of the research community to easily adapt schemas to suit their need and to enable community re-use of their schemas through the DDE schema registry.This encourages and empowers researchers to structure their data in a Schema.org-compliantfashion earlier on in the scientific research process rather than as an afterthought.The schema authoring by the research community, for the research community will encourage the creation and adoption of new classes and properties, which may have previously been neglected due to the absence of representation (e.g., volunteers with subject matter expertise) in data standardization communities.In this fashion, the DDE Schema Playground allows for researchers to express and share their data structuring needs with the data standardization community without diverting attention away from their primary research efforts.Data standardization communities also benefit because their volunteer time can be concentrated on classes already in use by researchers (but could benefit from some standardization), and diverted away from classes which lack interest/support from the research community at large.
There are many ways to express schemas (i.e., SHACL, ShEx), but the DDE only supports the expression of schemas in JSON-LD/JSON Schema format due to the widespread adoption of the JSON-LD format by resource providers and library/tool developers.In addition to this restriction, there are important limitations as to what can or cannot be registered into the DDE schema registry.Schema registration in the DDE is currently limited class-based schema (i.e., classes described by sets of properties) rooted in Schema.org,while many well-used, domain-neutral metadata ontologies (such as DCMI) and schema have properties that are not necessarily tied to any class.These classless metadata vocabularies intentionally do not group the properties into classes in order to encourage the mix-and-match of properties.Although classless metadata vocabularies cannot be registered in their entirety as classless properties in the DDE at this time, the DDE can flexibly ingest properties from any metadata vocabulary (whether or not they are class-based) as long as it is properly formatted (i.e., conforms to JSON-LD/JSON Schema formatting).This means that users can build their schema by extending from Schema.org,Bioschemas, or any registered schema, and incorporate properties from OWL, DCMI, or any other accessible vocabulary as needed.For example, all Bioschemas profile classes also include the conformsTo property from DCMI, and the NIAID Dataset schema [36] also leverages properties from OWL.In theory, classes inheriting just a single property from a Schema.orgclass, but otherwise built entirely from other metadata ontologies can be viewed and registered in the DDE.
We tested the use of the DDE Schema Playground to create customized Schema.orgcompliantclasses that could be used to normalize metadata between multiple types (datasets, clinical trials, publications, etc.) of COVID-19-related resources and applied these schemas towards a searchable resource site (https:// outbr eak.info).The Outbreak resource schema is available in the DDE schema registry which is also includes schemas from Schema.org,Bioschemas, BioLink, the National COVID Cohort Collaborative (N3C), the National Institute of Allergy and Infectious Diseases (NIAID) and more.We hope others will join us in making their open data more interpretable, interoperable, and reusable by adding their schemas to the schema registry.

Conclusion
We have created a user-friendly browser-based tool which facilitates the application of Schema.orgtowards biomedical research outputs.We demonstrate its use with the creation of the Outbreak.infoschema, its adoption into the Bioschemas schema development pipeline, and we encourage others to register and reuse Schema.org-compliantschemas.We welcome user feedback which has and continues to help identify desirable new features and tools (i.e., metadata validation tools) which will be added in the near future.
and many generic repositories (like Figshare and Zenodo) are compliant, Schema.orgremains largely underutilized by the biomedical research community.

Fig. 1 Fig. 2
Fig. 1 Components of the DDE Schema Playground and how they work together

Table 1
Comparison of Schema.org,Bioschemas, and DDE Schema Playground ✓ available, separate tools available, *process in place, **feature exists, but not generally available

Table 2
Classes in the Outbreak schema and how they were created and used