Ontology-based instance data validation for high-quality curated biological pathways

Background Modeling in systems biology is vital for understanding the complexity of biological systems across scales and predicting system-level behaviors. To obtain high-quality pathway databases, it is essential to improve the efficiency of model validation and model update based on appropriate feedback. Results We have developed a new method to guide creating novel high-quality biological pathways, using a rule-based validation. Rules are defined to correct models against biological semantics and improve models for dynamic simulation. In this work, we have defined 40 rules which constrain event-specific participants and the related features and adding missing processes based on biological events. This approach is applied to data in Cell System Ontology which is a comprehensive ontology that represents complex biological pathways with dynamics and visualization. The experimental results show that the relatively simple rules can efficiently detect errors made during curation, such as misassignment and misuse of ontology concepts and terms in curated models. Conclusions A new rule-based approach has been developed to facilitate model validation and model complementation. Our rule-based validation embedding biological semantics enables us to provide high-quality curated biological pathways. This approach can serve as a preprocessing step for model integration, exchange and extraction data, and simulation.

• Pre-defined instances in CSO and variables for instances are in italics. For the pre-defined instances in CSO, the apostrophe prefix is used to distinguish it from a variable, such as 'FT phosphorylated and 'ME Binding.
For the details of CSO and its schema, please refer to [16].
Criterion 1: validation for structurally correct models Rule 1. If there is given one process and one entity, then there should be only one connector between them. Otherwise, alert.

Event
Process Criterion 2: validation for biologically correct models To represent rules for Criterion 2, we first describe three properties that are not defined in CSO. As described before, we are interested in biological interactions and the related four types of connectors. We define that an input entity is the entity that is connected to a process via one of three input connectors InputAssociationBiological, InputInhibitorBiological, and InputProcessBiological. In particular, the entity connected to a process via InputProcessBiological is called the inputprocess entity in order to distinguish it from the other two types of input entities. Lastly, the output entity is defined as the entity connected to a process via OutputProcessBiological. Formally, the three properties are as follows: In the rules, sameAs defines an equality relationship between two instances or two values, whereas differentFrom defines an inequality relationship.
[Group 1: rules that need cardinality and type constraints] Rule 2. It needs only one input and one output entities, but should not have any regulator entities such as input associate and input inhibitor entities.

Event
Process Rule 3. It needs at least two inputprocess entities and one output entity whose type is Complex.

Event
Process Rule 4. It needs at least two inputprocess entities and one output entity. One inputprocess entity should have a type as Dna and one output entity as Complex.

Event
Process Rule 5. It needs one input entity and one output entity as Dna.

Event
Process Rule 6. It needs only one inputprocess entity whose type is Complex and at least two output entities.

Event
Process Rule 7. It needs at least one inputprocess entity whose name is GTP and one output entity whose name is GDP.

Event
Process Rule 8. It needs only one inputprocess and at least one output entity both of whose types are SmallMolecule.

Event
Process Rule 9. It needs at least one inputprocess and at least one output entity both of whose types are SmallMolecule.

Event
Process Rule 10. It needs only one inputprocess entity whose type is Protein or Complex.

Event
Process Rule 11. It needs one inputprocess entity whose type is Protein.

Event
Process Rule 12. It needs one inputprocess entity whose type is Protein, Complex, mRNA, Dna, or SmallMolecule.

Event
Process

[Group 2: cardinality and FEATURETYPE property constraints]
In the following rules, hasFeature(x 1 ,'x 2 ) implies that an entity x 1 has a feature type as 'x 2 when 'x 2 is a predefined term for FeatureType in CSO and the formal definition is as follows: Note that the instances with prefix (') such as 'ME Autophosphorylation and 'FT phosphorylated are pre-defined terms (instances) in CSO. Rules 13-24. It needs at least one inputprocess and one output entities both of whose uni-molecule references (XREF) are same. The output entity should have a feature type which is a pre-defined value.

Event
Process Rules 26-27. It nees only one inputprocess entity and at least one output entity. The inputprocess entity should have a defined feature type. For example, a phosphorylated entity can be dephosphorylated. So the inputprocess entity has a feature type as 'FT phosphorylated.

Event
Process(x 1 ) ∧ BIOLOGICALEVENT(x 1 ,'ME Dephosphorylation/'ME Deubiquitination) [Group 3: cardinality and STOICHIOMETRY property constraints] In CSO, the stoichiometric coefficient is the property of the connector connecting one process and one inputprocess entity, because the same entity can be involved in many processes and the stoichiometric coefficient will be different depending on the involved processes. In these rules, hasStoichiometry implies that given a process x 1 , the participating inputprocess entity x 3 has x 7 as its stoichiometric coefficient.
Rule 28. It needs only one inputprocess entity whose stoichiometry coefficient is equal to 2 and only one output entity whose type is Complex.

Event
Process(x 1 ) ∧ BIOLOGICALEVENT(x 1 ,'ME Oligomerization) Rule 30. It needs only one inputprocess entity whose stoichiometry coefficient is 21 or more and only one output entity whose type is Complex.