Propagating semantic information in biochemical network models
© Schulz et al.; licensee BioMed Central Ltd. 2012
Received: 2 November 2011
Accepted: 30 January 2012
Published: 30 January 2012
Skip to main content
© Schulz et al.; licensee BioMed Central Ltd. 2012
Received: 2 November 2011
Accepted: 30 January 2012
Published: 30 January 2012
To enable automatic searches, alignments, and model combination, the elements of systems biology models need to be compared and matched across models. Elements can be identified by machine-readable biological annotations, but assigning such annotations and matching non-annotated elements is tedious work and calls for automation.
A new method called "semantic propagation" allows the comparison of model elements based not only on their own annotations, but also on annotations of surrounding elements in the network. One may either propagate feature vectors, describing the annotations of individual elements, or quantitative similarities between elements from different models. Based on semantic propagation, we align partially annotated models and find annotations for non-annotated model elements.
Semantic propagation and model alignment are included in the open-source library semanticSBML, available on sourceforge. Online services for model alignment and for annotation prediction can be used at http://www.semanticsbml.org.
Systems biologists aim to understand the dynamic behaviour of cellular pathways with the help of quantitative models. To construct large-scale models flexibly from existing parts, models need to be retrieved, compared, and combined. For this purpose, modellers need to find and match equivalent model elements, for instance, variables describing identical metabolites. In the future, such alignments should be automated or, at least, sensible suggestions should be provided by software.
A simple way to match model elements is by comparing their names or identifiers as they appear in the model's code. Greedy alignments and model combination based on element names have been used in [1–4], but these approaches may fail if models stem from different sources and therefore use different naming schemes. A safer and more general approach is to compare model elements by the biological objects or concepts they stand for. Semantic annotations, for instance the MIRIAM-compliant annotations  used in the Systems Biology Markup Language (SBML) , provide a qualified naming scheme by relating model elements to entries in public web resources or ontologies. Knowledge from these ontologies can be used to compare alike biological objects and to define quantitative similarity scores between them [7, 8].
Obviously, such comparisons will fail if annotations are missing. However, there are also algorithms that compare biological networks by their structures and can therefore handle missing annotations. Since the comparison of graph structures is computationally hard , these algorithms either yield approximative results  or are restricted to models having a simple structure like a path [11, 12] or a tree [13, 14]. The initial comparison of network nodes, which is later refined using structural information, also varies from approach to approach. While some of them use plain node labels, others compare nodes by chemical structures [10, 15] or semantic information like EC numbers  or Gene Ontology terms . With this information, they can either refine the alignments  or speed up the computations by reducing the search space . A recent review on these and similar works can be found in .
In this article, we propose new heuristics for aligning network models with missing annotations. The basic idea is simple: a reaction, for instance in an SBML model, refers to its substrates and products. If two reactions are not annotated but their reactants are, we can trace the reactants, evaluate their annotations, and use this information for comparing the reactions. This logic can be applied whenever model elements show cross-references. Instead of collecting and combining this information step by step for each element, we developed a method to propagate all semantic information simultaneously across the network. Starting from an original direct similarity score, which compares elements only by their own annotations, we obtain a new inferred similarity score that incorporates information obtained from other elements.
Two applications of semantic propagation are presented in this article. The first one is the improved alignment of systems biology models. The second one is a method for predicting missing annotations in a model by aligning it to fully annotated network models. Our present implementation, which is a part of the tool semanticSBML , works for SBML models and for specific similarity measures . However, semantic propagation is a general approach that applies to a wide range of network models and similarity scores and that can be combined with different model matching algorithms, including the ones mentioned above.
We have developed methods for aligning partially annotated biological networks and implemented them for SBML models with MIRIAM-compliant annotations. A model alignment is based on two kinds of knowledge: similarities between model elements, which are computed from semantic annotations, and references between elements within each model (called here the "network structure").
In the second step, the actual model alignment, elements are matched between two or more models according to their inferred similarities. We use a greedy heuristic that arranges similar elements into tuples, supposed to be matched: all elements in a tuple have the same type (e.g. reaction), stem from different models, and each element is part of one tuple (possibly of size 1). Elements within a tuple have to show a high similarity, while elements in different tuples are supposed to be dissimilar.
The semantic propagation can be carried out in two ways, as feature propagation (FP) or as similarity propagation (SP). For feature propagation, each element has to be associated with a feature vector. The components of this vector (usually numbers between 0 and 1) describe how closely the element resembles certain biological concepts. A model species describing a phosphorylated MAP kinase, for example, could carry annotations referring to UniProt entry P28482 (MAPK1) and KEGG Compound entry C00562 (Phosphoprotein). The feature vector corresponding to this species would contain mostly zeros except for the two entries specifying the relation to these web resource entries. Two elements are similar if their feature vectors point to similar directions, i.e. if they are related to similar biological concepts like a phosphorylated and a non-phosphorylated MAP kinase. The size of the feature vector depends on the number of biological concepts considered, e.g. the entries of all web resources being referred to in BioModels Database .
In feature propagation elements inherit semantic information from their neighbour nodes, which might contribute to the definition of the elements' identity. The inferred feature vectors do therefore not only characterise the model element itself, but also its surrounding elements in the network. In the case of a reaction element, they do not only describe the reaction, but also its reactants, enzyme, or regulators. The transition from the original to the inferred feature vectors is shown by an abstract example in Figure 1A. In this example the colours red, green, and blue correspond to the features in a colour vector of size three, which encode for semantic information. The propagation of this information (Figure 1B) leads to mixed colours in the center of the network, which give each node its own identity and allow for a unique self-alignment of the network. The mathematical details of feature propagation are explained in the Methods section.
Model alignments can be used to complete missing annotations in a model. To do so, one may align the model to a large, annotated map of all physiological pathways and copy annotations from elements in the map to the model elements they have been aligned to. Even if not all of these annotations are correct, they may still be presented to users as suggestions during manual model annotation.
Our semanticSBML web page for model annotation (http://semanticsbml.org) provides this functionality and uses BioModels Database as a replacement for the large, annotated pathway map. The idea behind the annotation prediction is to apply feature propagation to a sparsely annotated model and all BioModels. Afterwards, the similarities between non-annotated model elements and all database elements are calculated. Finally, the annotations from the most similar elements contained in the database are presented to the user as suggestions for new annotations. Details of this procedure are described in the Methods section.
Despite the indifference of the results with respect to the models which are compared, the quality of the results varied in between examples in which only species or reaction annotations had been removed. The more detailed the annotations in a specific subset of model elements was, the more dramatic the decline in matching quality has been after their removal. For a deeper analysis of the matching quality after the removal of either species or reaction annotations from BioModel 9, the reader is referred to additional file 1.
In order to test the quality of our annotation prediction heuristic, we applied it to BioModel 61, a well-studied model of glycolysis . For our first evaluation, we removed all annotations from the model and tried to re-predict them via the web interface. The test confirms that the predictions are not perfect, but the top 10 results usually contain relevant annotations. In general, the predictions become better as the fraction of correctly annotated model elements increases and as we scan more of the top results for the correct annotation.
We have implemented both propagation methods in Python. The source code of this implementation is included in our library semanticSBML, which is freely available (GPL 3) from SourceForge (http://sourceforge.net/projects/semanticsbml/) and can be used to e.g. annotate and merge SBML models.
Furthermore, we have included the semantic propagation methods into our web tool semanticSBML http://www.semanticsbml.org. First, this web tool allows users to visually compare the results of the three methods to compare model elements (direct similarity, feature propagation, and similarity propagation). Second, for model merging, all three methods can be used to suggest an initial matching of model elements. Third, for model annotation, new element annotations based on propagated features from BioModels will be suggested after the "Predict annotations" button has been clicked.
Annotating the elements of systems biology models is laborious. Interactive software can facilitate this work by keyword searches and by proposing annotations based on element names. Our software semanticSBML [21, 26] helps modellers to annotate and combine SBML models based on MIRIAM-compliant annotations . In the present work, we addressed an important open issue, the alignment of models with missing annotations.
As the examples have shown, element matchings based on our inferred similarity measures perform well in practice. In our tests, semantic propagation increased the recall of correct element pairs, but at the price of a lower precision if less than half of the elements were annotated. Our two approaches, feature and similarity propagation, differ slightly in their quality, but also in their computational costs. While similarity propagation tends to yield better results, the computational effort is much higher than in feature propagation instead of , where |M| and |N| denote the numbers of elements in the two models).
We furthermore showed how new annotations for model elements can be suggested based on the annotations already present in a model and an annotated pathway map. For our predictions, we have selected the BioModels Database as a resource for highly curated annotation data. Analyses using other resources (see additional file 1 for details) show no loss in prediction quality and we plan to extend our background pathway map by using various other pathway resources in the future.
The details on how information is transfered during semantic propagation depends on the choice of propagation weights. In our tests, changing their values within reasonable ranges had little influence on the model alignment. In any case, the scaling factor λ (see Methods) allowed to avoid an overly strong propagation, which might lead to spurious similarities. As soon as a sufficient number of predefined model matchings will be available as a gold standard, the propagation weights may be further improved by machine learning.
Compared to existing approaches, our methods for model alignment have three advantages. First, by processing semantic annotations instead of ad-hoc node labels, we can compare models from different sources and compare elements with similar, yet different annotations. Second, considering the network structure allows us to distinguish between model elements carrying identical annotations, e.g. proteins in different phosphorylation states. Third, our similarity measures can be combined with various structure-based model comparison algorithms, e.g. . Although our approach has only been implemented for SBML models, it can be extended to any computational models that include structural information and semantic annotations.
To compare non-annotated model elements by their biological meaning (e.g. the reactions in Figure 2A), we have developed two alternative propagation schemes. In feature propagation (FP) elements receive information about semantic annotations from their neighbours, while in similarity propagation element pairs receive information about pairwise similarities between neighbouring elements. In both methods information is propagated along the references between model elements (Figure 2B). The strength of information transfer is determined by a |M| × |M| propagation matrix ρ M where |M| is the number of elements in model M. The sparse structure of this matrix reflects the connections between model elements, and its positive, real-valued entries determine the strength of the information transfer between the different types of elements. For instance, we may choose the values and whenever a reaction x refers to a species a as its reactant or product, and if a does not appear in reaction x. The parameter α controls the information transfer from species to reactions, while the parameter β controls information transfer in the opposite direction (Figure 2B). To prevent semantic propagation between unrelated reactions, we stop the propagation at cofactors, which might be highly connected hubs in the network structure. Therefore, we set whenever the annotations on a species a suggest that it is a cofactor (see additional file 1 for details).
Like in Eq. (4), if σ is non-negative, λ can be chosen such that ψ sp is also non-negative. As an example, explicit calculations of both similarity measures for a partially annotated model of the phosphoglucoisomerase reaction are shown in section 5 in additional file 1.
Feature propagation resembles a diffusion process on the reaction network. Unlike diffusion in the strict sense, our process is not symmetric and does not conserve mass. Methods with a similar background idea have already been successfully applied to assign protein functions in protein-protein interaction networks [31, 32].
whose stationary distribution is given by Eq. (8).
The structure of the pair propagation graph determines how information about element similarities can spread during propagation. For instance, the graph in Figure 7B shows that an initial matching between species (edges 1, 2, 4, and 5) or between reactions (edge 9) cannot lead to a matching between species and reactions (edges 3, 6, 7, and 8) because they appear in separate subgraphs.
A main task in model merging is to detect and combine corresponding elements between models. In such a model alignment the element matching has to be non-ambiguous (an element cannot be matched to several elements in another model) and transitive (if x, y, and z are elements from three different models and if the pairs (x,y) and (y,z) are matched, the pair (x,z) is matched automatically).
Instead of maximising the score exactly (which would be computationally hard), we employ a greedy algorithm: first, it chooses the element pair with the highest ψ value. If this value is positive, the two elements are matched and the pair is removed from the list. Then, we continue to match the element pairs with the highest remaining ψ value until no positive ψ values are left. If this leads to inconsistencies (i.e, if one of the two elements already has a different matching partner in the other model), the next best matching pair is considered instead. In the end, the alignment score is given by the sum of all ψ values collected.
For aligning three or more models, we use the same matching score (for element pairs), but consider the transitivity constraint: Whenever we match two elements, we also match all their previously determined matching partners (and consider the respective mutual similarities in the score function).
Semantic propagation can be used to suggest annotations for non-annotated model elements. For this purpose the considered partially annotated model is aligned to a fully annotated large pathway map. Afterwards, annotations are transferred from elements in the map to the corresponding, non-annotated elements in the model.
As a prerequisite, we collected many annotated model elements appearing in BioModels Database, which served as a replacement for the large, annotated pathway map, and calculated their propagated feature vectors. For each element (x ∈ M) in each of the models, its propagated feature vector (w x ) is computed. For every annotation i on the element we do the following. The corresponding feature is removed from the vector (w x,i = 0) and the pair of the feature and this propagated feature vector is added to the collection.
To find annotations for a non-annotated element of interest, we calculate its propagated feature vector within its model and check if there are similar vectors in our collection. Afterwards, the annotations associated with these vectors are presented to the user. The general idea is that given a model from the BioModels Database that lacks annotations for a single feature on a given element, this method would be able to predict annotations for this element with perfect accuracy.
For annotation prediction on the semanticSBML website, the feature propagation algorithm has been slightly modified. To reduce the number of annotations suggested, we set the propagation values ρ between reaction - reactant/product to and all others to 0. Moreover, information is propagated only to direct neighbours (i.e. the sum in Eq. (4) does not go to infinity but to 1).
The authors thank Isabel Rojas for suggesting the idea of semantic propagation, Falko Krause for implementation work in semanticSBML, Ron Milo for kindly hosting W.L. at the Weizmann Institute of Science, and an anonymous referee for valuable suggestions.
This work was supported by the German Research Foundation [CRC 618 and GZ: LI 1676/2-1], the International Max Planck Research School for Computational Biology and Scientific Computing, and the European Commission [grant number LSHG-CT-2006-037469].