Propagating semantic information in biochemical network models
© Schulz et al.; licensee BioMed Central Ltd. 2012
Received: 2 November 2011
Accepted: 30 January 2012
Published: 30 January 2012
To enable automatic searches, alignments, and model combination, the elements of systems biology models need to be compared and matched across models. Elements can be identified by machine-readable biological annotations, but assigning such annotations and matching non-annotated elements is tedious work and calls for automation.
A new method called "semantic propagation" allows the comparison of model elements based not only on their own annotations, but also on annotations of surrounding elements in the network. One may either propagate feature vectors, describing the annotations of individual elements, or quantitative similarities between elements from different models. Based on semantic propagation, we align partially annotated models and find annotations for non-annotated model elements.
Semantic propagation and model alignment are included in the open-source library semanticSBML, available on sourceforge. Online services for model alignment and for annotation prediction can be used at http://www.semanticsbml.org.
Systems biologists aim to understand the dynamic behaviour of cellular pathways with the help of quantitative models. To construct large-scale models flexibly from existing parts, models need to be retrieved, compared, and combined. For this purpose, modellers need to find and match equivalent model elements, for instance, variables describing identical metabolites. In the future, such alignments should be automated or, at least, sensible suggestions should be provided by software.
A simple way to match model elements is by comparing their names or identifiers as they appear in the model's code. Greedy alignments and model combination based on element names have been used in [1–4], but these approaches may fail if models stem from different sources and therefore use different naming schemes. A safer and more general approach is to compare model elements by the biological objects or concepts they stand for. Semantic annotations, for instance the MIRIAM-compliant annotations  used in the Systems Biology Markup Language (SBML) , provide a qualified naming scheme by relating model elements to entries in public web resources or ontologies. Knowledge from these ontologies can be used to compare alike biological objects and to define quantitative similarity scores between them [7, 8].
Obviously, such comparisons will fail if annotations are missing. However, there are also algorithms that compare biological networks by their structures and can therefore handle missing annotations. Since the comparison of graph structures is computationally hard , these algorithms either yield approximative results  or are restricted to models having a simple structure like a path [11, 12] or a tree [13, 14]. The initial comparison of network nodes, which is later refined using structural information, also varies from approach to approach. While some of them use plain node labels, others compare nodes by chemical structures [10, 15] or semantic information like EC numbers  or Gene Ontology terms . With this information, they can either refine the alignments  or speed up the computations by reducing the search space . A recent review on these and similar works can be found in .
In this article, we propose new heuristics for aligning network models with missing annotations. The basic idea is simple: a reaction, for instance in an SBML model, refers to its substrates and products. If two reactions are not annotated but their reactants are, we can trace the reactants, evaluate their annotations, and use this information for comparing the reactions. This logic can be applied whenever model elements show cross-references. Instead of collecting and combining this information step by step for each element, we developed a method to propagate all semantic information simultaneously across the network. Starting from an original direct similarity score, which compares elements only by their own annotations, we obtain a new inferred similarity score that incorporates information obtained from other elements.
Two applications of semantic propagation are presented in this article. The first one is the improved alignment of systems biology models. The second one is a method for predicting missing annotations in a model by aligning it to fully annotated network models. Our present implementation, which is a part of the tool semanticSBML , works for SBML models and for specific similarity measures . However, semantic propagation is a general approach that applies to a wide range of network models and similarity scores and that can be combined with different model matching algorithms, including the ones mentioned above.
We have developed methods for aligning partially annotated biological networks and implemented them for SBML models with MIRIAM-compliant annotations. A model alignment is based on two kinds of knowledge: similarities between model elements, which are computed from semantic annotations, and references between elements within each model (called here the "network structure").
In the second step, the actual model alignment, elements are matched between two or more models according to their inferred similarities. We use a greedy heuristic that arranges similar elements into tuples, supposed to be matched: all elements in a tuple have the same type (e.g. reaction), stem from different models, and each element is part of one tuple (possibly of size 1). Elements within a tuple have to show a high similarity, while elements in different tuples are supposed to be dissimilar.
The semantic propagation can be carried out in two ways, as feature propagation (FP) or as similarity propagation (SP). For feature propagation, each element has to be associated with a feature vector. The components of this vector (usually numbers between 0 and 1) describe how closely the element resembles certain biological concepts. A model species describing a phosphorylated MAP kinase, for example, could carry annotations referring to UniProt entry P28482 (MAPK1) and KEGG Compound entry C00562 (Phosphoprotein). The feature vector corresponding to this species would contain mostly zeros except for the two entries specifying the relation to these web resource entries. Two elements are similar if their feature vectors point to similar directions, i.e. if they are related to similar biological concepts like a phosphorylated and a non-phosphorylated MAP kinase. The size of the feature vector depends on the number of biological concepts considered, e.g. the entries of all web resources being referred to in BioModels Database .
In feature propagation elements inherit semantic information from their neighbour nodes, which might contribute to the definition of the elements' identity. The inferred feature vectors do therefore not only characterise the model element itself, but also its surrounding elements in the network. In the case of a reaction element, they do not only describe the reaction, but also its reactants, enzyme, or regulators. The transition from the original to the inferred feature vectors is shown by an abstract example in Figure 1A. In this example the colours red, green, and blue correspond to the features in a colour vector of size three, which encode for semantic information. The propagation of this information (Figure 1B) leads to mixed colours in the center of the network, which give each node its own identity and allow for a unique self-alignment of the network. The mathematical details of feature propagation are explained in the Methods section.
Model alignments can be used to complete missing annotations in a model. To do so, one may align the model to a large, annotated map of all physiological pathways and copy annotations from elements in the map to the model elements they have been aligned to. Even if not all of these annotations are correct, they may still be presented to users as suggestions during manual model annotation.
Our semanticSBML web page for model annotation (http://semanticsbml.org) provides this functionality and uses BioModels Database as a replacement for the large, annotated pathway map. The idea behind the annotation prediction is to apply feature propagation to a sparsely annotated model and all BioModels. Afterwards, the similarities between non-annotated model elements and all database elements are calculated. Finally, the annotations from the most similar elements contained in the database are presented to the user as suggestions for new annotations. Details of this procedure are described in the Methods section.
Self-alignment of a linear chain
Alignment between MAP kinase pathways
Despite the indifference of the results with respect to the models which are compared, the quality of the results varied in between examples in which only species or reaction annotations had been removed. The more detailed the annotations in a specific subset of model elements was, the more dramatic the decline in matching quality has been after their removal. For a deeper analysis of the matching quality after the removal of either species or reaction annotations from BioModel 9, the reader is referred to additional file 1.
Automatic suggestion of element annotations
In order to test the quality of our annotation prediction heuristic, we applied it to BioModel 61, a well-studied model of glycolysis . For our first evaluation, we removed all annotations from the model and tried to re-predict them via the web interface. The test confirms that the predictions are not perfect, but the top 10 results usually contain relevant annotations. In general, the predictions become better as the fraction of correctly annotated model elements increases and as we scan more of the top results for the correct annotation.
We have implemented both propagation methods in Python. The source code of this implementation is included in our library semanticSBML, which is freely available (GPL 3) from SourceForge (http://sourceforge.net/projects/semanticsbml/) and can be used to e.g. annotate and merge SBML models.
Furthermore, we have included the semantic propagation methods into our web tool semanticSBML http://www.semanticsbml.org. First, this web tool allows users to visually compare the results of the three methods to compare model elements (direct similarity, feature propagation, and similarity propagation). Second, for model merging, all three methods can be used to suggest an initial matching of model elements. Third, for model annotation, new element annotations based on propagated features from BioModels will be suggested after the "Predict annotations" button has been clicked.
Annotating the elements of systems biology models is laborious. Interactive software can facilitate this work by keyword searches and by proposing annotations based on element names. Our software semanticSBML [21, 26] helps modellers to annotate and combine SBML models based on MIRIAM-compliant annotations . In the present work, we addressed an important open issue, the alignment of models with missing annotations.
As the examples have shown, element matchings based on our inferred similarity measures perform well in practice. In our tests, semantic propagation increased the recall of correct element pairs, but at the price of a lower precision if less than half of the elements were annotated. Our two approaches, feature and similarity propagation, differ slightly in their quality, but also in their computational costs. While similarity propagation tends to yield better results, the computational effort is much higher than in feature propagation instead of , where |M| and |N| denote the numbers of elements in the two models).
We furthermore showed how new annotations for model elements can be suggested based on the annotations already present in a model and an annotated pathway map. For our predictions, we have selected the BioModels Database as a resource for highly curated annotation data. Analyses using other resources (see additional file 1 for details) show no loss in prediction quality and we plan to extend our background pathway map by using various other pathway resources in the future.
The details on how information is transfered during semantic propagation depends on the choice of propagation weights. In our tests, changing their values within reasonable ranges had little influence on the model alignment. In any case, the scaling factor λ (see Methods) allowed to avoid an overly strong propagation, which might lead to spurious similarities. As soon as a sufficient number of predefined model matchings will be available as a gold standard, the propagation weights may be further improved by machine learning.
Compared to existing approaches, our methods for model alignment have three advantages. First, by processing semantic annotations instead of ad-hoc node labels, we can compare models from different sources and compare elements with similar, yet different annotations. Second, considering the network structure allows us to distinguish between model elements carrying identical annotations, e.g. proteins in different phosphorylation states. Third, our similarity measures can be combined with various structure-based model comparison algorithms, e.g. . Although our approach has only been implemented for SBML models, it can be extended to any computational models that include structural information and semantic annotations.
Propagation of semantic information
To compare non-annotated model elements by their biological meaning (e.g. the reactions in Figure 2A), we have developed two alternative propagation schemes. In feature propagation (FP) elements receive information about semantic annotations from their neighbours, while in similarity propagation element pairs receive information about pairwise similarities between neighbouring elements. In both methods information is propagated along the references between model elements (Figure 2B). The strength of information transfer is determined by a |M| × |M| propagation matrix ρM where |M| is the number of elements in model M. The sparse structure of this matrix reflects the connections between model elements, and its positive, real-valued entries determine the strength of the information transfer between the different types of elements. For instance, we may choose the values and whenever a reaction x refers to a species a as its reactant or product, and if a does not appear in reaction x. The parameter α controls the information transfer from species to reactions, while the parameter β controls information transfer in the opposite direction (Figure 2B). To prevent semantic propagation between unrelated reactions, we stop the propagation at cofactors, which might be highly connected hubs in the network structure. Therefore, we set whenever the annotations on a species a suggest that it is a cofactor (see additional file 1 for details).
Like in Eq. (4), if σ is non-negative, λ can be chosen such that ψsp is also non-negative. As an example, explicit calculations of both similarity measures for a partially annotated model of the phosphoglucoisomerase reaction are shown in section 5 in additional file 1.
Analogy to diffusion processes
Feature propagation resembles a diffusion process on the reaction network. Unlike diffusion in the strict sense, our process is not symmetric and does not conserve mass. Methods with a similar background idea have already been successfully applied to assign protein functions in protein-protein interaction networks [31, 32].
whose stationary distribution is given by Eq. (8).
The structure of the pair propagation graph determines how information about element similarities can spread during propagation. For instance, the graph in Figure 7B shows that an initial matching between species (edges 1, 2, 4, and 5) or between reactions (edge 9) cannot lead to a matching between species and reactions (edges 3, 6, 7, and 8) because they appear in separate subgraphs.
A main task in model merging is to detect and combine corresponding elements between models. In such a model alignment the element matching has to be non-ambiguous (an element cannot be matched to several elements in another model) and transitive (if x, y, and z are elements from three different models and if the pairs (x,y) and (y,z) are matched, the pair (x,z) is matched automatically).
Instead of maximising the score exactly (which would be computationally hard), we employ a greedy algorithm: first, it chooses the element pair with the highest ψ value. If this value is positive, the two elements are matched and the pair is removed from the list. Then, we continue to match the element pairs with the highest remaining ψ value until no positive ψ values are left. If this leads to inconsistencies (i.e, if one of the two elements already has a different matching partner in the other model), the next best matching pair is considered instead. In the end, the alignment score is given by the sum of all ψ values collected.
For aligning three or more models, we use the same matching score (for element pairs), but consider the transitivity constraint: Whenever we match two elements, we also match all their previously determined matching partners (and consider the respective mutual similarities in the score function).
Finding element annotations based on model alignments
Semantic propagation can be used to suggest annotations for non-annotated model elements. For this purpose the considered partially annotated model is aligned to a fully annotated large pathway map. Afterwards, annotations are transferred from elements in the map to the corresponding, non-annotated elements in the model.
As a prerequisite, we collected many annotated model elements appearing in BioModels Database, which served as a replacement for the large, annotated pathway map, and calculated their propagated feature vectors. For each element (x ∈ M) in each of the models, its propagated feature vector (w x ) is computed. For every annotation i on the element we do the following. The corresponding feature is removed from the vector (wx,i= 0) and the pair of the feature and this propagated feature vector is added to the collection.
To find annotations for a non-annotated element of interest, we calculate its propagated feature vector within its model and check if there are similar vectors in our collection. Afterwards, the annotations associated with these vectors are presented to the user. The general idea is that given a model from the BioModels Database that lacks annotations for a single feature on a given element, this method would be able to predict annotations for this element with perfect accuracy.
For annotation prediction on the semanticSBML website, the feature propagation algorithm has been slightly modified. To reduce the number of annotations suggested, we set the propagation values ρ between reaction - reactant/product to and all others to 0. Moreover, information is propagated only to direct neighbours (i.e. the sum in Eq. (4) does not go to infinity but to 1).
The authors thank Isabel Rojas for suggesting the idea of semantic propagation, Falko Krause for implementation work in semanticSBML, Ron Milo for kindly hosting W.L. at the Weizmann Institute of Science, and an anonymous referee for valuable suggestions.
This work was supported by the German Research Foundation [CRC 618 and GZ: LI 1676/2-1], the International Max Planck Research School for Computational Biology and Scientific Computing, and the European Commission [grant number LSHG-CT-2006-037469].
- Goodfellow M, Wilson J, Hunt E: Biochemical network matching and composition. Proceedings of the 2010 EDBT Workshops, ACM 2010, 1–7.Google Scholar
- Randhawa R, Shaffer C, Tyson J: Model aggregation: a building-block approach to creating large macromolecular regulatory networks. Bioinformatics 2009, 25(24):3289. 10.1093/bioinformatics/btp581PubMed CentralView ArticlePubMedGoogle Scholar
- Moutselos K, Kanaris I, Chatziioannou A, Maglogiannis I, Kolisis F: KEGGconverter: a tool for the in-silico modelling of metabolic networks of the KEGG Pathways database. BMC bioinformatics 2009, 10: 324. 10.1186/1471-2105-10-324PubMed CentralView ArticlePubMedGoogle Scholar
- Wang YT, Huang YH, Chen YC, Hsu CL, Yang UC: PINT: Pathways INtegration Tool. Nucleic Acids Res 2010.Google Scholar
- Le Novère N, Finney A, Hucka M, Bhalla U, Campagne F, Collado-Vides J, Crampin E, Halstead M, Klipp E, Mendes P, Nielsen P, Sauro H, Shapiro B, Snoep J, Spence H, Wanner B: Minimum information requested in the annotation of biochemical models (MIRIAM). Nature biotechnology 2005, 23(12):1509–1515. 10.1038/nbt1156View ArticlePubMedGoogle Scholar
- Hucka M, Finney A, Sauro H, Bolouri H, Doyle J, Kitano H, Arkin A, Bornstein B, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles E, Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH, Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N, Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y, Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS, Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003, 19(4):524–531. 10.1093/bioinformatics/btg015View ArticlePubMedGoogle Scholar
- Sevilla J, Segura V, Podhorski A, Guruceaga E, Mato J, Martinez-Cruz L, Corrales F, Rubio A: Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2005, 2(4):330–338. 10.1109/TCBB.2005.50View ArticleGoogle Scholar
- Schulz M, Krause F, Le Novère N, Klipp E, Liebermeister W: Retrieval, alignment, and clustering of computational models based on semantic annotations. Mol Syst Biol 2011, 19(7):513.Google Scholar
- Gay S, Soliman S, Fages F: A graphical method for reducing and relating models in systems biology. Bioinformatics 2010, 26(18):i575. 10.1093/bioinformatics/btq388PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Q, Sze S: Path matching and graph matching in biological networks. Journal of Computational Biology 2007, 14: 56–67. 10.1089/cmb.2006.0076View ArticlePubMedGoogle Scholar
- Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T: PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res 2004, 1(32 Web Server):W83.View ArticleGoogle Scholar
- Shlomi T, Segal D, Ruppin E, Sharan R: QPath: a method for querying pathways in a protein-protein interaction network. BMC bioinformatics 2006, 7: 199. 10.1186/1471-2105-7-199PubMed CentralView ArticlePubMedGoogle Scholar
- Pinter R, Rokhlenko O, Yeger-Lotem E, Ziv-Ukelson M: Alignment of metabolic pathways. Bioinformatics 2005, 21(16):3401. 10.1093/bioinformatics/bti554View ArticlePubMedGoogle Scholar
- Dost B, Shlomi T, Gupta N, Ruppin E, Bafna V, Sharan R: Research in Computational Molecular Biology. Springer; 2007:1–15.View ArticleGoogle Scholar
- Hattori M, Okuno Y, Goto S, Kanehisa M: Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. Journal of the American Chemical Society 2003, 125(39):11853–11865. 10.1021/ja036030uView ArticlePubMedGoogle Scholar
- Tohsato Y, Matsuda H, Hashimoto A: A multiple alignment algorithm for metabolic pathway analysis using enzyme hierarchy. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB'00) 2000, 376–383.Google Scholar
- Lord P, Stevens R, Brass A, Goble C: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- Gamalielsson J, Olsson B: GOSAP: Gene Ontology-Based Semantic Alignment of Biological Pathways. International Journal of Bioinformatics Research and Applications 2008, 4(3):274–294. 10.1504/IJBRA.2008.019575View ArticlePubMedGoogle Scholar
- Wernicke S, Rasche F: Simple and fast alignment of metabolic pathways by exploiting local diversity. Bioinformatics 2007, 23(15):1978–1985. 10.1093/bioinformatics/btm279View ArticlePubMedGoogle Scholar
- Fionda V, Palopoli L: Biological Network Querying Techniques: Analysis and Comparison. Journal of Computational Biology 2011, 18(4):595–625. 10.1089/cmb.2009.0144View ArticlePubMedGoogle Scholar
- Krause F, Uhlendorf J, Lubitz T, Schulz M, Klipp E, Liebermeister W: Annotation and merging of SBML models with semanticSBML. Bioinformatics 2010, 26(3):421. 10.1093/bioinformatics/btp642View ArticlePubMedGoogle Scholar
- Le Novère N, Bornstein B, Broicher A, Courtot M, Donizelli M, Dharuri H, Li L, Sauro H, Schilstra M, Shapiro B, Snoep J, Hucka M: BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems. Nucleic Acids Research 2006, 34: D689-D691. 10.1093/nar/gkj092PubMed CentralView ArticlePubMedGoogle Scholar
- Huang C, Ferrell J: Ultrasensitivity in the mitogen-activated protein kinase cascade. Proceedings of the National Academy of Sciences 1996, 93(19):10078. 10.1073/pnas.93.19.10078View ArticleGoogle Scholar
- Levchenko A, Bruck J, Sternberg P: Scaffold proteins may biphasically affect the levels of mitogen-activated protein kinase signaling and reduce its threshold properties. Proceedings of the National Academy of Sciences of the United States of America 2000, 97(11):5818. 10.1073/pnas.97.11.5818PubMed CentralView ArticlePubMedGoogle Scholar
- Hynne F, Danø S, Sørenson P: Full-scale model of glycolysis in Saccharomyces cerevisiae. Biophysical Chemistry 2001, 94: 121–163. 10.1016/S0301-4622(01)00229-0View ArticlePubMedGoogle Scholar
- Schulz M, Uhlendorf J, Klipp E, Liebermeister W: SBMLmerge, a System for Combining Biochemical Network Models. Genome Informatics 2006, 17: 62–71.PubMedGoogle Scholar
- Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic acids res 2008, (36 Database):D344-D350.Google Scholar
- Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al.: KEGG for linking genomes to life and the environment. Nucleic acids research 2008, 36(suppl 1):D480.PubMed CentralPubMedGoogle Scholar
- Salton G: The SMART retrieval system - experiments in automatic document processing. Prentice-Hall, Inc. Upper Saddle River, NJ, USA; 1971.Google Scholar
- Becker J, Kuropka D: Topic-based vector space model. Proceedings of the 6th International Conference on Business Information Systems 2003, 7–12.Google Scholar
- Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005, 21(suppl 1):i302. 10.1093/bioinformatics/bti1054View ArticlePubMedGoogle Scholar
- Singh R, Xu J, Berger B: Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences 2008, 105(35):12763. 10.1073/pnas.0806627105View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.