From SNPs to pathways: integration of functional effect of sequence variations on models of cell signalling pathways
© Bauer-Mehren et al. 2009
Published: 27 August 2009
Skip to main content
© Bauer-Mehren et al. 2009
Published: 27 August 2009
Single nucleotide polymorphisms (SNPs) are the most frequent type of sequence variation between individuals, and represent a promising tool for finding genetic determinants of complex diseases and understanding the differences in drug response. In this regard, it is of particular interest to study the effect of non-synonymous SNPs in the context of biological networks such as cell signalling pathways. UniProt provides curated information about the functional and phenotypic effects of sequence variation, including SNPs, as well as on mutations of protein sequences. However, no strategy has been developed to integrate this information with biological networks, with the ultimate goal of studying the impact of the functional effect of SNPs in the structure and dynamics of biological networks.
First, we identified the different challenges posed by the integration of the phenotypic effect of sequence variants and mutations with biological networks. Second, we developed a strategy for the combination of data extracted from public resources, such as UniProt, NCBI dbSNP, Reactome and BioModels. We generated attribute files containing phenotypic and genotypic annotations to the nodes of biological networks, which can be imported into network visualization tools such as Cytoscape. These resources allow the mapping and visualization of mutations and natural variations of human proteins and their phenotypic effect on biological networks (e.g. signalling pathways, protein-protein interaction networks, dynamic models). Finally, an example on the use of the sequence variation data in the dynamics of a network model is presented.
In this paper we present a general strategy for the integration of pathway and sequence variation data for visualization, analysis and modelling purposes, including the study of the functional impact of protein sequence variations on the dynamics of signalling pathways. This is of particular interest when the SNP or mutation is known to be associated to disease. We expect that this approach will help in the study of the functional impact of disease-associated SNPs on the behaviour of cell signalling pathways, which ultimately will lead to a better understanding of the mechanisms underlying complex diseases.
Single nucleotide polymorphisms (SNPs), among other types of short range sequence variants (see Additional File 1 for definitions of terms), represent the most frequent type of genomic variation between individuals (0.1% of sequence variation in a diploid genome ). Moreover, their widespread distribution in the genome and their low mutation rate, have enabled the use of SNPs as genetic markers of phenotypic traits, including diseases. SNPs are currently used in candidate gene association studies, genome wide association studies and in pharmacogenomics studies. Once the SNPs associated with the disease phenotype are identified, the elucidation of the functional effect of predisposing SNP is a key factor for understanding the mechanisms underlying the disease.
Several publications and tools have approached the study of the functional effect of SNPs by assessing their effect on the protein structure or their impact on functional sites at the protein or DNA level [2–6]. All these approaches, although valuable, consider the effect at the single molecule level. It is a well established concept in systems biology that the function of proteins has to be understood through learning how the pathways in which the proteins participate work . In this context, the functional consequences of SNPs are better appreciated if the evaluation is performed at the biological system level, for instance by determining their effect on the dynamics of signalling pathways. In consequence, it is important to consider the effect of SNPs, in particular those having an impact at the protein level (non synonymous SNPs, nsSNPs), in the context of biological networks. Although synonymous SNPs and SNPs located in regions that modulate gene expression (e.g. promoters, introns, splice sites, transcription factor binding sites) can also alter gene or protein function and as a consequence lead to disease [8–11], in this study we focus on nsSNPs as they have a more evident effect on the protein function in the biological processes, and are more prevalent in databases and literature.
The study of the functional consequences of nsSNPs in relation to the molecular basis of diseases requires the integration and aggregation of several pieces of heterogeneous information such as protein sequence and its natural variations, experimental perturbations on protein function, the networks of reactions between proteins, and the phenotypes that are affected by the alterations on the protein function. Several resources collect information about SNPs [12, 13] and their association with diseases [2, 14] as well as mutations of clinical relevance . The study of protein function is usually assessed by experiments aimed at disrupting the activity of the protein, for instance by means of altering the protein sequence at residues suspected to be critical for the function (e.g. in vitro mutagenesis experiments). This information is documented in the biomedical literature, and it has already been recognized that text mining techniques are required to harvest it from free text. Nevertheless, much of this information is already collected in curated databases. One example is the UniProt database , which, along with information about protein sequence, structure, and function, records information about the functional effect and the association to disease phenotypes of nsSNPs, referred to as "natural variants" by UniProt. Thus, UniProt provides information about the functional effect of SNPs as well as on the effect of experimental mutation of specific protein residues. This information is recorded as sequence features in each protein entry (see for example http://www.uniprot.org/uniprot/P00533#section_features, for the entry P00533, in the "Sequence features" section, under "Natural variations" and "Experimental info"). This knowledge is extracted from the biomedical literature by UniProt curators and assigned to the corresponding protein entry [17, 18]. Therefore, it represents a reliable source of information about the natural variations of a protein and their associated phenotypes, and on the functional effect of mutations (obtained by experimental mutagenesis of protein residues) on the protein function.
Regarding the participation of proteins in pathways, several databases offer information about models of biological networks such as protein-protein interactions and signalling pathways (for a review on this topic, see ). An exemplary resource is Reactome , which contains manually curated information about pathways and reactions that involve human proteins. In addition, public repositories of models describing the dynamic behaviour of cellular pathways are also available (see  for an example).
With the public availability of resources such as pathway databases and curated datasets on the phenotypic effect of sequence variants, the study of genetic factors that contribute to complex disease phenotypes in the context of the structure and dynamics of biological networks should be feasible. In this regard, there are some reports detailing the integration of SNP data with protein structural data and pathways [22–24]. However, most of them focus on the visualization of nsSNP on the protein structure, and only provide cross references to pathway databases [22, 24]. For instance, DataBins  is a web service for the retrieval and aggregation of pathway data from KEGG, and sequence databases such as dbSNP  with the aim of mapping nsSNPs onto the proteins of a pathway. However, these approaches do not provide any utility for the visualization of nsSNP data on the pathways, not even for analysing the functional effect of the nsSNPs in the pathway context. A different kind of approaches are aimed at using statistical analyses in finding and prioritising metabolic pathways associated with complex diseases based on SNP frequency data (see  for an example). However, the functional effects of SNPs have not been incorporated in the analysis. To our knowledge, no strategy has attempted to integrate these sources of information (proteins and their sequence variants such as SNPs, phenotypic effect of SNPs and models of biological networks) with the final goal of assessing the effect of SNPs on the structure and dynamics of biological networks. In this paper we first identify the challenges that have to be faced for performing this integration in an automatic manner. Then, we present a general strategy for the integration of pathway and sequence variation data, towards their use for network visualization and analysis, including the modelling of signalling pathways.
The goal of this project was to design and implement a general strategy for the integration of pathway and sequence variation data towards their use for network visualization and analysis. In general, in the different models of cellular networks (e.g. signalling pathways, dynamic models, protein-protein interaction networks) the proteins are always represented as nodes, and the edges represent reactions or interactions between proteins. Thus, in practice, the integration involves the mapping of SNPs and mutant residues to the protein nodes of a network and the mapping of their functional effect to the edges of a network (e.g. reactions or relationships between nodes), for their use in the visualization and dynamic analysis of pathways.
In the following sections, we describe and analyse the challenges and approaches for the integration of the phenotypic effect of sequence variations in the context of biological networks, which are:
Integration of data coming from diverse and heterogeneous sources.
Visualization of information about sequence variations in the context of biological pathways.
Incorporation of the effect of the perturbation caused by the sequence variation in dynamic models of the pathways.
The first step to achieve such an integration is to map SNPs and mutant residues to the protein nodes of a network, and second to map their functional effect to the edges of the network. As described in the Introduction section, UniProt was chosen because it contains manually curated information about nsSNPs and mutant residues of proteins. As described in the Methods section, we identified and extracted human protein entries from UniProt with annotations on natural variation and mutagenesis experiments, which are suitable for integration with biological networks such as protein-protein interaction networks, signalling pathways and dynamic models. In this study we focus on the pathway database Reactome  and the dynamic models repository BioModels . The data of these resources are available in standard formats: Reactome reactions and pathways are published in the data exchange format for biological pathway BioPAX  (level 2), and dynamic models in the BioModels repository are made available in the SBML standard . As mentioned above, the integration process between the UniProt derived data and the network representations can be considered at two levels. The first level involves the mapping of proteins for which there are natural variation/mutagenesis annotations in UniProt to proteins in biological network models (e.g. signalling pathways, dynamic models, protein-protein interaction networks). This is the simplest task, and was performed by matching the UniProt identifiers from both data sources. In this regard, it is important to note that the different states of a protein such as its level of phosphorylation or its cellular location appear as different entities in a pathway exchange format such as BioPAX and in a model representation such as SBML. However, all the entities that represent different states of a protein are characterized by the same sequence identifiers, e.g. UniProt identifiers. Consequently, the annotations of a given protein were mapped onto all the corresponding instances in Reactome and BioModels, that is, to all the nodes that contain the same UniProt identifier. As a result, data containing the sequence features (natural variations or mutagenesis experiments) extracted from UniProt can be incorporated to visualize, filter and search the biological network, for example using Cytoscape, a software for network visualization  (see section "Visualization of SNPs on biological networks" for a complete description).
The second level of data integration involves the incorporation of the effect of the sequence variation in the biological process in which the protein participates. The effect of the sequence variation is expressed in natural language in the Description field of the UniProt files, and comprises one or more phrases. One can be tempted to think that state of the art text mining approaches will easily solve the problem of identification and extraction of the required information in order to map the functional effect onto the biological process represented in the biological model. However, the identification and extraction of the relevant information and its subsequent mapping to the reactions was found to be a non trivial task. An example is presented here in order to illustrate the difficulties that this task implies, and to highlight the challenges that an automatic text mining system should aim to handle. For clarity purposes, the example is analysed from the point of view of a domain expert (e.g. biologist) performing the interpretation of the data and their subsequent integration with pathways.
An additional difficulty appears if the fact that the SNP produces a "constitutive" activation has to be considered as well. But before addressing this issue, the biologist needs to interpret the meaning of "constitutive RAS activation". A possible interpretation of this assertion would be the following: mutated SOS1 does not depend on the binding of the activated EGFR receptor in order to activate RAS, and thus RAS is activated by SOS1 in a constitutive, ligand-independent manner (see  for an example). At this stage, an automatic system should deduce that in the presence of the allelic variant W->L of SOS1, there is no requirement for the signal originated by the binding of the EGF to its receptor to activate RAS. To accomplish this, this knowledge should be appropriately represented in an ontology.
In summary, this single example reveals the complexity of the integration process. The steps required to achieve the integration in an automatic way can be expressed as follows:
1. Extraction and mapping of information from natural language description of SNPs and mutations onto reactions or relations in networks. This requires a text mining system able to identify genes/proteins, along with their function and biological process in which they participate.
2. Identification of the entities/relationships in the network and mapping of both representations (text, network). The main difficulties here are the different levels of granularities and different perspectives used in text and pathways to describe the same process.
Solving these challenges will require ontologies and the use of sophisticated text mining tools able to map information extracted from text to information represented in networks. Once the information is represented in a OWL-DL  based format, such as Reactome, reasoning could be applied in order to mimic the interpretations performed by a human expert [32–34].
Node attribute description
Node attribute name
Entrez Gene identifier
List of the mutagenesis information:
contains the amino acid exchange, the sequence position and the textual phenotypic description from UniProt
List of the natural variant/polymorphism information:
contains the amino acid exchange, the sequence position, the textual phenotypic description from UniProt and if available a MIM id and the textual description of the disease association; if at the same position mutagenesis data is also available, this data is listed as a sub-list of the polymorphism
Disease name associated with the natural variant
List of GO biological process terms that are associated to the natural variant or mutant
List of GO biological process identifiers that are associated to the natural variant or mutant
List of GO molecular function terms that are associated to the natural variant or mutant
List of GO molecular function identifiers that are associated to the natural variant or mutant
List of GO cellular component terms that are associated to the natural variant or mutant
List of GO cellular component identifiers that are associated to the natural variant or mutant
List of UniProt identifiers that are associated to the natural variant or mutant
Required for the visual styles
1: only mutagenesis information available
2: only polymorphism information available
3: mutagenesis and polymorphism information available but not at the same position
4: mutagenesis and polymorphism information available at the same position
The previous example was only chosen for illustrative purposes, to exemplify the usefulness of incorporating sequence variation data in a modelling exercise.
This approach opens the possibility of evaluating the functional effect of SNPs and mutations on the structure and dynamics of network models.
In this paper we have presented a general strategy for the integration of pathway and sequence variation data, towards their use in network visualization and analysis, as well as in the modelling of signalling pathways. In principle, all the data derived from UniProt could be used for this purpose, provided that the relevant models are available. Several difficulties were found when we tried to combine the data from two structured databases: UniProt and Reactome. Even though the data from these resources is already organized or structured (the entities participating in the interactions are specified) there are a lot of difficulties in the identification of the reactions and nodes in the networks that are affected by the mutation or the SNP. These difficulties go beyond tasks that any current text mining system would be able to handle, since at least NER and relationship extraction tools are required. The difficulties are mainly related with the different perspectives that can be used to refer to the same biological process and how to deal with them to map the different representations to a single concept, and also in the complexity of the processes inherent to the knowledge domain. Similar issues were also discussed in relation to the manual annotation of a corpus describing events in the field of molecular biology [37, 38]. In these papers, the authors described the difficulty between mapping events expressed in natural language with reactions represented in pathways.
The intended integration allowing the mapping of the phenotypic effect of SNPs on biological networks (signalling pathways, protein-protein interaction networks, and dynamic models) has evident practical usefulness. The clinical phenotypic effect (e.g. sequence variation associated with colon cancer) and the functional phenotypic effect (e.g. sequence variation produces a decrease of enzymatic activity) can be evaluated in the context of the reactions and processes that are affected by the SNP. This is a very important issue as it provides information about the functional effect of mutations at the cellular level that are relevant in the clinical practice. Disease-associated variants or specific mutations of interest could be evaluated in the context of network models. Moreover, it would be possible to assess the effect of different sequence variations in the same model, an approach particularly relevant to consider the polygenic character of complex diseases. This can have significant consequences for understanding mechanisms of disease and the design of new therapeutical approaches.
In this paper we have presented a general strategy for the integration of pathway and sequence variation data, towards the use of the integrated information for network visualization and analysis, and for the modelling of signalling pathways. This will aid the modellers in studying the functional impact of protein sequence variations on the model dynamics and proposing relevant experiments. This is of particular interest when the SNP or mutation is known to be associated to disease. We expect that this approach will help in the study of the functional impact of disease-associated SNPs in the behaviour of cell signalling pathways, which ultimately will lead to a better understanding of the mechanisms underlying complex diseases.
Mutagenesis and natural variant information was obtained from UniProt/SwissProt (release 57.0 March 2009). The pathway "Signaling by EGFR" http://www.reactome.com/cgi-bin/eventbrowser_st_id?ST_ID=REACT_9417 was downloaded in BioPAX format level 2 from Reactome (release 27) (see Additional File 2). The network model of "ErbB signalling" developed by Birthwistle et al.  was downloaded in SBML format from BioModels http://www.ebi.ac.uk/biomodels-main/publ-model.do?mid=BIOMD0000000175. This model was used for the visualization in Cytoscape  and the network modelling. Cytoscape version 2.6.0 supports SBML Level 2 Version 1 (SBML L2 V1). As the model downloaded from BioModels is in SBML L2 V3 format, it had to be modified for visualization in Cytoscape (see Additional File 3). Since the model downloaded from the original publication  does not contain a mapping to UniProt identifiers, the mapping between all proteins appearing in the ErbB signalling network and UniProt was obtained from the annotations in the BioModels database and is provided as a mapping file as part of the supplementary materials (see Additional File 4).
Furthermore, text mining techniques were applied to the text in the Description field in order to identify concepts from controlled vocabularies in the phenotype description of the natural variants and mutations. The EBIMed system  was accessed through SOAP web service of Whatizit  and applied to the free text. EBIMed contains a NER module for the identification of mentions of GO  terms. In addition, it identifies mentions of proteins, which are disambiguated or normalized to UniProt identifiers. These data were extracted as additional annotations on the natural variations and mutations, providing a characterization of the functional effect in terms of GO terms and associated proteins.
All these data were combined for the development of attribute files that can be loaded into Cytoscape allowing the mapping, visualization, filtering and searching of the SNP information in the context of biological pathways.
For pathway visualization, we used Cytoscape version 2.6.0 . Cytoscape is widely used open-source software for visualization and analysis of networks. In Cytoscape, networks are represented as graphs where the nodes are the entities (e.g. proteins) and the edges their interactions (e.g. reactions). For the visualization of mutagenesis and natural variant information in the context of biological networks, we developed three different node attribute files (for a detailed description of the attributes see Table 1) and two visual style files that can easily be imported into Cytoscape. In detail, we provide separate attribute files for the mutagenesis (see Additional File 5) and the natural variant (polymorphism) data (see Additional File 6) and furthermore one for a combined view of both (see Additional File 7). The two different visual styles account for the two major network representation formats SBML (see Additional File 8) and BioPAX (see Additional File 9) which differ in their node attributes representation. Here, the nodes are coloured according to the kind of information that is available (see Figure 4). Furthermore, we provide a guide explaining how to use the attribute and visual style files within Cytoscape (see Additional File 10), as well as an example Cytoscape session (see Additional File 11). As mentioned above, the mapping of the mutations and natural variants on the pathway requires the existence of UniProt identifiers for the nodes in the pathway. We want to emphasize that for pathways in BioPAX format downloaded from Reactome, there exists a node attribute containing the UniProt identifier. For pathways in SBML format, a mapping of the nodes to UniProt identifiers is provided during curation in BioModels database.
We used COPASI (version 4.4)  to model the dynamics of ErbB signalling with inactivated MEK. For this purpose, we modified the species "MEKstar", which represents the activated/phosphorylated MEK1, in the SBML file of the original model . We set the species attributes "constant" and "boundaryCondition" to "true" and kept the "initialConcentration" zero. This implies that the concentration of MEKstar is constantly zero. The dynamics of the system and the data normalization were performed as in the original publications. All plots were generated with R . The modified model is available as Additional File 12.
single nucleotide polymorphism
Named Entity Recognition
SBML L2 V1
This work was generated in the framework of the @neurIST and the EU-ADR projects co-financed by the European Commission through the contracts no. IST-027703 and ICT-215847, respectively. The Research Unit on Biomedical Informatics (GRIB) is a node of the Spanish National Institute of Bioinformatics (INB). It is also member of the COMBIOMED network. We thank the Departament d'Innovació, Universitat i Empresa (Generalitat de Catalunya) for a grant to author ABM.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 8, 2009: Proceedings of the European Conference on Computational Biology (ECCB) 2008 Workshop: Annotation, interpretation and management of mutations. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/10?issue=S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.