An integrative methodology based on protein-protein interaction networks for identification and functional annotation of disease-relevant genes applied to channelopathies

Marín, Milagros; Esteban, Francisco J.; Ramírez-Rodrigo, Hilario; Ros, Eduardo; Sáez-Lara, María José

doi:10.1186/s12859-019-3162-1

Methodology article
Open access
Published: 12 November 2019

An integrative methodology based on protein-protein interaction networks for identification and functional annotation of disease-relevant genes applied to channelopathies

Milagros Marín^1,3,
Francisco J. Esteban²,
Hilario Ramírez-Rodrigo³,
Eduardo Ros¹ &
…
María José Sáez-Lara ORCID: orcid.org/0000-0001-7439-9817³

BMC Bioinformatics volume 20, Article number: 565 (2019) Cite this article

3011 Accesses
8 Citations
6 Altmetric
Metrics details

Abstract

Background

Biologically data-driven networks have become powerful analytical tools that handle massive, heterogeneous datasets generated from biomedical fields. Protein-protein interaction networks can identify the most relevant structures directly tied to biological functions. Functional enrichments can then be performed based on these structural aspects of gene relationships for the study of channelopathies. Channelopathies refer to a complex group of disorders resulting from dysfunctional ion channels with distinct polygenic manifestations. This study presents a semi-automatic workflow using protein-protein interaction networks that can identify the most relevant genes and their biological processes and pathways in channelopathies to better understand their etiopathogenesis. In addition, the clinical manifestations that are strongly associated with these genes are also identified as the most characteristic in this complex group of diseases.

Results

In particular, a set of nine representative disease-related genes was detected, these being the most significant genes in relation to their roles in channelopathies. In this way we attested the implication of some voltage-gated sodium (SCN1A, SCN2A, SCN4A, SCN4B, SCN5A, SCN9A) and potassium (KCNQ2, KCNH2) channels in cardiovascular diseases, epilepsies, febrile seizures, headache disorders, neuromuscular, neurodegenerative diseases or neurobehavioral manifestations. We also revealed the role of Ankyrin-G (ANK3) in the neurodegenerative and neurobehavioral disorders as well as the implication of these genes in other systems, such as the immunological or endocrine systems.

Conclusions

This research provides a systems biology approach to extract information from interaction networks of gene expression. We show how large-scale computational integration of heterogeneous datasets, PPI network analyses, functional databases and published literature may support the detection and assessment of possible potential therapeutic targets in the disease. Applying our workflow makes it feasible to spot the most relevant genes and unknown relationships in channelopathies and shows its potential as a first-step approach to identify both genes and functional interactions in clinical-knowledge scenarios of target diseases.

Methods

An initial gene pool is previously defined by searching general databases under a specific semantic framework. From the resulting interaction network, a subset of genes are identified as the most relevant through the workflow that includes centrality measures and other filtering and enrichment databases.

Background

The genetic aetiology of many complex diseases comprises different specific clinical symptoms and evolution. The identification of their causal agents becomes essential for the detection of suitable targets, the management of their diagnosis and the selection of the most adequate therapies [1,2,3]. The increasing availability of large bibliographic data volumes lays the foundations for the identification of these candidate genes [2, 4]. However, the integration of all this knowledge requires understanding the diverse biomedical information sources available. The extraction of data performed by valid association procedures and the comprehensive interpretation of all this current knowledge is complex. This is in and of itself an issue of utmost importance for the purpose mentioned above [4,5,6].

Traditional reductionist strategies that deal with this diverse wealth of information focus on the study of particular molecules or signalling pathways that are useful for the identification of diagnostic biomarkers. Nevertheless, it does not seem enough to approach all the system complexity [2, 4]. Alternatively, interdisciplinary research is developing new technologies and integrative computational methodologies in order to better understand pathogeneses [7, 8]. Some studies that use these current integrative methodologies allow the discovery of co-morbidities between Alzheimer’s disease and some types of cancers [9] where genetic factors can play an important role along with other factors such as the environment, lifestyle, and drug treatments. They are also being used to perform a genome-wide search for Autism gene candidates [1]. These new tools are able to manage deductive analyses by gaining insight into the connections among diseases, even between those a priori not related by the traditional bibliographic searches, which usually tend to be subjective, time-consuming or not reproducible [10]. However, the large range of diverse new tools created within different focuses hinders the existence of a unique approach to or a consensus on their usage. Thus, data extraction through ad-hoc approaches using specific tools may again be complex, not reproducible or subjective. In this way, network analyses and functional annotation tools represent some of the best strategies for objective interpretation of biomedical data and cope with higher level of biological complexity [1,2,3, 11].

The identification of relevant genes is being addressed from the global analysis of multiple interactions at different levels, usually employing networks as representations of the biological complex interactions underlying clinical disorders [11,12,13]. A way to systematically decode the cellular signalling networks consists in the identification of interactome for the detection of the central nodes which maintain the structure and information fluxes into the functional network [11, 14]. Despite some limitations, protein-protein interaction (PPI) networks have been suitably applied to the definition of biological mechanisms by integrating PPI data with transcriptional changes [1, 13,14,15]. It is evidenced that in disease networks in which the alteration is produced by mutations, the node or nodes mutated play a primary role in the development of diseases and thus have a central position in the network [16]. In the case of multifactorial diseases, the nodes which seem to be the causal factor could be located in the periphery. However, the key nodes in the main biological and molecular processes affected, i.e. potential pharmacological targets, tend to have a central position in the network [1, 17,18,19]. Thus, for the identification of the most significant genes in a disease as molecular targets there are useful software tools of high impact [20,21,22,23,24]. One of them is STRING [22], a database used to build predicted and well-known PPI networks. The interactions in STRING are mainly derived from automated text-mining and databases of previous knowledge, among other resources. Other well-known tool is Cytoscape [23], an open-source software platform which has being designed for the purpose of visualizing, analysing and modelling complex biological networks and pathways.

Furthermore, in a system biology approach it is highly important to know the biological and molecular processes in which the complex set of genes involved play a joint key role. Though, if the aim were to identify pharmacological targets, it would also be mandatory to unveil if these candidate genes could also be related with other diseases as comorbidity [9, 25]. These annotations and associations can be performed through traditional bibliographic search systems, which are inefficient, subjective and time consuming by hand [10], or by using some of the highest impact tools from the large number of platforms developed for functional annotation in objective, quick and reproducible ways [26]. This is the case of DAVID [24], which has been shown to provide an automatic comprehensive set of functional annotation tools for biological interpretation of large gene lists as pharmacological targets [7]. It is also very useful in unveiling other related diseases, providing a more comprehensive view of the importance of treatments [9, 19].

In this regard, the aim of this study is to present a semi-automatic workflow using PPI networks for the identification and functional annotation of the most relevant genes in diseases. This new contribution to the extant methods is based on the integration of a set of multidimensional data from different biological levels (genomics, transcriptomics and proteomics) in order to analyse genetic correlations among diseases with different clinical symptomatologies and/or clinical prognoses (and still based on similar molecular mechanisms). In order to illustrate the value of this integrative approach and demonstrate its usefulness, we applied this methodology to the case of channelopathies as proof-of-concept in order to understand their most common polygenic influences, which contributes to the overall understanding of pathomechanisms underlying these altered-channels diseases, in how mutations can modify disease severity [27] and to shed some light on effective treatments [19, 28, 29]. We showed that this proposed workflow is able to mine current available databases and platforms in the context of channelopathies.

Results

In this section, we illustrate the experimental application of the semi-automatic workflow (Fig. 1) to the case of channelopathies.

Semi-automatic workflow applied to channelopathies

Gene dataset of the disease under study

First, the gene dataset of channelopathies was created by introducing the term “channelopathies” in the first stage of the present workflow (Fig. 1 Stage 1), which generated a list of 42 genes involved in this complex group of disorders: SCN5A, KCNH2, KCNQ1, HLA-B, RYR2, SCN2A, SCN4A, CACNA1C, KCNE1, KCNE2, CACNA1S, ATP8B4, DCHS1, SCN4B, SCN2B, SCN9A, SNTA1, CDKL5, STK11, STXBP1, TGFB1, TGFB2, TRPC4, SCN1A, SCN1B, HLA-DRB5, HSPB2, KCNQ2, LOXL2, CNGB3, SCN3B, PCDH19, KCNE3, AKAP9, PRRT2, CLCN1, ASB10, ARX, DMPK, SPESP1, ANK3, HLA-A.

Identification of the most relevant genes

Then, the list of gene names was the input for Stage 2 (Fig. 1. Stage 2). Our target organism was H. sapiens, and a PPI network was generated through the STRING database (interactome network presented in Additional file 1) and then analysed by the Cytoscape platform (Fig. 2). We employed the main features used as centrality parameters, degree and betweenness (as described in methods) for the identification of the most important vertices within the graph. Thus, starting from 42 genes involved in channelopathies, nine genes with the highest degree of interactions and betweenness in their gene expressions were stemmed as the most relevant in channelopathies: SCN9A, ANK3, SCN5A, SCN2A, KCNQ2, SCN1A, KCNH2, SCN4B and SCN4A. This same set of nine relevant genes was also obtained using other connectivity features, such as closeness, EigenVector and radiality (Fig. 3). The result proves to be robust and concordant with that from Stage 2 of the workflow using only betweenness and centrality.

Gene functional annotation

Finally, the functional annotation of each gene was automatically generated in Stage 3 using DAVID search tool (Fig. 1. Stage 3). All the functional annotation results are detailed in section 4.1 in Additional file 4.

Validation of the workflow

To measure the quality of the results obtained, we carried out an alternative more conventional search with a view to comparing the workflow annotation results to the results offered by two other widespread family of bibliographic methods, such as systematic review and exhaustive review.

Comparison criterion

Using “MeSH” ontology [33], we selected four upper-level categories with their corresponding lower-level ones. Each of these lower-level categories refers to one or more diseases linked to these genes. We used “health disorder” as the specific comparator which contains up to four upper-level categories: 1) cardiovascular diseases, 2) nervous system diseases, 3) mental diseases, and 4) other diseases. This frame comprises all the phenotypes of each relevant gene in channelopathies to facilitate the visualization and comparison of functional annotation results (as specified in Table 1). In Additional file 3 we can find the “MeSH”-based terminological hierarchies of the selection of the lower-level categories.

Table 1 “MeSH”-based categories selected. A total of four upper-level categories and their corresponding lower-level categories capture all the phenotypes manifested by more than one of these genes. We resorted to “MeSH” terminological-based hierarchical networks that include all the phenotypes as referred in the third column (included in Additional file 3)

Full size table

From the functional annotation results through the last stage of the proposed workflow (using DAVID) (Table 4.1.8 in Additional file 4) and applying “health disorder” as the specific-domain category, we obtained the results (consigned in Table 4.2.1 in Additional file 4) that will be visually represented in the final results of this work (Figs. 5, 6 and 7).

Systematic review and exhaustive review as other traditional search systems

In the systematic review we searched by phenotype nomenclatures, filtered by H. sapiens as the target organism and removed duplicate entries (Fig. 4). Finally, we extracted nine gene entries from the OMIM and Gene databases and 32 evidences of diseases from the MedGene database (Table 2; all the diseases extracted through the systematic review can be found in Additional file 5). Following the same four upper-level categories, we created an equivalent table containing each disease or clinical manifestation related with its corresponding genes. We used “health disorder” as the specific-domain category and obtained the results shown in Additional file 6. We also compared DAVID against other phenotype-oriented databases of high impact, proving again the selection of this tool in Stage 3 (information included in Additional file 9).

Table 2 Gene accession numbers filtered through systematic review

Full size table

As our third step, exhaustive review was performed by using the query words “gene product nomenclature” + “diseases” in the search box of PubMed and MEDLINE resources, the evidence filtering being the most time-consuming task. We took the same categories and created an equivalent table containing each disease or clinical manifestation related to the corresponding genes, its “health disorder” as the specific-domain category, and its bibliographic references (Additional file 7). While performing this traditional review, we could also expand the functional annotation of the most relevant genes with further information, detailed in Additional file 8.

Representation through genotype-phenotype association networks

From the genotype-phenotype relationships found by the three search systems used in this work – the last stage of the workflow (Table 4.2.1 in Additional file 4), the systematic review (Additional file 5), and the exhaustive review (Additional file 7) — and considering all the categories selected for every phenotype, we represented association networks for cardiovascular diseases (Fig. 5), nervous system diseases (Fig. 6), and mental diseases and other disorders (Fig. 7).

For cardiovascular diseases, DAVID search (Fig. 5a) found more diseases than the systematic review (Fig. 5b) and the exhaustive review (Fig. 5c), with the exception of a connection between the gene SCN4B and “other heart diseases” category retrieved by the exhaustive review but not by DAVID or systematic searches. This is due to the fact that the gene product of SCN4B is an auxiliary subunit, hence it influences but not directly causes the disease. In fact, it has been found to be associated with various inherited arrhythmia syndromes (Brugada syndrome, long-QT syndrome type 3, progressive cardiac conduction defect, sick sinus node syndrome, atrial fibrillation, and dilated cardiomyopathy) [34]. For nervous system diseases, DAVID search (Fig. 6a) provided many more phenotypic connections among genes than systematic review (Fig. 6b) or exhaustive review (Fig. 6c), which obtained the same amount of information. In fact, we could observe that the only gene with a lack of disease association is the SCN4B which, as mentioned above, is associated with cardiovascular diseases only. Finally, for mental and other disorders we only found phenotypic connections for the most relevant genes in DAVID (Fig. 7a), but not through systematic review (Fig. 7b) nor exhaustive review (Fig. 7c).

Discussion

In the present study, we addressed the prediction of the most relevant genes in the context of a group of pathologies not necessarily homogeneous but linked by a common term, as is the case of channelopathies. The identification of those genes may present several shortcomings: 1) finding key genes through scientific literature might be a burdensome task due to the fuzzy and textual nature of information, 2) completely objective criteria are hard to define, and 3) the comparison and validation of different search methodologies might not be objectively carried out. To tackle limitation 1), we developed an integrative methodology using a workflow which departs from genes linked to particular diseases. Then we built a protein-protein interaction network from which key genes are identified through the determination of the centrality measures. Finally, we proceeded to functionally annotate these key genes through the application of widely used data analysis tools in the bibliography.

Although the proposed methodology is of general purpose, in this study it was applied to the set of diseases termed channelopathies. In this clinical context, our method allowed the identification of the most relevant genes (with the highest degree of intermediation and centrality) related to channelopathies. The products of these genes are mostly channels of two different types, namely voltage-gated sodium channels — SCN1A, SCN2A, SCN4A, SCN4B, SCN5A, and SCN9A — that are involved in the rapid depolarisation in the cardiac conduction (Reactome ID: R-HSA-5576892, Table 4.1.6 in Additional file 4), and voltage-gated potassium channels — KCNQ2 and KCNH2 — responsible for the activation of the voltage-gated potassium channels family in the neuronal system (Reactome ID: R-HSA-1296072, Table 4.1.6. in Additional file 4) [35,36,37]. KCNH2 is also involved in the rapid repolarisation of the cardiac conduction (Reactome ID: R-HSA-5576890, Table 4.1.6. in Additional file 4). On the other hand, Ankyrin-G (ANK3) is a protein which deals with the vesicle-mediated transport of the membrane trafficking (Reactome ID: R-HSA-374562, Table 4.1.6. in Additional file 4) and is also responsible for linking integral membrane proteins such as the voltage-gated sodium channel with the spectrin-based membrane skeleton [38]. Particularly, all the genes except KCNH2 contribute to the interaction between cytoskeleton adaptor ankyrins and a type of adhesion receptor (L1) which inhibits the nerve growth at the neural development pathway (Reactome ID: R-HSA-445095, Table 4.1.6. in Additional file 4) [35, 39].

Defects in the ion channels throughout the human body have been involved in a wide phenotypic variability in channelopathies. This remarkable causal heterogeneity makes the diseases hard to classify [40]. Some reviews deal with the categorization of channelopathies based on the organ system with which they are mainly associated in both clinical and pathophysiological aspects [28, 40,41,42,43]. Other reviews opt to classify channelopathies according to the ion channel proteins in order to improve the understanding of how their specific mutations can be linked to diseases [27, 44,45,46]. In current reviews the implication of voltage-gated sodium channels with cardiac pathologies (such as long-QT syndrome and fatal arrhythmias) and epilepsies is easily retrievable [27]. The role of some voltage-gated potassium channels with cardiac pathologies (heart arrhythmias, dilated cardiomyopathies), epilepsies and chronic pain is also well studied [27]. On the contrary, we do not know much about the clustering of Ankyrin-G at the axonal initial segments in the nervous system with voltage-gated sodium channels [47, 48] and some potassium channels [49]. In our work we found this implication of voltage-gated sodium and potassium channels in cardiovascular diseases (SCN2A-SCN9A-KCNH2 cluster for vascular diseases, SCN2A-SCN5A-KCNH2 cluster for cardiac arrhythmias and SCN5A-SCN4B-KCNH2 cluster for other heart diseases) (Fig. 5). We also discovered a very high interconnection and participation of the genes selected not only in epilepsies, but also in febrile seizures, headache disorders, neuromuscular and neurodegenerative diseases and neurobehavioral manifestations (Fig. 6). It is interesting to highlight that in our results the above mentioned participation of Ankyrin-G in the nervous system (Fig. 6) is also reflected, specifically in neurobehavioral manifestations (ANK3-SCN5A-KCNH2 cluster) and neurodegenerative diseases (ANK3-SCN2A-SCN4A-SCN9A cluster). Finally, our results showed the implication of the genes obtained in other types of diseases, such as tobacco use disorder, diabetes mellitus type 2 or sudden death (Fig. 7), which consequently means the involvement of these genes in other systems, such as the immunological system [50] or the endocrine system [40]. As discussed above, we found that these results corroborate the conclusions collected by current literature about channelopathies, even outcomes which are not retrievable in comparative terms with respect to other traditional literature mining.

Approaching the above-mentioned validation of the proposed methodology by statistical comparison with other extant methods would be difficult due to their very different nature and properties. For that reason, we compared our proposal with two traditional and widespread family of methods, these being systematic review and exhaustive review. Among the three methods employed, our workflow and the systematic review proved to be the most objective approach when compared to the exhaustive review. Our results indicate that our methodology is actually able to find more correlations among the nine genes selected than any of the other two methods. Particularly, the present approach allows the detection of many more correlations than the systematic review (as seen in Figs. 5, 6 and 7).

Therefore, the proposed methodology is able to gather as much significant information as any other traditional literature search system mentioned in this work. At the same time, it was shown to work more flexibly, making it a convenient and easy-to-perform first-level approach compared to the above-mentioned methods.

Conclusion

We showed the usefulness of a semi-automatic integrative workflow with regard to successful, currently available mining databases and platforms based on protein-protein interaction networks applied to channelopathies. This workflow builds as productive results as a non-automatic research but in a quicker way, functioning as a bridge-builder among fields and allowing the extraction of information which a priori might not seem relevant when the starting point is a very large group of genes in disease. We encourage future line of research to focus on the full automatization of the workflow and the use of more specific statistical resources such as principal component analysis or machine learning classifiers.

Methods

In this section, we present the semi-automatic workflow (Fig. 1) and describe the current systems biology tools and processes used. Thus, the course of action runs as follows: first, a gene dataset of disease under study is extracted; second, a protein-protein interaction network is built and analysed and the most significant genes in disease are selected; third, the functional annotation for each relevant gene is performed.