Skip to main content

Relation extraction for biological pathway construction using node2vec



Systems biology is an important field for understanding whole biological mechanisms composed of interactions between biological components. One approach for understanding complex and diverse mechanisms is to analyze biological pathways. However, because these pathways consist of important interactions and information on these interactions is disseminated in a large number of biomedical reports, text-mining techniques are essential for extracting these relationships automatically.


In this study, we applied node2vec, an algorithmic framework for feature learning in networks, for relationship extraction. To this end, we extracted genes from paper abstracts using pkde4j, a text-mining tool for detecting entities and relationships. Using the extracted genes, a co-occurrence network was constructed and node2vec was used with the network to generate a latent representation. To demonstrate the efficacy of node2vec in extracting relationships between genes, performance was evaluated for gene-gene interactions involved in a type 2 diabetes pathway. Moreover, we compared the results of node2vec to those of baseline methods such as co-occurrence and DeepWalk.


Node2vec outperformed existing methods in detecting relationships in the type 2 diabetes pathway, demonstrating that this method is appropriate for capturing the relatedness between pairs of biological entities involved in biological pathways. The results demonstrated that node2vec is useful for automatic pathway construction.


In the field of biology, biological pathway analysis is important for gaining insight into the underlying phenomenon of complex interactions between biological components [1,2,3]. Biological pathways are constructed based on collective interpretations of biomedical knowledge determined in many different studies, which demands considerable human effort [4]. Specifically, to construct pathways, biologists must read and interpret a large number of biomedical reports [5]. However, with the exponential growth in research papers in biology, it has become increasingly difficult to remain updated on new developments [6, 7], increasing interest in text mining techniques that can detect and extract biological entities, such as gene, disease, and cell and relationships between these entities [8].

Numerous text mining techniques for relationship extraction have been proposed, ranging from a simple but flexible method such as co-occurrence-based relationship extraction [9, 10] to complex techniques including rule-based [11,12,13,14,15], unsupervised [16, 17], and supervised methods [18,19,20,21,22,23,24]. However, most studies of relationship extraction have used supervised methods which are feature-based. Feature-based techniques for relationship extraction require a large amount of manually labeled data [17, 25], which is costly and time-consuming. Moreover, feature engineering and extraction are important tasks because the performance of supervised learning techniques is largely dependent on the features [21] and thus requires domain expert knowledge.

To tackle the training data issue, distantly/weakly supervised learning methods have been introduced [26,27,28]. Specifically, in the distantly supervised approach, an existing knowledge base is used to automatically label entities in the text and annotated data is utilized for training a classifier [29]. Moreover, weakly supervised learning techniques can work with small, inexact, and inaccurate training data [30]. However, these supervised learning techniques depend on the knowledge base in a given scientific domain and labeled data.

Self-supervised learning is a type of supervised learning used for learning representations entirely from unlabeled data such as autoencoders [31], Word2Vec [32], and node2vec [33]. Without training data, we can use these methods for prediction tasks. To take advantage of this strength, in the study, we applied node2vec, a network-embedding algorithm, for relation extraction in biological pathways. Another reason to use node2vec is that relationship extraction can be used as a link prediction between two biological entities in the network. Node2vec can learn the continuous feature representations of nodes in networks by using a biased random walk to sample neighborhoods of nodes [33]. As such, without annotated data, node2vec can learn rich feature representations for all nodes in a network.

As a result, in this study, we predicted whether two biological entities can be connected in a network using the node2vec algorithm. A series of experiments showed that the network embedding technique is well-suited for relationship extraction between genes in a biological pathway.


Evaluation of gene-gene interactions in the type 2 diabetes pathway

The type 2 diabetes mellitus pathway consists of 25 genes, 14 other biological components, such as disease and molecular function, and their direct/indirect relationships. Figure 1 shows the type 2 diabetes pathway provided by the KEGG PATHWAY database [34]. It is well-known that type 2 diabetes is strongly associated with insulin resistance [35]. Therefore, we focused on the pathways related to ‘insulin resistance’ within the type 2 diabetes pathway of KEGG. Specifically, the pathways linked to ‘insulin resistance’ contained 19 biological entities, including gene, molecular function and disease, and 26 connections between these entities, as shown in Fig. 1. These biological components are listed in Table 1.

Fig. 1

Type 2 diabetes pathway. Pathway data ©2017 KEGG. Retrieved December 24, 2017, from Screenshot by author

Table 1 Entities belonging to the paths connected to insulin resistance

Among the 25 relationships, we evaluated 18 gene-gene interactions in the type 2 diabetes pathway where the relationships between genes are made up of the most part. The 18 pairs of genes and relationship types (direct/undirect) are shown in Table 2. The interactions shown in the KEGG type 2 diabetes pathway fall into 2 categories: direct and indirect interactions. The dotted lines in the KEGG pathway denote an indirect relationship between two biological entities. Two entities in an indirect relationship interact each other though several other entities. Specifically, as shown in Table 2, the relationship between ADIPO and GLUT4 is indirect because the two genes are engaged in the adipocytokine signaling pathway as well as the type 2 diabetes pathway, as the biological components can participate in multiple pathways. Figure 2 shows that these two genes interact with each other through ADIPOR, AMPKK, and AMPK.

Table 2 Eighteen gene-gene interactions and, interaction type, and another participating pathway of each relationship
Fig. 2

Interaction between adiponectin and GLUT4 in the adipocytokine signaling pathway. Pathway data ©2017 KEGG. Retrieved December 24, 2017, from Screenshot by author

Accordingly, we expanded indirect interactions involved in the type 2 diabetes pathway, if two entities in a given indirect relationship participate in another pathway, to capture more detailed information on the process of interactions. Among the 18 gene-gene interactions, 9 were found to be indirect and pathway information in which each pair of genes participate in the type 2 diabetes pathway is described in Table 2.

Therefore, these 9 indirect paths were expanded to identify direct relationships. Table 3 reports the extended paths and direct paths in each extended path. For PI3K and GLUT4 (relationship no. 9), there are three possible ways to connect from PI3K to GLUT4: PI3K-PKC-GLUT4, PI3K-PDK1/2-PKC-GLUT4, and P13K-PDK1/2-AKT-GLUT4. Because PKC, the upstream kinase, is responsible for the phosphorylation and activation of AGC kinase members regulated by PI3K [36,37,38,39,40,41,42], we selected the two expanded paths, P13K-PDK1/2-AKT-GLUT4 and P13K-PDK1/2-PKC-GLUT4. As a result, including the 9 direct links in the type 2 diabetes pathway (relationship no. 2, 3, 4, 5, 6, 15, 16, 17, and 18), a total of 30 direct gene-gene interactions were used for performance evaluation.

Table 3 Extended paths and directed links in each expanded path

Each entity participating in the direct interactions was selected in sequence as a starting node to discover its closest terms. With each starting node, the 100 most similar genes were extracted by calculating cosine similarity between a given starting gene and other gene vectors. Next, the starting gene was paired with each of the 100 extracted genes and the newly generated relationships were ranked by cosine similarity. We evaluated the performance of node2vec by examining whether a given path in the type 2 diabetes pathway was ranked high in the results. For instance, when INS was a starting node, genes with high similarity to INS were extracted and the ranking of INSR was examined. If a direct path was not shown in the 100 newly created relationships, we considered that node2vec did not capture the path.

The 30 direct links and ranking of each relationship are described in Table 4. Specifically, among these 30 links, 25 gene-gene interactions were ranked within 100. Moreover, the 24 direct relationships were ranked within 10 (direct relationship no. 1–11, 13, 16–27). For example, IRS1/IRS-PI3K, INSR-SOCS, and SOCS-IRS1/IRS (direct relationship no. 3, 4, and 5) directly interact with each other in the type 2 diabetes pathway, which is supported by the additional pathway information shown in Fig. 3.

Table 4 Thirty direct gene-gene interactions and the ranking of each link
Fig. 3

INS-INSR and IRS-PI3K interactions in insulin signaling (a), and the effect of SOCS on insulin signaling (b). Adapted from “Diabetes and suppressors of cytokine signaling proteins,” by S. G. Rønn, N. Billestrup and T. Mandrup-Poulsen, 2007, Diabetes, 56(2), 541–548, p

According to a previous study [43], insulin (INS) binding to the insulin receptor (INSR) initiates phosphorylation of the receptor and IRS proteins, which activates PI3K. Moreover, SOCS interacts with the phosphorylated receptor, preventing the binding and activation of IRS proteins.

The results listed in Table 4 show that INS is a close term to INSR (ranking 10th), INSR to IRS1/IRS (ranking 4th), IRS1/IRS to PI3K (ranking first), INSR to SOCS (ranking 8th), and SOCS to IRS1/IRS (ranking second). These results indicate that node2vec can accurately reflect the relatedness of two directly related genes, demonstrating the possibility to applying node2vec for relationship extraction.

However, the ranking of the 5 direct paths is not included in the top 100. These results show that node2vec cannot capture the similarity between two entities belonging to these paths because biological entities and relationships among them were not observed in type 2 diabetes–related papers. This issue will be further described in the Discussion section.

Moreover, we compared our results with those generated by the baseline methods, co-occurrence and DeepWalk. To compare the 3 different techniques, node2vec, co-occurrence, and DeepWalk, we extracted 100 co-occurring gene pairs as well as DeepWalk-generated pairs with every starting node of the 30 direct links. First, co-occurring links were ranked by their co-occurrence counts. For example, in the case of direct link no. 1, genes frequently co-occurring with INS were extracted such as INS-GLP-1(co-occurrence frequency: 3959) and INS-TNF-alpha (co-occurrence frequency: 3145). The co-occurrence link, INS-INSR (co-occurrence frequency: 1819), was ranked 9th. Moreover, DeepWalk-generated gene paths were ranked as paths generated by node2vec.

The results are listed in Table 4. Specifically, 22 co-occurring links and 25 paths generated using node2vec were ranked within the top 100. Thus, node2vec reflects the relatedness of two genes belonging to the 3 paths (direct relationships no. 1, 8, and 12) better than co-occurrence. These 3 links were not observed in the co-occurrence results. For the ranking of the 30 direct relationships, only 4 co-occurring path rankings were higher than those of the 4 links generated using node2vec (direct relationships no. 2, 24, 26, and 27). In contrast, 9 node2vec-generated paths (direct relationships no. 3, 4, 5, 7, 9, 16, 17, 20, and 21) were ranked higher than the co-occurrence links. The remaining 9 path rankings were identical.

In addition, 23 DeepWalk-generated paths are ranked within the top 100, revealing that 2 additional direct paths (direct relationships no. 1 and 16) were captured by node2vec. For the ranking of the 30 direct relationships, only 1 DeepWalk path (direct relationship no. 4) ranked higher than the node2vec-generated path. In contrast, the rankings of 8 paths generated by node2vec were higher than those by DeepWalk. The remaining 14 paths showed the same rank. These results demonstrate that node2vec performs better than co-occurrence and DeepWalk in capturing the relatedness of two genes in the extended type 2 diabetes pathway.


In the study, we applied the node2vec algorithm to extract direct paths in a biological pathway. The results revealed the possibility of its application in automated pathway extraction. We further examined if node2vec can capture the directions between pairs of biological components in the pathway. It is essential to extract these directions because biological reactions in the pathway flow from reactants to products, in reverse, or both [44, 45].

The directions were expressed from starting nodes to target nodes, which means that the biological reaction between a given gene pair flows from the starting entity to the target entity. To investigate whether node2vec reflects the directions in the ranking, we changed the position of two entities in the 5 pairs such as INSR-IRS1/IRS, AMPKK-AMPK, Raf-MEK1/2, MEK1/2-ERK1/2, and PKCZ-GLUT4. Next, we set the original target genes (INSR, AMPKK, Raf, MEK1/2, and PKCZ) as starting entities, and the top 100 closest genes were extracted and ranked by similarity between pairs of entities.

The results are presented in Table 5, which shows that the ranking of the newly generated paths were much lower than those of the original links. Specifically, in the case of the direct relationship no. 2 and 24, the newly generated paths are not shown in the results. Thus, the target genes, INSR, AMPKK, Raf, MEK1/2, and PKCZ, were not extracted as similar genes of the starting nodes, IRS1/IRS, AMPK, MEK1/2, ERK1/2, and GLUT4. Based on our results, node2vec can capture the direction of flow between two genes, although an input network and co-occurrence network was not directed.

Table 5 Ranking reflecting the reverse directions

In addition to the flow directions, in a pathway network, hub nodes exist showing the highest degree [46, 47]. Hub genes are considered important because these genes are likely essential for organism survival [48]. To identify the hub genes and determine how well node2vec captures the relatedness between these hubs, we constructed an extended type 2 diabetes pathway network using other 2 pathways: insulin signaling pathway and adipocytokine signaling pathway. This expanded network was visualized using Gephi [49], a network visualization tool, which is illustrated in Fig. 4.

Fig. 4

Extended type 2 diabetes pathway network

The extended network consists of 29 genes and 35 edges between these entities. The thickness of the edges represents directed or undirected links among genes and 5 edges in the thin lines are undirected links (mTOR-TRAF2, TRAF2-JNK, TRAF2-IKK, GLUT4-AMPK, and ADIPOR-AMPKK). Nodes with a high degree of centrality indicate hub genes in the extended network. Specifically, IRS1/IRS showed the highest degree centrality (8), demonstrating that IRS1/IRS is a hub gene in the network. Genes connected directly to IRS1/IRS are INSR, PI3K, SOCS, IKK, JNK, PKCZ, PKCD/E, and GRB2. These 8 direct links were ranked within the top 10 in the node2vec results, showing that node2vec is an appropriate technique for extracting important relationships in the network, which is essential for pathway construction.

Moreover, the node2vec model was generated based on the adjacency matrix of biological components extracted from the text. Therefore, if entities and relationships are not extracted from the text and not included in the adjacency matrix, it is less likely that node2vec can capture the relatedness of two entities. For direct relationships such as TNFR1-TRADD, TRADD-TNFR2, TNFR2-TRAF2, GRB2-SOS, and SOS-Ras, similarities between these gene pairs is not captured by node2vec. This is because the number of abstracts including each pair of genes is zero, as shown in Table 6.

Table 6 Number of abstracts including direct gene pairs

Specifically, 5 gene pairs in Table 6 are not shown in the abstracts but rather are shown in the full-text. As such, using full-text papers available in PMC would be helpful for constructing a more precise co-occurrence network, which can enhance the performance of node2vec.


In the study, we applied node2vec to extract relationships between biological entities in the extended type 2 diabetes pathway. We showed that node2vec successfully extracted a high percentage of gene pairs belonging to the expanded pathway. Moreover, it outperformed existing techniques such as co-occurrence and DeepWalk. In addition, we demonstrated that node2vec captured the direction flow between two genes, which is essential because reaction flow exists in biological pathways. Accordingly, it has been shown that node2vec is a suitable technique for extracting relationships between entity pairs in pathways.

However, we evaluated our results based on the existing relationships in the pathway for path prediction and thus the relationships extracted using node2vec that have not been verified should be further examined, which is the main theme of our follow-up study. Moreover, several direct paths detected in our extended pathway were not captured by node2vec. As described in the Discussion section, two entity names belonging to the paths did not appear in abstracts but rather in full-text articles. Accordingly, if we use full-text articles, it will be possible to construct a more exquisite co-occurrence network, which ultimately increases node2vec performance in extracting relationships of biological pathways. This is another principal topic of our follow-up study.


To demonstrate the efficacy of node2vec for relation extraction, we selected ‘type 2 diabetes’ as a case study. The methodology used in this study is described below.

Data collection

Type 2 diabetes-related data were collected from PubMed, which contains over 26 million references to journal articles in life sciences on biomedicine. We used the keyword ‘type 2 diabetes’ to retrieve all papers indexed with this search term. Only articles including the term in the titles and abstracts were collected. PubMed XML records were retrieved using EFetch API [50]. As a result, the total number of collected records was 99,689 papers, published from 1978 to 2018. Finally, PMIDs, titles, and abstracts were extracted from the XML records and preprocessed for entity and relationship extraction.

Entity and relation extraction

For entity extraction, PKDE4J [14], a biomedical text mining tool, was utilized. Using the tool, biomedical entities can be extracted either by dictionary or supervised learning, or both. In our experiment, a combination approach was used to extract biological entities. Specifically, candidates of the biological entities were identified using the Stanford NER model [51] and the candidates were mapped into the Unified Medical Language System (UMLS) concepts to decrease false-positives. The UMLS is a vocabulary database of biomedical concepts and relationships among concepts, developed by the National Library of Medicine. The biomedical concepts in the UMLS Metathesaurus are categorized into 143 semantic types [52]. As such, semantic types can be selected to extract specific types of entities. In this study, semantic types matching Gene/Protein were used for gene extraction from biomedical text. These entity types are Cell component, Gene or Genome, Enzyme, Receptor, Nucleic acid, Nucleoside, or Nucleotide, Amino acid, Peptide or Protein, Molecular sequence, Nucleotide sequence, and Amino acid sequence.

For relationship extraction, two biological components were linked when the entities were mentioned together in the same sentence. The assumption behind this approach is that frequently co-occurring entities in the same sentence are more likely to be related than those occurring together in the same abstract. A co-occurrence network in which nodes and edges represent biological entities and co-occurrence relationships, respectively, was constructed and used as an input for node2vec.

Node2vec for latent path prediction

Node2vec is “an algorithmic framework for learning continuous feature representations for nodes in the networks” [33]. It can be used for path prediction in the network by maximizing the probability of preserving network neighborhoods of nodes via second order random walk [33]. In the networks, nodes exhibit homophily, structural equivalence, or both. Thus, node2vec employs biased random walks in which return parameter p and in-out parameter q adjust the walks to sample neighborhood of nodes that lead to embeddings corresponding to the structural, homophily equivalence, or both. Node2vec improves the random walk phase of DeepWalk [53], another feature learning technique for networks, by introducing hyperparameters that control the depth and breadth of random walks. Many studies have shown that node2vec outperforms DeepWalk [54,55,56].

Specifically, in the random walks process, if the return parameter is high (>max(q, 1)), the walk is less inclined to visit already visited nodes. In contrast, if p is low (<min(q,1)), the search is restricted to nearby nodes, which is essential for ascertaining structure equivalence [33]. For the in-out parameter q, if q is less than 1 (q < 1), we are more likely to sample nodes that are further away from a source node. Thus, “the sampled nodes more accurately reflect a micro-view of the neighborhood [33]”, which is crucial for discovering communities/clusters on homophily. The goal of our study was to identify nodes that are closely interconnected and belong to the same communities (homophily equivalence [33, 57]), and we set p = 1 and q = 0.5.

In addition to the p and q parameters, we set other parameters involved in node2vec as d = 128, r = l = 10, and k = 10 where d, r, l, and k denote embedding dimensions, walk per node, walk length, and context size, respectively. Parameter values were selected based on the parameter-sensitive part of the original paper [33] for the best performance. Moreover, to accurately compare node2vec with DeepWalk, we used the same parameters for both methods.

The constructed co-occurrence network was used as input for node2vec and DeepWalk to learn rich feature representation for every node in the network. Node2vec extends the Skip-gram architecture [58] to networks, learns node embeddings by generating random walks and optimizes the network-based objective function using SGD.

With the embedding matrix, the relatedness between each pair of biological entities (e1, e2) shown in the biological pathway, was identified by computing the cosine similarity of their corresponding transformed vectors (ve1, ve2).

Performance evaluation

To evaluate node2vec performance for predicting relationships between biological components, pathway-based analysis was conducted. Specifically, the type 2 diabetes mellitus pathway sourced from the KEGG PATHWAY database was used for the evaluation task. The pathway map in the KEGG PATHWAY provides knowledge regarding diverse molecular networks composed of nodes such as orthologs, genes, small molecules, and their reactions and interactions [34]. As such, node2vec performance was evaluated based on the entity-entity relationships shown in the KEGG pathway map. Moreover, we compared node2vec results with those generated by other baseline methods such as co-occurrence and DeepWalk.



Unified Medical Language System


  1. 1.

    Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, et al. A systems biology approach for pathway level analysis. Genome Res. 2007;17(10):1537–45.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. 2.

    Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  3. 3.

    Creixell P, Reimand J, Haider S, Wu G, Shibata T, Vazquez M, et al. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12(7):615.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. 4.

    Huang M, Zhu X, Hao Y, Payan DG, Qu K, Li M. Discovering patterns to extract protein–protein interactions from full texts. Bioinformatics. 2004;20(18):3604–12.

    Article  PubMed  CAS  Google Scholar 

  5. 5.

    Kemper B, Matsuzaki T, Matsuoka Y, Tsuruoka Y, Kitano H, Ananiadou S, Tsujii JI. PathText: a text mining integrator for biological pathway visualizations. Bioinformatics. 2010;26(12):i374–81.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  6. 6.

    Ananiadou S, Kell DB, Tsujii JI. Text mining and its potential applications in systems biology. Trends Biotechnol. 2006;24(12):571–9.

    Article  PubMed  CAS  Google Scholar 

  7. 7.

    Poon H, Toutanova K, Quirk C. Distant supervision for cancer pathway extraction from text. In: In Pacific Symposium on Biocomputing Co-Chairs; 2014. p. 120–31.

    Google Scholar 

  8. 8.

    Sahu, S. K., Anand, A., Oruganty, K., & Gattu, M. (2016). Relation extraction from clinical texts using domain invariant convolutional neural network. arXiv preprint arXiv:1606.09370.

  9. 9.

    Alako, B. T., Veldhoven, A., van Baal, S., Jelier, R., Verhoeven, S., Rullmann, T., ... & Jenster, G. (2005). CoPub mapper: mining MEDLINE based on search term co-publication. BMC Bioinforma, 6(1), 51.

  10. 10.

    Tsuruoka Y, Tsujii JI, Ananiadou S. FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics. 2008;24(21):2559–60.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  11. 11.

    Ono T, Hishigaki H, Tanigami A, Takagi T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics. 2001;17(2):155–61.

    Article  PubMed  CAS  Google Scholar 

  12. 12.

    Fundel K, Küffner R, Zimmer R. RelEx—relation extraction using dependency parse trees. Bioinformatics. 2006;23(3):365–71.

    Article  PubMed  CAS  Google Scholar 

  13. 13.

    Bui, Q. C., Campos, D., van Mulligen, E., & Kors, J. (2013). A fast rule-based approach for biomedical event extraction. In proceedings of the BioNLP shared task 2013 workshop (pp. 104-108). Association for Computational Linguistics.

  14. 14.

    Song M, Kim WC, Lee D, Heo GE, Kang KY. PKDE4J: entity and relation extraction for public knowledge discovery. J Biomed Inform. 2015;57:320–32.

    Article  PubMed  Google Scholar 

  15. 15.

    Ravikumar KE, Rastegar-Mojarad M, Liu H. BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences. Database. 2017;2017(1):baw156.

    Article  PubMed Central  CAS  Google Scholar 

  16. 16.

    Ciaramita, M., Gangemi, A., Ratsch, E., Saric, J., & Rojas, I. (2005). Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In IJCAI (pp. 659–664).

  17. 17.

    Quan C, Wang M, Ren F. An unsupervised text mining method for relation extraction from biomedical literature. PLoS One. 2014;9(7):e102039.

    Article  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Miller, S., Fox, H., Ramshaw, L., & Weischedel, R. (2000). A novel use of statistical parsing to extract information from text. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (pp. 226–233). Association for Computational Linguistics.

  19. 19.

    Kambhatla, N. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions (p. 22). Association for Computational Linguistics.

  20. 20.

    Tongtep N, Theeramunkong T. A feature-based approach for relation extraction from Thai news documents. In: In Pacific-Asia Workshop on Intelligence and Security Informatics. Berlin, Heidelberg: Springer; 2009. p. 149–54.

    Chapter  Google Scholar 

  21. 21.

    Lamurias A, Clarke LA, Couto FM. Extracting microRNA-gene relations from biomedical literature using distant supervision. PLoS One. 2017;12(3):e0171929.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  22. 22.

    GuoDong, Z., Jian, S., Jie, Z., & Min, Z. (2005). Exploring various knowledge in relation extraction. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 427–434). Association for Computational Linguistics.

  23. 23.

    Zhao, S., & Grishman, R. (2005). Extracting relations with integrated information using kernel methods. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 419–426). Association for Computational Linguistics.

  24. 24.

    Xiao, J., Su, J., Zhou, G. D., & Tan, C. (2005). Protein-protein interaction extraction: a supervised learning approach. In Proc Symp on Semantic Mining in Biomedicine (pp. 51–59).

  25. 25.

    Chen, J., Ji, D., Tan, C. L., & Niu, Z. (2006). Relation extraction using label propagation based semi-supervised learning. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics (pp. 129–136). Association for Computational Linguistics.

  26. 26.

    Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2 (pp. 1003–1011). Association for Computational Linguistics.

  27. 27.

    Crandall DJ, Huttenlocher DP. Weakly supervised learning of part-based spatial models for visual object recognition. In: In European Conference on Computer Vision. Berlin, Heidelberg: Springer; 2006. p. 16–29.

    Google Scholar 

  28. 28.

    Medlock B, Briscoe T. Weakly supervised learning for hedge classification in scientific literature. In ACL. 2007;2007:992–9.

    Google Scholar 

  29. 29.

    Augenstein I, Maynard D, Ciravegna F. Distantly supervised web relation extraction for knowledge base population. Semantic Web. 2016;7(4):335–49.

    Article  Google Scholar 

  30. 30.

    Zhou, Z. H. (2017). A brief introduction to weakly supervised learning. National Science Review.

    Google Scholar 

  31. 31.

    Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7.

    Article  PubMed  CAS  Google Scholar 

  32. 32.

    Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

  33. 33.

    Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855–864). ACM.

  34. 34.

    Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2009;38(suppl_1):D355–60.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. 35.

    Taylor R. Insulin resistance and type 2 diabetes. Diabetes. 2012;61(4):778–9.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  36. 36.

    Boucher J, Kleinridders A, Kahn CR. Insulin receptor signaling in normal and insulin-resistant states. Cold Spring Harb Perspect Biol. 2014;6(1):a009191.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  37. 37.

    Björnholm, M., & Zierath, J. R. (2005). Insulin signal transduction in human skeletal muscle: identifying the defects in type II diabetes.

    Google Scholar 

  38. 38.

    Hale LJ, Coward RJ. Insulin signalling to the kidney in health and disease. Clin Sci. 2013;124(6):351–70.

    Article  PubMed  CAS  Google Scholar 

  39. 39.

    Hajiaghaalipour F, Khalilpourfarshbafi M, Arya A. Modulation of glucose transporter protein by dietary flavonoids in type 2 diabetes mellitus. Int J Biol Sci. 2015;11(5):508.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  40. 40.

    Kim B, Feldman EL. Insulin resistance in the nervous system. Trends Endocrinol Metab. 2012;23(3):133–41.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. 41.

    Thong FS, Dugani CB, Klip A. Turning signals on and off: GLUT4 traffic in the insulin-signaling highway. Physiology. 2005;20(4):271–84.

    Article  PubMed  CAS  Google Scholar 

  42. 42.

    Watson RT, Pessin JE. Bridging the GAP between insulin signaling and GLUT4 translocation. Trends Biochem Sci. 2006;31(4):215–22.

    Article  PubMed  CAS  Google Scholar 

  43. 43.

    Rønn SG, Billestrup N, Mandrup-Poulsen T. Diabetes and suppressors of cytokine signaling proteins. Diabetes. 2007;56(2):541–8.

    Article  PubMed  CAS  Google Scholar 

  44. 44.

    Starr, C., Evers, C., & Starr, L. (2010). Biology: concepts and applications without physiology. Cengage Learning.

    Google Scholar 

  45. 45.

    Pavlopoulos GA, Secrier M, Moschopoulos CN, Soldatos TG, Kossida S, Aerts J, et al. Using graph theory to analyze biological networks. BioData Min. 2011;4(1):10.

    Article  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Barabási AL, Gulbahce N, Loscalzo J. Network medicine: a network-based approach to human disease. Nat Rev Genet. 2011;12(1):56.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  47. 47.

    Teschendorff AE, Banerji CR, Severini S, Kuehn R, Sollich P. Increased signaling entropy in cancer requires the scale-free property of protein interaction networks. Sci Rep. 2015;5:9646.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  48. 48.

    Langfelder P, Mischel PS, Horvath S. When is hub gene selection better than standard meta-analysis? PLoS One. 2013;8(4):e61505.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  49. 49.

    Bastian M, Heymann S, Jacomy M. Gephi: an open source software for exploring and manipulating networks. Icwsm. 2009;8:361–2.

    Google Scholar 

  50. 50.

    Sayers, E. (2009). Entrez programming utilities help. URL

  51. 51.

    Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 363–370). Association for Computational Linguistics.

  52. 52.

    Bodenreider, O. (2004). The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research, 32(suppl_1), D267-D270.

  53. 53.

    Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701–710). ACM.

  54. 54.

    Pimentel, T., Veloso, A., & Ziviani, N. (2017). Unsupervised and scalable algorithm for learning node representations.

    Google Scholar 

  55. 55.

    Prokhorov, V., Pilehvar, M. T., Kartsaklis, D., Lió, P., & Collier, N. (2017). Learning Rare Word Representations using Semantic Bridging. arXiv preprint arXiv:1707.07554.

  56. 56.

    Chen, H., Perozzi, B., Hu, Y., & Skiena, S. (2017). HARP: Hierarchical Representation Learning for Networks. arXiv preprint arXiv:1706.07845.

  57. 57.

    Gu, W., Gong, L., Lou, X., & Zhang, J. (2017). The Hidden Flow Structure and Metric Space of Network Embedding Algorithms Based on Random Walks. arXiv preprint arXiv:1704.05743.

  58. 58.

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Download references


This work was supported by the Bio-Synergy Research Project (NRF-2013M3A9C4078138) of the Ministry of Science, ICT, and Future Planning through the National Research Foundation.


Publication of this article was funded by the Bio-Synergy Research Project (NRF-2013M3A9C4078138) of the Ministry of Science, ICT, and Future Planning through the National Research Foundation.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 8, 2018: Proceedings of the 11th International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO 2017). The full contents of the supplement are available online at

Author information




MUK made considerable contributions to the design of the study, data collection, a co-occurrence network construction, performance evaluation, analysis of case study, and drafting the manuscript. SHB participated in the design of the study and analysis and interpretation of data. MS has been involved in its design and coordination and revision for principal intellectual content. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Min Song.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kim, M., Baek, S.H. & Song, M. Relation extraction for biological pathway construction using node2vec. BMC Bioinformatics 19, 206 (2018).

Download citation


  • Biological pathways
  • Relation extraction
  • Pathway extraction