Mining a stroke knowledge graph from literature

Background Stroke has an acute onset and a high mortality rate, making it one of the most fatal diseases worldwide. Its underlying biology and treatments have been widely studied both in the “Western” biomedicine and the Traditional Chinese Medicine (TCM). However, these two approaches are often studied and reported in insolation, both in the literature and associated databases. Results To aid research in finding effective prevention methods and treatments, we integrated knowledge from the literature and a number of databases (e.g. CID, TCMID, ETCM). We employed a suite of biomedical text mining (i.e. named-entity) approaches to identify mentions of genes, diseases, drugs, chemicals, symptoms, Chinese herbs and patent medicines, etc. in a large set of stroke papers from both biomedical and TCM domains. Then, using a combination of a rule-based approach with a pre-trained BioBERT model, we extracted and classified links and relationships among stroke-related entities as expressed in the literature. We construct StrokeKG, a knowledge graph includes almost 46 k nodes of nine types, and 157 k links of 30 types, connecting diseases, genes, symptoms, drugs, pathways, herbs, chemical, ingredients and patent medicine. Conclusions Our Stroke-KG can provide practical and reliable stroke-related knowledge to help with stroke-related research like exploring new directions for stroke research and ideas for drug repurposing and discovery. We make StrokeKG freely available at http://114.115.208.144:7474/browser/ (Please click "Connect" directly) and the source structured data for stroke at https://github.com/yangxi1016/Stroke Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04292-4.

particular in declining stroke mortality [2]. Western therapeutic such as drug injection and endovascular therapy [3], as well as traditional Chinese treatment such as herbal medicine and acupuncture [4], have made tremendous efforts for preventing stroke and recovery after stroke. However, stroke is still one of the most critical fatal diseases worldwide (the second leading cause of death) [5] because of acute onset, with the enormous economic burden of recovery for those who survive. So, there is a need to investigate potential pathogenic genes, risk factors further, and aura symptoms of stroke to find efficient preventative and therapeutic approaches.
There are some existing structured knowledge sources focused on stroke [6-8]. Still, a large amount of stroke-related information is available in scientific articles. For example, a recent search for 'stroke' in PubMed resulted in over 327 K papers. In this study, we aim to develop a stroke-related knowledge base by combining information extracted from these scientific papers and existing knowledge bases. The large volume of texts requires automated and computational methods to extract useful information from these unstructured data to build structured databases.
Knowledge graphs (KGs) [9] are widely known as knowledge domain visualization or knowledge domain mapping graphs in the library and information industry [10]. They are often represented as a series of different graphs with the relationships between development processes and the structure of knowledge. Visualization technology is used to describe, analyze, construct, and display knowledge and inter-relationships [11]. Such representation methods can promote the understanding of relations between biomedical entities, which is vital for scientific researchers to refine their research scope and improve personalized medicine. It is also possible to discover new knowledge (e.g., new drugs [12] and effective prevention/treatment methods [13,14]). However, it is a laborious and time-consuming process to construct a KG manually. Therefore, automated approaches to assist an automated/ semi-automated construction of knowledge graphs in specific domains have been used [15,16].
In this paper we introduce a stroke-related knowledge graph (StrokeKG) by combining information extracted from these scientific papers and existing knowledge bases. In addition to biomedical entities, we also add entities from Traditional Chinese medicine (TCM) [17], which pays close attention to the medical characteristics of the entire system of the human body, which makes it a promising candidate for the treatment of stroke [18]. We use a suite of tools to extract genes, diseases, drugs, symptoms, Chinese herbal medicine, and other entities and link them using relationship extraction methods. As a result, StrokeKG includes 46,983 nodes of 9 types, and 157,302 relationships of 30 types, connecting diseases, genes, symptoms, drugs, pathways, Chinese Patent Medicines (CPMs), Herbs, Chemical, ingredients. Besides, we marked 265 CPM entities and 404 CPM-Disease relationships through verification and manual annotation of existing databases to provide practical and accurate stroke-related knowledge. The graph can be used to facilitate our understanding of this complex disease, for example, by exploring precursor symptoms and sequelae of stroke, therapeutic drugs, and the pathway for treating related diseases.

Related work
In the field of biomedicine, knowledge bases (KBs) such as Gene ontology [19], disease ontology [20], reference terms for national drug archives [21], and basic models of anatomy [22] have been prominent examples of efforts to provide structured knowledge systematically. Some of these KBs, e.g. OpenKG [23], BenevolentAI [24], and KnowLife [14] have made significant contributions to the development of the biomedical field, including recent drug repositioning for COVID-19 [25], SemaTyP [26], and protein-drug target KG [15] have been used. Despite many efforts to provide more structured data, vast amounts of relevant knowledge are still hidden in the biomedical literature [27]. There are three main limitations to previous work on KB construction [9]. First, most biomedical KBs are manually constructed and curated, which defer them from keeping up with the pace of novel discoveries. Second, potentially useful text sources such as health portals, online communities, or other sources of information are often ignored. Finally, most previous works focused on one molecular level or chemical genomics, such as protein-protein interactions [28], gene-drug relationships [29], or just highly specific topics such as drug effects.
Natural language processing tools are indispensable to extract useful information from biomedical literature [30]. We need to start with the named entity recognition process and then relationship extraction. Biomedical Named Entity Recognition (NER) [31] aims to identify specific biomedical concepts in the text. NER consists of two steps: (1) classifying specific substrings obtained from the text to determine whether it is the name of a specific type of entity; (2) selecting a standard name or a unique identifier for one kind of entity [32]. There are already many NER tools available for different types of biomedical entities, such as genes/proteins [33], diseases [34,35], species [36], mutations [37], chemicals [38,39] and biological pathways [40]. Still, many essential concept types such as RNAs, phenotypes, Chinese Patent Medicines (CPMs), and herbal medicines do not have corresponding NER tools.
The task of Relation Extraction (RE) is has been in the focus of research in recent years. Due to the inherent complexity of the biomedical text, most relation extraction systems work at the sentence-based level. Common relationships include protein-protein interactions [28], drug-drug interactions [41], gene regulatory events [42], associations between mutations and diseases [43]. Early relationships used a cooccurrence approach [44], while pattern-based systems [45] rely on a set of manually or automatically collected patterns to extract relations and classify relation types between entities. Rule-based methods [46,47] use a set of processes or some heuristic algorithms to manually define or build a set of rules based on domain experts and automatically generated from the training data. It adds multiple constraints to scope specific relationships: for example, BioNLP'09 [48] focused on nine common molecular events. More recently, with the improvement of the accuracy and expanded availability of curated corpora, deep learning models are widely used in the field of natural language processing. Convolutional Neural Network (CNN) [49], Recurrent Neural Network (RNN) [50], Long Short Term Memory Network (LSTM) [41], Capsule Network, CapsNet [51], Graph Neural Networks [52,53], and BERT [54] are prevalent models employed in relation extraction, making great contribution to biomedical text mining.
For the field of traditional Chinese medicine, Manually organized TCM database, TCMID [55], TCM-MESH [56], and Chinese medicine network pharmacology ETCM [57,58], TCMSP brings convenience to the research of Chinese medicine. However, to the best of our knowledge, there is no text mining tool specifically for Traditional Chinese medicine, and there is also the non-disclosure or incomplete knowledge in strokerelated knowledge [6][7][8]. Therefore, in this research, we will enrich the application of text mining in the construction of Chinese medicine knowledge, and based on this and the-start-of-art, construct a stroke-related knowledge graph.

Methods
In this work, we designed a computational workflow to mine the stroke-related and TCM-related literature for the identification of biomedical entities and the relations between them. We split stroke-related abstracts into 463,225 sentences, the analysis pipeline tags the mentions of the following entities: drugs, chemicals, genes, pathways, and diseases, as well as traditional Chinese treatments like Herbs, Chinese Patent Medicines (CPMs), and ingredients. To increase the data set of Chinese medicine on strokerelated disease, we then split TCM-related abstracts into sentences for extract disease, CPM and herbs. We then use several approaches to relations between entities. After verifying and cleaning of the results, we use NEO4J to construct StrokeKG.
The steps of our workflow are explained below (Fig. 1).

Data source
A search for "Stroke AND treatment OR gene OR Herbs OR TCM" in PubMed resulted in 45,080 stroke-related and "Traditional Chinese Medicine" 72,410 TCM related abstracts, which we used as a dataset to extract information from. In addition, manually created databases and annotated corpora, drug-disease relation database: CDR [59], CTD [60], gene-disease relation corpus: EU-ADR [42], and TCMID ETCM [57], TCMSP [58] are also the main source of our knowledge graph data. Table 1 details the data source of our research.

Pre-processing
We re-formatted the PubMed abstracts into the PubTator [63] format to match the data for NER tools and then split sentences by NLTK [64].

Named entity recognition
We extract mentions of nine named-entity types (diseases, drugs, genes, symptoms, pathways, Chinese Patent Medicines (CPMs), Herbs, Chemicals, Ingredients). We use state-of-the-art NER methods, including DNorm [34] to extract and normalize disease words, tmChem [38] as a chemical named entity identifier, GNormPlus [33] to handle both gene mentions and identifier detection, and pathways through PWTEES [40]. We used a pre-trained BiLSTM-CRF [65] model with the Plant-disease corpus [62] to build a NER classifier to identify Herbs. The lack of annotated corpora poses a considerable challenge to using deep learning methods to build other NERs needed for our study. We have therefore developed dictionary-and rule-based methods for other entity types. A rule-based method PKDE4J [46] was used to modify the Stanford CoreNLP pipeline to extract entities based on drug dictionaries. Which we have collected Symptoms and ingredients are recognized by collecting terms from download the CPM database [55] and the ingredient database [56], construct a symptoms dictionary, and which are then inserted the dictionary into the PKDE4J model applied a dictionary-based method for NER. To eliminate the occurrence of an entity by accident, we determine the threshold based on the number of occurrences of the entity. When the number of occurrences of the entity is smaller than 3, We will manually determine whether the entity is related to stroke.

Relation extraction
We focus on eleven relationship types as specified in Table 2. These have been taken from the existing databases and from existing corpora (see below).
The relation extraction process is shown in Fig. 2. We first use a simple co-occurrence method. When two entities appear in the same sentence, we consider that there is a particular relationship between them. Secondly, a rule-based method has been used to extract 'evidence' for the relationship between two entities. Finally, we developed a machine-learning model to further classify relation types according to existing databases or corpora.
• Co-occurrence extraction We use NLTK [64] to segment each sentence and match the position of each entity in the sentence according to the entity positions determined by multiple NER model (see Fig. 2①).
• Rule-based approach We used PKDE4J [46] to create a dependency tree containing syntactic and grammatical structures. We rely on standard features and structures in sentences that may represent relationships and extract the keywords that may express the relationship between two entities identified via co-occurrence (Fig. 2②). We then designed a set of matching rules to classify these keywords to elven relationship types (e.g., positive association; therapeutic; induce; etc.) between specific pairs of entities (e.g., Gene-Disease; Herb-Chemical) as specified in the existing biomedical databases (e.g., TCMID [55], CTD database [60]) (as shown in Table 1).

• Extracting relation by Bio-BERT
We chose Bio-Bert [54] as a pre-trained model, which shares potential latent features with our data as it was re-trained on biomedical corpora. According to the parameter configuration of BioBERT, we use the gold standard data sets [42,60,61] as the training sets and our Co-occurrence results as the test set and select the result of the 20th epoch as the final result of our relation extraction. The corpora for relationship extraction we use can be seen in Table 2.
The co-occurrence method proves that two entities appear in the same sentence, indicating that there is a possible relationship between the entities. Rule-based methods can classify entity relationships well if keywords are extracted. When the keywords cannot be extracted, we use BioBERT's classification results, which can classify all relationships, but it much depends on the richness of the corpus and the accuracy of the model.
Because entity pairs may appear in different sentences, the classification results may differ. To find all relations between the pair of two entities, we calculated the confidence for the pair related by a particular relation, overall the sentences in which that pair of entities co-occur. We select only those relationships with confidence more considerably than the threshold to eliminate the noisy relationships that happen by accident. Afterward, we analyze the final relationship results for the entity.

Entity annotation
To verify the effectiveness of our Chinese herbal medicine related entity mining tool. This work mainly focuses on the annotation of herbs and Chinese Patent medicine in 450 TCM-related abstracts. We regard mentions mined by the tool as pre-annotation of entities. Therefore, according to the vocabulary provided by TCMID [55] and ETCM [57], we only need to modify the incorrect annotations and add annotations to the undetected entities, instead of annotating entities from scratch.
The definition of the target entities we are concerned with is as follows: Chinese Patent medicine: including clinical prescription, TCM formulas and CPM.

Relation annotation
In relation annotation task, we only considered two relations between entities. For each relationship, we classified the type of relationship based on the two annotation guidelines. Once two target entities appear in the same sentence, we label the relationship between them. Chinese patent medicine-disease this indicates the drug will treat the disease or induce the disease. According to Plant-disease corpus, the relationships are divided into 3 categories: treatment, cause and others.

Evaluation of text-mined results
The evaluation of NER and RE was to compare the extracted results with the existing databases or manually annotated corpus.
For TCM-related NER tools, we compare whether the results we extracted overlap with the existing database. Secondly, for CPM entities, we will compare the results by dictionary-based tool and the results we manually annotated.
For relation extraction results, also check the overlap of the entity pair we extracted with the existing database, and then calculate the correct rate (CR) of relationship classification in the overlap section.

Knowledge graph construction
The construction of a knowledge graph is a compelling visual representation of entities and relationships. These are embedded in the knowledge graph to carry information about entities and relationships and are widely used in learning tasks to accelerate the completion and recommendation of the knowledge graph. By mapping the strokerelated entities from our results and existing data source (TCMID [55], CDR [66], CTD [60], TCMSP [58] and ETCM [57]) in a common ID space, we can combine these triplets into one single dataset to construct a comprehensive stroke-related repurposing knowledge graph.

Results statistics
The results mainly include the entities and relations we mined. The statistical results and specific results of drugs, chemicals, symptoms, pathways, etc. are shown in Table 3 and https:// github. com/ yangx i1016/ Stroke.  Relation extraction results statistics are in Table 4.

Evaluation for NER
Compared with our manually labeled CPM results, the recall, precision and F1-score of the rule-based CPM NER are shown in Table 5.
The reason for the low recall mainly because of the lack of abbreviation (CY-Tang: Chungsim-Yeunja-Tang) and the different spelling of TCM caused by different pronunciation. (For example, Hwangryun-Hae-Dok-tang and Huanglian-Jie-Du-Tang).

Compare with existing database
To assess how validity the literature-derived knowledge represented data, we compared the results to two Chinese Medicine Pharmacology Knowledge Base: ETCM and TCMSP to those obtained in StrokeKG. Including stroke-related CPM, herbs, and genes. Figure 3 shows the result of comparison with ETCM [57] and TCMSP [58].
Compared with the existing database, our name recognition results partially overlap with the existing database, which indicates that our entity recognition results are  . 3 The comparison of NER results with ETCM [57] and TCMSP [58] effective. More importantly, we have unearthed many stroke-related entities that do not exist in the database. Which provides a new direction for future research.

Evaluation for RE
Compared with our manually labeled CPM-Disease relations, the recall, precision and F1-score of the CPM-Disease RE are shown in Table 6. On some relationship pairs, the model cannot judge whether it is a Treatment or a Cause, and is classified as Other, which is most of the reason for the error.
As shown in Fig. 4a and Table 7, our mining results include 190 pairs of CPM-Herbs, 4 pairs of CPM-Ingredients, and 515 pairs of Herbs-Ingredients, compared with the existing TCMID (only CPM components and Herbs component table) database, there are 275 pairs of relationships that overlap and the correct rate of the relationship classification results is 91.42%. Secondly, our mining results include 404 pairs of CPM-Disease (with 704 CPM for stroke-related disease) compared with TCMID (only comparing whether herbal medicine has a therapeutic effect on the disease). The rate is 84.37%. The correct rates of the relation between genes-diseases and drugs-diseases are 90.47% and 88.86%, respectively.
To determine if classification of overlaps can and made Table 8.   By detailed analysis, we found our relation extraction method can accurately extract two entities in the same sentence, but there will be errors in the classification of the relations. The main reason is the inability to identify keywords in relation extraction.
At the same time, the other purpose of the construction of our knowledge graph is to extract knowledge that may be useful but not included in the existing data set in the vast ocean of data. For this, we compare the size of the data set with the existing biomedical common knowledge base and proposed new possible clinical medical research directions.

StrokeKG
StrokeKG (http:// 114. 115. 208. 144: 7474/ brows er/) contains a total of 46,983 entities belonging to K = 9 entity-types. The type-wise distribution of the entities. StrokeKG contains a total of 157,302 triplets belonging to R = 30 edge-types with 659,838 properties. A part of the results, as shown in Fig. 5a, using entities as graphs nodes, and the entities contain entity ID, entity name, and standard classification (MESH). As shown in Fig. 5b, the PMID number of the article where two entities co-exist is used as the edge of the graph. In particular, the edge also contains the keyword (RelationKeyword) extracted by PKDE4J and the relationship classification result (RelationType) based on the BERT model.
To enhance the effectiveness of our knowledge graph, we also annotated the reliable 32,031 nodes of 9 types and 4,800 relationships of 16 types with evidence from the entirely correct part of the evaluation results and the information in the existing database.

Stroke-related gene and relation between stroke-genes and stroke-related disease-genes
Gene mutations are related to the incidence of stroke. By relation extraction in diseasegenes, we found 5953 types of genes (included 180,280 mentions). We linked 1238  (1177), nucleotide (1144), the index of these main compounds on the impact of human stroke and related diseases is the most concerned by the medical community. According to the drug list provided by Drugbank, we have normalized and classified 2156 kinds of drugs for entities. In addition to the individual elements of statistics in the chemical, The drugs with the greatest impact on stroke are aspirin (DB00945,1475), warfarin (DB00682,1034), clopidogrel (DB00758,666).

TCM for treating stroke and stroke-related disease
We have identified 294 Chinese patent medicines that have played a role in the prevention and treatment of stroke and related diseases. From our mining results, GUALOUGUIZHI DECOCTION (10), KUDIEZIINJECTION(10), DANHONGINJEC-TION (20), and BUYANGHUANWU DECOCTION (36) are potent medicine in treating stroke. We also extracted 420 species of Herbs (11,671 mentions). DAN-SHEN (58), Chuan-Xiong (50), Dang-Gui (23), Huang-Lian (21), and Bai-Fu-Zi (19) are in various Chinese patent medicines or prescriptions for the treatment of stroke and related diseases. In ingredients extraction, except the ingredients like glucose (1947) cholesterol (1394) glutamate (767) dopamine (478), the unique ingredients in Chinese herbal medicine such as Hyperin (265) and catechol (207) are important for treating stoke-related diseases.

Pathways
In our results, a total of 105,337 pathways mentions were identified. In the subsequent relation extraction process, we use the results for analyzing what kind of molecular pair does the chemical in the medicine or the herbal ingredient play in the disease and identify what the key genes and pathways involved in stroke-related diseases are.
For example, the ERK1/2 activity generated by cytokines and free radicals or other inflammatory factors after stroke may worsen ischemic damage, whereas the ERK1/2 activity produced by exogenous growth factors, estrogen, and preconditioning favors neuroprotection.

Discover possible existing CPM to treat stroke
StrokeKG Construction can discover possible existing drugs/CPM/herbs to treat stroke-related diseases to reduce the risk of stroke. Such a task can be expressed as a direct link prediction between the drug and disease entity, or indirectly expressed as a link between any pair of biological entities involved in a particular pathway. For example, 31348992 Intersection analysis between DZXXI's putative targets with ischemic stroke-associated genes identified two important targets (PTGS1, PTGS2) (Fig. 6).

Conclusions
In this study, we analyzed stroke-related literature with natural language processing, including named entity recognition and relation extraction. We showed that thestate-of-the-art text mining tools could efficiently extract the critical information hidden behind the unstructured data in the biomedical domain.
Through the knowledge base and knowledge graph, we have a clearer understanding of stroke-related diseases, symptoms, gene mutations that cause stroke, and the vital role of Chinese and Western medicine in preventing and treating stroke. We constructed StrokeKG, representing the relation among stroke-related entities successfully.