Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method

Bokharaeian, Behrouz; Dehghani, Mohammad; Diaz, Alberto

doi:10.1186/s12859-023-05236-w

Research
Open access
Published: 12 April 2023

Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method

Behrouz Bokharaeian¹,
Mohammad Dehghani² &
Alberto Diaz³

BMC Bioinformatics volume 24, Article number: 144 (2023) Cite this article

2073 Accesses
2 Citations
3 Altmetric
Metrics details

Abstract

Extraction of associations of singular nucleotide polymorphism (SNP) and phenotypes from biomedical literature is a vital task in BioNLP. Recently, some methods have been developed to extract mutation-diseases affiliations. However, no accessible method of extracting associations of SNP-phenotype from content considers their degree of certainty. In this paper, several machine learning methods were developed to extract ranked SNP-phenotype associations from biomedical abstracts and then were compared to each other. In addition, shallow machine learning methods, including random forest, logistic regression, and decision tree and two kernel-based methods like subtree and local context, a rule-based and a deep CNN-LSTM-based and two BERT-based methods were developed in this study to extract associations. Furthermore, the experiments indicated that although the used linguist features could be employed to implement a superior association extraction method outperforming the kernel-based counterparts, the used deep learning and BERT-based methods exhibited the best performance. However, the used PubMedBERT-LSTM outperformed the other developed methods among the used methods. Moreover, similar experiments were conducted to estimate the degree of certainty of the extracted association, which can be used to assess the strength of the reported association. The experiments revealed that our proposed PubMedBERT–CNN-LSTM method outperformed the sophisticated methods on the task.

Peer Review reports

Introduction

A single-nucleotide polymorphism (SNP) is a single-base mutation at the DNA level [1]. Variations in the DNA sequences can affect how humans develop diseases and respond to pathogens, chemicals, drugs, and other agents. A genome-wide association (GWA) study is an observational study of a set of genome-wide genetic variations in different individuals to determine if the mutation is associated with a trait like a major human disease. The first successful GWA study dates back to 2005, when Klein et al. performed the first successful GWAS on patients with age-related macular degeneration. It was the beginning of a worldwide trend, finding thousands of SNP associations. Figure 1 depicts the increasing number of papers published in the field from 2004 to 2020, which were obtained from a PubMed search engine for the query “Single Nucleotide Polymorphisms” (performed in November 2021). SNPs are also crucial for personalized medicine.

A phenotype is an organism's recognizable characteristics or traits such as its development, biochemical or physiological properties, behavior, and products of behavior [2]. An SNP can "relate" to a phenotype when a specific type of variant (one allele) is frequent within samples obtained from subjects. The degree to which genotype determines phenotype is referred to as "phenotypic plasticity" [3].

There are genetic instructions for growing and developing all individuals; however, environmental parameters influence an individual’s phenotype through embryonic growth and life. The amount of influence that environmental factors have on an individual’s ultimate phenotype is a serious scientific debate. Environmental parameters can result from various effects, including nutrition, weather, disease, and stress level. For example, the ability to taste the food is a phenotype estimated as 85% affected via genetic inheritance [4]. Additionally, the ability could be intervened by environmental parameters such as dry mouth and lately eaten food. However, phenotypic plasticity is considered high if environmental factors have a strong influence. Conversely, if phenotypic plasticity is low, the genotype can be used to predict phenotypes reliably. Overall, the amount of the influence of environmental factors on a phenotype is a source of scientific arguments. However, the large amount of data generated from these studies necessitates developing an automatic approach to facilitate the study of extracted associations.

Recently, few methods have been developed to extract mutation and disease associations from text such as [5] and [6]. Owing to the importance of the task, the authors produced the SNPPhenA corpus that can be used for benchmarking purposes [7]. Figure 2 presents two sample associations between two SNPs highlighted with blue and a Phenotype (PPA) highlighted with green.

The procedure of producing the corpus consisted of gathering the related abstract and named-entity recognition and annotating the associations, negation, modality markers, and degree of certainty for associations.

Identification of negations in the text is one of the essential tasks in biomedical text mining. Linguists define negation as a morphosyntactic operation [8], and a lexical item either denies or inverts the meaning of another item or construction through this operation. The importance of negation in biomedical text mining is revealed when we consider the negation commonplace in those texts, leading to a lack of precision in automatic information retrieval systems [9]. For example, in the sentence below, there is not no association between "APOE polymorphisms" and "serum HDL-C"; however, if the negation is neglected, a wrong association might be identified:

There were < { no} associations between APOE polymorphisms and serum HDL-C, APO-CIII, and triglycerides >

Linguistic modality is another linguistically driven phenomenon to be applied in this research. In general, modals are particular words that state modality and express the announcer’s internal attitudes and beliefs such as facility, probability, inevitability, commitment, permissibility, capability, wish, and contingency [10]. In the current study, the author’s confidence in the sentence is determined to show the strength of the SNP-phenotype associations stated in the corpus.

Although many machine learning methods have been used to extract biomedical relations from text, recent advances in biomedical text mining techniques have occurred through deep learning models [11, 12]. Nevertheless, direct use of sophisticated natural language processing (NLP) methodologies to extract biomedical relations have some limitations. However, the biomedical text mining model may often encounter problems of general corpora. Therefore, recent biomedical text mining models rely primarily on the adapted versions of word representations such as SciBERT [13] for scientific texts and PubMedBERT-LSTM [14] for biomedical texts.

In this study, the authors develop and compare some common machine learning techniques, along with some deep learning-based approaches that extract associations between SNPs and phenotypes. The rest of this paper is organized as follows. Section "SNPPhenA Corpus" discusses some of the fundamental characteristics of the SNPPhenA corpus, and section "Related works" introduces some related research works. Section "Method" expounds the proposed methods. Afterward, section "Evaluation" presents the results and statistical analysis. Finally, section "Discussion and conclusion" concludes the paper and provides some suggestions for further research.

SNPPhenA corpus

The SNPPhenA corpus was developed to extract the ranked associations of SNPs and phenotypes from GWA studies. The process of producing the corpus entailed collecting relevant abstracts and named entity recognition, and annotating the associations, negation cues and scopes, modality markers, and degree of certainty of the associations [7].

As opposed to the previous biomedical relation extraction corpora containing true and false types of relations, the associations annotated in the corpus were divided into three classes: positive, negative, and neutral candidates.

Unlike distinguished association candidates, including the author’s remarks, a neutral candidate does not contain any remarks [15]. In other words, neutral candidates were those SNP-phenotype candidates that showed no clear evidence as to the presence or lack of an association between SNPs and phenotypes. Identification of neutral candidates is critical for the negation process as the status of such candidates and their corresponding degree of certainty classification do not change when they are located in the scope of negation terms; on the contrary, the status of distinguished association candidates changes in such cases. McDonald et al. are one of the very few groups of researchers who have investigated neutral candidates in terms of the RE task [16].

Similarly, a neutral candidate's degree of certainty or uncertainty does not change if it is located in the scope of a speculation or modality term. Hence, determination of the effect of negation as well as modality terms requires identification of neutral candidates.

Examples

SNP-phenotype candidates were classified as positive, negative, and neutral. Positive SNP-phenotype relation candidates are those with clearly indicated associations (Fig. 3). In contrast, negative SNP-phenotype relation candidates are those in which a lack of association is evident (Fig. 4). In addition to the typical classes of relationships, a neutral class is defined for those within the two other classes, where the presence or absence of association is not noted in the sentence (see Figs. 5 and 6).

In addition to the mentioned annotations, the confident level of a positive association in the corpus was annotated in three categories: strong, moderate, and weak degree of certainty. Figures 7 and 8 display two samples of weak and strong associations.

Characteristics of the SNPPhenA corpus

This section provides detailed statistics regarding the linguistic and non-linguistic properties of the corpus. Table 1 presents the basic properties of the corpus, including the statistics of the produced corpus in terms of test and training parts. As the table shows, the candidates with a positive association comprised the largest category, while the negatively associated candidates constituted the smallest category.

Table 1 Basic statistics of the SNPPhenA corpus in terms of test and train parts

Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method

Abstract

Introduction

SNPPhenA corpus

Examples

Characteristics of the SNPPhenA corpus

Related works

Method

Extracting ranked SNP-phenotype associations from text using machine learning methods

Extracting SNP-phenotype associations

Degree of certainty classification

Semantically linguistic-based ranked SNP-phenotype association extraction

Extracting SNP-Phenotype associations using negation and neutral candidates (NNB)

Neutral candidate detector

Negation-based association extraction method

Degree of certainty classification

Deep learning-based ranked SNP-phenotype association extraction models

Extracting SNP-phenotype associations

Degree of certainty classification

Evaluation

Identification of SNP-phenotype associations

Forecasting degree of certainty

Discussion and conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us