A hybrid algorithm for clinical decision support in precision medicine based on machine learning

Zhang, Zicheng; Lin, Xinyue; Wu, Shanshan

doi:10.1186/s12859-022-05116-9

Research
Open access
Published: 03 January 2023

A hybrid algorithm for clinical decision support in precision medicine based on machine learning

Zicheng Zhang¹,
Xinyue Lin¹ &
Shanshan Wu²

BMC Bioinformatics volume 24, Article number: 3 (2023) Cite this article

2024 Accesses
3 Citations
2 Altmetric
Metrics details

Abstract

Purpose

The objective of the manuscript is to propose a hybrid algorithm combining the improved BM25 algorithm, k-means clustering, and BioBert model to better determine biomedical articles utilizing the PubMed database so, the number of retrieved biomedical articles whose content contains much similar information regarding a query of a specific disease could grow larger.

Design/methodology/approach

In the paper, a two-stage information retrieval method is proposed to conduct an improved Text-Rank algorithm. The first stage consists of employing the improved BM25 algorithm to assign scores to biomedical articles in the database and identify the 1000 publications with the highest scores. The second stage is composed of employing a method called a cluster-based abstract extraction to reduce the number of article abstracts to match the input constraints of the BioBert model, and then the BioBert-based document similarity matching method is utilized to obtain the most similar search outcomes between the document and the retrieved morphemes. To realize reproducibility, the written code is made available on https://github.com/zzc1991/TREC_Precision_Medicine_Track.

Findings

The experimental study is conducted based on the data sets of TREC2017 and TREC2018 to train the proposed model and the data of TREC2019 is used as a validation set confirming the effectiveness and practicability of the proposed algorithm that would be implemented for clinical decision support in precision medicine with a generalizability feature.

Originality/value

This research integrates multiple machine learning and text processing methods to devise a hybrid method applicable to domains of specific medical literature retrieval. The proposed algorithm provides a 3% increase of P@10 than that of the state-of-the-art algorithm in TREC 2019.

Peer Review reports

Introduction

Precision medicine is a new medical paradigm that integrates modern scientific and technological means with conventional medical methods by detailing human bodily functions and the nature of diseases scientifically, thus optimizing systematically the principles and practices of human disease prevention and health care to eventually maximize both individual and social health benefits with more effective, safer, and more economical medical services [1, 2]. In precision medicine, diagnostic methods are appropriately selected for each patient to realize minimal iatrogenic damage, minimum medical costs, and optimal patient recovery [3, 4]. Besides, utilizing both genomic profiles and healthcare data sources of patients to a large extent leads to personalized treatments [5]. Hence, the clinical system adopting this new approach mainly pays attention to all types of useful information regarding genes, microbiomes, environmental conditions, family history, and lifestyles of patients to pick precise diagnoses and therapeutic alternatives that individually result in better treatments [6]. In other terms, precision medicine is considered a tool that could be used for several purposes such as predictive, preventive, personalized, and participatory healthcare service utilizing all available data sources such as genetics, omics, and patients’ history [7].

Precision medicine has been covering various areas ranging from drug discovery, design, and development, the analysis of drug sensitivity in pharmacology, and the construction of clinical decision support systems in health analytics to a better understanding of several diseases and their relationships with genes, family history, and other attributable factors in medicine [8,9,10,11].

With the advancement of medical technologies, the number of biomedical articles has grown exponentially. So, finding relevant articles matching the symptoms of a patient in massive article databases becomes increasingly difficult. For example, when just “precision medicine” is written in the search bar in the Science Direct database, the number of articles that are found is 229,126. Therefore, getting both useful and practical insights out of the immense collection requires to be implemented finely devised methods and approaches.

Information retrieval (IR) plays a significant role in precision medicine and refers to the process and technology to organize and access information according to the requirements of users. The main goal of information retrieval is to obtain the required information as accurately, quickly, and comprehensively as possible. Moreover, since data accumulation grows sharply, big data-based crunching and modeling have been gaining momentum, especially after 2008 [12]. Hence, more precise, and refined outcomes could be potentially reached by employing finely devised methods or algorithms.

Even though the BM25 algorithm is the first and most widely used algorithm to improve better algorithms in text ranking tasks, most BM25 algorithms only consider abstracts and do not consider the possible search morphemes and their co-occurrence relationships that could be found in chemicals, MeSH, and keywords. Zhang [13] proposed an improved BM25 algorithm that computes three scores for the vocabulary, co-word, and expanded word that leads to a composite retrieval function whose parameters are optimized by the cuckoo optimization algorithm that retrieved better search outcomes. The model was trained on the 2017 dataset. The results showed that the trained parameters produced improvements in the search results when both the 2018 and 2019 datasets are used, so this research provided a reference for parameter selection for the BM25 algorithm. Several of the available algorithms utilize the BM25 algorithm as the first step of a search algorithm and then employ a deep learning model to obtain more accurate matchings. Besides, it should be kept in mind that the effect of deep learning models is dependent on how well the models get trained. Therefore, similarity results could be highly affected by the results of the employed method in the first stage. Consequently, the improved BM25 used at the first stage provides advantages to attaining better search results in the proposed algorithm.

This manuscript will base on the improved BM25 approach to pick the highest scores of 1000 articles in PubMed and conduct a clustering algorithm to split into N different clusters to reach the minimum input requirement of the pre-trained model on the data set called BioBERT to generate better text ranking results by using search terms of diseases, genes, and individual traits. Therefore, similarity-matching results will be attained based on finally running the BioBERT model that is employed also as a pre-training model and calculates the similarity between the article abstract/title and the retrieval morpheme as a score. Due to the limitation of the input vector length of the BERT model which is restricted to using 512 tokens (words or characters) in an article abstract, negative samples for the training data set are generated to improve the training effect.

The motivation of the research is to propose a hybrid algorithm consisting of a two-stage information retrieval method based on the improved BM25 algorithm, k-means clustering, and BioBert model to better determine the most relevant biomedical articles to specific diseases, genes, and individual traits.

The sections of the article are organized as follows: Section "Related work" presents the related works. Section “Method” describes the improved BM25 algorithm, and proposed the algorithm whose stages are called document similarity matching, and cluster-based abstract extraction. Section "The Proposed Method and its Implementation" describes the proposed method with a flow chart and its execution details including data structure, and negative training sample generation method. Section"Experimental results" describes the experimental comparison results of the proposed algorithm and the selected algorithm presented in Track 2020, as well as the data and parameters used by the proposed algorithm. Section "Summary and future work" concludes the research.

Related work

Preliminary

In this subsection, we will present a brief introductory development of text retrieval. The Boolean model constitutes the search model of the original information, which was used for information retrieval as early as 1957 and is a simple retrieval model based on the set theory and Boolean algebra whose basic idea is to represent the query of a user and a document by utilizing a set of words. Then, the similarity of the two sets is determined by using Boolean operations. Moreover, the Boolean model is a keyword-matching type of information retrieval, that is, documents containing the keywords in a query will be retrieved. However, there exists usually a low correlation between the retrieved results and the target. In some research fields, weighting the index terms has been shown to greatly improve the retrieval results, which has led to the development of vector models [14, 15].

BM25 and its modified versions, which are characterized by conventional probabilistic models employing the two-Poisson approximation of the term-frequency distribution, have been long effective tools in text ranking and the BM25 algorithm is generally used to compare the performance of the newly introduced models [16, 17]. Besides, typical vector models include the term frequency-inverse document frequency (TF-IDF) approach and the BM25 model have been widely studied based on this approach. As a result, the emergence of vector models has substantially increased the relevance of retrieved documents to the retrieval target and led to the concepts of document scoring and ranking [18,19,20].

With the advancement of machine learning algorithms in recent years, several ranking algorithms have been developed by aiming at better ranking the texts in the search of matching the query with the most relevant articles. Besides, when machine learning algorithms are implemented, more automatic processes are expected to attain better outcomes. Learning-to-rank methods are generally classified into three categories according to the training methods: pointwise, pairwise, and listwise [21,22,23]. In the pointwise method, each document in the training set is treated as a separate sample, which is essentially a single-document classification and regression problem. Some widely implemented pointwise algorithms include Prank [24], McRank [25], and RankProp [26]. In the pairwise method, document pairs with different labels for the same query in the training set are trained as one sample. Based on two documents with different labels, the ranking problem is finally transformed into a binary classification problem. Some broadly utilized algorithms include the rank boost algorithm [27] and the frank algorithm [28]. In the listwise method, the entire document sequence is taken as a sample, and the evaluation of the information retrieved is optimized by defining a loss function. Some widely conducted research includes ListNet [29], SVMMAP [30], and the ADA rank algorithm [31].

When machine learning algorithms are implemented, the pre-training process contributes to the success of these algorithms [32,33,34]. A pre-trained language representation approach, called BERT (A multilayer bidirectional transformer encoder stack), was proposed by [35] and the BERT’s performance was found to be better than the available ones in the literature. Park et al. [36] used a bidirectional encoder representation from transformers (BERT) classifier to train retrieved articles and word vectors to represent medical articles. The studies were ranked according to similarity scores between query semantic elements and the article. The results showed that the accuracy was greatly improved over existing algorithms. Pan et al. [19] combined patient health records with biomedical articles and used three methods to expand the phrases used in queries, and the experimental results showed that the proposed model yielded a promising average weighted accuracy, better stability, and applicability. Maciej et al. [37] investigated the effectiveness of a BERT-based ranking model on different platforms. The results verified the accuracy of the BERT model for precision medicine too. Bayesian networks into query expansion and probabilistic models to expand query semantic elements to increase query accuracy were introduced [9]. Two types of BERT models, BERT_BASE and BERT_LARGE, are available [38]. Some articles covering various related modifications of BERT can be found in [39,40,41,42].

BioBert model

With the implementation of the BioBERT model [43,44,45,46], Natural Language Processing tasks extract better relations and generate more accurate outcomes. Instead of pre-training on generic data sets, BioBert requires derived data sets to perform well. On the contrary, poor performances would be expected. The BioBERT model is used for various improvement purposes. For example, the identification of functional links between proteins has been recently conducted by fine-tuning weights from BioBERT [44]. Besides, several research manuscripts have reported better outcomes when the BioBERT model is implemented [47,48,49,50] in the literature.

Method

Baseline algorithm

Our baseline algorithm employs the improved BM25 algorithm previously proposed by the author. To ensure the integrity of the paper, The fundamental aspects of the improved BM25 algorithm are revisited [13].

First, we defined the abstract score,

$$AS(Q,d)=\sum_{i}^{n}IDF({q}_{i})\times \frac{{f}_{i}\times \left({k}_{1}+1\right)}{{f}_{i}+{k}_{1}\times \left(1-{b}_{1}+{b}_{1}\times \frac{dl}{avgdl}\right)}$$

(1)

where Inverse Document Frequency (IDF) is the search morpheme ${q}_{i}$, where ${k}_{1}$ and ${b}_{1}$ are the adjustment factors, which are usually set according to the experience of users, ${f}_{i}$ is the frequency of ${q}_{i}$ in $d$. IDF is defined as follows: IDF for a particular word can be obtained by dividing the total number of documents by the number of documents containing the searched word and then taking the logarithm of the quotient. $dl$ is the text length of document $d$, and $avgdl$ is the average text length of all documents.

We propose a wordlist to combine the chemical words, MeSH headings, and keywords of a retrieved document, and the scores are defined as follows:

$$WS\left( {Q,d} \right) = \mathop \sum \limits_{i}^{n} \frac{{tfw\left( {Q,d} \right) \times \left( {k_{2} + 1} \right)}}{{tfw\left( {Q,d} \right) + k_{2} \times \left( {1 - b_{2} + b_{2} \times \frac{dwl}{{avgdwl}}} \right)}}$$

(2)

where $tfw$ is the sum of the IDF values of each retrieved morpheme, and k₁ and b₁ are adjustment factors, which are usually set according to the experience of users. $dwl$ is the number of words in the wordlist of document d, and $avgdwl$ is the average number of words in the wordlist of all documents.

We also defined the co-word score, that is, the disease and gene in the search morpheme (including expansion words) co-occur in the abstract, and the wordlist is recorded as the co-occurrence score as follows:

$$\mathrm{CWS}\left(Q,d\right)=\sum_{i}^{n}{IDF}_{word}({g}_{i},d)$$

(3)

where ${IDF}_{word}({g}_{i},d)$ represents the score based on the expression gene ${g}_{i}$ for query $Q$, the summation is used since some tasks could contain genes.

To achieve the same level as the scores of the similarity method in the manuscript, we standardize the sum of the three scores, and the standardization method adopts the max–min method, as shown in Eq. (4):

$${x}_{norm}=\frac{x-\mathrm{min}(X)}{\mathrm{max}(X)-\mathrm{min}(X)}$$

(4)

where ${x}_{norm}$ represents the normalized value,$x$ represents the value before normalization, $\mathrm{min}(X)$ represents the minimum value of the sequence to be standardized, and $\mathrm{max}(X)$ represents the maximum value of the sequence to be standardized. In the algorithm, we also added query expansion to extend the mesh. The algorithm and its performance evaluation in detail can be found in [13].

Document similarity matching

Similarity matching between articles and retrieval tasks is an important step in the information retrieval process. In [24], Bidirectional Encoder Representation from Transformers (BERT) model is employed to train the abstracts/titles and query tasks. The model structure is shown in Fig. 1. [CLS], which is a special vector, is added to the top of the input before transferring and sending it to the BERT and [SEP], which is a special tag to separate sentences, is added as a separator between the abstract/title. Then, the output of the BERT model (the embedding of sentence pairs) is taken, and [CLS] is utilized to complete the similarity calculation task. The output sigmoid is computed to obtain the similarity between the abstract/title and the query, which is considered as the matching score between the input abstract/title and the query.

Clustering-based abstract extraction

Because the BERT model is limited to 512 tokens (words or characters), the abstract needs to be further streamlined, and the key content needs to be extracted. An extractive abstract generation method is employed to preserve the writing style and the meaning of the original abstract to the highest extent. Then, the article adopts the clustering-based abstract extraction method, and the specific process is described as follows:

1. The BioBert pretraining model is utilized to generate a sentence vector for each sentence in the abstract to obtain a sentence-level vector representation, which is a 1 × 768 dimensional vector.

2. Sentences are clustered by using the K-means clustering to obtain N categories, where the number N is preassigned by the implementer.

3. A sentence closest to the center of the cluster is selected from the category until the overall length reaches 512 tokens (words or characters) to form a new abstract text.

The proposed method and its implementation

The proposed algorithm

This research integrates multiple machine learning and text processing methods to devise a hybrid method applicable to domains of specific medical literature retrieval. The flow chart of the algorithm is depicted in Fig. 2. A hybrid algorithm consisting of a two-stage information retrieval method based on the improved BM25 algorithm, k-means clustering, and BioBert model to better determine the most relevant biomedical articles to specific diseases, genes, and individual traits.

The improved BM25 algorithm computes three scores for the vocabulary, co-word, and expanded word that lead to a composite retrieval function whose parameters are optimized by the cuckoo optimization algorithm. Afterward, the BioBert pre-trained model is utilized to generate a sentence vector for each sentence in the abstract to obtain a sentence-level vector representation, which is a 1 × 768 dimensional vector. Sentences are then clustered by using the K-means clustering regarding the closest sentence to the center of a cluster of the category until the overall length reaches 512 tokens to form a new abstract text. Finally, the output of the BERT model that is employed in the BioBert-based document similarity matching method is utilized to obtain the similarity between the document and the retrieved morphemes.

To exemplify what has been conducted, first, patient information and medical articles are input into the system, such as patient information, disease, demographics, genes, and other attributes. Medical article information includes title, abstract, MeSH headings, chemical list, and keyword list. The patient information was input into the MeSH library to obtain the expanded query information, and the patient information and the expanded word information were input into the improved BM25 algorithm [13] to obtain the abstract score, word score, and co-word score, which were then standardized and processed according to the standardization process. Afterward, the top 1000 articles were sorted in descending order by using their composite retrieval scores. The abstract and title similarity scores of each document and the query were calculated by using the BioBert document similarity matching method for the top 1000 articles. The standardized scores were then added to the improved BM25 scores, and the final scores were sorted in descending order to reflect the similarity scores.

Structured data

Table 1 summarizes the evaluation results obtained between 2017 through 2019 for the initial screening of the literature. It is a screening factor for human precision medicine (PM), and the co-occurrence of disease genes is also an important factor for determining the correlation. Therefore, the co-word method proposed in the improved BM25 algorithm [13] can increase the scores in potentially relevant articles. When the search elements are defined, the term “human” as one of the search elements of the baseline is utilized to distinguish between humans and animals. Because the PM tasks in 2020 and between 2017 through 2019 were different, and demographics were replaced by treatment, the tasks in 2020 are excluded and the tasks between 2017 through 2019 are used as PM retrieval tasks for the research data. Table 2 shows the PM retrieval tasks between 2017 through 2019. Observed that disease and genes are fixed expressions, and age and gender need to be classified during retrieval. The classification criteria are shown in Table 3. The regular expression extracts the age from the abstract, such as years-old/year-old/years old, which are all extracted to form the corresponding category, and the word stem of nltk is used to extract the words that express gender in the abstract, such as woman, man, girl, and boy. If the abstract does not contain demographic information, matching items from the Mesh for extraction are searched for.

Table 1 Raw judgments for Scientific Abstracts

A hybrid algorithm for clinical decision support in precision medicine based on machine learning

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Introduction

Related work

Preliminary

BioBert model

Method

Baseline algorithm

Document similarity matching

Clustering-based abstract extraction

The proposed method and its implementation

The proposed algorithm

Structured data

Generation of the training sample

Experimental results

Data

The parameter setting of the proposed algorithm

Experimental comparison

Summary and future work

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us