Matching health information seekers' queries to medical terms

Soualmia, Lina F; Prieur-Gaston, Elise; Moalla, Zied; Lecroq, Thierry; Darmoni, Stéfan J

doi:10.1186/1471-2105-13-S14-S11

Volume 13 Supplement 14

Research from the Eleventh International Workshop on Network Tools and Applications in Biology (NETTAB 2011)

Research
Open access
Published: 07 September 2012

Matching health information seekers' queries to medical terms

Lina F Soualmia^1,2,
Elise Prieur-Gaston²,
Zied Moalla²,
Thierry Lecroq² &
…
Stéfan J Darmoni²

BMC Bioinformatics volume 13, Article number: S11 (2012) Cite this article

4058 Accesses
6 Citations
Metrics details

Abstract

Background

The Internet is a major source of health information but most seekers are not familiar with medical vocabularies. Hence, their searches fail due to bad query formulation. Several methods have been proposed to improve information retrieval: query expansion, syntactic and semantic techniques or knowledge-based methods. However, it would be useful to clean those queries which are misspelled. In this paper, we propose a simple yet efficient method in order to correct misspellings of queries submitted by health information seekers to a medical online search tool.

Methods

In addition to query normalizations and exact phonetic term matching, we tested two approximate string comparators: the similarity score function of Stoilos and the normalized Levenshtein edit distance. We propose here to combine them to increase the number of matched medical terms in French. We first took a sample of query logs to determine the thresholds and processing times. In the second run, at a greater scale we tested different combinations of query normalizations before or after misspelling correction with the retained thresholds in the first run.

Results

According to the total number of suggestions (around 163, the number of the first sample of queries), at a threshold comparator score of 0.3, the normalized Levenshtein edit distance gave the highest F-Measure (88.15%) and at a threshold comparator score of 0.7, the Stoilos function gave the highest F-Measure (84.31%). By combining Levenshtein and Stoilos, the highest F-Measure (80.28%) is obtained with 0.2 and 0.7 thresholds respectively. However, queries are composed by several terms that may be combination of medical terms. The process of query normalization and segmentation is thus required. The highest F-Measure (64.18%) is obtained when this process is realized before spelling-correction.

Conclusions

Despite the widely known high performance of the normalized edit distance of Levenshtein, we show in this paper that its combination with the Stoilos algorithm improved the results for misspelling correction of user queries. Accuracy is improved by combining spelling, phoneme-based information and string normalizations and segmentations into medical terms. These encouraging results have enabled the integration of this method into two projects funded by the French National Research Agency-Technologies for Health Care. The first aims to facilitate the coding process of clinical free texts contained in Electronic Health Records and discharge summaries, whereas the second aims at improving information retrieval through Electronic Health Records.

Background

The Internet is fast becoming a recognized source of information in many fields, including health. In this domain, as in others, users are now experiencing huge difficulties in finding precisely what they are looking for among the numerous documents available online, and this in spite of existing tools. In medicine and health-related information accessible on the Internet, general search engines, such as Google, or general catalogues, such as Yahoo, cannot solve this problem efficiently [1]. This is because they usually offer a selection of documents that turn out to be either too large or ill-suited to the query. Free text word-based search engines typically return innumerable completely irrelevant hits, which require much manual weeding by the user, and also miss important information resources.

In this context, several health gateways [2] have been developed to support systematic resource discovery and help users find the health information they are looking for. These information seekers may be patients but also health professionals, such as physicians searching for clinical trials. Health gateways rely on thesauri and controlled vocabularies. Some of them are evaluated in [3]. Thesauri are a proven key technology for effective access to information since they provide a controlled vocabulary for indexing information. They therefore help to overcome some of the problems of free-text search by relating and grouping relevant terms in a specific domain. Nonetheless, medical vocabularies are difficult to handle by non-professionals.

Many tools have been developed to improve information retrieval from such gateways. They exploit techniques such as natural language processing, statistics, lexical and background knowledge ... etc. However, a simple spelling corrector, such as Google's "Did you mean:" or Yahoo's "Also try:" feature may be a valuable tool for non-professional users who may approach the medical domain in a more general way [4]. Such features can improve the performance of these tools and provide the user with the necessary help. In fact, the problem of spelling errors represents a major challenge for an information retrieval system. If the queries (composed by one or multiple words) generated by information seekers remain undetected, this can result in a lack of outcome in terms of search and retrieval. A spelling corrector may be classified in two categories. The first relies on a dictionary of well-spelled terms and selects the top candidate based on a string edit distance calculus. An approximate string matching algorithm, or a function, is required to detect errors in users' queries. It then recommends a list of terms from the dictionary that are similar to each query word. The second category of spelling correctors uses lexical disambiguation tools in order to refine the ranking of the candidate terms that might be a correction of the misspelled query. Several studies have been published on this subject. We cite the work of Grannis [5] which describes a method for calculating similarity in order to improve medical record linkage. This method uses different algorithms such as Jaro-Winkler, Levenshtein [6] and the longest common subsequence (LCS). In [7] the authors suggest improving the algorithm for computing Levenshtein similarity by using the frequency and length of strings. In [8] a phonetic transcription corrects users' queries when they are misspelled but have similar pronunciation (e.g. Alzaymer vs. Alzheimer). In [9] the authors propose a simple and flexible spell checker using efficient associative matching in a neural system and also compare their method with other commonly used spell checkers.

In fact, the problem of automatic spell checking is not new. Indeed, research in this area started in the 1960's [10] and many different techniques for spell checking have been proposed since then. Some of these techniques exploit general spelling error tendencies and others exploit phonetic transcription of the misspelled term to find the correct term. The process of spell checking can generally be divided into three steps (i) error detection: the validity of a term in a language is verified and invalid terms are identified as spelling errors (ii) error correction: valid candidate terms from the dictionary are selected as corrections for the misspelled term and (iii) ranking: the selected corrections are sorted in decreasing order of their likelihood of being the intended term. Many studies have been performed to analyze the types and the tendencies of spelling errors for the English language. According to [11] spelling errors are generally divided into two types, (i) typographic errors and (ii) cognitive errors. Typographic errors occur when the correct spelling is known but the word is mistyped by mistake. These errors are mostly related to keyboard errors and therefore do not follow any linguistic criteria (58% of these errors involve adjacent keys [12] and occur because the wrong key is pressed, or two keys are pressed, or keys are pressed in the wrong order ... etc.). Cognitive errors, or orthographic errors, occur when the correct spelling of a term is not known. The pronunciation of the misspelled term is similar to the pronunciation of the intended correct term. In English, the role of the sound similarity of characters is a factor that often affects error tendencies [12]. However, phonetic errors are harder to correct because they deform the word more than a single insertion, deletion or substitution. Indeed, over 80% of errors fall into one of the following four single edit operation categories: (i) single letter insertion; (i) single letter deletion; (iii) single letter substitution and (iv) transposition of two adjacent letters [10, 11].

The third step in spell-checking is the ranking of the selected corrections. Main spell-checking techniques do not provide any explicit mechanism. However, statistical techniques provide ranking of the corrections based on probability scores with good results [13–15].

HONselect [16] is a multilingual and intelligent search tool integrating heterogeneous web resources in health. In the medical domain, spell-checking is performed on the basis of a medical thesaurus by offering information seekers several medical terms, ranging from one to four differences related to the original query. Exploiting the frequency of a given term in the medical domain can also significantly improve spelling correction [17] : edit distance technique is used for correction along with term frequencies for ranking. In [18] the authors use normalization techniques, aggressive reformatting and abbreviation expansion for unrecognized words as well as spelling correction to find the closest drug names within RxNorm for drug name variants that can be found in local drug formularies. It returns only drug name suggestions. To match queries with the MeSH thesaurus, Wilbur et al. [19] propose a technique on the noisy channel model and statistics from the PubMed logs.

Research has focused on several different areas, from pattern matching algorithms and dictionary searching techniques to optical character recognition of spelling corrections in different domains. However, relatively few groups have studied spelling corrections regarding medical queries in French. In this paper, a simple method is proposed : it combines two approximate string comparators, the well-known Levenshtein [6] edit distance and the Stoilos function similarity defined in [20] for ontologies. We apply and evaluate these two distances, alone and combined, on a set of sample queries in French submitted to the health gateway CISMeF [21]. The queries may be submitted both by health professionals in their clinical practice as well as patients. The system we have designed aims to correct errors resulting in non-existent terms, and thus reducing the silence of the associated search tool.

Methods

Similarity functions

Similarity functions between two text strings S₁ and S₂ give a similarity or dissimilarity score between S₁ and S₂ for approximate matching or comparison. For example, the strings "Asthma" and "Asthmatic" can be considered similar to a certain degree. Modern spell-checking tools are based on the simple Levenshtein edit distance [6] which is the most widely known. This function operates between two input strings and returns a score equivalent to the number of substitutions and deletions needed in order to transform one input string into another. It is defined as the minimum number of elementary operations that is required to pass from a string S₁ to a string S₂. There are three possible transactions: replacing a character with another, deleting a character and adding a character. This measure takes its values in the interval [0, ∞]. The Normalized Levenshtein [22] (LevNorm) in the range [0, 1] is obtained by dividing the distance of Levenshtein Lev(S₁, S₂) by the size of the longest string and it is defined by the following equation (1):

L e v N o r m (S_{1}, S_{2}) = \frac{L e v (S_{1}, S_{2})}{M a x (| S_{1} |, | S_{2} |)}

(1)

LevNorm (S₁, S₂) ∈ [0, 1] as Lev(S₁, S₂) <Max(|S₁|,|S₂|).

For example, LevNorm(eutanasia, euthanasia) = 0.1, as Lev(eutanasia, eut h anasia) = 1 (adds 1 character h); |eutanasia| = 9 and |euthanasia| = 10.

We complete the calculation of the Levenshtein distance by the similarity function Stoilos proposed in [20]. It has been specifically developed for strings that are labels of concepts in ontologies. It is based on the idea that the similarity between two entities is related to their commonalities as well as their differences. Thus, the similarity should be a function of both these features. It is defined by the equation (2) where Comm(S₁, S₂) stands for the commonality between the strings S₁ and S₂, Diff(S₁, S₂) for the difference between S₁ and S₂, and Winkler(S₁, S₂) for the improvement of the result using the method introduced by Winkler in [23]:

S i m (S_{l}, S_{2}) = C o m m (S_{l}, S_{2}) - Diff (S_{l}, S_{2}) + w i n k l e r (S_{l}, S_{2})

(2)

The function of commonality is determined by the substring function. The biggest common substring between two strings (MaxComSubString) is computed. This process is further extended by removing the common substring and by searching again for the next biggest substring until none can be identified. The function of commonality is given by the equation (3):

C o m m (S_{1}, S_{2}) = \frac{2 \times \sum_{i} | M a x C o m S u b S t r i n g_{i} |}{| S_{1} | + | S_{2} |}

(3)

For example for the strings S₁ = Trigonocep ah lie and S₂ = Trigonocep ha lie we have: |MaxComSubString₁|=|Trigonocep| = 10; |MaxComSubString₂|=|lie| = 3 Comm(Trigonocepahlie, Trigonocephalie) = 0.866.

The difference function Diff(S₁, S₂) is based on the length of the unmatched strings resulting from the initial matching step. The function of difference is defined in equation (4) where p ∈ [0, ∞], |u_S1| and |u_S2| represent the length of the unmatched substring from the strings S₁ and S₂ scaled respectively by their length:

D i ff (S_{l}, S_{2}) = \frac{| u_{S 1} | \times | u_{S 2} |}{p + (1 - p) \times (| u_{S 1} | + | u_{S 2} | - | u_{S l} | \times | u_{S 2} |)}

(4)

For example for S₁ = Trigonocepahlie and S₂ = Trigonocephalie and p = 0.6 we have: |u_S1| = 2/15; |u_S2| = 2/15; Diff(S₁, S₂) = 0.0254.

The Winkler parameter Winkler(S₁, S₂) is a factor that improves the results [5, 23]. It is defined by the equation (5) where L is the length of common prefix between the strings S₁ and S₂ at the start of the string up to a maximum of 4 characters and P is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler's work is P = 0.1 :

W i n k l e r (S_{l}, S_{2}) = L \times P \times (1 - C o m m (S_{l}, S_{2}))

(5)

For example, Sim(S₁, S₂), between the strings S₁ = hyperaldoterisme and S₂ = hyperaldosteronisme. We have |S₁| = 16, |S₂| = 19; the common substrings between S₁ and S₂ are hyperaldo, ter, and isme. Comm(S₁, S₂) = 0.914; Diff(S₁, S₂) = 0; Winkler(S₁, S₂) = 0.034 and Sim( hyperaldoterisme, hyperaldoster on isme) = 0.948.

Processing users' queries

As detailed in [12], spelling errors can be classified as typographic and phonetic. Cognitive errors are caused by a writer's lack of knowledge and phonetic ones are due to similar pronunciation of a misspelled and corrected word. The queries are pre-processed by a phonetic transcription before applying the Levenshtein edit distance along with the similarity function Stoilos.

CISMeF is a quality-controlled health gateway developed at Rouen University Hospital in France [21]. Doc'CISMeF is the search tool associated with CISMeF. Many ways of navigation and information retrieval are possible through the catalogue. The most used is the simple search, with a free text interface. The information retrieval algorithm is based on the subsumption relationships (specialization/generalization) between medical terms, using their hierarchical information, going from the top of the hierarchy to the bottom. If the user query can be matched to an existing term from the terminology, the result is thus the union of the resources indexed by the term, and the resources that are indexed by the terms it subsumes, either directly or indirectly, in all the hierarchies it belongs to. For example, a query on the term Hepatitis gives a set of documents indexed by the descriptor Hepatitis but also by the descriptors Hepatitis a, Hepatitis b and so on. However, the vocabularies of medical terminologies are difficult to apprehend for a user who is not familiar with the domain.

The different materials that we have used to apply the method of spell-checking are related mainly to the search tool Doc'CISMeF: a set of queries and a dictionary of entry terms.

First set of test queries

We first selected a set of queries sent to Doc'CISMeF by different users. A set of 127,750 queries were extracted from the query log server (3 months logs). Only the most frequent queries were selected. In fact some queries are more frequent than others. For example, the query "swine flu" is more present in the query log than "chlorophyll". We eliminated the doubles (68,712 queries remained). From these 68,712 queries, we selected 25,000 queries to extract those with no answers (7,562). From these, we selected queries with misspellings from the most frequent queries in the original set and constituted a first sample test of 163 queries. To avoid phonetic errors of misspelling we first performed a phonetic transcription of this sample with the "Phonemisation" function the method of which is detailed below.

Phonetic transcription of queries and dictionary

Soundex ("Indexing on sound") was the first phonetic string-matching algorithm developed in 1918 [24] for name matching. The idea was to assign common codes to similar sounding names. Intuitively, names referring to the same person have identical or similar Soundex codes. The length of the code is four and it is of the form letter, digit, digit, digit. The first letter of the code is the same as the first letter of the word. For each subsequent consonant of the word, a digit is concatenated at the end of the code. All vowels and duplicate letters are ignored. The letters h, w and y are also ignored. If the code exceeds the maximum length, extra characters are ignored. If the length of the code is less than 4, zeroes are concatenated at the end. The digits assigned to the different letters for English in the original Soundex algorithm are shown in Table 1: Soundex(Robert) = R163; Soundex(Robin) = R150 (an extra 0 is added to obtain 3 digits); Soundex(Mith) = S530 and Soundex(Smith) = S530.

Table 1 Soundex codes

Research from the Eleventh International Workshop on Network Tools and Applications in Biology (NETTAB 2011)

Matching health information seekers' queries to medical terms

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Similarity functions

Processing users' queries

First set of test queries

Phonetic transcription of queries and dictionary

Second sample of test queries: multi-word queries

Query segmentation

Character normalizations

Stop words

Exact expression

Phonemisation

Bag of words

Evaluations

Results

Choice of thresholds for the first set of queries

Evaluation on the first sample of queries

Evaluation of the second sample of queries

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us