Matching health information seekers' queries to medical terms
© Soualmia et al.; licensee BioMed Central Ltd. 2012
Published: 7 September 2012
Skip to main content
© Soualmia et al.; licensee BioMed Central Ltd. 2012
Published: 7 September 2012
The Internet is a major source of health information but most seekers are not familiar with medical vocabularies. Hence, their searches fail due to bad query formulation. Several methods have been proposed to improve information retrieval: query expansion, syntactic and semantic techniques or knowledge-based methods. However, it would be useful to clean those queries which are misspelled. In this paper, we propose a simple yet efficient method in order to correct misspellings of queries submitted by health information seekers to a medical online search tool.
In addition to query normalizations and exact phonetic term matching, we tested two approximate string comparators: the similarity score function of Stoilos and the normalized Levenshtein edit distance. We propose here to combine them to increase the number of matched medical terms in French. We first took a sample of query logs to determine the thresholds and processing times. In the second run, at a greater scale we tested different combinations of query normalizations before or after misspelling correction with the retained thresholds in the first run.
According to the total number of suggestions (around 163, the number of the first sample of queries), at a threshold comparator score of 0.3, the normalized Levenshtein edit distance gave the highest F-Measure (88.15%) and at a threshold comparator score of 0.7, the Stoilos function gave the highest F-Measure (84.31%). By combining Levenshtein and Stoilos, the highest F-Measure (80.28%) is obtained with 0.2 and 0.7 thresholds respectively. However, queries are composed by several terms that may be combination of medical terms. The process of query normalization and segmentation is thus required. The highest F-Measure (64.18%) is obtained when this process is realized before spelling-correction.
Despite the widely known high performance of the normalized edit distance of Levenshtein, we show in this paper that its combination with the Stoilos algorithm improved the results for misspelling correction of user queries. Accuracy is improved by combining spelling, phoneme-based information and string normalizations and segmentations into medical terms. These encouraging results have enabled the integration of this method into two projects funded by the French National Research Agency-Technologies for Health Care. The first aims to facilitate the coding process of clinical free texts contained in Electronic Health Records and discharge summaries, whereas the second aims at improving information retrieval through Electronic Health Records.
The Internet is fast becoming a recognized source of information in many fields, including health. In this domain, as in others, users are now experiencing huge difficulties in finding precisely what they are looking for among the numerous documents available online, and this in spite of existing tools. In medicine and health-related information accessible on the Internet, general search engines, such as Google, or general catalogues, such as Yahoo, cannot solve this problem efficiently . This is because they usually offer a selection of documents that turn out to be either too large or ill-suited to the query. Free text word-based search engines typically return innumerable completely irrelevant hits, which require much manual weeding by the user, and also miss important information resources.
In this context, several health gateways  have been developed to support systematic resource discovery and help users find the health information they are looking for. These information seekers may be patients but also health professionals, such as physicians searching for clinical trials. Health gateways rely on thesauri and controlled vocabularies. Some of them are evaluated in . Thesauri are a proven key technology for effective access to information since they provide a controlled vocabulary for indexing information. They therefore help to overcome some of the problems of free-text search by relating and grouping relevant terms in a specific domain. Nonetheless, medical vocabularies are difficult to handle by non-professionals.
Many tools have been developed to improve information retrieval from such gateways. They exploit techniques such as natural language processing, statistics, lexical and background knowledge ... etc. However, a simple spelling corrector, such as Google's "Did you mean:" or Yahoo's "Also try:" feature may be a valuable tool for non-professional users who may approach the medical domain in a more general way . Such features can improve the performance of these tools and provide the user with the necessary help. In fact, the problem of spelling errors represents a major challenge for an information retrieval system. If the queries (composed by one or multiple words) generated by information seekers remain undetected, this can result in a lack of outcome in terms of search and retrieval. A spelling corrector may be classified in two categories. The first relies on a dictionary of well-spelled terms and selects the top candidate based on a string edit distance calculus. An approximate string matching algorithm, or a function, is required to detect errors in users' queries. It then recommends a list of terms from the dictionary that are similar to each query word. The second category of spelling correctors uses lexical disambiguation tools in order to refine the ranking of the candidate terms that might be a correction of the misspelled query. Several studies have been published on this subject. We cite the work of Grannis  which describes a method for calculating similarity in order to improve medical record linkage. This method uses different algorithms such as Jaro-Winkler, Levenshtein  and the longest common subsequence (LCS). In  the authors suggest improving the algorithm for computing Levenshtein similarity by using the frequency and length of strings. In  a phonetic transcription corrects users' queries when they are misspelled but have similar pronunciation (e.g. Alzaymer vs. Alzheimer). In  the authors propose a simple and flexible spell checker using efficient associative matching in a neural system and also compare their method with other commonly used spell checkers.
In fact, the problem of automatic spell checking is not new. Indeed, research in this area started in the 1960's  and many different techniques for spell checking have been proposed since then. Some of these techniques exploit general spelling error tendencies and others exploit phonetic transcription of the misspelled term to find the correct term. The process of spell checking can generally be divided into three steps (i) error detection: the validity of a term in a language is verified and invalid terms are identified as spelling errors (ii) error correction: valid candidate terms from the dictionary are selected as corrections for the misspelled term and (iii) ranking: the selected corrections are sorted in decreasing order of their likelihood of being the intended term. Many studies have been performed to analyze the types and the tendencies of spelling errors for the English language. According to  spelling errors are generally divided into two types, (i) typographic errors and (ii) cognitive errors. Typographic errors occur when the correct spelling is known but the word is mistyped by mistake. These errors are mostly related to keyboard errors and therefore do not follow any linguistic criteria (58% of these errors involve adjacent keys  and occur because the wrong key is pressed, or two keys are pressed, or keys are pressed in the wrong order ... etc.). Cognitive errors, or orthographic errors, occur when the correct spelling of a term is not known. The pronunciation of the misspelled term is similar to the pronunciation of the intended correct term. In English, the role of the sound similarity of characters is a factor that often affects error tendencies . However, phonetic errors are harder to correct because they deform the word more than a single insertion, deletion or substitution. Indeed, over 80% of errors fall into one of the following four single edit operation categories: (i) single letter insertion; (i) single letter deletion; (iii) single letter substitution and (iv) transposition of two adjacent letters [10, 11].
The third step in spell-checking is the ranking of the selected corrections. Main spell-checking techniques do not provide any explicit mechanism. However, statistical techniques provide ranking of the corrections based on probability scores with good results [13–15].
HONselect  is a multilingual and intelligent search tool integrating heterogeneous web resources in health. In the medical domain, spell-checking is performed on the basis of a medical thesaurus by offering information seekers several medical terms, ranging from one to four differences related to the original query. Exploiting the frequency of a given term in the medical domain can also significantly improve spelling correction  : edit distance technique is used for correction along with term frequencies for ranking. In  the authors use normalization techniques, aggressive reformatting and abbreviation expansion for unrecognized words as well as spelling correction to find the closest drug names within RxNorm for drug name variants that can be found in local drug formularies. It returns only drug name suggestions. To match queries with the MeSH thesaurus, Wilbur et al.  propose a technique on the noisy channel model and statistics from the PubMed logs.
Research has focused on several different areas, from pattern matching algorithms and dictionary searching techniques to optical character recognition of spelling corrections in different domains. However, relatively few groups have studied spelling corrections regarding medical queries in French. In this paper, a simple method is proposed : it combines two approximate string comparators, the well-known Levenshtein  edit distance and the Stoilos function similarity defined in  for ontologies. We apply and evaluate these two distances, alone and combined, on a set of sample queries in French submitted to the health gateway CISMeF . The queries may be submitted both by health professionals in their clinical practice as well as patients. The system we have designed aims to correct errors resulting in non-existent terms, and thus reducing the silence of the associated search tool.
LevNorm (S 1 , S 2 ) ∈ [0, 1] as Lev(S 1 , S 2 ) <Max(|S 1 |,|S 2 |).
For example, LevNorm(eutanasia, euthanasia) = 0.1, as Lev(eutanasia, eut h anasia) = 1 (adds 1 character h); |eutanasia| = 9 and |euthanasia| = 10.
For example for the strings S 1 = Trigonocep ah lie and S 2 = Trigonocep ha lie we have: |MaxComSubString 1 |=|Trigonocep| = 10; |MaxComSubString 2 |=|lie| = 3 Comm(Trigonocepahlie, Trigonocephalie) = 0.866.
For example for S 1 = Trigonocepahlie and S 2 = Trigonocephalie and p = 0.6 we have: |u S1 | = 2/15; |u S2 | = 2/15; Diff(S 1 , S 2 ) = 0.0254.
For example, Sim(S 1 , S 2 ), between the strings S 1 = hyperaldoterisme and S 2 = hyperaldosteronisme. We have |S 1 | = 16, |S 2 | = 19; the common substrings between S 1 and S 2 are hyperaldo, ter, and isme. Comm(S 1 , S 2 ) = 0.914; Diff(S 1 , S 2 ) = 0; Winkler(S 1 , S 2 ) = 0.034 and Sim( hyperaldoterisme, hyperaldoster on isme) = 0.948.
As detailed in , spelling errors can be classified as typographic and phonetic. Cognitive errors are caused by a writer's lack of knowledge and phonetic ones are due to similar pronunciation of a misspelled and corrected word. The queries are pre-processed by a phonetic transcription before applying the Levenshtein edit distance along with the similarity function Stoilos.
CISMeF is a quality-controlled health gateway developed at Rouen University Hospital in France . Doc'CISMeF is the search tool associated with CISMeF. Many ways of navigation and information retrieval are possible through the catalogue. The most used is the simple search, with a free text interface. The information retrieval algorithm is based on the subsumption relationships (specialization/generalization) between medical terms, using their hierarchical information, going from the top of the hierarchy to the bottom. If the user query can be matched to an existing term from the terminology, the result is thus the union of the resources indexed by the term, and the resources that are indexed by the terms it subsumes, either directly or indirectly, in all the hierarchies it belongs to. For example, a query on the term Hepatitis gives a set of documents indexed by the descriptor Hepatitis but also by the descriptors Hepatitis a, Hepatitis b and so on. However, the vocabularies of medical terminologies are difficult to apprehend for a user who is not familiar with the domain.
The different materials that we have used to apply the method of spell-checking are related mainly to the search tool Doc'CISMeF: a set of queries and a dictionary of entry terms.
We first selected a set of queries sent to Doc'CISMeF by different users. A set of 127,750 queries were extracted from the query log server (3 months logs). Only the most frequent queries were selected. In fact some queries are more frequent than others. For example, the query "swine flu" is more present in the query log than "chlorophyll". We eliminated the doubles (68,712 queries remained). From these 68,712 queries, we selected 25,000 queries to extract those with no answers (7,562). From these, we selected queries with misspellings from the most frequent queries in the original set and constituted a first sample test of 163 queries. To avoid phonetic errors of misspelling we first performed a phonetic transcription of this sample with the "Phonemisation" function the method of which is detailed below.
b, f, p, v
c, g, j, k, q, s, x, z
Many variations of the basic Soundex algorithm, such as changing the code length, assigning a code to the letter of the string or making N-Gram substitutions before code assignment have been tested.
"é" [e] "è" [ε] "e" [ø]
The Phonemisation function of medical terms that have been developed, allows us to find a word even if it is written with the wrong spelling but with good sound. For example, for the query "kollesterraulle" (instead of "cholesterol") Phonemisation(kollesterraulle) = Phonemisation(cholesterol)="kolesterol". We have also constituted manually a list of words that are pronounced "e" in French but ending in "er" or "ed". To encode the terms, changes are made according to the letters that follow or precede groups of letters that have a particular sound. For example, for the word "insomnia" the letters 'in' are replaced by the code '1' giving Phonemisation(insomnia) = "1somnia". However, in the word "inosine" we also find the same combination of letters 'in' but, as the next letter "o" is a vowel, no changes in the word are made.
String modifications according to letters combinations and groups of letters before and after the combination
Group of Letter
'o', 'e', 'a'
'o', 'e', 'a'
Some modifications according to letters combinations
Some sound matching
Composition of the reference dictionary based on the MeSH in French
4 words and +
Structure of the queries (with no answer) obtained from the logs
4 (and more) words
The query was segmented in words thanks to a list of segmentation characters and string tokenizers. This list is composed of all the non-alphanumerical characters (e.g.: * $,!§;|@).
We applied two types of character normalization at this stage. MeSH terms are in the form of non-accented uppercase characters. Nevertheless, the terms used in the CISMeF terminology are in mixed-case and accented. (1) Lowercase conversion: all the uppercased characters were replaced by their lowercase version; "A" was replaced by "a". This step was necessary because the controlled vocabulary is in lowercase. (2) Deaccenting: all accented characters ("éèêë") were replaced by non-accented ("e") ones. Words in the French MeSH were not accented, and words in queries were either accented or not, or wrongly accented (h è patite" instead "h é patite").
We eliminated all stop words (such as the, and, when) in the query. Our stop word list was composed 1,422 elements in French (vs. 135 in PubMed).
We use regular expressions to match the exact expression of each word of the query with the terminology. This step allowed us to take into account the complex terms (composed of more than one word) of the vocabulary and also to avoid some inherent noise generated by the truncations. The query ' accident ' is matched with the term 'circulation accident' but not with the terms ' accident ' and 'chute accident elle'. The query 'sida' is matched with the terms 'lymphome lié sida' and ' sida atteinte neurologique' but not with the terms 'gluco sida ses', 'agra sida e' and 'bêta galacto sida se'.
The function is as described in the previous section. It converts a word into its French phonemic transcription: e.g. the query alzaymer is replaced by the reserved term alzheimer.
The algorithm searched the greatest set of words in the query corresponding to a reserved term. The query was segmented. The stop words were eliminated. The other words were transformed with the Phonemisation function and sorted alphabetically. The different reserved term bags were formed iteratively until there were no possible combinations. The query 'therapy of the breast cancer' gave two reserved words: 'therapeutics' and breast cancer' (therapy being a synonym of the reserved term therapeutics).
Numbers of proposed corrections with the Levenshtein edit distance at different thresholds
Nb by query
Numbers of proposed corrections with the Stoilos function at different thresholds
Nb by query
Numbers of proposed corrections (between brackets the number by query) at different thresholds with the Stoilos function combined with the Levenshtein edit distance
As shown in Tables 8, 9 and 10 and Figure 1, the number of suggestions provided to the user in order to correct is variable and the task of correcting queries may become overwhelming if the user has to select the correct word from hundreds, even millions (for Levenshtein < 0.9). Manageable results (around 163, the number of queries) are obtained for the following thresholds for (i) Levenshtein < 0.3; (ii) Stoilos > 0.7 and (iii) the combination of Lenshtein < 0.3 and Stoilos > 0.6.
Evaluations and numbers of corrected queries for Levenshtein edit distance with different thresholds
Number of suggestions
Evaluations and numbers of corrected queries for Stoilos function with different thresholds
Number of suggestions
Evaluation (P: Precision, R: Recall, F: F-Measure) and number of corrected queries (Q) with Levenshtein and Stoilos combinations
P = 100
R = 30.67
F = 46.94
P = 95.94
R = 43.55
F = 59.91
P = 93.97
R = 47.85
F = 63.41
P = 96.42
R = 46.69
F = 65.58
P = 96.62
R = 52.76
F = 68.25
P = 93.57
R = 62.57
F = 75.00
P = 92.72
R = 62.57
F = 74.72
P = 91.20
R = 63.81
F = 75.09
P = 90.43
R = 63.80
F = 74.82
P = 100
R = 03.60
F = 07.10
P = 87.39
R = 63.80
F = 73.75
P = 85.36
R = 64.41
F = 73.42
P = 82.30
R = 65.64
F = 73.03
P = 100
R = 36.19
F = 53.15
P = 96.90
R = 57.66
F = 72.30
P = 94.21
R = 69.93
F = 80.28
P = 83.46
R = 65.03
F = 73.10
P = 81.53
R = 65.03
F = 72.35
P = 83.72
R = 66.25
F = 73.97
P = 94.26
P = 83.84
R = 70.55
R = 66.87
F = 80.70
F = 74.75
Note that the function Phonemisation gave a 38% recall, 42% precision and 39.90% F-Measure, which are lower than the methods based on string edit distance or similarity function.
According to all those results (mainly precision, total number of suggestions and number of corrected queries) we retained a threshold of 0.2 for Levenshtein edit distance and 0.7 for Stoilos function, when combinated for spelling-correction.
Number of suggestions according to the size of the query
Nb suggestions by query
1 word query
Min = 3; Avg = 10.49; Max = 25
Avg = 0.39; Max = 5
2 words query
Min = 5; Avg = 18.36; Max = 41
Avg = 0.22; Max = 6
3 words query
Min = 10; Avg = 24.39; Max = 54
Avg = 0.13; Max = 1
4 words and +query
Min = 11; Avg = 37.30; Max = 113
Avg = 0.06; Max = 1
Evaluation measures of the different methods : Bag-of-Words (BoW), Levenshtein along with Stoilos (LS), LS performed before BoW, and BoW performed before Levenshtein combined with Stoilos
set of 310 queries among 1,061)
(set of 450 queries among 1,636)
(set of 594 queries among 1,443)
4 words +
(set of 710 queries among 2,157)
(set of 2,064 queries among 6,297)
LS before BoW
BoW before LS
Several studies have explored the problem of spelling corrections, but the literature is quite sparse in the medical domain, which is a distinct problem, because of the complexity of medical vocabularies. Nonetheless, the work of  uses word frequency based sorting to improve the ranking of suggestions generated by programs such as GNU Gspell and GNU Aspell. This method does not detect any misspellings nor generate suggestions but reports that Aspell gives better results than Gspell. In  Ruch studied contextual spelling correction to improve the effectiveness of a health Information Retrieval system. In  the authors created a prototype spell checker using UMLS and WordNet in English sources of knowledge for cleaning reports on adverse events following immunization. We also cite the work of  which proposes a program for automatic spelling correction in mammography reports. It is based on edit distances and bi-gram probabilities but is applied to a very specific sub-domain of medicine, and not to queries but to plain text. In  the authors use normalization techniques, aggressive reformatting and abbreviation expansion for unrecognized words as well as spelling correction to find the closest drug names within RxNorm for drug name variants found in local drug formularies. The spelling algorithm is that of the RxNorm API which returns only drug name suggestions. The unknown word must have a minimum length of five characters for spelling correction to be tried. However, the effective usage of the spelling correction component was only 7.6% in the approximate matching of drug names. In addition many spelling corrections were applied to unknown tokens which were not intended to be drugs. The different experiments we performed show that with 38% recall and 42% precision, Phonemisation cannot correct all errors : it can only be applied when the query and entry term of the vocabulary have similar pronunciation. However, when there is reversal of characters in the query, it is an error of another type : the sound is not the same and similarity distances such as Levenshtein and Stoilos can be exploited here. Similarly, when using certain characters instead of others ("ammidale" instead of "amygdale"), string similarity functions are not efficient. The best results (F-Measure 64.18%) are obtained with multi-word queries by performing the Bag-of-Words algorithm first and then the spelling-correction based on similarity measures. Due to the relatively small number of correction suggestions (min 1 and max 6), which are manually manageable by a health information seeker, we have chosen to return an alphabetically sorted list rather than ranking them.
The general idea of spelling correction is based on comparing the query with either dictionaries or controlled vocabularies. If a query does not match the vocabulary, one or more suggestions are proposed to the user. Recent research has focused on the development of algorithms in recognizing a misspelled word, even when the word is in the dictionary, and based on the calculation of similarity distances. Damerau  indicated that 80% of all spelling errors are the result of (i) transposition of two adjacent letters (ashtma vs. asthma) (ii) insertion of one letter (asthmma vs. asthma) (iii) deletion of one letter (astma vs. asthma) and (iv) replacement of one letter by another (asthla vs. asthma). Each of these wrong operations costs 1 i.e. the distance between the misspelled and the correct word.
In this paper, we present a method to automatically correct misspelled queries submitted to a health search tool that may be used both by patients but also by health professionals such as physicians during their clinical practice. We have described how to adapt the Levenshtein and Stoilos to calculate similarity in spell-checking medical terms when there is character reversal. We have also presented the combined approach of two similarity functions and defined the best thresholds. Our results show that using these distances improves phonetic transcription results. This latter step is not only necessary but is less expensive than calculating distance. The best results (in terms of quality and quantity) are obtained by performing the Bag-of-Words algorithm (which includes phonetic transcription) before the combination of Levenshtein and Stoilos similarity functions.
The use of keyword configuration, by studying the distances between keys, is another possible direction to suggest spelling corrections. For example, when the user types a "Q" instead of an "A" which is located just above on the keyboard, similarly to the work detailed in  for correcting German brand names of drugs. These errors are more frequent when queries are submitted by a Tablet PC or a smart phone, the keyboard being smaller in size.
This method may also be used to extract medical information from clinical free texts of electronic health records or discharge summaries. Indeed, the efforts to recognize medical terms in text have focused on finding disease names in electronic medical records, discharge summaries, clinical guideline descriptions and clinical trial summaries. The survey of Meystre et al.  describes several studies on detecting information elements in clinical texts using natural language processing and show their impact on clinical practice. These information elements may be diseases , treatments  in English, or other medical information in French . However, as in any free text, clinical notes may contain misspellings. Using our method may be a preliminary step to cleaning these notes before coding. The algorithms we have presented in this paper will be integrated into the first work package of the following two research projects, both of which are funded by the French National Research Agency: the RAVEL project for information retrieval through patient medical records and the SIFADO project for helping health professionals to code discharge summaries, which free-text components require manual processing by human encoders.
The authors are grateful to Nikki Sabourin, Rouen University Hospital, for reviewing the manuscript in English.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 14, 2012: Selected articles from Research from the Eleventh International Workshop on Network Tools and Applications in Biology (NETTAB 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S14
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.