Mapping biological entities using the longest approximately common prefix method

Background The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation. Results This paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for approximate string matching that runs in linear time. We compare the LACP method for performance, precision and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we present a spell checker based on the LACP method. Conclusions The Longest Approximately Common Prefix method completes its string similarity evaluations in less time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix outperforms these nine approximate string matching methods in its Maximum F1 measure when evaluated on three out of the four datasets, and in its average precision on two of the four datasets.

Results: This paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for approximate string matching that runs in linear time. We compare the LACP method for performance, precision and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we present a spell checker based on the LACP method.

Conclusions:
The Longest Approximately Common Prefix method completes its string similarity evaluations in less time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix outperforms these nine approximate string matching methods in its Maximum F 1 measure when evaluated on three out of the four datasets, and in its average precision on two of the four datasets.

Background
The term-matching problem has been widely addressed in multiple contexts, which resulted in a number of string similarity metrics designed, applied and evaluated in various research studies [1]. In the biomedical domain, various ASM methods are used by scientists to solve current research tasks such as retrieving sequences from existing databases that are homologous to newly discovered ones, and establishing multiple sequence alignment to discover similarity patterns to predict the function, structure, and evolutionary history of biological sequences [2].
The recent expansion of healthcare information systems that draw from multiple medical databases has resulted in redundant information, among other problems. This phenomenon, also known as the duplicate detection problem, has caused problems with record linkage across medical databases. Previous research has addressed problems such as patient record aggregation from multiple databases based on a minimum profile (i.e., name, gender and date of birth) [3] and term matching for source integration, spelling correction and biomedical data mining applications. In this paper, these tasks are considered in the context of terminologies such as Systemized Nomenclature of Medicine Clinical Terms (SNOMED CT) and the Unified Medical Language System (UMLS) [4]. Approximate String Matching (ASM) methods are used for augmenting, updating, and auditing UMLS vocabularies. ASM methods are also important for facilitating biomedical information extraction, relationship search, and concept discovery [5].
The UMLS is an extensive terminological knowledge base comprised of three major components: the Metathesaurus, the Semantic Network, and the SPECIALIST Lexicon and Lexical Tools. The current 2013AB release of the Metathesaurus contains more than 2.9 million concepts and 11.4 million unique terms retrieved from over 160 source vocabularies [6]. UMLS source integration is a complicated multistep process and, despite the availability of numerous algorithmic tools, managing these vocabularies requires considerable human involvement. As additional sources are integrated into the UMLS, they will require reintegration with existing vocabularies [4].
These disadvantages motivate the search for a new method for approximate string matching and UMLSbased evaluation. In this paper, we introduce the Longest Approximately Common Prefix (LACP) method for ASM and present the results of its use to improve the operation of a number of applications in biomedical informatics and related domains.
It bears noting that, in contrast to the well-known SPECIALIST lexicon tools Norm, Word Index or LVG [7], LACP does not perform text manipulations. Instead, it assesses the similarity or dissimilarity of two strings.
Other three highly praised instruments, MetaMap [8], NCBO Annotator [9] and ConceptMapper [10] are publicly available concept recognition systems designed for text annotation from various ontologies [11]. The general rationale of these tools is to split the input text into smaller constructions, such as phrases or tokens, which are subsequently looked up in a dictionary. For instance, MetaMap splits the input text into phrases and produces their variants. Then it generates a candidate set, which is mapped to an ontology. The LACP method, introduced in this paper, may be used as an inner component of such a system for calculating the similarity of a candidate phrase or token when matching to various ontology terms. The authors consider implementation of a text annotation system incorporating the LACP method as a direction for future research.
The rest of the section is dedicated to the analysis of the relevant research approaches and the related work studying the application of well-known similarity measures in the biomedical domain.
Tan et al. [12] applied the classic Levenshtein score incorporated with a particular threshold to medical ontology alignment. Tolentino et al. [13] utilized the Levenshtein technique in combination with other string similarity algorithms to construct a UMLS-based spell checker. Sahay et al. [14] employed more advanced combinations of the Jaro and Jaro-Winkler similarity metrics combined with Term Frequency/Inverse Document Frequency (TFIDF) to compute similarity values between ontological concepts and phrases. Cohen et al. [15] described, implemented and evaluated the above-mentioned hybrid distances in the SecondString Java toolkit.
Plaza et al. [16] applied heuristic rules with a clustering algorithm to the problem of biomedical text summarization. Their work mapped terms found in a given document to UMLS concepts. Using the relationships between the identified UMLS concepts, the authors then represented the document in a graph. They graphed the concepts and assigned sentences to clusters based on semantic similarity. Finally, the most important sentences were selected to be included in a document summary.
Zhen et al. [17] introduced a TFIDF string distance method within their clustering algorithm and applied it to biomedical ontologies. The evaluation of their method demonstrated superior values of the F-measure on two datasets derived from the MeSH and GO ontologies.
In a previous paper, we developed a novel Markov Random Field-based Edit Distance (MRFED) and applied it to the ASM problem in GO ontologies [18]. Similarly, Wellner et al. [19] used Conditional Random Fields in a distance metric method on a UMLS Metathesaurus dataset. Bodenreider et al. [20] applied the Cosine, Jaccard and Dice string similarity coefficients to aligning the UMLS Semantic Network with the Metathesaurus.
Yamaguchi et al. [21] tested four similarity metrics for clustering terms, which appeared in the UMLS Metathesaurus. The authors compared the performances of Monge-Elkan, SoftTFIDF, Jaro-Winkler and the bigram Dice coefficient methods evaluating these techniques on chemical and non-chemical terms grouped into two datasets. They demonstrated that normalized string distances performed better than the standard measures for the evaluation of precision, recall, and F-measure, and that similarity metrics required different parameters such as threshold values for chemical and non-chemical terms, among other findings.
Sauleau et al. [22] propose a novel method for linking medical records by examining the connections between stand-alone and clustered databases. The authors developed a three-step approach: 1) preprocessing the data and applying blockers, 2) matching pairs of records using the Porter-Jaro-Winkler score calculation, and 3) clustering the data. The authors suggest that their method is useful for inserting new entities into large databases.
Zunner et al. [23] studied the semi-automated mapping of non-English terms to Logical Observation Identifiers Names and Codes (LOINC) [24] using the Regenstrief LOINC Mapping Assistant (RELMA) [25]. Their approach resulted in a mapping rate of 500 terms per day, which they considered satisfactory.
In research by Parcero et al. [26], mapping a local terminology to the LOINC dataset led to the development of an automated tool that uses an approximate string matching function. McDonald et al. benchmarked Jaccard, Levenshtein, Monge-Elkan, and Soft TFIDF metrics for LOINC integration, and the Jaccard method was selected as the best choice for such a task [24].
The present research employs the Shortest Path Edit Distance (SPED) algorithm we developed previously [27] to compute a string distance based on substring matching and graph-based transformations. To adjust the dissimilarity values in the final results, we applied a re-scorer set according to the length of equal string prefixes. This final step produced a major improvement in results and inspired this paper on the Longest Approximately Common Prefix (LACP) method, a novel string similarity metric based on the approximate prefix match of two strings. This paper demonstrates how this fast string distance method provides performance that is superior to other methods on datasets from SNOMED CT and from multiple UMLS sources (Table 1) in terms of average precision and Maximum F 1 .

Methods
The Longest Approximately Common Prefix (LACP) method is based on an approximate histogram match of string prefixes. It identifies matches by determining the similarity value of a pair of strings. The method compares the histogram differences between the prefixes of two strings to parameter α. It begins its search in the first characters of the strings. The prefix length is returned when the histogram difference is equal to α or the last character of the shorter string is reached. The prefix length is then divided by the average length of the pair of strings. The division takes into consideration string lengths, since strings that have significantly varying lengths are more dissimilar than strings that do not. The division also assures that the value of the LACP function stays in the [0, 1] interval. The formula for the LACP function (1) is as follows: where prefLength is the length of the longest approximately common prefix. According to formula (1), for two identical strings, LACP is 0, whereas LACP is 1 for two strings not sharing any common prefix under a certain selection of the parameter α. The formula for prefLength is given in (2) below: where prefHistDiff is a histogram difference function of string prefixes, α is a parameter, and S 1..i and T 1..i are prefixes of strings S and T of length i. For example, for the strings S = Anorexia and T = Angina, with an α = 2, the prefLength would be 3, because two initial characters match and α allows only one mismatch. Alternatively, with α = 3 the prefLength would be 4 because two mismatches are allowed. The histogram difference function for string prefixes is defined in formula (3): where hist is a histogram, and i satisfies the inequality (4): A histogram is an array, that counts the number of occurrences of each distinct symbol in a string. In formulae (2) and (3), i denotes a prefix length. By subtracting from i the number of characters that are common to the histograms of both prefixes, the number of non-common characters remains in the difference. This number of noncommon prefixes is matched against the parameter α, as is shown in formula (2). During the evaluation phase, we used α = 3, which allowed two mismatches in histogram difference.
The expression hist(S 1..i ) ∩ hist(T 1..i ) denotes the histogram intersection of two string prefixes. Figure 1 depicts the histogram intersection of two UMLS terms, ammonium and ammonium ion. The histogram of ammonium is in Figure 1a, the histogram of ammonium ion is in Figure 1b. The intersection (Figure 1c) is computed as the minimum for each pair of argument values of the same character, with missing values in one argument omitted from the result.
For example, ammonium contains one "o" while there are two letters "o" in ammonium ion. As min(1, 2) = 1, the resulting histogram in Figure 1c contains the entry "1" for the letter "o". As there is no blank in ammonium, there is also no entry for the blank character in the resulting histogram. In order to compute the size (the "absolute value" ||) of the histogram intersection in Figure 1c, the sum of all the numbers in the result matrix is calculated. For Figure 1c, the size of the histogram intersection is (1 + 1 + 3 + 1 + 1 + 1) = 8.
An example of three strings sharing the same prefix is shown in Table 2. Strings (1) and (2) comprise the first pair, and strings (1)  the first pair of strings is more similar than the second pair. To account for this and similar cases, the length of the approximately common prefix is divided by the average string length in formula (1). In Table 2, strings (1) and (2) belong to the UMLS concept with Concept Unique Identifier (CUI) C0002611, while string (3) is associated with (CUI) C1816069. The LACP algorithm is in Table 3. The algorithm begins by setting the histogram intersection at 0. The search for the longest approximately common prefix begins with the first character of each string. In steps 3 and 4, the characters at the current position i of strings S and T are added to the corresponding histograms. In steps 5 through 9, all characters in the histogram of string S are compared against the histogram of string T at the current iteration i. At this point, the search has advanced to the i-th character of each string. Steps 6 and 7 describe the following: when a character c is found in both histograms, operation Get(c) retrieves the count of this character from both HistS and HistT. Then the smaller of the two values is added to the intersection. The search continues until the parameter α is reached, as shown in line 9, or the last character of the shorter string is processed, as specified in line 2. In the latter case, the length of the shorter string is computed in line 11. Depicts the histogram intersection of two UMLS terms, ammonium and ammonium ion. The histogram of ammonium is in a, the histogram of ammonium ion is in b. The intersection (c) is computed as the minimum for each pair of argument values of the same character, with missing values in one argument omitted from the result. For example, ammonium contains one "o" while there are two letters "o" in ammonium ion. As min (1, 2) = 1, the resulting histogram in c contains the entry "1" for the letter "o." As there is no blank in ammonium, there is also no entry for the blank character in the resulting histogram. In order to compute the size (the "absolute value" ||) of the histogram intersection in c, the sum of all the numbers in the result matrix is calculated. For c, the size of the histogram intersection is (1 + 1 + 3 + 1 + 1 + 1) = 8.  Despite its linear time computational complexity, the simplicity of the LACP algorithm ensures a short execution time. The big-O computational complexity is commonly used for estimating the speed of an algorithm in computer science. The calculation of the LACP method time complexity is shown in Table 3. The inner loop in step 5 is bound by the number of printable characters and therefore constant [28]. Thus, the complexity of the LACP algorithm is linear, i.e., O(n), which is fast comparing to other algorithms evaluated in this paper.

LACP-based interactive spell checker
We have employed the LACP method to develop an interactive online spell checker [29] for SNOMED CT terms. The spell checker is a program written in PHP, which connects to a MySQL database containing SNOMED CT terms from the 2009AB edition of the UMLS. The goal of the application is to evaluate LACP performance by revealing the set of SNOMED CT terms that are similar to the user-provided input term.
The spell checker accepts an input query and interactively outputs the SNOMED CT terms satisfying the condition LACP(S, T) < t. Here, S is the input term, T is a SNOMED CT term, and t is a threshold. To reduce the run time, the algorithm limits the set of search terms by applying length criteria as described below.
There are several parameters that define the performance of the spell checker depending on the mode of operation. The length of a SNOMED CT term |T| that is considered a potential match is bound by formulas (8), (10), and (11) in conformity with each of the three modes of operation. Parameters A and B are used in (11) to determine the values of the lower and upper limits for |T|, respectively. Parameter α sets the upper bound for the number of allowed character mismatches in the prefixes of strings S and T. Threshold t defines the "cutoff point" for the LACP score; a pair of strings S and T is considered to be a match when the LACP score is less than the threshold t.
Three modes of operation are implemented: (a) a search with dynamically estimated parameters; (b) a search with static parameters; and (c) a search with user-defined parameters. In case (a), the search is limited to the database terms meeting the criterion (5), while α is defined in (6) and threshold t is 0.1.
For example, for string S = Ischemia, |S| = 8. Thus, according to (5), the dynamic search would be limited to terms longer than 4 characters and shorter than 12 characters. In case (a), parameter α is set individually for each pair of strings S and T as shown in (6): In case (b), α is set to 1, threshold t is 0.1, and the length of a term should be in the following range (7): In case (c), a user selects parameter values from predefined sets. The search is restricted to terms with lengths within the interval (8).
The dynamic search option adjusts the number of allowed misspellings α along with minimum and maximum term length parameters according to the input query. The dynamic search offers flexibility without user intervention. The threshold t is set to 0.1 for this search mode.
The static search option operates with constant parameter values. It allows only one misspelling. The lengths of the returned strings must be in the neighbourhood of ±3 characters of the input query length. This option decreases the search time for longer input terms compared to the dynamic search option.
The search mode based on user-defined parameters expands parameter options within pre-defined ranges. This mode is intended for users who are not satisfied with the results of the dynamic and static modes or who seek more refined results.
In summary, the dynamic option is suggested when results significantly vary in length from the search query. The static search option should be used when the resulting strings is expected to lie in the neighbourhood of the input term. The search with user-defined parameters is intended for fine-tuning results or for a more advanced search.

Results
The LACP was compared to nine other well-known approximate string distance metrics: Jaccard [30], Jaro [31], Jaro-Winkler [32], Levenshtein [33], Monge-Elkan [34], Needleman-Wunsch [35], Smith-Waterman [36], TFIDF [37], and Soft TFIDF [15]. LACP was compared with these string matching methods on four datasets derived from Version 2009AB of the UMLS (Table 1). Dataset D 1 was obtained by counting occurrences of each Concept Unique Identifier (CUI) within the UMLS [38], retrieving all terms corresponding to the 100 most frequent CUIs and eliminating records with duplicate terms. D 2 was created in the same way, but limited to concepts from SNOMED CT [39]. D 3 was built by retrieving the 5,000 longest terms from the multiple UMLS sources. D 4 was constructed by taking the 5,000 longest terms from SNOMED CT.
SecondString [17], an open-source Java toolkit, was used as an experimental test bed. During the experiments, each term was matched against those within a set of candidate pairs. This type of set reduces the problem size and speeds up experiment execution. The candidate set includes pairs of terms from the dataset that share one or more common words. The goal was to determine whether every pair of terms has the same CUI. Using common performance evaluation methods from information retrieval [27], we calculated average precision (P), recall (R) and Maximum F 1 values (formulae (9), (10), and (11)), and graphed precision-recall (P-R) curves for our method and for the competing techniques. Precision and recall are tradeoffs against one another: on the one hand, it is possible to obtain the maximum value of recall with a low value of precision by retrieving all documents for all queries. On the other hand, the precision usually decreases as the number of retrieved documents grows. A single measure that trades off precision versus recall is the F measure, which is the weighted harmonic mean of precision and recall [40].
In (9) and (10), D r denotes the number of relevant items retrieved, D t is the total number of retrieved items, and N r is the number of relevant items in the collection.
LACP achieves the highest average precision for datasets D 1 and D 4 (Table 4) and the best values of Maximum F 1 for D 1 , D 2 , and D 4 ( Table 5). TFIDF and Soft TFIDF achieve the best scores of average precision for D 1 and D 2 and the largest Maximum F 1 for D 3 . It is worth noting that TFIDF and Soft TFIDF demonstrate exactly the same values of average precision and Maximum F 1 for each dataset, although Soft TDIDF executes the operation at a significantly slower pace. Table 6 shows that LACP is the fastest method on every dataset. Figure 2 depicts four precision-recall charts plotting interpolated precision values at 11 recall levels [27]. The horizontal axis shows 11 recall points; the vertical axis displays interpolated precision values. A method with a larger area under its curve demonstrates a better result. The differences in performance between LACP, TFIDF and Soft TFIDF are easily apparent. For D 1 and D 4, LACP consistently outperforms the other two methods. It is important to note, however, that on D 2 , LACP experiences a rapid precision drop after recall = 0.5, and that on D 3 , LACP is inferior to most methods.

Discussion
The primary advantage of the LACP method is its short execution times, a feature that is highly desirable when dealing with the large data sets involved in Medical Informatics. The performance of the LACP method can be interpreted by studying the structure of the datasets D 1 , −D 4 . Datasets D 1 , D 2 , and D 4 have higher numbers of terms per concept compared to dataset D 3 (see Table 1). Thus, D 1 , D 2 , and D 4 have a higher number of records that have the same CUIs and have approximately common Figure 2 Precision-recall curves of the evaluated methods. Figure 2 depicts four precision-recall charts plotting interpolated precision values at 11 recall levels. The horizontal axis shows 11 recall points; the vertical axis displays interpolated precision values. A method with a larger area under its curve demonstrates a better result. The differences in performance between LACP, TFIDF and Soft TFIDF are easily apparent. For D 1 and D 4, LACP consistently outperforms the other two methods. It is important to note, however, that on D 2 , LACP experiences a rapid precision drop after recall = 0.5, and that on D 3 , LACP is inferior to most methods.  prefixes. This allows the LACP algorithm to outperform other more complicated well-known methods on D 1 , D 2 , and D 4 . However, the LACP method performed poorly on D 3 . This is due to the large number of concepts with similar terms. As shown in Table 7, five terms share a 146character-long common prefix, for example. By design, such terms are evaluated by LACP as very similar, which in fact is incorrect. Large numbers of such similarly spelled UMLS terms with different identifiers leave no chance for the LACP algorithm to succeed in these contexts.
We note that the current online spell checker is a prototype. It has not been optimized for speed nor is it intended to compete with the well-known Google Instant Search [10], which displays search predictions as the user types a query. Instead, our goal is to create a spell checker specifically for use with biomedical terminologies. The remarkable difference between the excellent performance of LACP on datasets D 1 , D 2 , and D 4 and its disappointing performance on D 3 indicates that approximate string matching methods exhibit a certain degree of domain dependence. In fact, as detailed in an extensive research report by Rudniy [41], domain dependence has been shown to be a common phenomenon.

Conclusions
LACP is a novel method we have developed for computing approximate string similarities based on assessing the length of approximately common string prefixes. The algorithm implements a normalization technique by dividing the length of the approximately common prefix by the average length of the pair of strings. LACP performed better than a number of well-known string similarity algorithms on three out of four datasets and demonstrated the shortest execution times on all four. For the average precision measure, LACP achieved the highest values of 0.62 on dataset D 1 and 0.84 on dataset D 4 . On D 3 , LACP was second best, with an average precision of 0.51. Our method had the best values of Maximum F 1 on three datasets: 0.69 on D 1 , 0.61 on D 2 , and 0.92 on D 4 . However, LACP experienced a drop in performance on dataset D 3 . In terms of execution time, LACP was on average two times faster than the Jaccard method, which achieved the second best times.
The LACP method demonstrated superior performance on certain types of biomedical datasets though its productivity has to be determined for other corpora. Another common limitation of the approximate string matching methods lies in the inability to determine that differently spelled synonyms correspond to the same concept. For such cases, either semantic methods or expert insight are required.
In future work, we will attempt to identify the cause and solve the problem of performance variability due to differences in dataset characteristics. Another branch of future research consists of investigating the best value for parameter α. The ultimate-though difficult-goal is to develop an approximate string matching method that recognizes and adapts to the distinctive characteristics of each dataset.