Automated vocabulary discovery for geo-parsing online epidemic intelligence
© Keller et al; licensee BioMed Central Ltd. 2009
Received: 03 June 2009
Accepted: 24 November 2009
Published: 24 November 2009
Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human.
Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon.
The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting.
Web-based information sources such as online news media, government websites, mailing lists, blogs and chatrooms provide valuable epidemic intelligence by disseminating current, highly local information about outbreaks, especially in geographic areas that have limited public health infrastructure. These data have been credited with providing early evidence of disease event, such as SARS (Severe Acute Respiratory Syndrome) and avian influenza . The availability of open source and freely available technology has spawned a new generation of disease "mashups" that scour the web and provide real-time outbreak information. HealthMap [2, 3] is one such system that monitors, analyzes and disseminates disease outbreak alerts in news media from all around the world. Each hour, the system automatically queries over 20,000 sources using news aggregators such as Google News, for relevant reports. It filters the retrieved documents into several taxonomies and provides on its website, http://www.HealthMap.org, a geographic and disease-based display of the ongoing alerts. HealthMap provides a starting point for real-time intelligence on a broad range of emerging infectious diseases, and is designed for a diverse set of users, including public health officials and international travelers [4, 5].
This real-time surveillance platform is composed of a number of Information Retrieval and Natural Language Processing modules, such as outbreak alert retrieval and categorization, information extraction, etc. In the present work we are interested in a critical task of the last phase of the information processing scheme: the geographic parsing ("geo-parsing")  of a disease outbreak alert or the extraction from one such textual document of its related geographic information. This information is needed for the precise geographic mapping, as well as for the identification/characterization of the particular disease outbreak described in the alert. Indeed each alert is uniquely characterized by its disease category, a set period in time and its precise geographic location. A good characterization of the outbreak allows the system to discriminate between duplications of an alert and new alerts. It is also essential for the evaluation of the system and for long-term analysis. Most importantly, the automated high resolution geographic assignment of alerts aids rapid triaging of important information by system users, including public agencies such as the World Health Organization and the US Centers for Disease Control and Prevention that use these data as part of their daily surveillance efforts.
So far, HealthMap assigns incoming alerts to a low resolution geographic description such as its country, and in some cases its immediately lower geographic designation (for the USA and Canada, it would provide for example the state or province). The system uses a rule-based approach relying on a specially crafted gazetteer, which was built incrementally by adding relevant geographic phrases extracted from the specific kind of news report intended for mapping. The approach consists in a look-up tree algorithm which tries to find a match between the sequences of words in the alert and the sequences of words in the entries of the gazetteer. It also implements a set of rules which use the position of the phrase in the alert to decide whether or not the phrase is related to the reported disease.
The gazetteer contains around 4,000 key phrases, some of which refers to geographic locations with several resolution levels (from hospitals' to countries'), some are negative phrases (≈ 500 phrases, eg Brazil nut or turkey flock are not considered location references) as well as phrases that are specific to the kind of data processed (Center for Disease Control, Swedish health officials, etc.).
While the current limited gazetteer has proven useful for a high level view of ongoing threats, there is a public health need to develop a method that provides the highest resolution geographic assignments, especially for public health practitioners that require this information for outbreak verification and follow-up. The approach presented in the following section is an attempt at producing a geo-parser using the prior knowledge encoded in the gazetteer as a base. At first glance, it would seem that in order to increase the resolution of the HealthMap geo-parser, expanding the size of the gazetteer should be enough. However in our experience, adding new terms to the gazetteer, without careful supervision often results in an upsurge of false positives for the system. The ability to predict statistically if a word in a sentence is a geographic reference is a valuable feature for a geo-parser. Indeed no matter the size of the gazetteer, it cannot contain every geographic reference, and even words in the gazetteer might be false positives in the alert (Canada geese). The typical way of obtaining a statistical predictor is to have access to annotated training data. However, data annotation is a very time consuming and expensive task, and thus the approach we present is an attempt at circumventing it by using the already available gazetteer.
The inspiration behind the present approach is based on the intuition that a human reader presented with a text containing a phrase that is out of his vocabulary would most likely be able to guess whether this phrase refers to a geographic location or not. This reader would infer the semantic role of the phrase with a certain accuracy, because he has prior knowledge of the syntactic context in which geographic references appear, maybe also of their particular character distribution or the fact that they generally begin with a capital letter, etc. Our approach in some sense simulates this situation with a learning algorithm in the guise of an artificial "reader." We use the HealthMap gazetteer as the supervision (or reference) for training the reader. Some of the location words in the training texts are purposely hidden from the reader's vocabulary in order to divert the attention of the learning algorithm to the context on which these words appears instead of the words themselves. Previous related natural language processing approaches reported in the literature, like ours, use a limited knowledge base to generate a broader one.
The task we are trying to solve, namely finding geographic location references in a text, falls into the more generic Natural Language Processing problem of Information Extraction [7–9], which involves automatically selecting sub-strings, containing specific types of information, from a text. It is in particular closely related to the Named Entity Recognition task , in which texts are parsed in search of references (mainly proper names) to persons, organizations and locations or more recently gene name . However, we are here interested in more than just named entities, since any hint of location, even, for example, adjectives (French authorities) or public health organizations (INSERM: a French institute of medical research), can provide us with desired information.
There have been a number of approaches to named entity recognition and more generally to information extraction problems (see eg [12, 13] or  for a name entity recognition system being used in biosurveillance), exploiting as we do, syntactic and contextual information. They however usually rely on supervised approaches, which require heavily annotated datasets to account for the human experience. Building these annotated corpora, is extremely time-consuming, expensive and results in a so-called knowledge-engineering bottleneck. On the other hand, large numbers of unlabeled texts are easily available through, for example, the Internet.
In order to take advantage of the unlabeled data, while avoiding the cumbersome need of annotation, a number of approaches, sometimes referred to as Automatic Knowledge Acquisition  have been developed. The domains to which each of these methods is applied are very diverse. Some concentrate specifically on the named entity classification problem [16, 17], while others, like ours, have a different information extraction scope [15, 18, 19]. Whether they use a few rules [16, 19] (eg Mr. [Proper Noun] → Person) or a small lexicon [15–17, 20] (such as a small gazetteer) as seeds for the information to be extracted, all approaches, including ours, begin with a corpus partially labeled. These approaches are related to semi-supervised learning, where a few labeled examples are used in conjunction of a large number of unlabeled ones. The goal of all these models is to exploit the redundancy of language and to learn a generalization of the context in which labels appear.
Strategies on how to use these few labeled examples diverge. Most of these approaches go through their training sets in several steps and incrementally add inferred labels to their labeled examples, in a bootstrap fashion, [15–18, 20]. Of course, the addition of every inferred labeled example (even false positives) can quickly produce a drift towards a noisy solution. These approaches have thus heuristics to decide which examples to add at each iteration. Other approaches, like ours, are built to learn everything they need from the initial input only .
While our present work focuses on geo-parsing an English-language corpus, HealthMap surveillance so far covers alerts in 4 more languages (Spanish, French, Russian and Chinese), and plans to expand to other languages. Most of these approaches, [15, 16, 19, 20], use elaborate linguistic knowledge either to represent the words or to target groups of words to tag. Relying on complex linguistic features requires language-specific expert knowledge difficult to obtain. Our approach relies on low level syntactic features making it easily portable to other languages. In addition, it is based on statistical machine learning principles (like [16, 17]), as opposed to rule-based ones, which also reduces the need of expert knowledge.
Our dataset consists of English-language disease outbreak alerts retrieved in 2007 by the HealthMap system. We used the HealthMap rule-based approach to tag the words in the alert that match geographical references found in the gazetteer. To enrich the dataset with syntactic information, we tagged the alerts with a part-of-speech tagger (see Methods section for more details). Provided that, in English, location names often begin with capital letters or appear as acronyms, we assigned to the words in the alerts, in addition to their part-of-speech tags, a capitalization status, ie none, first character, upper case.
As explained previously, one important characteristic of this experiment is the fact that the words (that is, the lexical information) are only partially accessible to the learning algorithm. This is implemented by the choice of the dictionary D mentioned in the previous paragraph. As we will explain shortly, some words in the corpus are purposely left out of the dictionary. Consequently, the first components of their sparse representation, the ones referred to as dictionary index in Fig. 2, are all set to zero. Such words are thus represented only by their part-of-speech tag and their capitalization status.
given a window (n - 1 = 2 × hw) of preceding and following words. A threshold on NN(i, x) allows us to decide if the input is a location or not. The neural network was trained by negative log-likelihood minimization using stochastic gradient descent. An extensive description of the learning algorithm architecture and optimization is provide in the "Methods" section.
The same lack of labeled data that we were faced with for training the geo-parser applies to the question of how we can test the performance of a trained model. Indeed, to measure the accuracy of the outputs of the geo-parser, we would need "correct" annotations to compare with. Ideally, we would even have a test corpus annotated by several humans independently and thus be able to measure the difficulty of the task we are trying to solve. As stated previously, annotating such a dataset is a tedious and expensive task and we thus consider it as last resort. In order to nonetheless evaluate the performance of the algorithm we devised two approximate solutions.
First, as in the training phase we can reuse the HealthMap gazetteer to tag the test corpus and then evaluate how much of the gazetteer annotation the neural network geo-parser (or NN geo-parser) is able to recover. It might seem that retrieving the locations that were used for training is an easy task. However, the same lexical information that was hidden in the training corpus would be hidden in the test dataset as well. As a consequence, both training and test examples provide only general context information, and thus rediscovering the HealthMap gazetteer labels is not such a trivial task.
In a second experiment, we used a comprehensive (subscription only) commercial geo-parser (MetaCarta GeoTagger ) to tag 500 alerts with what we would consider "true" location references. Note however that despite the fact that there is good overlap between the HealthMap gazetteer and the MetaCarta tags, there are also a certain number of tags that are found only in the HealthMap annotation. Taking both sets of tags as a whole: 52.4% are found MetaCarta's only, both geo-parsers agree on 38.7% of the tags and the remaining 8.9% come from the HealthMap gazetteer. Some of the HealthMap only tags are due to minor uninteresting variations in the annotation schema, while others suggest a specialisation of the HealthMap gazetteer to public health content that the more generic MetaCarta geo-parser is obviously not trained to provide.
We trained several neural network geo-parsers on the three datasets T0 (1,000 alerts), T1 (2,500 alerts) and T2 (5,000 alerts), with extracted dictionaries of varying sizes according to our lower bound λ. Given the approximate nature of the solution found when training neural networks by stochastic gradient descent, we repeated the learning process for each condition 5 times to estimate the variance. The evaluation corpus contains 500 disease outbreak alerts subsequent to the ones used for training, in respect of the temporal nature of the HealthMap surveillance process. This represents 201,643 words to tag among which 5,030 are words that are considered locations by the HealthMap gazetteer approach, 7,385 are considered locations by the commercial geo-tagger, of which 3,315 by it alone (whereas 960 words are considered locations only by the HealthMap gazetteer approach).
To show that this approach is not just a memorization of the gazetteer-based approach, the results are sliced into the F1 scores of words inside and outside the algorithm's dictionary, ie words that were represented with and without their dictionary index feature. As the value of λ increases, the size of the dictionary we allow the NN geo-parser to see decreases and thus the number of words in the evaluation corpus that are out of the algorithm's dictionary increases. This makes the task of identifying which words in the text refer to locations more difficult, and as a visible consequence the overall performance of the system decreases (solid red line). On the other hand, however, the increase in out-of-dictionary examples greatly improves the ability of the system to correctly identify locations that are out of the algorithm's dictionary (dash-dotted blue line), ability that is non-existent when the whole training set vocabulary is available to the learning algorithm (λ = 0). As explained in previous sections, the idea behind this approach is to consider those purposely "hidden" location words as surrogates of the location words unknown to the gazetteer, those we want to be able to discover. The observed increase in performance in the retrieval of those words suggests that this approach would be appropriate for this task and without having too high a loss in performance for "visible" words (dashed green line).
Best performances with respect to MetaCarta labels (optimal F1-score).
Comb. loc T0
0.61 (± 0.006)
Comb. loc T1
0.63 (± 0.008)
Comb. loc T2
It is worth noting that there are words that are here considered as false positives (and thus contribute to the decrease in precision) that are in fact generalizations of those words that HealthMap sees as location but that the more generic geo-parser ignores.
We have presented an approach to the geo-parsing of disease outbreak alerts in the absence of annotated data. The identification of precise geographic information in the context of disease outbreak surveillance from informal text sources (eg news stories) is essential to the characterization of the actual outbreaks, and would increase the resolution of a visualization scheme such as the one proposed by HealthMap. Such precise information extraction typically requires a dataset of texts with the desired information carefully annotated by a human expert. A corpus of this kind is expensive and time-consuming to create. Instead, we propose a statistical machine learning approach which generalizes the existing HealthMap rule-based geo-parsing by making use of the lexical and syntactic context in which the existing gazetteer phrases appear. We have demonstrated that the described model has indeed the ability to discover a substantial number of geographic references that are not present in the gazetteer. In addition, our approach also limits the number of false positive it produces.
Nevertheless, there is still a portion of those geographic references that remains a challenge to retrieve. There are several components of this approach that could be refined in order to improve the performance of the algorithm. For example, a more sophisticated word-withholding strategy could be implemented. In addition to relying on the term frequency, the strategy could also account for the part-of-speech of the word to hide (eg verbs are unlikely surrogates for locations), and on the alert frequency of the term (a word that appear a few times in the corpus but only in a unique alert might be more likely to be location than one with occurrences in several alerts). Another place for improvement could be the representation of words. Other features, aside from part-of-speech and capitalization, could provide additional information about the semantic status of the word. However, these features should be simple enough to implement in several languages, in order to comply with the requirement of portability formulated previously. Another further area of exploration, is the weight given to words tagged as none during training. Some of those words are actually unknown locations that, during the learning process, are strongly supervised as not being locations. Even if we compensate for this error by tuning the decision threshold, it should be also possible to act on the problem by a relaxation of the supervision during the training stage. Finally, the geo-parsing of an alert is only an intermediate step. Finding which terms in an alert are geographic references is crucial, but the final goal is to identify which among these terms is the definitive disease outbreak location and be able to disambiguate it (Paris, Texas vs Paris, France). Ideally, we would like to integrate these different tasks into one, so that the information that is learned from one can benefit the others.
From a more general point of view, the presented approach also describes a way of incorporating the prior knowledge encoded in a rule-based procedure into a more general statistical framework. This could be adapted to the extraction of other types of information that would also prove useful in the characterization of an outbreak. For example, we are interested in the extraction, from the alerts, of attributes related to the individuals involved in the outbreak such as age, sex, setting, clinical outcomes when specified.
We have presented an approach to the geo-parsing of disease outbreak alerts in the absence of annotated data. The results of this analysis provide a framework for future automated global surveillance that reduce manual efforts and improve timeliness of reporting. Ultimately, the automated content analysis of news media and other nontraditional sources of surveillance data can facilitate early warning of emerging disease threats and improve timeliness of response and intervention.
Our methodology can be summarized as follows. Using alerts retrieved by HealthMap we generated a dataset specially tailored to train a geo-parsing algorithm. To generate this dataset, the HealthMap gazetteer-based algorithm was first applied to this set of alerts in order to extract the words in the text referring to geographic locations. The same alerts were then run through the part-of-speech tagger algorithm provided by NEC's project SENNA (a Neural Network Architecture for Semantic Extraction) , making the syntax of the text explicit. This part-of-speech tagger has a reported accuracy of 96.85% on the reference Penn Treebank dataset . We then assigned to every word in the alerts a capitalization status, ie none, first character, upper case. After these 3 steps in the data generation process, each word in each alert had 4 features: the word itself, its part-of-speech tag, its capitalization status and a label indicating if the word is a geographic location or not. The last step in the data generation process consisted in replacing the lexical feature of the words with lowest frequency by a blank, as explained in the Results section.
The outputs ϕ(xi-hw), ..., ϕ(x i ), ..., ϕ(xi+hw) are concatenated into a vector z ∈ Rd × nwhich is itself given as input to a second multi-layer perceptron. This second MLP, called ψ in Figure 7, has as output layer a softmax filtering function which allows us to consider the outputs of the neural network as probabilities. A threshold on P(y i = loc|xi-hw, ..., x i , ..., xi+hw) allows us to decide if the input is a location or not. This threshold and the hyper-parameters of the neural network are tuned on a separate validation set. Tuning this threshold away from 0.5 compensates for the fact that some none labels are in fact locations unknown to the HealthMap gazetteer.
This work was supported by grant G08LM009776-01A2 from the National Library of Medicine and the National Institutes of Health and a research grant from Google.org.
- Mawudeku A, Lemay R, Werker D, Andraghetti R, John RS: The Global Public Health Intelligence Network. In Infectious Disease Surveillance. Edited by: M'ikanatha N, Lynfield R, Beneden CV, de Valk H. Blackwell Publishing, MA; 2007.Google Scholar
- Brownstein JS, Freifeld CC: HealthMap: the development of automated real-time internet surveillance for epidemic intelligence. Euro Surveill 2007, 12(48):3322.Google Scholar
- Freifeld CC, Mandl KD, Reis BY, Brownstein JS: HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports. J Am Med Inform Assoc 2008, 15(2):150–157.PubMed CentralView ArticlePubMedGoogle Scholar
- Holden C: Netwatch: Diseases on the move. Science 2006, 314(5804):1363d.View ArticleGoogle Scholar
- Larkin M: Technology and public health: Healthmap tracks global diseases. Lancet Infect Dis 2007, 7: 91.View ArticleGoogle Scholar
- Woodruff AG, Plaunt C: GIPSY: Automated Geographic Indexing of Text Documents. Journal of the American Society for Information Science 1994, 45(9):645–655.View ArticleGoogle Scholar
- Proceedings of the Thrid Message Understanding Conference, Morgan Kaufmann. 1991.Google Scholar
- Proceedings of the Fourth Message Understanding Conference, Morgan Kaufmann. 1992.Google Scholar
- Proceedings of the Fifth Message Understanding Conference, Morgan Kaufmann. 1993.Google Scholar
- Proceedings of the Sixth Message Understanding Conference, San Francisco, CA: Morgan Kaufmann. 1995.Google Scholar
- McDonald R, Pereira F: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics 2005., 6(Suppl 1 S6): 10.1186/1471-2105-6-S1-S6Google Scholar
- Tjong Kim Sang EF, De Meulder F: Introduction to the CoNLL-2003 shared task: Language-independent Named Entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-03). Edmonton, Canada Edited by: Daelemans W, Osborne M. 2003, 142–147.Google Scholar
- Carreras X, Màrquez L: Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. Proceedings of the 9th Conference on Natural Language Learning, CoNLL-2005, Ann Arbor, MI USA 2005.Google Scholar
- Kawazoe A, Jin L, Shigematsu M, Barrero R, Taniguchi K, Collier N: The development of a schema for the annotation of terms in the BioCaster disease/detection tracking system. Proc. of the Int'l Workshop on Biomedical Ontology in Action (KR-MED 2006), Baltimore, Maryland, USA, November 8 2006.Google Scholar
- Yangarber R, Lin W, Grishman R: Unsupervised learning of generalized names. In Proceedings of the 19th international conference on Computational linguistics. Morristown, NJ, USA: Association for Computational Linguistics; 2002:1–7.View ArticleGoogle Scholar
- Collins M, Singer Y: Unsupervised Models for Named Entity Classification. Proceedings of the 1999 Joint SIGDAT Conference on EMNLP and VLC 1999, 1–10.Google Scholar
- Cucerzan S, Yarowsky D: Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. Proceedings of the 1999 Joint SIGDAT Conference on EMNLP and VLC 1999, 90–99.Google Scholar
- Strzalkowski T, Wang J: A self-learning universal concept spotter. In Proceedings of the 16th conference on Computational linguistics. Morristown, NJ, USA: Association for Computational Linguistics; 1996:931–936.View ArticleGoogle Scholar
- Riloff E: An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains. Artificial Intelligence 1996, 85: 101–134.View ArticleGoogle Scholar
- Yangarber R, Grishman R, Tapanainen P, Huttunen S: Automatic acquisition of domain knowledge for Information Extraction. In Proceedings of the 18th international conference on Computational linguistics. Morristown, NJ, USA: Association for Computational Linguistics; 2000:940–946.View ArticleGoogle Scholar
- Rauch E, Bukatin M, Baker K: A confidence-based framework for disambiguating geographic terms. Proceedings of the HLT-NAACL, Workshop on Analysis of Geographic References (WS9) 2003.Google Scholar
- Collobert R, Weston J: Fast Semantic Extraction Using a Novel Neural Network Architecture. Proceedings of 45th Annual Meeting of the Association for Computational Linguistics 2007.Google Scholar
- Marcus M, Marcinkiewicz M, Santorini B: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 1993, 19(2):313–330.Google Scholar
- Bromley J, Guyon I, LeCun Y, Sackinger E, Shah R: Signature Verification using a Siamese Time Delay Neural Network. Advances in Neural Information Processing Systems 6 1993.Google Scholar
- Bengio Y, Ducharme R, Vincent P, Gauvin C: A Neural Probabilistic Language Model. JMLR 2003, 3: 1137–1155.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.