A system for de-identifying medical message board text
© Benton et al. 2011
Published: 9 June 2011
Skip to main content
© Benton et al. 2011
Published: 9 June 2011
There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients’ experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors’ personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.
Medical message boards (MMBs) serve as forums for emotional support and information exchange, usually for patients with similar conditions. Users of MMBs communicate by asynchronously posting messages to the board in threads, groups of messages that are typically centered on a single topic. Because of the sheer number, inexpensiveness, and candid nature of messages posted on these boards, many researchers have begun to treat MMB threads as “virtual focus groups” to gain more knowledge about patient experiences [1–3]. Additionally, our group is currently using MMBs as a source for identifying undocumented adverse effects from drugs and dietary supplements.
As more patients gain access to the Internet and join these communities, more MMB text on patient experiences will become available, providing researchers with further opportunities to investigate. However, in order to adhere to ethical requirements in quoting from or performing research on MMB corpora, all information that may identify the user should be removed. In fact, the University of Pennsylvania’s institutional review board requires this. This information includes personal and usernames, email and postal addresses, telephone numbers, and uniform resource locators (URLs), collectively defined here as identifiers. There has been considerable research in the domain of Named Entity Recognition (NER), the task of identifying instances of a particular type, such as people or companies within free text. Many NER systems have been developed and perform reasonably well . However, since MMB text is much more unstructured and noisy than the text for which most NER systems are developed , these methods are not as effective at capturing identifiers within MMB posts. For example, running the Stanford Named Entity Recognizer  trained on a corpus of US and UK newswires to identify proper names within a random sample of 500 posts resulted in correctly identifying 81.2% of proper names with a precision of 61.7%. This does not take into account any usernames that were present in these documents. In comparison, this same system was originally reported to be able to identify proper names with an F-score of 92.3% over a sample of newswire . We frame the task of de-identifying MMB text as a specialized form of NER.
There is also a well-developed body of research regarding medical document de-identification. Many systems have already been developed to de-identify different types of free text medical documents such as pathology reports, nursing notes, and discharge summaries. Many of these systems rely heavily on heuristics and pattern matching in order to remove identifying information [7–9]. Others have used statistical models in order to detect identifying information, including maximum entropy classifiers , support vector machines , and conditional random fields (CRFs) . The problem of de-identifying medical records has been addressed by numerous researchers up to this point and performance of some of these systems is exceptional, with F-scores over 98% for the best systems . Unfortunately, these methods that have been tuned to the more regular text of medical documents will not translate easily to de-identifying MMB text. This is because MMB text is extremely noisy with frequent typographical errors, a large variation in possible names, terms and abbreviations that are only specific to Internet posts or even a particular message board (e.g., bilat mx, onc), and the authors frequently refer to their friends and family members in their posts.
In this paper, we present a system that is able to remove phone numbers, e-mail addresses, user resource locators (URLs), proper names, and usernames from MMB text. We focus here specifically on name de-identification.
We used two corpora to train and validate our system. The first corpus, the breast cancer (BC) corpus, was generated by downloading the messages from 12 different BC message board sites. Downloaded messages were then cleaned with scripts specifically tailored to the layout of each message board, to fit to a standard format.
In addition to the BC corpus, we also compiled a corpus of arthritis posts gathered from three different arthritis message boards: http://www.healthboards.com/, http://arthritisinsight.com/forum/, and http://boards.webmd.com. We randomly sampled messages from this corpus to generate the test set for validating our system. This corpus was generated in the same manner as the BC corpus and contained over 100,000 messages and 14,000 threads. We selected the test set from this corpus in order to realistically determine how well our system would perform over a completely novel MMB, with different conditions and usernames than the training set.
Before names were identified, the corpus was passed through a pre-processing step corresponding to step 1 in Fig. 1. In this step, e-mail addresses, URLs, and phone numbers were identified via regular expressions. For example, e-mail addresses were identified with the regular expression “[\w.]+@\w+(.[\w])*” and phone numbers were identified with “(\d\d\d[-._ ]?)+\d\d\d\d” where \d refers to the set of digits and \w refers to the set of alphanumeric characters and underscore.
After these identifiers were discovered, the remaining text was split into tokens by whitespace and any punctuation marks. Once the text had been tokenized, our system generated a feature vector for each item in the output. Each token’s feature vector is a set of properties that describe that particular instance of the token, and are used by the name classifier to determine the likelihood that a given token is a name.
The features that we used to train the CRF can be grouped into two classes: features that do not rely on the structure of MMBs and those that take advantage of the way that MMBs are structured.
The MMB non-structure features tend to be features that would be helpful in identifying names in many different media, not necessarily MMBs. The features that we used to describe each token include the token itself, the token lower-cased, and its length. The case of the token was encoded as either lower, upper, title, or mixed case, each being a binary feature. The token’s two and three character suffixes and prefixes were also included. These features are helpful in identifying names in many different media, since names tend to be capitalized (although to a lesser extent in MMB text) and certain prefixes and suffixes may also indicate a name. We also included the distance of the token (in number of tokens) from the beginning and end of the message as a feature to take advantage of the fact that MMB posts often begin by addressing another user and end with the author’s name.
Word lists used to generate features
Ispell, GNU spell-checker dictionary; inspired by Thomas et al. 
Generic list of very common English words
Generated from the Cerner Multum Drug Lexicon, (Denver, CO)
Compiled by hand (e.g., mr., mrs, dr.)
Users that have posted to this message board, generated from “author” field of each message
Users who have posted to this particular thread, with variants of these names derived automatically (strip digits, split by delimiters/camel case/known names and words)
The vector also includes the features of the two previous tokens and the two next tokens. We included the features of the two previous and following tokens in each feature vector because the system performed better than the conditions where we included no surrounding tokens, included the immediately surrounding tokens, or included three tokens on either side. This may be due to the fact that certain words strongly indicate a proper name (e.g., honorifics).
An MMB corpus can be segmented in several different ways. For example, one can consider each message as a separate document. Likewise, one can consider each thread, all threads within a particular message board, or all messages posted by a particular user as separate documents. Certain words will be repeated much more frequently within a document than in the entire corpus. In other words, they are “document-specific”. Many of these document-specific words are in fact names. At the level of a thread, these are most likely the names of the users participating in that thread. At the level of a particular user, they would likely be their own name and other users that they frequently converse with. We use the term frequency-inverse document frequency metric (tf-idf) in two ways: by treating all messages that belong to a particular message board as a document, and by treating all messages that a particular user posts as a document. Each token in the document is ranked by its tf-idf value and its rank is included in the feature vector (e.g., inTop1%=TRUE).
where tf i , j is the frequency of a particular token i within a document j normalized by the square-root of the total number of tokens within j, N is the total number of documents within the corpus, and n i is the total number of documents that contain at least one instance of token i. This metric favors terms that occur many times within the current document, but occur in very few other documents. The list below shows the top 25 tokens when ranking by tf-idf in the http://breastcancer.org message board, which is one of 12 different message board sites that comprise the BC corpus.
-- Likely names (obvious name or variant in username dictionary): mjb, shirlann, mena, nicki, barbe, marsha, ravdeb, lisa, shokk, hippie96321, laura, odalys, jankay, harley, kaygirl, janiemarie, ginadcnj, spar, vickie, binney, luann
-- Not likely to be names: hugs, tx, onc, dh
Another virtue of this type of metric is that it tends to assign higher values to names that are rarer in general and may not occur in the proper name or even username lists. A similar feature that takes advantage of the fact that a particular name will occur multiple times in a particular document but not throughout the corpus was used by Minkov, Wang, and Cohen to identify names in e-mail messages .
We also used the likelihood that a token would appear near the beginning or end of a paragraph over the entire corpus scaled by the logarithm of the number of times that token appeared in the corpus as a feature. This was used to take advantage of the fact that although a particular name may not be used to end a message or greet another user at its current location, perhaps it has been frequently used in this manner in other messages.
Overview of features used by our system
MMB non-structure features
isLower=True, isCapitalized=False, …
suffix2=hy, prefix2=ka, suffix3=thy, …
distance from beginning/end
w/in1FromEdge=True, w/in2FromEdge=True, …
in word list
isProperName=True, isCommon=False, isUsername=False, …
possibly in word list
editDist1ProperName=True, editDist2ProperName=True, …
Also include features of two previous and following tokens
MMB structure features
tf-idf over message boards
inTop10=False, inTop1%=False, …
tf-idf over user posts
InTop10=False, inTop1%=True, ...
border of paragraph likelihood
inTop5=True, inTop10%=True, ...
Once feature vectors were generated for each token, a CRF name classifier was run over the tokens to estimate the marginal tag probabilities for each particular token. This corresponds to step 2 in Fig. 1. A CRF  is a discriminative probabilistic model that has been widely used in natural language processing in order to tag sequences. The particular name classifier that we used was trained on a 1,000-message sample from the BC corpus containing a total of 91,344 tokens, 822 proper names, and 682 usernames.
After returning tag probabilities for each token, any token with a cumulative probability greater than 0.05 of being either a proper name or username was tagged as a proper name or username (whichever tag was more likely). We applied this step (step 3 in Fig. 1) in order to increase the system’s recall, without sacrificing a great deal of precision, since identifying as many names as possible is more important than preventing non-name tokens from being removed.
In order to improve and validate our system, we created a development set with 500 messages sampled from the BC corpus (31,232 non-punctuation tokens, 483 names total) distinct from the set on which our classifier was trained, and a test set with 500 messages sampled from the arthritis corpus (28,146 non-punctuation tokens, 432 names total). Both of these sets were manually tagged by a human coder in order to evaluate the effectiveness of our system. Any token that referred to a user of the message board or anyone that they had personal contact with was tagged as a name. This may have been overly harsh since many of these tokens were acronyms or nicknames that were unlikely to identify the user.
Although the majority of these sets were tagged by a single coder (AB), a subset of 120 messages was also tagged by another coder (AC) to produce an estimate of the sole coder’s reliability. Over this subset, AB tagged 83 tokens as names and AC tagged 82 as names; their tags agreed on 81 tags (97.6% of tokens tagged as names by AB or 98.9% tagged by AC).
Although the F-score decreases as the name likelihood threshold decreases, the recall increases dramatically by 15.5% (82.6% to 98.1%) from the baseline of tagging tokens whose cumulative probability of being a name is greater than 50% to tagging any token whose cumulative probability exceeds 5%. For our task of de-identification, recall is much more important than precision, since the primary goal is to preserve the authors' anonymity.
Note that many tokens in the original coding of the development and test sets were unlikely to give much information as to the identity of the author. In order to accurately reflect our system’s ability to remove identifying information, a second coding of these sets was performed, where any tokens that were originally tagged as a name were tagged as “other” if they were one of: an acronym, a nickname that was obviously unrelated to the author’s username or personal name, or a substring of the username that was three or less characters long where the username was over twice that length. Our system achieved a name recall of 99.1% over the development set and 95.4% over the test set after this recoding.
Breakdown of the names that were not tagged as names by our system
Ambig. common words
Breakdown of the tokens that were incorrectly tagged as names by our system
In Table 4, “People” refers to tokens that were names of actual people, but were unrelated to any of the MMB authors (e.g., Oprah Winfrey, Tom Petty). “Places/Institutions” refers to tokens referring to a location or organization. “Medical” tokens were tokens that referred to a drug, supplement, procedure, or some other medical concept and could be useful to researchers investigating these posts. “Other” tokens could not be placed in any of the previous four categories and would probably not be very useful to researchers. Some examples of these are: “kiddo”, “june”, “april”, “morning”, “crispy”, and “sweetie”.
In order to compare our system’s performance against a currently available de-identification system, we also ran the “Deid system” [9, 16] (http://www.physionet.org/physiotools/deid/) over our development and test sets. The Deid system consists of a single Perl script that relies on a combination of heuristics, regular expression, and word lists to remove identifiers. We ran the system under several conditions. The system was run first without altering any of its word lists, then by appending all the usernames on the message board to its list of ambiguous names, and finally by appending all of those usernames to the list of unambiguous names instead. The system was first judged only on its ability to identify proper names and then on its ability to identify both proper names and usernames. The system’s ability to identify usernames was poor in all cases (9.0%, 18.3%, and 67.0% recall for each of the three conditions respectively over the development set, and similarly over the test set) and the system’s precision under the final condition was prohibitively low.
Performance of Deid system and Stanford named entity recognizer on development and test sets considering only proper names
Deid - Out of the box
Deid - Ambig. name list with usernames
Deid - Unambig. name list with usernames
Performance of state-of-the-art MIST de-identification system against our system, over the development and test sets
In this table, recall is defined as the proportion of proper/user names that were tagged by the system with some name tag (either a proper or a user name tag, the only possible name tags) and precision was defined as the proportion of non-name tokens that the system tagged with a name tag. We used this definition of recall and precision, because the removal of identifying information is more important than the specific name tags they are replaced with.
Our system performs as well over MMB text as some of the other de-identification systems perform over other medical documents. In a recent challenge to remove private health information from medical discharge records , out of the sixteen systems evaluated, two systems exhibited an F-score of less than 78.1% and eight systems exhibited recall of less than 93.8% in identifying patient names. Given that MMB text is much noisier than discharge records, it is understandable that our system does not achieve state-of-the-art performance at de-identifying this text. However, it performs better over our MMB test sets than even the best of these systems (MIST). We believe that it is a great step forward in developing a system that can adequately de-identify medical message board text.
We chose to directly compare our system’s performance against the Deid system over the same corpus, because it was one of the few de-identification systems that were freely available. The Deid system uses a very different method of hand-tailored rules and word lists to remove identifiers. It is not surprising that this system performs poorly since it was developed for de-identifying medical records, not MMB text. The expanded word list conditions in Table 5 show either little change or no improvement in the Deid system’s performance. This is due to the fact that the word lists were expanded with author usernames, and were tokens unlikely to be labeled as proper names (Table 5 only evaluates the performance of these systems over proper names). Running the Stanford Named Entity Recognizer over a sample of 500 BC posts may be more comparable, since it also detects proper names using a CRF. Even then, its newswire-trained classifier performed with a precision of 61.7% and recall of 81.2% over just proper names in the development set (61.7% precision/69.7% recall over both proper and user names within the same sample).
Even the MIST system, which was trained on the same training set and had access to the same dictionaries as our system did not perform well. In particular, Table 6 suggests that it was unable to identify usernames well (recall of 49.5% over the development set and 34.3% over the test set), even though its training set contained explicitly marked usernames. This suggests that the default feature set that MIST uses to describe tokens is not suitable for de-identifying MMB text, although it may expressive enough to discover identifiers in more regular text, such as medical records.
The poor performance of these systems over our MMB corpus, suggests that current de-identification methods cannot readily be applied to this new text medium, and that our specialized method is useful and novel. These systems perform very well over the medium they were designed for. Neamatullah, et al. report that the Deid system was able to identify proper names in a corpus of nursing notes with 72.5% precision and 98.9% recall . Finkel, Grenager, and Manning report that the Stanford NER system was able to identify person names over the Seventh Conference on Computational Natural Language Learning named entity recognition shared task with an F-score of 92.3%  and the MIST system achieved the best overall score in the de-identification task of the 2006 AMIA Challenges in Natural Language Processing for Clinical Data . However, we show that they are unable to reliably identify names when applied to the very different medium of MMB text. We were unable to find a system specifically designed to de-identify MMB text, which is why we chose to evaluate the performance of our system against two medical record de-identification systems, Deid and MIST, and a named entity recognition system, the Stanford NER.
One of the main difficulties in identifying usernames is that many usernames are common words. Some examples of tokens ambiguous between common words and usernames that our system failed to classify were “one”, “boo”, “breezy”, “tiger”, “girl”, and “ash”. The majority of names that were missed were of this form. Another class of names that our system failed to tag was acronyms of names. However, it is difficult to imagine how a human reading the message would be able to discover the actual username based on this acronym.
Medical words incorrectly identified as names over development and test sets
The tokens removed from the test set pose a much greater concern. “Doxy” was used as an abbreviation of the pharmaceutical doxycycline, which is why it was not marked as a non-name in the post-processing step. Nicknames for drugs could pose a great problem for our system, since nicknames seem to be much more common in MMB text than in medical records, and, like “doxy”, often look very similar to usernames or nicknames of authors. “Hashimoto” and “sjogren” are difficult as well, since they are ambiguous between proper names and medical terms (Hashimoto’s disease, Sjögren’s syndrome). Within our test set, they appeared as conditions that users were discussing rather than people that they knew.
In spite of the low precision, the specificity of our system is very good, only removing about 0.7% of all non-names.
Although our system takes a great step in de-identifying MMB text, there are several modifications that we can make in order to improve our system’s performance. First, increasing the size of the training set for the name classifier would likely improve the precision of our system by reducing the number of common words that were mistagged as names. We could also include a gazetteer of locations in order to reduce the number of mistagged places.
Second, we have not included the part of speech (POS) of the token in the feature vectors generated for tokens. Six out of the 16 systems evaluated in a 2007 discharge summary de-identification challenge  used POS tags to inform identification of private health information. Many statistical de-identification systems rely on this feature. As of now, we are unsure of how effective a POS tagger would be over MMB text, since these systems are often trained on text from newswires or the Wall Street Journal, which is more regular than message board text. Nevertheless, it may be worthwhile to experiment with this particular feature.
Finally, the identifiers that our system currently removes are far from full de-identification, but they are some of the most pervasive identifiers in MMB text. We intend to improve our system by specifically identifying institution names and locations as well. The removal of these terms is currently a by-product of our name classifier and we have not evaluated its performance at removing locations and institution identifiers. As investigators continue exploring MMB text to gain a greater awareness of patients’ experiences, systems such as ours will become more important than ever in protecting the privacy of the members of these communities.
We have developed a system that can de-identify MMB posts by identifying and removing both proper and usernames with acceptable precision and recall. Not only is this a boon to researchers investigating these MMBs, but it also suggests that NER can be effective in even some of the noisiest forms of free text. We welcome any improvements that others can offer to our system.
AB, SH, LU, and JH all contributed to the initial design of the system. All authors participated in discussions about the interpretation of our data and results. AB and AC tagged the validation sets and evaluated their coding. AB implemented the system and drafted the manuscript. All authors contributed to revising the manuscript and provided feedback on the methods used to evaluate system performance. All authors read and approved the final manuscript.
This project is supported by the National Library of Medicine (RC1LM010342). Access to the Multum database was supported by the National Center for Research Resources ((5KL2RR024132)). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Library of Medicine or the National Institutes of Health.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 3, 2011: Machine Learning for Biomedical Literature Analysis and Text Retrieval. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S3.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.