Skip to main content
Figure 3 | BMC Bioinformatics

Figure 3

From: Building a protein name dictionary from full text: a machine learning term extraction approach

Figure 3

Heuristics filtering of the protein catalog to produce the protein dictionary. The dictionary was constructed by considering the classification results of a particular term in different articles. Step 1: we filtered out terms that were predicted to be a protein in less than 75% of the articles where a prediction was made. For example, if term A appears in 4 articles and is classified as a protein name in 3 of them, term A is accepted in the dictionary. This process collected 61,312 terms. Step 2: we removed terms with two characters or less. Step 3: to remove ambiguity with protein names that are also common nouns, we filter the dictionary against the Webster's Revised Unabridged Dictionary (G & C. Merriam Co., 1913, edited by Noah Porter, provided by Patrick Cassidy of MICRA, Inc, and retrieved from http://www.dict.org). We estimate that this edition contains about 80 common protein names (e.g., amylase). Step 4: we filter the dictionary against species names from the NCBI taxonomy database [30].

Back to article page