- Open Access
Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot
© Ehrler et al; licensee BioMed Central Ltd 2005
- Published: 24 May 2005
In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories.
Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot.
Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities.
From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.
- Gene Ontology
- Noun Phrase
- Text Categorization
- Mean Average Precision
- Unify Medical Language System
Numerous techniques help researchers locate relevant documents in an ever-growing mountain of scientific information. Next, it becomes important to develop tools able to help people process this data for use in digital libraries and electronic databases (see  for a survey). The BioCreative intitiative, a joint evaluation campaign organized by the Centro Nacional de Biotecnologia (CNB) and the MITRE and supported by the European Molecular Biology Organization (EMBO), aimed at explored the application of text mining tools to support annotation of molecular biology databases. Four different types of tasks were proposed:
Gene and protein named entity boundary detection (task 1a). This is a classical task in information extraction, and has been largely investigated in the context of MUC  conferences as well as more recently in more biomedical fori, such as the JNLPBA workshop shared task proposed this year for at COLING http://www.genisis.ch/~natlang/JNLPBA04/.
Passage retrieval (tasks 2.1). The task in well-known in question-answering . The point of this task is to retrieve a short passage rather than a complete document.
Text categorization (tasks 1b, 2.2). In task 1b, the targeted categories are a set of gene and protein names, while in task 2.2, the categories are the terms listed in the Gene Ontology (GO). For task 2.2, the passage supporting the annotation is also to be provided (task 2.1).
Ad hoc information retrieval (tasks 2.3 and 2.4). These two tasks were discarded due to the lack of participants.
Our participation focused on task 2.2, which also includes task 2.1. From a functional point of view, task 2.1 is defined as follows: given a Swiss-Prot triplet, i.e. a protein, a GO term and a related article, participants had to extract a short passage that substantiates selection of a GO a category. Task 2.2 is more complex: given a Swiss-Prot pair, containing a protein and a relevant article, participants had to automatically assign a set of GO categories, then for each of the assigned GO categories, we located the appropriate passage, which supported the attribution of the GO term. The experimental design assumes that the number of GO categories assigned for each protein is known a priori.
The plan of the paper is the following: introduction of the background research supporting our work in section 2; description of the data sets and the architecture of the system in section 3; results, the official evaluation merged with evaluations made after the competition, in section 4; conclusion and future works, in section 5.
In this section, we relate the content of the paper to the state-of-the-art. Both passage retrieval and automatic text categorization are introduced, however as the rest of the paper, which reflects the second BioCreative task, the presentation focuses on the categorization task.
Passage retrieval is an important step in question-answering (QA). It bridges the gap between document retrieval and very short textual answers needed for QA. However, the purpose of the passage retrieval task proposed in BioCreative is to find a short fragments which would appropriately support 1) the already known GO annotation in task 2.1, and 2) the automatic GO term assignement in task 2.2. In both cases the targeted text is already known. Thus, the task is similar to the known-item search task . In TREC, this task aimed at retrieving s single know document in corrupted collections. Corruptions were caused by misspellings  or by running optical character recognition ) tools.
Text Categorization (TC) aims at attributing a set of concepts to an input text. Typical applications use a set of keywords to be selected into a glossary. TC is performed daily by professional indexers working in digital or classical libraries. However, keyword assignment is only a particular instance of text categorization. TC can also be seen as an information extraction task, when conducted for named-entity (NE) recognition purposes as investigated in task 1b. Computer-based concept mapping technologies include:
retrieval based on string matching, which attributes concepts to texts based on shared features (words, stems, phrases...);
empirical learning of text-concept associations from a training set of texts and their associated concepts.
In the former approach, the targeted concepts are indexed. Each indexing unit is attributed with a specific weight. While in the latter, a more complex model of the data is built in order to provide text-concept associations beyond strict features sharing. Retrieval based on string-matching is often presented as the weaker method  of the two, but in many real situations, like those defined in the BioCreative challenge, learning approaches cannot be applied. For instance, empirical learning methods require large training sets of data that are usually not available and whose development costs would exceed the budget of most research groups. Additionally, the size of category sets can be some orders of magnitude above the capacities of current learning algorithms running on a standard computing framework. Designing TC as a retrieval task means indexing of a collection of terms, in our case terms from the GO, as if they were documents, and then processing each document as if it was a query. Then, the retrieval tool uses the score attributed to each term to rank them. Because the document collection is made of entities (terms in a controlled vocabulary) that are clearly shorter than usual documents. Our study aims at exploring the behavior of classical statistical models. For TC, the use of a vector space engine, using both stems and linguistically-motivated indexing features, and its combination with a search tool based on pattern matching constitutes the main modules of our system. We also investigated some refinements of this core combination.
Automatic text categorization has been extensively studied and has led to an impressive number of papers. A partial list (see http://www.math.unipd.it/~fabseb60/ for an updated bibliography) of machines learning approaches applied to text categorization includes naive Bayes , support vector machines , boosting , and rule-learning algorithms . However, most of these studies apply text classification to a small set of classe, usually a few hundreds, as in the Reuters' collection . In comparison, our retrieval methods are designed to handle large class sets since they relies on an inverted file to allow fast categorization. The inverted file relates each indexing unit (word or stem) to the terms where it occurs in the GO. The size of the inverted file, which additionally stores the weight of each word (or stem), is an important parameter but 105–6 is still a modest range so that even large controlled vocabularies can be indexed.
In text categorization based on learning methods, the scalability issue is twofold. It concerns both the ability of these data-driven systems to work with large concept sets and their ability to learn and generalize regularities for rare events. Theoretically, if large multi class problems can be recast as binary classifiers in order to be solved by learning approaches, in practice it is often difficult. Larkey and Croft  show how the frequency of concepts in the collection is a major parameter. Our approach is data-poor because it only demands a small collection of annotated texts for fine tuning as opposed to data-intensive machine learning approaches, which require large annotated sets.
To our knowledge the largest set of categories ever used by text classification systems is above 104. These systems were applied to the domain of life sciences. Yang and Chute  worked with the International Classification of Diseases (about 12000 concepts). Similarly, the OHSUMED collection contains 14301 Medical Subject Headings (MeSH). In contrast, our system is tailored to be applied to much larger class sets. The Unified Medical Language System (UMLS) contains 871,584 different concepts and 2.1 million terms (with synonyms), while TrEMBL contains about 700,000 protein names, often including synonyms. For the BioCreative competition, the categorization space of our system was restricted to the GO partition of the Unified Medical Language System (UMLS).
In addition, to usual word-based features more elaborated indexing units have been proposed in information retrieval (IR). The general idea in indexing entities, which are different than words (or stems), is to handle information as conveyed in word collocations. Thus, expressions such as cystic fibrosis can be seen as one semantic entry in an inverted file. Various phrase indexing methods have been proposed in the past and generally, retrieval or categorization performance conclusions on the use of phrases as indexing units were inconsistent . For IR, Hull et al.  and Strzalkowski et al.  used phrases and were able to report some improvement. For text categorization, Tan et al.  and Mongovi et al.  have reported that statistical bigrams increased performance, while Toole and Chen  relied on linguistically-motivated phrases. Mitra et al.  re-examined the use of statistical and syntactic phrases for retrieval and came to the conclusion that "once a good basic ranking scheme is used, the use of phrases do not have a major effect on precision at high ranks". For linguistically-motivated phrases, Arampatzis et al.  question the use of syntactic structures as substitute for semantic content. As for our present concerns, statistical phrase indexing is problematic. Usually inspired by mutual information measures , it requires important volumes of training data, while we aim at designing a data independent system. Therefore, in our systems phrases are based on syntactic parsing  rather than statistical analysis. However, let us remark thaz data needed to identify statistical phrases are not of the same kind as those needed for training a classifier: the former approach requires only large corpora, while the latter needs supervision, i.e. annotated data, so both tasks are data-intensive but discovering statistical phrase extraction is much cheaper than text categorization.
Most data sets and metrics are common to each of the subtasks, therefore we introduce these aspects first, then the methods used for conducting each task are reported.
The data resources used in the experiments can be separated into three subsets, the document collection, the Swiss-Prot  records and the GO terms . The Gene Ontology merges three structured vocabularies, organized as ontologies, that describe gene products in terms of their associated biological process, cellular component (1368) and molecular function in a species-independent manner. The molecular function terms describe activities at the molecular level. A biological process is accomplished by one or more ordered assemblies of molecular functions. The cellular component is a component of the cell, which is part of some larger object. For example either an anatomical structure or a gene product group.
Collections and metrics
GO term per record in DSI.
# GO term
# Swiss-Prot record
For task 2.1 the expert has to decide whether the evidence text corresponds to the given GO concept and protein, or if it is not appropriate. Additionally, in task 2.2, the judge assesses whether the GO concept has been correctly predicted for each text. There are three different marks (high, generally, low) to evaluate the quality of the results. These marks evaluate GO concept and protein separately. For task 2.1, high, generally and low evaluate the relevance of the sentence, which supports the annotation of the protein with GO concepts. For the task 2.2, high means that the protein or the GO concept has been correctly predicted. Generally, as an evaluation for the GO term, means that it is not totally wrong but too general to be useful for annotation. Generally, as an evaluation for the protein means that the specific protein has not been found but instead a homologue from another organism or a reference to the protein family. Low means that the answer was wrong.
The purpose of the passage retrieval task is to facilitate and improve annotation by offering a short segment of text that can indicate the correct GO term. Our approach is based on the idea that the relevant passage and the GO terms share some kind of lexical, and hopefully semantic, similarity. Therefore, the basic method consists of searching for the concept directly in the text. For the passage retrieval task, only the GO term is used to rank passages from the input text. Although, using the protein name and synonyms of the GO term could have been useful to expand the matching power of our approach, we decided to focus on an high precision matching rather than relying on additionnal materials. A possible improvement would be to boost GO concepts, which occur more than once in the candidate list. Identifying parts of GO terms in text is an simple strategy, which does not require any training data set and which can be manually tuned. The main difficulty encountered with this approach is defining a distance that measures the similarity between a GO concept and a given sentence. Different types of distances were tested, but the basic idea is to rank the candidate sentences and to select a single top-ranked passage. Two independendt modules were developped: a sentence splitter, which defines the basic retrieval units, and the sentence ranker. Although specialization of the parameters used for each of the three GO axis could have been beneficial, we used the same settings for each of the three GO axis.
As preliminary observations we noted that applying our tools on full text articles rather than on abstracts did require improving our pre-processing tools, especially to detect sentence boundaries, therefore the official competition experiments were done using abstracts. For experiments conducted afterwards, the impact of using full-text articles was investigated.
For passage retrieval, the length of the appropriate segment to be considered was crucial. Following what was learned from the information extraction task of the TREC Genomic track, we assumed that sentences were likely to be relevant segments . The TREC task aimed at returning a Gene Reference into Function (GeneRIF), i.e. a short passage, which provides information on the function of a protein in the LocusLink repository.
Lacking clean training data, we decided not to investigate the use of machine learning approaches to solve the sentence pre-processing problem (as in ), and instead we decided to use simple manually crafted regular expressions. The tool relies on a set of finite-state automata, which are applied sequentially. Although the system is simple, it offers a certain level of maintainability and a good accuracy (97%), which is similar to more advanced sentence boundary detection methods.
Two different similarity measures have been used to compute a score between sentences and GO terms. The two similarity measures are: 1) a high precision but low discriminative power exact match method and 2) a low precision but good recall fuzzy string-edit distance. These two measures are then linearly combined to obtain a unique score for each sentence in the input document as in the following equation:
with the following parameters and parameter values:
s0: perfect score;
s1: fuzzy score;
w0: weight of s0;
w1: weight of s 1.
The direct match method computes a Dice-like distance as in the following equation:
Each time a word of the GO concept is found in the candidate passage the GO term and passage intersection set is increased by one. This score is divided by the total number of words, which composes the concept. The normalization factor is important to smooth length variations in the GO controlled vocabulary. It is also interesting to notice that full and exact match is unusual, but when it occurs (e.g. when a five token GO term if found in the document) then very high precision is achieved, thus precision becomes a trivial issue. In a quite unusual manner for categorization and information retrieval purposes, recall is more difficult to achieve. Indeed, unlike in large text collections (MEDLINE, Web...), where the natural redundancy of information help to find a relevant document whatever words are used to query the system, searching for a relevant passage in an abstract is more challenging regarding recall.
The string edit distance module computes a distance between two strings. The score counts the minimal number of modifications (insertions, deletions and substitutions) needed to transform the first string into the second one (see , for a short introduction or  and , for a comprehensive presentation). String-edit distances operations are very sensitive to small cost variations making this step very time-consuming.
Example of distances for task 2.1.
The choice of the best distances was made empirically. Some characters, such as "-" or digits, have a very low replacement costs. As exemplified in Table 3 given the three following sentences and the term "protein serine/threonine kinase activity", the Smith-Waterman distance performed generally well:
S1. Cdc42-induced activation of the mixed-lineage kinase SPRK in vivo.
S2. Src homology 3 domain (SH3)-containing proline-rich protein kinase (SPRK)/mixed-lineage kinase (MLK)-3 is a serine/threonine kinase that upon overexpression in mammalian cells activates the c-Jun NH(2)-terminal kinase pathway.
S3. This is, to the best of our knowledge, the first demonstrated example of a Cdc42-mediated change in the in vivo phosphorylation of a protein kinase.
In this example, we assume that S2 is the best candidate sentence. Two direct matches are observed in S2 and S3, and so these two segments are better candidates than S1, but to rank segments S2 and S3, we relied on the string-edit distance module. In Table 3, we see that both Smith-Waterman and Jaccard measures are discriminant, while neither Jaro, nor Levenshtein are effective. The final score is a linear combination which favors Smith-Waterman and Jaccard over Jaro and Levenshtein distances. This score will be used in our evaluations to estimate the reliability of the passage assignement.
Distribution of token per terms in the Gene Ontology.
# GO term
Our system does not use any specific string normalization module. The system extracts every contiguous sequence of 5 tokens by moving a window through the abstract. These pentagrams are then matched against the collection of GO terms. Basically, the manually crafted finite-state automata allow two insertions or one deletion within a GO term. Ranking of the proposed candidate terms is based on these two basic edit operations: insertion costs 1, while deletion costs 2. The resulting pattern matcher acts as a term proximity scoring system , but with a 5 token matching window. Krallinger and Padron  use a similar strategy but they generalize the idea and vary the window size too.
Vector space classifier
Term Weights in the SMART System.
1 + log(tf)
α + β × (tf/max(tf))
Inverse Document Frequency
The engine uses stems (Porter, with minor modifications) as indexing units and a stop word list (544 items). We observed that cosine normalization was especially effective for our task. This is not surprising, considering the fact that cosine normalization performs well when all documents have the same length .
Sample of GO synonyms for each axis.
function: cholesterol O-acyltransferase – sterol O-acyltransferase activity
component: protoplasm – intracellular
process: cell division – cytokinesis
GO terms contain between 1 and 28 words and almost verb-free noun phrases (NP) if we omit some rare participle forms such as in "cell-cell signaling involved in cell fate commitment", which occur in less than 0.01% of GO terms. Noun phrase indexing was expected to be beneficial because of the profile of these terms. In our approach, only the content of GO terms is stored in the indexes, and phrase recognition is only applied on the input document in order to identify possible GO terms. Formally, this manipulation of the abstract can be viewed as a reformulation process. The abstract is translated into a set of noun phrases before to be matched to the list of GO terms. Our working hypothesis is a weak variant of the Phrase Retrieval Hypothesis . We assumed that NP recognition can help reducing noisy mapping for subterms.
Our shallow parser uses both statistical and manually written patterns, applied at the syntactic level (part-of-speech) of each sentence , to identify noun phrase boundaries. The parser concentrates on adjective (A) and noun (N) sequences, such as: [A*] [N*], i.e. N, AN, NN, ANN, NNN, AANN, ANNN, NNNN, AANNN, NNNNN... adjectives as well as prepositions such as of, with are optional. Unlike in other technical glossaries , we observed that templates with conjunctions are rare in GO terms. We counted 1423 occurences of conjunction tokens the GO terminology terminology (i.e., almost 1%), therefore we decided to ignore it.
We call noisy subterm mapping an erroneous behavior of the mapping process, when it selects some erroneous GO terms that are part of a relevant term. Thus, considering an input text dealing with the term cystic fibrosis, both cystic and fibrosis are irrelevant subterms likely to be proposed as indexing units, so being able to recognize that cystic fibrosis constitutes a noun phrases will help discard these two noisy candidates. However, discarding all subterms from the candidate list may result in negative effects, so that subterm removal must be based on contextual evidences. If a subterm occur in the input text as an autonomous noun phrase, then it is kept in the candidate list. Therefore two different indexes (or view of the input text) are constructed. The merger of this index with the index of stems is described in the next paragraph.
Fusion of classifiers
The hybrid system combines the regular expression classifier with the vector-space classifier. Unlike Larkey and Croft  we do not merge our classifiers by linear combination because the RegEx module does not return a scoring consistent with the vector space system. The combination of classifiers uses the list returned by the vector space module as a reference list (RL) and the list returned by the regular expression module is used as boosting list (BL). This method serves to improve the ranking of terms listed in RL. A third factor takes into account the length of terms. Both the number of characters (L1) and the number of tokens (L2, with L2 > 3) are computed, so that long and compound terms, which appear in both lists, are favored over single single and short terms. We assume that the reference list has good recall, and we do not set any threshold on it. For each concept t listed in the RL, the combined Retrieval Status Value (cRSV, equation 1) is:
The value of the k parameter is set empirically.
The index of phrases is used to reorder the set of terms returned by the engine. The strategy is the following: when a given term is found in the list of terms (TL) returned by the hybrid system (RegEx + VS), and this term is not found alone in the phrase list (PL) stored for this abstract, then the RSV of this concept is downscored. The shorter the subterm, the more its RSV is affected, as expressed in the following equation, which gives the final RSV. fRSV ; m = 16 in equation 2, since GO terms contain no more than 15 words:
In principle, to transform a retrieval engine, which returns a ranked list of concepts, into a categorization system, which make binary decisions on each concept, it is necessary to set a threshold on the retrieval status value. However, the number of concepts to be returned for each protein-GO axis pair is known, so this threshold may be a priori ignored in the current design of the categorizer.
GO definition, prior probability and full article
Sample of GO definitions.
term: TRAIL receptor 2 biosynthesis
definition: The formation from simpler components of TRAIL-R2 (TNF-related apoptosis inducing ligand receptor 2), which engages a caspase-dependent apoptotic pathway and mediates apoptosis via the intracellular adaptor molecule FADD/MORT1.
term: trans-2-enoyl-CoA reductase (NADPH) activity
definition: Catalysis of the reaction: acyl-CoA + NADP+ = trans-2,3-dehydroacyl-CoA + NADPH + H+.
Distribution of the most frequent GO terms in the 640 items Swiss-Prot data set (DSI): cut-off at 14 occurrences.
integral to plasma membrane
transcription factor activity
integral to membrane
protein amino acid phosphorylation
In this section, we present and discuss the official results as well as results gathered after the competition. All official results were provided by the BioCreative judges.
Passage retrieval: results for each GO axis.
# submitted passage
# evaluated passage
Gene Ontology Annotation
Results for different system settings.
GO Definition + Prior
MRP (δ %)
MAP (δ %)
For r1, we see that the optimal weighting schema for the vector space engine (i.e. anc.atn) is not the best schema for combination with the regular expression pattern matcher. The best combination is achieved with ltc.lnn (r2). As expected the impact of the pattern matcher is especially effective at high ranks (+31.3% of MRP), while the improvement of the MAP is less significant (+19.1%). In r3, we observe that the thesaurus has a positive but marginal impact, from 15.86 to 16.10 for MRP. The submitted run (r4) confirms that linguistically-motivated phrase indexing is beneficial, from 16.06 to 16.45 for MRP and from 7.16 to 7.72 for MAP. In r4, we used ltc.lnn, but experiments performed after the competition time show (in r5) that a better tf.idf combination is anc.ltn. For the augmented term frequency factor, noted a, the value of the parameters is α = β = 0.5. Finally, the use of the GO definition to expand the document/term matching features is also beneficial (from 17.04 to 17.17 for MRP and from 8.32 to 8.61 for MAP). Run r7 uses the same settings as the official run but applied to the full articles. Although using full articles rather than abstracts results in a degradation of the classification in regards to both MAP and MRP, we cannot conclude that abstracts should be preferred to full articles. Infact, we cannot expect that the best combination for processing short abstracts would remain optimal for long articles, and therefore additional experiments with different parameters are needed to study this issue.
Consistent with conclusions drawn from task 1b, is the fact that data-poor retrieval string matching methods  are competitive with more complex data-intensive approaches . The impact of features appearing in GO definitions and related resources, which were used by some of the competitors, appear to be promising extentions . Such expansion strategies could improve both the categorization and the passage retrieval task, and we believe that further experiments are necessary to fully exploit these resources. Another important evaluation parameter is the size of the retrieved passage. Guidelines were not explicit and therefore some participants  decided to return document sections rather than sentences. The precision (high results) was not improved, but they retrieved a large number of results in the generally category. Furthermore, it is very interesting to analyse the way these other participants envisage the relationship between tasks 2.1 and 2.2. Passage retrieval (task 2.1) is seen as a feature reduction step, which is preparatory for the GO annotation task (task 2.2). Working with full text articles, other systems must first reduce the categorization space to a shorter passage, then, categorization is applied. This design is opposite to ours. Categorization is performed first, then passage retrieval is accomplished driven by the GO category. Such strictly inverted strategies suggest that a wide span of approaches can be equally effective. However, considering a related task proposed last year in the context of the TREC Genomics track (automatic extraction of GeneRIFs in LocusLink), it seems that passages longer than a sentence are generally not appropriate for protein annotation. Thus, for the TREC Genomics track , Ruch et al.  report that sentence shortening was an effective strategy to model GeneRIF extraction as performed by humans. Finally, recent advances in Text Mining applied to biomedical litterature suggest that argumentative content , i.e. paragraphs or sentences specific to categories such as purpose, methods, results and conclusion might be of interest for information retrieval  and extraction of gene and protein functions .
We have reported on the development and evaluation of a passage retrieval tool used to support an automatic text categorization tool for protein annotation. For passage retrieval, the tool combines an exact match strategy and a string-to-string edit distance to select the best ranked sentence. For text categorization, the systems combines: 1) a pattern matcher, based on regular expressions; 2) a vector space retrieval engines that uses stems and phrases as indexing units, a traditional tf.idf weighting schema, and cosine as normalization factor. The use of noun phrases seems to improve the categorization's average precision by at least 3%. The combined system can be applied on any controlled vocabulary, even when manually annotated data are not available. The system achieved very competitive results in the context of the BioCreative challenge.
We would like to thank Christine Chichester as well as the reviewers for their valuable comments. We also would like to thank the organizers of the BioCreative challenges as well as the evaluators at the European Bioinformatics Institute. Finally, we would like to thank Christian Blaschke, Dietrich Rebholz-Schuhmann, Karin Verspoor and Evelyn Camon for the stimulating discussion at the BioCreative workshop in Granada. For the UMLS and TrEMBL, statistics are given for September 2003 releases. The easyIR toolkit can be downloaded: http://www.natlang.hcuge.ch/People/ruch/. The study has been supported by the European Union 6th framework program via the SemanticMining Network of Excellence (EU Grant 507505 – Swiss Federal Office for Education and Science Grant 03.0399).
- Hirschman L, Park J, Tsujii J, Wong L, Wu C: Accomplishments and challenges in literature data mining for biology. Bioinformatics 2002, 18(12):1553–1561. 10.1093/bioinformatics/18.12.1553View ArticlePubMedGoogle Scholar
- Chinchor N: MUC-7 Named-Entity task Definition. MUC. 1997.Google Scholar
- Hull D: Xerox TREC-8 Question Answering Track Report. TREC-8 Report 2000.Google Scholar
- Kantor P, Voorhees E: The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text. Information Retrieval 2000, 165–76. 10.1023/A:1009902609570Google Scholar
- Ruch P: Using contextual spelling correction to improve retrieval effectiveness in degraded text collections. COLING 2002.Google Scholar
- Mittendorf E, Schauble P: Measuring the Effects of Data Corruption on Information Retrieval. SDAIR Proceedings 1996.Google Scholar
- Yang Y: Sampling strategies and learning efficiency in text categorization. Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access 1996.Google Scholar
- McCallum A, Nigam K: A comparison of event models for Naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization 1998.Google Scholar
- Joachims T: Making Large-Scale SVM Learning Practical. Advances in Kernel Methods – Support Vector Learning 1999.Google Scholar
- Schapire R, Singer Y: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 2000, 39(2/3):135–168. 10.1023/A:1007649029923View ArticleGoogle Scholar
- Apté C, Damerau F, Weiss S: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS) 1994, 12(3):233–251. 10.1145/183422.183423View ArticleGoogle Scholar
- Hayes P, Weinstein S: A System for Content-Based Indexing of a Database of News Stories. Proceedings of the Second Annual Conference on Innovative Applications of Intelligence 1990.Google Scholar
- Larkey L, Croft W: Combining classifiers in text categorization. SIGIR, ACM Press, New York, US; 1996:289–297.Google Scholar
- Yang Y, Chute C: A linear least squares fit mapping method for information retrieval from natural language texts. COLING 1992, 447–453.Google Scholar
- Rasolofo Y, Savoy J: Term Proximity Scoring for Keyword-based Retrieval Systems. ECIR 2003, 101–116.Google Scholar
- Hull D, Grefenstette G, Schulze B, Gaussier E, Schutze H, Pedersen J: XEROX TREC-5 site report: Routing, Filtering, NLP, and Spanish tracks. TREC-5 NIST Special Publication 500–238 1997, 167–180.Google Scholar
- Strzalkowski T, Stein G, Wise GB, Carballo JP, Tapanainen P, Jarvinen T, Voutilainen A, Karlgren J: Natural Language Information Retrieval: TREC-7 Report. Text REtrieval Conference 1998, 164–173.Google Scholar
- Tan C, Wang Y, Lee C: The Use of BiGrams to Enhance Text Categorization. Information Processing and Management 2002, 38(4):529–546. 10.1016/S0306-4573(01)00045-0View ArticleGoogle Scholar
- Kongovi M, Guzman J, Dasigi V: Text Categorization: An Experiment Using Phrases. ECIR 2002, LNCS 2291: 213–220.Google Scholar
- Tolle K, Chen H: Comparing noun phrasing techniques for use with medical digital library tools. Journal of the American Society for Information Science 2000, 51(4):352–370. Publisher Full Text 10.1002/(SICI)1097-4571(2000)51:4<352::AID-ASI5>3.0.CO;2-8View ArticleGoogle Scholar
- Mitra M, Buckley C, Singhal A, Cardie C: An analysis of Statistical and Syntactic Phrases. RIAO 1997, 200–214.Google Scholar
- Arampatzis A, van der Weide T, van Nommel P, Koster C: Linguistically Motivated Information Retrieval. Encyclopedia of Library and Information Sciece 2000., 69:Google Scholar
- Stolz W: A probabilistic procedure for grouping words into phrases. Language and Speech 1965., 8:Google Scholar
- Ruch P, Baud R, Bouillon P, Robert G: Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models. CoNLL 2000, 111–116.Google Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000, 28(1):45–8. 10.1093/nar/28.1.45PubMed CentralView ArticlePubMedGoogle Scholar
- Consortium TGO: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556View ArticleGoogle Scholar
- Ruch P, Chichester C, Cohen G, Coray G, Ehrler F, Ghorbel H, Müller H, Pallotta V: Report on the TREC 2003 Experiment: Genomic Track. TREC-12 2004. [http://trec.nist.gov/pubs/trec12/t12_proceedings.html]Google Scholar
- Reynar J, Ratnaparkhi A: Entropy Approach to Identifying Sentence Boundaries. Proceedings of the ANLP 1997.Google Scholar
- Ruch P, Baud R, Geissbühler A: Using Lexical Disambiguation and Named-Entity Recognition to Improve Spelling Correction in the Electronic Patient Record. Art Intell Med 2003, 29(1–2):169–184. 10.1016/S0933-3657(03)00052-6View ArticleGoogle Scholar
- Wagner R, Fisher M: The string-to-string correction problem. Journal of the Association of Computing Machinery 1974, 1: 168–173.View ArticleGoogle Scholar
- Cohen W, Fienberg PRS: A Comparison of String Distance Metrics for Name-Matching Tasks. IIWeb 2003, 73–78.Google Scholar
- Krallinger M, Padron M: Prediction of GO annotation by Combining Entity Specific Sentence Sliding Windows Profiles. BioCreative Notebook Papers, CNB 2004. [http://www.pdg.cnb.uam.es/BioLink/workshop_BioCreative_04/handout/]Google Scholar
- Salton G, McGill M: Introduction to Modern Information Retrieval. McGraw Hill Book; 1983.Google Scholar
- Amati G, van Rijsbergen C: Probabilistic Models of Information Retrieval based on Measuring the Divergence from Randomness. ACM Transactions on Information Systems (TOIS) 2002, 20(4):357–389. 10.1145/582415.582416View ArticleGoogle Scholar
- Singhal A, Buckley C, Mitra M: Pivoted document length normalization. ACM-SIGIR 1996, 21–29.Google Scholar
- Grabtree I, Soltysiak S: Identifying and Tracking Changing Interests. International Journal of Digital Libraries 1998, 2: 38–53. 10.1007/s007990050035View ArticleGoogle Scholar
- Klinkenberg R, Joachims T: Detecting Concept Drift with Support Vector Machines. Proceedings of ICML 2000, 487–494.Google Scholar
- Park Y, Byrd R, Boguraev B: Automatic Glossary Extraction: Beyond Terminology Identification. COLING 2002.Google Scholar
- Verspoor K, Cohn J, Joslyn C, Mniszewski S: Protein Annotation as Term categorization in the Gene Ontology. BioCreative Notebook Papers, CNB 2004. [http://www.pdg.cnb.uam.es/BioLink/workshop_BioCreative_04/handout/]Google Scholar
- Couto F, Silva M, Coutinho P: FIGO: Findings GO Terms in UnStructured Text. BioCreative Notebook Papers, CNB 2004. [http://www.pdg.cnb.uam.es/BioLink/workshop_BioCreative_04/handout/]Google Scholar
- Hanish D, Fundel K, Mevissen H, Zimmer R, Fluck J: ProMiner: Organism-specific protein name detecion using approximate string matching. BioCreative Notebook Papers, CNB 2004. [http://www.pdg.cnb.uam.es/BioLink/workshop_BioCreative_04/handout/]Google Scholar
- Crim J, McDonald R, Pereira F: Automatically Annotating Documents with Normalized Gene Lists. BioCreative Notebook Papers, CNB 2004. [http://www.pdg.cnb.uam.es/BioLink/workshop_BioCreative_04/handout/]Google Scholar
- Hersh W, Bhupatiraju R: TREC GENOMICS Track Overview. TREC-12 2004. [http://trec.nist.gov/pubs/trec12/t12_proceedings.html]Google Scholar
- Mizuta Y, Collier N: Zone Identification in Biology Articles as a Basis for Information Extraction. COLING Workshop on Natural Language Processing in Biomedicine and its Application (JNLPBA) 2004.Google Scholar
- Tbahriti I, Chichester C, Lisacek F, Ruch P: Using Argumentation to Retrieve Articles with Similar Citations from MEDLINE. COLING Workshop on Natural Language Processing in Biomedicine and its Application (JNLPBA) 2004.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.