Biomedical Text Mining (TM) has become increasingly popular due to the great need to provide access to the tremendous body of texts available in biomedical sciences. Considerable progress has been made in the development of basic resources (e.g. ontologies, annotated corpora) and techniques in this area, e.g. in Information Retrieval (IR) (i.e. identification of relevant documents) and Information Extraction (IE) (i.e. identification of specific information in the documents, e.g. proteins and genes, and specific relations between them), and research has began to focus on increasingly challenging tasks, e.g. summarization and the discovery of novel information in biomedical literature [1–4].
The major current challenge is to extend TM techniques with richer and deeper analysis and to apply them to support real-world tasks in biomedicine. In recent past, there has been an increasing trend towards research which is driven by actual user needs rather than by technical developments . Corpus annotation and classification schemes applicable to a wider variety of biomedical literature have been developed to support biologists with diverse TM needs [6, 7]. Shared tasks (e.g. BioCreative and the TREC Genomics track) targeting the actual workflow of biomedical researchers have appeared, along with studies exploring the TM needs of specific tasks (e.g. literature curation, library services for biomedical applications) [8, 9]. Several practical tools have been developed for the use of working scientists which can support IR and IE from biomedical literature [10–13]. However, the understanding of user needs is still one of the neglected areas of biomedical TM, and further user-centered evaluations and systems grounded in real-life tasks are required to determine which tools and services are actually useful .
In our recent work, we investigated the user needs of a challenging task yet to be tackled by text mining: Cancer Risk Assessment (CRA) [15, 16]. CRA is a task which involves examining existing published evidence to determine the relationship between exposure to a chemical and the likelihood of developing cancer from that exposure . It has become increasingly important over the past years as the link between environmental chemicals and cancer has become evident and tight legislations governing chemical safety have been introduced worldwide. For example, the recently established European Community REACH (Registration, Evaluation, Authorisation and Restriction of Chemical substances) legislation requires that all the chemicals manufactured or imported in a high quantity must undergo thorough CRA (EC 1907/2006) .
Performed manually by experts in health related institutions, CRA is a demanding exercise which requires combining scientific knowledge with elaborate literature review. It involves searching, locating and interpreting relevant information in repositories of scientific peer reviewed journal articles - a process which can be extremely time-consuming because the data required for CRA of just a single carcinogen may be scattered across thousands of articles. Over the recent years, while the need for CRA has grown, the task has also turned increasingly complex due to the rapid development of molecular biology techniques, the increased knowledge of mechanisms involved in cancer development, and the exponentially growing volume of CRA literature (e.g. the MEDLINE database  of biomedical research articles expanded with over 0.5 M references last year and now includes over 17 million in total). Under these circumstances, CRA is getting too challenging to manage via manual means.
To gain an understanding of how TM could best assist CRA, we conducted an initial study where we interviewed 14 experienced risk assessors working for different national and international CRA authorities in Sweden1 . During this study, the risk assessors described the following steps of their work: (1) identifying the journal articles relevant for CRA of the chemical in question, (2) identifying the scientific evidence in these articles which help to determine whether/how the chemical causes cancer, (3) classifying and analysing the resulting (partly conflicting) evidence to build the toxicological profile for the chemical, and (4) preparing the risk assessment report. These steps are conducted largely manually, relying on standard literature search engines (e.g. provided with PubMed) and word processors as technical support. CRA of a single chemical may take several years when done on a part time basis. The risk assessors were unanimous about the need to increase the productivity of their work to meet the current CRA demand. They reported that locating and classifying the scientific evidence in literature is the most time consuming phase of their work and that a tool capable of assisting this phase and ensuring that all the potentially relevant evidence is found would be particularly helpful.
It became clear to us that a prerequisite for the development of such a tool would need to be an extensive specification of the scientific evidence used for CRA. This evidence -- which forms the basis of all the subsequent steps of CRA -- is described in the guideline documents of major international CRA agencies, e.g. European Chemicals Agency  (ECHA), the United States Environmental Protection Agency  (EPA), and the International Agency for Research on Cancer (IARC) . The guideline documents describe various human, animal (in vivo), cellular (in vitro) and other mechanistic data which provide evidence for both hazard identification (i.e. the assessment of whether a chemical is capable of causing cancer) and the assessment of the Mode of Action (MOA) (i.e. the sequence of key events that result in cancer formation, e.g. mutagenesis, increased cell proliferation, and receptor activation). However, our investigation showed that although these documents constitute the main reference material available for CRA, they cover the main types of evidence only, do not specify the evidence at the level of detail required for comprehensive data gathering (e.g. do not provide complete lists of relevant keywords or terms) and are not updated regularly to include the latest developments in biomedical sciences. For example, the most recent EPA CRA guideline was published in 2005 and the data requirements have not been updated since then.
The same guidelines emphasise, however, the importance of investigating all the published scientific data on the chemical in question which might be of potential relevance for CRA. For example, according to ECHA  "failure to collect all of the available information on a substance may lead to duplicate work, wasted time, increased costs and potentially unnecessary animal use" (page 7). Recent research has revealed that conflicting risk assessments of the same chemical are surprisingly common [22, 23]. Inadequate or imbalanced data may give rise to such problems. Extensive data gathering is therefore essential not only for the coverage but also for the accuracy of CRA.
Where the guidelines fail to provide sufficient information, risk assessors rely on their experience and expert knowledge. This is not ideal since chemical carcinogenesis is such a complex process that even the most experienced risk assessor is incapable of memorizing the wide range of relevant evidence without the support of a thorough specification.
Here we report the work we did on obtaining a more adequate specification of the scientific evidence for CRA. Ideally, a comprehensive knowledge resource is needed which specifies the range of relevant evidence and provides extensive lists of keywords to support the gathering of this evidence in literature. Given the dynamic nature of CRA data, the best approach long term would be to develop technology for automatic acquisition and updating of such a resource from CRA literature [1, 2]. However, the very development of such technology requires target specification of the scientific evidence more comprehensive than that currently provided. Therefore, in this first work, we opted for expert annotation of biomedical literature according to the evidence it offers for CRA.
Following the recommended practices of biomedical corpus design by Cohen et al.  as far as practical, we constructed a representative, balanced CRA corpus of 1297 MEDLINE abstracts from a set of journals typically used for CRA. A user-friendly annotation tool was designed which experts could use to annotate abstracts (i) for the relevance for CRA and (ii) according to the types of evidence they provide for the task. Three experts (experienced risk assessors) agreed on the annotation guidelines and produced a corpus which contains 1164 abstracts judged as relevant and annotated for 1742 unique keywords (words or phrases) indicating the evidence they offer for CRA. The experts grouped the keywords according to the types of evidence they provide for the task, and organized them into a taxonomy which contains 48 distinct classes and covers a variety of data related to carcinogenic activity, MOA and toxicokinetics. We measure the inter-annotator agreement of both relevance and keyword annotation tasks. In addition, we report a series of experiments which involve training and testing automatic classifiers to assign PubMed abstracts to taxonomy classes. Finally, a simple user test in a near real-world CRA scenario is reported. The evaluation we report demonstrates that our taxonomy is highly accurate and can be useful for practical CRA. The materials we have produced can thus provide valuable support for manual CRA as well as facilitate the development of an approach based on TM. We discuss refining and extending the taxonomy further via manual and machine learning approaches, and the subsequent steps required to develop TM to support the entire CRA workflow.
The rest of this paper is organized as follows: The Methods section introduces the CRA corpus, the annotation tool, the annotation guidelines, the principles of taxonomy construction, and the automatic classification methods. The Results section describes first the annotation work and the resulting taxonomy. The results of the inter-annotator agreement tests, the automatic classification experiments and the user-test are then reported. The Discussion and Conclusion section concludes the paper with comparison to related research and directions for future work.