Extraction of semantic biomedical relations from text using conditional random fields

Background The increasing amount of published literature in biomedicine represents an immense source of knowledge, which can only efficiently be accessed by a new generation of automated information extraction tools. Named entity recognition of well-defined objects, such as genes or proteins, has achieved a sufficient level of maturity such that it can form the basis for the next step: the extraction of relations that exist between the recognized entities. Whereas most early work focused on the mere detection of relations, the classification of the type of relation is also of great importance and this is the focus of this work. In this paper we describe an approach that extracts both the existence of a relation and its type. Our work is based on Conditional Random Fields, which have been applied with much success to the task of named entity recognition. Results We benchmark our approach on two different tasks. The first task is the identification of semantic relations between diseases and treatments. The available data set consists of manually annotated PubMed abstracts. The second task is the identification of relations between genes and diseases from a set of concise phrases, so-called GeneRIF (Gene Reference Into Function) phrases. In our experimental setting, we do not assume that the entities are given, as is often the case in previous relation extraction work. Rather the extraction of the entities is solved as a subproblem. Compared with other state-of-the-art approaches, we achieve very competitive results on both data sets. To demonstrate the scalability of our solution, we apply our approach to the complete human GeneRIF database. The resulting gene-disease network contains 34758 semantic associations between 4939 genes and 1745 diseases. The gene-disease network is publicly available as a machine-readable RDF graph. Conclusion We extend the framework of Conditional Random Fields towards the annotation of semantic relations from text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.


General description
The here described corpus consists of 5.720 sentences extracted from 453 Entrez [2] gene entries. The corresponding Entrez gene id's are provided as supplementary data as well. The corpus was generated to evaluate our relationship extraction approach to identify relations holding between genes and diseases and to extract the relation type. We defined the type for a specific gene-disease relation to be based on the information at which level of molecular activity (DNA, mRNA, protein etc.), a gene is mentioned to be related with a specific disease. Thus, we defined the following assertions: a gene/protein can be in a state of altered expression, genetic variation or regulatory modification. We are trying to cover all molecular conditions with these types of relations ranging from genetic to transcriptional up to phosphorylation events, a gene/protein is hypothesized to be associated with a specific disease. Moreover, if no specific information about the molecular state is available, a gene/protein can have any relation or, if a relation to a disease is negated in a specific sentence, be unrelated with a disease. See section Annotation for the definition of predefined types of relations.
Table1 lists the total number of semantic relations in the corpus. According to Table 1 Inter-annotator agreement was estimated in a similar way like [1] did at the BioCreAtIvE I evaluation conference. We took a small sample of the corpus (5%) and compute the fraction of agreements vs.
disagreements. We marked an association as disagreement if one of the two annotators disagreed [1]. The inter-annotator agreement was estimated to be about 84%.

Labeling
The annotation unit was at the sentence-level, thus, if a GeneRIF was composed of several sentences, these sentences were split automatically and annotated independently from each other. In a further preprocessing step these sentences were tokenized and each token was required to get labeled. The labels were a combination of the entity type (i.e. a disease) plus the relation type holding between the gene/protein (key entity) and the disease. In general, GeneRIF phrases are based on concise phrases, created by domain experts. Another feature of GeneRIF phrases is that the investigated text phrase refers to a key entity, a certain Entrez gene/protein. As a consequence, the gene/protein is already given and all entities mentioned in the phrase encode a relation to the key entity. Thus, the key entity does not have to be labeled.
Note that we use SGML format for labeling, even though the examples here are shown in the MUC (Message Understanding Conference) format. All tokens not inside the disease tags are marked as outside (e.g. "MMP-12/O polymorphisms/O" in SGML). The tokens inside an entity are marked with the type of entity plus the relation holding between the entities. In addition, a flag is set, whether we are at the beginning or inside an entity (e.g. Alzheimer/B-gen_var_disease disease/I-gen_var_disease).

Annotation
In this section we describe the rules, which guided the two annotators through the manual labelling process. The annotation scheme for the associations was defined as follows:

Any:
A pair of disease and gene/protein entity is marked as related, if the sentence states clearly a role of the gene in the disease, but there's no further specification, if a certain molecular state of the gene plays a role. Thus, the annotators use this association, if there is clearly a certain role for a gene in a disease, but there's no explicit description that a specific observed state is linked with the disease.
Example E2: ENG and ALK-1 genes may have roles in <disease type="any"> hereditary haemorrhagic telangiectasia </disease> in the italian population.

Altered expression:
A pair is marked with this association, if an unusual or altered expression level of the gene/protein was observed for a certain disease. This means that the unusual expression level is somehow linked with a disease. Even though this fact could only be a by-product of the disease, the information could be, for instance, quite valuable for new biomarker candidate identification. Or it could be very likely that at least genes from the same pathway are likely linked etiologically with the disease.
In what follows, we list the issues the human annotators had to keep in mind additionally, when labeling the data set: 1. Which tokens shall be tagged as part of a disease entity? As one can see from example E2, the tree tokens hereditary haemorrhagic telangiectasia were labelled as one disease entity although in the MeSH ontology only the entry for Telangiectasis can be found. We decided to label all additional information as part of the disease entity, if the tokens specify the disease more precisely. Note that this tagging of extra-information can cause problems when named entities are mapped later on to entries of taxonomies/ontologies like e.g. MeSH. However, we believe that these additional disease descriptions represent valuable extra-information.

2.
What influence do several gene mentions in a sentence have, when assigning relations between the key entity gene and an occurring disease? Recall that the task is to infer the relation holding between the key entity gene given by the GeneRIF and a disease. As noted earlier, if several gene mentions of different genes occur in the sentence, assertions for a more fine grained relation (e.g. altered expression) for the key entity can only be made if there is an explicit statement about the key entity with respect to the disease. If there are only specific molecular statements about observed states for other genes than the key entity, but not for the key entity itself, the entity pair can only be assigned an any or an unrelated relation for that sentence (see example E6).

How to handle ambiguities of relations?
If it was not clear for an entity pair which relation should be assigned because of ambiguity, then we looked at the PubMed abstract of that GeneRIF. If there was a clear statement, which category is holding for the relation pair, then this association was chosen. If indeed one GeneRIF phrase was stating two possible associations for a disease and a gene (e.g. a methylation event and as a consequence an altered expression event), then we decided to use the association which comes first from a biological point of view (i.e. the association/event which happens in the biological machinery before the other).

How to handle nested entities?
Our strategy to handle nested entities was two fold: Disease mentions in gene/protein names like breast cancer 2 gene were not tagged, since nothing is stated explicitly between breast cancer and the key entity gene/protein, whereas disease mentions in cell type names like prostate cancer cells were labeled.