Skip to main content

Table 1 Document sets (corpora) used in this work

From: Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

  ID Name # of docs size in MB type description
1 RN RegulonDB Network References 724 24.9 full-text Full-text papers from the RegulonDB database references that curators have identified as referring specifically to the regulatory network, as opposed to those referring to other objects from the database.
2 RP RegulonDB papers 2,475 99 full-text Full text papers from the complete RegulonDB references that we were able to access and download.
3 RA RegulonDB Abstracts 3,075 3.3 abstracts Abstracts from the complete RegulonDB references, as of June of 2006.
4 RS RegulonDB search strategies 12,059 12.3 abstracts Corpus generated by using the RegulonDB curator's search strategies, without any subsequent filtering.
5 EA EcoCyc Abstracts 13,334 14.4 abstracts Abstracts from references in the 2006 EcoCyc database that describes the genome and the biochemical machinery of E. coli.
6 ST STRING-IE 58,312 10.7 sentences Corpus of distinct sentences generated by the STRING-IE team by searching in PubMed for "E. coli" (and synonyms), and two gene/protein names in the same abstract, from 195,000 abstracts.
  1. Description of the different full text and abstract corpora used for extraction of regulatory interactions. The document sets are based on PubMed searches and on reference lists from database curation efforts.