Table 1 Document sets (corpora) used in this work

  ID Name # of docs size in MB type description
1 RN RegulonDB Network References 724 24.9 full-text Full-text papers from the RegulonDB database references that curators have identified as referring specifically to the regulatory network, as opposed to those referring to other objects from the database.
2 RP RegulonDB papers 2,475 99 full-text Full text papers from the complete RegulonDB references that we were able to access and download.
3 RA RegulonDB Abstracts 3,075 3.3 abstracts Abstracts from the complete RegulonDB references, as of June of 2006.
4 RS RegulonDB search strategies 12,059 12.3 abstracts Corpus generated by using the RegulonDB curator's search strategies, without any subsequent filtering.
5 EA EcoCyc Abstracts 13,334 14.4 abstracts Abstracts from references in the 2006 EcoCyc database that describes the genome and the biochemical machinery of E. coli.
6 ST STRING-IE 58,312 10.7 sentences Corpus of distinct sentences generated by the STRING-IE team by searching in PubMed for "E. coli" (and synonyms), and two gene/protein names in the same abstract, from 195,000 abstracts.
