Skip to main content
Figure 4 | BMC Bioinformatics

Figure 4

From: BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

Figure 4

Comparative performance of cdf with other commonly employed IR methods. This figure illustrates the predictive performance, assessed by the cross-validation results (in AUC), versus the cumulative number of characteristics (out of 484) for each of the commonly-employed methods. Notes: (1) ctf: an average of 1,520 tokens per drug (IQR: 598-3,781 tokens) were retained after elimination. (2) Stemming: the application of stemming algorithm has resulted in the reduction of 20% of tokens (median: 973 tokens per drug, IQR: 455-1,144 tokens). In depth investigation has, however, revealed that stemming did not always group the concept consistently (See additional file 2 for further discussions). (3) Drug synonyms: Compared with MEDLINE searches using only generic names, a mean of 3.4% more results were retrieved when trade names were used (IQR: 0-0.54% more abstracts were retrieved). The dotted lines indicate AUCs of 0.8 and 0.9 respectively. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens generated by retrieving abstracts with both generic and trade names for a given drug.

Back to article page