Figure 4From: BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugsComparative performance of cdf with other commonly employed IR methods. This figure illustrates the predictive performance, assessed by the cross-validation results (in AUC), versus the cumulative number of characteristics (out of 484) for each of the commonly-employed methods. Notes: (1) ctf: an average of 1,520 tokens per drug (IQR: 598-3,781 tokens) were retained after elimination. (2) Stemming: the application of stemming algorithm has resulted in the reduction of 20% of tokens (median: 973 tokens per drug, IQR: 455-1,144 tokens). In depth investigation has, however, revealed that stemming did not always group the concept consistently (See additional file 2 for further discussions). (3) Drug synonyms: Compared with MEDLINE searches using only generic names, a mean of 3.4% more results were retrieved when trade names were used (IQR: 0-0.54% more abstracts were retrieved). The dotted lines indicate AUCs of 0.8 and 0.9 respectively. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens generated by retrieving abstracts with both generic and trade names for a given drug.Back to article page