Skip to main content

Table 5 Comparative evaluation of cdf-based predictions with other commonly used IR methods

From: BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

Category

Best AUC

Method

  

cdf

ctf

ctf-icdf

Stemming

Synonyms

AMH major classes

> 0.80

20 (100)

20 (100)

20 (100)

19 (95)

20 (100)

 

> 0.90

19 (95)

17 (85)

18 (90)

17 (85)

18 (90)

 

> 0.95

12 (60)

12 (60)

15 (75)

12 (60)

11 (55)

AMH minor classes

> 0.80

135 (69)

106 (54)*

157 (80)

152 (77)

156 (79)

 

> 0.90

123 (62)

100 (51)

150 (76)*

145 (74)

151 (77)*

 

> 0.95

114 (58)

92 (47)

144 (73)*

142 (72)*

143 (73)*

AMH adverse events

> 0.80

159 (67)

148 (62)

173 (73)

153 (64)

155 (65)

 

> 0.90

86 (36)

88 (37)

100 (42)

84 (35)

84 (35)

 

> 0.95

41 (17)

42 (18)

55 (23)

44 (18)

44 (18)

PKIS perpetrator

> 0.80

7 (47)

6 (40)

7 (47)

12 (80)

10 (67)

 

> 0.90

5 (33)

5 (33)

4 (27)

3 (20)

4 (27)

 

> 0.95

2 (13)

2 (13)

2 (13)

3 (20)

3 (20)

Narrow therapeutic index drugs

> 0.80

9 (64)

9 (64)

11 (79)

10 (71)

10 (71)

 

> 0.90

8 (57)

7 (50)

10 (71)

9 (64)

10 (71)

 

> 0.95

5 (36)

6 (43)

9 (64)

9 (64)

10 (71)

Overall

> 0.80

330 (68)

289 (60)*

368 (76)*

346 (71)

351 (73)

 

> 0.90

241 (50)

217 (45)

282 (58)*

258 (53)

267 (55)

 

> 0.95

174 (36)

154 (32)

225 (46)*

210 (43)

211 (44)

  1. The numbers in this table indicate the number of characteristics (percentage) achieved an AUC above the given thresholds in stratified cross-validation evaluations. For each method, the results from the best of 4 algorithms were compared. The thresholds of AUC can be interpreted as good (> 0.8), very good (> 0.9), and excellent (> 0.95) respectively. The entries labelled (*) indicate a significantly better or worse performance than cdf for predicting drug characteristics. Fisher's exact tests were applied as 2 × 2 tables with α = 0.05 adjusted for a family of four comparisons by using the Bonferroni method. The numbers in boldface indicate the best performing method(s) for each characteristic category above the AUC = 0.8 threshold. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens calculated by retrieving abstracts with both generic and trade names for a given drug.