Discovery of novel biomarkers and phenotypes by semantic technologies

BMC Bioinformatics

Table 4 Precision and recall

*Benchmark*	*Benchmark corpus*	*InfoCodex corpus*	*Precision*	*Recall*
I2E raw	PubMed	PubMed	(exact)	(exact)
			<1% obesity	5% obesity
			3-5% diabetes	9-11% diabetes
			3-7% MDOB	7% MDOB
I2E normalized	PubMed	PubMed	(exact)	(exact)
I2E normalized	PubMed	PubMed	3-7% MDOB	3-7% MDOB
I2E manual	PubMed	PubMed	1-5% obesity	9-33% obesity
			3-11% diabetes	9-31% diabetes
			3-26% MDOB	4-15% MDOB
UMLS + GO + OMIM	UMLS + GO + OMIM	PubMed	1-4%	3-22%
UMLS + GO + OMIM	UMLS + GO + OMIM	PubMed	1-8% (unary)	4-35% (unary)
Thomson Reuters	Thomson Reuters	PubMed	7-36% obesity	36% obesity
			7-36% obesity	18% DM2
			9-49% DM2	22% DM1
			9-49% DM2	25% DI
TGI	TGI	PubMed	0-5% obesity	(exact) 2.5%
			0-4% diabetes
			1-14% MDOB
I2E manual	PubMed	ClinicalTrials.gov	(preferred terms) 27-59%	(preferred terms) 3-7%
UMLS + GO + OMIM	UMLS + GO + OMIM	ClinicalTrials.gov	(preferred terms) 1-2%	(preferred terms) <1%
I2E manual	PubMed	Merck internal	(preferred terms) 8-14%	(preferred terms) 1-2%
UMLS + GO + OMIM	UMLS + GO + OMIM	Merck internal	(preferred terms) <1%	(preferred terms) <1%

Precision and recall of InfoCodex candidate biomarkers/phenotypes compared to various benchmarks. “(exact)” and “(preferred terms)” refer to sub-ranges according the 2x2 matching matrix described in the text under “Methods - Precision/recall”. “MDOB” refers to the InfoCodex output subset containing references to the 27 Merck D&O biomarkers. “(unary)” means all InfoCodex candidate biomarkers/phenotypes were lumped together across obesity, diabetes, and MDOB, in contrast to the default binary criterion for matching.

ISSN: 1471-2105