Evaluation of a large-scale biomedical data annotation initiative
© Lacson et al; licensee BioMed Central Ltd. 2009
Published: 17 September 2009
This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators.
There were 12,500 samples annotated with approximately 30 variables, in each of six disease categories – breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures.
We show that it is possible to perform manual re-annotation of a large repository in a reliable manner.
Large repositories of gene expression data are currently available and serve as online resources for researchers, including the Gene Expression Omnibus (GEO), the Center for Information Biology Gene Expression Database (CIBEX), the European Bioinformatics Institute's ArrayExpress and the Stanford Tissue Microarray Database [1–4]. Repositories for gene expression data such as GEO allow for widespread distribution of gene expression measurements in order to: (1) validate experimental results, (2) enable progressive accumulation of data that may support, modify or further develop prior work, and (3) facilitate use of archived measurements to generate novel hypotheses that naturally develop from continuous updating of accumulated data. Although GEO contains a vast amount of measurements from numerous samples, the link between measurements and phenotypic characteristics of each individual sample, including the sample's disease and tissue type, is not readily accessible because they are encoded as free text. Furthermore, there are no standardized documentation rules, so phenotypic and/or protocol information resides in multiple documents and physical locations. Such information may be included as text describing the experiment or protocol, sample and sampling descriptions, or may be found only in the published journal article that may accompany the submission. In order to increase utility and improve ease of use of this resource, data should be readily available and easily comprehensible, not only for researchers, but also for automatic retrieval. In particular, the data have to contain sufficient detail to allow for appropriate combination of similar experimental subjects and protocols that may then collectively facilitate the verification, support, or development of new hypotheses.
Many centers have focused on re-annotating biomedical data with the goal of increasing utility for researchers. The promise of fast-paced annotation amid rapid accumulation of data has spurred great interest in progressive development of automated methods [4, 5]. To date, manually annotated data is the de facto gold standard for most annotation efforts [4, 5]. Therefore, it becomes critical to ensure that manually annotated data are accurately described and evaluated.
(a) taken from GDS showing three axes – "cell line," "disease state", and "stress" with corresponding values; (b) taken from GDS showing cell line descriptors.
It is not surprising, therefore, that re-annotating GEO and other large microarray data repositories is the focus of several groups. In particular, automatic text processing is being used to capture disease states corresponding to a given sample from GDS annotations. In a recently published article in which the objective was to identify disease and control samples within an experiment, the GDS subsets were analyzed using representative text phrases and algorithms for negation and lexical variation . Although this algorithm was successful in identifying 62% of controls, the study was evaluated using only 200 samples, and it highlighted an urgent need for a methodical solution for annotating GEO using a controlled vocabulary. Another study performed re-annotation of the Stanford Tissue Microarray Database using the National Cancer Institute (NCI) thesaurus . They were successful in representing annotations for 86% of the samples with 86% precision and 87% recall, but the study was evaluated using only 300 samples. While diagnosis remains as one of the most useful annotation points for a given experimental sample, there are many more categories of interest to investigators and users. For example, treatment interventions, sample demographics (e.g. age, gender, race), and various phenotypic information that affects gene expression. A re-annotation of these rapidly growing repositories has to take into account all these variables and the use of a controlled vocabulary for identifying sample variables and values.
We therefore describe a large-scale manual re-annotation of data samples in GEO, including variable fields derived from the NCI thesaurus and corresponding values that also utilize primarily controlled terminology . The objective is to create an annotation scheme for various disease states that is flexible, comprehensive and scalable. We subsequently present a framework for evaluating the annotation structure by measuring coverage and agreement between annotators.
Three sections below specifically: (1) enumerate the iterative process used for developing an annotation structure, (2) describe the annotation tool and the annotators' characteristics, and (3) describe the framework for evaluation.
An iterative process was designed for identifying the variables selected for annotation, as follows:
1. Variable generation – Human experts develop a list of variables for annotation. This procedure is based on guidelines and publications that are related to the disease category. Variables were then trimmed based on consensus among three physicians.
2. Supervised domain annotation – A trained annotator was instructed to start annotating the given variables under physician supervision. Whenever a variable deemed important was identified, it was listed for further deliberation. The process was then repeated – back to number (1) above, until no further variables were identified or the amount of samples for preliminary annotation was reached (i.e. 10% of the total samples for annotation within each domain).
3. Unsupervised annotation – A trained human annotator then performed unsupervised annotation independently, after receiving a standardized, written instruction protocol. Instructions were specifically developed for each disease category. Two human annotators were assigned to code each data sample. Randomized assignment between annotators was performed by disease category to minimize the occurrence of two coders being assigned to annotate the same disease category (and therefore the same samples) repeatedly.
4. Disagreement and partial agreement identification – After the human annotators finished coding their assigned experiments, the data was compiled and the assigned values were compared to measure agreement. The method to assess agreement is further described below.
5. Re-annotation – Finally, the samples containing values that were not in agreement initially were re-annotated and the correct annotation was determined by a majority vote. In the event of a three-way tie, one of the investigators performed a manual review and final adjudication.
The variable "tissue" was assigned several different values, one of which was "breast." This assignment provided flexibility, allowing for addition of other tissue types, whenever the disease domain changes. There was also sufficient granularity to allow for actual interrogation(s) into the database for future hypothesis generation or validation. A full description of the web-based annotation tool and the quantity of samples annotated over time is described in a separate paper .
Evaluation of annotations
There were a total of six annotators, including four senior biology students, one graduate student in the biological sciences field, and one physician. As noted previously, each sample had at least two annotators assigning values to variables. The annotation task was to provide phenotypic information for each data sample that was available in GEO for breast and colon cancer, IBD, DM, SLE, and RA. Thus, it was critical to obtain standardized values for most of the annotation variables to ensure that the annotations would be consistent. This entailed a review of data descriptions listed in various sources – the data sets (GDS), series information (GSE) and sample information (GSM). In addition, information was available in supplementary files and in published scientific articles, which are not in GEO. Manual review of all these data sources was necessary to obtain sufficient variable coverage. Coverage was defined as the percentage of non-'unknown' values that were assigned to a variable. Specifically, it can be represented as:
Coverage = X/Y, where X represents the number of variables with values that are not "unknown." Y represents the total number of variables that were annotated.
Criteria for measuring agreement
Exactly the same variable value between annotators.
There is lexical discordance, but the words match to the same concept. This subsumes hierarchical similarity.
Partial agreement, some degree of discordance.
To validate the reliability of the annotation scheme, we computed the percentage of agreement between annotators, defined as the number of variables for which both annotators gave the same value, divided by the total number of variables that were annotated. We calculated percentage agreement for each level of similarity across all disease categories.
Disease categories annotated from GEO
Inflammatory Bowel Disease (IBD)
Insulin Dependent Diabetes Mellitus (DM)
Rheumatoid Arthritis (RA)
Systemic Lupus Erythematosus (SLE)
Sample variables that are annotated for three disease categories – breast and colon cancer and rheumatoid arthritis
Past breast cancer
Degree of differentiation
Coverage of the top ten variables
Top Ten Variables
NCI Thesaurus ID
Semantic + Partia
Disagreement between Annotators
T4b N2a M0
T4b N2a M3b
Repositories for gene expression data such as GEO are expanding very rapidly . However, the critical details necessary for understanding the experiments and sample information are encoded as free text and are not readily available for analysis. We described a large scale re-annotation performed on a substantial portion of the GEO consisting of 12,500 samples. Our large scale re-annotation was accomplished within a reasonable amount of time – completed within only five weeks. In addition, we were able to accomplish annotations of samples in great detail. The annotations used controlled terminology from the NCI thesaurus, with the advantage of allowing generalizability of the annotations for other research applications.
This study's re-annotation evaluation was performed on sample quantities that are two orders of magnitude higher than most prior reports [4, 5, 12]. A major contribution of this research effort includes the massive amount of well-annotated data, with substantial coverage for a large number of phenotypic information and with excellent accuracy, particularly at the semantic level.
We also described the methodology used for identifying relevant variables in each disease category. This iterative process is efficient and provided a mechanism for identifying relevant variables for domain categories. This technique provides a framework for inducing structure of a specific domain in an iterative and consultative manner. Excellent inter-annotator agreement confirmed that the annotation variables were robust and easily identifiable.
Finally, we provided a framework for measuring inter-annotator agreement. Apart from strict agreement measured using exact string matching between variable values, we defined and considered two other similarity categories that were known to be especially useful for annotations that relied heavily on free text. We showed an improvement in agreement using these more lenient similarity measures. The degree of improvement was mitigated by the very controlled terminology from the NCI Thesaurus that annotators utilized, and was augmented by the annotation tool. Several studies use semantic similarity as a measurement of agreement in annotation of microarray data [4, 5]. Several other studies use partial agreement, especially when annotated text contains fragments that are not exactly similar [12, 14]. Manual curation is usually the gold standard and determines whether terms that were used are semantically appropriate or not . Our results show better strict, semantic, and partial agreement compared to most other re-annotation studies [12, 16].
Phenotypic annotations and data sample information are critically important for translational research. In particular, it is important to have good coverage for vital information, specific to clinical domain, as well as providing accurate annotations. We show that it is possible to perform manual re-annotation of a large repository in a reliable and efficient manner.
The authors would like to thank the annotators who worked diligently on this project: Evelyn Pitzer, Pierre Cornell, Karrie Du, Lindy Su and Anthony Villanova. Galante was funded by grant D43TW007015 from the Forgarty International Center, NIH. This work was funded in part by grant FAS0703850 from the Komen Foundation.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 9, 2009: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S9.
- Barrett T, Edgar R: Reannotation of array probes at NCBI's GEO database. Nat Methods 2008, 5(2):117. 10.1038/nmeth0208-117bView ArticlePubMedGoogle Scholar
- Ikeo K, Ishi-i J, Tamura T, Gojobori T, Tateno Y: CIBEX: center for information biology gene expression database. C R Biol 2003, 326(10–11):1079–1082. 10.1016/j.crvi.2003.09.034View ArticlePubMedGoogle Scholar
- Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma A: ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 2005, (33 Database):D553–5.Google Scholar
- Shah NH, Rubin DL, Supekar KS, Musen MA: Ontology-based annotation and query of tissue microarray data. AMIA Annu Symp Proc 2006, 709–713.Google Scholar
- Dudley J, Butte AJ: Enabling integrative genomic analysis of high-impact human diseases through text mining. Pac Symp Biocomput 2008, 580–591.Google Scholar
- Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, et al.: NCBI GEO: mining millions of expression profiles – database and tools. Nucleic Acids Res 2005, (33 Database):D562-D566.Google Scholar
- de Coronado S, Haber MW, Sioutos N, Tuttle MS, Wright LW: NCI Thesaurus: using science-based terminology to integrate cancer research results. Stud Health Technol Inform 2004, 107(Pt 1):33–37.PubMedGoogle Scholar
- Lee HW, Park YR, Sim J, Park RW, Kim WH, Kim JH: The tissue microarray object model: a data model for storage, analysis, and exchange of tissue microarray experimental data. Arch Pathol Lab Med 2006, 130(7):1004–1013.PubMedGoogle Scholar
- Lindberg DA, Humphreys BL, McCray AT: The Unified Medical Language System. Methods Inf Med 1993, 32(4):281–291.PubMedGoogle Scholar
- Pitzer E, Lacson R, Hinske C, Kim J, Galante P, Ohno-Machado L: Large scale sample annotation of GEO. AMIA Summit in Translational Bioinformatics 2009.Google Scholar
- Fan JW, Friedman C: Semantic classification of biomedical concepts using distributional similarity. J Am Med Inform Assoc 2007, 14(4):467–77. Epub 2007 Apr 25. Epub 2007 Apr 25. 10.1197/jamia.M2314PubMed CentralView ArticlePubMedGoogle Scholar
- Wilbur WJ, Rzhetsky A, Shatkay H: New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 2006, 7: 356. 10.1186/1471-2105-7-356PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 2009, (37 Database):D885–90. Epub 2008 Oct 21. Epub 2008 Oct 21. 10.1093/nar/gkn764Google Scholar
- Pevner L, Hearst M: A critique and improvement of an evaluation metric for text segmentation. Association for Computational Linguistics 2002., 28(1):Google Scholar
- Fung KW, Hole WT, Nelson SJ, Srinivasan S, Powell T, Roth L: Integrating SNOMED CT into the UMLS: an exploration of different views of synonymy and quality of editing. J Am Med Inform Assoc 2005, 12(4):486–94. Epub 2005 Mar 31. Epub 2005 Mar 31. 10.1197/jamia.M1767PubMed CentralView ArticlePubMedGoogle Scholar
- Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, et al.: An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics 2005, 6(Suppl 1):S17. 10.1186/1471-2105-6-S1-S17PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.