The language of gene ontology: a Zipf’s law analysis
© Kalankesh et al.; licensee BioMed Central Ltd. 2012
Received: 11 June 2011
Accepted: 15 May 2012
Published: 7 June 2012
Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf’s law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language.
Annotations from the Gene Ontology Annotation project were found to follow Zipf’s law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation.
Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.
The gene ontology and annotation
The Gene Ontology (GO) is used extensively in biology. It provides a structured set of concepts that can be used to describe genes and gene products. These concepts are divided into three separate sub-ontologies focused on molecular function (MF), biological process (BP) and cellular component (CC) . The GO has now been used to annotate many of the standard databases of genes and gene products. This annotation helps to integrate biological resources across various experimental organisms and different data bases [2–4]. The power of the GO annotation is that it allows unambiguous communication of knowledge among biologists as to the functionality of gene products, at the same time as making the biological knowledge computer-comprehensible [3, 4]. GO annotation is undertaken either manually, automatically, or by some combination of both . The GO Consortium provide codes that indicate the evidence to support the association between a specific GO term and gene product (for example through sequence similarity or direct experimental support). Evidence codes should not be directly used as a measure of the annotation quality ; they can, however, help inform the level of belief a user might have in the GO terms assigned .
A number of studies have attempted to address issues of annotation quality, for example by looking at the consistency of coding between different annotators . Another study introduced an Annotation Confidence Scoring system for comparing the annotation of genes and gene products to those found in a reference genome set . Others have used the GO evidence codes and term depth in the GO to provide evidence of quality . There is some evidence that sources annotated through manual curation are of higher quality than those annotated automatically  as they are the result of the combined effort of many scientists . None of these methods, however, has addressed the core question of how effective the annotations are in conveying meaning to a wider biological audience. We therefore need methods that determine the extent to which annotation is meeting user requirements. Unfortunately, we have very few ways of judging whether the set of annotations produced to describe a collection of genes/gene products in a database works effectively in communicating knowledge between the annotator and the end user of those annotations.
Language and the principle of least effort
The GO provides a vocabulary used by annotators to encode information regarding gene product function, information that the wider community then need to decode. The annotation associated with a gene product can be thought of as a sentence made up of words from GO.
It has long been known that natural languages show power-law behaviour. For example Zipf’s law states that for any sufficiently large corpus word frequency is approximately inversely proportional to word rank (in which words are ordered by their frequency within the text, the most common ranked first). Indeed, Zipf’s law is considered as the statistical characteristic of human language [12, 13], and as a wider property of many different complex systems . This pattern has even been observed in a number of extinct and undeciphered languages such as Meroitic , and in the mysterious encrypted 15th century Voynich manuscript .
where α is the Zipf’s law exponent.
For typical single author sources in English β is about 2 [18–20]. There can, however, be variations around this value. For example, in the speech of young children β is around 1.6  whereas β > 2 has been found in sets of nouns taken from single author texts . Almost all texts analysed have values of β in the range [1.6-2.4] . Zipf further argued that the power law behaviour arose from a principle of “least effort” in communication. A communication process can be thought of as having three components; a speaker, a listener and a message. The principle of “least effort” examines the work required from the speaker and the listener in communicating a message [12, 24].
Similarly, we can view annotation as a process of communication. Consider the process of annotating the cellular location of the gene product integrin alpha8. The simplest annotation for the speaker (annotator) to produce is a frequently used (and ambiguous) term such as “cell” (GO:0002623). Such an annotation would, however, push greater effort on to the person using the annotation – the listener. The listener’s job is easiest if the term used is clear and unambiguous, for example “integrin complex” (GO:0008305). This, however, requires significant effort from the speaker in identifying such rarely used GO terms.
Zipf's law and the gene ontology
GO annotation from complete genomes show power law behaviour;
the exponent of the power law provides insights into the nature of the underlying annotation;
computational linguistic analysis provide insights into the annotation process.
To do this we have retrieved genome annotations from the Gene Ontology Annotation (GOA) project. In particular, the GOA data can be regarded as a gold-standard annotation set, with a significant portion that has been extensively curated by human experts.
Total number of annotations and the number of distinct GO identifiers for each of the data sets used in the study in terms of three separate sub-ontologies
Total number of annotations
The number of distinct GO IDs
The total number of annotations and the number of distinct GO identifiers of each of the Homo sapiens (Hs) and Mus musculus (Mm) data sets in terms of the three separate sub-ontologies by evidence code
The number of distinct GO IDs
Total number of annotations
The number of distinct GO IDs
Total number of annotations
The cumulative frequency graph is well defined for all values of x, and removes the problem of noise in the low frequency terms .
The data on the GO identifier frequencies were therefore analysed using the Matlab packages plfit, plplot and, plpva (version 1.0.10 published in January 2010) developed by Clauset and Shalizi . These packages attempt to fit a power law model to the empirical data (represented as a Pareto distribution) and determine the extent to which the data can be effectively modeled using a power law. These tools provide two statistics describing the data. The first is a P-value that is used to determine the extent to which the power law model is appropriate. If the P-value is greater than 0.1 we can regard the power law to be a plausible model of our data. The second statistic produced is β, the exponent of the power law.
Annotation and power law behaviour
Some of the most frequently used terms in the annotation data are some of the most generic (low term depth). For example the term GO:0005515 (protein binding) is typically one of the top two most frequent terms in all the MF data analysed and is only two levels down from the root of the molecular function sub-ontology. The top 25% of the most commonly used GO terms for human molecular function have an average depth of 4.6, compared with an average depth of 6.4 for the 25% least commonly used terms. A similar pattern is repeated for all the sub-ontologies in all species examined in this paper (data not shown). This difference is significant (p < 0.001 in a paired t-test), demonstrating that the most commonly used terms are typically less specific (higher in the ontology) than those which are used less frequently (deeper in the ontology).
Results obtained from the power law analysis of each of the data sets characterized in Table 2
Results obtained from power law analysis of each of the data sets characterized in Table 2
Again there is a clear trend visible in these results, with the low confidence data showing consistently lower exponents than the full data set, with the highest exponents being measured for the filtered high confidence data. A paired t-test analysis of data measured from the high confidence and low confidence data supports the fact that the difference in exponents between these data sets is significant (p = 0.01). It is also interesting to note that two of the annotation data sets with lower values of β have P-values < 0.1, i.e. cannot be effectively represented by a power law.
We have used computational linguistics methods to examine a range of gene annotation data sets used to populate genome resources. In almost all cases these data sets obey Zipf’s law, with exponents typical of those for human languages (Table 3). The This supports the hypothesis that the GO annotation can be thought of as a language, and that we can think of annotation as a form of communication process with the characteristics of a natural language. This then provides us with a framework in which to look at the effectiveness of the communication process using power law. For example, we have observed a real and significant difference in the power law exponents measured for annotation using the biological process sub-ontology (β ≈ 2.1) compared with that using the molecular function and cellular component sub-ontologies (β ≈ 1.8).
The measured exponent changes in a predictable and significant way as a function of the evidence codes that have been used to support annotation, but not as a function of the size of the annotation available (Figure 2). However, it is not clear that the absolute value of the exponent can be interpreted as a quality measure; for example, we would not want to state that the BP annotations are of higher quality than those done with the MF and CC ontologies. We therefore need to look more deeply into the linkage between the exponent and information transfer. For example, some insights can be drawn from work in statistical mechanics approaches to understanding the behaviour of language . In this work it is hypothesised that the exponent β is proportional to the “temperature” of the communication system, where temperature is to be interpreted as a “willingness to communicate”. This would therefore imply that the increase we see in the value of β as a function of the annotation source (Table 4) reflects an increasing effort in the communication process. Indeed, this observation has been made previously in a number of studies of human language, in which the value of the exponent has been somewhat controversially linked to communication effectiveness [23, 24, 27, 28]. Similarly, there is a large literature (e.g. ) which debates the interpretations that can legitimately me be made of the Zipf’s law exponent in linguistics and the extent to which these variations provide insights into communication, whether in whistles between dolphins , the nature of the schizophrenic brain  or language in children . In particular, much of this analysis has investigated the ways in which differences in language use, communication effectiveness or brain structure are reflected in the measured exponent.
An inference that might therefore be drawn as regards the differences in exponents between the various GO sub-ontologies could therefore be that the information conveyed by BP is fundamentally more complex than that described by the other two sub-ontologies, capturing the processes in which the molecule is involved, rather than a simple molecular function description, or a location in which the activity takes place. That is, we simply have more to say about process than we do about function and cellular location; the biology is more complex in processes. This might intrinsically require more “willingness to communicate” than is needed to describe aspects of molecular function or cellular component. An anomaly in this analysis is the observed low exponent for the D. rerio BP sub-ontology, from which we might infer that the information content captured in the annotation for biological processes in this model species is lower than that from the other model organisms (as reflected in the significantly smaller number of published papers on D. rerio compared to those of the other model species listed).
One key difference between this analysis and that more generally used in computational linguistics is in the variation of word length. In the GO annotation all words have the same length (the GO Identifier) whereas in natural languages word lengths can vary. A recent paper  has revisited one of Zipf’s original observations that word length correlates inversely with frequency . The key finding was that the correlation between word length and information content was better than that between word length and frequency. The analysis presented here, in the rather more controlled environment of genome annotation, has the potential to throw new light on this long-running debate in computational linguistics, as we can separate out the effects of word length and focus specifically on the information content and frequency of terms.
In principle we also believe that the straightforward computational linguistics methods we have applied to GO data in this paper should be more widely applicable to any situation in which data are described using terms from an ontology; for example, medical patient data described using terms from SNOMED-CT . Indeed, we have recently observed very similar Zipf’s law behaviour in a large corpus of primary care general practice data describing patients in Salford (UK) (data not shown).
In this paper we have demonstrated that computational linguistics, in the form of Zipf’s law, provides a powerful and innovative framework in which to examine GO annotation. As hypothesised, the GO annotation does follow Zipf’s law and there is some evidence that the exponent does provide information on the nature of the annotation; for example, it responds in a predictable way as a function of the evidence codes used to support the annotation. An unexpected finding is that the power law exponent of data described using the process sub-ontology is significantly different to that measured for data in the function and component ontologies. We do not know whether this difference is some fundamental feature of the structure of the GO sub-ontologies, the nature of the biology being communicated, or whether it reflects thought processes in the annotation teams. Such an understanding could be useful in helping improve the use of ontologies for annotation.
While other studies have focussed on consistency or depth of annotation for assessing the quality of annotation [7–9], no study has explored the nature of the annotation from the perspective of the communication of information. The method should provide a straightforward technique for assessing corpora described using terms from ontology in areas beyond just biology and bioinformatics.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig J: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004, 32 (Database issue): D262-266.PubMed CentralView ArticlePubMedGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshal B: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32 (Database issue): D258-D261.PubMedGoogle Scholar
- Rhee SY, Wood V, Dolinski K, Draghici S: Use and misuse of Gene Ontology annotations. Nat Rev Genet. 2008, 9 (7): 509-515. 10.1038/nrg2363.View ArticlePubMedGoogle Scholar
- Guide to GO Evidence Codes. [http://www.geneontology.org/go.evidence.shtml]
- Gross A, Hartung M, Kirsten T, Rahm E: Estimating the Quality of Ontology-Based Annotation by considering Evolutionary Changes. DILS 2009. Edited by: Paton NW, Missier P, Hedeler C. 2009, Berlin Heidelberg: Springer-Verlag, 81-87.Google Scholar
- Dolan M, Ni L, Camon E, Blake JA: A procedure for assessing GO annotation consistency. Bioinformatics. 2005, 21 (Suppl 1): i136-i143. 10.1093/bioinformatics/bti1019.View ArticlePubMedGoogle Scholar
- Yang Y, Gilbert D, Kim S: Annotation confidence score for genome annotation: a genome comparison approach. Bioinformatics. 2010, 26 (1): 22-29. 10.1093/bioinformatics/btp613.View ArticlePubMedGoogle Scholar
- Buza TJ, McCarthy FM, Wang N, Bridge SM, Burgess SC: Gene Ontology annotation quality analysis in model eukaryotes. Nucleic Acids Res. 2008, 36 (2): e12-PubMed CentralView ArticlePubMedGoogle Scholar
- Mulas F, Curk T, Bellazzi R, Zupan B: On quality of different annotation sources for gene expression analysis. Artificial Intelligence in Medicine 2009. Edited by: Combi C, Shahar Y, Abu-Hanna A. 2009, Heidelberg: Springer-Verlag BerlinGoogle Scholar
- Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR: Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008, 9 (1): R7-10.1186/gb-2008-9-1-r7.PubMed CentralView ArticlePubMedGoogle Scholar
- Zipf G: Human Behavior and the Principle of least effort: Introduction to human Ecology. 1949, Oxford: Addison WesleyGoogle Scholar
- Exact Methods in the study of language and text. Edited by: Grzybek P, Köhler R. 2007, Berlin: Walter de Gruyter GmbH & Co
- Manin DY: Zipf's Law and Avoidance of Excessive Synonymy. Cogn Sci: A Multidisciplinary J. 2008, 32 (7): 1075-1098. 10.1080/03640210802020003.View ArticleGoogle Scholar
- Smith R: Investigation of the Zipf-plot of the extinct Meroitic language. Glottometrics. 2007, 15: 53-61.Google Scholar
- Landini G: Evidence of Linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia. 2001, 25 (4): 275-295. 10.1080/0161-110191889932.View ArticleGoogle Scholar
- Newman M: Power laws, Pareto distribution and Zipf's law. Contemp Phys. 2005, 46 (5): 323-351. 10.1080/00107510500052444.View ArticleGoogle Scholar
- Ferreri Cancho R, Sole R: Two regimes in the frequency of words and the origins of complex lexicons: Zipf's law revisited. J Quant Linguist. 2001, 8 (3): 165-173. 10.1076/jqul.188.8.131.5201.View ArticleGoogle Scholar
- Montemurro M: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A. 2001, 300: 567-578. 10.1016/S0378-4371(01)00355-7.View ArticleGoogle Scholar
- Montemurro M, Zanette D: Frequency-rank distribution of words in large text samples: phenomenology and the mode. Glottometrics. 2002, 4: 87-98.Google Scholar
- Piotrowski RG, Pashkovskii VE, Piotrowski VR: Psychiatric linguistic and automatic text processing. Automatic Doc Math Linguist. 1995, 28 (5): 28-35.Google Scholar
- Balasubrahmanyan VK, Naranan S: Quantitative linguistics and complex system studies. J Quant Linguist. 1996, 3 (3): 177-228. 10.1080/09296179608599629.View ArticleGoogle Scholar
- Ferreri Cancho R: The variation of Zipf's law in human language. The European Phys J B. 2005b, 44: 249-257. 10.1140/epjb/e2005-00121-8.View ArticleGoogle Scholar
- Ferreri Cancho R, Sole R: Least Effort and the origins of scaling in human language. PNAS. 2003, 100 (3): 788-791. 10.1073/pnas.0335980100.View ArticleGoogle Scholar
- Clauset A, Shalizi C, Newman M: Power law distribution in empirical data. SIAM Rev. 2009, 51 (4): 661-703. 10.1137/070710111.View ArticleGoogle Scholar
- Kosmidis K, Kalampokisa A, Argyrakis P: Statistical mechanical approach to human language. Phys A: Stat Mechanics Appl. 2006, 366: 495-502.View ArticleGoogle Scholar
- Ferreri Cancho R: Decoding least effort and scaling in signal frequency distributions. Phys A: Stat Mechanics Appl. 2005, 345: 275-284.View ArticleGoogle Scholar
- Ferreri Cancho R: Zipf's law from a communicative phase transition. The European Phys J B. 2005, 47 (3): 449-457. 10.1140/epjb/e2005-00340-y.View ArticleGoogle Scholar
- McCowan B, Doyle LR, Jenkins JM, Hanser SF: The appropriate use of Zipf’s law in animal communication studies. Anim Behav. 2005, 69: F1-F7. 10.1016/j.anbehav.2004.09.002.View ArticleGoogle Scholar
- Ferreri Cancho R, McCowan B: A Law of Word Meaning in Dolphin Whistle Types. Entropy. 2009, 11: 688-701. 10.3390/e11040688.View ArticleGoogle Scholar
- Ferrer Cancho R: When language breaks into pieces A conflict between communication through isolated signals and language. Biosystems. 2006, 84: 242-253. 10.1016/j.biosystems.2005.12.001.View ArticleGoogle Scholar
- Julien Mayor J, Plunkett K: Vocabulary explosion: are infants full of Zipf?. Proceedings of the 32nd Annual Meeting of the Cognitive Science Society. 2010, Cognitive Science SocietyGoogle Scholar
- Piantadosi ST, Tily H, Gibson E: Word lengths are optimized for efficient communication. PNAS. 2011, 108 (9): 3526-3529. 10.1073/pnas.1012551108.PubMed CentralView ArticlePubMedGoogle Scholar
- Zipf G: The Psychobiology of Language. 1936, Routledge, LondonGoogle Scholar
- Cornet R, de Keizer N: Forty years of SNOMED: a literature review. BMC Medical Informatics and Decision Making. 2008, 8 (Suppl 1): S2-10.1186/1472-6947-8-S1-S2.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.