Social tagging in the life sciences: characterizing a new metadata resource for bioinformatics
© Good et al. 2009
Received: 11 March 2009
Accepted: 25 September 2009
Published: 25 September 2009
Skip to main content
© Good et al. 2009
Received: 11 March 2009
Accepted: 25 September 2009
Published: 25 September 2009
Academic social tagging systems, such as Connotea and CiteULike, provide researchers with a means to organize personal collections of online references with keywords (tags) and to share these collections with others. One of the side-effects of the operation of these systems is the generation of large, publicly accessible metadata repositories describing the resources in the collections. In light of the well-known expansion of information in the life sciences and the need for metadata to enhance its value, these repositories present a potentially valuable new resource for application developers. Here we characterize the current contents of two scientifically relevant metadata repositories created through social tagging. This investigation helps to establish how such socially constructed metadata might be used as it stands currently and to suggest ways that new social tagging systems might be designed that would yield better aggregate products.
We assessed the metadata that users of CiteULike and Connotea associated with citations in PubMed with the following metrics: coverage of the document space, density of metadata (tags) per document, rates of inter-annotator agreement, and rates of agreement with MeSH indexing. CiteULike and Connotea were very similar on all of the measurements. In comparison to PubMed, document coverage and per-document metadata density were much lower for the social tagging systems. Inter-annotator agreement within the social tagging systems and the agreement between the aggregated social tagging metadata and MeSH indexing was low though the latter could be increased through voting.
The most promising uses of metadata from current academic social tagging repositories will be those that find ways to utilize the novel relationships between users, tags, and documents exposed through these systems. For more traditional kinds of indexing-based applications (such as keyword-based search) to benefit substantially from socially generated metadata in the life sciences, more documents need to be tagged and more tags are needed for each document. These issues may be addressed both by finding ways to attract more users to current systems and by creating new user interfaces that encourage more collectively useful individual tagging behaviour.
As the volume of data in various forms continues to expand in the life sciences and elsewhere, it is increasingly important to find mechanisms to generate high quality metadata rapidly and inexpensively. This indexing information - the subjects linked to documents, the functions annotated for proteins, the characteristics identified in images, etc. - is what makes it possible to build the software required to provide researchers with the ability to find, integrate, and interact effectively with distributed scientific information.
Current practices for generating metadata within the life sciences, though varying across initiatives and often augmented by automated techniques, generally follow a process closely resembling that long employed by practitioners in the library and information sciences [1, 2]. First, semantic structures, such as thesauri and ontologies, are created by teams of life scientists working in cooperation with experts in knowledge representation or by individuals with expertise in both areas. Next, annotation pipelines are created whereby professional annotators utilize the relevant semantic structures to describe the entities in their domain. Those annotations are then stored in a database that is made available to the public via websites and sometimes Web services. As time goes on, the semantic structures and the annotations are updated based on feedback from the community and from the annotators themselves.
This process yields useful results, but it is intensive in its utilization of an inherently limited supply of professional annotators. As the technology to produce new information and the capacity to derive new knowledge from that information increases, so too must the capacity for metadata provision. Technologies that support this process by partially automating it, such as workflows for genome annotation  and natural language indexing systems [4–6], provide important help in this regard, but manual review of automated predictions remains critical in most domains [7, 8]. There is clearly a need for an increase in the number of human annotators to go along with the increase in the amount of data.
The tagline of Connotea - "Organize, Share, Discover" - illustrates the purposes that social tagging systems are intended to enable for their users. Tags can be used to organize personal collections in a flexible, location independent manner. The online nature of these services allows users to easily share these collections with others - either within their circle of associates or with the general public. Through the public sharing of these annotated references, it is possible for users to discover other users with similar interests and references they may not otherwise have come across.
This basic functionality has already attracted tens of thousands of users to these systems.
The expanding numbers of users and the concomitant increase in the volume of the metadata they produce suggests the possibility of new applications that build on the socially generated metadata to achieve purposes different from the personal ones listed above. For example,  showed that the relevance of Web search results achieved with both search engines and hand-curated directories could be improved by integrating results produced by social tagging services. In fact, the "social search engines" suggested by this result are already starting to appear (for an example, see worio.com ).
As we consider the creation of new applications like these within the life sciences, it is important to begin with an understanding of the nature of the metadata that they will be built upon. This study is thus intended to provide a thorough characterization of the current products of social tagging services in biomedical, academic contexts. This is achieved through an empirical assessment of the tags used to describe citations in PubMed by users of Connotea and CiteULike. Selecting PubMed citations as the resource-focus for this investigation makes it possible to compare socially generated metadata, produced initially to support disparate personal needs, directly with professionally generated metadata produced for the express purpose of enabling applications that serve the whole community. Where commonalities are noted, similar kinds of community-level uses can be imagined for the socially generated metadata; where differences occur, opportunities are raised to envision new applications.
This data suggests that, despite the very large numbers of registered users of academically-focused social tagging services - on November 10, 2008, Connotea reported more than 60,000 (Ian Mulvaney, personal communication) - the actual volume of metadata generated by these systems remains quite low. While the sheer numbers of users of these systems renders it possible that this volume could increase dramatically, that possibility remains to be shown.
Density refers simply to the number of metadata terms associated with each resource described. Though providing no direct evidence of the quality of the metadata, it helps to form a descriptive picture of the contents of metadata repositories that can serve as a starting point for exploratory comparative analyses. To gain insight into the relative density of tags used to describe citations in academic social tagging services, we conducted a comparison of the number of distinct tags per PubMed citation for a set of 19,118 unique citations described by both Connotea and CiteULike. This set represents the complete intersection of 203,314 PubMed citations identified in the CiteULike data and 106,828 PubMed citations found in Connotea.
Tag density in Connotea, CiteULike and MEDLINE on PubMed citations
coefficient of variation
Connotea per tagging
CiteULike per tagging
In terms of tags per post, the users of CiteULike and Connotea were very similar. As Table 1 indicates, the mean number of tags added per biomedical document by individual users was 3.02 for Connotea and 2.51 for CiteULike, with a median of 2 tags/document for both systems. These figures are consistent with tagging behaviour observed throughout both systems and with earlier findings on a smaller sample from CiteULike which indicated that users typically employ 1-3 tags per resource [21, 22]. On independent samples of 500,000 posts (tagging events) for both CiteULike and for Connotea, including posts on a wide variety of subjects, the medians for both systems were again 2 tags/document and the means were 2.39 tags/document for CiteULike and 3.36 for Connotea. The difference in means is driven, to some extent, by the fact that CiteULike allows users to post bookmarks to their collections without adding any tags while Connotea requires a minimum of one tag per post. Other factors that could influence observed differences are that the user populations for the two systems are not identical nor are the interfaces used to author the tags. In fact, given the many potential differences, the observed similarity in tagging behaviour across the two systems is striking.
As more individuals tag any given document, more distinct tags are assigned to it. After aggregating all of the tags added to each of the citations in the sample by all of the different users to tag each citation, the mean number of distinct tags/citation for Connotea was 4.15 and the mean number for CiteULike was 5.10. This difference is a reflection of the larger number of posts describing the citations under consideration by the CiteULike service. In total, 45,525 CiteULike tagging events produced tags for the citations under consideration while data from just 28,236 Connotea tagging events were considered.
Measures of inter-annotator agreement quantify the level of consensus regarding annotations created by multiple annotators. Where consensus is assumed to indicate correctness, it is used as measure of quality. The higher the agreement between multiple annotators, the higher the perceived confidence in the annotations.
In a social tagging scenario, agreement regarding the tags assigned to particular resources can serve as a rough estimate of the quality of those tags from the perspective of their likelihood to be useful to people other than their authors. When the same tag is used by multiple people to describe the same thing, it is more likely to directly pertain to the important characteristics of the item tagged (e.g. 'VEGF' or 'solid organ transplantation') than to be of a personal or erroneous nature (e.g. 'BIOLS_101', 'todo', or '**'). Rates of inter-annotator agreement can thus be used as an approximation of the quality of tag assignments from the community perspective. Note that, as  discusses, there may be interesting, community-level uses for other kinds of tags, such as those bearing emotional content. For example, tags like 'cool' or 'important' may be useful in the formation of recommendation systems as implicit positive ratings of content. However, the focus of the present study is on the detection and assessment of tags from the perspective of subject-based indexing. Note also that the small numbers of tags per document in the systems under consideration here bring into question the relationship between consensus and quality.
Examples of different levels of granularity
CUI 0001584: 'Adolescent Psychology'
UMLS Semantic Type
SUI T090: 'Biomedical Occupation or Discipline'
UMLS Semantic Group
The reason for including multiple levels of granularity in the measures of agreement is to provide a thorough comparison of the meanings of the tags. Since the tags are created dynamically by users entering simple strings of text, we expect large amounts of variation in the representations of the same concepts due to the presence of synonyms, spelling errors, differences in punctuation, differences in plural versus singular forms, etc. The mapping to UMLS concepts should help to reduce the possibility of such non-semantic variations masking real conceptual agreements. Furthermore, by including analyses at the levels of semantic types and semantic groups, we can detect potential conceptual similarities that exact concept matching would not reveal. (While the present study is focused on measures of agreement, in future work this data could be used to pose questions regarding the semantic content of different collections of tags - for example, it would be possible to see if a particular semantic group like 'concepts and ideas' was over-represented in one group versus another.)
Positive Specific Agreement among pairs of social taggers on PubMed citations
N pairs measured
Mean terms per post
N pairs measured
Mean terms per post
One interpretation of the low levels of agreement is that some users are providing incorrect descriptions of the citations. Another interpretation is that there are many concepts that could be used to correctly describe each citation and that different users identified different, yet equally valid, concepts. Given the complex nature of scientific documents and the low number of concepts identified per post, the second interpretation is tempting. Perhaps the different social taggers provide different, but generally valid views on the concepts of importance for the description of these documents. If that is the case, then, for items tagged by many different people, the aggregation of the many different views would provide a conceptually multi-faceted, generally correct description of each tagged item. Furthermore, in cases where conceptual overlap does occur, strength is added to the assertion of the correctness of the overlapping concepts.
To test both of these assumptions, some way of measuring 'correctness' regarding tag assignments is required. In the next several sections, we offer comparisons between socially generated tags and the MeSH subject descriptors used to describe the same documents. Where MeSH annotation is considered to be correct, the provided levels of agreement can be taken as estimates of tag quality; however, as will be shown in the anecdote that concludes the results section and addressed further in the Discussion section, MeSH indexing is not and could not be exhaustive in identifying relevant concepts nor perfect in assigning descriptors within the limits of its controlled vocabulary. There are likely many tags that are relevant to the subject matter of the documents they are linked to yet do not appear in the MeSH indexing; agreement with MeSH indexing can not be taken as an absolute measure of quality - it is merely one of many potential indicators.
As both another approach to quality assessment and a means to precisely gauge the relationship between socially generated and professionally generated metadata in this context, we compared the tags added to PubMed citations to the MeSH descriptors added to the same documents. For these comparisons, we again used PSA, but in addition, we report the precision and the recall of the tags generated by the social tagging services with respect to the MeSH descriptors. (For readers familiar with machine learning or information retrieval studies, in cases such as this where one set is considered to contain true positives while the other is considered to contain predicted positives, PSA is equivalent to the F measure - the harmonic mean of precision and recall.)
Average agreement between social tagging aggregates and MeSH indexing.
CiteULike verse MEDLINE
Connotea verse MEDLINE
Focusing specifically on precision, we see that approximately 80% of the concepts that could be identified in both social tagging data sets fell into UMLS Semantic Groups represented by UMLS Concepts linked to the MeSH descriptors for the same resources. At the level of the Semantic Types, 59% and 56% of the kinds of concepts identified in the Connotea and CiteULike tags respectively, were found in the MeSH annotations. Finally, at the level of UMLS Concepts, just 30% and 20% of the concepts identified in the Connotea and CiteULike tags matched Concepts from the MeSH annotations.
The data in Table 4 represents the conceptual relationships between MeSH indexing and the complete, unfiltered collection of tagging events in CiteULike and Connotea. In certain applications, it may be beneficial to identify tag assignments likely to bear a greater similarity to a standard like this - for example, to filter out spam or to rank search result lists. One method for generating such information in situations where many different opinions are present is voting. Assuming that there is a greater tendency for tag assignments to agree with the standard than to disagree - where multiple tag assignments for a particular document are present - then the more times a tag is used to describe a particular document the more likely that tag is to match the standard.
Though the bulk of the socially generated metadata investigated above is sparse - with most items receiving just a few tags from a few people - it is illuminating to investigate the properties of this kind of metadata when larger amounts are available both because it makes it easier to visualize the complex nature of the data and because it suggests potential future applications. Aside from enabling voting processes that may increase confidence in certain tag assignments, increasing numbers of tags also provide additional views on documents that may be used in many other ways. Here, we show a demonstrative, though anecdotal example where several different users tagged a particular document and use it to show some important aspects of socially generated metadata - particularly in contrast to other forms of indexing.
In some cases, the tags added by the users of the social tagging systems are more precise than the terms used by the MeSH indexers. For example, the main experimental method used in the article was two-photon microscopy - a tag used by two different social taggers (with the strings 'two-photon' and 'twophoton'). The MeSH term used to describe the method in the manuscript is 'Microscopy, Confocal'.
Within the MeSH hierarchy, two-photon microscopy is most precisely described by the MeSH heading 'Microscopy, Fluorescence, Multiphoton' which is narrower than 'Microscopy, Fluorescence' and not directly linked to 'Microscopy, Confocal'; hence it appears that the social taggers exposed a minor error in the MeSH annotation. In other cases, the social taggers chose more general categories - for example, 'hemodynamics' in place of the more specific 'blood volume'.
The tags in Figure 14 show two important aspects of socially generated metadata: diversity and emergent consensus formation. As increasing numbers of tags are generated for a particular item, some tags are used repeatedly and these tend to be topically relevant; for this article, we see 'astrocytes' and 'vision' emerging as dominant descriptors. In addition to this emergent consensus formation (which might be encouraged through interface design choices) other tags representing diverse user backgrounds and objectives also arise such as 'hemodynamic'. 'neuroplasticity', 'two-photon', and 'WOW'. In considering applications of such metadata, both phenomenon have important consequences. Precision of search might be enhanced by focusing query algorithms on high-consensus tag assignments or by enabling Boolean combinations of many different tags. Recall may be increased by incorporating the tags with lower levels of consensus.
While we assert that this anecdote is demonstrative, a sample of one is obviously not authoritative. It is offered simply to expose common traits observed in the data where many tags have been posted for a particular resource.
The continuous increase in the volume of data present in the life sciences, illustrated clearly in Figure 2 by the growth of PubMed, renders processes that produce value-enhancing metadata increasingly important. It has been suggested by a number of sources that social tagging services might generate useful metadata, functioning as an effective intermediate between typically inexpensive, but low precision automated methods and expensive professional indexing involving controlled vocabularies [15, 21, 22, 29]. Evidence in favour of this claim comes from reported improvements in the relevance of Web search results gained by integrating information from social tagging data into the retrieval process . Where a substantial density of socially generated tags is present, we demonstrated that it is possible to achieve both deep resource descriptions (Figure 14) and improvements in annotation precision via aggregation (Figure 12). Unfortunately, the results presented here also suggest that much of this potential is as yet unavailable in the context of the life sciences because the coverage of the domain is still very narrow and the number of tags used to describe most of the documents is generally very low.
If metadata from social tagging services is to be useful in support of applications that are similar in purpose and implementation to those currently in operation, more documents need to be tagged and more tags need to be assigned per document. These objectives can be approached by both expanding the number of users of these systems and adjusting the interfaces that they interact with. Looking forward, the increasing volume of contributors to social tagging services should help to increase resource coverage and, to some extent, tag density, yet both the rich-get-richer nature of citation and the limited actual size of the various sub-communities of science will likely continue to result in skewed numbers of posts per resource. To make effective use of the annotations produced by social tagging applications, the metadata generated by individual users needs to be improved in terms of density and relevance because, in most cases, the number of people to tag any particular item will be extremely low. Identifying design patterns that encourage collectively useful tagging behaviour is thus a critical area for future investigations. It has been shown that careful interface and interaction design can be used to guide individual users towards tagging behaviours that produce more useful metadata at the collective level [30, 31]. Future research will help to provide a better understanding of this process, illuminating methods for guiding user contributions in particular directions, e.g. towards the use of larger numbers of more topical tags, without reducing the individual benefits of using these systems that seem to provide the primary incentive for participation . One such experiment in interaction design would be to inform the users of these systems that the annotations they create for themselves are to be used in the creation of applications that operate on a collective level and thus benefit the community as a whole. By making the desire to create such applications known and by explaining the attributes of the annotations required to make these applications effective, it is possible that some individuals might act intentionally to improve the collective product. Such an experiment would help to shed light on the question of why there are such differences between the tagging behaviours of typical users and the annotations produced in professional contexts. Perhaps an increased overlap in purpose would result in increased overlap in product.
Aside from such overt requests, changes to the interfaces used to author annotations within social tagging systems might also have substantial effects. One key area of development in terms of tagging interface design is the incorporation of controlled vocabularies into the process. Emerging systems in this domain let users tag with controlled terms [33, 34] and automatically extract relevant keywords from text associated with the documents to suggest as potential tag candidates . By providing the well-known benefits of vocabulary control - including effective recognition and utilization of relationships such as synonymy and hyponymy - and by gently pressing users towards more convergent vocabulary choices and fewer simple spelling errors, such systems seem likely to produce metadata that would improve substantially on that analyzed here. In preliminary investigations of such 'semantic social tagging' applications - including Faviki , the Entity Describer [36, 37], and ZigTag  - the degrees of inter-tagger agreement do appear higher than for the free-text interfaces however the number of tags per document remains about the same (data not shown). Systems that aid the user in selecting tags - for example, by mining them from relevant text - may aid in the expansion of the number of tags added per document.
In addition to recruiting more users and producing interfaces that guide them towards more individually and collectively useful tagging behaviours, additional work is needed to better understand other aspects of the metadata from social tagging systems that are both important and completely distinct from previous forms of indexing. For example, one of the fundamental differences between socially generated and institutionally generated indexes is the availability of authorship information in the social data . It is generally not possible to identify the person responsible for creating the MeSH indexing for a particular PubMed citation, but it is usually possible to identify the creator of a public post in a social tagging system. This opens up whole new opportunities for finding information online whose consequences are little understood. For example, it is now possible for users to search based on other users e.g. searching for items in Connotea that have been tagged by 'mwilkinson'  or 'bgood' . In addition to this simple yet novel pattern of information interaction, research is being conducted into ways to incorporate user-related data into keyword-based retrieval algorithms .
Academic social tagging systems provide scientists with fundamentally new contexts for collaboratively describing, finding, and integrating scientific information. In contrast to earlier forms of personal information management, the public nature and open APIs characteristic of social tagging services make the records of these important scientific activities accessible to the community. These new public metadata repositories provide a novel resource for system developers who wish to improve the way scientists interact with information.
Based on the results presented above, it is clear that the information accumulating in the metadata repositories generated through social tagging offers substantial differences from other kinds of metadata. In particular, both the number of documents described by these systems and the density of tags associated with each document remain generally very low and very unequally distributed across both the user and the document space. While expanding numbers of user-contributors and improving user interfaces will likely help to encourage the formation of greater numbers of tagged documents and more useful tags, the unbalanced distribution of scientific attention will almost certainly result in the continuation of the skewed numbers of taggers (and thus tags) per document displayed in Figures 6, 7, 8 and 9.
At a broad level, the key implication of these results from the standpoint of bioinformatics system design is that - despite surface similarities - these new metadata resources can not be used in the same manner as metadata assembled in other ways. Rather, new processes that make use of the additional social context made accessible through these systems need to be explored. In the long run, it may turn out that the primary benefit of social tagging data might not be found in the relationships between tags and documents as explored here but instead in the information linking documents and tags to users and users to each other.
The Connotea data was gathered using the Connotea Web API  and a client-side Java library for interacting with it . All tagging events accessible via the API prior to November 10, 2008 were retrieved and, with the exception of a small number lost due to XML parsing errors, stored in a local MySQL database for analysis.
The CiteULike data was downloaded on November 9, 2008 from the daily database export provided online . Once again, the data was parsed and loaded into a local MySQL database for processing.
Once the Connotea and CiteULike data was gathered, the associated PubMed identifiers from both datasets were used to retrieve the PubMed records using a Java client written for the United States National Centre for Biotechnology's Entrez Programming Utilities . This client retrieved the metadata, including MeSH term assignments, for each identifier and stored it in the local database.
The coverage of PubMed by Connotea and CiteULike was estimated through inspection of the number of unique identifiers supplied for each posted citation in the downloaded data. Only citations that were linked by the tagging systems to PubMed identifiers were counted.
The data generated for the tag density tables and figures was assembled from the local database using Java programs. The figures were generated using R .
Equation 1: Positive Specific Agreement for the members of sets S1, S2 whose intersection is a and where b = S1 excluding a and c = S2 excluding a. For more information, see .
To provide an estimation for quality of tag assignments in academic social tagging systems, we measure the levels of agreement between the sets of tags assigned to the same resource by multiple users as follows:
- For resources associated with more than one tagging event
◦ For pairs of users to tag the resource
▪ measure and record the positive specific agreement (PSA) between the tags assigned to the resource between the pair
- Summarize by average PSA for each distinct (user-pair, resource) combination
As PSA is a metric designed for comparing sets, to use it, it is necessary to define a rigid equivalence function to define the members of the sets. For comparisons between concepts, types, and groups from the UMLS, unique identifiers for each item are used; however, for comparisons between tags, only the strings representing the tag are available. For the results presented at the level of standardized strings, operations were applied to the tags prior to the comparisons as follows:
1. All non-word characters (for example, commas, semi-colons, underscores and hyphens) were mapped to spaces using a regular expression. So the term "automatic-ontology_evaluation" would become "automatic ontology evaluation".
2. CamelCase  compound words were mapped to space separated words - "camelCase" becomes "camel case".
3. All words were made all lower case ("case-folded").
4. Any redundant terms were removed such that, after operations 1-3, each term in a set composed a string of characters that was unique within that set.
5. Porter stemming was applied to all terms and sub-terms .
6. All sub-terms were sorted alphabetically.
For MeSH terms, associated UMLS concepts were identified within the information provided in the 2008 version of the MeSH XML file provided by the NLM . In a few cases, concepts were missing from this file in which case they were retrieved using a Java client written to make use of the Web services made available as part of the UMLS Knowledge Source Server (UMLSKS) .
For the tags, the UMLSKS client program was designed to identify matching concepts with high precision. For each tag, the UMLSKS web service method findCUIByExact was used to identify concepts from any of the source vocabularies represented in the metathesaurus where at least one of the names assigned to that concept matched the tag directly . To further increase precision, only concepts whose primary name (rather than one of the several possible alternate names) matched the tag were included.
To assess the performance of this concept identification protocol, we tested it on its ability to rediscover the concepts associated with MeSH descriptors using the text of the preferred label for the descriptor (acting as a tag) as the input to the system. The concepts already associated with each MeSH descriptor in the MeSH XML file provided by the NLM were used as true positive concept calls for comparison. On a test of 500 MeSH descriptors, the concept calling protocol used to generate the data presented above produced a precision of 0.97 and a recall of 0.91. Without the requirement that the primary concept name match the query string, precision decreases to 0.82 while the recall increases to 1.0 for the same query set. The reduction in the precision is due to false positives such as 'Meningeal disorder' being identified for the query term 'Meninges'. Once a unique concept identifier was identified, the Java client was used to extract its semantic type and semantic group and store this information in our local database.
BMG is supported by a University Graduate Fellowship provided by the University of British Columbia. MDW is supported by an award from the Natural Sciences and Engineering Research Council of Canada.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.