In the previous sections, we have highlighted some technologies and methodologies that can efficiently support new data integration and search challenges set by NGS and the development of new high-throughput equipment. We have especially stressed the role of modelling, standardization, and interoperability. We also hinted at addressing methodological and technological outstanding issues through the expansion of collaborative efforts. As in other fields, community efforts, such as data annotation and curation, are progressively enabled by the growing support of social information and communication technologies. The technical environments that are available for community annotation, data publishing and integration play an increasingly important role in the life sciences [31–34]. Yet, some factors are still limiting the possible valuable contributions arising from social efforts. Here, some of these factors are shortly discussed, focusing on those that restrain the participation of scientists to bio-data integration, mining and validation. In particular, we identified two major difficulties. First, scientists appear to currently lack the motivation to contribute positively to annotation in databases or knowledge bases. Second, valuable work done by authors who do not produce de novo data, but carefully select data from repositories for reanalysis, is poorly acknowledged. The following sheds some light on these questions and the various possible answers brought by communities with different social roles.
Data curation: from ignoring to cooperative
Many of us search or browse bioinformatics online resources. While doing so, we occasionally activate a link that unexpectedly breaks or points to absurd content. We mentally complain about it, but we usually ignore it and resume browsing. Some of us do spend the couple of minutes necessary to report this broken or mistaken link to the development team (if still in existence) and thereby spare the trouble to other users. In this case, "contributing" is pointing out errors but not solving issues. Indeed, too few of us envisage on-line resources as community wealth to which contribution would mean definite improvement and added value, i.e. a form of curation, which benefits all.
The activity of biological data curation has evolved over the years to a point where there is now an organised International Society of Biocuration [35] within which the question of community-based curation is debated and promoted, among other themes. The need for a coordinated action in this domain was emphasised for instance when the Swiss-Prot team introduced in 2007 the "adopt a protein" scheme [36], encouraging specialists of a given protein to oversee the update of the corresponding UniProtKB/Swiss-Prot entry. As it seems, scientists are not born protein adopters and the initiative could not be sustained. In the same period, a more sophisticated wiki based attempt was made in WikiProteins: the paper calling for a million minds [Mons 2008] from 2008 has meanwhile collected more than 120 citations, but is not in line with the number of community annotations in WikiProteins, and the attempt was discontinued. Other wiki-based approaches, such as the GeneWiki (in the context of WikiPedia) [32] and for instance WikiPathways [33], have met with slightly more traffic, but to the best of our knowledge the only community annotation effort that really took off to a level of satisfaction is ChemSpider [37]. However, with these lessons learned, an alternative invitation to contribute was devised a few years later in the human protein-centric knowledge platform neXtProt [38]. The neXtProt scheme promotes users' participation through the specific input of a selected network of specialists. Experts contribute by submitting experimental data sets and defining metrics for quality filtering in agreement with the neXtProt team. Very recently, curated associations on, for instance, Post Translational Modifications from NextProt have been formatted as nanopublications; this will allow the community contributions to certain snippets of information to be fully recognized (see next section).
In essence, biological data curation history tends to show that direct contribution may not be the ultimate strategy for gathering quality information and attracting potential contributors when it is limited to the addition of comments or facts in a Web page. Instead, guided input, so as to capture and shape information upon criteria that were previously and collectively agreed upon, seems more of a realistic approach. Future tools should rely on social interfaces encouraging users' cooperation in a constructive and targeted manner. Some efforts have already been made in this direction, e.g. for collaborative ontology development [39] and for interactive knowledge capture by means of Semantic Web technologies [40]. Yet, this important future area of scientific contribution suffers from the same roadblock as the "data-based science" discussed before. Unless a culture develops where these contributions (when measured perfectly) influence the career of the next generation of scientists, community contributions will always be limited to the "altruistic few" [41].
Exchange, access, provenance and reward models
An outstanding issue is the social award system and the perceptions prevailing around data sharing. In white papers advocating data sharing, authors usually emphasise technical challenges rather than the actual process of data sharing, although there are as many social challenges associated with actual data sharing as there are technical challenges. Obviously, technical challenges come first: data can only be shared if they are interoperable in format or have been captured with proper metadata attached.
Making data Open Access is clearly not enough; data accessibility and reusability by others than the data generators, is what really matters. As stated in previous sections, reuse of valuable data sets will support e-science discovery processes. In this context, provenance is the key for users planning to include existing data in a meta-analysis. Prior to adding a data set to the analysis mix, an e-scientist needs to evaluate the set, its overall relevance, quality and the underlying methods. For this crucial decision step, the metadata, including rich provenance, are needed.
In many cases, data can be excluded or included from/in an analysis workflow by properly instructed machines. For instance, all data on genes of a given species, e.g. mouse, can be automatically discarded, as long as sufficient provenance is associated with each candidate data set. It is thus very important that the concept "mouse" as Mus musculus is associated with a data set based entirely on mouse experiments, and properly referred to with a computer readable identifier. But it is also important that such identifier is at the appropriate position in the metadata fields, or in a RDF graph; this, for instance, to allow for the distinction between an occasional mention of the concept Mus musculus in a table or graph, as opposed to the statement "this entire set was generated on 'mouse' experiments". However, this ideal situation is currently far from reality. Even with Digital Object Identifiers (DOIs) for data sets and initiatives such as FigShare [42], Dryad [43] and the Research Data Alliance [44], we will need many years before each valuable data set can be properly judged and interpreted by others than its creators. This becomes even more pertinent in multi-scale modeling and the associated multi-omics and multi-technology data sets that increasingly dominate contemporary biology. It is not enough to "find a data set of potential relevance", because soon there will be too many, or to see some metadata on how the study was performed, although this is a conditio sine qua non. For real e-science approaches in biology, we need to see the provenance of each individual data element as it may appear, for instance, as a crucial edge in a graph-based hypothetical discovery interface.
Nanopublications and micropublications are here important. It is clear that entire complex data sets with hundreds of thousands, and sometimes millions, of interesting associations can be published as nanopublications; so, they are no longer lost in a huge number of hyperlinks to remote repositories, but each and every individual association becomes a research object in its own right, and is discoverable by computers and humans alike, all across the Web.
Assuming nanopublications and micropublications can solve the data integration issue, the next challenge is to "build a market" for such small information units. They do not intrinsically carry enough evidence to be trusted. The decision to "trust" them or not is taken on the basis of the source information, the associated methods and the reasoning that led to the claim in question. Again, additional steps (in fact, a form of annotation) are needed to create such computer and human readable units with rich enough provenance. This raises the question of finding the means of improving scientists' motivation to spend extra time on annotation.
Naturally, we lack much of the technical infrastructure that is needed to make this all reality, but, in practical sense, these needs are "easier" to fulfil than breaking through the science ecosystem hurdle to make "preparation for sharing" of data a core activity for every data creator and publisher. In fact, what we need is "desktop publishing" of data and information, very much like today authors carefully pre-format their papers according to guidelines for authors. Modern publishers should become data publishers, as well as narrative publishers, and should assist scientists in the curation and shaping of newly published data sets and of their provenance, much the same as they currently do for narrative.
Especially for sensitive data, both in terms of privacy and competitiveness, a trusted party status for the needed data publishing and stewardship infrastructure is a conditio sine qua non. Therefore, such a data exchange environment can only be built effectively as a federated and "approved" infrastructure, serving national as well as international data driven projects, and as a public-private partnership.
For purposes of clarity, in Figure 2 we have summarized, and grossly oversimplified, the basic workflow of data driven science. What is needed for e-science is, in fact, a completely new way of publishing, using, searching and reasoning with massive data output, in an open, software-driven, interactive environment.
Relevant scientific data, such as open source publications (e.g. Public Library of Science (PLoS) or BioMed Central (BMC)), individual assertions from closed access publications, abstracts (e.g. PubMed) and relevant legacy data sources (e.g. ChEMBL, UniProt), that constitute a central core of biological information requested by almost all domains, should be made available in an interoperable format to make their direct integration, comparison and modeling with new data possible (Figure 3). Currently, only a small percentage of information in databases, for instance SNP-phenotype associations, can be recovered by text mining from abstracts, or even the entire narrative part, of full text articles. Many of these types of associations are included in tables and figures, which escape ordinary text mining algorithms, and in supplementary data, which are ignored by text mining. It is therefore crucial to move to a situation where massive numbers of associations can be published in a "discoverable" and interoperable format, with proper references to the producers of the data and associated narrative elements in order to allow award of the efforts.
Notably, publishers have already played a role by imposing annotated data submission and, some of them, by being involved in the definition of related standards, e.g., MINSEQE [45] or MIAPE [46] standards that are governing corresponding data repositories: ArrayExpress [47] for MINSEQE or PRIDE [48] for MIAPE. However, a major roadblock at this point in time is that many grant and manuscript reviewers still do not recognise the value of studies that do not entail the production of new experimental data, but only exploit results from data repositories. Without challenging the sustained importance of proving a biological hypothesis with sound experimental data, it should nonetheless be admitted that validation does not necessarily impose being the creator of the data used as evidence.
Finally, social hurdles for data sharing are not limited to the conservatism of publishers and funders, which could be overcome hopefully soon. Additionally and more importantly, there is no "scientific" reward for sharing, i.e. acknowledgement of its value as a scientific product. If no mechanisms exist for any generally acknowledged reward for sharing and making own data discoverable, well annotated, principally interoperable and citable, a routine of data sharing is not likely to be established. Movements like Altmetrics [49] are crucial to raise a discussion and to demonstrate technical feasibility of a fine grained judgement about an individual's contribution to a scientific record. However, until the "reward" reaches a steady and wide acceptance by reviewers, funding bodies and publishers, nothing will change. They have the means to push researchers make a proper data stewardship part of their natural workflow. It is only since recently that we need to take the "reusability" of the data that are being generated into account in the study design. Several funders already require a well-drafted data stewardship plan for any proposal that will generate significant data sets. This practice should be encouraged; proper standards, best practices, guidelines and reward systems should be implemented and made easily findable, so that biologists with little or no affinity with bioinformatics or data sharing can still participate. Only if all these prerequisites for data sharing are in place, the culture may change and a genuine open data exchange culture in the life sciences can be established.