Trends in the production of scientific data analysis resources
© Hennessey et al.; licensee BioMed Central Ltd. 2014
Published: 21 October 2014
As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which have been shown to decay in a time-dependent fashion. What is less clear is whether or not SDAR-producing group size or prior experience in SDAR production correlates with SDAR persistence or whether certain institutions or regions account for a disproportionate number of peer-reviewed resources.
We first quantified the current availability of over 26,000 unique URLs published in MEDLINE abstracts/titles over the past 20 years, then extracted authorship, institutional and ZIP code data. We estimated which URLs were SDARs by using keyword proximity analysis.
We identified 23,820 non-archival URLs produced between 1996 and 2013, out of which 11,977 were classified as SDARs. Production of SDARs as measured with the Gini coefficient is more widely distributed among institutions (.62) and ZIP codes (.65) than scientific research in general, which tends to be disproportionately clustered within elite institutions (.91) and ZIPs (.96). An estimated one percent of institutions produced 68% of published research whereas the top 1% only accounted for 16% of SDARs. Some labs produced many SDARs (maximum detected = 64), but 74% of SDAR-producing authors have only published one SDAR. Interestingly, decayed SDARs have significantly fewer average authors (4.33 +/- 3.06), than available SDARs (4.88 +/- 3.59) (p < 8.32 × 10-4). Approximately 3.4% of URLs, as published, contain errors in their entry/format, including DOIs and links to clinical trials registry numbers.
SDAR production is less dependent upon institutional location and resources, and SDAR online persistence does not seem to be a function of infrastructure or expertise. Yet, SDAR team size correlates positively with SDAR accessibility, suggesting a possible sociological factor involved. While a detectable URL entry error rate of 3.4% is relatively low, it raises the question of whether or not this is a general error rate that extends to additional published entities.
Technological advances have enabled the rapid production of data in many scientific fields . Because gathering data is frequently the beginning point for scientific inquisition rather than the end goal, the number of publications that mention the use of software, databases, web servers or informatics components has increased steadily over the past several decades . The field of bioinformatics has grown, in large part, commensurate with this increase in data, and focuses on developing methods to better understand the implications of gathered data. These methods encompass computer programs, databases and other Internet-accessible software products that we will collectively refer to as Scientific Data Analysis Resources (SDARs). Some SDARs have a profound impact on science - three of the four most cited papers in the past 25 years (as of March 26th, 2014, according to Web of Science) were SDARs, including BLAST for sequence analysis (cited 37,641 times), Clustal-W for multiple sequence alignment (cited 40,364 times), and SHELX for protein structure determination (cited 35,311 times) . However, one concern with SDARs is that, due to their digital nature, they can suddenly disappear. That is, the equipment hosting the SDAR may become inaccessible for various reasons outside the control of the authors, such as catastrophic data loss, maintenance neglect, or loss of funding to support the resource. Alternatively, new methods may replace old ones, and links to the obsolete methods may simply be retired.
Internet-accessible resources, locatable by their Uniform Resource Locator (URL) addresses, are being increasingly used in scientific publications. However, unlike print, URLs are dynamic and can not only change in their content but become inaccessible. This phenomenon, URL decay, whereby URLs become inaccessible in a time-dependent manner, has been documented in numerous studies to date and has been reported across academic disciplines (e.g., medicine, law, business, social science, etc.) [4–6]. These problems in the persistence of online resources are being dealt with in a variety of ways, such as archival sites like http://webcitation.org, and others such as the Neuroscience Information Framework (http://www.neuinfo.org/)  to identify, catalog and standardize existing resources.
Here, we are interested in a related phenomenon - factors that are associated with historical SDAR production and whether or not any of them are also correlate with eventual decay. For example, does the number of authors publishing a SDAR correlate with its stability? In terms of vigilance in maintaining the online presence of SDARs, there are several possibilities as to what factors are most effective with reference to the original team creating the SDAR. It's possible that online stability correlates with the number of authors on the paper - perhaps because there might be more people with a vested interest in its maintenance and more people able to maintain it. Or, perhaps more authors might dilute the perception of personal responsibility for SDAR decay, and thus fewer authors per paper might lend itself to greater SDAR stability. What is the distribution in SDAR production by institution and does it differ from scientific research in general? How many groups publish multiple SDARs and are their SDARs more or less likely to be accessible? Alternatively, SDAR decay may be more a function of external factors, such as whether a project to create the SDAR was underfunded, a hosting institution changing policies regarding external access of internal network content, or a lack of interest from the scientific community in using the SDAR.
An interesting corollary to these questions is to contrast SDARs with the biomedical literature in general from an economic perspective. If publications are the currency of academia, can we measure the "health" of its economy? To that end, the Gini coefficient (aka the Gini index or Gini ratio) has been employed as a measure of wealth disparity by economists and sociologists to reflect the contrast between the richest and poorest in a nation. Ranging from 0 (a completely equal distribution of wealth) to 1 (a complete concentration of wealth), it is defined as the coefficient between the area below a Lorenz Curve (the amount of wealth accounted for by a certain percentage of the population) and the area between it and the line of equality (going from 0,0 to 1,1) . By contrasting publication production between biomedical literature and SDARs, we can estimate whether or not SDAR production tends to be a product of those with infrastructure and resources (i.e., wealth in the traditional Gini coefficient model) or is more of a function of individual initiative and effort.
All of the extracted data, including MEDLINE URLs, abstracts, author names, dates and institutions, were obtained from the May, 2014 release of the National Library of Medicine's MEDLINE XML dataset . To prevent the introduction of bias created by certain journals appearing in the MEDLINE index sooner than others, the current year (2014) was excluded.
Institution names and their ZIP codes pertaining to individual citations were extracted from the Affiliation element, using the "extract_data.py" program, primarily using heuristics and regular expressions. One of the primary goals in extracting the names was to find the most generic identifier that uniquely pertained to an entity so that we could focus on discussing institutions, though this was challenging due to different punctuation, spelling, abbreviations and languages. The Affiliation string was first tokenized using commas, after which a prioritized set of keywords were searched within the substrings to identify high-level entities. Due to the noisy nature of the data, all institutions only appearing once were excluded from analysis.
Attempts were made to handle the international and multilingual nature of biomedical publishing. Some keywords were introduced to cover their English equivalents (like "hôpital" (French) and "Istituto" (Italian)), while others only needed to be shortened due to their common lingual history (like "Universi", which matched English, French, German, Italian, Portuguese and Spanish). Unicode text was reduced to its closest English transliteration using the unidecode python package as it was observed that names with non-English characters were inconsistently translated already. All comparisons were done in a case-insensitive manner.
Various functions of standardization were applied (such as eliminating a leading "The"), the most controversial of which would likely be the dropping of any text after "University" if there was text before it. This reduced names such as "Foo University School of Medicine" down to "Foo University" while leaving others like "University of Foo" as they were. This forced judgment calls to be made in terms of what constitutes an institution, like in the case of the most represented institution, the University of California system, which is managed by a central board of regents. At the same time, there are many researchers at universities who list their college (such as a medical school) as their primary entity, making it difficult to fold them back into their university. With over 12 million institute strings, each heuristic added also has the potential to introduce a new parsing error.
ZIP codes were extracted simply by verifying that an entry looked somewhat USA-related (containing "USA" or "United States" or at least not containing another country's name) and looking for a 5 digit number, preferably toward the end of the string.
URLs were extracted from the XML abstracts using a Visual Basic program as described in . To determine their availability, they were then tested over a period of 10 days at 3 random times per day, following the same protocol and using the same "check_urls_web.py" program previously published in . Previous results in [4–6] showed a very small percentage of URLs were only intermittently available (about 3% of URLs were available between 0% and 90% of the time), and the current survey is consistent with that (Additional file 1), with 2% showing intermittent availability. For convenience, we defined a URL to be "inaccessible" if it was accessible less than 3 times of the 30 it was queried, and "accessible" otherwise. We classified URLs as SDARs by key terms that appeared in the abstract with the URL, including "informatics", "algorithm", "software", "web server" and "computer program".
Archival site entry errors are infrequent yet potentially problematic
For analysis purposes, we separated out 2,529 websites (S2 in Additional file 1) that were created by organizations to archive static, non-SDAR content (e.g., text, multimedia) from those that appeared to either be more author-initiated or were for organizational archival of SDARs (e.g., http://code.google.com, http://sourceforge.net, etc). This was done on the basis of the top-level domain (TLD). For example, URLs pointing to Digital Object Identifiers (DOIs), http://Webcitation.org, http://ClinicalTrials.gov, and journal-based archive sites were separated out and excluded from the rest of the URLs. These URLs (particularly DOIs) are expected to be more stable due to their long-term organizational support.
Interestingly, of the 666 DOIs detected, 19 (3%) were inaccessible. Although a very low rate, the very idea behind DOIs is to provide a permanent locator. Upon further examination, nine contained apparent spelling/formatting errors, 4 of which could be corrected and were then accessible. The other 5 were missing a critical field standard to DOI formatting, but what the field should have been could not be determined. The remaining 10 may have also had spelling errors but, if so, were non-obvious. Furthermore, this 3% erroneous entry rate seemed to be a general phenomenon, as 85 of the 2,529 archival URLs (3.4%) were inaccessible. Manual inspection of the URLs (as extracted and as written in PubMed) showed that many of them were incorrectly formatted. For example, two links to the archival site http://www.webcitation.org left off the ".org" suffix. The most common errors were to one of the most highly cited web sites, http://clinicaltrials.gov, in attempts to cite the clinical trials number. The proper formatting is http://clinicaltrials.gov/ct2/show/# (where # is the clinical trials registry number), but dozens of publications have misspelled the URL. For example, hyphenating clinical-trials (which does not automatically redirect), misspelling "trials" as "trails", but most often just not correctly formatting the URL. What fraction may be the fault of the authors, the journal publication process or entry into MEDLINE is not known.
URL decay continues unabated
The number of authors per SDAR is increasing and correlates with future accessibility
Logistic regression coefficients from modeling SDAR decay as a function of Number of Authors, Number of years since publication (SDAR_AGE) and Publication Year.
ANOVA table for the logistic linear model decay~NumAuthors+year
ANOVA table for the logistic linear model decay~NumAuthors+year+SDAR_Age
The resulting statistics for the coefficients of the 3 predictor model, decay ~ NumAuthors+SDAR_Age+Year can be seen in Table 1. As expected, the variable, SDAR_Age, has the strongest influence on decay propensity (p = 6.9e-35). Each additional year passed since publication increases the SDAR decay odds ratio by 1.307 (=e^0.268). The number of authors also had a strong impact on decay evolution (p = 8.32e-04) but in the opposite direction (log odds ratio = -0.0236). That is, more authors on the original SDAR publication correlates with the probability the SDAR will be accessible in the future. The statistics table also shows a slightly higher than expected decay rate in 2008 (p = 5.17e-04, log odds ratio = 0.458) which might account for the remaining marginal significance of the year variable overall (p = 0.0258 as computed by ANOVA procedure). The overall logistic model with three predictor variables has a residual deviance of 11502 on 12035 degrees of freedom. Its improvement relative to the null model, having a deviance of 12890 on 12047 degrees of freedom, is quantified by chi-square statistics of 1388 on 12 degrees of freedom, and a corresponding p-value no larger than 2.2e-16.
Decay rates are similar for multi-SDAR and single-SDAR authors
We examined whether or not senior authors that have published multiple SDARs have less overall URL decay (URLs pointing to their SDARs) than those that have published only one. There are competing hypotheses as to whether or not publication of multiple URLs correlates with greater or lesser stability. On one hand, senior authors that produce many SDARs likely have focused on developing the necessary infrastructure and have likely dedicated a substantial portion of their research to providing SDARs. On the other hand, a researcher publishing multiple SDARs might have a single point of failure (e.g., if they change institutions) or simply have too many to effectively keep track of.
Senior authors of papers that report the development of SDARs, including how many SDARs they have published as of May, 2014.
SDAR production is more widely distributed than research in general
Top institutional sources of scientific publication production.
University of California
Harvard Medical School
Johns Hopkins University
University of Toronto
Top US-based ZIP codes for scientific publication production.
National Institute of Health, Bethesda, MD
VA Hospital, Houston, TX
University of Michigan, Ann Arbor, MI
Mayo Clinic, Rochester, MN
New York, NY
US San Francisco, San Francisco, CA
St. Louis, PA
Top institutional sources of SDAR production.
# of URLs
University of California
European Bioinformatics Institute
University of Washington
University of Manchester
University of Michigan
Iowa State University
Top sources of URL production by ZIP code.
UC San Diego, La Jolla, CA
University of Michigan, Ann Arbor, MI
Iowa State University, Ames, IA
UC Berkeley, Berkeley, CA
University of Washington, Seattle, WA
UC Los Angeles, Los Angeles, CA
National Institute of Health, Bethesda, MD
Publishing disparities are shown by the Gini coefficient
Programmatic resources for the analysis of scientific data are integral to the modern scientific analysis. Whereas some technologies are still restricted to institutions and researchers with substantial capital, the Internet in combination with a relatively rapid price decline in the cost of computer technology has enabled worldwide access to these Scientific Data Analysis Resources (SDARs). Three of the four most cited papers in all of science within the past 25 years have reported the development of SDARs, which speaks to the extent to which they have influenced research. But the continued accessibility of SDARs remains an issue. We sought here to examine some of the factors related to production and stability of SDARs, such as the size of the scientific laboratories that produce them, a lab's general proclivity to produce SDARs as part of their research focus, and the general distribution of SDAR production among institutions.
We find the average number of authors per accessible SDAR is significantly higher than the average number per inaccessible SDAR. There were no substantial outliers in the data that biased the average (most authors per accessible SDAR was 58, and inaccessible SDAR was 53), so there are several possible explanations. The first is that, perhaps, authors who create SDARs also tend to be users of them and therefore more people tend to have a personal interest in its continued availability. A second possibility may be more sociological in that the more authors per SDAR, the more people there are to contact about its decay, to be vigilant about its accessibility, or to provide options if the primary maintainer becomes unable to maintain it (e.g., changes institutions). It is also possible that larger projects that involve more people also tend to result in a more useful and/or stable end-product.
The distribution of SDAR production suggests that bioinformatics may occupy somewhat of a unique niche among scientific disciplines, as it does not require extensive infrastructure to develop and deploy analysis software. Specialized institutions such as the European Bioinformatics Institute are able to make substantial contributions while not being among the top general players. Consequently, published SDARs can be produced from institutions that lack the resources of more elite institutions. However, whether the SDARs from elite institutions are more stable or more cited is not known and will be the subject of future study.
With a slow but steady annual rise in the Gini coefficient for SDARs, the data may be hinting that this more distributed state of affairs could be gradually approaching the more centralized production of biomedical research. Perhaps nimble early adopters contributed substantially to efforts and now that waning? It may also be that as the novelty of developing and sharing digital analysis resources wears off, large and more established organizations are venturing into that arena, a model that has been suggested for Internet technology in general.
Finally, by identifying which URLs should be stable on the basis of their support by organizational entities (e.g., publishers, DOIs, and archival sites like http://Webcitation.org), and should be accessible, we found that a number of published errors have crept into the scientific record, at an approximate rate of 3%. Although this is a relatively low rate, it begs the question of how extensively it permeates reported entities such as numbers (e.g., transposed digits), record identifiers, and names. Variation in URL construction is consistent with other reported variation such as the creation of acronym-definition pairs  as well as chemical name spellings within text  and even within databases , but computers aren't as flexible as humans when it comes to tolerating this type of variation.
There are several limitations to this study. First, we relied upon automated classification of SDARs by keywords present within the abstract. In the future, we plan to crowdsource classification of URLs to better determine which ones are scientific data analysis resources. Another limitation is that not all SDARs are linked to by a URL present within the abstract, so the coverage of published SDARs may be incomplete. For identifying institutions, we attempted to be relatively generic, yet we may have been too generic. For example, the top-publishing institution was extracted as "The University of California", yet this is a system with many different universities. Thus, there is some bias in the institutional results due to the way names were parsed, which is why ZIP codes were analyzed as well. An area of future research would be to identify better mechanisms for identifying institution names, whether through text mining techniques term frequency/inverse document frequency or in combination with a curated list of institutions.
URL decay continues unabated, but in this study we attempted to analyze a subclass of URL, those reporting the development of Scientific Data Analysis Resources (SDARs). We found average team size for SDAR production tends to be lower than scientific publications in general, although larger team size correlated with SDAR persistence. SDAR production is less dependent upon institutional location and resources, as judged by the Gini coefficient, and groups producing multiple SDARs do not seem to differ in the probability they are still accessible from groups that have produced only one SDAR. Finally, errors are creeping into the public record at a rate of about 3%, rendering their URLs, as written, invalid from the moment of publication.
Conflict of interest
The authors declare that they have no competing interests.
Declaration of funding
The authors would like to acknowledge the National Science Foundation (NSF) for funding this research and publication from grant # ACI-1345426.
The authors would like to acknowledge the National Science Foundation (NSF) for funding this research from grant # ACI-1345426. The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 11, 2014: Proceedings of the 11th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S11.
- Marx V: Biology: The big challenges of big data. Nature. 2013, 498 (7453): 255-260. 10.1038/498255a.View ArticlePubMedGoogle Scholar
- Perez-Iratxeta C, Andrade-Navarro MA, Wren JD: Evolving research trends in bioinformatics. Brief Bioinform. 2007, 8 (2): 88-95.View ArticlePubMedGoogle Scholar
- Sheldrick GM: A short history of SHELX. Acta Crystallogr A. 2008, 64 (Pt 1): 112-122.View ArticlePubMedGoogle Scholar
- Ducut E, Liu F, Fontelo P: An update on Uniform Resource Locator (URL) decay in MEDLINE abstracts and measures for its mitigation. BMC Med Inform Decis Mak. 2008, 8: 23-10.1186/1472-6947-8-23.PubMed CentralView ArticlePubMedGoogle Scholar
- Hennessey J, Ge S: A cross disciplinary study of link decay and the effectiveness of mitigation techniques. Bmc Bioinformatics. 2013, 14 (Suppl 14): S5-10.1186/1471-2105-14-S14-S5.PubMed CentralView ArticlePubMedGoogle Scholar
- Wren JD: URL decay in MEDLINE--a 4-year follow-up study. Bioinformatics. 2008, 24 (11): 1381-1385. 10.1093/bioinformatics/btn127.View ArticlePubMedGoogle Scholar
- Eysenbach G, Trudel M: Going, going, still there: using the WebCite service to permanently archive cited web pages. Journal of medical Internet research. 2005, 7 (5): e60-10.2196/jmir.7.5.e60.PubMed CentralView ArticlePubMedGoogle Scholar
- Gardner D, Akil H, Ascoli GA, Bowden DM, Bug W, Donohue DE, Goldberg DH, Grafstein B, Grethe JS, Gupta A: The neuroscience information framework: a data and knowledge environment for neuroscience. Neuroinformatics. 2008, 6 (3): 149-160. 10.1007/s12021-008-9024-z.PubMed CentralView ArticlePubMedGoogle Scholar
- Gini C: Variabilità e mutabilità (Italian Transl: 'Variability and Mutability'). 1912, BolognaGoogle Scholar
- Leasing Journal Citations. [http://www.nlm.nih.gov/databases/journal.html]
- Dellavalle RP, Hester EJ, Heilig LF, Drake AL, Kuntzman JW, Graber M, Schilling LM: Information science. Going, going, gone: lost Internet references. Science. 2003, 302 (5646): 787-788. 10.1126/science.1088234.View ArticlePubMedGoogle Scholar
- Baethge C: Publish together or perish: the increasing number of authors per article in academic journals is the consequence of a changing scientific culture. Some researchers define authorship quite loosely. Deutsches Arzteblatt international. 2008, 105 (20): 380-383.PubMed CentralPubMedGoogle Scholar
- Zetterstrom R: The number of authors of scientific publications. Acta paediatrica. 2004, 93 (5): 581-582. 10.1111/j.1651-2227.2004.tb02980.x.View ArticlePubMedGoogle Scholar
- Wren JD, Kozak KZ, Johnson KR, Deakyne SJ, Schilling LM, Dellavalle RP: The write position. A survey of perceived contributions to papers based on byline position and number of authors. EMBO Rep. 2007, 8 (11): 988-991. 10.1038/sj.embor.7401095.PubMed CentralView ArticlePubMedGoogle Scholar
- O'Brien T, Yamamoto K, Hawgood S: Commentary: Team science. Academic medicine : journal of the Association of American Medical Colleges. 2013, 88 (2): 156-157. 10.1097/ACM.0b013e31827c0e34.View ArticleGoogle Scholar
- Disis ML, Slattery JT: The road we must take: multidisciplinary team science. Science translational medicine. 2010, 2 (22): 22cm29-View ArticleGoogle Scholar
- Zeileis A: ineq: Measuring Inequality, Concentration, and Poverty. 2014Google Scholar
- Damgaard C, Weiner J: Describing inequality in plant size or fecundity. Ecology. 2000, 81 (4): 1139-1142. 10.1890/0012-9658(2000)081[1139:DIIPSO]2.0.CO;2.View ArticleGoogle Scholar
- Schneier B: The Battle for Power on the Internet. 2013, The AtlanticGoogle Scholar
- Wren JD, Chang JT, Pustejovsky J, Adar E, Garner HR, Altman RB: Biomedical term mapping databases. Nucleic acids research. 2005, 33 (Database): D289-293.PubMed CentralPubMedGoogle Scholar
- Wren JD: A scalable machine-learning approach to recognize chemical names within large text databases. BMC bioinformatics. 2006, 7 (Suppl 2): S3-10.1186/1471-2105-7-S2-S3.PubMed CentralView ArticlePubMedGoogle Scholar
- Akhondi SA, Kors JA, Muresan S: Consistency of systematic chemical identifiers within and between small-molecule databases. Journal of cheminformatics. 2012, 4 (1): 35-10.1186/1758-2946-4-35.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.