BMC Bioinformatics BioMed Central Correspondence

Background Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method. Results The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% ± 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84. Conclusion Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to the Z-score, but that there is also a large uncertainty in reported PID values. Since better alternatives to PID exist to quantify sequence similarity, these should be quoted where possible in preference to PID. The findings presented here should prove helpful to those new to sequence analysis, and in warning those who seek to interpret the value of a PID reported in the literature.


Background
The large amount of supporting resources necessary to replicate biomedical experiments includes but is not limited to raw data, experimental design specifications, specific software, statistical models, and experimental protocols. Researchers interested in extending or replicating results detailed in a published paper may attempt to use the supplementary resources located at a link within the paper together with their own interpretation of these other factors. Much has been written about the increasingly complex nature of replicating this form of work, from attempts to quantify the ability to replicate the original experimental design, environment, workflow and statistical interpretation. In this paper we focus on the simple ability to retrieve data that original authors felt was of sufficient importance to reference it in support of their results and specifically provide such as supplementary data ostensibly available via an Internet accessible link. While some may question the value or necessity of supplemental data [1] there are numerous reasons for publishing data external to the article itself. These reasons range from size constraints of the journal format and various editorial concerns to the fact that some types of data simply cannot be usefully represented in traditional text or image format. The latter category includes supplemental items such as software (either executable or source code), databases and large data sets that others may wish to re-analyze or include in meta-analyses with other data. Hence, in some cases, supplemental data is a necessity if readers are to evaluate the published work and the persistency of supplemental data is an important concern. To evaluate the long-term availability of supplemental data, we tested for the persistency of the data links from a representative subset of journals indexed within PubMed from 1998 to 2005.

Data retention and current journal supplementary data policies
Making data freely and easily available should be of concern to most academic researchers who publish in biomedical journals. The National Institutes of Health (NIH) released a Policy Statement in 2003 stating that data must be maintained for three years after the termination of a NIH sponsored grant [2]. In a separate notice from 2002, the NIH also states it "will expect investigators supported by NIH funding to make their research data available to the scientific community for subsequent analyses." [3]. Large research universities are now mandating that research data that is published -whether or not funded by bodies such as the NIH, should be maintained and be easily accessible for up to six years after the conclusion of research as part of their responsible conduct of research policies. It is reasonable to assume that these policies will become more common and widespread in the future. Given that grant funding is generally not available to support long term storage and maintenance of data generated on previous funding and given that researchers may switch institutions, careers or retire, it may not be possible in practice to assure proper data storage and availability for the lengths of time specified by either the NIH or the local institution. Hence, the ability to submit supplemental data to either a journal or a third party data repository, provides a level of stability to data access that may not be achievable by the researcher who is left to his or her own devices.
There is not wide spread agreement between biomedical journals as to consistent supplementary data policies. One reason for this variance is the differing importance and relevance that domain-specific journals place on the different forms of supplementary resources. These resources come in many forms; small and large data sets, experimental protocols, supplementary discussion, links to online biomedical databases, web-based software, source code with or without example data sets, software manuals, etc. Most journals state that data that is directly relevant to a manuscript should be included within the paper, and that additional data that supports conclusions should be made publicly available. Some journals give very specific instructions for each type of desired supplementary resource -manuscripts involving sequences or structural biologic data are typically required to submit the data to a particular public repository prior to or by publication. Other journals state that manuscripts that reference databases should make these databases freely available to all and without password-protection. Few state as clear a policy as the journal Nature, which requires that supplementary material need to be stored at either Nature or an accredited independent website, and that "such material cannot solely be hosted on an author's personal or institutional site." [4]. Nature additionally provides a "Materials complaint" procedure if these guidelines are not followed. In general it appears that supplementary data is accommodated by most biomedical journals, but few appear to require that the data be submitted directly to the journal itself -though this is often possible. However, some journals take an approach that is almost the opposite of Nature and discourage authors from submitting supplemental data but rather suggest that authors host said data on their own site. The apparent motivation for such a policy is to limit the long term cost to journal for data storage and maintenance. Our personal experiences and occasional frustrations in trying to obtain supplemental data led us to perform a study of the persistency of said data.

Supplemental data links within PubMed abstracts
For the set of records that specified a link within the abstract, we found that an average of 74% of manuscripts published between 1998 and 2005 had links that were still accessible (Chart 1). We note that this result is weighed by the low number of manuscripts published prior to 2001, but still note that an average of 85% of links were still available since 2001. Of the inaccessible links, 93% were to locations outside the journal of publication ( Figure 1, 2).

Supplemental data links within full text manuscripts from three selected journals
Within this set we found an average of 83% of links were available approximately a year after publication. Of the inaccessible links, 55% were to locations outside the journal of publication ( Figure 2). This varied between the journals; manuscripts published within Nucleic Acids had no links to data outside the journal, whereby Bioinformatics had the bulk of its total links to data (73%) referring to locations outside the journal. All of the inaccessible data associated with the journal Genetics (33%) came from manuscripts that stated "supplementary data available at genetics.org", where the data was not in fact present at the supplementary data portion of that website (despite reasonable amounts of manual effort to find said data). Despite these individual journal differences and the varying author compliance with supplementary data policies, we feel that the finding of an average of 18% of publications within this dataset having unavailable links confirms the results from 2001-2005 identified via search abstracts alone. In addition, we were quite surprised to find such a large percentage of supplemental data (17%) that was not available only 1 year and a few months past publication. This result combined with a non-zero, recent time, y-intercept on the right of Figure 2, suggests that approximately 10% of all supplemental data links in published articles neveractually had the supplemental data available. This further suggests that the availability of supplemental data is often not rigorously checked by editors or manuscript reviewers prior to publication.

Limitations
Our study sampled supplemental data links in both abstracts and a small number of selected full text manuscripts over a 7 year and 3 month time period respectively. Our relatively simple text searches resulted in a fairly small sample size of 655 links from abstracts and 161 links from the selected full text publications. For the earliest years (1998)(1999)(2000) in which we found links to supplemental data in abstracts, the sample sizes were quite small and it is difficult to draw conclusions from these early data. However, in the later years, a fairly constant 10-20% of the links do not have supplemental data available and these results are consistent with those obtained from our selected full text mining.
It is possible that some of these links we checked were down only temporarily during the time period we checked. Prior work determined that this could be the case up to 19% of the time, but also noted that approximately the same amount (19%) were consistently unavailable [6]. In addition, even if said data was only temporarily missing, it was missing none-the-less and is a reasonable reflection of what a researcher in the field would find. It is also possible that the missing data could have been obtained with further efforts on our part perhaps through direct email contact with authors. However, our goal was not to evaluate whether or not supplemental data could be obtained an any costs but rather to evaluate if data that was ostensibly available through published links could be quickly, easily and conveniently obtained. In addition, it should be noted that automated data mining and aggregation tools would require that such links work.

Conclusions and recommendations
Biomedical manuscripts are virtually guaranteed to increasingly refer to large data sets and supporting technical material that cannot be contained within the scope of the published manuscript. Journals that are focused on their unique research domains will place different emphasis on the varieties of supplementary data or technical materials relevant to their published manuscripts. A journal interested in public health may, for example, consider data derived from large population based data sets to be crucial to their research, where computational biology journals may place a higher emphasis on software code and example data sets. Despite this variance on the definition of what is relevant, we feel that there is a broad need for improvement in providing persistent access to these resources, regardless of the journal's research focus. There are multiple initiatives at both federal funding agencies and local institutional levels that are calling for greater data sharing and research collaboration. We feel that the following five recommendations address a practical approach to ensuring data persistency for biomedical research publications.
1) Journal policies -At present, journal policies with respect to supplemental data are inconsistent and widely varied with some (such as Nature) requiring that all supplemental materials be provided with the manuscript for storage by the publisher or submitted to an independent and credible repository, with other journal policies relatively silent on this issue. Our research shows that supplemental data that is stored on a publisher's website has a significant higher probability of being persistent than data stored on an author's own website. Hence, we encourage all journals to adopt and extend a policy similar to that of Nature's if the supplemental data is directly supporting conclusions drawn within the manuscript. This policy states that others should be able to replicate and build upon the author's claims, that specialized data such as DNA sequences or atomic coordinate data must be submitted to and referenced from a third party repository such as PDB/GenBank/EMBL/DDBJ or SWISS-PROT and that an author's own web site is not acceptable for these forms of data. The policy further states that any supporting data sets should be deposited in publicly accessible databases wherever possible, but for occasions for which there is no public repository they should be made available at the authors own website -though this can be cause for refusal of publication if the Nature referees cannot be assured of the resources being freely available to the community. Most importantly, the journal also provide a "Materials Complaint Procedure" [9] that allows readers to complain to the journal for problems with gaining access to supplementary data for published manuscripts. At this time Nature does not have a recommendation for access to dynamic data such as software source code. If a third party attempting to check or reproduce the results is likely to require or strongly benefit from availability of the supplemental data, we feel the publisher is has an obligation to assure that such data is available for review -either by storing the data on their own site or by requiring the authors to submit it to a credible third party for redistribution. In addition, we encourage publishers to considering accepting and maintaining other supplemental data described within the manuscript that authors feel would benefit the research community even when such data is not directly required to support the conclusions of the manuscript. In many cases, data produced as part of a publication but not specifically required to support the conclusions drawn within, could be useful to others who have different research questions or who wish to mine this data in a larger context.
2) Authors should be required to call out all links within a manuscript either in a specially labelled section near the beginning or end of the manuscript or via separate entry into a web-based form upon submission. The motivation of gathering all links in a common and separate area of the manuscript is to make it simpler for reviewers to identify and check the availability of the resources at said links. In addition, this process (especially if all links were submitted separately in a web based form) would make it easier for automate checking of data availability.
3) Publishers should develop systems to automatically check if links provided within submitted manuscripts are "alive". While this would not assure that the correct data was available at such a link, it would catch the majority of Accessibility by location of supplementary resources data problems we have discovered in which external links are simply not available or in which the provided URL was malformed or mistyped. In addition, we feel it would be wise for publishers to develop a database of all links to supplemental data within manuscripts so that ongoing monitoring of data availability could be accomplished post-publication. Such a system could be developed to contain the original link, the authors' email addresses, a redirected link and perhaps a small amount of associated annotation (a reference to the article and brief text description of the data). With such a system in place, it would be a simple matter to write a script that would perform regular checks to see if the link provided was still available. When it was not, an email notification could be sent to the authors alert them to this fact so that the problem could be corrected or a re-directed link could be provided. While this would not solve all problems associated with missing supplemental data, we posit that a proactive approach such as that suggested would significantly increase overall data availability.
4) Reviewers and editors should be specifically required to assure that all supplemental data is actually available upon submission. Our work suggests that approximately 10% of all supplemental data was not available at the time of publication which further implies the data availability was not carefully checked in the review or editing process.

5)
We encourage the NIH to develop not only policies but more importantly, funding mechanisms and/or NIH supported sites for the long term storage and maintenance of heterogeneous supplemental data. We recognize that certain types of supplemental information -such as dynamically generated web sites that are connected to sophisticated databases and/or analytical tools -may necessarily require storage and maintenance by the authors. However, supplemental data that is instantiated in flat files (documents, spreadsheets, images, source code, executable code etc) should be stored in a system designed for long-term data persistency. In addition, we would encourage the NIH to perform an informal audit of the ability of researchers to comply with presently existing policies when funding for long term data storage is not necessarily provided to either the researcher or the researcher's institution. Our own experience, while admittedly anecdotal, suggests that long term maintenance of digital data within a researcher's own lab is often not effectively managed due to a variety of circumstances that include a lack of funding to adequately support an internal IT infrastructure, a lack of sophistication in data storage and backup, and social/human factors.
In conclusion, we feel that long-term persistent access to the rapidly increasing and predominantly digital data that supports modern biomedical research should be treated with the same diligence applied to the published research work itself. Journal publishers are helping drive their individual fields, and as such have a special responsibility to maintain accurate references to supplementary data that specifically supports conclusions in their manuscripts for both present and future researchers. Our work suggests that the assurance of data persistency should not be left solely to the authors, but should be managed by clear policies of the publishing journal or other responsible institution. In addition, while we do not specifically address data persistency for the considerably larger set of data that is not published, our work suggests that the persistency of unpublished data is likely to also become a future research issue. Funding organizations such as the NIH and NSF may need to develop additional policies and more importantly -specific funding mechanisms to assure that such data is available into the future. Similar issues are likely faced by major funding agencies in other countries, so these recommendations may have merit outside the US.
relatively high numbers of hits in our searches of abstracts so we could reasonably anticipate that we would find large numbers of supplemental data links within small samples of full text manuscripts. Third, these journals all have reasonably high impact factors and sample a variety of different types of biological researchers (from "bench biologist" to bioinformatician). We searched the full text of all the publications in each journal for manuscripts published between October and December of 2004 for the phrases "supplementary information", "supplemental data", "supplemental material" or "supplementary data". This returned 71, 60 and 30 unique manuscripts for Bioinformatics, Nucleic Acids Research and Genetics, respectively.
For both method 1 and 2, each link was manually checked to determine if the supplemental data mentioned in the text was available [see Additional file 1]. In addition, we each link was categorized by whether the supplemental data was hosted on the journal or a non-journal (typically the authors') website.