Skip to main content

Archived Comments for: An analysis of the Sargasso Sea resource and the consequences for database composition

Back to article

  1. Good to see criticism of this type of data

    Neil Saunders, University of Queensland

    20 April 2006

    It's good to see someone take a critical look at environmental sequence data. One thing that the authors don't mention is the questionable validity of many protein sequences that are annotated as "hypothetical". The Sargasso sequences in GenBank do not appear to have been annotated for 23S rRNA genes and many of the short, hypothetical ORFs are in fact just translated 23S regions. You can see this for yourself if you BLAST a 23S sequence (e.g. from E. coli) versus the env_nt dataset, note the sequence coordinates of the 23S hit then visit the GenBank entry for that hit (e.g. gi 44249358). In many cases the so-called hypothetical ORFs lie in a 23S rDNA gene.

    Perhaps NCBI and the other databases should consider segregation of environmental data from the bulk of the nr dataset to avoid contamination with junk.

    Competing interests

    None declared