Skip to content

Advertisement

You're viewing the new version of our site. Please leave us feedback.

Learn more

BMC Bioinformatics

Open Access

Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics

  • Barry R Zeeberg1,
  • Joseph Riss2,
  • David W Kane3,
  • Kimberly J Bussey1,
  • Edward Uchio4,
  • W Marston Linehan4,
  • J Carl Barrett2 and
  • John N Weinstein1Email author
Contributed equally
BMC Bioinformatics20045:80

https://doi.org/10.1186/1471-2105-5-80

Received: 05 March 2004

Accepted: 23 June 2004

Published: 23 June 2004

Back to article

Archived Comments

  1. not only excel

    30 June 2004

    Heikki Lehvaslaiho, European Bioinformatics Institute

    I quickly tested a few common open source spreadsheet programs, openoffice.org calc, gnumeric and kspread, for this automatic symbol mutation ability.

    The following crude text table indicates if the conversions happens by default in these programs. "date" means that DEC1 type string gets converted, "float" means that RIKEN identifiers of type "2310009E13" get converted.

    .................."date"...."float"

    calc................yes........yes

    gnumeric........no........yes

    kspread.........no........yes

    Be careful out there!

    Competing interests

    None declared

  2. Well spotted

    21 July 2004

    Andrew Clegg, Birkbeck

    One to pin up on lab walls everywhere. I shudder to think how many pieces of work this might have affected.

    Competing interests

    None declared

  3. Special Interest group on spreadsheet risks

    26 July 2004

    Patrick OBeirne, Eusprig

    The European Spreadsheet Risk Interest Group (EUSPRIG) discusses the prevention and detection of spreadsheet errors. You can read about the emergence of the discipline of Spreadsheet Engineering and other related information at our website <a href="http://www.eusprig.org">www.eusprig.org</a>. We have just completed our fifth international conference and now have a corpus of approximately 100 peer reviewed papers in our subject domain.

    For more reports of spreadsheet errors, see

    <a href="http://www.eusprig.org/stories.htm">our stories</a>

    We're not specifically a group to discuss Excel bugs and workarounds, the <a href="http://peach.ease.lsoft.com/archives/excel-l.html">Excel-L list</a> is a very busy source of information on these, as well of course as the MS Knowledgebase.

    We are very interested in hearing from users about how you mitigate spreadsheet risks, what good practices they adopt, and so on. We are working with the ECDL Foundation for a syllabus of good practice for end users.

    Patrick O'Beirne, chair, Eusprig

    Competing interests

    none

  4. Good point.

    27 July 2004

    Carol Bult, The Jackson Laboratory

    The article raises a very good point. I've experienced similar behavior in excel for other data types. I would add that it is always a good idea to carry along a unique numeric database id along with gene names/symbols. Database accession ids may be less likely to be munged by Excel (unless the ids are alpha-numeric!) and since they are usually unique and permanent they can be used to restore and/or update lists of gene names/symbols (which change all of the time).

    Competing interests

    No competing interests

  5. 19 probe sets in Affymetrix's human U133Plus2.0

    28 July 2004

    Chao Lu, Hospital for Sick Children, Toronto

    A good point. Many people did not pay attention to this 'small' error.

    Here is a list of 19 probe sets with errors in their gene symbol (June 23, 04 annotation, Affymetrix) when opened in Excel:

    1570394_at ===> 1-Sep

    200902_at ===> 15-Sep

    208999_at ===> 8-Sep

    209000_s_at ===> 8-Sep

    212413_at ===> 6-Sep

    212414_s_at ===> 6-Sep

    212415_at ===> 6-Sep

    212698_s_at ===> 10-Sep

    213666_at ===> 6-Sep

    214298_x_at ===> 6-Sep

    214720_x_at ===> 10-Sep

    220781_at ===> 1-Dec

    221129_at ===> 2-Apr

    223362_s_at ===> 3-Sep

    225814_at ===> 1-Sep

    226627_at ===> 8-Sep

    227034_at ===> 10-Sep

    227552_at ===> 1-Sep

    233632_s_at ===> 1-Sep

    Competing interests

    None declared

  6. And the lesson is...

    11 April 2008

    Neil Saunders, University of Queensland

    And that's why bioinformaticians don't use Excel for this purpose. Or more generally, don't use spreadsheets as "databases".

    Competing interests

    None declared

  7. MS should pick this up

    12 May 2011

    Richard Jackson, Independent

    I believe a large part of bioinformatics is about providing a conduit between experts in different fields, as well as novel discovery. Often, people have their own preferences for data manipulation packages, and frequently scientists with less technical expertise tend towards Excel. Moving data back and forth between individuals in such ways give ample opportunities for errors like this to arise.

    Hence, I think the situation is ubiquitous and serious enough to warrant intervention by Microsoft. I don't know if they've picked up on this article yet. Sadly, they don't seem to have anything in terms of a suggestion box on their website (I spent an hour looking!)

    Competing interests

    None declared

Authors’ Affiliations

(1)
Genomics & Bioinformatics Group, Laboratory of Molecular Pharmacology, Center for Cancer Research (CCR), National Cancer Institute (NCI), National Institutes of Health (NIH)
(2)
Laboratory of Biosystems and Cancer, CCR
(3)
SRA International
(4)
Urologic Oncology Branch, National Institutes of Health

Advertisement