Scientists had to rename human genes because Microsoft Excel confuses them as dates
Microsoft Excel’s automatic formatting is part of what makes the program a powerful tool for data visualization and analysis. But for geneticists, it’s this function that has proved problematic for their work.
In fact, it’s become so troublesome that the HUGO Gene Nomenclature Committee (HGNC), the scientific body tasked with human gene names, released new guidelines last week to prevent Excel’s automatic date from altering data. For example, the gene known as MARCH1 (membrane associated ring-CH-type finger 1) is now labeled MARCHF1; the same goes for genes with symbols like SEPT2 and DEC1, which now go by SEPTIN2 and DELEC1, respectively.
The update was a long time coming. Over the past year, 27 human genes have been renamed because Excel kept misreading their symbols as dates, HGNC coordinator Elspeth Bruford told the Verge.
Easier than changing cell formats
By default, Microsoft Excel is preprogrammed to make it easier to enter dates. For many people who use Excel, having the ability to autoformat is a godsend.
The corrections can spell trouble for analysis work, however, causing the software to skip or misinterpret the autocorrected genes on the spreadsheet. In addition, scientists looking for particular genes by their name may not see entries that have been corrupted.
“It’s really, really annoying,” explained systems biologist Dezső Módos, speaking to the Verge. He added that part of the reason why these errors occur is that scientists have relied on Excel for years to process numerical data. Those who know Excel can avoid this problem, but it’s particularly easy for mistakes to be introduced.
“[The date autoformat function] is a widespread tool and if you are a bit computationally illiterate you will use it,” Módos said.
For Neil Saunders, a data scientist at the Centre for Education and Statistics in Australia, he says there are a lot of better alternatives. He even wrote about it on his site in 2012, saying the problem has persisted since 2004.
“Excel is on their computers and they feel familiar with it, even if they can’t actually use it properly. Biologists, in particular, are reluctant to invest time in learning programming skills,” he told the Register. The problem could be avoided, he added, if scientists took the time to fix their settings when they import spreadsheets.
“But no one does this – they just click on a file name, it opens in Excel – boom, the damage is done.”
A study by the Alfred Research Alliance in Australia looked at genetic data from 3,597 published papers to determine how often Microsoft Excel autoformats gene names. The team discovered that over a fifth of the papers they sampled had erroneous gene name conversions.
The HGNC even made a YouTube video on how to avoid the problem:
The new HGNC guidance is a step in the right direction, according to scientists. For one, it says that gene symbols should not be the same as “commonly used abbreviations.” It also requires them to be written using uppercase Latin letters and Arabic numerals.
Prior to its release, the HGNC had been working with scientists to address this issue, said Bruford. It was also the first time that it was updated to counter the problems caused by Excel. Earlier guidelines were revised to avoid potential offense.
For his part, Saunders believes the update isn’t a great solution, saying that Excel’s non-explicit conversion of data types is the main issue.
“But given that Microsoft won’t change its default Excel behavior and 16-plus years of attempts to educate biologists on the issue have failed, I suppose it is a practical solution.”