Maybe my bug? I wrote most of the original Excel text-based file import code in 1984 (including the SLYK format, which was the main way we migrated files from Multiplan to Excel).
Hi mckoss, awesome to hear one of the main Excel architects is here on HN!
Any interesting stories to share? I think we'd all be interested.
Mind if I ask a question directly related to one of my comments a few days ago? I was lamenting the fact that the ASCII codes 29, 30, 31 (Group, Record, and File separators) never really became widely implemented, as these were specifically designed to delimit data. Ie, one could easily include commas, new lines/carriage returns, etc in data cells without clashing. But instead CSV seems to be the most common standard for tabular data.
Were these ASCII codes ever considered for tabular files?
You couldn't conveniently type those codes on a keyboard and there was no standard way to display them. You'd need a special editor for such files, eliminating the main advantage of an interchangeable text based file format.
This is one of the great things about HN, and indeed the internet at large. Someone posts a link to a little quirk of our world, and then someone is able to pipe up and say "hey that's me! I made that thing!" Even better when it's something like a software bug.
This is why I like HN. Opening the comments and seeing "oh maybe this is due to something I did back in the 1980s when I wrote most of that, here's some inside info/explanation" about something as significant as Excel's file handling is something that will never grow old for me.
That's probably how it has been fixed. I cannot test right now, but since the article doesn't apply to newer versions of Office, I guess they are no longer susceptible to that.
Another bug with CSV files is that opening a CSV file triggers a full recalculation, even if the calculations are set to manual. This is clearly a wrong behaviour.
This is why HN is cool. You see a Microsoft knowledgebase article documenting an obsolete enterprise software's bug and its workaround, then an anonymous account takes credit for it. This is what I come here for.
Maybe this was written quite a bit later, but can you explain why the following would allow data to be returned in Excel 97?
"If you move your mouse pointer continuously while the data is being returned to Microsoft Excel, the query may not fail. Do not stop moving the mouse until all the data has been returned to Microsoft Excel."
I've run into this error many times for a decade, always when I make a CSV file of a table whose first column is "ID".
It seems to me that Microsoft by now could have improved its tests. If the first two letters are "ID", but if "there are no valid SYLK codes after the 'ID' characters," then maybe it was never meant to be an SYLK file. If the file's suffix is .csv, then maybe you should just treat it as a CSV file. If the file's suffix is .txt, then maybe you should treat it as a text file.
I've run into this bug too and it was mystifying and (iirc) not overridden by manually loading the file using the Text Import wizard. But this...
"If the file's suffix is .csv, then maybe you should just treat it as a CSV file"
doesn't make me any happier, because there are a lot of slight variations on CSV files or a lot of different ways you might want to load a CSV file (perhaps you'd like to specify the character set, or set columns to be text or date format, or not have leading zeroes stripped or numbers in brackets turned into negative values), and if I absentmindedly name them .csv then again I can't use the full loader, because Excel knows best.
And if you want to read a CSV file with a VB macro it's even worse, because you can specify all the parameters but it just silently overrides them all with the CSV defaults just because the file extension is CSV. Hours of debugging spent on that...
For historical reasons, the MZ is interpreted as the header of a 16-bit executable. When you run something under CMD (hopefully I'm getting this right, it's been a few years) it first runs it using the CreateProcess API, which just tries to run the file as an executable, regardless of extension, then it falls back to ShellExecute if it doesn't have an MZ header, which dispatches based on extension.
Basically you get a dialog saying the text file isn't a compatible executable. If you change the MZ to anything else, it opens in notepad.
Windows 7 64 bit here - it just has the error message box, and the same text in the cmd window, it doesn't open the file though :)
"This version of foo.txt is not compatible with the version of Windows you're
running. Check your computer's system information to see whether you need a
x86 (32-bit) or x64 (64-bit) version of the program, and then contact
the software publisher."
So this would be another problem that could be avoided if "CSV" would actually friggin mean "comma separated values" and not "whatever Windows might consider comma-equivalent depending on locale and/or phase of the moon"?
> A SYLK file is a text file that begins with "ID" or "ID_xxxx", where xxxx is a text string. The first record of a SYLK file is the ID_Number record. When Excel identifies this text at the beginning of a text file, it interprets the file as a SYLK file. Excel tries to convert the file from the SYLK format, but cannot do so because there are no valid SYLK codes after the "ID" characters. Because Excel cannot convert the file, you receive the error message.
This is usually called "magic string" or "magic number" (https://en.m.wikipedia.org/wiki/Magic_number_(programming)). It has nothing to do with the comma, and everything to do with SYLK using a pretty risky magic string (ID) and Excel not having a "try SYLK, if that fails, try as csv".
tl;dr: this is about guessing the input format and has nothing to do with the delimiter.
It's not just ID, there's also mandatory fields after that, separated by a semicolon. Thus 'ID;P' would be a valid header for SYLK and e.g. for German "CSV"s. If CSV would stick to commas (and boohoo, just escape more fields when this causes locale issues), then even such overly simple heuristics would've sufficed, wouldn't it?
Although you'd just run into other pathetic cases with such an informal format (CSV, not SYLK). But hey, I need to interchange formats with some IT guys and don't have a big IT department (ready) and they don't want to parse XLS(X), so I just use CSV. Now you have two problems.
Pretty sure any locale-aware OS treats comma different depending on what locale has been used, no? Anyway commas are historically used as seperators in numbers in some countries, and in text, etc, so if you really want to shed your blame all over something, maybe it shouldn't be Windows here but rather the people who came up with comma as a record seperator in the first place. Probably it was ok for them in that time, at that place (US I'm guessing) and for the usage needed back then, but I'd really wish they chose a different seperator.
I expect what the parent is truly imagining (but hasn't really thought through the usability implications of) is a format where ASCII newline + ASCII comma are used as delimiters, and then everything else is encoded further so as to basically become opaque octet-strings (whose encoding is such that it will never include '\n' or ',').
In other words, if you take CSV and make it completely useless for hand-editing in a text-editor by taking it 90% of the way to being a length-prefixed binary encoding, you'll have what the programmer's mind intuitively jumps to as "what CSV should be."
Why not just use ASCII 31/30 instead of comma/new line. Just give it a new name like ASV (ASCII separated values). What would take to give glyphs to those characters? Just add them to some font?
Actually newlines and commas are OK, and in fact arbitrary bytes can be put into CSV, because officially[1] CSV supports escaping using quotes (and escaping quotes by doubling them).
Jeez, how many SLYK files out there? I venture to guess there is an order of degrees more attempts at opening CSV files with ID as the first header than attempted opens of unconverted SLYK files in excel.
When I admire Microsoft, it's most often because despite nearly 4 decades of bloat to support, they still can ship reliable software to millions.
I kind of wish I could just use a version from a decade or 2 ago. The (4 year old) versions of word and excel that the uni site I work for use are so clunky to navigate. I loved using word in pre OSX says - I wonder how much of that is rose tinted nostalgia.
"A SYLK file is a text file that begins with "ID" or "ID_xxxx", where xxxx is a text string. The first record of a SYLK file is the ID_Number record. When Excel identifies this text at the beginning of a text file, it interprets the file as a SYLK file. Excel tries to convert the file from the SYLK format, but cannot do so because there are no valid SYLK codes after the "ID" characters. Because Excel cannot convert the file, you receive the error message. "
I particularly like the step by step explaination of how to add an apostrophe at the begining of the first line with a description of which keyboard key to press...
I feel like this is one of those bugs that would have taken less time to fix than it took to document.
After many years, I still have to slap myself when I catch myself thinking this way. Unless you know the architecture of the software, I've learned that it's often significantly harder than you imagine to fix seemingly simple and obvious bugs without breaking something else.
Fine, a slight correction then. It should be easy to fix except for terrible software architecture. Terrible software architecture is not a valid excuse when you have plenty of time and billions of dollars. So there should be little forgiveness for this bug still existing.
As a consumer affected by this bug, sure, you may find it inexcusable. But the basic truth about any mature software product used by tens of millions is that there will be a laundry list of bugs competing for limited resources. So there will always be bugs that you want to fix but simply isn't a high enough priority relative to all the other bugs or features.
There are limited resources, but there are enough resources to improve the architecture over more than a decade. And if they decided to avoid architecture changes, there would be nothing in the way of throwing lots of siloed software engineers at bugs to get more of them fixed. They're not lacking for money with so many customers. It's possible for them to set a much more inclusive cutoff.
They could have fewer bugs per feature than almost any other product after being stable and popular for so long. But they choose not to make that a goal.
This is basically what happens, at least on my Mac - it complains about it not being a valid SYLK file, then I think, "I have no idea what that means", then I dismiss the error and it opens it up as a CSV.
Excel is quite often useless with CSV, even if its contents are pure ASCII. Whether or not Excel will open a standard CSV file delimited with comma's (as per RFC 4180) properly also depends on the locale of the computer running Excel. So if you run Excel on Windows with a Dutch locale, it will by default fail to open the CSV normally and jam all its contents in one cell, because it expects a semi-colon as delimiter instead of a comma. To open standard CSV you have to import the file instead.
Someone at some point decided that the Dutch use semi-colons instead of commas to delimit tabulated data, and decided to apply this logic to CSV files as well. I would like to know the history behind such a decision! Switch your locale to English US, and it opens normally.
LibreOffice just opens the CSV and asks the user to confirm the delimiter and what not regardless of locale.
Yes, this is so horrible.
As a workaround I sometimes put "sep=," on the first line of the spreadsheet. This forces Excel to use the comma as a separator, regardless of the locale.
The downside of course is that every other tool on the planet shows the extra row, so it's not very useful.
I suffer from the reverse problem. I need to ingest CSV files produced by excel in various locales. Or is there some generally available alternative to excel that would produce valid RFC CSVs?
RFC 4180¹ has been in existence since 2005 and seems perfectly straight forward. The basic syntax is two pages of very understandable text. I hold that to be the definition of valid CSV.
CSV libraries tend to adhere to it (and often support additional options encountered in the wild as well); e.g., Apache Commons CSV².
A VBA import routine will always run in the US locale. However, csv files containing dates formatted according to other locales will have their days and months switched as long as the day is not bigger than 12.
Exactly. Unfortunately, double-clicking the file does not trigger that dialogue. You would be surprised how many IT-staffers simply give up at that point („Can I get the raw data?”, „Sure, here is the automatically generated CSV dump.”, „I can't open it…”).
Surprisingly, Excel can import CSV from UTF8 with BOM (at least new versions of Excel). However, be careful, as Ctrl+S will save it in some arbitrary codepage and you'll probably end up with a document like:
ID,Name
1,??? ????
2,Kevin Dub?is
so take care to always save such documents as .xmlx.
(If you convert them to your local codepage and import that - when at all relevant - it will be saved in the local codepage, so less risk of data loss.)
> Surprisingly, Excel can import CSV from UTF8 with BOM (at least new versions of Excel).
A BOM is abnormal and not recommended in UTF-8, it's basically a shitty MS hack: the BOM is necessary to detect endianness differences between the document and the host system, endianness has no impact impact on UTF-8.
And as freak_nl above notes, Excel uses locale-dependent field separators, so in some countries/locales it will try to import "CSV" with semicolon separators.
It actually does have CSV and UTF-8, but finding the correct settings is very hidden. I'm always amazed at how unable Microsoft is to improve the base of Excel and instead just keeps piling UI layers on top to hide the ugly places.
"Applies to
Microsoft Excel 2002 Standard Edition, Microsoft Excel 2000 Standard Edition, Microsoft Excel 97 Standard Edition, Microsoft Office Excel 2003 "
I had to fake a csv file for testing yesterday, and started out with id, then went "ahahah", and used name.
I know that's boring but this is one of those generational pieces of knowledge like "keep your docs up to date" that we need to build into software training somehow. (Or rather, this bug is not important, but the kind of training that imparts this knowledge will be vital in building a real software profession)
I don't know enough about SLYK to suggest a full solution, but couldn't additional checks be performed that confirm it's a SLYK file? Other than all SLYK files starting with ID, are there any other properties that exist in all SLYK files that could be checked against?