Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Try creating a UTF8 CSV-file that will open correctly in Excel.

Hint: It's not possible. My favourite chart about excel insufficiencies:

    Encoding  BOM      Win                            Mac
    --------  ---      ----------------------------   ------------
    utf-8              scrambled                      scrambled
    utf-8     BOM      WORKS                          scrambled
    utf-16             file not recognized            file not recognized
    utf-16    BOM      file not recognized            Chinese gibberish
    utf-16LE           file not recognized            file not recognized
    utf-16LE  BOM      characters OK,                 same as Win
                       row data all in first field
(https://stackoverflow.com/questions/6588068/which-encoding-o...)


I found this unbelievable when I first came across it a few years ago.

More frustrating is when you need produce an UTF-8 CSV from excel. I always used open office calc for this as a workaround.

To import an UTF-8 csv to excel you just run the text import wizzard and specify the encoding.


Yes.

My relief in this area was discovering that most (western) non-software companies dealing will Excel-CSV know that encodings are a problem and many have settled on using CP-1252 instead of UTF-8.


Hi, you should check out my shell script gist that helps you with it. Remember to change the path to your LibreOffice executable https://gist.github.com/jhanschoo/bcd6d08d1e0f4b5d52735287f3...


CSV is a poorly defined format. It doesn't have a standard. No wonder it screws up when you start varying character encoding.

The first practical lesson of CSV is CSV WILL fail miserably when changing character encoding, using non ASCII characters, having quoted string, varying end of line characters...


> CSV is a poorly defined format. It doesn't have a standard.

Wrong: CSV is standardized in RFC 4180:

> https://tools.ietf.org/html/rfc4180


Ex post facto. It's been a "standard" long before the RFC, so there are many CSV documents which need to be read but don't confirm to the RFC spec. You should create a CSV writer that conforms, but your reader needs to support much more. Then if you save back to the RFC spec, the document might break other downstream readers which were written long ago and which don't follow the new RFC spec. It's an imperfect file format.


Indeed. Differences in encoding or presence/lack of a BOM in UTF8 are a small set of issues. I've dealt with financial data from very large public banks and vendors that are horrible about vending proper CSV data. Either impromper or unescaped quotes (or delimiters). Takes weeks to get the data corrected, ao you manually correct the data or write a custom parser to try and navigate around it. One of my faves was a vendor that didn't escape the legal name of a company, so "Microsoft,Inc" would be 2 fields, rending most of that recors gibberish.

I generally agree with the idiom "If you write your own cav parser, you're doing it wrong", but sometimes you have no choice to work around the idiocy of others.


It is amazing how badly Excel handles CSV files. My favorite thing is where it will decide for some goddamn reason that a numeric or string column is actually supposed to be a date, or a currency, or some other random thing, and automatically format the column. Cue the calls from customers that don't know how to turn this "feature" off, when they export something as CSV and think it is corrupted.


This misfeature is also locale dependent. It's impossible to know how your customer's Excel will behave, unless you test every possible localization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: