I find it quicker and easier to create one-off text filters in awk than Perl or ...

jrockway · on Aug 28, 2011

By the time you've dealt with all the special cases of the CSV format, your awk script isn't going to look very clean.

A few lines of Perl with Text::CSV is going to be much more robust, run faster, and take less time to get working.

telemachos · on Aug 28, 2011

(I know your work, and I respect you. I mean the following as a compliment - truly, no sarcasm or snark.)

Sounds to me like you're being a professional programmer. I'm not. I'm a self-taught amateur. I program a little for myself, a little for the school I work for and a bunch to do sysadmin tasks (for myself and the school). I don't worry all that much about the special cases of CSV. I look at the data, massage it a little before if I need to (usually that with awk, sed, etc. as well), then run it through awk and do the job. If the result isn't all perfect, I clean it up quickly by hand using Vim. In this kind of context, 'robust' doesn't mean anything to me. I'm not expecting to do the exact same task ever again. As for running faster, I call bullshit and remind you of this[1]. awk runs plenty fast - seconds or less for such cases. As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.

The bottom line for me is that I like awk. It's defaults make sense to me, it's fast, it's very flexible and powerful. Finding the right Perl module or Ruby gem, learning its API, opening my editor, etc. - that all takes time. For small, one-off jobs, it's not worth it.

[1] http://news.ycombinator.com/item?id=1116224

chromatic · on Aug 28, 2011

As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.

I'm a lot faster not writing code than I am writing code, and when I have to write code, I write code much faster when I let the computer massage and clean up data.

telemachos · on Aug 28, 2011

This is getting pretty silly. The computer doesn't magically massage data itself. You're saying you prefer to use Perl and a module from CPAN to help do that. Ok. I'm saying that in many cases, I prefer to use some combination of standard *nix tools, an editor and awk. The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish. So I'm just not seeing any argument from "write code much faster".

Again, this isn't robust. It's not professional. No tests in advance. But it works for me day in, day out, every week.

Last thought from me - this is beginning to remind me of this slide from Mark Jason Dominus: http://perl.plover.com/yak/12views/samples/notes.html#sl-3,

"I mentioned this approach on the #perl IRC channel once, and I was immediately set upon by several people who said I was using the wrong approach, that I should not be shelling out for such a simple operation. Some said I should use the Perl File::Compare module; most others said I should maintain a database of MD5 checksums of the text files, and regenerate HTML for files whose checksums did not match those in the database.

I think the greatest contribution of the Extreme Programming movement may be the saying "Do the simplest thing that could possibly work." Programmers are mostly very clever people, and they love to do clever things. I think programmers need to try to be less clever, and to show more restraint. Using system("cmp -s $file1 $file2") is in fact the simplest thing that could possibly work. It was trivial to write, it's efficient, and it works. MD5 checksums are not necessary. I said as much on IRC.

People have trouble understanding the archaic language of "sufficient unto the day is the evil thereof," so here's a modern rendering, from the New American Standard Bible: "Do not worry about tomorrow; for tomorrow will care for itself. Each day has enough trouble of its own." (Matthew 6:34)

People on IRC then argued that calling cmp on each file was wasteful, and the MD5 approach would be more efficient. I said that I didn't care, because the typical class contains about 100 slides, and running cmp 100 times takes about four seconds. The MD5 thing might be more efficient, but it can't possibly save me more than four seconds per run. So who cares?"

chromatic · on Aug 28, 2011

The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish.

I underestimate how much time it'll take me to do a job manually when I already have great tools to do it the right way all the time.

Maybe you're just that much better a programmer than I am--but I know when I need to parse something like CSV, Text::xSV will get it right without me having to think about whether there are any edge cases in the data at all. (If Text::xSV can't get it right, then there's little chance I will get it right in an ad hoc fashion.)

In the same way, I could write my own simple web server which listens on a socket, parses headers, and dumps files by resolving paths, or I could spend three minutes extending a piece of Plack middleware once and not worrying about the details of HTTP.

Again, maybe I'm a stickler about laziness and false laziness, but I tend to get caught up in the same cleverness MJD rightfully skewers. Maybe you're different, but part of my strategy for avoiding my own unnecessary cleverness is writing as little parsing code as I can get away with.

burgerbrain · on Aug 28, 2011

I could take a few more minutes to use something that makes me not have to worry about CSV edge cases, or I could just recognize that those cases don't apply to me, throw together an awk line in 20 seconds, and get to the pub early and get a pint or two down before my buds show up.

Who cares about what is better on paper? I have a life to live.

berntb · on Aug 28, 2011

Personally, unless I knew the data, I'd have problems enjoying that pint without worrying about data fields containing e.g. ",":s which my naive script would fail on...

burgerbrain · on Aug 28, 2011

In my experience, files have mixed formats far less often than you might suspect. Generally they were kicked out by another script some other bored bloke who just wants to get home wrote himself. If I don't get the results I'm expecting then I'll investigate but the time I save myself by assuming things will work more than offsets that.

Usually the issues I find are things that invalidate the entire file, and indicate a problem further up the stream. In those cases I'm actually glad my script was not robust.

berntb · on Aug 30, 2011

Sorry for coming in late, but... that argument can motivate the use of simpler tools (here, awk instead of perl) in any situation. :-)

(I didn't mention "mixed formats", I mentioned the specific problems of e.g. ",":s (and possibly '"':s) in CSV. You will find such characters even in e.g. names, especially if entered by hand/ocr.)

brendano · on Aug 28, 2011

Alternative: first pipe through a CSV to strict TSV converter, such that "awk -F'\t'" is correct. I do this all the time, because awk is far superior to standalone perl/python scripts for quick queries, tests, and reporting.

(My converter script: https://github.com/brendano/tsvutils/blob/master/csv2tsv )