Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why you should learn just a little Awk - A Tutorial by Example (gregable.com)
224 points by ColinWright on Aug 27, 2011 | hide | past | favorite | 76 comments


I used awk 20 years back. Then I moved on. My awk scripts still run, those in other languages don't (major version changes e.g.). Also, I began manipulating text using grep, cut, sed and sort, but found that these read through the file each time, becoming slow as data increases. Using awk you can search, filter and manipulate data in ONE iteration of the file making it very fast. And so i once again brushed up on awk.

btw, i suggest using gawk since it has date functions.


I wrote an ebook 2 months ago on Awk called "Awk One-Liners Explained."

http://www.catonmat.net/blog/awk-book/

It teaches Awk through many practical examples, so called one-liners, that are small and short programs that just do one task. Such as joining lines, printing lines matching a pattern, summing up numbers on lines, converting text, etc.

Check it out!


If i recall, you also wrote a (wonderful) gawk script to download youtube videos (which I contributed some updates to). I still use it often (the famous python program link fails, btw). http://www.catonmat.net/blog/downloading-youtube-videos-with...


Yes, yes I did! :)

It was more like proof of concept that GNU Awk is awesome and can do things like binary data IO and networking.


Looks like a valuable and useful book, but it really needs to be in Kindle (or other ebook) format. PDF files are not suited to reading on an ebook device.


Would Awk be useful for end user plain text databases? I want to keep a listing of all my books but I would prefer not to use a database and to use something that works in the Unix console.


The Awk Programming Language[1], by Aho, Kernighan and Weinberger, has a whole chapter on using Awk as a relational database engine. (They implement a small query language in Awk itself.) It's also a great book period.

[1] http://cm.bell-labs.com/cm/cs/awkbook


I agree , great book. Have you checked the price lately?


Luckily, I don't need to: I picked it up from Amazon (university library sell-off) after silentbicycle recommended it here some time back.

(You made me curious: there are used copies at Amazon starting at around $6.00[1].)

[1] http://www.amazon.com/gp/offer-listing/020107981X/ref=dp_olp...


I wrote an e-book on Awk two months ago:

http://www.catonmat.net/blog/awk-book/

It teaches Awk through many practical examples, so called one-liners, that are small and short programs that just do one task. Such as joining lines, printing lines matching a pattern, summing up numbers on lines, converting text, etc.


Something like the 'NoSQL' project might be just what you're looking for: http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20P...

This 'NoSQL' has absolutely no similarity to the 'NoSQL' databases of the last few years (and predates them all by many years). Instead it's a tool to manage databases of plain text files in the Unix environment. Seamlessly ties together several parts of Unix toolchain, with Awk as the centerpiece. Has surprising features, like searches with joins between tables (i.e., files).

I'll also second that the Aho Kernighan Awk book is a beautiful piece of work.


That link (nosql) looks interesting. Also do look at crush-tools. You can even pivot data with it.

http://code.google.com/p/crush-tools/wiki/CrushTutorial


Depends on how complex the expected operations are. I think awk is absolutely essential for dealing with things like logfiles. Do you have a reason against using sqlite[1]? That works just fine in the console, doesn't need a server. You could still use awk with it, technically. For example:

    bash..$ sqlite3 books.sqlite
    sqlite$ create table books ( title varchar(128), author varchar(128) );
    sqlite$ insert into books (title, author) values ("Snow Crash", "Neal Stephenson");
    sqlite$ .quit
    bash..$ echo "select * from books;" | sqlite3 books.sqlite
    Snow Crash|Neal Stephenson
[1] http://sqlite.org/


You don't even need to pipe the command in; just put it as the second argument after the database file.

    $ sqlite3 books.sqlite "select * from books"


Thanks for the tip


Why would you want to do that? It looks like xml all over again. No advantages of user-readable format and no advantages of a real database. Talking to sqlite CLI tool is probably a little bit more useful.


XML? I keep databases in user readable/writable format that I also parse with awk. The whole point of awk is to parse human readable text. No XML nonsense. Just text with a non-rigid structure.


That was my question - why do that?

If you need a database, you really shouldn't allow the text to be parsed as anything else than originally intended. With "just text with a non-rigid structure" it's just too easy to make a silly mistake. So what is the advantage here, given we have bdb and sqlite on almost every machine nowadays?

Or the other way - if you need something human-readable, why use is as a database? Why not keep the data where the data belongs and generate the reports when needed from there?


You might be interested by recutils: https://www.gnu.org/s/recutils/


(n/g)AWK doesn't just make sense for one-liners: it's a full (one could argue: the first) scripting language. Thinking it was only for one-liners put me off learning it at first. That was a mistake.



Ha! Thanks. I just spent 5 minutes trying to figure out why I can't reply to that thread.

I had hopped from HN to Gregable, then followed the link on Gregable to the 2010 HN thread (thinking I had returned to the original).


If you didn't already know that Ruby scripts can be written in a very similar way to awk, this makes for a good read.

http://tomayko.com/writings/awkward-ruby


  07.46.199.184 [28/Sep/2010:04:08:20] "GE...
07, huh? :)


I use awk a lot, but only as canned solution I found on the web, your article definitely decided me to learn it. Thanks.



TO ALL THE NAYSAYERS REGARDING AWK'S POWER:

Q. Have you had any surprises in the way that AWK has developed over the years?

A. One Monday morning I walked into my office to find a person from the Bell Labs micro-electronics product division who had used AWK to create a multi-thousand-line computer-aided design system. I was just stunned. I thought that no one would ever write an AWK program with more than a handful of statements. But he had written a powerful CAD development system in AWK because he could do it so quickly and with such facility. My biggest surprise is that AWK has been used in many different applications that none of us had initially envisaged. But perhaps that's the sign of a good tool, as you use a screwdriver for many more things than turning screws.

- from the 2008 Computerworld interview with Alfred V. Aho (http://goo.gl/OVtFU)


Serious question: What can you use a screwdriver for (besides turning a screw)?


chisel

hole punch

lever

prybar

stethoscope

electrical bus bar

electrical probe

paint stirrer

scraper

hammer

engraving tool

depth gauge

compass

And yes, I've used screwdrivers for all those purposes.


I've found a screwdriver to be one of my go-to tools for car work. Oftentimes I need to remove a rubber gasket that's been in the car for years. It's usually completely hardened from heat and baked onto a metal surface. A screwdriver is really useful for scraping that off before I install a new one.

When I worked in a restaurant I saw the line cook do the same with food burned into the salamander.


I can remember using a screwdriver as a hole punch, a paint can opener, a cutting edge to open plastic bags, and a door stop. I'm sure I've used them for other things as well.

When a screwdriver is the only tool handy, you'll often make that work rather than spend the time finding/buying a better tool.


This week I used a screwdriver to pry open the top of a rack mounted server case because it was stuck shut.


you can score many surfaces with a screwdriver, say for measuring to cut wood, or steel, or something.


A prybar.


Well wrtten.


Ahhhh, so that's how I get a job at google.


Let it go, it's 2011. Every time someone uses Awk in this day and age god kills a baby seal.


And why do you think that? Awk is simply the best tool for many tasks.


Name one?


Hacking together something to parse text?

I used it to find a house[1]. I needed to rent two apartments close to each other. I wrote some shell/awk scripts that take several real estate web sites as input, parse them, extract all data about apartments including URL, address and price, pipe the addresses into Google maps API, find the exact coordinates of all apartments, calculate distances between them and produce a list of pairs of apartments sorted by the distance between them. It works incredibly well, it routinely finds houses that are on the same street or apartments that are in the same building.

The first version of the code was written when I was in Vienna, looking for a house. It took me an evening.

The only thing that comes even close to this in power is perl, but I don't like perl, I think it's overkill when you only need to parse text, not do some other computation, and I did all this work under Plan9 which doesn't have perl anyway.

Oh, and what was most useful after all this was creating one liners for doing filtering, like finding pairs that had total cost under some value, total number of rooms above some value and distance between them in some interval. I found this one liners to be easier in awk then perl.

Your argument against awk because is 2011 is the same as saying let's drop C because it's 2011. Both C and awk solve some things well, for other things there are other languages.

[1] http://code.google.com/p/operation-housefinder/


I find that any argument along the lines of "Oh goodness, it's [YEAR] for crying out loud" is generally rubbish. Not always, but often enough that I've noticed the correlation.

I know Perl, I know Ruby, I know Brainfuck and I know Awk. I'm also not a masochist. I don't use Brainfuck but I use the hell out of Awk. Myself and other Awk users like 4ad here aren't using Awk because we're crazy, we've calculated effort/rewards/tradeoffs and come upon a solution.

I find people suggesting that others should or should not use various tools to be incredibly condescending. In doing so you are effectively refusing to recognize others as your peers. If you're Dijkstra and everyone around you has a hardon for GOTO, then be my guest, but you are not.


I find it quicker and easier to create one-off text filters in awk than Perl or Ruby. That is, I may have a set of csv files for teachers, administrators and students by grade. I want to extract fields x, y and z from all these files, transform the data in some specific ways and then create a single output file (maybe to upload somewhere else or to use as part of a larger program). I certainly could write a script in an interpreted language to do this, but I often find it faster to do it in awk. (There's a sweet spot here: if the job is too complex or I'm going to do it over and over, a script may become the better choice.) I use awk for this kind of thing 3-5 times a week, easily.


By the time you've dealt with all the special cases of the CSV format, your awk script isn't going to look very clean.

A few lines of Perl with Text::CSV is going to be much more robust, run faster, and take less time to get working.


(I know your work, and I respect you. I mean the following as a compliment - truly, no sarcasm or snark.)

Sounds to me like you're being a professional programmer. I'm not. I'm a self-taught amateur. I program a little for myself, a little for the school I work for and a bunch to do sysadmin tasks (for myself and the school). I don't worry all that much about the special cases of CSV. I look at the data, massage it a little before if I need to (usually that with awk, sed, etc. as well), then run it through awk and do the job. If the result isn't all perfect, I clean it up quickly by hand using Vim. In this kind of context, 'robust' doesn't mean anything to me. I'm not expecting to do the exact same task ever again. As for running faster, I call bullshit and remind you of this[1]. awk runs plenty fast - seconds or less for such cases. As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.

The bottom line for me is that I like awk. It's defaults make sense to me, it's fast, it's very flexible and powerful. Finding the right Perl module or Ruby gem, learning its API, opening my editor, etc. - that all takes time. For small, one-off jobs, it's not worth it.

[1] http://news.ycombinator.com/item?id=1116224


As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.

I'm a lot faster not writing code than I am writing code, and when I have to write code, I write code much faster when I let the computer massage and clean up data.


This is getting pretty silly. The computer doesn't magically massage data itself. You're saying you prefer to use Perl and a module from CPAN to help do that. Ok. I'm saying that in many cases, I prefer to use some combination of standard *nix tools, an editor and awk. The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish. So I'm just not seeing any argument from "write code much faster".

Again, this isn't robust. It's not professional. No tests in advance. But it works for me day in, day out, every week.

Last thought from me - this is beginning to remind me of this slide from Mark Jason Dominus: http://perl.plover.com/yak/12views/samples/notes.html#sl-3,

"I mentioned this approach on the #perl IRC channel once, and I was immediately set upon by several people who said I was using the wrong approach, that I should not be shelling out for such a simple operation. Some said I should use the Perl File::Compare module; most others said I should maintain a database of MD5 checksums of the text files, and regenerate HTML for files whose checksums did not match those in the database.

I think the greatest contribution of the Extreme Programming movement may be the saying "Do the simplest thing that could possibly work." Programmers are mostly very clever people, and they love to do clever things. I think programmers need to try to be less clever, and to show more restraint. Using system("cmp -s $file1 $file2") is in fact the simplest thing that could possibly work. It was trivial to write, it's efficient, and it works. MD5 checksums are not necessary. I said as much on IRC.

People have trouble understanding the archaic language of "sufficient unto the day is the evil thereof," so here's a modern rendering, from the New American Standard Bible: "Do not worry about tomorrow; for tomorrow will care for itself. Each day has enough trouble of its own." (Matthew 6:34)

People on IRC then argued that calling cmp on each file was wasteful, and the MD5 approach would be more efficient. I said that I didn't care, because the typical class contains about 100 slides, and running cmp 100 times takes about four seconds. The MD5 thing might be more efficient, but it can't possibly save me more than four seconds per run. So who cares?"


The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish.

I underestimate how much time it'll take me to do a job manually when I already have great tools to do it the right way all the time.

Maybe you're just that much better a programmer than I am--but I know when I need to parse something like CSV, Text::xSV will get it right without me having to think about whether there are any edge cases in the data at all. (If Text::xSV can't get it right, then there's little chance I will get it right in an ad hoc fashion.)

In the same way, I could write my own simple web server which listens on a socket, parses headers, and dumps files by resolving paths, or I could spend three minutes extending a piece of Plack middleware once and not worrying about the details of HTTP.

Again, maybe I'm a stickler about laziness and false laziness, but I tend to get caught up in the same cleverness MJD rightfully skewers. Maybe you're different, but part of my strategy for avoiding my own unnecessary cleverness is writing as little parsing code as I can get away with.


I could take a few more minutes to use something that makes me not have to worry about CSV edge cases, or I could just recognize that those cases don't apply to me, throw together an awk line in 20 seconds, and get to the pub early and get a pint or two down before my buds show up.

Who cares about what is better on paper? I have a life to live.


Personally, unless I knew the data, I'd have problems enjoying that pint without worrying about data fields containing e.g. ",":s which my naive script would fail on...


In my experience, files have mixed formats far less often than you might suspect. Generally they were kicked out by another script some other bored bloke who just wants to get home wrote himself. If I don't get the results I'm expecting then I'll investigate but the time I save myself by assuming things will work more than offsets that.

Usually the issues I find are things that invalidate the entire file, and indicate a problem further up the stream. In those cases I'm actually glad my script was not robust.


Sorry for coming in late, but... that argument can motivate the use of simpler tools (here, awk instead of perl) in any situation. :-)

(I didn't mention "mixed formats", I mentioned the specific problems of e.g. ",":s (and possibly '"':s) in CSV. You will find such characters even in e.g. names, especially if entered by hand/ocr.)


Alternative: first pipe through a CSV to strict TSV converter, such that "awk -F'\t'" is correct. I do this all the time, because awk is far superior to standalone perl/python scripts for quick queries, tests, and reporting.

(My converter script: https://github.com/brendano/tsvutils/blob/master/csv2tsv )


!x[$1]++

Use that at least 10 times a day

Also, Larry Wall (perl) said "I still say awk '{print $1}' a lot."


Don't forget about `cut -f1` though!

Of course anything more complex than that, but not requiring the complexity of Perl/Ruby is generally best still done with Awk. In my world, that is a lot.


cut -f1 and awk '{print $1}' are not equivalent, though: awk does not count empty fields. To be precise:

    pi:~$ echo 'foo  bar' | cut -d ' ' -f 2

    pi:~$ echo 'foo  bar' | awk '{print $2}'
    bar


Just for the interested reader: when the first command line is changed into

    pi:~$ echo 'foo  bar' | tr -s ' ' | cut -d ' ' -f 2
it also outputs 'bar'. The -s (for "squeeze") option of tr turns every sequence of the specified character (space in this case) into one instance of this character.

Of course, the awk solution is more succint and elegant in this case - I just think that tr -s / cut -d is handy to know from time to time, too.


The interested reader should not assume hastily, however, that tr -s is enough to make cut behave exactly like awk. Hint: leading spaces.

    pi:~$ echo ' foo' | tr -s ' ' | cut -d ' ' -f 1 
    
    pi:~$ echo ' foo' | awk '{print $1}'
    foo


Because it's shorter than `perl -lane 'print $F[0]'`.


It's certainly easier to remember.. what's the lane?


"what's the lane?"

four options squeezed together:

-l: 1. "chomps" the input (which here means: removes newlines (if present) from each line, more general explanation: http://perldoc.perl.org/functions/chomp.html 2. and automatically adds a newline to each output newline (see below how to achieve this in a shorter way with a relatively new Perl feature).

-a: turns on auto-split-mode: this splits the input (like awk) into the @F (for fields) array. The default separator is one (or many) spaces, i.e. it behaves like AWK.

-n: makes Perl process the input in an implicit while loop around the input, which is processed line-wise: "as long as there is another line, process it!"

-e: execute the next command line argument as a Perl program (argument in this case beeing 'print $F[0]').

Note that the example can be shortened if you use -E instead of -e. -E enables all extensions and new features of Perl (which aren't enabled with -e because of backwards compatibility). This allows you to use 'say' instead of 'print' which adds a trailing newline automatically and lets you drop the -l option (if you don't need the 'chomp' behaviour explained above):

    $ perl -anE 'say $F[0]'
Of course, the AWK command line is still shorter - and that's expected, because AWK is more specialized than Perl.

Still, Perl one liners are often very useful and can do other things better / shorter than AWK - especially if the one-liner uses one of the many libraries that are available for Perl.

A thorough reference of all Perl command line options is available at:

    http://perldoc.perl.org/perlrun.html
or just:

    $ man perlrun


Thank you, that tempts me to go and look at perl. At the moment I tend to use simple awk, or a full python script. I find python really doesn't lend itself to "one line", or even particularly short, programs however. I keep meaning to go back and look at perl. I was tempted to wait for Perl 6, but I think the time has come to just look at Perl 5 :)


If you're genuinely curious about -lane, my favorite site for Perl one liners has gone the way of all flesh, but the Wayback Machine can still get you a copy[1].

[1] http://web.archive.org/web/20090602215912/http://sial.org/ho...


Care to elaborate? I'd love to hear alternatives you use for the same kind of task.


Perl, and for very simple tasks grep and sed.


Mostly chains of sort, uniq, sed, tr, comm and friends.

Anything that doesn't fit into 80 chars that way is usually a sure sign that it's time to reach for a real scripting language rather than brewing up one of those line-noise blobs.

Oh, and of course I don't mind when people use it on their shell-prompt.

I do mind when I run into crap like this in a shell-script:

  awk -F'>' '{ pack[$1]=pack[$1] $2 } END {for (val in pack) print val ">", "(" pack[val] ")"}'


Wait, you're buttressing your argument that people shouldn't use awk because it's 2011, by advocating equally complex command composition with more tools? I mean, I love sort and uniq but your opposition to awk sounds more like religious zealotry than anything rational.

"It's 2011, time for a sed v. awk religious war!"


"Mostly chains of sort, uniq, sed, tr, comm and friends."

Awk wasn't invented for lack of those things, Awk was invented to supplement those things. And for that purpose it is every bit as relevant today as it ever was.

P.S. Properly formatted, that awk line is perfectly sensible. I don't see what your issue here is.


Crap can be written in any language, no language is imune and no language is better in this regard.


Except that in awk the middle-ground between "could be trivially done without awk" and "illegible" is extremely narrow.

When I 'grep awk /usr/bin/*' then I see only two types of usages.

Most commonly it's used in place of grep, sed or cut for no obvious reason. Gladly less common but all the more annoying is when someone tries to go all "fancy" with it...


Awk has everything "real" languages have. The same math, flow control and data structures. It also has a subset of string mangling ops that give it the "illegible" appearance. But it is fine as a general purpose dynamic scripting language. As an example:

http://kmkeen.com/awk-music/

It could be clearer without the variable-name golfing and as a seperate .awk file (instead of a CLI one-liner) but it reads pretty much the same as C without any pesky type declarations.


a sure sign that it's time to reach for a real scripting language

Awk is Turing-complete.


So is brainfuck, and whitespace. This isn't really a useful thing to point out.

For any sort of script thats going to be run many times and may need to be improved or extended, readability is a huge plus.


Awk is incredibly readable in a shell script. It's a proper scripting language that actually looks cleaner than the Bourne shell.


BF and whitespace are LOLCODE, and thus not "real" for my purposes.


So is brainfuck...


It's possible to put yourself through some real pain by in appropriate use of awk.

(Case in point: University coursework - processing html. Not pretty)

Pretty easy to knock together a quick line of code for dealing with files where there are obviously columns though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: