I used awk 20 years back. Then I moved on. My awk scripts still run, those in other languages don't (major version changes e.g.). Also, I began manipulating text using grep, cut, sed and sort, but found that these read through the file each time, becoming slow as data increases. Using awk you can search, filter and manipulate data in ONE iteration of the file making it very fast.
And so i once again brushed up on awk.
btw, i suggest using gawk since it has date functions.
It teaches Awk through many practical examples, so called one-liners, that are small and short programs that just do one task. Such as joining lines, printing lines matching a pattern, summing up numbers on lines, converting text, etc.
Looks like a valuable and useful book, but it really needs to be in Kindle (or other ebook) format. PDF files are not suited to reading on an ebook device.
Would Awk be useful for end user plain text databases? I want to keep a listing of all my books but I would prefer not to use a database and to use something that works in the Unix console.
The Awk Programming Language[1], by Aho, Kernighan and Weinberger, has a whole chapter on using Awk as a relational database engine. (They implement a small query language in Awk itself.) It's also a great book period.
It teaches Awk through many practical examples, so called one-liners, that are small and short programs that just do one task. Such as joining lines, printing lines matching a pattern, summing up numbers on lines, converting text, etc.
This 'NoSQL' has absolutely no similarity to the 'NoSQL' databases of the last few years (and predates them all by many years). Instead it's a tool to manage databases of plain text files in the Unix environment. Seamlessly ties together several parts of Unix toolchain, with Awk as the centerpiece. Has surprising features, like searches with joins between tables (i.e., files).
I'll also second that the Aho Kernighan Awk book is a beautiful piece of work.
Depends on how complex the expected operations are. I think awk is absolutely essential for dealing with things like logfiles. Do you have a reason against using sqlite[1]? That works just fine in the console, doesn't need a server. You could still use awk with it, technically. For example:
Why would you want to do that? It looks like xml all over again. No advantages of user-readable format and no advantages of a real database. Talking to sqlite CLI tool is probably a little bit more useful.
XML? I keep databases in user readable/writable format that I also parse with awk. The whole point of awk is to parse human readable text. No XML nonsense. Just text with a non-rigid structure.
If you need a database, you really shouldn't allow the text to be parsed as anything else than originally intended. With "just text with a non-rigid structure" it's just too easy to make a silly mistake. So what is the advantage here, given we have bdb and sqlite on almost every machine nowadays?
Or the other way - if you need something human-readable, why use is as a database? Why not keep the data where the data belongs and generate the reports when needed from there?
(n/g)AWK doesn't just make sense for one-liners: it's a full (one could argue: the first) scripting language. Thinking it was only for one-liners put me off learning it at first. That was a mistake.
Q. Have you had any surprises in the way that AWK has developed over the years?
A. One Monday morning I walked into my office to find a person from the Bell Labs micro-electronics product division who had used AWK to create a multi-thousand-line computer-aided design system. I was just stunned. I thought that no one would ever write an AWK program with more than a handful of statements. But he had written a powerful CAD development system in AWK because he could do it so quickly and with such facility. My biggest surprise is that AWK has been used in many different applications that none of us had initially envisaged. But perhaps that's the sign of a good tool, as you use a screwdriver for many more things than turning screws.
- from the 2008 Computerworld interview with Alfred V. Aho (http://goo.gl/OVtFU)
I've found a screwdriver to be one of my go-to tools for car work. Oftentimes I need to remove a rubber gasket that's been in the car for years. It's usually completely hardened from heat and baked onto a metal surface. A screwdriver is really useful for scraping that off before I install a new one.
When I worked in a restaurant I saw the line cook do the same with food burned into the salamander.
I can remember using a screwdriver as a hole punch, a paint can opener, a cutting edge to open plastic bags, and a door stop. I'm sure I've used them for other things as well.
When a screwdriver is the only tool handy, you'll often make that work rather than spend the time finding/buying a better tool.
I used it to find a house[1]. I needed to rent two apartments close to each other. I wrote some shell/awk scripts that take several real estate web sites as input, parse them, extract all data about apartments including URL, address and price, pipe the addresses into Google maps API, find the exact coordinates of all apartments, calculate distances between them and produce a list of pairs of apartments sorted by the distance between them. It works incredibly well, it routinely finds houses that are on the same street or apartments that are in the same building.
The first version of the code was written when I was in Vienna, looking for a house. It took me an evening.
The only thing that comes even close to this in power is perl, but I don't like perl, I think it's overkill when you only need to parse text, not do some other computation, and I did all this work under Plan9 which doesn't have perl anyway.
Oh, and what was most useful after all this was creating one liners for doing filtering, like finding pairs that had total cost under some value, total number of rooms above some value and distance between them in some interval. I found this one liners to be easier in awk then perl.
Your argument against awk because is 2011 is the same as saying let's drop C because it's 2011. Both C and awk solve some things well, for other things there are other languages.
I find that any argument along the lines of "Oh goodness, it's [YEAR] for crying out loud" is generally rubbish. Not always, but often enough that I've noticed the correlation.
I know Perl, I know Ruby, I know Brainfuck and I know Awk. I'm also not a masochist. I don't use Brainfuck but I use the hell out of Awk. Myself and other Awk users like 4ad here aren't using Awk because we're crazy, we've calculated effort/rewards/tradeoffs and come upon a solution.
I find people suggesting that others should or should not use various tools to be incredibly condescending. In doing so you are effectively refusing to recognize others as your peers. If you're Dijkstra and everyone around you has a hardon for GOTO, then be my guest, but you are not.
I find it quicker and easier to create one-off text filters in awk than Perl or Ruby. That is, I may have a set of csv files for teachers, administrators and students by grade. I want to extract fields x, y and z from all these files, transform the data in some specific ways and then create a single output file (maybe to upload somewhere else or to use as part of a larger program). I certainly could write a script in an interpreted language to do this, but I often find it faster to do it in awk. (There's a sweet spot here: if the job is too complex or I'm going to do it over and over, a script may become the better choice.) I use awk for this kind of thing 3-5 times a week, easily.
(I know your work, and I respect you. I mean the following as a compliment - truly, no sarcasm or snark.)
Sounds to me like you're being a professional programmer. I'm not. I'm a self-taught amateur. I program a little for myself, a little for the school I work for and a bunch to do sysadmin tasks (for myself and the school). I don't worry all that much about the special cases of CSV. I look at the data, massage it a little before if I need to (usually that with awk, sed, etc. as well), then run it through awk and do the job. If the result isn't all perfect, I clean it up quickly by hand using Vim. In this kind of context, 'robust' doesn't mean anything to me. I'm not expecting to do the exact same task ever again. As for running faster, I call bullshit and remind you of this[1]. awk runs plenty fast - seconds or less for such cases. As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.
The bottom line for me is that I like awk. It's defaults make sense to me, it's fast, it's very flexible and powerful. Finding the right Perl module or Ruby gem, learning its API, opening my editor, etc. - that all takes time. For small, one-off jobs, it's not worth it.
As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.
I'm a lot faster not writing code than I am writing code, and when I have to write code, I write code much faster when I let the computer massage and clean up data.
This is getting pretty silly. The computer doesn't magically massage data itself. You're saying you prefer to use Perl and a module from CPAN to help do that. Ok. I'm saying that in many cases, I prefer to use some combination of standard *nix tools, an editor and awk. The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish. So I'm just not seeing any argument from "write code much faster".
Again, this isn't robust. It's not professional. No tests in advance. But it works for me day in, day out, every week.
"I mentioned this approach on the #perl IRC channel once, and I was immediately set upon by several people who said I was using the wrong approach, that I should not be shelling out for such a simple operation. Some said I should use the Perl File::Compare module; most others said I should maintain a database of MD5 checksums of the text files, and regenerate HTML for files whose checksums did not match those in the database.
I think the greatest contribution of the Extreme Programming movement may be the saying "Do the simplest thing that could possibly work." Programmers are mostly very clever people, and they love to do clever things. I think programmers need to try to be less clever, and to show more restraint. Using system("cmp -s $file1 $file2") is in fact the simplest thing that could possibly work. It was trivial to write, it's efficient, and it works. MD5 checksums are not necessary. I said as much on IRC.
People have trouble understanding the archaic language of "sufficient unto the day is the evil thereof," so here's a modern rendering, from the New American Standard Bible: "Do not worry about tomorrow; for tomorrow will care for itself. Each day has enough trouble of its own." (Matthew 6:34)
People on IRC then argued that calling cmp on each file was wasteful, and the MD5 approach would be more efficient. I said that I didn't care, because the typical class contains about 100 slides, and running cmp 100 times takes about four seconds. The MD5 thing might be more efficient, but it can't possibly save me more than four seconds per run. So who cares?"
The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish.
I underestimate how much time it'll take me to do a job manually when I already have great tools to do it the right way all the time.
Maybe you're just that much better a programmer than I am--but I know when I need to parse something like CSV, Text::xSV will get it right without me having to think about whether there are any edge cases in the data at all. (If Text::xSV can't get it right, then there's little chance I will get it right in an ad hoc fashion.)
In the same way, I could write my own simple web server which listens on a socket, parses headers, and dumps files by resolving paths, or I could spend three minutes extending a piece of Plack middleware once and not worrying about the details of HTTP.
Again, maybe I'm a stickler about laziness and false laziness, but I tend to get caught up in the same cleverness MJD rightfully skewers. Maybe you're different, but part of my strategy for avoiding my own unnecessary cleverness is writing as little parsing code as I can get away with.
I could take a few more minutes to use something that makes me not have to worry about CSV edge cases, or I could just recognize that those cases don't apply to me, throw together an awk line in 20 seconds, and get to the pub early and get a pint or two down before my buds show up.
Who cares about what is better on paper? I have a life to live.
Personally, unless I knew the data, I'd have problems enjoying that pint without worrying about data fields containing e.g. ",":s which my naive script would fail on...
In my experience, files have mixed formats far less often than you might suspect. Generally they were kicked out by another script some other bored bloke who just wants to get home wrote himself. If I don't get the results I'm expecting then I'll investigate but the time I save myself by assuming things will work more than offsets that.
Usually the issues I find are things that invalidate the entire file, and indicate a problem further up the stream. In those cases I'm actually glad my script was not robust.
Sorry for coming in late, but... that argument can motivate the use of simpler tools (here, awk instead of perl) in any situation. :-)
(I didn't mention "mixed formats", I mentioned the specific problems of e.g. ",":s (and possibly '"':s) in CSV. You will find such characters even in e.g. names, especially if entered by hand/ocr.)
Alternative: first pipe through a CSV to strict TSV converter, such that "awk -F'\t'" is correct. I do this all the time, because awk is far superior to standalone perl/python scripts for quick queries, tests, and reporting.
Of course anything more complex than that, but not requiring the complexity of Perl/Ruby is generally best still done with Awk. In my world, that is a lot.
it also outputs 'bar'. The -s (for "squeeze") option of tr turns every sequence of the specified character (space in this case) into one instance of this character.
Of course, the awk solution is more succint and elegant in this case - I just think that tr -s / cut -d is handy to know from time to time, too.
-l:
1. "chomps" the input (which here means: removes newlines (if present) from each line, more general explanation:
http://perldoc.perl.org/functions/chomp.html
2. and automatically adds a newline to each output newline (see below how to achieve this in a shorter way with a relatively new Perl feature).
-a: turns on auto-split-mode: this splits the input (like awk) into the @F (for fields) array. The default separator is one (or many) spaces, i.e. it behaves like AWK.
-n: makes Perl process the input in an implicit while loop around the input, which is processed line-wise: "as long as there is another line, process it!"
-e: execute the next command line argument as a Perl program (argument in this case beeing 'print $F[0]').
Note that the example can be shortened if you use
-E instead of -e. -E enables all extensions and new features of Perl (which aren't enabled with -e because of backwards compatibility). This allows you to use 'say' instead of 'print' which adds a trailing newline automatically and lets you drop the -l option (if you don't need the 'chomp' behaviour explained above):
$ perl -anE 'say $F[0]'
Of course, the AWK command line is still shorter - and that's expected, because AWK is more specialized than Perl.
Still, Perl one liners are often very useful and can do other things better / shorter than AWK - especially if the one-liner uses one of the many libraries that are available for Perl.
A thorough reference of all Perl command line options is available at:
Thank you, that tempts me to go and look at perl. At the moment I tend to use simple awk, or a full python script. I find python really doesn't lend itself to "one line", or even particularly short, programs however. I keep meaning to go back and look at perl. I was tempted to wait for Perl 6, but I think the time has come to just look at Perl 5 :)
If you're genuinely curious about -lane, my favorite site for Perl one liners has gone the way of all flesh, but the Wayback Machine can still get you a copy[1].
Mostly chains of sort, uniq, sed, tr, comm and friends.
Anything that doesn't fit into 80 chars that way is usually a sure sign that it's time to reach for a real scripting language rather than brewing up one of those line-noise blobs.
Oh, and of course I don't mind when people use it on their shell-prompt.
I do mind when I run into crap like this in a shell-script:
awk -F'>' '{ pack[$1]=pack[$1] $2 } END {for (val in pack) print val ">", "(" pack[val] ")"}'
Wait, you're buttressing your argument that people shouldn't use awk because it's 2011, by advocating equally complex command composition with more tools? I mean, I love sort and uniq but your opposition to awk sounds more like religious zealotry than anything rational.
"Mostly chains of sort, uniq, sed, tr, comm and friends."
Awk wasn't invented for lack of those things, Awk was invented to supplement those things. And for that purpose it is every bit as relevant today as it ever was.
P.S. Properly formatted, that awk line is perfectly sensible. I don't see what your issue here is.
Except that in awk the middle-ground between "could be trivially done without awk" and "illegible" is extremely narrow.
When I 'grep awk /usr/bin/*' then I see only two types of usages.
Most commonly it's used in place of grep, sed or cut for no obvious reason. Gladly less common but all the more annoying is when someone tries to go all "fancy" with it...
Awk has everything "real" languages have. The same math, flow control and data structures. It also has a subset of string mangling ops that give it the "illegible" appearance. But it is fine as a general purpose dynamic scripting language. As an example:
It could be clearer without the variable-name golfing and as a seperate .awk file (instead of a CLI one-liner) but it reads pretty much the same as C without any pesky type declarations.
btw, i suggest using gawk since it has date functions.