Contrary to the current HN title, the article points out: *Evidence presented du...

3pt14159 · on Feb 9, 2014

Having done this type of work before for a legitimate purpose, it is almost certainly a python or perl script with a nice library in front of it that makes it easy to follow links.

wget is too brittle, not extensible enough, and not as maintainable as a nice python script.

jrochkind1 · on Feb 13, 2014

I believe Manning actually used Windows batch scripting to automate wget, or so the government alleged from forensics at the trial. (I observed a couple days of the trial).

Manning did not have the tech skills of Snowden though, she wasn't neccesarily doing things in the most effective or elegant ways, but it worked.

emmelaich · on Feb 9, 2014

Probably; but could be something like lftp. It's name belies its capabilities.

Or maybe Kermit? Half-smiley; only if he's a masochist. http://www.kermitproject.org/ckscripts.html

wslh · on Feb 9, 2014

Wget is also single threading which is a slow strategy to download pages.

Fasebook · on Feb 9, 2014

that's what xargs -n x is for

wslh · on Feb 12, 2014

Can you elaborate?

sebastianavina · on Feb 9, 2014

the other day I had the task to batch download product pictures from a website... every picture had a sessionid on the uri so I could't make a simple image wget. I wrote a simple python script that wrote a shell script with a lot of "wget -E -H -k -p \n sleep 30" and ran it trough a cloud server for a couple days... after that, some simple scripts for renaming the pictures, some regular expressions here and the, and voila, 250k perfectly named pictures for my product catalog... (it's for an intranet, so I guess I wont have copyright problems"

eli · on Feb 9, 2014

FYI, you have exactly the same copyright issues on an intranet. You're just less likely to get caught, I guess.

kudu · on Feb 8, 2014

curl is just a library with a slim command-line interface. It can't scrape pages by itself. Perhaps you're thinking of curlmirror? Even then, I doubt it can be considered more powerful than a good wget configuration.

kurtsiegfried · on Feb 9, 2014

Nutch/Solr could provide a way to do a crawl, refine parameters, and then feed into a tool to download the actual resources.