Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Contrary to the current HN title, the article points out:

Evidence presented during Private Manning’s court-martial for his role as the source for large archives of military and diplomatic files given to WikiLeaks revealed that he had used a program called “wget” to download the batches of files. That program automates the retrieval of large numbers of files, but it is considered less powerful than the tool Mr. Snowden used.

So the tool wasn't wget. curl, perhaps?



Having done this type of work before for a legitimate purpose, it is almost certainly a python or perl script with a nice library in front of it that makes it easy to follow links.

wget is too brittle, not extensible enough, and not as maintainable as a nice python script.


I believe Manning actually used Windows batch scripting to automate wget, or so the government alleged from forensics at the trial. (I observed a couple days of the trial).

Manning did not have the tech skills of Snowden though, she wasn't neccesarily doing things in the most effective or elegant ways, but it worked.


Probably; but could be something like lftp. It's name belies its capabilities.

Or maybe Kermit? Half-smiley; only if he's a masochist. http://www.kermitproject.org/ckscripts.html


Wget is also single threading which is a slow strategy to download pages.


that's what xargs -n x is for


Can you elaborate?


the other day I had the task to batch download product pictures from a website... every picture had a sessionid on the uri so I could't make a simple image wget. I wrote a simple python script that wrote a shell script with a lot of "wget -E -H -k -p \n sleep 30" and ran it trough a cloud server for a couple days... after that, some simple scripts for renaming the pictures, some regular expressions here and the, and voila, 250k perfectly named pictures for my product catalog... (it's for an intranet, so I guess I wont have copyright problems"


FYI, you have exactly the same copyright issues on an intranet. You're just less likely to get caught, I guess.


curl is just a library with a slim command-line interface. It can't scrape pages by itself. Perhaps you're thinking of curlmirror? Even then, I doubt it can be considered more powerful than a good wget configuration.


Nutch/Solr could provide a way to do a crawl, refine parameters, and then feed into a tool to download the actual resources.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: