Keep in mind that your average repository doesn't only contain code that is compiled and executed (or interpreted), there is also documentation, static assets such as images (that may be processed), configuration, computed files (that may make sense to pre-compute once rather than compute on a hundred people's environments every build), and so forth.
Also, it doesn't only include the current file set - they include files that have been deleted, been split into modular files, been merged, been wholesale rewritten, put into a new hierarchy (some VCS systems handle this better than others).
(I work at Facebook, but not on the team looking into this stuff. I'm a happy user of their systems though. Keep in mind that the 1.3 million file repo is a synthetic test, not reality.)
Also, it doesn't only include the current file set - they include files that have been deleted, been split into modular files, been merged, been wholesale rewritten, put into a new hierarchy (some VCS systems handle this better than others).
The follow-up email still mentions a working directory of 9.5gb. I cannot fathom working on a code repository consisting of 9.5gb of text. There must be something else going on here, even considering any peripheral projects like the iOS and android apps, etc.
(edit: if there are huge generated files intermingled with code, shouldn't those be hosted on a "pre-generated cache" web server instead of git, for example?)
Our codebase at my employer currently hovers around 5GB in SVN. Binaries and other generated code are intermixed for historical reasons. Removing them is a non-starter due to the amount of time it'd take to do so; the best solution I've been able to come up with so far is to break out into multiple SVN repos (one for images, one for generated language files, etc.) and then, hopefully, get code into Github while externally using the SVN repos for stuff that shouldn't be versioned in a distributed manner (versioning that stuff is useful as a convenience - avoiding conflicts, etc.).