Wow. I was expecting an interesting discussion. I was disappointed. Apparently the consensus on hacker news is that there exists a repository size N above which the benefits of splitting the repo _always_ outweigh the negatives. And, if that wasn't absurd enough, we've decided that git can already handle N and the repository in question is clearly above N. And I guess all along we'll ignore the many massive organizations who cannot and will not use git for precisely the same issue.
So instead of (potentially very enlightening conversation) identifying and talking about limitations and possible solutions in git, we've decided that anyone who can't use git because of its perf issues is "doing it wrong".
Your comment was at the top so I continued to read expecting to find a bunch of ignorant group think about how git is awesome and Facebook is dumb, but that's not really what's going on down below.
I don't know what facebook's use case is, so I have no idea if their repositories are optimally structured. However, I've used git on a very large repository and ran into some of the same performance issues that they did (30+ seconds to run git status), so I don't think it's terribly hard to imagine they're in a similar situation.
What we did to solve it is exactly what you're excoriating the people below for suggesting: we split the repos and used other tools to manage multiple git repos, 'Repo' in some situations, git submodules in others.
However, we moved to that workflow mainly because it had a number of other advantages, not just because it made day-to-day git operations faster.
I hope git gets faster, some of the performance problems described are things we saw too, but things are always more complicated and I see nothing below that looks like the knee-jerk ignorant consensus you're describing.
Sometimes the answer to "it hurts when I do this" is "don't do that... because there's other ways to solve the same issue that work better for a number of other reasons and we haven't bothered fixing that particular one because most of the time the other way works better anyway."
On a simliar note, I've heard there of people who would hit the limit on fortran files, so they put every variable into a function call to the next file, which itself contained one function and a function call to the next file after that (if necessary).
It is intuitively obvious that it is better to be rich and healthy than poor and ill. Sadly, the reality choices are neither. And you can not just split a repo. Hearing some ideas for that case would have been interesting.
Solving a scaling problem by splitting it is, well, obvious.
And, yes, I also ran github on a couple of projects at $work and the issues are real, seen them.
So, if it hurts when I try to use git - the answer will be don't use git... But the conveniences are so tempting...
Stat'ing a million files is going to take a long time. Perforce doesn't have this problem because you explicitly check out files (p4 edit). (Perforce marks the whole tree read-only, as a reminder to edit the file before you save.)
It seems like large-repo git could implement the same feature. You would just disable (or warn) for operations which require stat'ing the whole tree.
Then the question is how to make the rest of the operations perform well -- git add taking 5-10 seconds seems indicative of an interesting problem, doesn't it?
It seems eminently obvious to me that having basically a "change log" for a (part of a) filesystem is something that's valuable independent of your build system, revision control system, whatnot.
At least that's what I'd like to see - it's functionality that's orthogonal to those tools.
Mac OS X's FSEvents API has something similar to that. When you create a FSEvent listener you can pass in an old event ID so the system can give you all the stuff that happened while you weren't listening [1]. Apple uses this for Time Machine (and I suspect Spotlight, too).
To better understand this technology, you should first understand what it is
not. It is not a mechanism for registering for fine-grained notification of
filesystem changes. It was not intended for virus checkers or other technologies
that need to immediately learn about changes to a file and preempt those changes
if needed. [...]
The file system events API is also not designed for finding out when a
particular file changes. For such purposes, the kqueues mechanism is more
appropriate.
The file system events API is designed for passively monitoring a large tree of
files for changes. The most obvious use for this technology is for backup
software. Indeed, the file system events API provides the foundation for Apple’s
backup technology.
IMO, only telling users what directories changed is a smart move. It means that the amount of data that must be kept around is much smaller. That allows the OS to keep this list around 'forever' (I do not know how 'forever' that actually is)
NTFS has this optionally in the "USN Change Journal"; see http://msdn.microsoft.com/en-us/library/aa363798.aspx. It's used by a few Microsoft features like indexing and file replication, but it's available to third party programs too.
The git add problem is because .git/index is rewritten from scratch each time a new change is staged. With a 100 mb index file, that takes as long as it takes to write that much data to disk (cache). Much room for improvement here.
I found the original email equally disappointing, though. It boils down to "We pushed the envelope on size, it's too slow, we'd like to speed it up." Well, duh.
He uses the word 'scalability' early in the email, but shows no indication that he knows what it means. I'd love to hear if different operations slow down at different rates as the repo accumulates commits. Do they scale linearly, sublinearly, or superlinearly as the repo grows? Are there step functions at which there's a sudden dramatic slowdown (ran out of RAM, etc.)?
It's intentionally vague but with enough details that if you're actually in a position to help, you'll recognize what's going on and actually directly contact to get more information.
You don't spill internal processes and configurations without some kind of disclosure agreements and certainly not in a public forum.
There's no need to spill internal processes and configurations. The fellow said he had a synthetic repo that he used to benchmark various operations. Surely whatever generated that test repo can scale it up or down to whatever size they like, so you can benchmark at various points and collect the data that would tell us if there is some horrible non-linear scaling going on under the covers.
Right now it sounds like he's just trying to see what the possible solutions for his issues are. If he can provide additional benchmarks, etc., great. But he's under no obligation to provide any more than he has. Once there's a solution, then maybe.
Of course he's not under any obligation to provide any more info than he has. But given that he already has the test harness setup, and that only he has access to the hardware on which his benchmarks ran, it seems that he could easily enable more people to help him by providing additional data points.
I'm not asking for secrets here. I'm asking for some sign that he has a well-defined problem to solve.
How git performs as repo size grows to 15GB isn't hidden in a vault at facebook somewhere; I suspect they just haven't done anything more detailed than a superficial time measurement.
And as much as I'd like being truly open as an ideal, it falls apart when you're dealing with competition (not cooperation) and money. At best you try to keep things open enough.
I don't see how Facebook's build needs to be kept secret. It's a purely internal process and while they might lose something by giving details they can also gain if someone suggests improvements. That said, there are plenty of tings they need to keep secret. EX: Letting anyone export FB's full social graph would be really stupid.
Git and HG:
1. Require you to be sync'ed to tip before pushing.
2. Cannot selectively check out files.
The former means that in any reasonably sized team, you will be forced to sync 30 times a day, even if you are the only one editing your section of the source tree. The latter means that Joe who is checking in the libraries for (huge open source project) for some testing increases everyones repo by that much, forever, even if it's deleted later.
Needless to say, the universal response is that I'm doing it wrong.
Perforce 4 life!
But seriously, it says that Google adopted Git for their repo --- does anyone know how they use it? I would expect them to want a linear history, but their teams are way too big to be able to have everyone sync'ed to tip to push...
That's not the case. In fact, in the context of Linux kernel development, there's many emails on LKML where Linus is telling someone that they shouldn't be merging random-kernel-of-the-day into their development branch.
Git is not used for their main repo. Git is used as a local cache for perforce where a branch roughly corresponds to a CL. Only subtrees of interest are checked out.
> Git is used as a local cache for perforce where a branch roughly corresponds to a CL. Only subtrees of interest are checked out.
That's a common use for git at Google, but not the only one (I'm a SWE). When I do use perforce I've got enough rhythm that it doesn't get in my way, but I really like git at Google for local branches on rapidly-changing subtrees. A lot of time I'll work on a branch to submit as a CL, but then realize I should do something else that depends on it. Perforce is a mess at this situation if the tree is changing much, and git is perfect if you just make a new branch.
That's the problem. It's NOT an easy problem to solve.
A lot of posts on hn describing some problem elicit "Why, that's no problem at all!" responses or "That's the wrong problem to think about" responses.
Honestly that mindset is often really useful in programming, but when we get a problem that doesn't have a shortcut and is relevent, conversation goes to shit. Because I guess that's when programmers normally go into a hole and brute-force brain it out.
How to use mass comms to talk about a difficult open problem is, I suppose, itself an open problem.
It comes out of the Linux kernel, where you need a secure hash of a segment to prevent compromise. For big projects you have submodules, you can only get a level higher later.
In a company, you trust the sources. With Perforce you check out files and work with the part you want.
It is a design decision and they could have known before.
It is my guess (though I have no proof) that most places with particularly large repositories have lots of binary files in them. It's hard to get a 15GB repository if you just have text.
This sort of thing suggests a centralized check-in/check-out model, because binary files are difficult to merge sensibly, and nobody wants to spend terabytes of hard drive space storing the repository locally. And your centralized check-in/check-out needs, whatever scale they might be, are probably tolerably well served by one of the existing solutions.
Yes, but why is that a show stopper? It's a small market filled only with people who typically have large fist-fulls of cash and are dependent on version control. It's a small market, but companies in it have the resources for a good solution.
Because those companies generally already pay the $$$ for Perforce (which has any number of deeply terrifying, shiny red candylike self destruct buttons and makes git's user interface look kind) which for all its other faults handles this specific user case extremely well.
And also paying Perforce fistfuls of cash in licensing fees. I hear that Perforce is a quite a small company, and the founder wrote the lion's share of the code a couple decades ago.
I think they are probably on par with craigslist in profits per employee (i.e. much higher than Google or Facebook. Interestingly I think Facebook has about 1/10 the employees of Google with 1/10 the profits -- off the top of my head feel free to correct -- so I don't think they blew it out of the park with their IPO filing).
Perforce is quite expensive, yes. I don't understand the rest of your comments though. I'm not sure why company size, code author, or profit margins are relevant. Perforce is used by every major gaming studio, Pixar, Nvidia, and many more.
If I were to make a snarky comment it would be that Git is for poor people and Perforce is what you use when you grow up. That's not an even remotely reasonable statement, but it does have a teeny, tiny hint of truth to it. :)
I'm just pointing out the Perforce is making crazy profit, and somewhat ironically it's doing so more efficiently (I conjecture) than Facebook, which you are hearing a lot more about.
Perforce is a great system, but it's showing it's age by now. I think there is probably room for someone to make another product in the high end space and make boatloads of cash from big companies, but it's not easy.
Yes partly. Doing lots of commits locally before pushing to others is definitely something I like.
Another part of it is working disconnected -- with so many people coding on their laptops that's actually a pretty common use case.
Also the lack of need to do sysadmin work on git/hg is really nice. I used to run the free Perforce server a long time ago for myself, but it was annoying to do the backups. With git or hg you get whole-repository backups for free.
The "big repository with all dependencies model" has its drawbacks but it's interesting that facebook finds a lot of use for it, and that git is unsuitable for it. Perforce is probably still their best choice in that case.
The most recent version of Perforce added streams which is their primary answer to git and hg. Easy creation, management, and switching between branches. I've only used this at home and not in a large scale environment yet, but it's promising.
Later this year they are adding p4 Sandbox which allows for disconnected work. When that is complete and working I'm honestly not sure what advantage git will have left other than being free.
Perforce just raised their limit on the free version to 20 users and 20 workspaces, it used to be 2 users. We use it at Tinkercad and have been very happy, I used it at Google previously. The price (free) is acceptable for almost any small development organization.
If you have used ClearCase, you will know that while it's a great solution to a "what shall we do with our buckets of cash" problem, it's not something that anyone encountering performance problems would reach for.
So instead of (potentially very enlightening conversation) identifying and talking about limitations and possible solutions in git, we've decided that anyone who can't use git because of its perf issues is "doing it wrong".