Wow. I was expecting an interesting discussion. I was disappointed. Apparently t...

kinofcain · on Feb 4, 2012

Your comment was at the top so I continued to read expecting to find a bunch of ignorant group think about how git is awesome and Facebook is dumb, but that's not really what's going on down below.

I don't know what facebook's use case is, so I have no idea if their repositories are optimally structured. However, I've used git on a very large repository and ran into some of the same performance issues that they did (30+ seconds to run git status), so I don't think it's terribly hard to imagine they're in a similar situation.

What we did to solve it is exactly what you're excoriating the people below for suggesting: we split the repos and used other tools to manage multiple git repos, 'Repo' in some situations, git submodules in others.

However, we moved to that workflow mainly because it had a number of other advantages, not just because it made day-to-day git operations faster.

I hope git gets faster, some of the performance problems described are things we saw too, but things are always more complicated and I see nothing below that looks like the knee-jerk ignorant consensus you're describing.

Sometimes the answer to "it hurts when I do this" is "don't do that... because there's other ways to solve the same issue that work better for a number of other reasons and we haven't bothered fixing that particular one because most of the time the other way works better anyway."

wisty · on Feb 4, 2012

On a simliar note, I've heard there of people who would hit the limit on fortran files, so they put every variable into a function call to the next file, which itself contained one function and a function call to the next file after that (if necessary).

Making stuff modular is often a good idea.

LearnYouALisp · on Feb 4, 2012

Cellular, modular, and interactive-odular!

ay · on Feb 4, 2012

It is intuitively obvious that it is better to be rich and healthy than poor and ill. Sadly, the reality choices are neither. And you can not just split a repo. Hearing some ideas for that case would have been interesting.

Solving a scaling problem by splitting it is, well, obvious.

And, yes, I also ran github on a couple of projects at $work and the issues are real, seen them.

So, if it hurts when I try to use git - the answer will be don't use git... But the conveniences are so tempting...

lincolnq · on Feb 3, 2012

I had the same reaction as you. </meta>

Stat'ing a million files is going to take a long time. Perforce doesn't have this problem because you explicitly check out files (p4 edit). (Perforce marks the whole tree read-only, as a reminder to edit the file before you save.)

It seems like large-repo git could implement the same feature. You would just disable (or warn) for operations which require stat'ing the whole tree.

Then the question is how to make the rest of the operations perform well -- git add taking 5-10 seconds seems indicative of an interesting problem, doesn't it?

ori_b · on Feb 3, 2012

It seems to me that you could have a daemon that uses inotify to make operations O(changed) vs O(size).

groby_b · on Feb 4, 2012

Which would also be tremendously useful for e.g. make.

shabble · on Feb 4, 2012

There already exists tup: http://gittup.org/tup/ which does that sort of thing.

groby_b · on Feb 4, 2012

It seems eminently obvious to me that having basically a "change log" for a (part of a) filesystem is something that's valuable independent of your build system, revision control system, whatnot.

At least that's what I'd like to see - it's functionality that's orthogonal to those tools.

cheez · on Feb 4, 2012

Oh my god, that would be awesome at the FS level.

__david__ · on Feb 4, 2012

Mac OS X's FSEvents API has something similar to that. When you create a FSEvent listener you can pass in an old event ID so the system can give you all the stuff that happened while you weren't listening [1]. Apple uses this for Time Machine (and I suspect Spotlight, too).

[1] https://developer.apple.com/library/mac/#documentation/Darwi...

ori_b · on Feb 4, 2012

What happens if a file is created and deleted multiple times? How does this avoid doing a complete walk of FS state and being O(size) itself?

Someone · on Feb 5, 2012

It does not actually tell you what files got changed; it tells you what directories saw at least one change.

Programs will still have to inspect those directories to find out what file(s) changed.

To quote https://developer.apple.com/library/mac/#documentation/Darwi...:

  To better understand this technology, you should first understand what it is
  not. It is not a mechanism for registering for fine-grained notification of
  filesystem changes. It was not intended for virus checkers or other technologies
  that need to immediately learn about changes to a file and preempt those changes
  if needed. [...]

  The file system events API is also not designed for finding out when a
  particular file changes. For such purposes, the kqueues mechanism is more
  appropriate.

  The file system events API is designed for passively monitoring a large tree of
  files for changes. The most obvious use for this technology is for backup
  software. Indeed, the file system events API provides the foundation for Apple’s
  backup technology.

IMO, only telling users what directories changed is a smart move. It means that the amount of data that must be kept around is much smaller. That allows the OS to keep this list around 'forever' (I do not know how 'forever' that actually is)

icehawk · on Feb 4, 2012

It doesn't. But it does mean you don't have to do it every time

This is a nice overview on FSevents

http://arstechnica.com/apple/reviews/2007/10/mac-os-x-10-5.a...

carey · on Feb 4, 2012

NTFS has this optionally in the "USN Change Journal"; see http://msdn.microsoft.com/en-us/library/aa363798.aspx. It's used by a few Microsoft features like indexing and file replication, but it's available to third party programs too.

rwmj · on Feb 4, 2012

Linux has been toying with a decent replacement for inotify for a while. Last time I looked it was called fanotify[1] and was still not merged.

[1] https://lwn.net/Articles/339399/

sciurus · on Feb 4, 2012

Fanotify was merged but disabled in 2.6.36. It was enabled in 2.6.37.

https://lwn.net/Articles/421638/

joeyh · on Feb 4, 2012

The git add problem is because .git/index is rewritten from scratch each time a new change is staged. With a 100 mb index file, that takes as long as it takes to write that much data to disk (cache). Much room for improvement here.

Retric · on Feb 4, 2012

Writing 100mb to disk should take around 4 seconds on a HDD and less on a SSD.

akkartik · on Feb 3, 2012

You're right.

I found the original email equally disappointing, though. It boils down to "We pushed the envelope on size, it's too slow, we'd like to speed it up." Well, duh.

He uses the word 'scalability' early in the email, but shows no indication that he knows what it means. I'd love to hear if different operations slow down at different rates as the repo accumulates commits. Do they scale linearly, sublinearly, or superlinearly as the repo grows? Are there step functions at which there's a sudden dramatic slowdown (ran out of RAM, etc.)?

lnguyen · on Feb 3, 2012

It's intentionally vague but with enough details that if you're actually in a position to help, you'll recognize what's going on and actually directly contact to get more information.

You don't spill internal processes and configurations without some kind of disclosure agreements and certainly not in a public forum.

emelski · on Feb 4, 2012

There's no need to spill internal processes and configurations. The fellow said he had a synthetic repo that he used to benchmark various operations. Surely whatever generated that test repo can scale it up or down to whatever size they like, so you can benchmark at various points and collect the data that would tell us if there is some horrible non-linear scaling going on under the covers.

lnguyen · on Feb 4, 2012

Right now it sounds like he's just trying to see what the possible solutions for his issues are. If he can provide additional benchmarks, etc., great. But he's under no obligation to provide any more than he has. Once there's a solution, then maybe.

akkartik · on Feb 4, 2012

You're absolutely right, nobody is under any obligation to be non-disappointing.

He's soliciting opinions. I'm not sure how anybody can comment meaningfully based on the data he's given.

emelski · on Feb 6, 2012

Of course he's not under any obligation to provide any more info than he has. But given that he already has the test harness setup, and that only he has access to the hardware on which his benchmarks ran, it seems that he could easily enable more people to help him by providing additional data points.

akkartik · on Feb 4, 2012

I'm not asking for secrets here. I'm asking for some sign that he has a well-defined problem to solve.

How git performs as repo size grows to 15GB isn't hidden in a vault at facebook somewhere; I suspect they just haven't done anything more detailed than a superficial time measurement.

zobzu · on Feb 3, 2012

if you were working for a truly open company, you could :)

lnguyen · on Feb 4, 2012

I don't think Facebook is claiming to be.

And as much as I'd like being truly open as an ideal, it falls apart when you're dealing with competition (not cooperation) and money. At best you try to keep things open enough.

Retric · on Feb 4, 2012

I don't see how Facebook's build needs to be kept secret. It's a purely internal process and while they might lose something by giving details they can also gain if someone suggests improvements. That said, there are plenty of tings they need to keep secret. EX: Letting anyone export FB's full social graph would be really stupid.

sek · on Feb 3, 2012

You have a point.

It is just surprising when git was designed for the Linux kernel and we all here have a Github mindset.

robfig · on Feb 4, 2012

In violent agreement here.

Git and HG: 1. Require you to be sync'ed to tip before pushing. 2. Cannot selectively check out files.

The former means that in any reasonably sized team, you will be forced to sync 30 times a day, even if you are the only one editing your section of the source tree. The latter means that Joe who is checking in the libraries for (huge open source project) for some testing increases everyones repo by that much, forever, even if it's deleted later.

Needless to say, the universal response is that I'm doing it wrong. Perforce 4 life!

But seriously, it says that Google adopted Git for their repo --- does anyone know how they use it? I would expect them to want a linear history, but their teams are way too big to be able to have everyone sync'ed to tip to push...

caf · on Feb 4, 2012

Require you to be sync'ed to tip before pushing.

That's not the case. In fact, in the context of Linux kernel development, there's many emails on LKML where Linus is telling someone that they shouldn't be merging random-kernel-of-the-day into their development branch.

robfig · on Feb 5, 2012

It's required to maintain a linear history, no?

ori_b · on Feb 4, 2012

Git is not used for their main repo. Git is used as a local cache for perforce where a branch roughly corresponds to a CL. Only subtrees of interest are checked out.

adgar · on Feb 4, 2012

> Git is used as a local cache for perforce where a branch roughly corresponds to a CL. Only subtrees of interest are checked out.

That's a common use for git at Google, but not the only one (I'm a SWE). When I do use perforce I've got enough rhythm that it doesn't get in my way, but I really like git at Google for local branches on rapidly-changing subtrees. A lot of time I'll work on a branch to submit as a CL, but then realize I should do something else that depends on it. Perforce is a mess at this situation if the tree is changing much, and git is perfect if you just make a new branch.

jamespo · on Feb 3, 2012

I would have liked this comment better if it came up with some solutions itself, although maybe it's not easy to solve?

djtriptych · on Feb 3, 2012

That's the problem. It's NOT an easy problem to solve.

A lot of posts on hn describing some problem elicit "Why, that's no problem at all!" responses or "That's the wrong problem to think about" responses.

Honestly that mindset is often really useful in programming, but when we get a problem that doesn't have a shortcut and is relevent, conversation goes to shit. Because I guess that's when programmers normally go into a hole and brute-force brain it out.

How to use mass comms to talk about a difficult open problem is, I suppose, itself an open problem.

sek · on Feb 3, 2012

Git was not designed for this.

It comes out of the Linux kernel, where you need a secure hash of a segment to prevent compromise. For big projects you have submodules, you can only get a level higher later.

In a company, you trust the sources. With Perforce you check out files and work with the part you want.

It is a design decision and they could have known before.

kenrik · on Feb 3, 2012

With that in mind it seems like there is a market for a git replacement for these huge repos.

to3m · on Feb 3, 2012

I fear the market would be small.

It is my guess (though I have no proof) that most places with particularly large repositories have lots of binary files in them. It's hard to get a 15GB repository if you just have text.

This sort of thing suggests a centralized check-in/check-out model, because binary files are difficult to merge sensibly, and nobody wants to spend terabytes of hard drive space storing the repository locally. And your centralized check-in/check-out needs, whatever scale they might be, are probably tolerably well served by one of the existing solutions.

joeyh · on Feb 4, 2012

The naive solution to binary files is a centralized model. But here's an alternate, fully distributed implementation: http://git-annex.branchable.com/

15 gb is a tiny, tiny repo. I have a 7 tb repo "here" (really, spread amoung various drives, servers, S3, etc). :)

alttag · on Feb 4, 2012

Yes, but why is that a show stopper? It's a small market filled only with people who typically have large fist-fulls of cash and are dependent on version control. It's a small market, but companies in it have the resources for a good solution.

gchpaco · on Feb 4, 2012

Because those companies generally already pay the $$$ for Perforce (which has any number of deeply terrifying, shiny red candylike self destruct buttons and makes git's user interface look kind) which for all its other faults handles this specific user case extremely well.

forrestthewoods · on Feb 3, 2012

It's called Perforce and anyone dealing with binary files has been using it for years.

chubot · on Feb 4, 2012

And also paying Perforce fistfuls of cash in licensing fees. I hear that Perforce is a quite a small company, and the founder wrote the lion's share of the code a couple decades ago.

I think they are probably on par with craigslist in profits per employee (i.e. much higher than Google or Facebook. Interestingly I think Facebook has about 1/10 the employees of Google with 1/10 the profits -- off the top of my head feel free to correct -- so I don't think they blew it out of the park with their IPO filing).

forrestthewoods · on Feb 4, 2012

Perforce is quite expensive, yes. I don't understand the rest of your comments though. I'm not sure why company size, code author, or profit margins are relevant. Perforce is used by every major gaming studio, Pixar, Nvidia, and many more.

If I were to make a snarky comment it would be that Git is for poor people and Perforce is what you use when you grow up. That's not an even remotely reasonable statement, but it does have a teeny, tiny hint of truth to it. :)

malkia · on Feb 4, 2012

I love git, hg, svn, fossil but P4 is awesome.

I've used for 10+ at work, and it's easy to explain to anyone how to use it (from production to artist, and coders).

It has it's gotcha moments, but I haven't seen better. We normally go by 50-100GB depot, and then often whole branches of the game copied.

chubot · on Feb 4, 2012

I'm just pointing out the Perforce is making crazy profit, and somewhat ironically it's doing so more efficiently (I conjecture) than Facebook, which you are hearing a lot more about.

Perforce is a great system, but it's showing it's age by now. I think there is probably room for someone to make another product in the high end space and make boatloads of cash from big companies, but it's not easy.

nfg · on Feb 4, 2012

> Perforce is a great system, but it's showing it's age by now.

Care to elaborate? Do you mean in terms of distributed -vs- centralised repos?

chubot · on Feb 4, 2012

Yes partly. Doing lots of commits locally before pushing to others is definitely something I like.

Another part of it is working disconnected -- with so many people coding on their laptops that's actually a pretty common use case.

Also the lack of need to do sysadmin work on git/hg is really nice. I used to run the free Perforce server a long time ago for myself, but it was annoying to do the backups. With git or hg you get whole-repository backups for free.

The "big repository with all dependencies model" has its drawbacks but it's interesting that facebook finds a lot of use for it, and that git is unsuitable for it. Perforce is probably still their best choice in that case.

forrestthewoods · on Feb 4, 2012

The most recent version of Perforce added streams which is their primary answer to git and hg. Easy creation, management, and switching between branches. I've only used this at home and not in a large scale environment yet, but it's promising.

Later this year they are adding p4 Sandbox which allows for disconnected work. When that is complete and working I'm honestly not sure what advantage git will have left other than being free.

kaib · on Feb 4, 2012

Perforce just raised their limit on the free version to 20 users and 20 workspaces, it used to be 2 users. We use it at Tinkercad and have been very happy, I used it at Google previously. The price (free) is acceptable for almost any small development organization.

sateesh · on Feb 4, 2012

And also there is 'Rational Clearcase', another centralized version control system.

jomar · on Feb 4, 2012

If you have used ClearCase, you will know that while it's a great solution to a "what shall we do with our buckets of cash" problem, it's not something that anyone encountering performance problems would reach for.