Great write up! I wish there was a little bit more color around what "availabili...

mjb · on May 18, 2014

> I wish there was a little bit more color around what "availability" really means.

The definitions used by Gilbert and Lynch in their CAP proof requires availability for updates at all nodes. Other proofs use availability for updates (or consistent reads) at a msjority of nodes.

> Yes some data stores are offering higher availability, but that doesn't mean they are immune to node failure.

All theoretical models like this are always susceptible to real-world implementation limitations. Still, it's very useful to think about these theoretical models, because they tell us what we could achieve with an ideal implementation. Knowing that allows us to chose implementation techniques and approaches suitable to the problem, and to not waste time on trying to implement the impossible. Just because something is possible doesn't mean that it can be, or has been, practically implemented. If something is impossible, though, we know we shouldn't spend time trying to do it.

> Even for a "CP" system, latencies can go up, and with enough node failures, partial unavailability and data loss can occur.

Of course. These models are actually pretty silent on durability, which is frequently seen as a seperate matter, or a limiting case of availability, depending on the area of research. Durability is really practically important, but this isn't really the area of research that addresses it.

Partial unavailability because of theoretical limitations and partial unavailability because of implementation limitations are different things. We can improve on the latter, and insist our software vendors do the same, but the former we just need to work around.

> In theory, a strongly consistent system could have a failover that is so small that it is, from a user pov, always available.

The definition of consistency and availability doesn't allow you to "fail over". Completely ignoring availability, there are two cases of failover: one where the system still appears to be consistent (using one of the strong consistency models described here), and one where the system becomes eventually consistent. The various proofs (such as Gilbert and Lynch's CAP proof) imply that you can't "fail over" and keep consistency in the case where some nodes are uncontactable. The definition of "some nodes" depends on the exact proof, but there is no way to fail over into a minority partition and keep consistency. It's not possible.

On the other hand, there are loads of practical and useful ways to fail over into a minority partition that still gives useful eventual consistency semantics. It all depends what you need.

ryanobjc · on May 18, 2014

Yes, there is the notion of what CAP requires, but we cannot cling to the strict definitions of these papers. They must be translated in to lay persons terms. If you don't, how else can you communicate with your stakeholders? How can you communicate with your users?

Personally I refuse the notion that this is out of scope of our jobs. Think of the ability and power of a so-called 'renaissance person' - can do anything.

> The definition of consistency and availability doesn't allow you to "fail over". Completely ignoring availability, there are two cases of failover: one where the system still appears to be consistent (using one of the strong consistency models described here), and one where the system becomes eventually consistent. The various proofs (such as Gilbert and Lynch's CAP proof) imply that you can't "fail over" and keep consistency in the case where some nodes are uncontactable. The definition of "some nodes" depends on the exact proof, but there is no way to fail over into a minority partition and keep consistency. It's not possible.

Minority partition - maybe you could restate this paragraph in terms of discoveries such as paxos? Does paxos not allow progress and data retrieval during some failure scenarios? Yes, realistically once enough nodes are lost, things grind to a halt. But this is how dynamo works as well. Once the # of nodes available declines below the R or W factor, the algorithm stops making progress.

nemothekid · on May 19, 2014

>but we cannot cling to the strict definitions of these papers

This seems like an incredibly strange sentiment I cannot grasp what you mean. In another field it seems like you would be trying to say "we cannot cling to the strict definitions of gravity." At the end of the day proofs are proofs.