Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Paxos (martinfowler.com)
240 points by joeyespo on Jan 6, 2022 | hide | past | favorite | 47 comments


I have a bit of a man-crush on Leslie Lamport (and have (unsuccessfully) tried to get TLA+ to be used at every job I've had for the last five years), and I have read through both the original Paxos paper in addition to the Paxos Made Simple paper, and I'm just going to say that I still found it about five times harder to grok than Raft. Honestly the TLA+ specification of Paxos made it click more than the papers for me, but that could just be because of the repeated readings before.

It's a beautiful algorithm, but Raft has the advantage of being a lot simpler.


I am also a huge fan of Lamport. Paxos is so concise and of such beauty that once you understand it, all other consensus protocols seem just different ways of phrasing Paxos with additional features (e.g. membership changes, make it into a log, etc.). Paxos is the smallest, most elegant consensus building block that I have seen so far. Raft's protocol covers a much bigger area than just consensus (it gives you a log, supports reconfiguration, etc.). Paxos' entire protocol fits a single slide. Period.

I wrote a few posts about how to understand paxos, and this is my personal favorite – https://blog.the-pans.com/understanding-paxos/.

Lamport did comment on it (via email) saying that understanding Paxos is less important than proving it works (via TLA+).


Your view that Paxos is an algorithm for read-modify-write is a really deep insight. Once you adopt this point of view, things like Raft are just RMW updates of a log. (I like to think of Paxos as the fault-tolerant version of the load-linked/store-conditional pair of instructions.)

However, there is a difficulty in expressing exactly what RMW-Paxos does, because it may end up "modifying" a value that was never "committed".

For example, assume that I have three acceptors storing a replicated counter. The proposer executes phase-1, increments the value received from the acceptor with the highest ballot, and sends the value to all acceptors in phase-2. Consider now an execution where all acceptors initially hold 0. The proposer manages to store a new value 1 in one acceptor, and then crashes. Thus, 1 was not committed and a reader may conclude that the consensus value is still 0. Now a new proposer arrives, receives (1, 0, 0) from the three acceptors, chooses 1 as the newest value, and writes (2, 2, 2) to all acceptors. Thus "2" is now committed, and the commit of the "2" retroactively commits the "1". Thus, the condition for being committed is no longer "there exists a phase-2 quorum ...", but it is more complicated and depends on future history.

I have verified with TLA+ that this algorithm does indeed implement an atomic read-modify-write operation, but only under a complicated notion of being "committed" that boils down to either having a phase-2 quorum, or one of the successor operations having a phase-2 quorum.

Did you ever figure out a simpler way to say this?


Instead of Read-Modify-Write, I like to think of it as Read/Determine/Write cycle.

A note worthy point is that, Paxos actually expects proposers to drop their value and carry-forward someone-else's value to complete the consensus. This is typically not an intended behavior from client's point of view, but proposers role is to form the consensus, not push their value.

Read phase is basically understanding if there is any value that could've been chosen when F nodes are unavailable.

Determine phase is to choose to carry-forward any value (if-any) from the Read phase or use the new value from the client.

Write phase is to actually ask majority of the acceptors to save the value selected in the Determine phase.


You are totally correct that in Paxos the proposer is supposed to "carry-forward" the value proposed by somebody else. In this view, which is how everybody understands Paxos, Paxos is an algorithm for consensus.

However, the proposer can also modify the value proposed by somebody else, and it turns out that this is a perfectly valid algorithm for implementing a distributed read-modify-write. This variant does more than just consensus---it basically behaves like a fault-tolerant memory with an atomic update operation (think compare-and-swap or load-linked/store-conditional). In his blog post, GP also mentions that paxos is a read-modify-write transaction, but perhaps I read too much into what he is saying. Either way, this RMW variant is correct, I have verified it exhaustively with TLA+, and it is used in a few production systems that I have implemented.

The difficulty is now to say exactly what this RMW variant does. I was hoping that GP (or anybody else) may have some insights.


In case anyone wants to read more about the RMW variant, this is called CASPaxos: https://arxiv.org/abs/1802.07000

I first learned about this as a project called gryadka and I had trouble believing that extending Paxos from consensus (write once) to a full linearizable register worked until model checking a TLA+ spec for it that I wrote.


> Lamport did comment on it (via email) saying that understanding Paxos is less important than proving it works (via TLA+).

I suppose that's not wrong; if I know for a fact that something is going to behave how I expect it too, I don't generally care how it works. I can treat the algorithm as a black box and just go from there.

That said, I dont' completely agree. If I am implementing Paxos in NewLangOfTheMonth, it's likely that a straight one-to-one port of another version will not fully work due to tiny differences in the language semantics. If I don't understand Paxos, then it's unlikely I'll be able to correct any mistakes in the implementation.


Yeah I agree. I think understanding the protocol makes it much less likely to introduce protocol level bugs during implementation. E.g. why certain steps need to be done in certain order.


Agreed, being able to trust it is probably what I want in order to use it at my job to get shit done - being able to understand it/implement it is what I want to do on my own time.

Unless you're one of the rare group who gets paid to implement it in x language/library/technology, it's more of a mental exercise than anything, something nice to go "ahhh cool!" when you understand it, then move on.


Re: "proving it works", in a previous interview Lamport said that some folks measure simplicity by showing something to people and they say whether or not they understand it, while he measures simplicity by showing something to people and they can prove it's correct.


Allowing me to bikeshed for a second, was the progress bar a stylistic choice on this blog or part of some default styling? I keep seeing it pop up on random sites and I can't help but think people are developing on browsers that elide the scrollbar. Adding it back in seems like a signal that it was actually useful information for the user. I know safari removes it, not sure about chrome. Firefox does not remove it on desktop but does on mobile.


Their slide [1] looks like just a reworded version of yours [2]?

[1]: https://martinfowler.com/articles/patterns-of-distributed-sy...

[2]: https://blog.the-pans.com/paxos-explained/#paxos


I totally agree. Paxos is such a simple solution that it is so amazing.


You might be interested in this paper [1] which presents Paxos in the same way as Raft is presented in the Raft paper. In my view it really illustrates that most of the simplicity of Raft comes from its excellent presentation and not necessarily from its algorithmic simplicity.

[1]https://arxiv.org/pdf/2004.05074


> I have a bit of a man-crush on Leslie Lamport

+1 - I really admire him too. I wonder, what makes people like him and Edsger Dijkstra so charismatic in this sense?


I think a lot of it has to do with the fact that he's a very prolific writer and speaker, in addition to contributing to a fairly diverse set of Computer Science fields.

I usually enjoy Lamport's speeches, and I have said before that his book Specifying Systems should be mandatory reading for nearly anyone interested in CS, due to how approachable it is, and yet how deep it goes.


It's interesting that you mentioned Dijkstra because Lamport in one interview said that the original Paxos paper (The Part-time Parliment) was written in the way it was exactly because he was inspired by Dijkstra in his famous examples such as Sleeping barber, etc.


Dijkstra was and Lamport is very good at writing clearly and distinctly. Many people with that level of raw brainpower don't make that effort, so the ones that do are easy to admire and be grateful to.


My grey matter is failing me but does anyone remember a rant about 5-10 years ago about developing software in this era that had some punch line about consensus algorithms that went a little like, "so, there's this guy named Diego?"

edit: found it

https://circleci.com/blog/its-the-future/


That convo reminded me of Google inside joke: i just want to serve 5TB.

https://youtu.be/3t6L-FlfeaI


As a longtime yahoo, ouch and double-ouch as we’ve always had our own version of this hell, just with different names.


I thought you were going to link my favourite humor piece on consensus https://scholar.harvard.edu/files/mickens/files/thesaddestmo...


Thank you so much. This is hilarious. It really is in the same vain of being at whit's end of fundamental truths of space and time.

> It may be difficult if you can’t trust anything and the entire concept of happiness is a lie designed by unseen overlords of endless deceptive power.


>Why don’t I just use Google’s thing?

>-You think that’s going to be around in 6 months?

bahaha.


Shameless plug of a blog series where I try to list and summarize every paxos variant:

https://vadosware.io/post/paxosmon-gotta-concensus-them-all/


Great read. One thing you may be interested to know: when do you need paxos or similar, and when can you get away with gossip and CRDTs? the answer is that the set of domains that can be expressed using CRDTs is exactly the set of monotonic logic.

It’s a bit tricky to explain precisely in an HN comment but basically, something is monotonic if, once you learn that something is true, nothing can disprove it. For example, let’s say you know of a bunch of nodes on a graph, and you’re slowly learning about edges between the nodes. Once you see a cycle, no amount of new edges will ever “disprove” the cycle. So this means that, if what you care about is cycles, then a CRDT is fine. However, let’s say you’re trying to do disturbed garbage collection of nodes that have no edges leading to them. (And as before you’re slowly learning about new edges.) You can see a node has no edges leading to it that you know of, but if you think a node of garbage-collectible you might always be proved wrong by learning about a new edge. So the problem of distributed garbage collection can’t be solved using CRDTs.

There’s a paper on this that you’ll find if you google for the CALM theorem.


Thanks for taking the time to write this out -- I'm digging into the CALM theorem right now to try and get it to sink in. This explanation is fantastic and it makes me feel like I understand the concepts better than I probably do in reality.


This is utterly fantastic stuff. Assuming it's correct, which with Paxos...


Don't worry you don't have to take my word for it! I left all the links in and copied some PDFs to my domain as well just to make sure people could read the actual papers with the actual insight! :)


That was an interesting read, thanks!


Thanks! Hope it is a decent introductory reference -- there's so much out there.


Lamport has published a TLA+ spec of the Paxos consensus algorithm¹, but none I could find of the Paxos algorithm itself (aka multi-Paxos).

In 2016 some academics at Stony Brook University did, along with a machine-checked TLAPS proof, updated in 2019:

https://arxiv.org/abs/1606.01387

¹Paxos consensus refines specs for consensus and voting. See the spec and accompanying material here: https://lamport.azurewebsites.net/tla/paxos-algorithm.html


Is there a mistake on the second diagram where prepare(1,e) is sent to Byzantium? Wonder if it should be prepare(1,a), since Athens hasn't learned the (1,e) value which only Ephesus and Delphi know about at that point?

Otherwise I think it's a pretty good description of Paxos. I like that it walks through a good number of failure scenarios and mentions interesting corner cases such as a lack of an explicit commit in the original papers and that reading also needs to run the full Paxos algorithm.


It does seem to be a typo, based on the table and text below it.


There are huge gaps between the scientific paper and how you actually operate and scale Paxos from an engineering point of view. See "Paxos Made Live - An Engineering Perspective" by Tushar Chandra et.al.

Paxos also has drawbacks like being slow or lacking liveness guarantees. Some alternatives are discussed in https://aws.amazon.com/builders-library/leader-election-in-d...


I prefer a simpler approach where only 100% read uptime is guaranteed: http://talk.binarytask.com/task?id=4662372011677848884

The way to do this is to asynconously write to all nodes in real-time, here you can try it out: http://root.rupy.se

Go to http://root.rupy.se/node?make&info and press Make and you'll se the replication in real-time! (although it's too fast so you'll only see it as being delivered in one go, if you wireshark it you'll see multiple HTTP chunks...)

The source is also here: https://github.com/tinspin/rupy in Root.java


Paxos has a reputation of being hard to understand. I was under the impression that Raft was now preferred for distributed consensus. The creators of the Raft algorithm explicitly had understandability as a goal when designing it and called out Paxos as being hard to understand.

https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf


(Multi-)Paxos and Raft are incredibly similar. I would even argue that Raft is just a variant of Paxos in the same way that e.g. Fast Paxos and EPaxos are variants of Paxos. Most of the Raft algorithm is identical to the Paxos algorithm. The main difference between the two algorithms is how leadership election works. In my view the main reason why Raft is considered to be simpler is because it was presented in a really clear paper. Paxos becomes equally simple when using a similar representation as used in that paper. See this paper: https://arxiv.org/pdf/2004.05074


> Unmesh Joshi

It feels kinda strange to see an author other than Martin Fowler blogging on martinfowler.com....


Unmesh Joshi is the Chief Technologist at ThoughtWorks and a colleague of Martin. Martin publishes a lot of posts on behalf of many colleagues to give visibility to tech posts that interest Martin. It's more like a collaboration and acknowledging contributions.


This came up with another martinfowler.com post, and I checked the past couple years. Roughly half the content is Fowler's, and the other half are other writers (from memory, not recounting it now). I don't know the ratio going back prior years (only checked 2021 and 2020) but I've seen lots of posts by others on his site for the past few years.

https://martinfowler.com/articles/patterns-of-distributed-sy... - other content by the same author under title Patterns of Distributed Systems.


Martin Fowler is just a brand now. He's not a spectacular dev that knows all the right ways of doing things, he's just good at finding decent approaches to things and great at writing and communicating them. It's good that he found other writers to keep high quality content.

I used to hate people that preached his ways, but I recognized they're often pretty good. People just take them too much as laws and not opinions.

People have to remember that he's just human like the rest of us.


Maybe so, although he has been featuring other authors on his blog regularly now. He works with them to keep the quality high, check them out.


Paxos reminds me of all the TCP variants: Tahoe, Reno, Vegas, Veno, Westwood, Africa, Fast-Reno, Eifel, DOOR, Nice, YeAH, BIC, CUBIC, Illinois, Libra, Hybla, ARENO, Fusion.


I think a flow chart would have been easier to understand, or perhaps a video or .gif


No, it doesn't. A video might help you _feel_ like you understand it but it doesn't really help you understand it. That's just not how distributed systems work.


Now that's a little silly to say, some people grok more from video, some more from articles with pictures.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: