Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
MySQL High Availability at GitHub (githubengineering.com)
225 points by samlambert on June 20, 2018 | hide | past | favorite | 53 comments


I feel like this article stabs at a kind of interesting tradeoff in distributed databases: using consensus to store your data, or using consensus to point at the master for the data.

A lot of the bleeding-edge DBMS out there (e.g. CockroachDB, TiDB) use the former probably because of the Spanner influence, but maybe the latter is good enough for almost everyone.


Replicating data in consensus groups gives some nice guarantees, in particular it can guarantee monotonic read consistency even if you're not reading from the leader, at the cost of a network round-trip. Switching between master-slave replicas might cause you to go back and forth in time, even if replication is synchronous.

There are also some potential availability benefits to storing data in consensus groups, though most implementations use leader leases to improve performance, which reduces availability in case of leader failure. A master-slave system that does synchronous replication could do failovers that are just as fast as leader changes in a consensus protocol.

The pain of storing data in consensus groups is in performance, especially concurrency. Every write to the log requires some synchronization with a quorum to ensure consensus on the order. That introduces extra round-trips and makes it hard to handle high concurrency.

There are quite a few bleeding edge distributed databases that have trouble matching the throughput of a single node RDBMS unless you're willing to spend an order of magnitude more on hardware or your workload follows a very specific pattern.


NewSQL databases still may use consensus to point to a leader for data, its just that data is transparently partitioned and each partition can have a different leader.

In TiDB for example each region (data partition) has its own leader and only that leader can serve writes (but the leader can change via a raft-based election). By default all servers are both leaders of some partitions and followers of others [1].

Putting all your data on one master stops scaling at some point when data is large enough. However, even before that point, even without high-availability requirements, you are making trade-offs which may be good or bad (e.g. throughput is limited on a single master, but latency is low).

[1]: https://pingcap.github.io/docs/tikv/tikv-overview/


Isn't sharding orthogonal to this? It seems completely possible to have a system where the individual data partitions are asynchronously replicated, but the metadata of leader leases and locations for all the partitions are listed on an separate strongly-consistent fault-tolerant source. I think Vitess does something like that, and certainly Bigtable [1] was an early implementation using Chubby to point to the "root" tablet.

[1] https://static.googleusercontent.com/media/research.google.c...


Sharding is generally used to refer to a particular type of partitioning: physically distributing data on different machines with the purpose of scaling horizontally (to increase write capacity).

What many NewSQL databases use is a logical partitioning scheme where data is partitioned into <= 1 GB chunks. This makes it easy to both scale horizontally and achieve fault tolerance (and change these characteristics without interrupting service). In some systems that component may be coupled and hidden away. In TiDB it is a separately deployed component called the Placement Driver (which embeds etcd, a Chubby equivalent): https://pingcap.com/blog/2016-11-09-Deep-Dive-into-TiKV/#pla...

With sharding you do still need a system to locate your data. A big difference is that most sharding systems are not designed to distribute transactions (or possibly even joins) across shards. So Vitess has a special 2PC for cross-shard transactions with poorer performance than a single shard transaction.


Here is a video demo of FaunaDB maintaining ACID transactional correctness even while datacenters go offline. Relevant because the underlying cluster manager maintains correctness at both the master-tracking and data level.

https://blog.fauna.com/demonstrating-transactional-correctne...


From this article I cannot parse if they use simple Mysql replication with GTID or MySQL cluster/Galera/similar solution.


Vitess is another system for horizontally scaling MySQL that they use at YouTube https://vitess.io/


Another cool one is ProxySQL. We opted towards that rather than Vitess because it allowed a crawl, walk, run approach. Vitess came across as all or nothing.


Github is evaluating Vitess. They'll probably talk about it if and when they deploy it into production.


Am I mistaken or it seems to me that promoting a slave to a master is kind of fragile operation(detecting that it did really fail). I mean there are more robust methods:

1. There is a bookkeeper cluster providing replicated log. (No Kafka! Hold on a second, move to 4)

2. All nodes of a single MySQL cluster have a service on them that has two responsibilities: 1. accepting writes. 2. following the log. MySQL service only serves reads. Write will be performed indirectly from the writer service which is running besides MySQL service.

3. Bookkeeper (and DistributedLog on top of it) enforces single writer. So essentially one node is appending to log, the others are following it.

4. Whenever one of slaves thinks master is failed (or maybe partially-failed), it issues a FENCE request to bookkeeper, which either stops current master from appending to log, or fails to get majority vote from bookkeeper nodes. In either way log stays consistent. (There exist no two masters at a single point in time)

5. If one slave is able to fence the current master, it just notifies service discovery component to redirect clients. This operation doesn't have to happen atomically. Since old master is fenced and it can't append to log, anyway. So client operation will be safely rejected.

This approach can convert any data store to a consistent, highly available store. It is even capable of running multiple databases grouped as one cluster. Replicated log is a separate module. It is possible to replicate from MongoDB to SQL Server as an example. It has the advantage that, you don't have to adopt any untested databases. Just use good old MySQL/PG/...

I understand that, it requires quite a lot of work to refactor already running infrastructure. But it is more robust and somewhat safer. Bookkeeper's built-in fence mechanism precludes any random writes from any partially failed master.


i think the joyent-manatee HA [https://github.com/joyent/manatee] solution is the safest way of taking a system like PG that supports synchronous replication and giving it good availability properties while preserving durability.

they basically have 3 nodes.

- the master which must synchronously commit all writes to the slave

- the slave which accepts synchronous writes from the master and may be promoted to the master

- the asynchronous slave that accepts writes asynchronously from the slave [or from the master if you support rolling back the timeline]

if the master dies, the the slave can promote itself to the master if it gains the support of the asynchronous slave.

if the slave dies, then the master can no longer commit and it tries to promote the asynchronous slave to the new slave. if there is some confusion over who is dead the asynchronous slave basically adjudicates because it can only be promoted once to slave either by the master or the original slave.

basically writes are only accepted if at least 2 nodes acknowledge the write and the system can survive one node dying. but because you need co-operation of 2 of 3 nodes in order to actually commit data in the system it doesn't have the split-brain issues you have with an external system checking for health and then telling nodes to stop connecting to the unhealthy master.

i think the actual state machine used is a bit more complicated because they need to support removing/adding nodes and i think they end up requiring zookeeper in order to coordinate.

flynn [https://flynn.io/docs/databases] also have a HA solution for postgres/mysql based on this.

though, i can see why maybe this kind of setup might not perform as well as one where you are accepting writes after committing 2 out of N to the read replicas. especially when you already need lots of read replicas.


“the slave which accepts synchronous writes from the master and may be promoted to the master”

That is failover, not high availability, since there is downtime, however short. High availability means zero downtime, like with Vertica.


Sounds like you're looking for a different word from "high". Maybe "continuous"?


I can understand that VIP's can sometimes be akward, ARP fails or even worse that VIPs won't be correctly released due to unknown stuff.

However I do wonder why they choose such a complicated thing instead of looking into something like BGP that would've replaced their ARP setup.

Also they should've probably tried to look into K8s since inside a cluster such problems do not exist. (There you have the problem to have a good LB at the outside of your cluster)


> something like BGP that would've replaced their ARP setup

Just a guess, but anecdotally it seems to me that inside a modern mid-size dev org there are a lot more engineers familiar with K/V stores like etcd/consul/ZK than IP routing protocols like OSPF and BGP.

It's not necessarily the wrong decision to forgo an existing well-tested tool in favour of a hand-rolled solution, if you don't have any experts on that existing tool in your org.


> It's not necessarily the wrong decision to forgo an existing well-tested tool in favour of a hand-rolled solution, if you don't have any experts on that existing tool in your org.

It will, however, lead to other problems, especially:

- reinvention of the wheel (aka: time waste)

- starting from scratch: technologies like OSPF, BGP, or (when we're at higher levels) my/pgSQL vs noSQL are sometimes decades old and battle tested. Which means that there are a lot more edge cases that have been discovered and fixed/documented, as well as best practices that have evolved. Same goes for security. Or even fundamental stuff like ACID criteria (which many of the noSQL friends have had to re-discover and learn)...

There's a reason yuge enterprises and government offices still run on mainframes (or, mainframe software on modern hardware): the stability and resilience of these beasts with uptimes in year-scales is something the hipster crap crowd is enviously looking at. And there's a reason that redesign projects in these areas fail so spectacularly: there are decades worth of domain knowledge and experience encoded in these systems. It's impossible to recreate this in a year or two, especially when outsourcing the work to body shops.

Only thing I wonder is: it should be, by now, common knowledge (or: state of the art) that outsourcing yuge projects to body shops, especially foreign ones, leads to disasters. So why are managers and politicians not prosecuted or otherwise held responsible when screw-ups originate? After all, when one e.g. designs a building that violates fire code and people die as a result, or one grossly ignores security best practices, the responsible persons can get charged, too...


> my/pgSQL vs noSQL are sometimes decades old and battle tested

> run on mainframes (or, mainframe software on modern hardware): the stability and resilience of these beasts with uptimes in year-scales is something the hipster crap crowd is enviously looking at

Why so much ignorance? Distributed systems are not hipster things, it's an actual science that helps to understand why no amount of battle testing can make this old tech resilient, it's fundamentally broken. There is just no resilience without at least proper distributed algorithms.



+1 for BGP/OSPF-based VIPs in a layer 3 network. It's 2018. No reason why you can't easily run a layer 3 network with rich routing protocols. Super easy.

You don't need application level "discovery" when the IP never changes; the network will tell you where it currently lives.

I'll take a layer 3 network over raft/consul any day and all night.


I don't think I've ever heard BGP described as "easy", but then again it's not really my area.

Where could I go to read up on this without being a networking expert? Are there any safe ways to play with BGP in a lab environment? How expensive would said lab be to set up?


Are there any safe ways to play with BGP in a lab environment? actually there are multiple ways of running bgp in a lab. there are cheap variants that are routed into the internet with ipv6 and there are ways to just play without without talking to the outside network.

How expensive would said lab be to set up? it heavily depends on what you want. for a high available lab you would of course need multiple nodes/routers, depends on what you want to do.

for just playing with bgp, actually using bird http://bird.network.cz/ is good enough.

Actually ripe sometimes publishes articles with BGP. Here is one that sets up a ANYCAST routing with BGP https://labs.ripe.net/Members/samir_jafferali/build-your-own...


Juniper and Cisco have lab versions of their products (Olive, vMX, vSRX) for free. Arista has something, too.

BIRD is super easy to setup, too.


If I'm not studying for any particular cert (I'm just looking to fill some gaps in my knowledge) then it looks like BIRD is the way to go.


bird is the word


Network people who understand BGP are cheaper and easier to find than someone who actually knows how Consul/raft works. You might think you know how consul works, but do you really? Consul and raft have caused several high profile outages -- including at places full of smart people like Amazon. If Amazon can't do it right, what chance do you have?

Especially when there are several-year old bugs open against consul, where it just completely loses consensus and can't deal with losing a single node(0)

0: https://github.com/hashicorp/consul/issues/993#issuecomment-...


This is good to know if I ever find myself working at this scale.

I don't have skin in the game but it makes sense that a much older, tested technology would have more people who know it well.


that exactly what we are doing at Hostinger ;-) this one http://blog.donatas.net/blog/2017/02/26/exazk/


Why not InnoDB Cluster[1] with single primary? There is a slight decrease in write throughput as it is effectively similar to a synchronous replication on a majority, but you get native failover and very easy cluster membership management.

[1] https://dev.mysql.com/doc/refman/8.0/en/mysql-innodb-cluster...


Would really love to know why this got downvoted instead of replying "InnoDB Cluster is not good because..."?

It's actually the most recommended way to do replication from MySQL now, they are heavily deprecating everything around traditional replication, they even removed MySQL Fabric (the proxy for the client to work well with failover) in favor of MySQL Router, which works with just InnoDB cluster.

HN, go figure...


Small clarification: MySQL Router usage does not require InnoDB cluster; although agreed that most Router development is focused there.


That's only practical for a single datacenter. You still really need asyc replication across geographic boundaries with a failover mechanism.


Thank you for the interesting take, it makes sense. Do you think that if you were to be deploying just in multiple availability zones (a-la AWS, so extremely low network latency but good physical/logical isolation) the same concerns would still apply?

Thanks


How about another one implementation of "master election" like here http://blog.donatas.net/blog/2017/02/26/exazk/ ?


I feel like I'm missing something. I see two layers of Raft consensus at play here: one at orchestrator's level, and another at Consul's level. Would it be possible to manage consensus at only one level?


High availability requires a database cluster which provides synchronous multimaster replication. What they have is failover, not high avilability.


Failover is a valid high-availability technique. One with limitations and not for every scenario, but still one.


Per definition, highly available means no interruption of service. Failover does not meet this criterion since there is an interruption.


I've never seen it defined that way, neither in industry material nor academia, so I guess "citation needed"?

These things always have limits, and what's defined as "availability" and acceptable failure rates/times is application-dependent. High-availability merely describes reaching a goal that's higher than a "conventional" setup can achieve, and failover is a fairly obvious way of massively reducing Time-To-Recovery


Why not TiDB ?


FYI: TiDB is relevant because it is distributed and protocol-compatible with MySQL.


I think at this point it's basically a non-option though. Ti is nowhere near mature enough, the ops investment cost is extremely nontrivial, and it and still has some interesting issues to work out (e.g. [1]).

Furthermore, protocol-compatible doesn't directly imply API-compatible (I'm thinking of very very subtle differences between things like datatypes or isolation level behavior). I know a bit more about CockroachDB, which consciously does some things differently from Postgres.

[1] https://github.com/pingcap/tidb/issues/2712


It's not a drop-in, but several very large websites migrated from MySQL to TiDB, so it definitely is an option.


Source? I'm guessing you're referring to Mobike and Yiguo... any others?

https://www.pingcap.com/cases/


Those two are just case studis. There is a fuller list here: https://pingcap.com/docs/adopters/


TiDB developer here. There are more than one hundred companies who use TiDB in their production environment. The number keeps growing. The link in sanxiyn's post shows part of them. I think it is safe to say that TiDB is mature enough.


MS SQL Server may be a good choice, too.


Why not Aurora?


Besides being AWS only, Aurora is very expensive when you factor the IO costs for a large deployment.

That's not to say that I'm not a fan of it, but it's for a smaller sized deployment than GitHub. You can see this for yourself on the customer list here:

https://aws.amazon.com/rds/aurora/customers/


They're not on AWS.


So in other words, they had to hack together some tools to get clustering features on a crucial piece of company infrastructure. I understand that relational databases have been around for 40 years but very few of them come close to offering the multi-master replication or master-election protocols implemented in modern NOSQL.


Actually MySQL has a well working multi-master replication with great automatic master election protocol and automatic client routing towards the current master. It's called InnoDB cluster, and my other comment, where I was trying to start a discussion around it, was downvoted.


Correct.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: