>Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare.
- the DNS services internally have issues (most likely, as this could explain the snowball effect)
- it could be also a core storage issue and all their VMs are relying on it and so they don't want to block third-party websites and think it will last for a long time, so they prefer to answer nothing for now in the DNS (so it will fail instantly to the client, and drain the application/database servers so they can reboot with less load)
I was on a video call during the incident. The service was working but with super-low bandwidth for 30 minutes, then I got disconnected and every FB property went down suddenly. Seems more suggestive of someone pulling the plug than a DNS issue, although it could also be both.
Oh you bet they do. In large organizations with complex microservices these dependencies inevitably arise. It takes real dedication and discipline to avoid creating these circular dependencies.
This is very true. I tell everyone who'll listen that every competent engineer should be well versed in the nuances of feedback in complex systems (https://en.wikipedia.org/wiki/Feedback).
That said, virtuous cycles can't exist without vicious cycles. I think we as a society need to do a lot more work into helping people understand and model feedback in complex systems, because at scales like Facebook's it's impossible for any one person to truly understand the hidden causal loops until it goes wrong. You only need to look at something like the Lotka-Volterra equations (https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...) to see how deeply counterintuitive these system dynamics can be (e.g. "increasing the food available to the prey caused the predator's population to destabilize": https://en.wikipedia.org/wiki/Paradox_of_enrichment).
It seems like an easy redundancy split, but imagine driving two cars down the freeway at the same time, because you got a flat tire in one, the other day.
In order to actually be redundant you need to have two sets of infrastructure to serve, and then if the internal one goes down, the external one's basically useless when the internal resolution's down anyway. Capacity planning (because you're inside Facebook and can't pretend that all data-centers ever-where are connected via an infinitely fast network) becomes twice as much work. How you do updates for a couple thousand teams isn't trivial in the first place, now you have to cordon them off appropriately?
I don't know what Facebook's DNS serving infrastructure looks like internally, but it's definitely more complicated than installing `unbound` on a couple of left-over servers.
Even the Name servers are not returning any values. That's bad.
dig @8.8.8.8 +short facebook.com NS
These are usually anycasted, meaning that 1 ip return in NS are in fact several servers spread in several regions. They are distributed to closer match through agreements with ISP with the BGP protocol. Very interesting, because it seems that it took 1 DNS entry misconfiguration to withdraw M$ worth of devices from the internet.
So far the pattern isn't the same. Slack published a DNSSEC record that got cached and then deleted it, which broke clients that tried to validate DNSSEC for slack.com. But in this case, the records are just completely gone. As if "facebook.com", "instagram.com", et al just didn't exist.