For Facebook and WhatsApp it looks like a DNS issue, name resolution fails with ...

r721 · on Oct 4, 2021

John Graham-Cumming:

>Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare.

https://twitter.com/jgrahamc/status/1445065270272434176 (thread)

UPD

>About five minutes before Facebook's DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook's ASN.

https://twitter.com/jgrahamc/status/1445068309288951820

simlevesque · on Oct 4, 2021

Maybe they tried everything else before that.

At first it was working but they couldn't serve responses: https://i.imgur.com/UaCtOiX.png

Notice the "2020"

rvnx · on Oct 4, 2021

The servers struggle to reply a basic 5xx answer.

Two possibilities:

- the DNS services internally have issues (most likely, as this could explain the snowball effect)

- it could be also a core storage issue and all their VMs are relying on it and so they don't want to block third-party websites and think it will last for a long time, so they prefer to answer nothing for now in the DNS (so it will fail instantly to the client, and drain the application/database servers so they can reboot with less load)

zarzavat · on Oct 4, 2021

I was on a video call during the incident. The service was working but with super-low bandwidth for 30 minutes, then I got disconnected and every FB property went down suddenly. Seems more suggestive of someone pulling the plug than a DNS issue, although it could also be both.

ctur · on Oct 4, 2021

It isn't just DNS. If you happen to have cached entries, the site is returning errors as well.

Nextgrid · on Oct 4, 2021

Presumably the DNS being down also wreaks havoc in their internal infrastructure as services can no longer resolve each other's names.

rightbyte · on Oct 4, 2021

I wonder if Facebook has circular 'boot' dependencies on their microservices or something? I.e. they can't restart stuff now when everything is down.

clon · on Oct 4, 2021

For sure. Reminds me of the difficulties of starting a power grid from total blackout, bringing generators and power stations to sync.. .

kccqzy · on Oct 4, 2021

Oh you bet they do. In large organizations with complex microservices these dependencies inevitably arise. It takes real dedication and discipline to avoid creating these circular dependencies.

samhw · on Oct 4, 2021

This is very true. I tell everyone who'll listen that every competent engineer should be well versed in the nuances of feedback in complex systems (https://en.wikipedia.org/wiki/Feedback).

The most successful systems rely on the property of feedback (https://en.wikipedia.org/wiki/Feedback): evolution, untrained learning, genetic algorithms, the diagonal arguments (https://en.wikipedia.org/wiki/Diagonal_argument), artificial general intelligence (https://en.wikipedia.org/wiki/Technological_singularity), financial markets according to no less than George Soros (https://en.wikipedia.org/wiki/Reflexivity_(social_theory)#In...), etc.

That said, virtuous cycles can't exist without vicious cycles. I think we as a society need to do a lot more work into helping people understand and model feedback in complex systems, because at scales like Facebook's it's impossible for any one person to truly understand the hidden causal loops until it goes wrong. You only need to look at something like the Lotka-Volterra equations (https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...) to see how deeply counterintuitive these system dynamics can be (e.g. "increasing the food available to the prey caused the predator's population to destabilize": https://en.wikipedia.org/wiki/Paradox_of_enrichment).

qeternity · on Oct 4, 2021

Internal services using public dns records?

msbarnett · on Oct 4, 2021

Probably not, but their external and internal DNS may share infrastructure that's at the root of the failure

qeternity · on Oct 4, 2021

Yikes, seems like an easy redundancy split.

fragmede · on Oct 4, 2021

It seems like an easy redundancy split, but imagine driving two cars down the freeway at the same time, because you got a flat tire in one, the other day.

In order to actually be redundant you need to have two sets of infrastructure to serve, and then if the internal one goes down, the external one's basically useless when the internal resolution's down anyway. Capacity planning (because you're inside Facebook and can't pretend that all data-centers ever-where are connected via an infinitely fast network) becomes twice as much work. How you do updates for a couple thousand teams isn't trivial in the first place, now you have to cordon them off appropriately?

I don't know what Facebook's DNS serving infrastructure looks like internally, but it's definitely more complicated than installing `unbound` on a couple of left-over servers.

qeternity · on Oct 4, 2021

Yes, all of that (imo) is an argument in favor.

I never said it was free, but it's worth it as long as it's cheaper than failure.

I don't keep backups because I enjoy having multiple copies of my data. I do it because losing that data would be devastating.

ikiris · on Oct 4, 2021

agreed, they fell off the internet according to routeviews

WillPostForFood · on Oct 4, 2021

I'm seeing similar DNS errors for many non-Facebook sites.

robjan · on Oct 4, 2021

My ISP's DNS server went down a few minutes after the Facebook outage, presumably because all the residential customers' devices keep querying.

rstupek · on Oct 4, 2021

Seeing the same thing with 8.8.8.8 name servers. Everything I query returns an error

Spare_account · on Oct 4, 2021

Do you have some examples?

janmo · on Oct 4, 2021

I am getting DNS fails for wikipedia

marbex7 · on Oct 4, 2021

Wikipedia wfm.

peter_retief · on Oct 4, 2021

WillPostForFood · on Oct 4, 2021

normashooting.com - but only when, like the parent poster, using Google's DNS servers. Just switched to Cloudflare and it works.

Using Google DNS:

nslookup

> normashooting.com

Server: 8.8.8.8

Address: 8.8.8.8#53

* server can't find normashooting.com:

SERVFAIL

Using Cloudflare DNS servers:

> normashooting.com Server: 1.1.1.1

Address: 1.1.1.1#53

Non-authoritative answer:

Name: normashooting.com

Address: 104.22.56.165

Name: normashooting.com

Address: 104.22.57.165

Name: normashooting.com

Address: 172.67.43.70

chilldill · on Oct 4, 2021

cant login to aws console either

chilldill · on Oct 4, 2021

aws.amazon.com is down as well

pul · on Oct 4, 2021

Jep, also from other caches: https://www.nslookup.io/dns-records/facebook.com

john37386 · on Oct 4, 2021

Even the Name servers are not returning any values. That's bad.

dig @8.8.8.8 +short facebook.com NS

These are usually anycasted, meaning that 1 ip return in NS are in fact several servers spread in several regions. They are distributed to closer match through agreements with ISP with the BGP protocol. Very interesting, because it seems that it took 1 DNS entry misconfiguration to withdraw M$ worth of devices from the internet.

variant · on Oct 4, 2021

BGP goof?

https://twitter.com/g_bonfiglio/status/1445056923309649926?s...

LordRishav · on Oct 4, 2021

It's always DNS

sysadmindotfail · on Oct 4, 2021

>It's always DNS

How is this not the top comment? Underrated

Animats · on Oct 4, 2021

Even Google's 8.8.8.8 DNS server says can't find, SERVFAIL.

Hokusai · on Oct 4, 2021

Is this related in any way to what happened to Slack recently in their DNS?

etc-hosts · on Oct 4, 2021

No

https://lists.dns-oarc.net/pipermail/dns-operations/2021-Sep...

skywhopper · on Oct 4, 2021

So far the pattern isn't the same. Slack published a DNSSEC record that got cached and then deleted it, which broke clients that tried to validate DNSSEC for slack.com. But in this case, the records are just completely gone. As if "facebook.com", "instagram.com", et al just didn't exist.

hulitu · on Oct 4, 2021

Thank god we have DoH.

dvratil · on Oct 4, 2021

It's DNS over HTTPS. It relies on the same system as plain DNS, so DoH won't really help in this case...

jbverschoor · on Oct 4, 2021

Same here on facebook.com , [api]whatsapp.com (instagram.com works)