Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For Facebook and WhatsApp it looks like a DNS issue, name resolution fails with SERVFAIL:

    $ dig facebook.com

    ; <<>> DiG 9.16.21 <<>> facebook.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23982
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 512
    ;; QUESTION SECTION:
    ;facebook.com.   IN A

    ;; Query time: 16 msec
    ;; SERVER: 8.8.8.8#53(8.8.8.8)
    ;; WHEN: Mon Oct 04 17:53:00 CEST 2021
    ;; MSG SIZE  rcvd: 41


John Graham-Cumming:

>Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare.

https://twitter.com/jgrahamc/status/1445065270272434176 (thread)

UPD

>About five minutes before Facebook's DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook's ASN.

https://twitter.com/jgrahamc/status/1445068309288951820


Maybe they tried everything else before that.

At first it was working but they couldn't serve responses: https://i.imgur.com/UaCtOiX.png

Notice the "2020"


The servers struggle to reply a basic 5xx answer.

Two possibilities:

- the DNS services internally have issues (most likely, as this could explain the snowball effect)

- it could be also a core storage issue and all their VMs are relying on it and so they don't want to block third-party websites and think it will last for a long time, so they prefer to answer nothing for now in the DNS (so it will fail instantly to the client, and drain the application/database servers so they can reboot with less load)


I was on a video call during the incident. The service was working but with super-low bandwidth for 30 minutes, then I got disconnected and every FB property went down suddenly. Seems more suggestive of someone pulling the plug than a DNS issue, although it could also be both.


It isn't just DNS. If you happen to have cached entries, the site is returning errors as well.


Presumably the DNS being down also wreaks havoc in their internal infrastructure as services can no longer resolve each other's names.


I wonder if Facebook has circular 'boot' dependencies on their microservices or something? I.e. they can't restart stuff now when everything is down.


For sure. Reminds me of the difficulties of starting a power grid from total blackout, bringing generators and power stations to sync.. .


Oh you bet they do. In large organizations with complex microservices these dependencies inevitably arise. It takes real dedication and discipline to avoid creating these circular dependencies.


This is very true. I tell everyone who'll listen that every competent engineer should be well versed in the nuances of feedback in complex systems (https://en.wikipedia.org/wiki/Feedback).

The most successful systems rely on the property of feedback (https://en.wikipedia.org/wiki/Feedback): evolution, untrained learning, genetic algorithms, the diagonal arguments (https://en.wikipedia.org/wiki/Diagonal_argument), artificial general intelligence (https://en.wikipedia.org/wiki/Technological_singularity), financial markets according to no less than George Soros (https://en.wikipedia.org/wiki/Reflexivity_(social_theory)#In...), etc.

That said, virtuous cycles can't exist without vicious cycles. I think we as a society need to do a lot more work into helping people understand and model feedback in complex systems, because at scales like Facebook's it's impossible for any one person to truly understand the hidden causal loops until it goes wrong. You only need to look at something like the Lotka-Volterra equations (https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...) to see how deeply counterintuitive these system dynamics can be (e.g. "increasing the food available to the prey caused the predator's population to destabilize": https://en.wikipedia.org/wiki/Paradox_of_enrichment).


Internal services using public dns records?


Probably not, but their external and internal DNS may share infrastructure that's at the root of the failure


Yikes, seems like an easy redundancy split.


It seems like an easy redundancy split, but imagine driving two cars down the freeway at the same time, because you got a flat tire in one, the other day.

In order to actually be redundant you need to have two sets of infrastructure to serve, and then if the internal one goes down, the external one's basically useless when the internal resolution's down anyway. Capacity planning (because you're inside Facebook and can't pretend that all data-centers ever-where are connected via an infinitely fast network) becomes twice as much work. How you do updates for a couple thousand teams isn't trivial in the first place, now you have to cordon them off appropriately?

I don't know what Facebook's DNS serving infrastructure looks like internally, but it's definitely more complicated than installing `unbound` on a couple of left-over servers.


Yes, all of that (imo) is an argument in favor.

I never said it was free, but it's worth it as long as it's cheaper than failure.

I don't keep backups because I enjoy having multiple copies of my data. I do it because losing that data would be devastating.


agreed, they fell off the internet according to routeviews


I'm seeing similar DNS errors for many non-Facebook sites.


My ISP's DNS server went down a few minutes after the Facebook outage, presumably because all the residential customers' devices keep querying.


Seeing the same thing with 8.8.8.8 name servers. Everything I query returns an error


Do you have some examples?


I am getting DNS fails for wikipedia


Wikipedia wfm.


wfm


normashooting.com - but only when, like the parent poster, using Google's DNS servers. Just switched to Cloudflare and it works.

Using Google DNS:

nslookup

> normashooting.com

Server: 8.8.8.8

Address: 8.8.8.8#53

* server can't find normashooting.com:

SERVFAIL

Using Cloudflare DNS servers:

> normashooting.com Server: 1.1.1.1

Address: 1.1.1.1#53

Non-authoritative answer:

Name: normashooting.com

Address: 104.22.56.165

Name: normashooting.com

Address: 104.22.57.165

Name: normashooting.com

Address: 172.67.43.70


cant login to aws console either


aws.amazon.com is down as well



Even the Name servers are not returning any values. That's bad.

dig @8.8.8.8 +short facebook.com NS

These are usually anycasted, meaning that 1 ip return in NS are in fact several servers spread in several regions. They are distributed to closer match through agreements with ISP with the BGP protocol. Very interesting, because it seems that it took 1 DNS entry misconfiguration to withdraw M$ worth of devices from the internet.



It's always DNS


>It's always DNS

How is this not the top comment? Underrated


Even Google's 8.8.8.8 DNS server says can't find, SERVFAIL.


Is this related in any way to what happened to Slack recently in their DNS?



So far the pattern isn't the same. Slack published a DNSSEC record that got cached and then deleted it, which broke clients that tried to validate DNSSEC for slack.com. But in this case, the records are just completely gone. As if "facebook.com", "instagram.com", et al just didn't exist.


Thank god we have DoH.


It's DNS over HTTPS. It relies on the same system as plain DNS, so DoH won't really help in this case...


Same here on facebook.com , [api]whatsapp.com (instagram.com works)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: