Sorry if this sounds dickish, but renting 3 servers @ $75 apiece from 3 different dedicated server companies in the USA, putting TinyDNS on them, and using them as backup servers, would have solved your problems hours ago.
Even a single quad-core server with 4GB RAM running TinyDNS could serve 10K queries per second, based on extrapolation and assumed improvements since this 2001 test, which showed nearly 4K/second performance on 700Mhz PIII CPUs: https://lists.isc.org/pipermail/bind-users/2001-June/029457....
EDIT to add: and lengthening TTLs temporarily would mean that those 10K queries would quickly lessen the outage, since each query might last for 12 hours; and large ISPs like Comcast would cache the queries for all their customers, so a single successful query delivered to Comcast would have (some amount) of multiplier effect.
You're asserting that your (or their) homegrown DNS service will have better reliability than Dyn and Route53 combined. That assertion gets even worse when it's a backup because people never, ever test backups. And "ready to go" means an extremely low TTL on NS records if you need to change them (which, for a hidden backup, you will), and many resolvers ignore that when it suits them, so have fun getting back to 100% of traffic.
Spoiler: I'd bet my complete net worth against your assertion and give you incredible odds.
Golden rule: Fixing a DNS outage with actions that require DNS propagation = game over. You'd might as well hop in the car and start driving your content to people's homes.
I don't know how big PagerDuty is; IIRC over 200 employees, so, a decent size.
I was giving a bare-minimum example of how this or (some other backup solution) should have already been setup and ready to be switched over.
DNS is bog-simple to serve and secure (provided you don't try to do the fancier stuff and just serve DNS records): it is basically like serving static HTML in terms of difficulty.
That a company would have a backup of all important sites/IP addresses locally available and ready to deploy on some other service, or even be built by hand via some quickly rented servers, is I think quite a reasonable thing to have. I guess it would also be simple to run on GCE and Azure as well, if you don't like the idea of dedicated servers.
Not neccesarily. Granted this is how I would configure a system (two providers), but it is just as sensical to use one major provider which falls back to company servers in the event of an attack like this. It is all in sysadmin preference, while it is smart to relegate low-level tasks to managed providers it is also smart to have a backup solution that is under full control just in case that control needs to be taken at some point in time.
That would be a quick fix similar to adding another NS provider. Of course if dyn is out completely they might not have their master zone anywhere else. Then it's similar to any service rebuilding without a backup.
Even a single quad-core server with 4GB RAM running TinyDNS could serve 10K queries per second, based on extrapolation and assumed improvements since this 2001 test, which showed nearly 4K/second performance on 700Mhz PIII CPUs: https://lists.isc.org/pipermail/bind-users/2001-June/029457....
EDIT to add: and lengthening TTLs temporarily would mean that those 10K queries would quickly lessen the outage, since each query might last for 12 hours; and large ISPs like Comcast would cache the queries for all their customers, so a single successful query delivered to Comcast would have (some amount) of multiplier effect.