http://ee.lbl.gov/papers/sync_94.pdf I posted this yesterday, with the conjectur...

tptacek · on April 25, 2011

This is a great paper. If you haven't read it, it suggests a common scenario where endemic network delays tend to nudge all participants in a periodic broadcast protocol to send their broadcasts at the same time, so that some hours after you start all the participants, everyone has synchronized and on a timer saturates the network with updates.

The solution (I didn't reread so this is from memory) is to add random jitter to each participant's timer.

However, is there evidence to suggest that's what happened to Amazon? I can see this being a big issue in '93 with high-latency low-bandwidth links a commonplace. But we think that Amazon wasn't engineered well enough to deal with multiple orders of magnitude spikes in C&C traffic?

Thank you, though, for posting a (much needed) technical comment to this discussion.

pumpmylemma · on April 25, 2011

I don't think it was a symptom of routing synchronization specifically, but I'd be curious to know if it was a case of unexpected and undesired synchronization. (E.G. An independent and random cluster of blocks suddenly updated; the network was saturated; it pulled in more updates; ...)

And yes, the paper talked about randomization. It also pointed out the magnitude of randomization required was larger than expected.

pandakar · on April 25, 2011

Has there been an official explanation?

pumpmylemma · on April 25, 2011

As far as I'm aware, no. That's why RightAWS said they get an F for communication.