Skype has finally released some details on its massive network outage. From what they’ve said, it appears to have been a self-propagating restart failure. We’ve seen these before.
The first part of the trap is a massive number of near-simultaneous client restarts. This is relatively easy to design for if you plan for it; I’ve been in more than one meeting where someone has something like "what happens if we power-cycle Chicago?" Not having sufficiently capacity to handle very rare events isn’t necessarily wrong. The events are very rare; in many circumstances, it’s perfectly acceptable to shed load by denying service to some clients during the recovery phase.
What appears to have happened here, as best I can tell from the Skype statement, is more subtle. Suppose that the excess load causes a server to crash. All of the clients who were using that server will notice the problem and attempt to reconnect to a different server. That puts more load on it, causing it to crash.
As I’ve noted, this sort of thing has happened before. Perhaps the best-known incident was the Martin Luther King Day meltdown of the AT&T long distance network. In that case, the problem was that if a phone switch crashed and restarted, the recovery message could crash its neighbor. That one would restart, generating messages that crashed its neighbors, including of course the one that crashed it.
The hardest problem, though, is that it’s so difficult to test a load-sensitive failure. How many client machines do you have in your test lab? Do you really know what resource your servers are going to run out of first, especially if there’s non-linear behavior?