August 2007
Electronic Voting Machines (1 August 2007
Are Secure Systems Possible? (3 August 2007
Dealing With Security Problems (6 August 2007
Minnesota Court Orders Release of Alcohol Breath Tester Source Code (10 August 2007
Safes, Locks, and Override Codes (14 August 2007
The Skype Outage (20 August 2007
Defending Against the Owner (24 August 2007
The Amtrak Ticket System Outage (26 August 2007
Update on the Amtrak Outage (28 August 2007
The FBI and Computer Security (Updated) (29 August 2007

The Skype Outage

20 August 2007

Skype has finally released some details on its massive network outage. From what they've said, it appears to have been a self-propagating restart failure. We've seen these before.

The first part of the trap is a massive number of near-simultaneous client restarts. This is relatively easy to design for if you plan for it; I've been in more than one meeting where someone has something like "what happens if we power-cycle Chicago?" Not having sufficiently capacity to handle very rare events isn't necessarily wrong. The events are very rare; in many circumstances, it's perfectly acceptable to shed load by denying service to some clients during the recovery phase.

What appears to have happened here, as best I can tell from the Skype statement, is more subtle. Suppose that the excess load causes a server to crash. All of the clients who were using that server will notice the problem and attempt to reconnect to a different server. That puts more load on it, causing it to crash.

As I've noted, this sort of thing has happened before. Perhaps the best-known incident was the Martin Luther King Day meltdown of the AT&T long distance network. In that case, the problem was that if a phone switch crashed and restarted, the recovery message could crash its neighbor. That one would restart, generating messages that crashed its neighbors, including of course the one that crashed it.

The hardest problem, though, is that it's so difficult to test a load-sensitive failure. How many client machines do you have in your test lab? Do you really know what resource your servers are going to run out of first, especially if there's non-linear behavior?