August 2007
Electronic Voting Machines (1 August 2007)
Are Secure Systems Possible? (3 August 2007)
Dealing With Security Problems (6 August 2007)
Minnesota Court Orders Release of Alcohol Breath Tester Source Code (10 August 2007)
Safes, Locks, and Override Codes (14 August 2007)
The Skype Outage (20 August 2007)
Defending Against the Owner (24 August 2007)
The Amtrak Ticket System Outage (26 August 2007)
Update on the Amtrak Outage (28 August 2007)
The FBI and Computer Security (Updated) (29 August 2007)

The Amtrak Ticket System Outage

26 August 2007

While I was getting ready to leave on a trip this morning, my wife heard a brief radio report: Amtrak was having problems with its ticketing system; as a result, there were long lines at train stations. We were unable to get any more information, so we left the house 30 minutes early to get to the station.

As it turned out, by that point the ticketing system had been down for almost 24 hours. From what the ticket clerk told me, it had failed early Saturday morning, came back a few hours later, then failed hard around 1:30 PM EDT. He had no information on what the problem was or when the system would be back.

The lines came about because at many stations, agents were hand-writing tickets. I was spared that: I was told to board the train without a ticket, and simply give the conductor my reservation number. Passengers who didn’t know theirs were told to call up to get it; given how crowded the phone lines were, it’s not clear to me that that would have worked very well. In any event, I had my mine. (Amtrak reservation numbers are six hexadecimal digits…) The conductor went through the train asking people in my situation to write down their name and reservaton number; presumably, they’ll follow up with me somehow.

The lack of communication by Amtrak was quite frustrating. There were no notices on amtrak.com; all you knew was that some functions weren’t working that well. The same was true of the automated phone system. There was virtually no coverage by the mainstream media, even though Amtrak ridership is up significantly. Rumors spread. One fellow rider told me she heard the problem was caused by a lightning strike.

Ultimately, it was determined to be a software issue: a system upgrade didn’t work properly. Apparently, diagnosing the problem took close to 12 hours; repairing it — that is, deleting the "upgrade" and reinstalling the old software — took another 12-15 hours.

It’s tempting to blame Amtrak for the entire fiasco. Certainly, they should have communicated better with their passengers. I think their failures in that are inexcusable. But it’s a fact of life that software upgrades often break things. Perhaps Amtrak didn’t test the new code adequately; it will take a detailed investigation to find out. That said, even the best testing is often not good enough. More worrisome is how long it took to revert to the old system. That may have been a case of poor planning by Amtrak; however, in practice it turns out to be a surprisingly difficult thing to do on complex systems. Just as they’re not the only ones to have been victimized by bad upgrades, they’re not the only ones who had trouble backing them out.

There are, then, three lessons.

The first is doable today. The second remains difficult in practice, despite a lot of attention from researchers and developers. The third hasn’t even received much attention, and it should.

I look forward to seeing the investigation report on this incident.

https://www.cs.columbia.edu/~smb/blog/2007-08/2007-08-26.html