April 2008
Buggy Voting Systems in New Jersey (4 April 2008
An Outage from Managing P2P Traffic? (6 April 2008
Ships Impounded in Cable Cut (8 April 2008
Comcast Outage: Not P2P-Related (18 April 2008
PayPal is Wrong About Unsafe Browsers (19 April 2008
New Jersey Supreme Court Protects Internet Users' Privacy (22 April 2008
The Fate of Old Hardcopy Journals (27 April 2008

An Outage from Managing P2P Traffic?

6 April 2008

I have cable modem service from Comcast. Yesterday, that service was down for eight hours. In fact, many of their Internet customers in New Jersey, Pennsylvania, and Delaware were off the air for that long — and I believe that the problem was due to Comcast's attempts to manage peer-to-peer traffic.

Network outages at 6:00 on a Saturday morning are neither unknown nor unreasonable. In fact, that's probably a good time for routine maintenance and upgrades. This one, however, went on far too long. (I know the article says the problem started at 7:00. I noticed the problem at 6:00 and my log files are quite unambiguous; my connectivity problems started an hour earlier than the story indicates.)

The symptoms were odd. I could reach a few sites, but not many. I could, however, ping web sites I couldn't connect to. I ran traceroute; it showed a normal-looking path to failing sites. However, both ping and non-Windows traceroute use ICMP; web connections use TCP. Was ICMP working, when TCP was not?

I used a traceroute variant that I wrote, which can use UDP or TCP. The results were decidedly odd. UDP traceroutes would go 8 or 9 hops within Comcast's network, then stop. TCP traceroutes went nowhere. But ICMP traceroute went the full distance. Other tests I did, by using twisty paths to log in to a server I control, showed the same thing: TCP packets were not making it from my house to the server; ICMP packets were.

Treating ICMP, UDP, and TCP differently is not a normal mode of operation for an ISP backbone. In fact, I have no idea why an ISP would do it, unless they were trying to treat some traffic differently. And we know that Comcast is trying to manage peer-to-peer traffic. Is that what happened?

Boxes and distributed systems can fail. The more boxes you have that you rely on, the greater the risk of an outage and perhaps a widespread outage. It seems likely that this is exactly what happened yesterday: because they are trying to restrict peer-to-peer traffic, many people were off the air for many hours. A major underpinning of the Internet's design was a desire to get away from "must be there" elements. We seem to have taken a step backwards.

Update: In a mailing list message, Declan Forde, a Comcast executive, stated that
We had a routing issue in the PA and NJ areas yesterday that impacted some customers' ability to reach certain sites. This had nothing to with P2P traffic management.
I've asked him why TCP, UDP, and ICMP appear to behave differently; that, after all, is why I speculated on the connection to P2P. When I get an answer, I'll post it here.