From wirish@parc.xerox.com (Wes Irish) From: wirish@parc.xerox.com (Wes Irish) Newsgroups: comp.dcom.lans.ethernet Subject: Performance problems on high utilization Ethernets Summary: High utilization Ethernet performance problems traced to controller implementation bugs Keywords: Ethernet, communications, interframe gap, IFG, collisions, controller, interface, packet loss, data link For the past year or so I have been investigating performance problems on the Ethernets here at PARC. This work has uncovered problems with a number of Ethernet controllers in common use today. These low-level controller problems can lead to serious performance problems for many of the systems involved. A full paper on this work, "Investigations into observed performance problems on high utilization Ethernet networks", will be released soon (initially as a PARC Blue & White report). But, since I have been giving talks on this work and news of it has begun to hit the Internet, I feel that a should post a preliminary report in order to reduce speculation and to make sure that the facts are correctly stated. Below is a short summary of some of the key facts and issues. The Ethernet specifications talk about making sure that transmitters enforce a 9.6 microsecond gap (IFG) between frames (packets). This is straight forward in the case of a gap following a just completed good packet. But, gaps following collision events are less straight forward. I do not want to debate the details of what is and is not "correct" in this case -- that is a discussion for another time and place. The reality of the situation is that there are a number of controllers in wide-spread use on networks today that do not interoperate very well in the face of collisions. In general, the problems arise when the gap following a collision is too short for a particular implementation of a receiver. In addition to uncovering controllers that simply generate short IFGs I have also uncovered a major implementation bug in a particular chip that injects short signal bursts onto the network. These bursts can damage the IFG "enforced" by other machines. Either way, the result is that same -- a short IFG preceding a packet which can result in a missed packet. It is important to note that when a controller misses a packet due to a short IFG THE FACT THAT THE PACKET WAS MISSED IS NOT DETECTED NOR REPORTED TO THE SYSTEM. System and driver statistics will claim no packets lost (unless some are lost for other reasons). Even most network analyzers are subject to the same undetected and therefore unreported packet loss. I have resorted to using a digital oscilloscope to capture and analyze these events. Let me emphasize that these problems are almost exclusively related to dealing with collision events. On a lightly loaded network, where collisions are few and far between, these problems are virtually non-existent. But these problems do indeed come into play on moderate to heavily loaded networks. Based on my observations a VERY ROUGH network load dividing line is about 25% load (using 0.1 or 1.0 sec samples). Here is an enumeration of some of the facts related to particular controllers that I have uncovered so far. There may be problems with other controllers but they may not appear on the networks that I have inspected. Controller: Intel 82586 Commonly found in: SUN 3's and SUN 4's (ie interfaces), many other machines Problem: Can generate a short IFG following a collision Cause: starts IFG timer on CS dropout Controller: Intel 82596 Commonly found in: Network General Sniffer using Cogent interface card Problem: Will not hear packet unless preceding IFG is 4.6 usec or larger Controller: SEEQ 8003 Commonly found in: Cisco MEC and MCI interfaces, older SGI (Silicon Graphics) including 4D/35 and Indigo (but not Indigo2) Problem: Can generate a short IFG following a collision Cause: Starts 9.6 usec timer at end of its on jam and not end of collision Problem: Generates 24 bit signal burst onto network following some collisions. This burst lands in the IFG following the collision and will often result in two short IFGs resulting in other controllers missing the packet. NB: this can happen even if the chip has nothing to transmit! Controller: AMD 7990 "LANCE" Commonly found in: SUN SPARCStation machines (SS-1, SS-1+, SS-2, SS-10, ...), many DEC machines, Cisco/SynOptics routers, Cisco IGS, many other machines Problem: Will not hear packet unless preceding IFG is 4.1 usec or larger Cause: implementation state machine Problem: many other problems including lock-up, transmit gaps greater than 9.6 usec under load, etc. Fix: A new version of the controller, the 79C90 CLANCE, fixes many of these problems but is not in common use like the LANCE. Interface chip: AT&T T7213 Commonly found in: SUN SPARCStation 10 and other newer SUN machines Problem: Will hold the collision (and kill data) sent to the controller chip across IFGs of roughly 1.0 usec or less. It will also do this if a "manchester coding violation" is detected in a packet -- a job that should be left to the controller. The result of all of these implementation details is that it is very possible, even probable, to put together a network that results in "undetected" packet loss. Packet loss rates of even less than 1% can result in performance hits as high as 80%, depending on a multitude of factors including the protocols and implementations being used. I have clocked the potential packet drop rate at PARC due to these problems to be in the 1% - 5% range at times. I have been working with many of these vendors for a number of months now in an attempt to get these various bugs fixed so that different equipment interoperates properly. Most of the vendors have been very receptive to making things work now that they know there is a problem. Some have already identified solutions while others are still working on them. Wesley Irish Network Scientist Xerox PARC wirish@parc.xerox.com [Please send any replies via e-mail as I do not normally read netnews]