First, read the man pages for ip(4), icmp(4), route(4), and route(8). Then read this text. Then go back to the man pages, and then come back to this text. Repeat as necessary. Internet hosts have multiple interfaces; one or more real interfaces (e.g., Ethernet, wireless, fiber, etc.), and zero or more virtual interfaces (e.g., loopback interfaces, tunneling interfaces). Typical hosts have just one real interface, less typical hosts may be attached to two, or even more, ethernets (say, one for regular traffic and one for accessing dedicated servers). These are called multi-homed hosts, and have different IP addresses on each interface, which are, of course, different from the loopback address. To complicate things further, each interface may have many IP addresses (called "aliases"), and may have addresses for different protocols (an ethernet interface would have a MAC address, one or more IP addresses, and even IPv6 addresses. A node can receive a packet on one of its interfaces in a number of ways: if the interface is for a point-to-point link (e.g., PPP), because the other end of the link sent the packet; if the interface is an ethernet interface, because the MAC address of the packet matched the unicast MAC address of the interface, one of the multicast MAC addresses assigned to the interface, or the MAC address was the broadcast address (ff:ff:ff:ff:ff:ff). Once a host has received an IP packet on one of its interfaces, it checks to see whether the packet was actually meant for that host (i.e., whoever sent it did not accidentally or maliciously put the appropriate MAC address on the packet). If the destination IP address of the received packet matches any of the IP addresses assigned to any of the interfaces of the host (including the virtual interfaces), matches any multicast group the host belongs to, or matches a broadcast address (either directed or limited (255.255.255.255)), the packet is accepted and passed to a higher-layer protocol (TCP, UDP, etc.). If it turns out that the packet was not for the host, it is silently discarded. Also, some implementations insist that the destination IP address of a packet should match one of the addresses of the interface it was actually received on (Solaris has an option to force that). Now, a router is simply a node connected to multiple networks (just like a multihomed host), but instead of discarding packets that were not meant for itself, it actually forwards them to the appropriate interface. The way a router forwards packets is identical to the way a host decides which interface to send packets on (obviously!). There is just one difference: when a router receives a packet that it has to forward, it decrements the TTL field by 1; if at that point the field becomes zero, it drops the packet and sends an ICMP "Time Exceeded" message to the source of the offending packet. The purpose of the forwarding code is to decide what interface to send each packet outgoing IP packet on, and, when relevant, what the destination MAC address of that packet should be. The destination host can be on a directly attached network (no router is involved), or can be on a remote network (in which case the packet needs to be sent to the next-hop router), or can simply be on a virtual interface, in which case the packet is delivered locally (never makes it to the wire). We thus distinguish four cases: 1. Destination on the same ethernet. 2. Destination is the other end of a point-to-point link. 3. Destination is on a different network, and the first-hop router is on the same ethernet. 4. Destination is on a different network, and the first-hop router is on the other end of point-to-point link. In all four cases, the lookup operation in the Forwarding Information Base (FIB; also called, somewhat incorrectly, the "routing table" in BSD-derived Unix systems) is the same, but yields different results: In case 1, the lookup operation either returns the interface that the packet should be sent out on and MAC address of the destination (because it is already known), and the packet is sent out to that MAC address, or returns just the interface along with an indication that the host is directly attached to that interface. In that case, the host sends an ARP request on that interface, and puts the packet back in the IP output queue while waiting for an ARP reply. In case 2, there is no MAC address to be found, so the packet is just sent down the point-to-point link. In case 3, the lookup returns the IP address of the next-hop router, along with an indication that this is a router and not a directly-attached host. The router must, of course, be directly attached (no recursive lookups in Unix), so a second lookup for the IP address of the router (identical to case 1) returns its MAC address, which is used as the destination MAC address of the packet. Thus, a packet that is to be routed has the IP address of its real destination, but the MAC address of the first-hop router. THIS IS AN IMPORTANT THING TO REMEMBER FOR WHEN WE TALK ABOUT TUNNELING. In case 4, again the lookup returns the IP address of the next-hop router, along with an indication that this router is the other side of point-to-point link. The packet is then sent down that link, just like in case 2. Each entry in the FIB has several fields: the destination prefix, the interface the packet matching that prefix should be sent on, and what the router (if any) and the MAC address (if appropriate) should be. There are various flags to assist in the selection process, and additional fields to keep the size of the table manageable (so that unused entries can time out, etc.). An entry with a prefix that is the same size as the address (/32 for IPv4, /128 for IPv6) is called a "host route"; every thing else is called a "network route". In older releases of BSD, before CIDR and VLSM, there were two routing tables: one for host routes, and one for network routes; they were both implemented as hash tables. There was also a default route kept separately. For each outgoing packet, the host routes table was consulted first; if the destination IP address of the packet was not found, the network routes table was consulted; if the address was not found, the default route was used. The ARP table was kept separately. How were prefixes decided for the network routes? That was essentially kludged when we started subnetting, but didn't have real VLSM support. There were two kinds of network entries. In general, the "class" of the address was used to determine the implicit prefix, so that there were, e.g., entries for net 10, 127, 128.59, 192.4.13. The exception was when an interface was using a netmask longer than the "natural" mask for the class of the interface address. For example, if interface 128.59.16.20 had 255.255.255.0 as its mask rather than the "natural" 255.255.0.0. In that case, all other network routes sharing the same classful address used the netmask of that interface. In other words, the netmask for these routes was inferred from the netmask of the interface; the netmask was not actually stored in the routing table. In this example, all other routes for subnets in 128.59/16 (e.g., 128.59.33.0/24, 128.59.122.0/24, etc) had to be listed explicitly, or handled by the default route. To make matters harder, some releases didn't even support prefixes that were not a multiple of 8 bits (so you could break up a class B into 256 /24s, but not 128 /23s), and even then addresses with a subnet part of 0 (i.e., 128.59.0/24) were not allowed because of the confusion they would create. Now Go to your Algorithms book and read about tries and Patricia trees. Then read the Sklower paper; then read the rest of this. The idea is that host routes are the same as network routes with a /32 mask. The default route is just an entry with a /0 netmask. Since the netmask is always anded to both the entry in the trie and the lookup key, there is no reason to actually keep anything but the already-anded part ("network part") of an address. A separate Patricia tree is kept for each address family (AF_INET, AF_INET6, and so on). An obvious question here is, why do we do this if the first 32 bits of a struct sockaddr_in are always the same? Remember that the same routines are used to manipulate the trees regardless of address type; different address types have different lengths, and some address types (e.g., international phone numbers) may even have variable-length addresses, and as a result the size of the corresponding struct sockaddr would change. Therefore, the routines need to be written with an address format where the length is embedded in the address, and that's exactly what a struct sockaddr does. In case you are confused about struct sockaddr versus struct sockaddr_in; all general-purpose networking routines are written to take arguments of type struct sockaddr (or pointer to struct sockaddr). Each address family has a corresponding structure whose first two bytes are address family and length. Examples are struct sockaddr_in, struct sockaddr_in6, etc. Whenever pointers to such structures need to be passed to the general purpose routines, they have to be cast into (struct sockaddr *)s. This is why a typical call to bind(2) reads: bind(fd, (struct sockaddr *)&me, sizeof me); where me is of type struct sockaddr_in (if this is an internet domain socket).