First, read the man pages for ip(4), icmp(4), route(4), and route(8).
Then read this text.  Then go back to the man pages, and then come
back to this text.  Repeat as necessary.

Internet hosts have multiple interfaces; one or more real interfaces
(e.g., Ethernet, wireless, fiber, etc.), and zero or more virtual
interfaces (e.g., loopback interfaces, tunneling interfaces).  Typical
hosts have just one real interface, less typical hosts may be attached
to two, or even more, ethernets (say, one for regular traffic and one
for accessing dedicated servers).  These are called multi-homed hosts,
and have different IP addresses on each interface, which are, of
course, different from the loopback address.  To complicate things
further, each interface may have many IP addresses (called "aliases"),
and may have addresses for different protocols (an ethernet interface
would have a MAC address, one or more IP addresses, and even IPv6
addresses.

A node can receive a packet on one of its interfaces in a number of
ways: if the interface is for a point-to-point link (e.g., PPP),
because the other end of the link sent the packet; if the interface is
an ethernet interface, because the MAC address of the packet matched
the unicast MAC address of the interface, one of the multicast MAC
addresses assigned to the interface, or the MAC address was the
broadcast address (ff:ff:ff:ff:ff:ff).
Once a host has received an IP packet on one of its interfaces, it checks
to see whether the packet was actually meant for that host (i.e.,
whoever sent it did not accidentally or maliciously put the appropriate
MAC address on the packet).  If the destination IP address of the
received packet matches any of the IP addresses assigned to
any of the interfaces of the host (including the virtual interfaces),
matches any multicast group the host belongs to, or matches a
broadcast address (either directed or limited (255.255.255.255)), the
packet is accepted and passed to a higher-layer protocol (TCP, UDP,
etc.).  If it turns out that the packet was not for the host, it is
silently discarded.  Also, some implementations insist that the
destination IP address of a packet should match one of the addresses
of the interface it was actually received on (Solaris has an option to
force that).

Now, a router is simply a node connected to multiple networks (just
like a multihomed host), but instead of discarding packets that were
not meant for itself, it actually forwards them to the appropriate
interface.  The way a router forwards packets is identical to the way
a host decides which interface to send packets on (obviously!).  There
is just one difference: when a router receives a packet that it has to
forward, it decrements the TTL field by 1; if at that point the field
becomes zero, it drops the packet and sends an ICMP "Time Exceeded"
message to the source of the offending packet.

The purpose of the forwarding code is to decide what interface to send
each packet outgoing IP packet on, and, when relevant, what the
destination MAC address of that packet should be.  The destination
host can be on a directly attached network (no router is involved), or
can be on a remote network (in which case the packet needs to be sent
to the next-hop router), or can simply be on a virtual interface, in
which case the packet is delivered locally (never makes it to the wire).


We thus distinguish four cases:

1. Destination on the same ethernet.

2. Destination is the other end of a point-to-point link.

3. Destination is on a different network, and the first-hop router is
on the same ethernet.

4. Destination is on a different network, and the first-hop router is
on the other end of point-to-point link.

In all four cases, the lookup operation in the Forwarding Information
Base (FIB; also called, somewhat incorrectly, the "routing table" in
BSD-derived Unix systems) is the same, but yields different results:

In case 1, the lookup operation either returns the interface that the
packet should be sent out on and MAC address of the destination
(because it is already known), and the packet is sent out to that MAC
address, or returns just the interface along with an indication that
the host is directly attached to that interface. In that case, the
host sends an ARP request on that interface, and puts the packet back
in the IP output queue while waiting for an ARP reply.

In case 2, there is no MAC address to be found, so the packet is just
sent down the point-to-point link.

In case 3, the lookup returns the IP address of the next-hop router,
along with an indication that this is a router and not a
directly-attached host.  The router must, of course, be directly
attached (no recursive lookups in Unix), so a second lookup for the IP
address of the router (identical to case 1) returns its MAC address,
which is used as the destination MAC address of the packet.  Thus, a
packet that is to be routed has the IP address of its real
destination, but the MAC address of the first-hop router.  THIS IS AN
IMPORTANT THING TO REMEMBER FOR WHEN WE TALK ABOUT TUNNELING.

In case 4, again the lookup returns the IP address of the next-hop
router, along with an indication that this router is the other side of
point-to-point link.  The packet is then sent down that link, just
like in case 2.

Each entry in the FIB has several fields: the destination prefix, the
interface the packet matching that prefix should be sent on, and what
the router (if any) and the MAC address (if appropriate) should be.
There are various flags to assist in the selection process, and
additional fields to keep the size of the table manageable (so that
unused entries can time out, etc.).  An entry with a prefix that is
the same size as the address (/32 for IPv4, /128 for IPv6) is called a
"host route"; every thing else is called a "network route".

In older releases of BSD, before CIDR and VLSM, there were two routing
tables: one for host routes, and one for network routes; they were
both implemented as hash tables.  There was also a default route kept
separately.  For each outgoing packet, the host routes table was
consulted first; if the destination IP address of the packet was not
found, the network routes table was consulted; if the address was not
found, the default route was used.  The ARP table was kept separately.
How were prefixes decided for the network routes?  That was
essentially kludged when we started subnetting, but didn't have real
VLSM support.  There were two kinds of network entries.  In general,
the "class" of the address was used to determine the implicit prefix,
so that there were, e.g., entries for net 10, 127, 128.59, 192.4.13.
The exception was when an interface was using a netmask longer than
the "natural" mask for the class of the interface address.  For
example, if interface 128.59.16.20 had 255.255.255.0 as its mask
rather than the "natural" 255.255.0.0.  In that case, all other
network routes sharing the same classful address used the netmask of
that interface.  In other words, the netmask for these routes was
inferred from the netmask of the interface; the netmask was not
actually stored in the routing table.  In this example, all other
routes for subnets in 128.59/16 (e.g., 128.59.33.0/24,
128.59.122.0/24, etc) had to be listed explicitly, or handled by the
default route.  To make matters harder, some releases didn't even
support prefixes that were not a multiple of 8 bits (so you could
break up a class B into 256 /24s, but not 128 /23s), and even then
addresses with a subnet part of 0 (i.e., 128.59.0/24) were not allowed
because of the confusion they would create.

Now Go to your Algorithms book and read about tries and Patricia
trees.  Then read the Sklower paper; then read the rest of this.

The idea is that host routes are the same as network routes with a /32
mask.  The default route is just an entry with a /0 netmask.  Since
the netmask is always anded to both the entry in the trie and the
lookup key, there is no reason to actually keep anything but the
already-anded part ("network part") of an address.  A separate
Patricia tree is kept for each address family (AF_INET, AF_INET6, and
so on).  An obvious question here is, why do we do this if the first
32 bits of a struct sockaddr_in are always the same?  Remember that
the same routines are used to manipulate the trees regardless of
address type; different address types have different lengths, and some
address types (e.g., international phone numbers) may even have
variable-length addresses, and as a result the size of the
corresponding struct sockaddr would change.  Therefore, the routines
need to be written with an address format where the length is embedded
in the address, and that's exactly what a struct sockaddr does.

In case you are confused about struct sockaddr versus struct
sockaddr_in; all general-purpose networking routines are written to
take arguments of type struct sockaddr (or pointer to struct
sockaddr).  Each address family has a corresponding structure whose
first two bytes are address family and length.  Examples are struct
sockaddr_in, struct sockaddr_in6, etc. Whenever pointers to such structures need
to be passed to the general purpose routines, they have to be cast
into (struct sockaddr *)s.  This is why a typical call to bind(2)
reads:

	bind(fd, (struct sockaddr *)&me, sizeof me);

where me is of type struct sockaddr_in (if this is an internet domain
socket).