Computer Networking: A Top-Down Approach Featuring the Internet Chapter 5 -- 4.4: Internet Protocol

4.4: Internet Protocol

So far in this chapter we've focused on underlying principles of the network layer, without reference to any specific network architecture. We've discussed various network-layer service models, the routing algorithms commonly used to deter mine paths between source and destination, and the use of hierarchy to address the problem of scale. In this section, we turn our attention to the network layer of the Internet, the pieces of which are often collectively referred to as the IP layer (named after the Internet's IP protocol). We'll see, though, that the IP protocol itself is just one piece (albeit a very important piece) of the Internet's network layer.

As noted in Section 4.1, the Internet's network layer provides connectionless datagram service rather than virtual-circuit service. When the network layer at the sending host receives a segment from the transport layer, it encapsulates the segment within an IP datagram, writes the destination host address as well as other fields in the datagram, and sends the datagram to the first router on the path toward the destination host. Our analogy from Chapter 1 was that this process is similar to a person writing a letter, inserting the letter in an envelope, writing the destination address on the envelope, and dropping the envelope into a mailbox. Neither the Internet's network layer nor the postal service make any preliminary contact with the destination before moving its "parcel" (datagram or letter, respectively) toward the destination. Furthermore, as discussed in Section 4.1, the Internet's network-layer service and the postal delivery service both provide so-called best-effort service: neither guarantee that a parcel will arrive within a certain time at the destination, nor does either guarantee that a series of parcels will arrive in the order sent. Indeed, neither even guarantee that a parcel will ever arrive at its destination!

As shown in Figure 4.14, the network layer in a datagram-oriented network such as the Internet has three major components.

Figure 4.14: A look inside the Internet's network layer

The first component is the network protocol, which defines network-layer addressing, the fields in the datagram (that is, the network-layer PDU), and the actions taken by routers and end systems on a datagram based on the values in these fields. The network protocol in the Internet is called the Internet Protocol, or more commonly, the IP Protocol. There are two versions of the IP protocol in use today. We'll examine the widely deployed Internet Protocol version 4, more commonly known simply as IPv4 [RFC 791] in Sections 4.4.1 through 4.4.4. In Section 4.7 we'll examine IP version 6 [RFC 2373; RFC 2460], which has been proposed to replace IPv4 in upcoming years.
The second major component of the network layer is the path determination component; it determines the route a datagram follows from source to destination. We saw in Section 4.2 that routing protocols compute the routing tables that are used to route packets through the network. We'll study the Internet's path determination component in Section 4.5.
The final component of the network layer is a facility to report errors in datagrams and respond to requests for certain network-layer information. We'll cover the Internet's network-layer error and information reporting protocol, ICMP, in Section 4.4.5.

4.4.1: IPv4 Addressing

Let's begin our study of IPv4 by considering IPv4 addressing. Although addressing may seem a rather straightforward and perhaps tedious topic, the coupling between addressing and network-layer routing is both crucial and subtle. Excellent treatments of IPv4 addressing are [Semeria 1996] and the first chapter in [Stewart 1999].

Before discussing IP addressing, however, we'll need to say a few words about how hosts and routers are connected into the network. A host typically has only a single link into the network. When IP in the host wants to send a datagram, it will do so over this link. The boundary between the host and the physical link is called an interface. A router, on the other hand, is fundamentally different from a host. Because a router's job is to receive a datagram on an "incoming" link and forward the datagram on some "outgoing" link, a router necessarily has two or more links to which it is connected. The boundary between the router and any one of its links is also called an interface. A router thus has multiple interfaces, one for each of its links. Because every host and router is capable of sending and receiving IP datagrams, IP requires each interface to have an IP address. Thus, an IP address is technically associated with an interface, rather than with the host or router containing that interface

Each IP address is 32 bits long (equivalently, four bytes), and there are thus a total of 2³² possible IP addresses. These addresses are typically written in so-called dotted-decimal notation, in which each byte of the address is written in its decimal form and is separated by a period ("dot") from other bytes in the address. For example, consider the IP address 193.32.216.9. The 193 is the decimal equivalent of the first eight bits of the address; the 32 is the decimal equivalent of the second eight bits of the address, and so on. Thus, the address 193.32.216.9 in binary notation is:

11000001 00100000 11011000 00001001.

Each interface on every host and router in the global Internet must have an IP address that is globally unique. These addresses cannot be chosen in a willy-nilly manner, however. To a large extent, an interface's IP address will be determined by the "network" to which it is connected. In this context, the term "network" does not refer to the general infrastructure of hosts, routers, and links that make up a network. Instead, the term has a very precise meaning that is closely tied to IP addressing, as we will see. Figure 4.15 provides an example of IP addressing and interfaces. In this figure, one router (with three interfaces) is used to interconnect seven hosts. Take a close look at the IP addresses assigned to the host and router interfaces; there are several things to be noted. The three hosts in the upper-left portion of Figure 4.15, and the router interface to which they are connected all have an IP address of the form 223.1.1.xxx. That is, they share a common leftmost 24 bits of their IP address. They are also interconnected to each other by a single physical link (in this case, a broadcast link such as an Ethernet cable to which they are all physically attached) with no intervening routers. In the jargon of IP, the interfaces in these hosts and the upper-left interface in the router form an IP network or more simply a network. The 24 address bits that they share in common constitute the network portion of their IP address; the remaining eight bits are the host portion of the IP address. (We would prefer to use the terminology "interface part of the address" rather than "host part of the address" because an IP address is really for an interface rather than a host; but the terminology "host part" is commonly used in practice.) The network itself also has an address: 223.1.1.0/24, where the "/24" notation, sometimes known as a network mask, indicates that the leftmost 24 bits of the 32-bit quantity define the network address. These leftmost bits that define the network address are also often referred to as the network prefix. The network 223.1.1.0/24 thus consists of the three host interfaces (223.1.1.1, 223.1.1.2, and 223.1.1.3) and one router interface (223.1.1.4). Any additional hosts attached to the 223.1.1.0/24 network would be required to have an address of the form 223.1.1.xxx. There are two additional networks shown in Figure 4.15: the 223.1.2.0/24 network and the 223.1.3.0/24 network. Figure 4.16 illustrates the three IP networks present in Figure 4.15.

Figure 4.15: Interface addresses

Figure 4.16: Network addresses

The IP definition of a "network" is not restricted to Ethernet segments that connect multiple hosts to a router interface. To get some insight here, consider Figure 4.17, which shows three routers that are interconnected with each other by point-to-point links. Each router has three interfaces, one for each point-to-point link, and one for the broadcast link that directly connects the router to a pair of hosts. What IP networks are present here? Three networks, 223.1.1.0/24, 223.1.2.0/24, and 223.1.3.0/24 are similar in spirit to the networks we encountered in Figure 4.15. But note that there are three additional networks in this example as well: one network, 223.1.9.0/24, for the interfaces that connect routers R1 and R2; another network, 223.1.8.0/24, for the interfaces that connect routers R2 and R3; and a third network, 223.1.7.0/24, for the interfaces that connect routers R3 and R1.

Figure 4.17: Three routers interconnecting six hosts

For a general interconnected system of routers and hosts, we can use the following recipe to define the networks in the system. We first detach each interface from its host or router. This creates islands of isolated networks, with interfaces terminating the endpoints of the isolated networks. We then call each of these isolated networks a network. If we apply this procedure to the interconnected system in Figure 4.17, we get six islands or networks. The current Internet consists of millions of such networks. The notion of a network and a network address is an important one, and plays a central role in the Internet's routing architecture.

Now that we have defined a network, we are ready to discuss IP addressing in more detail. The original Internet addressing architecture defined four classes of address, as shown in Figure 4.18. A fifth address class, beginning with 11110, was reserved for future use. For a class A address, the first eight bits identify the network, and the last 24 bits identify the interface within that network. Thus, within class A we can have up to 2⁷ networks (the first of the eight bits is fixed as 0), each with up to 2²⁴ interfaces. The class B address space allows for 2¹⁴ networks, with up to 2¹⁶ interfaces within each network. A class C address uses 21 bits to identify the network and leaves only eight bits for the interface identifier. Class D addresses are reserved for so-called multicast addresses; we'll defer our discussion of class D addresses until Section 4.7.

Figure 4.18: IPv4 address formats

The four address classes shown in Figure 4.18 (sometimes known as classful addressing) are no longer formally part of the IP addressing architecture. The requirement that the network portion of an IP address be exactly one, two, or three bytes long turned out to be problematic for supporting the rapidly growing number of organizations with small and medium-sized networks. A class C (/24) network could only accommodate up to 2⁸ - 2 = 254 hosts (two of the 2⁸= 256 addresses are reserved for special use)--too small for many organizations. However, a class B (/16) network, which supports up 65,634 hosts was too large. Under classful addressing, an organization with, say, 2,000 hosts was typically allocated a class B (/16) network address. This led to a rapid depletion of the class B address space and poor utilization of the assigned address space. For example, the organization that used a class B address for its 2,000 hosts was allocated enough of the address space for up to 65,534 interfaces--leaving more than 63,000 unused addresses that could not be used by other organizations.

In 1993, the IETF standardized on Classless Interdomain Routing (CIDR--pronounced the same as "cider") [RFC 1519]. With so-called CIDRized network addresses, the network part of an IP address can be any number of bits long, rather than being constrained to 8, 16, or 24 bits. A CIDRized network address has the dotted-decimal form a.b.c.d/x, where x indicates the number of leading bits in the 32-bit quantity that constitutes the network portion of the address. In our example above, the organization needing to support 2,000 hosts could be allocated a block of only 2,048 host addresses of the form a.b.c.d/21, allowing the approximately 63,000 addresses that would have been allocated and unused under classful addressing to be allocated to a different organization. In this case, the first 21 bits specify the organization's network address and are common in the IP addresses of all hosts in the organization. The remaining 11 bits then identify the specific hosts in the organization. In practice, the organization could further divide these 11 rightmost bits using a procedure known as subnetting [RFC 950] to create its own internal networks within the a.b.c.d/21 network.

Assigning Addresses

Having introduced IP addressing, a question that immediately comes to mind is how a host gets its own IP address. We have just learned that an IP address has two parts, a network part and a host part. The host part of the address can be assigned in several different ways, including:

Manual configuration. The IP address is configured into the host (typically in a file) by the system administrator.
Dynamic Host Configuration Protocol (DHCP) [RFC 2131]. DHCP is an extension of the BOOTP [RFC 1542] protocol, and is sometimes referred to as Plug and Play. With DHCP, a DHCP server in a network (for example, in a LAN) receives DHCP requests from a client and, in the case of dynamic address allocation, allocates an IP address back to the requesting client. DHCP is used extensively in LANs and in residential Internet access.

Obtaining a network address is not as simple. An organization's network administrator might first contact its ISP, which would provide addresses from a larger block of addressees that had already been allocated to the ISP. For example, the ISP may itself have been allocated the address block 200.23.16.0/20. The ISP, in turn could divide its address block into eight equal-size smaller address blocks and give one of these address blocks out to each of up to eight organizations that are supported by this ISP, as shown below. (We have underlined the network part of these addresses for visual convenience.)

ISP's block 11001000 00010111 00010000 00000000 200.23.16.0/20

Organization 0 11001000 00010111 00010000 00000000 200.23.16.0/23

Organization 1 11001000 00010111 00010010 00000000 200.23.18.0/23

Organization 2 11001000 00010111 00010100 00000000 200.23.20.0/23

...

Organization 7 11001000 00010111 00011110 00000000 200.23.30.0/23

Let's conclude our discussion of addressing by considering how an ISP itself gets a block of addresses. IP addresses are managed under the authority of The Internet Corporation for Assigned Names and Numbers (ICANN) [ICANN 2000] based on guidelines set forth in RFC 2050. The role of the nonprofit ICANN organization [NTIA 1998] is to allocate not only IP addresses, but also to manage the DNS root servers. It also has the very contentious job of assigning domain names and resolving domain name disputes. The actual assignment of addresses is now managed by regional Internet registries. As of mid-2000, there are three such regional registries: the American Registry for Internet Number (ARIN, which handles registrations for North and South America, as well as parts of Africa. ARIN has recently taken over a number of the functions previously provided by Network Solutions), the Reseaux IP Europeans (RIPE, which covers Europe and nearby countries), and the Asia Pacific Network Information Center (APNIC).

Before leaving our discussion of addressing, we want to mention that mobile hosts may change the network to which they are attached, either dynamically while in motion or on a longer time scale. Because routing is to a network first, and then to a host within the network, this means that the mobile host's IP address must change when the host changes networks. Techniques for handling such issues are now under development within the IETF and the research community [RFC 2002; RFC 2131].

Principles in Practice

This example of an ISP that connects eight organizations into the larger Internet also nicely illustrates how carefully allocated CIDRized addresses facilitate hierarchical routing. Suppose, as shown in Figure 4.19, that the ISP (which we'll call Fly-By-Night-ISP) advertises to the outside world that it should be sent any datagrams whose first 20 address bits match 200.23.16.0/20. The rest of the world need not know that within the address block 200.23.16.0/20 there are in fact eight other organizations, each with their own networks. This ability to use a single network prefix to advertise multiple networks is often referred to as route aggregation or route summarization.

Figure 4.19: Hierarchical addressing and route aggregation

Route aggregation works extremely well when addresses are allocated in blocks to ISPs and then from ISPs to client organizations. But what happens when addresses are not allocated in such a hierarchical manner? What would happen, for example, if Organization 1 becomes discontent with the poor service provided by Fly-By-Night-ISP and decides to switch over to a new ISP, say, ISPs-R-Us? As shown in Figure 4.19, ISPs-R-Us owns the address block 199.31.0.0/16, but Organization 2's IP addresses are unfortunately outside of this address block. What should be done here? Certainly, Organization 1 could renumber all of its routers and hosts to have addresses within the ISPs-R-Us address block. But this is a costly solution, and Organization 1 might well choose to switch from ISPs-R-Us to yet another ISP in the future. The solution typically adopted is for Organization 1 to keep its IP addresses in 200.23.18.0/23. In this case, as shown in Figure 4.20, Fly-By-Night-ISP continues to advertise the address block 200.23.16.0/20 and ISPs-R-Us continues to advertise 199.31.0.0/16. However, ISPs-R-Us now also advertises the block of addresses for Organization 1, 200.23.18.0/23. When other routers in the larger Internet see the address blocks 200.23.16.0/20 (from Fly-By-Night-ISP) and 200.23.18.0/23 (from ISPs-R-Us) and want to route to an address in the block 200.23.18.0/23, they will use a longest prefix matching rule, and route toward ISPs-R-Us, as it advertises the longest (more specific) address prefix that matches the destination address.

Figure 4.20: ISPs-R-Us has a more specific route to Organization 1

4.4.2: Transporting a Datagram from Source to Destination: Addressing and Routing

Now that we have defined interfaces and networks and have a basic understanding of IP addressing, we take a step back and examine how hosts and routers transport an IP datagram from source to destination. To this end, a high-level view of an IP datagram is shown in Figure 4.21. Every IP datagram has a source address field and a destination address field. The source host fills a datagram's source address field with its own 32-bit IP address. It fills the destination address field with the 32-bit IP address of the final destination host to which the datagram is being sent. The data field of the datagram is typically filled with a TCP or UDP segment. We'll discuss the remaining IP datagram fields a little later in this section.

Figure 4.21: The key fields of an IP datagram

Once the source host creates the IP datagram, how does the network layer transport the datagram from the source host to the destination host? The answer to this question depends on whether the source and destination reside on the same network (where the term "network" is used here in the precise, addressing sense discussed in Section 4.4.1). Let's consider this question in the context of the network shown in Figure 4.22. First suppose host A wants to send an IP datagram to host B, which resides on the same network, 223.1.1.0/24, as A. This is accomplished as follows. IP in host A first consults its internal routing table, shown in Figure 4.22, and finds an entry, 223.1.1.0/24, whose network address matches the leading bits in the IP address of host B. The routing table shows that the number of hops to network 223.1.1.0 is 1, indicating that B is on the very same network to which A itself is attached. Host A thus knows that destination host B can be reached directly via A's outgoing interface, without the need for any intervening routers. Host A then passes the IP datagram to the link-layer protocol for the interface, which then has the responsibility of transporting the datagram to host B. (We'll study how the link layer transports a datagram between two interfaces on the same network in Chapter 5.)

Figure 4.22: Routing table in host A

Let's next consider the more interesting case that host A wants to send a datagram to another host, say E, that is on a different network. Host A again consults its routing table and finds an entry, 223.1.2.0/24, whose network address matches the leading bits in the IP address of host E. Because the number of hops to the destination is 2, host A knows that the destination is on another network and thus an intervening router will necessarily be involved. The routing table also tells host A that in order to get the datagram to host E, host A should first send the datagram to IP address 223.1.1.4, the router interface to which A's own interface is directly connected. IP in host A then passes the datagram down to the link layer and indicates to the link layer that it should send the datagram to IP address 223.1.1.4. It's important to note here that although the datagram is being sent (via the link layer) to the router's interface, the destination address of the datagram remains that of the ultimate destination (host E,) not that of the intermediate router interface.

The datagram is now in the router, and it is the job of the router to move the datagram toward its ultimate destination. As shown in Figure 4.23, the router consults it own routing table and finds an entry, 223.1.2.0/24, whose network address matches the leading bits in the IP address of host E. The routing table indicates that the datagram should be forwarded on router interface 223.1.2.9. Since the number of hops to the destination is 1, the router knows that destination host E is on the same network as its own interface, 223.1.2.9. The router thus moves the datagram to this interface, which then transmits the datagram to host E.

Figure 4.23: Routing table in router

In Figure 4.23, note that the entries in the "next router" column are all empty since each of the networks (223.1.1.0/24., 223.1.2.0/24, and 223.1.3.0/24) is directly attached to the router. In this case, there is no need to go through an intermediate router to get to a destination host. However, if host A and host E were separated by two routers, then within the routing table of the first router along the path from A to B, the appropriate row would indicate 2 hops to the destination and would specify the IP address of the second router along the path. The first router would then forward the datagram to the second router, using the link-layer protocol that connects the two routers. The second router would then forward the datagram to the destination host, using the link-layer protocol that connects the second router to the destination host.

You may recall from Chapter 1 that we said that routing a datagram in the Internet is similar to a person driving a car and asking gas station attendants at each intersection along the way how to get to the ultimate destination. It should now be clear why this an appropriate analogy for routing in the Internet. As a datagram travels from source to destination, it visits a series of routers. At each router in the series, it stops and asks the router how to get to its ultimate destination. Unless the router is on the same network as the ultimate destination, the routing table essentially says to the datagram: "I don't know exactly how to get to the ultimate destination, but I do know that the ultimate destination is in the direction of the link (analogous to a road) connected to one of my interfaces." The datagram then sets out on the link connected to this interface, arrives at a new router, and again asks for new directions.

From this discussion we see that the routing tables in the routers play a central role in routing datagrams through the Internet. But how are these routing tables configured and maintained for large networks with multiple paths between sources and destinations (such as in the Internet)? Clearly, these routing tables should be configured so that the datagrams follow "good" routes from source to destination. As you probably guessed, routing algorithms--like those studied in Section 4.2--have the job of configuring and maintaining the routing tables. We will discuss the Internet's routing algorithms in Section 4.5. But before moving on to routing algorithms, we cover three more important topics for the IP protocol, namely, the datagram format, datagram fragmentation, and the Internet Control Message Protocol (ICMP).

4.4.3: Datagram Format

The IPv4 datagram format is shown in Figure 4.24.

Figure 4.24: IPv4 datagram format

The key fields in the IPv4 datagram are the following:

Version Number. These four bits specify the IP protocol version of the datagram. By looking at the version number, the router can then determine how to interpret the remainder of the IP datagram. Different versions of IP use different datagram formats. The datagram format for the current version of IP, IPv4, is shown in Figure 4.24. The datagram format for the new version of IP (IPv6) is discussed in Section 4.7.
Header Length. Because an IPv4 datagram can contain a variable number of options (that are included in the IPv4 datagram header) these four bits are needed to determine where in the IP datagram the data actually begins. Most IP datagrams do not contain options so the typical IP datagram has a 20-byte header.
TOS. The type of service (TOS) bits were included in the IPv4 header to allow different "types" of IP datagrams to be distinguished from each other, presumably so that they could be handled differently in times of overload. When the network is overloaded, for example, it would be useful to be able to distinguish network-control datagrams (for example, see the ICMP discussion in Section 4.4.5) from datagrams carrying data (for example, HTTP messages). It would also be useful to distinguish real-time datagrams (for example, used by an IP telephony application) from non-real-time traffic (for example, FTP). More recently, one major routing vendor (Cisco) interprets the first three TOS bits as defining differential levels of service that can be provided by the router. The specific level of service to be provided is a policy issue determined by the router's administrator. We'll explore the topic of differentiated service in detail in Chapter 6.
Datagram Length. This is the total length of the IP datagram (header plus data) measured in bytes. Since this field is 16 bits long, the theoretical maximum size of the IP datagram is 65,535 bytes. However, datagrams are rarely greater than 1,500 bytes and are often limited in size to 576 bytes.
Identifier, Flags, Fragmentation Offset. These three fields have to do with so-called IP fragmentation, a topic we will consider in depth shortly. Interestingly, the new version of IP, IPv6, does not allow for fragmentation at routers.
Time-to-live. The time-to-live (TTL) field is included to ensure that datagrams do not circulate forever (due to, for example, a long-lived router loop) in the network. This field is decremented by one each time the datagram is processed by a router. If the TTL field reaches 0, the datagram must be dropped.
Protocol. This field is used only when an IP datagram reaches its final destination. The value of this field indicates the transport-layer protocol at the destination to which the data portion of this IP datagram will be passed. For example, a value of 6 indicates that the data portion is passed to TCP, while a value of 17 indicates that the data is passed to UDP. For a listing of all possible numbers, see RFC 1700. Note that the protocol number in the IP datagram has a role that is fully analogous to the role of the port number field in the transport-layer segment. The protocol number is the "glue" that binds the network and transport layers together, whereas the port number is the "glue" that binds the transport and application layers together. We will see in Chapter 5 that the link-layer frame also has a special field that binds the link layer to the network layer.
Header Checksum. The header checksum aids a router in detecting bit errors in a received IP datagram. The header checksum is computed by treating each two bytes in the header as a number and summing these numbers using 1's complement arithmetic. As discussed in Section 3.3, the 1's complement of this sum, known as the Internet checksum, is stored in the checksum field. A router computes the Internet checksum for each received IP datagram and detects an error condition if the checksum carried in the datagram does not equal the computed checksum. Routers typically discard datagrams for which an error has been detected. Note that the checksum must be recomputed and restored at each router, as the TTL field, and possibly options fields as well, may change. An interesting discussion of fast algorithms for computing the Internet checksum is RFC 1071. A question often asked at this point is, why does TCP/IP perform error checking at both the transport and network layers? There are many reasons for this repetition. First, routers are not required to perform error checking, so the transport layer cannot count on the network layer to do the job. Second, TCP/UDP and IP do not necessarily both have to belong to the same protocol stack. TCP can, in principle, run over a different protocol (for example, ATM) and IP can carry data that will not be passed to TCP/UDP.
Source and Destination IP Address. These fields carry the 32-bit IP address of the source and final destination for this IP datagram. The use and importance of the destination address is clear. Recall from Section 3.2 that the source IP address (along with the source and destination port numbers) is used at the destination host to direct the application data to the proper socket.
Options. The options fields allow an IP header to be extended. Header options were meant to be used rarely--hence the decision to save overhead by not including the information in options fields in every datagram header. However, the mere existence of options does complicate matters--since datagram headers can be of variable length, one cannot determine a priori where the data field will start. Also, since some datagrams may require options processing and others may not, the amount of time needed to process an IP datagram at a router can vary greatly. These considerations become particularly important for IP processing in high-performance routers and hosts. For these reasons and others, IP options were dropped in the IPv6 header.
Data (payload). Finally, we come to the last, and most important field--the raison d'être for the datagram in the first place! In most circumstances, the data field of the IP datagram contains the transport-layer segment (TCP or UDP) to be delivered to the destination. However, the data field can carry other types of data, such as ICMP messages (discussed in Section 4.4.5).

Note that an IP datagram has a total of 20 bytes of header (assuming it has no options). If the datagram carries a TCP segment, then each (non-fragmented) datagram carries a total of 40 bytes of header (20 IP header bytes and 20 TCP header bytes) along with the application-layer message.

4.4.4: IP Fragmentation and Reassembly

We will see in Chapter 5 that not all link-layer protocols can carry packets of the same size. Some protocols can carry "big" packets, whereas other protocols can only carry "little" packets. For example, Ethernet packets can carry no more than 1,500 bytes of data, whereas packets for many wide-area links can carry no more than 576 bytes. The maximum amount of data that a link-layer packet can carry is called the MTU (maximum transfer unit). Because each IP datagram is encapsulated within the link-layer packet for transport from one router to the next router, the MTU of the link-layer protocol places a hard limit on the length of an IP datagram. Having a hard limit on the size of an IP datagram is not much of a problem. What is a problem is that each of the links along the route between sender and destination can use different link-layer protocols, and each of these protocols can have different MTUs.

To understand the problem better, imagine that you are a router that interconnects several links, each running different link-layer protocols with different MTUs. Suppose you receive an IP datagram from one link, you check your routing table to determine the outgoing link, and this outgoing link has an MTU that is smaller than the length of the IP datagram. Time to panic--how are you going to squeeze this oversized IP packet into the payload field of the link-layer packet? The solution to this problem is to "fragment" the data in the IP datagram among two or more smaller IP datagrams, and then send these smaller datagrams over the outgoing link. Each of these smaller datagrams is referred to as a fragment.

Fragments need to be reassembled before they reach the transport layer at the destination. Indeed, both TCP and UDP are expecting to receive complete, unfragmented segments from the network layer. The designers of IPv4 felt that reassembling (and possibly refragmenting) datagrams in the routers would introduce significant complication into the protocol and put a damper on router performance. (If you were a router, would you want to be reassembling fragments on top of everything else you have to do?) Sticking to the principle of keeping the network layer simple, the designers of IPv4 decided to put the job of datagram reassembly in the end systems rather than in network routers.

When a destination host receives a series of datagrams from the same source, it needs to determine if any of these datagrams are fragments of some original larger datagram. If it does determine that some datagrams are fragments, it must further determine when it has received the last fragment and how the fragments it has received should be pieced back together to form the original datagram. To allow the destination host to perform these reassembly tasks, the designers of IP (version 4) put identification, flag, and fragmentation fields in the IP datagram. When a datagram is created, the sending host stamps the datagram with an identification number as well as a source and destination address. The sending host increments the identification number for each datagram it sends. When a router needs to fragment a datagram, each resulting datagram (that is, "fragment") is stamped with the source address, destination address, and identification number of the original datagram. When the destination receives a series of datagrams from the same sending host, it can examine the identification numbers of the datagrams to determine which of the datagrams are actually fragments of the same, larger datagram. Because IP is an unreliable service, one or more of the fragments may never arrive at the destination. For this reason, in order for the destination host to be absolutely sure it has received the last fragment of the original datagram, the last fragment has a flag bit set to 0 whereas all the other fragments have this flag bit set to 1. Also, in order for the destination host to determine if a fragment is missing (and also to be able to reassemble the fragments in their proper order), the offset field is used to specify where the fragment fits within the original IP datagram.

Figure 4.25 illustrates an example. A datagram of 4,000 bytes arrives at a router, and must be forwarded to a link with an MTU of 1,500 bytes. This implies that the 3,980 data bytes in the original datagram must be allocated to three separate fragments (each of which are also IP datagrams). Suppose that the original datagram is stamped with an identification number of 777. The characteristics of the three fragments are shown in Table 4.3.

Figure 4.25: IP fragmentation and reassembly

Table 4.3: IP Fragments

Fragment Bytes ID Offset Flag

1st fragment 1480 bytes in the data field of the IP datagram identification=777 offset=0 (meaning the data should be inserted beginning at byte 0) flag=1 (meaning there is more)

2nd fragment 1480 byte information field identification=777 offset=1,480 (meaning the data should be inserted beginning at byte 1,480) flag=1 (meaning there is more)

3rd fragment 1020 byte (=3980 - 1480 - 1480) information field identification=777 offset=2,960 (meaning the data should be inserted beginning at byte 2,960) flag=0 (meaning this is the last fragment)

The payload of the datagram is only passed to the transport layer at the destination once the IP layer has fully reconstructed the original IP datagram. If one or more of the fragments does not arrive to the destination, the datagram is discarded and not passed to the transport layer. But, as we learned in the previous chapter, if TCP is being used at the transport layer, then TCP will recover from this loss by having the source retransmit the data in the original datagram.

Fragmentation and reassembly puts an additional burden on Internet routers (the additional effort to create fragments out of a datagram) and on the destination hosts (the additional effort to reassemble fragments). For this reason it is desirable to keep fragmentation to a minimum. This is often done by limiting the TCP and UDP segments to a relatively small size, so that fragmentation of the corresponding datagrams is unlikely. Because all data-link protocols supported by IP are supposed to have MTUs of at least 576 bytes, fragmentation can be entirely eliminated by using an MSS of 536 bytes, 20 bytes of TCP segment header and 20 bytes of IP datagram header. This is why most TCP segments for bulk data transfer (such as with HTTP) are 512-536 bytes long. (You may have noticed while surfing the Web that 500 or so bytes of data often arrive at a time.)

Following this section, we provide a Java applet that generates fragments. You provide the incoming datagram size, the MTU, and the incoming datagram identification. It automatically generates the fragments for you. Click here to open it in a new window, or select it from the menu bar at the left.

4.4.5: ICMP: Internet Control Message Protocol

We conclude this section with a discussion of the Internet Control Message Protocol, ICMP, which is used by hosts, routers, and gateways to communicate network layer information to each other. ICMP is specified in RFC 792. The most typical use of ICMP is for error reporting. For example, when running a Telnet, FTP, or HTTP session, you may have encountered an error message such as "Destination network unreachable." This message had its origins in ICMP. At some point, an IP router was unable to find a path to the host specified in your Telnet, FTP, or HTTP application. That router created and sent a type-3 ICMP message to your host indicating the error. Your host received the ICMP message and returned the error code to the TCP code that was attempting to connect to the remote host. TCP, in turn, returned the error code to your application.

ICMP is often considered part of IP, but architecturally lies just above IP, as ICMP messages are carried inside IP packets. That is, ICMP messages are carried as IP payload, just as TCP or UDP segments are carried as IP payload. Similarly, when a host receives an IP packet with ICMP specified as the upper-layer protocol, it demultiplexes the packet to ICMP, just as it would demultiplex a packet to TCP or UDP.

ICMP messages have a type and a code field, and also contain the first eight bytes of the IP datagram that caused the ICMP message to be generated in the first place (so that the sender can determine the packet that caused the error). Selected ICMP messages are shown below in Figure 4.26. Note that ICMP messages are used not only for signaling error conditions. The well-known ping program sends an ICMP type 8 code 0 message to the specified host. The destination host, seeing the echo request, sends back a type 0 code 0 ICMP echo reply. Another interesting ICMP message is the source quench message. This message is seldom used in practice. Its original purpose was to perform congestion control--to allow a congested router to send an ICMP source quench message to a host to force that host to reduce its transmission rate. We have seen in Chapter 3 that TCP has its own congestion-control mechanism that operates at the transport layer, without the use of network-layer feedback such as the ICMP source quench message.

ICMP Type Code Description

0 0 echo reply (to ping)

3 0 destination network unreachable

3 1 destination host unreachable

3 2 destination protocol unreachable

3 3 destination port unreachable

3 6 destination network unknown

3 7 destination host unknown

4 0 source quench (congestion control)

8 0 echo request

9 0 router advertisement

10 0 router discovery

11 0 TTL expired

12 0 IP header bad

Figure 4.26: IP Fragmentation

In Chapter 1 we introduced the Traceroute program, which enabled you to trace the route from a few given hosts to any host in the world. Interestingly, Traceroute also uses ICMP messages. To determine the names and addresses of the routers between source and destination, Traceroute in the source sends a series of ordinary IP datagrams to the destination. The first of these datagrams has a TTL of 1, the second of 2, the third of 3, etc. The source also starts timers for each of the datagrams. When the nth datagram arrives at the nth router, the nth router observes that the TTL of the datagram has just expired. According to the rules of the IP protocol, the router discards the datagram and sends an ICMP warning message to the source (type 11 code 0). This warning message includes the name of the router and its IP address. When this ICMP message arrives at the source, the source obtains the round-trip time from the timer and the name and IP address of the nth router from the ICMP message. Now that you understand how Traceroute works, you may want to go back and play with it some more.