Computer Networking: A Top-Down Approach Featuring the Internet Chapter 7 -- 6.4: RTP

6.4: RTP

In the previous section we learned that the sender side of a multimedia application appends header fields to the audio/video chunks before passing them to the transport layer. These header fields include sequence numbers and timestamps. Since most multimedia networking applications can make use of sequence numbers and timestamps, it is convenient to have a standardized packet structure that includes fields for audio/video data, sequence number, and timestamp, as well as other potentially useful fields. RTP, defined in RFC 1889, is such a standard. RTP can be used for transporting common formats such as PCM or GSM for sound and MPEG1 and MPEG2 for video. It can also be used for transporting proprietary sound and video formats.

In this section we provide a short introduction to RTP and to its companion protocol, RTCP. We also discuss the role of RTP in the H.323 standard for real-time interactive audio and video conferencing. The reader is encouraged to visit Henning Schulzrinne's RTP site [Schulzrinne 1999], which provides a wealth of information on the subject. Also, readers may want to visit the Free Phone site [Freephone 1999], which describes an Internet phone application that uses RTP.

6.4.1: RTP Basics

RTP typically runs on top of UDP. Specifically, chunks of audio or video data that are generated by the sending side of a multimedia application are encapsulated in RTP packets. Each RTP packet is in turn encapsulated in a UDP segment. Because RTP provides services (such as timestamps and sequence numbers) to the multimedia application, RTP can be viewed as a sublayer of the transport layer, as shown in Figure 6.9.

Figure 6.9: RTP can be viewed as a sublayer of the transport layer

From the application developer's perspective, however, RTP is not part of the transport layer but instead part of the application layer. This is because the developer must integrate RTP into the application. Specifically, for the sender side of the application, the developer must write application code that creates the RTP encapsulating packets. The application then sends the RTP packets into a UDP socket interface. Similarly, at the receiver side of the application, RTP packets enter the application through a UDP socket interface. The developer therefore must write code into the application that extracts the media chunks from the RTP packets. This is illustrated in Figure 6.10.

Figure 6.10: From a developer's perspective, RTP is part of the application layer

As an example, consider the use of RTP to transport voice. Suppose the voice source is PCM encoded (that is, sampled, quantized, and digitized) at 64 Kbps. Further suppose that the application collects the encoded data in 20 msec chunks, that is, 160 bytes in a chunk. The application precedes each chunk of the audio data with an RTP header that includes the type of audio encoding, a sequence number, and a timestamp. The audio chunk along with the RTP header form the RTP packet. The RTP packet is then sent into the UDP socket interface. At the receiver side, the application receives the RTP packet from its socket interface. The application extracts the audio chunk from the RTP packet, and uses the header fields of the RTP packet to properly decode and playback the audio chunk.

If an application incorporates RTP--instead of a proprietary scheme to provide payload type, sequence numbers, or timestamps--then the application will more easily interoperate with other networked multimedia applications. For example, if two different companies develop Internet phone software and they both incorporate RTP into their product, there may be some hope that a user using one of the Internet phone products will be able to communicate with a user using the other Internet phone product. At the end of this section we'll see that RTP has been incorporated into an important part of an Internet telephony standard.

It should be emphasized that RTP in itself does not provide any mechanism to ensure timely delivery of data or provide other quality of service guarantees; it does not even guarantee delivery of packets or prevent out-of-order delivery of packets. Indeed, RTP encapsulation is only seen at the end systems. Routers do not distinguish between IP datagrams that carry RTP packets and IP datagrams that don't.

RTP allows each source (for example, a camera or a microphone) to be assigned its own independent RTP stream of packets. For example, for a videoconference between two participants, four RTP streams could be opened--two streams for transmitting the audio (one in each direction) and two streams for the video (again, one in each direction). However, many popular encoding techniques--including MPEG1 and MPEG2--bundle the audio and video into a single stream during the encoding process. When the audio and video are bundled by the encoder, then only one RTP stream is generated in each direction.

RTP packets are not limited to unicast applications. They can also be sent over one-to-many and many-to-many multicast trees. For a many-to-many multicast session, all of the session's senders and sources typically use the same multicast group for sending their RTP streams. RTP multicast streams belonging together, such as audio and video streams emanating from multiple senders in a videoconference application, belong to an RTP session.

6.4.2: RTP Packet Header Fields

As shown in Figure 6.11, the four main RTP packet header fields are the payload type, sequence number, timestamp, and the source identifier fields.

Figure 6.11: RTP header fields

The payload type field in the RTP packet is seven bits long. For an audio stream, the payload type field is used to indicate the type of audio encoding (for example, PCM, adaptive delta modulation, linear predictive encoding) that is being used. If a sender decides to change the encoding in the middle of a session, the sender can inform the receiver of the change through this payload type field. The sender may want to change the encoding in order to increase the audio quality or to decrease the RTP stream bit rate. Table 6.1 lists some of the audio payload types currently supported by RTP.

Table 6.1: Some audio payload types supported by RTP

Payload Type Number Audio Format Sampling Rate Throughput

0 PCM -law 8 KHz 64 Kbps

1 1016 8 KHz 4.8 Kbps

3 GSM 8 KHz 13 Kbps

7 LPC 8 KHz 2.4 Kbps

9 G.722 8 KHz 48-64 Kbps

14 MPEG Audio 90 KHz --

15 G.728 8 KHz 16 Kbps

For a video stream, the payload type is used to indicate the type of video encoding (for example, motion JPEG, MPEG1, MPEG2, H.261). Again, the sender can change video encoding on-the-fly during a session. Table 6.2 lists some of the video payload types currently supported by RTP.

Table 6.2: Some video payload types supported by RTP

Payload Type Number Video Format

26 Motion JPEG

31 H.261

32 MPEG1 video

33 MPEG2 video

The other important fields are:

Sequence number field. The sequence number field is 16 bits long. The sequence number increments by one for each RTP packet sent, and may be used by the receiver to detect packet loss and to restore packet sequence. For example, if the receiver side of the application receives a stream of RTP packets with a gap between sequence numbers 86 and 89, then the receiver knows that packets 87 and 88 are missing. The receiver can then attempt to conceal the lost data.
Timestamp field. The timestamp field is 32 bits long. It reflects the sampling instant of the first byte in the RTP data packet. As we saw in the previous section, the receiver can use timestamps in order to remove packet jitter introduced in the network and to provide synchronous playout at the receiver. The timestamp is derived from a sampling clock at the sender. As an example, for audio, the timestamp clock increments by one for each sampling period (for example, each 125 sec for an 8 kHz sampling clock); if the audio application generates chunks consisting of 160 encoded samples, then the timestamp increases by 160 for each RTP packet when the source is active. The timestamp clock continues to increase at a constant rate even if the source is inactive.
Synchronization source identifier (SSRC). The SSRC field is 32 bits long. It identifies the source of the RTP stream. Typically, each stream in an RTP session has a distinct SSRC. The SSRC is not the IP address of the sender, but instead a number that the source assigns randomly when the new stream is started. The probability that two streams get assigned the same SSRC is very small. Should this happen, the two sources pick a new SSRC value.

6.4.3: RTP Control Protocol (RTCP)

RFC 1889 also specifies RTCP, a protocol that a multimedia networking application can use in conjunction with RTP. As shown in the multicast scenario in Figure 6.12, RTCP packets are transmitted by each participant in an RTP session to all other participants in the session using IP multicast. For an RTP session typically there is a single multicast address and all RTP and RTCP packets belonging to the session use the multicast address. RTP and RTCP packets are distinguished from each other through the use of distinct port numbers.

Figure 6.12: Both senders and receivers send RTCP messages

RTCP packets do not encapsulate chunks of audio or video. Instead, RTCP packets are sent periodically and contain sender and/or receiver reports that announce statistics that can be useful to the application. These statistics include number of packets sent, number of packets lost, and interarrival jitter. The RTP specification [RFC 1889] does not dictate what the application should do with this feedback information; this is up to the application developer. Senders can use the feedback information, for example, to modify their transmission rates. The feedback information can also be used for diagnostic purposes; for example, receivers can determine whether problems are local, regional, or global.

RTCP Packet Types

For each RTP stream that a receiver receives as part of a session, the receiver generates a reception report. The receiver aggregates its reception reports into a single RTCP packet. The packet is then sent into the multicast tree that connects together all the sessions's participants. The reception report includes several fields, the most important of which are listed below.

The SSRC of the RTP stream for which the reception report is being generated.
The fraction of packets lost within the RTP stream. Each receiver calculates the number of RTP packets lost divided by the number of RTP packets sent as part of the stream. If a sender receives reception reports indicating that the receivers are receiving only a small fraction of the sender's transmitted packets, it can switch to a lower encoding rate, with the aim of decreasing network congestion and improving the reception rate.
The last sequence number received in the stream of RTP packets.
The interarrival jitter, which is calculated as the average interarrival time between successive packets in the RTP stream.

For each RTP stream that a sender is transmitting, the sender creates and transmits RTCP sender report packets. These packets include information about the RTP stream, including:

The SSRC of the RTP stream.
The timestamp and wall clock time of the most recently generated RTP packet in the stream.
The number of packets sent in the stream.
The number of bytes sent in the stream.

Sender reports can be used to synchronize different media streams within an RTP session. For example, consider a videoconferencing application for which each sender generates two independent RTP streams, one for video and one for audio. The timestamps in these RTP packets are tied to the video and audio sampling clocks, and are not tied to the wall clock time (that is, to real time). Each RTCP sender report contains, for the most recently generated packet in the associated RTP stream, the timestamp of the RTP packet and the wall clock time for when the packet was created. Thus the RTCP sender report packets associate the sampling clock to the real-time clock. Receivers can use this association in RTCP sender reports to synchronize the playout of audio and video.

For each RTP stream that a sender is transmitting, the sender also creates and transmits source description packets. These packets contain information about the source, such as e-mail address of the sender, the sender's name, and the application that generates the RTP stream. It also includes the SSRC of the associated RTP stream. These packets provide a mapping between the source identifier (that is, the SSRC) and the user/host name.

RTCP packets are stackable, that is, receiver reception reports, sender reports, and source descriptors can be concatenated into a single packet. The resulting packet is then encapsulated into a UDP segment and forwarded into the multicast tree.

RTCP Bandwidth Scaling

The astute reader will have observed that RTCP has a potential scaling problem. Consider, for example, an RTP session that consists of one sender and a large number of receivers. If each of the receivers periodically generates RTCP packets, then the aggregate transmission rate of RTCP packets can greatly exceed the rate of RTP packets sent by the sender. Observe that the amount of RTP traffic sent into the multicast tree does not change as the number of receivers increases, whereas the amount of RTCP traffic grows linearly with the number of receivers. To solve this scaling problem, RTCP modifies the rate at which a participant sends RTCP packets into the multicast tree as a function of the number of participants in the session. Also, since each participant sends control packets to everyone else, each participant can estimate the total number of participants in the session [Friedman 1999].

RTCP attempts to limit its traffic to 5% of the session bandwidth. For example, suppose there is one sender, which is sending video at a rate of 2 Mbps. Then RTCP attempts to limit its traffic to 5% of 2 Mbps, or 100 Kbps, as follows. The protocol gives 75% of this rate, or 75 Kbps, to the receivers; it gives the remaining 25% of the rate, or 25 Kbps, to the sender. The 75 Kbps devoted to the receivers is equally shared among the receivers. Thus, if there are R receivers, then each receiver gets to send RTCP traffic at a rate of 75/R Kbps and the sender gets to send RTCP traffic at a rate of 25 Kbps. A participant (a sender or receiver) determines the RTCP packet transmission period by dynamically calculating the average RTCP packet size (across the entire session) and dividing the average RTCP packet size by its allocated rate. In summary, the period for transmitting RTCP packets for a sender is

And the period for transmitting RTCP packets for a receiver is

6.4.4: H.323

H.323 is a standard for real-time audio and video conferencing among end systems on the Internet. As shown in Figure 6.13, the standard also covers how end systems attached to the Internet communicate with telephones attached to ordinary circuit-switched telephone networks. In principle, if manufacturers of Internet telephony and video conferencing all conform to H.323, then all their products should be able to interoperate, and should be able to communicate with ordinary telephones. We discuss H.323 in this section, as it provides an application context for RTP. Indeed, we'll see below that RTP is an integral part of the H.323 standard.

Figure 6.13: H.323 end systems attached to the Internet can communicate with telephones attached to a circuit-switched telephone network

H.323 end points (terminals) can be standalone devices (for example, Web phones and Web TVs) or applications in a PC (for example, Internet phone or video conferencing software). H.323 equipment also includes gateways and gatekeepers. Gateways permit communication among H.323 end points and ordinary telephones in a circuit-switched telephone network. Gatekeepers, which are optional, provide address translation, authorization, bandwidth management, accounting, and billing. We will discuss gatekeepers in more detail at the end of this section.

The H.323 standard is an umbrella specification that includes:

A specification for how endpoints negotiate common audio/video encodings. Because H.323 supports a variety of audio and video encoding standards, a protocol is needed to allow the communicating endpoints to agree on a common encoding.
A specification for how audio and video chunks are encapsulated and sent over network. As you may have guessed, this is where RTP comes into the picture.
A specification for how endpoints communicate with their respective gatekeepers.
A specification for how Internet phones communicate through a gateway with ordinary phones in the public circuit-switched telephone network.

Figure 6.14 shows the H.323 protocol architecture.

Figure 6.14: H.323 protocol architecture

Minimally, each H.323 endpoint must support the G.711 speech compression standard. G.711 uses PCM to generate digitized speech at either 56 Kbps or 64 Kbps. Although H.323 requires every endpoint to be voice capable (through G.711), video capabilities are optional. Because video support is optional, manufacturers of terminals can sell simpler speech terminals as well as more complex terminals that support both audio and video.

As shown in Figure 6.14, H.323 also requires that all H.323 end points use the following protocols:

RTP. The sending side of an endpoint encapsulates all media chunks within RTP packets. The sending side then passes the RTP packets to UDP.
H.245. An "out-of-band" control protocol for controlling media between H.323 endpoints. This protocol is used to negotiate a common audio or video compression standard that will be employed by all the participating endpoints in a session.
Q.931. A signaling protocol for establishing and terminating calls. This protocol provides traditional telephone functionality (for example, dial tones and ringing) to H.323 endpoints and equipment.
RAS (Registration/Admission/Status) channel protocol. A protocol that allows end points to communicate with a gatekeeper (if a gatekeeper is present).

Audio and Video Compression

The H.323 standard supports a specific set of audio and video compression techniques. Let's first consider audio. As we just mentioned, all H.323 end points must support the G.711 speech encoding standard. Because of this requirement, two H.323 end points will always be able to default to G.711 and communicate. But H.323 allows terminals to support a variety of other speech compression standards, including G.723.1, G.722, G.728, and G.729. Many of these standards compress speech to rates that are suitable for 28.8 Kbps dial-up modems. For example, G.723.1 compresses speech to either 5.3 Kbps or 6.3 Kbps, with sound quality that is comparable to G.711.

As we mentioned earlier, video capabilities for an H.323 endpoint are optional. However, if an endpoint does support video, then it must (at the very least) support the QCIF H.261 (176 x144 pixels) video standard. A video-capable endpoint may optionally support other H.261 schemes, including CIF, 4CIF, 16CIF, and the H.263 standard. As the H.323 standard evolves, it will likely support a longer list of audio and video compression schemes.

H.323 Channels

When an end point participates in an H.323 session, it maintains several channels, as shown in Figure 6.15. Examining Figure 6.15, we see that an end point can support many simultaneous RTP media channels. For each media type, there will typically be one send media channel and one receive media channel; thus, if audio and video are sent in separate RTP streams, there will typically be four media channels. Accompanying the RTP media channels, there is one RTCP media control channel, as discussed in Section 6.4.3. All of the RTP and RTCP channels run over UDP. In addition to the RTP/RTCP channels, two other channels are required: the call control channel and the call signaling channel. The H.245 call control channel is a TCP connection that carries H.245 control messages. Its principal tasks are (1) opening and closing media channels, and (2) capability exchange, that is, before sending media, endpoints agree on an encoding algorithm. H.245, being a control protocol for real-time interactive applications, is analogous to RTSP, the control protocol for streaming of stored multimedia that we studied in Section 6.2.3. Finally, the Q.931 call signaling channel provides classical telephone functionality, such as dial tone and ringing.

Figure 6.15: H.323 channels

Gatekeepers

The gatekeeper is an optional H.323 device. Each gatekeeper is responsible for an H.323 zone. A typical deployment scenario is shown in Figure 6.16. In this scenario, the H.323 terminals and the gatekeeper are all attached to the same LAN, and the H.323 zone is the LAN itself. If a zone has a gatekeeper, then all H.323 terminals in the zone are required to communicate with it using the RAS protocol, which runs over TCP. Address translation is one of the more important gatekeeper services. Each terminal can have an alias address, such as the name of the person at the terminal, the e-mail address of the person at the terminal, and so on. The gateway translates these alias addresses to IP addresses. This address translation service is similar to the DNS service, covered in Section 2.5. Another gatekeeper service is bandwidth management: The gatekeeper can limit the number of simultaneous real-time conferences in order to save some bandwidth for other applications running over the LAN. Optionally, H.323 calls can be routed through gatekeeper, which is useful for billing.

Figure 6.16: H.323 terminals and gatekeeper on the same LAN

The H.323 terminal must register itself with the gatekeeper in its zone. When the H.323 application is invoked at the terminal, the terminal uses RAS to send its IP address and alias (provided by user) to the gatekeeper. If the gatekeeper is present in a zone, each terminal in the zone must contact the gatekeeper to ask permission to make a call. Once it has permission, the terminal can send the gatekeeper an e-mail address, alias string, or phone extension for the terminal it wants to call, which may be in another zone. If necessary, a gatekeeper will poll other gatekeepers in other zones to resolve an IP address.

An excellent tutorial on H.323 is provided by [WebProForum 1999]. The reader is also encouraged to see [Rosenberg 1999] for an alternative architecture to H.323 for providing telephone service in the Internet.