| In the previous
section we learned that the sender side of a multimedia application appends
header fields to the audio/video chunks before passing them to the transport
layer. These header fields include sequence numbers and timestamps. Since
most multimedia networking applications can make use of sequence numbers
and timestamps, it is convenient to have a standardized packet structure
that includes fields for audio/video data, sequence number, and timestamp,
as well as other potentially useful fields. RTP, defined in RFC 1889, is
such a standard. RTP can be used for transporting common formats such as
PCM or GSM for sound and MPEG1 and MPEG2 for video. It can also be used
for transporting proprietary sound and video formats.
In this section
we provide a short introduction to RTP and to its companion protocol, RTCP.
We also discuss the role of RTP in the H.323 standard for real-time interactive
audio and video conferencing. The reader is encouraged to visit Henning
Schulzrinne's RTP site [Schulzrinne
1999], which provides a wealth of information on the subject. Also,
readers may want to visit the Free Phone site [Freephone
1999], which describes an Internet phone application that uses RTP.
6.4.1: RTP Basics
RTP typically runs
on top of UDP. Specifically, chunks of audio or video data that are generated
by the sending side of a multimedia application are encapsulated in RTP
packets. Each RTP packet is in turn encapsulated in a UDP segment. Because
RTP provides services (such as timestamps and sequence numbers) to the
multimedia application, RTP can be viewed as a sublayer of the transport
layer, as shown in Figure 6.9.
Figure 6.9:
RTP can be viewed as a sublayer of the transport layer
From the application
developer's perspective, however, RTP is not part of the transport layer
but instead part of the application layer. This is because the developer
must integrate RTP into the application. Specifically, for the sender side
of the application, the developer must write application code that creates
the RTP encapsulating packets. The application then sends the RTP packets
into a UDP socket interface. Similarly, at the receiver side of the application,
RTP packets enter the application through a UDP socket interface. The developer
therefore must write code into the application that extracts the media
chunks from the RTP packets. This is illustrated in Figure 6.10.
Figure 6.10:
From a developer's perspective, RTP is part of the application layer
As an example,
consider the use of RTP to transport voice. Suppose the voice source is
PCM encoded (that is, sampled, quantized, and digitized) at 64 Kbps. Further
suppose that the application collects the encoded data in 20 msec chunks,
that is, 160 bytes in a chunk. The application precedes each chunk of the
audio data with an RTP header that includes the type of audio encoding,
a sequence number, and a timestamp. The audio chunk along with the RTP
header form the RTP packet. The RTP packet is then sent into the
UDP socket interface. At the receiver side, the application receives the
RTP packet from its socket interface. The application extracts the audio
chunk from the RTP packet, and uses the header fields of the RTP packet
to properly decode and playback the audio chunk.
If an application
incorporates RTP--instead of a proprietary scheme to provide payload type,
sequence numbers, or timestamps--then the application will more easily
interoperate with other networked multimedia applications. For example,
if two different companies develop Internet phone software and they both
incorporate RTP into their product, there may be some hope that a user
using one of the Internet phone products will be able to communicate with
a user using the other Internet phone product. At the end of this section
we'll see that RTP has been incorporated into an important part of an Internet
telephony standard.
It should be
emphasized that RTP in itself does not provide any mechanism to ensure
timely delivery of data or provide other quality of service guarantees;
it does not even guarantee delivery of packets or prevent out-of-order
delivery of packets. Indeed, RTP encapsulation is only seen at the end
systems. Routers do not distinguish between IP datagrams that carry RTP
packets and IP datagrams that don't.
RTP allows each
source (for example, a camera or a microphone) to be assigned its own independent
RTP stream of packets. For example, for a videoconference between two participants,
four RTP streams could be opened--two streams for transmitting the audio
(one in each direction) and two streams for the video (again, one in each
direction). However, many popular encoding techniques--including MPEG1
and MPEG2--bundle the audio and video into a single stream during the encoding
process. When the audio and video are bundled by the encoder, then only
one RTP stream is generated in each direction.
RTP packets
are not limited to unicast applications. They can also be sent over one-to-many
and many-to-many multicast trees. For a many-to-many multicast session,
all of the session's senders and sources typically use the same multicast
group for sending their RTP streams. RTP multicast streams belonging together,
such as audio and video streams emanating from multiple senders in a videoconference
application, belong to an RTP session.
6.4.2: RTP Packet
Header Fields
As shown in Figure
6.11, the four main RTP packet header fields are the payload type, sequence
number, timestamp, and the source identifier fields.
Figure 6.11:
RTP header fields
The payload
type field in the RTP packet is seven bits long. For an audio stream, the
payload type field is used to indicate the type of audio encoding (for
example, PCM, adaptive delta modulation, linear predictive encoding) that
is being used. If a sender decides to change the encoding in the middle
of a session, the sender can inform the receiver of the change through
this payload type field. The sender may want to change the encoding in
order to increase the audio quality or to decrease the RTP stream bit rate.
Table 6.1 lists some of the audio payload types currently supported by
RTP.
Table 6.1:
Some audio payload types supported by RTP
| Payload Type Number |
Audio Format |
Sampling Rate |
Throughput |
| 0 |
PCM -law |
8 KHz |
64 Kbps |
| 1 |
1016 |
8 KHz |
4.8 Kbps |
| 3 |
GSM |
8 KHz |
13 Kbps |
| 7 |
LPC |
8 KHz |
2.4 Kbps |
| 9 |
G.722 |
8 KHz |
48-64 Kbps |
| 14 |
MPEG Audio |
90 KHz |
-- |
| 15 |
G.728 |
8 KHz |
16 Kbps |
For a video
stream, the payload type is used to indicate the type of video encoding
(for example, motion JPEG, MPEG1, MPEG2, H.261). Again, the sender can
change video encoding on-the-fly during a session. Table 6.2 lists some
of the video payload types currently supported by RTP.
Table 6.2:
Some video payload types supported by RTP
| Payload Type Number |
Video Format |
| 26 |
Motion JPEG |
| 31 |
H.261 |
| 32 |
MPEG1 video |
| 33 |
MPEG2 video |
The other important
fields are:
-
Sequence number
field. The sequence number field is 16 bits long. The sequence number
increments by one for each RTP packet sent, and may be used by the receiver
to detect packet loss and to restore packet sequence. For example, if the
receiver side of the application receives a stream of RTP packets with
a gap between sequence numbers 86 and 89, then the receiver knows that
packets 87 and 88 are missing. The receiver can then attempt to conceal
the lost data.
-
Timestamp field.
The timestamp field is 32 bits long. It reflects the sampling instant of
the first byte in the RTP data packet. As we saw in the previous section,
the receiver can use timestamps in order to remove packet jitter introduced
in the network and to provide synchronous playout at the receiver. The
timestamp is derived from a sampling clock at the sender. As an example,
for audio, the timestamp clock increments by one for each sampling period
(for example, each 125
sec for an 8 kHz
sampling clock); if the audio application generates chunks consisting of
160 encoded samples, then the timestamp increases by 160 for each RTP packet
when the source is active. The timestamp clock continues to increase at
a constant rate even if the source is inactive.
-
Synchronization
source identifier (SSRC). The SSRC field is 32 bits long. It identifies
the source of the RTP stream. Typically, each stream in an RTP session
has a distinct SSRC. The SSRC is not the IP address of the sender, but
instead a number that the source assigns randomly when the new stream is
started. The probability that two streams get assigned the same SSRC is
very small. Should this happen, the two sources pick a new SSRC value.
6.4.3: RTP Control
Protocol (RTCP)
RFC 1889 also specifies
RTCP, a protocol that a multimedia networking application can use in conjunction
with RTP. As shown in the multicast scenario in Figure 6.12, RTCP packets
are transmitted by each participant in an RTP session to all other participants
in the session using IP multicast. For an RTP session typically there is
a single multicast address and all RTP and RTCP packets belonging to the
session use the multicast address. RTP and RTCP packets are distinguished
from each other through the use of distinct port numbers.
Figure 6.12:
Both senders and receivers send RTCP messages
RTCP packets
do not encapsulate chunks of audio or video. Instead, RTCP packets are
sent periodically and contain sender and/or receiver reports that announce
statistics that can be useful to the application. These statistics include
number of packets sent, number of packets lost, and interarrival jitter.
The RTP specification [RFC
1889] does not dictate what the application should do with this feedback
information; this is up to the application developer. Senders can use the
feedback information, for example, to modify their transmission rates.
The feedback information can also be used for diagnostic purposes; for
example, receivers can determine whether problems are local, regional,
or global.
RTCP Packet
Types
For each RTP
stream that a receiver receives as part of a session, the receiver generates
a reception report. The receiver aggregates its reception reports into
a single RTCP packet. The packet is then sent into the multicast tree that
connects together all the sessions's participants. The reception report
includes several fields, the most important of which are listed below.
-
The SSRC of the
RTP stream for which the reception report is being generated.
-
The fraction of
packets lost within the RTP stream. Each receiver calculates the number
of RTP packets lost divided by the number of RTP packets sent as part of
the stream. If a sender receives reception reports indicating that the
receivers are receiving only a small fraction of the sender's transmitted
packets, it can switch to a lower encoding rate, with the aim of decreasing
network congestion and improving the reception rate.
-
The last sequence
number received in the stream of RTP packets.
-
The interarrival
jitter, which is calculated as the average interarrival time between successive
packets in the RTP stream.
For each RTP stream
that a sender is transmitting, the sender creates and transmits RTCP sender
report packets. These packets include information about the RTP stream,
including:
-
The SSRC of the
RTP stream.
-
The timestamp and
wall clock time of the most recently generated RTP packet in the stream.
-
The number of packets
sent in the stream.
-
The number of bytes
sent in the stream.
Sender reports
can be used to synchronize different media streams within an RTP session.
For example, consider a videoconferencing application for which each sender
generates two independent RTP streams, one for video and one for audio.
The timestamps in these RTP packets are tied to the video and audio sampling
clocks, and are not tied to the wall clock time (that is, to real
time). Each RTCP sender report contains, for the most recently generated
packet in the associated RTP stream, the timestamp of the RTP packet and
the wall clock time for when the packet was created. Thus the RTCP sender
report packets associate the sampling clock to the real-time clock. Receivers
can use this association in RTCP sender reports to synchronize the playout
of audio and video.
For each RTP
stream that a sender is transmitting, the sender also creates and transmits
source description packets. These packets contain information about the
source, such as e-mail address of the sender, the sender's name, and the
application that generates the RTP stream. It also includes the SSRC of
the associated RTP stream. These packets provide a mapping between the
source identifier (that is, the SSRC) and the user/host name.
RTCP packets
are stackable, that is, receiver reception reports, sender reports, and
source descriptors can be concatenated into a single packet. The resulting
packet is then encapsulated into a UDP segment and forwarded into the multicast
tree.
RTCP Bandwidth
Scaling
The astute reader
will have observed that RTCP has a potential scaling problem. Consider,
for example, an RTP session that consists of one sender and a large number
of receivers. If each of the receivers periodically generates RTCP packets,
then the aggregate transmission rate of RTCP packets can greatly exceed
the rate of RTP packets sent by the sender. Observe that the amount of
RTP traffic sent into the multicast tree does not change as the number
of receivers increases, whereas the amount of RTCP traffic grows linearly
with the number of receivers. To solve this scaling problem, RTCP modifies
the rate at which a participant sends RTCP packets into the multicast tree
as a function of the number of participants in the session. Also, since
each participant sends control packets to everyone else, each participant
can estimate the total number of participants in the session [Friedman
1999].
RTCP attempts
to limit its traffic to 5% of the session bandwidth. For example, suppose
there is one sender, which is sending video at a rate of 2 Mbps. Then RTCP
attempts to limit its traffic to 5% of 2 Mbps, or 100 Kbps, as follows.
The protocol gives 75% of this rate, or 75 Kbps, to the receivers; it gives
the remaining 25% of the rate, or 25 Kbps, to the sender. The 75 Kbps devoted
to the receivers is equally shared among the receivers. Thus, if there
are R receivers, then each receiver gets to send RTCP traffic at
a rate of 75/R Kbps and the sender gets to send RTCP traffic at
a rate of 25 Kbps. A participant (a sender or receiver) determines the
RTCP packet transmission period by dynamically calculating the average
RTCP packet size (across the entire session) and dividing the average RTCP
packet size by its allocated rate. In summary, the period for transmitting
RTCP packets for a sender is

And the period
for transmitting RTCP packets for a receiver is

6.4.4: H.323
H.323 is a standard
for real-time audio and video conferencing among end systems on the Internet.
As shown in Figure 6.13, the standard also covers how end systems attached
to the Internet communicate with telephones attached to ordinary circuit-switched
telephone networks. In principle, if manufacturers of Internet telephony
and video conferencing all conform to H.323, then all their products should
be able to interoperate, and should be able to communicate with ordinary
telephones. We discuss H.323 in this section, as it provides an application
context for RTP. Indeed, we'll see below that RTP is an integral part of
the H.323 standard.
Figure 6.13:
H.323 end systems attached to the Internet can communicate with telephones
attached to a circuit-switched telephone network
H.323 end
points (terminals) can be standalone devices (for example, Web phones
and Web TVs) or applications in a PC (for example, Internet phone or video
conferencing software). H.323 equipment also includes gateways and
gatekeepers. Gateways permit communication among H.323 end points
and ordinary telephones in a circuit-switched telephone network. Gatekeepers,
which are optional, provide address translation, authorization, bandwidth
management, accounting, and billing. We will discuss gatekeepers in more
detail at the end of this section.
The H.323 standard
is an umbrella specification that includes:
-
A specification
for how endpoints negotiate common audio/video encodings. Because H.323
supports a variety of audio and video encoding standards, a protocol is
needed to allow the communicating endpoints to agree on a common encoding.
-
A specification
for how audio and video chunks are encapsulated and sent over network.
As you may have guessed, this is where RTP comes into the picture.
-
A specification
for how endpoints communicate with their respective gatekeepers.
-
A specification
for how Internet phones communicate through a gateway with ordinary phones
in the public circuit-switched telephone network.
Figure 6.14 shows
the H.323 protocol architecture.
Figure 6.14:
H.323 protocol architecture
Minimally, each
H.323 endpoint must support the G.711 speech compression standard.
G.711 uses PCM to generate digitized speech at either 56 Kbps or 64 Kbps.
Although H.323 requires every endpoint to be voice capable (through G.711),
video capabilities are optional. Because video support is optional, manufacturers
of terminals can sell simpler speech terminals as well as more complex
terminals that support both audio and video.
As shown in
Figure 6.14, H.323 also requires that all H.323 end points use the following
protocols:
-
RTP. The
sending side of an endpoint encapsulates all media chunks within RTP packets.
The sending side then passes the RTP packets to UDP.
-
H.245. An
"out-of-band" control protocol for controlling media between H.323 endpoints.
This protocol is used to negotiate a common audio or video compression
standard that will be employed by all the participating endpoints in a
session.
-
Q.931. A
signaling protocol for establishing and terminating calls. This protocol
provides traditional telephone functionality (for example, dial tones and
ringing) to H.323 endpoints and equipment.
-
RAS (Registration/Admission/Status)
channel protocol. A protocol that allows end points to communicate
with a gatekeeper (if a gatekeeper is present).
Audio and Video
Compression
The H.323 standard
supports a specific set of audio and video compression techniques. Let's
first consider audio. As we just mentioned, all H.323 end points must support
the G.711 speech encoding standard. Because of this requirement, two H.323
end points will always be able to default to G.711 and communicate. But
H.323 allows terminals to support a variety of other speech compression
standards, including G.723.1, G.722, G.728, and G.729. Many of these standards
compress speech to rates that are suitable for 28.8 Kbps dial-up modems.
For example, G.723.1 compresses speech to either 5.3 Kbps or 6.3 Kbps,
with sound quality that is comparable to G.711.
As we mentioned
earlier, video capabilities for an H.323 endpoint are optional. However,
if an endpoint does support video, then it must (at the very least) support
the QCIF H.261 (176 x144 pixels) video standard. A video-capable endpoint
may optionally support other H.261 schemes, including CIF, 4CIF, 16CIF,
and the H.263 standard. As the H.323 standard evolves, it will likely support
a longer list of audio and video compression schemes.
H.323 Channels
When an end
point participates in an H.323 session, it maintains several channels,
as shown in Figure 6.15. Examining Figure 6.15, we see that an end point
can support many simultaneous RTP media channels. For each media type,
there will typically be one send media channel and one receive media channel;
thus, if audio and video are sent in separate RTP streams, there will typically
be four media channels. Accompanying the RTP media channels, there is one
RTCP media control channel, as discussed in Section 6.4.3. All of the RTP
and RTCP channels run over UDP. In addition to the RTP/RTCP channels, two
other channels are required: the call control channel and the call signaling
channel. The H.245 call control channel is a TCP connection that carries
H.245 control messages. Its principal tasks are (1) opening and closing
media channels, and (2) capability exchange, that is, before sending media,
endpoints agree on an encoding algorithm. H.245, being a control protocol
for real-time interactive applications, is analogous to RTSP, the control
protocol for streaming of stored multimedia that we studied in Section
6.2.3. Finally, the Q.931 call signaling channel provides classical telephone
functionality, such as dial tone and ringing.
Figure 6.15:
H.323 channels
Gatekeepers
The gatekeeper
is an optional H.323 device. Each gatekeeper is responsible for an H.323
zone. A typical deployment scenario is shown in Figure 6.16. In this scenario,
the H.323 terminals and the gatekeeper are all attached to the same LAN,
and the H.323 zone is the LAN itself. If a zone has a gatekeeper, then
all H.323 terminals in the zone are required to communicate with it using
the RAS protocol, which runs over TCP. Address translation is one of the
more important gatekeeper services. Each terminal can have an alias address,
such as the name of the person at the terminal, the e-mail address of the
person at the terminal, and so on. The gateway translates these alias addresses
to IP addresses. This address translation service is similar to the DNS
service, covered in Section 2.5. Another gatekeeper service is bandwidth
management: The gatekeeper can limit the number of simultaneous real-time
conferences in order to save some bandwidth for other applications running
over the LAN. Optionally, H.323 calls can be routed through gatekeeper,
which is useful for billing.
Figure 6.16:
H.323 terminals and gatekeeper on the same LAN
The H.323 terminal
must register itself with the gatekeeper in its zone. When the H.323 application
is invoked at the terminal, the terminal uses RAS to send its IP address
and alias (provided by user) to the gatekeeper. If the gatekeeper is present
in a zone, each terminal in the zone must contact the gatekeeper to ask
permission to make a call. Once it has permission, the terminal can send
the gatekeeper an e-mail address, alias string, or phone extension for
the terminal it wants to call, which may be in another zone. If necessary,
a gatekeeper will poll other gatekeepers in other zones to resolve an IP
address.
An excellent
tutorial on H.323 is provided by [WebProForum
1999]. The reader is also encouraged to see [Rosenberg
1999] for an alternative architecture to H.323 for providing telephone
service in the Internet. |