In recent years,
audio/video streaming has become a popular application and a major consumer
of network bandwidth. This trend is likely to continue for several reasons.
First, the cost of disk storage is decreasing rapidly, even faster than
processing and bandwidth costs. Cheap storage will lead to a significant
increase in the amount of stored audio/video in the Internet. For example,
shared MP3 audio files of rock music via [Napster
2000] has become incredibly popular among college and high school students.
Second, improvements in Internet infrastructure, such as high-speed residential
access (that is, cable modems and ADSL, as discussed in Chapter 1), network
caching of video (see Section 2.2), and new QoS-oriented Internet protocols
(see Sections 6.5-6.9) will greatly facilitate the distribution of stored
audio and video. And third, there is an enormous pent-up demand for high-quality
video streaming, an application that combines two existing killer communication
technologies--television and the on-demand Web.
In audio/video
streaming, clients request compressed audio/video files that are resident
on servers. As we'll discuss in this section, these servers can be "ordinary"
Web servers, or can be special streaming servers tailored for the audio/video
streaming application. Upon client request, the server directs an audio/video
file to the client by sending the file into a socket. Both TCP and UDP
socket connections are used in practice. Before sending the audio/video
file into the network, the file is segmented, and the segments are typically
encapsulated with special headers appropriate for audio/video traffic.
The Real-time protocol (RTP), discussed in Section 6.4, is a public-domain
standard for encapsulating such segments. Once the client begins to receive
the requested audio/video file, the client begins to render the file (typically)
within a few seconds. Most existing products also provide for user interactivity,
for example, pause/resume and temporal jumps within the audio/video file.
This user interactivity also requires a protocol for client/server interaction.
Real-time streaming protocol (RTSP), discussed at the end of this
section, is a public-domain protocol for providing user interactivity.
Audio/video
streaming is often requested by users through a Web client (that is, browser).
But because audio/video playout is not integrated directly in today's Web
clients, a separate helper application is required for playing out
the audio/video. The helper application is often called a media player,
the most popular of which are currently RealNetworks' Real Player and the
Microsoft Windows Media Player. The media player performs several functions,
including:
-
Decompression.
Audio/video is almost always compressed to save disk storage and network
bandwidth. A media player must decompress the audio/video on the fly during
playout.
-
Jitter removal.
Packet jitter is the variability of source-to-destination delays of packets
within the same packet stream. Since audio and video must be played out
with the same timing with which it was recorded, a receiver will buffer
received packets for a short period of time to remove this jitter. We'll
examine this topic in detail in Section 6.3.
-
Error correction.
Due to unpredictable congestion in the Internet, a fraction of packets
in the packet stream can be lost. If this fraction becomes too large, user-perceived
audio/video quality becomes unacceptable. To this end, many streaming systems
attempt to recover from losses by either (1) reconstructing lost packets
through the transmission of redundant packets, (2) by having the client
explicitly request retransmissions of lost packets, (3) masking loss by
interpolating the missing data from the received data.
-
Graphical user
interface with control knobs. This is the actual interface that the
user interacts with. It typically includes volume controls, pause/resume
buttons, sliders for making temporal jumps in the audio/video stream, and
so on.
Plug-ins may be
used to embed the user interface of the media player within the window
of the Web browser. For such embeddings, the browser reserves screen space
on the current Web page, and it is up to the media player to manage the
screen space. But either appearing in a separate window or within the browser
window (as a plug-in), the media player is a program that is being executed
separately from the browser.
6.2.1: Accessing
Audio and Video from a Web Server
Stored audio/video
can reside either on a Web server that delivers the audio/video to the
client over HTTP, or on an audio/video streaming server that delivers the
audio/video over non-HTTP protocols (protocols that can be either proprietary
or open standards). In this subsection, we examine delivery of audio/video
from a Web server; in the next subsection, we examine delivery from a streaming
server.
Consider first
the case of audio streaming. When an audio file resides on a Web server,
the audio file is an ordinary object in the server's file system, just
as are HTML and JPEG files. When a user wants to hear the audio file, the
user's host establishes a TCP connection with the Web server and sends
an HTTP request for the object (see Section 2.2). Upon receiving a request,
the Web server bundles the audio file in an HTTP response message and sends
the response message back into the TCP connection. The case of video can
be a little more tricky, because the audio and video parts of the "video"
may be stored in two different files, that is, they may be two different
objects in the Web server's file system. In this case, two separate HTTP
requests are sent to the server (over two separate TCP connections for
HTTP/1.0), and the audio and video files arrive at the client in parallel.
It is up to the client to manage the synchronization of the two streams.
It is also possible that the audio and video are interleaved in the same
file, so that only one object need be sent to the client. To keep our discussion
simple, for the case of "video" we assume that the audio and video are
contained in one file.
A naive architecture
for audio/video streaming is shown in Figure 6.1. In this architecture:
Figure 6.1:
A naive implementation for audio streaming
-
The browser process
establishes a TCP connection with the Web server and requests the audio/video
file with an HTTP request message.
-
The Web server
sends to the browser the audio/video file in an HTTP response message.
-
The content-type
header line in the HTTP response message indicates a specific audio/video
encoding. The client browser examines the content-type of the response
message, launches the associated media player, and passes the file to the
media player.
-
The media player
then renders the audio/video file.
Although this approach
is very simple, it has a major drawback: The media player (that is, the
helper application) must interact with the server through the intermediary
of a Web browser. This can lead to many problems. In particular, when the
browser is an intermediary, the entire object must be downloaded before
the browser passes the object to a helper application. The resulting delay
before playout can begin is typically unacceptable for audio/video clips
of moderate length. For this reason, audio/video streaming implementations
typically have the server send the audio/video file directly to the media
player process. In other words, a direct socket connection is made between
the server process and the media player process. As shown in Figure 6.2,
this is typically done by making use of a meta file, a file that
provides information (for example, URL, type of encoding) about the audio/video
file that is to be streamed.
Figure 6.2:
Web server sends audio/video directly to the media player
A direct TCP
connection between the server and the media player is obtained as follows:
-
The user clicks
on a hyperlink for an audio/video file.
-
The hyperlink does
not point directly to the audio/video file, but instead to a meta file.
The meta file contains the URL of the actual audio/video file. The HTTP
response message that encapsulates the meta file includes a content-type
header line that indicates the specific audio/video application.
-
The client browser
examines the content-type header line of the response message, launches
the associated media player, and passes the entire body of the response
message (that is, the meta file) to the media player.
-
The media player
sets up a TCP connection directly with the HTTP server. The media player
sends an HTTP request message for the audio/video file into the TCP connection.
-
The audio/video
file is sent within an HTTP response message to the media player. The media
player streams out the audio/video file.
The importance
of the intermediate step of acquiring the meta file is clear. When the
browser sees the content-type for the file, it can launch the appropriate
media player, and thereby have the media player directly contact the server.
We have just
learned how a meta file can allow a media player to dialogue directly with
a Web server housing an audio/video. Yet many companies that sell products
for audio/video streaming do not recommend the architecture we just described.
This is because the architecture has the media player communicate with
the server over HTTP and hence also over TCP. HTTP is often considered
insufficiently rich to allow for satisfactory user interaction with the
server; in particular, HTTP does not easily allow a user (through the media
server) to send pause/resume, fast-forward, and temporal jump commands
to the server.
6.2.2: Sending
Multimedia from a Streaming Server to a Helper Application
In order to get
around HTTP and/or TCP, audio/video can be stored on and sent from a streaming
server to the media player. This streaming server could be a proprietary
streaming server, such as those marketed by RealNetworks and Microsoft,
or could be a public-domain streaming server. With a streaming server,
audio/video can be sent over UDP (rather than TCP) using application-layer
protocols that may be better tailored than HTTP to audio/video streaming.
This architecture
requires two servers, as shown in Figure 6.3. One server, the HTTP server,
serves Web pages (including meta files). The second server, the streaming
server, serves the audio/video files. The two servers can run on the
same end system or on two distinct end systems. The steps for this architecture
are similar to those described in the previous architecture. However, now
the media player requests the file from a streaming server rather than
from a Web server, and now the media player and streaming server can interact
using their own protocols. These protocols can allow for rich user interaction
with the audio/video stream.
Figure 6.3:
Streaming from a streaming server to a media player
In the architecture
of Figure 6.3, there are many options for delivering the audio/ video from
the streaming server to the media player. A partial list of the options
is given below:
-
The audio/video
is sent over UDP at a constant rate equal to the drain rate at the receiver
(which is the encoded rate of the audio/video). For example, if the audio
is compressed using GSM at a rate of 13 Kbps, then the server clocks out
the compressed audio file at 13 Kbps. As soon as the client receives compressed
audio/video from the network, it decompresses the audio/video and plays
it back.
-
This is the same
as option 1, but the media player delays playout for 2-5 seconds in order
to eliminate network-induced jitter. The client accomplishes this task
by placing the compressed media that it receives from the network into
a client buffer, as shown in Figure 6.4. Once the client has "prefetched"
a few seconds of the media, it begins to drain the buffer. For this, and
the previous option, the fill rate x(t) is equal to the drain
rate d, except when there is packet loss, in which case x(t)
is momentarily less than d.
Figure 6.4:
Client buffer being filled at rate x(t) and drained at rate d
-
The media is sent
over TCP. The server pushes the media file into the TCP socket as quickly
as it can; the client (i.e., media player) reads from the TCP socket as
quickly as it can, and places the compressed video into the media player
buffer. After an initial 2-5 second delay, the media player reads from
its buffer at a rate d and forwards the compressed media to decompression
and playback. Because TCP retransmits lost packets, it has the potential
to provide better sound quality than UDP. On the other hand, the fill rate
x(t) now fluctuates with time due to TCP congestion control
and window flow control. In fact, after packet loss, TCP congestion control
may reduce the instantaneous rate to less than d for long periods
of time. This can empty the client buffer and introduce undesirable pauses
into the output of the audio/video stream at the client.
For the
third option, the behavior of x(t) will very much depend
on the size of the client buffer (which is not to be confused with the
TCP receive buffer). If this buffer is large enough to hold all of the
media file (possibly within disk storage), then TCP will make use of all
the instantaneous bandwidth available to the connection, so that x(t)
can become much larger than d. If x(t) becomes much
larger than d for long periods of time, then a large portion of
media is prefetched into the client, and subsequent client starvation is
unlikely. If, on the other hand, the client buffer is small, then x(t)
will fluctuate around the drain rate d. Risk of client starvation
is much larger in this case.
6.2.3: Real-Time
Streaming Protocol (RTSP)
Many Internet multimedia
users (particularly those who grew up with a remote TV control in hand)
will want to control the playback of continuous media by pausing
playback, repositioning playback to a future or past point of time, visual
fast-forwarding playback, visual rewinding playback, and so on. This functionality
is similar to what a user has with a VCR when watching a video cassette
or with a CD player when listening to a music CD. To allow a user to control
playback, the media player and server need a protocol for exchanging playback
control information. RTSP, defined in RFC 2326, is such a protocol.
But before getting
into the details of RTSP, let us first indicate what RTSP does not do:
-
RTSP does not define
compression schemes for audio and video.
-
RTSP does not define
how audio and video is encapsulated in packets for transmission over a
network; encapsulation for streaming media can be provided by RTP or by
a proprietary protocol. (RTP is discussed in Section 6.4.) For example,
RealMedia's G2 server and player use RTSP to send control information to
each other. But the media stream itself can be encapsulated in RTP packets
or in some proprietary data format.
-
RTSP does not restrict
how streamed media is transported; it can be transported over UDP or TCP.
-
RTSP does not restrict
how the media player buffers the audio/video. The audio/video can be played
out as soon as it begins to arrive at the client, it can be played out
after a delay of a few seconds, or it can be downloaded in its entirety
before playout.
So if RTSP doesn't
do any of the above, what does RTSP do? RTSP is a protocol that allows
a media player to control the transmission of a media stream. As mentioned
above, control actions include pause/resume, repositioning of playback,
fast forward and rewind. RTSP is a so-called out-of-band protocol.
In particular, the RTSP messages are sent out-of-band, whereas the media
stream, whose packet structure is not defined by RTSP, is considered "in-band."
RTSP messages use a different port number, 544, than the media stream.
The RTSP specification [RFC
2326] permits RTSP messages to be sent over either TCP or UDP.
Recall from
Section 2.3, that file transfer protocol (FTP) also uses the out-of-band
notion. In particular, FTP uses two client/server pairs of sockets, each
pair with its own port number: one client/server socket pair supports a
TCP connection that transports control information; the other client/server
socket pair supports a TCP connection that actually transports the file.
The RTSP channel is in many ways similar to FTP's control channel.
Let us now walk
through a simple RTSP example, which is illustrated in Figure 6.5. The
Web browser first requests a presentation description file from a Web server.
The presentation description file can have references to several continuous-media
files as well as directives for synchronization of the continuous-media
files. Each reference to a continuous-media file begins with the URL method,
rtsp://.
Figure 6.5:
Interaction between client and server using RTSP
Below we provide
a sample presentation file that has been adapted from [Schulzrinne
1997]. In this presentation, an audio and video stream are played in
parallel and in lip sync (as part of the same "group"). For the audio stream,
the media player can choose ("switch") between two audio recordings, a
low-fidelity recording and a high-fidelity recording.
<title>Twister</title>
<session>
<group language=en lipsync>
<switch>
<track type=audio
e="PCMU/8000/1"
src="rtsp://audio.example.com/twister/audio.en/lofi">
<track type=audio
e="DVI4/16000/2" pt="90 DVI4/8000/1"
src="rtsp://audio.example.com/twister/audio.en/hifi">
</switch>
<track type="video/jpeg"
src="rtsp://video.example.com/twister/video">
</group>
</session>
The Web server
encapsulates the presentation description file in an HTTP response message
and sends the message to the browser. When the browser receives the HTTP
response message, the browser invokes a media player (that is, the helper
application) based on the content-type field of the message. The presentation
description file includes references to media streams, using the URL method
rtsp://, as shown in the above sample. As shown in Figure
6.5, the player and the server then send each other a series of RTSP messages.
The player sends an RTSP SETUP request, and the server sends an RTSP SETUP
response. The player sends an RTSP PLAY request, say, for low-fidelity
audio, and the server sends an RTSP PLAY response. At this point, the streaming
server pumps the low-fidelity audio into its own in-band channel. Later,
the media player sends an RTSP PAUSE request, and the server responds with
an RTSP PAUSE response. When the user is finished, the media player sends
an RTSP TEARDOWN request, and the server responds with an RTSP TEARDOWN
response.
Each RTSP session
has a session identifier, which is chosen by the server. The client initiates
the session with the SETUP request, and the server responds to the request
with an identifier. The client repeats the session identifier for each
request, until the client closes the session with the TEARDOWN request.
The following is a simplified example of an RTSP session between a client
(C:) and a sender (S:).
C: SETUP rtsp://audio.example.com/twister/audio RTSP/1.0
Transport: rtp/udp; compression; port=3056; mode=PLAY
S: RTSP/1.0 200 1 OK
Session 4231
C: PLAY rtsp://audio.example.com/twister/audio.en/lofi
RTSP/1.0
Session: 4231
Range: npt=0-
C: PAUSE rtsp://audio.example.com/twister/audio.en/
lofi RTSP/1.0
Session: 4231
Range: npt=37
C: TEARDOWN rtsp://audio.example.com/twister/audio.en/
lofi RTSP/1.0 Session: 4231
S: 200 3 OK
Notice that in
this example, the player chose not to play back the complete presentation,
but instead only the low-fidelity portion of the presentation. The RTSP
protocol is actually capable of doing much more than described in this
brief introduction. In particular, RTSP has facilities that allow clients
to stream toward the server (for example, for recording). RTSP has been
adopted by RealNetworks, currently the industry leader in audio/video streaming.
Henning Schulzrinne makes available a Web page on RTSP [Schulzrinne
1999]. |