Internet Engineering Task Force                 Don Hoffman
INTERNET-DRAFT                                  Gerard Fernando
                                                Sun Microsystems, Inc.

                                                Vivek Goyal
                                                University of Southern 
						  California

                                                June, 1995
                                                Expires: December 1, 1995


               RTP Payload Format for MPEG1/MPEG2 Video

                          Status of this Memo

This document is an Internet-Draft.  Internet-Drafts are working documents of
the Internet Engineering Task Force (IETF), its areas, and its working
groups.  Note that other groups may also distribute working documents as
Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may
be updated, replaced, or obsoleted by other documents at any time.  It is
inappropriate to use Internet-Drafts as reference material or to cite them
other than as "work in progress."

To learn the current status of any Internet-Draft, please check the
"1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au
(Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West
Coast).

Distribution of this memo is unlimited.

                                Abstract

This draft describes a packetization scheme for MPEG video and audio
streams.  The scheme proposed can be used to transport such a video or audio
flow over the transport protocols supported by RTP.  Two approaches are
described. The first is designed to support maximum interoperability with
MPEG2 System environments.  The second is designed to maximize simplicity of
implementation,  and provide maximum compatibilty with other RTP-encapsulated
media streams and future conference control work of the IETF.


draft-ietf-avt-mpeg-00.txt                                      [Page 1]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


0. What's Changed Since Last Version

        1) Redesign of MPEG Transport Stream encapsulation to use direct
           MPEG Transport Stream (MTS) encapsulation rather than Packetized
           Elementary Stream (PES) level encapsulation.

        2) Dropped header fields to provide macro-block state recovery
           in Elementary Stream (ES) encapsulation and added header fields to
           provide better recovery of picture and group-of-picture headers.

        3) Further specification of framing rules for ES encapsulation.

        4) Provide suggested recovery strategies for missing packets
           in the ES encapsulation.


1. Introduction

ISO/IEC JTC1/SC29 WG11 (also referred to as the MPEG committee) has defined
the MPEG1 standard (ISO/IEC 11172)[1] and the MPEG2 standard (ISO/IEC
13818)[2].

The MPEG1 specification is defined in three parts: System, Video and Audio.
It is designed primarily for CD-ROM-based applications, and is optimized for
approximately 1.5 Mbits/sec combined data rates. The video and audio portions
of the specification describe the basic format of the video or audio stream.
These formats define the Elementary Streams (ES).  The MPEG1 System
specification defines an encapsulation of the the ES's that contains
Presentation Time Stamps (PTS), Decoding Time Stamps and System Clock
references, and performs multiplexing of MPEG1 compressed video and audio
ES's with user data.

The MPEG2 specification is structured in a similar way. However, it hasn't
been restricted only to CD-ROM applications. The MPEG2 System specification
defines two system stream formats:  the MPEG2 Transport Stream (MTS) and the
MPEG2 Program Stream (MPS).  The MTS is tailored for communicating or storing
one or more programs of MPEG2 compressed data and also other data in
relatively error-prone environments. The MPS is tailored for relatively
error-free environments.

We seek to achieve interoperability among 4 types of end-systems in the
following specification. The 4 types are:

        1. Transmitting Interworking Unit (TIU)

           Receives MPEG information from a native MTS system for
           distribution over packet networks using a native RTP-based system
           layer (such as an IP-based internetwork). Examples: real-time


draft-ietf-avt-mpeg-00.txt                                      [Page 2]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


           encoder, MTS satellite link to Internet, video server with
           MTS-encoded source material.

        2. Receiving Interworking Unit (RIU)

           Receives MPEG information in real time from an RTP-based network
           for forwarding to a native MTS environment. Examples:
           Internet-based video server to MTS-based cable distribution
           plant.

        3. Transmitting Internet End-System (TAES)

           Transmits MPEG information generated or stored within the internet
           end-system itself, or received from internet-based computer networks.
           Example: video server.


        4. Receiving Internet End-System (RAES)

           Receives MPEG information over an RTP-based internet for
           consumption at the internet end-system or forwarding to
           traditional computer network.  Example: desktop PC or workstation
           viewing training video.

Each of the 2 types of transmitters must work with each of the 2 types of
receivers.  Because it is probable that the TAES, and certain that the RAES,
will be based on existing and planned internet-connected computers, it is
highly desirable for the interoperable protocol to be based on RTP.

Because of the range of applications that might employ MPEG streams, we
propose to define two profiles.

Much interest in the MPEG community is in the use of MTS, and hence, in
Section 2 we propose an encapsulation of MPEG2 Transport Stream with the
Real-time Transport Protocol (RTP) [3, 4]. This profile supports the full
semantics of MPEG System and offers basic interoperability among all four
end-system types.  MPEG1 System streams will not be supported in this
profile.

When operating only among internet-based end-systems (i.e., TAES and RAES) a
profile that provides greater compatibility with the Internet architecture is
desired, deferring some of the system issues to other protocols being defined
in the Internet community (such as the MMUSIC WG).  In Section 3 we propose
an encapsulation of compressed video and audio data (referred to in MPEG
documentation as "Elementary Streams" (ES) ) complying with either MPEG1 or
MPEG2. Here, neither of the System standards of MPEG1 or MPEG2 are utilized.
The ES's are directly encapsulated with RTP.


draft-ietf-avt-mpeg-00.txt                                      [Page 3]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


Throughout this specification, we make extensive use of MPEG terminology.
The reader should consult the primary MPEG references for definitive
descriptions of this terminology.


2. Encapsulation of MPEG2 Transport Streams

To avoid end system inefficiencies, data from multiple small MTS packets
(normally fixed in size at 188 bytes) are aggregated into a single RTP
packet.

Each RTP packet will contain a timestamp derived from the a sender 90KHz
clock reference.  This clock may or may not be synchronized to the MTS
Program Clock Reference (PCR) and should represent when the packet is
presented to the RTP packetizer.  The RTP timestamp will not be passed to the
MPEG decoder.  This use of the timestamp is somewhat different than normally
is the case in RTP, in that it is not considered to be the media display or
presentation timestamp. The the primary purpose of the RTP timestamp will be
to estimate and reduce any network-induced jitter below the very
stringent levels required by many MPEG-compliant decocers.  In general, the
relationship between the MPEG Presentation Time Stamp (PTS) and this
RTP timestamp is complex and depends on the design and tolerances of
the MPEG encoder and decoder.  Consequently, the RTP timestamp should
only be used for inter-stream display synchronization in the context
of specific encoder/decoder implementations.

The RTP payload will contain an integral number of MPEG transport packets.
The number of transport packets contained is computed by dividing RTP payload
length by the length of an MTS packet (188).

Each RTP packet may contain a different number of MTS packets.  If an MTS
packet contains a non-zero payload_unit_start_indicator, it must begin a new
RTP packet.  A one in this field means that that MTS packet contains the
start of a new PES payload.


2.1 RTP header usage

The RTP header fields are used as follows:

        M bit:  Set to 1 when the MTS payload_unit_start_indicator
          of the first MTS packet in the RTP payload is non-zero.

        timestamp: 32 bit 90K Hz timestamp representing when MTS payload
          encapsulated in RTP.


draft-ietf-avt-mpeg-00.txt                                      [Page 4]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


3. Encapsulation of MPEG Elementary Streams

MPEG1 or MPEG2 Elementary Streams (ES) shall be encapsulated with RTP.
This encapsulation shall provide suitable timestamps, identification of
packet loss and other functions for transport of MPEG1 and MPEG2 streams
over IP networks.

Each encapsulated MPEG video or audio header shall be completely contained
within one packet.  Consequently, a minimum RTP payload size of 261 bytes
must be supported to contain the largest single header defined in the ES
(that is, the extension_data() header containing the quant_matrix_extension()).

Presentation Time Stamps (PTS) of 32 bits with an accuracy of 90 kHz
would be carried in the fixed RTP header. All packets that make up a
audio or video frame shall have the same time stamp.

The following ES types may be encapsulated:
        (a) MPEG1 Video (ISO/IEC 11172-2)
        (b) MPEG2 Video (ISO/IEC 13818-2)
        (c) MPEG1 Audio (ISO/IEC 11172-3)
        (d) MPEG2 Audio (ISO/IEC 13818-3)

A distict payload type is assigned to MPEG1/MPEG2 Video and MPEG1/MPEG2
Audio, respectively. Further indication as to whether the data is MPEG1 or
MPEG2 need not be provided in the RTP or MPEG-specific headers of this
encapsulation, as this information is available in the ES headers.

3.1 MPEG Video elementary streams

MPEG1 Video can be distinguished from MPEG2 Video at the video sequence
header, i.e. for MPEG2 Video a sequence_header() is followed by
sequence_extension().  The particular profile and level of MPEG2 Video
(MAIN_Profile@MAIN_Level, HIGH_Profile@HIGH_Level, etc) are determined
by the profile_and_level_indicator field of the sequence_extension
header of MPEG2 Video.

Since MPEG pictures can be large, they will normally be fragmented into
packets of size less than a typical LAN/WAN MTU.  Each picture is made up of
one or more "slices," and a slice is intended to be the unit of recovery from
data loss or corruption. An MPEG-compliant decoder will normally advance to
the beginning of next slice whenever an error is encountered in the MPEG
compressed video bit-stream.

The MPEG Video_Sequence_Header, when present, will always be at the beginning
of an RTP payload.  An MPEG GOP_header, when present, will always be at the
beginning of the RTP payload, or will follow a Video_Sequence_Header.  An
MPEG Picture_Header, when present, will alway be at the beginning of a RTP
payload, or will follow a GOP_header.


draft-ietf-avt-mpeg-00.txt                                      [Page 5]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


A RTP encapsulated MPEG video packet can consist of the following slice payload
contents:

        1. EXACTLY ONE slice of the frame.
        2. MULTIPLE INTEGRAL slices.
        3. A FRAGMENT of a slice. This fragment could be the first fragment
           of the  slice, last fragment or any fragment in between.
        4. MULTIPLE INTEGRAL slices and the FIRST FRAGMENT of the
           following slice.

Specifically the LAST FRAGMENT of a slice followed by any other data is
disallowed.

An implementation based on this encapsulation assumes that the
Video_Sequence_Header is repeated periodically in the MPEG bit-stream. In
practice (though not required by MPEG standard) this is used to allow channel
switching and to receive and start decoding a continuously relayed MPEG
bit-stream at arbitrary points in the media stream.  It is suggested that when
playing back from an MPEG stream from a file format (where the
Video_Sequence_Header may only be represented at the beginning of the stream)
that the first Video_Sequence_Header be saved by the packetizer for periodic
injection in to the network stream.

The MPEG bit-stream semantics were designed for relatively error-free
environments, and there is significant amount of dependency (both temporal
and spatial) within the stream such that loss of some data make other
uncorrupted data useless.  This encapsulation is designed to provide for some
limited set of recovery procedures. Appendix 1 suggests several recovery
strategies based on the redundant encoding of MPEG header information in
the RTP and MPEG-specific RTP headers.

3.2 MPEG Audio elementary streams

MPEG1 Audio can be distinguished from MPEG2 Audio from the MPEG ancillary_data()
header.  For either MPEG1 or MPEG2 Audio, distinct PTS's may be present for
frames which correspond to either 384 samples for Layer-I, or 1152 samples
for Layer-II or Layer-III.

Multiple frames may be encapsulated within one RTP packet.  Also, if
relatively short packets need to be used, one frame may be so large that it
may straddle multiple RTP packets.  For example, for Layer-II MPEG audio
sampled at a rate of 44.1 KHz each frame would represent a time slot of 26.1
msec. At this sampling rate if the compressed bit-rate is 384 kbits/sec
(i.e.  48 kBytes/sec) then the average audio frame size would be 1.25 KBytes.
If packets were to be 500 Bytes long, then each audio frame would straddle 3
RTP packets.  The audio fragmentation indicator header (See Section 3.5)
shall be present for an MPEG1/2 Audio payload type to provide for
this fragmentation.


draft-ietf-avt-mpeg-00.txt                                      [Page 6]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


3.3 RTP Fixed Header for MPEG ES encapsulation

The RTP header fields are used as follows:

        M bit:  Set to 1 on packet containing MPEG frame end code.

        PT:  MPEG video or audio stream ID.

        timestamp: 32-bit 90K Hz timestamp representing presentation time
          of MPEG picture or audio frame.  Same for all packets that make up a
          picture or audio frame.  May not be monotonically increasing if B
          pictures present in stream.  For packet that contain only
          a video sequence and/or GOP header, the timestamp is that of the
          subsequent picture.

3.4 MPEG Video specific headers

This header shall be attached to each RTP packet after the RTP fixed header.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|U|S|B|E| P |         TR        |      MBZ      | | BFC | | FFC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                                FBV     FFV

        U: Unused.  Must be set to zero in current format.

        S: Sequence-header-present (1 bit). Normally 0 and set to 1 at
           the occurrence of each MPEG sequence header.  Used to detect
           presence of sequence header in RTP packet.

        B: Beginning-of-slice (BS) (1 bit). Set when the start of the
           packet payload is a slice start code, or when a slice start code
           is preceeded only by one or more of a Video_Sequence_Header,
           GOP_header and/or Picture_Header.

        E: End-of-slice (ES) (1 bit). Set when the last byte of the payload
           is the end of an MPEG slice.

        P: Picture-Type (2 bits). I (1), P (2), B (3) or D (4). This value
           is constant for each RTP packet of a given picture.

        TR: Temporal-Reference (10 bits). The temporal reference of the
            current picture within the current GOP. This value ranges from
            0-1023 and is constant for all RTP packets of a given
            picture.


draft-ietf-avt-mpeg-00.txt                                      [Page 7]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


        MBZ: Unused. Must be set to zero in current profile. This space is
             reserved for future use.

        FBV: full_pel_backward_vector
        BFC: backward_f_code
        FFV: full_pel_forward_vector
        FFC: forward_f_code
              Obtained from the most recent picture header, and are constant
              for each RTP packet of a given picture. None of these values
              are used for I frames and must be set to zero in the RTP
              header. For P frames only the last two values are present and
              FBV and BFC must be set to zero in the RTP header. For B
              frames all the four values are present.


3.5 MPEG Audio specific headers

This header shall be attached to each RTP packet at the start of the payload
and after any RTP headers for an MPEG1/2 Audio payload type.

 0                   1
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|0|        Frag_offset        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

        Frag_offset: Byte offset into the frame for the data
                     in this packet.


draft-ietf-avt-mpeg-00.txt                                      [Page 8]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


Appendix 1. Error Recovery and Resynchronization Strategies.

The following error recovery and resynchronization strategies are intended
to be guidelines only.  A compliant receiver is free to employ alternative
(or no) strategies.

When initially decoding an RTP-encapsulated MPEG Elementary Stream, the
receiver may discard all packets until the Sequence-header-present bit is set
to 1.  At this point, sufficient state information is contained in the stream
to allow processing by an MPEG decoder.

Loss of packets containing the GOP_header and/or Picture_Header are detected
by an unexpected change in the Temporal-Reference and Picture-Type values.
Consider the following example GOP sequence:

        In display order: 0B 1B 2I 3B 4B 5P 6B 7B 8P GOP_HDR 0B ...
        In stream order:  2I 0B 1B 5P 3B 4B 8P 6B 7B GOP_HDR 2I ...

Consider also two counters:

        ref_pic_temp (Reference Picture (I,P) Temporal Reference)
        dep_pic_temp (Dependent Picture (B) Temporal Reference)

At each GOP beginning, set these counters to the temporal reference value of
the corresponding picture type. For our example GOP sequence, ref_pic_temp =
2 and dep_pic_temp = 0. Keep incrementing BOTH counters by unity with each
following picture. Ref_pic_temp should match the temporal references of
the I and P frames, and dep_pic_temp should match the temporal references
of the B frames.

    dep_pic_temp: -  0  1  2  3  4  5  6  7        8  9
In stream order:  2I 0B 1B 5P 3B 4B 8P 6B 7B GOP_H 2I 0B 1B ...
    ref_pic_temp: 2  3  4  5  6  7  8  9  10  ^    11
                  --------------------------  |    ^
                             Match            Drop |
                                                   Mismatch
                                                    in ref_pic_temp

The loss of a GOP header can be detected by matching the appropriate counter
(based on picture type) to the temporal reference value. A mismatch indicates
a lost GOP header. If desired, a GOP header can be re-constructed using a
"null" time_code, repeating the closed_gop flag from previous GOP headers,
and setting the broken_link flag to 1.

The loss of a Picture_Header can also be detected by a mismatches in the
Temporal Reference contained in the RTP packet from the appropriate
dep_pic_temp or ref_pic_temp counters at the receiver.  After scanning to the
next Beginning-of-slice the Picture_Header is reconstructed from the P, TR,


draft-ietf-avt-mpeg-00.txt                                      [Page 9]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


FBV, BFC, FFV and FFC contained in that packet, and from stream-dependent
default values.

Any time an RTP packet is lost (as indicated by a gap in the RTP sequence
number), the receiver may discard all packets until the Beginning-of-slice
bit is set.  At this point, sufficient state information is contained in the
stream to allow processing by an MPEG decoder starting at the next slice
boundary (possibly after reconstruction of the GOP_header and/or
Picture_Header as described above).


draft-ietf-avt-mpeg-00.txt                                     [Page 10]

INTERNET-DRAFT  RTP Payload Format for MPEG1/MPEG2 Video      June, 1995


Expires: December 1, 1995


References:

[1] ISO/IEC International Standard 11172; "Coding of moving pictures and
    associated audio for digital storage media up to about 1,5 Mbits/s",
    November 1993.

[2] ISO/IEC International Standard 13818; "Generic coding of moving pictures
    and associated audio information", November 1994.

[3] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson,
    "RTP: A Transport Protocol for Real-Time Applications",
    Internet Draft, March 21st, 1995

[4] H. Schulzrinne, "RTP Profile for Audio and Video Conferences
    with Minimal Control", Internet Draft, March 24th, 1995


Authors' Addresses:

        Gerard Fernando
        Sun Microsystems, Inc.
        Mail-stop UMPK14-305
        2550 Garcia Avenue
        Mountain View, California 94043-1100
        USA
        phone: +1 415-786-6373
        email: gerard.fernando@eng.sun.com

        Vivek Goyal
        Computer Science Department
        University of Southern California
        941 W. 37th Place
        Los Angeles, CA 90089-0781
        USA
        phone: +1 213-740-7287
        e-mail: goyal@usc.edu

        Don Hoffman
        Sun Microsystems, Inc.
        Mail-stop UMPK14-305
        2550 Garcia Avenue
        Mountain View, California 94043-1100
        USA
        phone: +1 503-297-1580
        email: don.hoffman@eng.sun.com


draft-ietf-avt-mpeg-00.txt                                     [Page 11]