Internet Engineering Task Force Don Hoffman INTERNET-DRAFT Gerard Fernando Sun Microsystems, Inc. Vivek Goyal University of Southern California November, 1994 Expires: May 1, 1995 RTP Encapsulation of MPEG1/MPEG2 Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this memo is unlimited. Abstract This draft describes a packetization scheme for MPEG video and audio streams. The scheme proposed can be used to transport such a video or audio flow over the transport protocols supported by RTP. Two profiles are described. The first is designed to support maximum interoperability with MPEG2 System environments. The second is designed to maximize simplicity of implementation and to leverage other efforts within IETF. draft-hoffman-rtp-mpeg-encap-01.txt [Page 1] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 1. Introduction ISO/IEC JTC1/SC29 WG11 (also referred to as the MPEG committee) has defined the MPEG1 standard (ISO/IEC 11172)[1] and the MPEG2 standard (ISO/IEC 13818)[2]. The MPEG1 specification is defined in three parts: System, Video and Audio. It is designed primarily for CD-ROM-based applications, and is optimized for approximately 1.5 Mbits/sec combined data rates. The video and audio portions of the specification describe the basic format of the video or audio stream. These formats define the Elementary Streams (ES). The MPEG1 System specification defines an encapusulation of the the ES's that contains Presentation Time Stamps (PTS), Decoding Time Stamps and System Clock references, and performs multiplexing of MPEG1 compressed video and audio ES's with user data. The MPEG2 specification is structured in a similar way. However, it hasn't been restricted only to CD-ROM applications. The MPEG2 System specification defines two system stream formats: the MPEG2 Transport Stream (MTS) and the MPEG2 Program Stream (MPS). The MTS is tailored for communicating or storing one or more programs of MPEG2 compressed data and also other data in relatively error-prone environments. The MPS is tailored for relatively error-free environments. We seek to achieve interoperability among 4 types of end-systems in the following specification. The 4 types are: 1. Transmitting Interworking Unit (TIU) Receives MPEG information in real time from a native MTS link for distribution over an IP-based internet. Examples: real-time encoder, MTS satellite link to Internet. 2. Receiving Interworking Unit (RIU) Receives MPEG information in real time from an IP-based internet for forwarding to a native MTS environment. Examples: Internet-based video server to MTS-based cable distribution plant. 3. Transmitting Internet End-System (TAES) Transmits MPEG information generated or stored within the internet end-system itself, or received from internet-based computer networks. Example: video server. draft-hoffman-rtp-mpeg-encap-01.txt [Page 2] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 4. Receiving Internet End-System (RAES) Receives MPEG information over an IP-based internet for consumption at the internet end-system or forwarding to traditional computer network. Example: desktop PC or workstation viewing training video. Each of the 2 types of transmitters must work with each of the 2 types of receivers. Because it is probable that the TAES, and certain that the RAES, will be based on existing and planned internet-connected computers, it is highly desirable for the interoperable protocol to be based on RTP. Because of the range of applications that might employ MPEG streams, we propose to define two profiles. Most interest in the MPEG community is in the use of MTS, and hence, in Section 3 we propose an encapsulation of MPEG2 Transport Stream with the Real-time Transport Protocol (RTP) [3]. This profile supports the full semantics of MPEG System and offers maximum interoperability among all four end-system types. The presentation Time Stamp (PTS) is carried in the fixed RTP header. MPEG-specific headers are defined which carry the Program Clock Reference (PCR) as well as other information from the transport header and adaptation header. MPEG1 System streams will not be supported in this profile. When operating only among internet-based end-systems (i.e., TAES and RAES) a simpler profile that focuses just on media-stream transport is desired, deferring some of the system issues to other protocols being defined in the Internet community (such as the MMUSIC WG). In Section 4 we propose an encapsulation of compressed video and audio data (referred to in MPEG documentation as "Elementary Streams" (ES) ) complying with either MPEG1 or MPEG2. Here, neither of the System standards of MPEG1 or MPEG2 are utilized. The ES's are directly encapsulated with RTP. Throughout this specification, we make extensive use of MPEG terminology. The reader should consult the primary MPEG references for definitive descriptions of this terminology. 2. Use of clock information Clock information in the form of presentation timestamps (PTS) are carried in the System headers of both MPEG1 and MPEG2. These provide relative synchronization between compressed data streams. Lip synchronization with video is one obvious use of these timestamps. In the MPEG standards the PTS have an accuracy of 90kHz. draft-hoffman-rtp-mpeg-encap-01.txt [Page 3] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 Another use of timestamps is to achieve some degree of "locking" of the decoder sample clock to that of the encoder. This ensures that decoder buffers do not overflow (or underflow) due to differences in decoder and encoder clocks. If such buffer overflow (or underflow) were to occur then pictures and audio samples may need to be presented more than once or not at all. This is referred to as "frame slipping" and "sample slipping". The program clock reference (PCR) in MTS, and the system clock reference in MPS and MPEG1 System may be used to lock the decoder clock to the encoder, thereby preventing buffer overflow and underflow. Also, clock information from the encoder may also be used to generate the chroma sub-carrier. The MPEG2 System standard (both MTS and MPS) provide the necessary accuracy of PCR and SCR respectively to generate the chroma sub-carrier. This is of interest where video is to be displayed in composite format on NTSC or PAL monitors. Many of the practical issues involved in implementing end-end clock locking are not well understood. In particular, the MPEG model assumes a network with zero jitter. It is understood that this will never be the case in real network and work is underway to determine how this mechanism would work in the face of end-end jitter. Although is not expected that this mechanism will be employed where video is displayed on computer monitors, the RTP encapsulation of MPEG2 Transport Streams in Section-3 is intended to provide equivalent functionality to MTS. In the case of clock information, this means that all necessary timestamp information is available to a decoder with the full accuracy. It should be noted by implementors of this encapsulation that the PTS does not monotonically increase in MPEG streams that contain B pictures. In general, B pictures are placed by the encoder into the data stream after any I or P pictures on which they depend, even though the B picture may be temporally before the I or P picture in the presentation sequence. 3. Encapsulation of MPEG2 Transport Streams The basic approach will be to map the MPEG transport headers to appropriate RTP and MPEG-specific headers. For example, the MPEG Presentation Time Stamp (PTS) is represented as the timestamp contained in the fixed part of the RTP header. Optional MPEG-specific headers are defined which would carry the Program Clock Reference (PCR) as well as other information from the transport header and adaptation header. To avoid end system inefficiencies, data from multiple small MTS packets (normally fixed in size at 188 bytes) may be aggregated into a single RTP packet. These multiple MTS packets must be from the same MTS PID, and, for audio and video, correspond to the same PES level packets. draft-hoffman-rtp-mpeg-encap-01.txt [Page 4] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 This RTP encapsulation should enable high quality video and audio to be reconstructed from compressed data, and provide equivalent functionality to the MTS. In addition to this equivalent functionality, it provides translation, without loss of any information, to and from the MTS format. MTS are constructed from packetized elementary streams (PES). These PES packets contain presentation time stamps (PTS). The MPEG2 System standard specifies that the time interval between contiguous PTS's should not be greater than 0.7 seconds. We propose that PTS's should be present in every RTP fixed header. In the MPEG2 System standard the PTS is defined to 33 bits with an accuracy of 90 kHz. However, in this RTP encapsulation we use only the 32 LSB's. Consequently, the PTS would cycle every 13.26 hours. The PCR and OPCR are encapsulated in two optional MPEG-specific headers. The MPEG2 System standard specifies that the time interval between contiguous PCR's (or OPCR's) should not be greater than 0.1 seconds. Consequently, an RTP packet stream must contain the PCR optional header (or OPCR header, where appropriate) within this time interval where the application wishes to use the transport bit stream to provide end-end clock synchronization. These headers are not required when end-end clock "locking" is not implemented. 3.1 RTP fixed header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |T=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source identifier (SSRC) | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | content source identifiers (CSRCs) | | .... | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ M bit: Set to 1 at end of each PES packet PT: (value TBD) Fixed value for all packet streams of this type. sequence number: As per RTP spec. Incremented on each packet. timestamp: 32 LSB's of MPEG Presentation Time Stamp (PTS) (32 bits). Note - May not be monotonically increasing if B pictures present in stream. SSRC/CSRC: As per RTP spec. draft-hoffman-rtp-mpeg-encap-01.txt [Page 5] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 3.2 MPEG Transport Header This header shall be attached to each RTP packet at the start of the payload and after any RTP headers for an MPEG2 Transport stream payload type (i.e. PT=TBD). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |A|B|C| PID |1|2|SC |3|4|5|6|splice counter | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ A: transport_error_indicator (1 bit) B: payload_unit_start_indicator (normally start of PES packet) (1 bit) C: transport_priority (1 bit) PID (Program ID): 13 bits 1: PCR present 2: OPCR present SC: transport scrambling control (2 bits) (default = 00) 3: discontinuity indicator (1 bit) (default = 0) 4: random access indicator (1 bit) (default = 0) 5: elementary stream priority indicator (1 bit) (default = 0) 6: splice point flag (1 bit) (default = 0) splice counter: 8 bits (default = 0) 3.3 PCR (optional) This header shall be attached to each RTP packet immediately after the RTP and MPEG Transport headers. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PCR_base (32 LSB's) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Unused |W| PCR_clock_ext | Unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ draft-hoffman-rtp-mpeg-encap-01.txt [Page 6] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 W: MSB of program_clock_reference_base (1 bit) PCR_clock_ext: program_clock_reference_extension (9 bits) Modulo 300 27MHz clock (default = 0 if not represented in MTS) PCR_base: 32 LSB's of program_clock_reference_base (32 bits) 90kHz clock 3.4 OPCR (optional) This header shall be attached to each RTP packet after the RTP and MPEG Transport headers, and after the PCR header (if present). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | OPCR_base (32 LSB's) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Unused |W| PCR_clock_ext | Unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ W: MSB of original_program_clock_reference_base (1 bit) OPCR_clock_ext: program_clock_reference_extension (9 bits) Modulo 300 27MHz clock (default = 0 if not represented in MTS) OPCR_base: 32 LSB's of program_clock_reference_base (32 bits) 90kHz clock 4. Encapsulation of MPEG Elementary Streams MPEG1 or MPEG2 Elementary Streams (ES) shall be encapsulated with RTP. This encapsulation shall provide suitable timestamps, identification of packet loss and other functions for transport of MPEG1 and MPEG2 streams over IP networks. Each encapsulated MPEG video or audio header shall be completely contained within one packet. Consequently, a minimum RTP payload size of 261 bytes must be supported to contain the largest single header defined in the ES (that is, the extension_data() header containing the quant_matrix_extension()). Presentation Time Stamps (PTS) of 32 bits with an accuracy of 90 kHz would be carried in the fixed RTP header. All packets that make up a audio or video frame shall have the same time stamp. draft-hoffman-rtp-mpeg-encap-01.txt [Page 7] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 The following ES types may be encapsulated: (a) MPEG1 Video (ISO/IEC 11172-2) (b) MPEG2 Video (ISO/IEC 13818-2) (c) MPEG1 Audio (ISO/IEC 11172-3) (d) MPEG2 Audio (ISO/IEC 13818-3) A payload type (PT) of TBD corresponds to MPEG1/MPEG2 Video, and a PT of TBD corresponds to MPEG1/MPEG2 Audio (See Section 4.1). Further indication as to whether the data is MPEG1 or MPEG2 need not be provided in the RTP or MPEG-specific headers of this encapsulation, as this information is available in the ES headers. 4.1 MPEG Video elementary streams MPEG1 Video can be distinguished from MPEG2 Video at the video sequence header, i.e. for MPEG2 Video a sequence_header() is followed by sequence_extension(). The particular profile and level of MPEG2 Video (MAIN_Profile@MAIN_Level, HIGH_Profile@HIGH_Level, etc) is determined by the profile_and_level_indicator field of the sequence_extension header of MPEG2 Video. A header containing the slice counter and Sequence_header_state shall be attached to each RTP packet at the start of the video payload and after any RTP fixed or optional headers (See Section 4.4). For either MPEG1 or MPEG2 Video, loss of packets containing sequence headers is identified by the Sequence_header_state. Loss of packets containing group-of-picture (GOP) header is identified by the temporal_reference field of the picture header. Therefore, such information need not be available in the profile-specific header. An optional profile-specific header is available which contains beginning-of-slice and end-of-slice flags, as well as macro-block absolute position. For the case where a slice extends over more than one RTP packet the beginning-of-slice and the end-of-slice flags in this optional header may be used to indicate beginning or end of slice within an RTP packet. To provide greater robustness to packet loss at the macro-block level, the absolute position of the first macro-block in a slice shall be indicated. This information can be used to recover macro-block location mid-slice from the relative location information present in each macro-block. 4.2 MPEG Audio elementary streams MPEG1 Audio can be distinguished from MPEG2 Audio from the ancillary_data() header. For either MPEG1 or MPEG2 Audio distinct PTS's may be present for frames which correspond to either 384 samples for Layer-I, or 1152 samples for Layer-II or Layer-III. draft-hoffman-rtp-mpeg-encap-01.txt [Page 8] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 Multiple frames may be encapsulated within one RTP packet. Also, if relatively short packets need to be used, one frame may be so large that it may straddle multiple RTP packets. For example, for Layer-II at a sampling rate of 44.1 KHz each frame would represent a time slot of 26.1 msec. At this sampling rate if the compressed bit-rate is 384 kbits/sec (i.e. 48 kBytes/sec) then the average audio frame size would be 1.25 KBytes. If packets were to be 500 Bytes long, then each audio frame would straddle 3 RTP packets. The audio fragmentation indicator header (See Section 4.5) shall be present for an MPEG1/2 Audio payload type (PT=TBD) to provide for this fragmentation. 4.3 RTP Fixed Header for MPEG ES encapsulation 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |T=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source identifier (SSRC) | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | content source identifiers (CSRCs) | | .... | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ M bit: Set to 1 on packet containing MPEG frame end code. PT: MPEG video or audio stream ID (See Table-1 below). sequence number: As per RTP spec. Incremented on each packet. timestamp: 32 LSB's of MPEG Presentation Time Stamp (PTS) (32 bits) Note - May not be monotonically increasing if B pictures present in stream. SSRC/CSRC: As per RTP spec. Table-1: Stream ID assignments: +---------+-----------------------------------------------------+ | PT | ES type | +---------+-----------------------------------------------------+ | TBD | ITU-T Rec.262|ISO/IEC 11172-2 or ISO/IEC | | | 13818-2 video stream | +---------+-----------------------------------------------------+ | TBD | ISO/IEC 11172-3 or ISO/IEC 13818-3 audio stream | | | | +---------+-----------------------------------------------------+ draft-hoffman-rtp-mpeg-encap-01.txt [Page 9] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 4.4 MPEG Video specific headers This header shall be attached to each RTP packet at the start of the payload and after any RTP headers for an MPEG1/2 Video payload type (i.e. PT=TBD). Slice/Macro-block fragmentation information: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |S|U| SC |A|B| AP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ S: Sequence_header_state (1 bit). Initially 0 and toggled at the occurrence of each MPEG sequence header. Used to detect loss of sequence header U: Unused. Must be set to zero in current profile. SC: Slice counter (14 bits). Reset to 0 at beginning of each picture and incremented on each slice. Fourteen bits allows pictures with up to 16384 slices (one macro-block each) to be covered by this profile. This should be sufficient to cover the High level of MPEG2 (1920 pixels X 1152 lines), as well as the highest resolution for pictures from the US HDTV system from the "Grand Alliance". A: Beginning-of-slice (BS) (1 bit). B: End-of-slice (ES) (1 bit). AP: Absolute position (expressed as absolute macro-block number) of 1st macro-block in a slice (14 bits). This value shall be constant for each RTP packet for a given slice. Fourteen bits allows slices with up to 16384 macro-blocks to be covered by this profile. This should be sufficient to cover the High level of MPEG2 (1920 pixels X 1152 lines), as well as the highest resolution for pictures from the US HDTV system from the "Grand Alliance". Used to recover synchronization at the macro-block level the in event of packet loss. draft-hoffman-rtp-mpeg-encap-01.txt [Page 10] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 4.5 MPEG Audio specific headers This header shall be attached to each RTP packet at the start of the payload and after any RTP headers for an MPEG1/2 Audio payload type (i.e. PT=TBD). Audio fragmentation information: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|0| Frag_offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Frag_offset: Byte offset into the frame for the data in this packet. draft-hoffman-rtp-mpeg-encap-01.txt [Page 11] INTERNET-DRAFT RTP Encapsulation of MPEG1/MPEG2 November 1994 Expires: May 1, 1995 References: [1] ISO/IEC International Standard 11172; "Coding of moving pictures and associated audio for digital storage media up to about 1,5 Mbits/s", November 1993. [2] ISO/IEC International Standard 13818; "Generic coding of moving pictures and associated audio information", November 1994. [3] RTP: A Transport Protocol for Real-Time Applications, IETF Internet Draft, ftp://ds.internic.net/internet-drafts/draft-ietf-avt-rtp-05.txt Authors' Addresses: Gerard Fernando Sun Microsystems, Inc. 2550 Garcia Avenue Mail-stop UMPK14-305 Mountain View, California 94043-1100 USA phone: +415-786-6373 email: gerard.fernando@eng.sun.com Vivek Goyal Computer Science Department University of Southern California 941 W. 37th Place Los Angeles, CA 90089-0781 USA phone: +213-740-7287 e-mail: goyal@usc.edu Don Hoffman Sun Microsystems, Inc. 2550 Garcia Avenue Mail-stop UMPK14-305 Mountain View, California 94043-1100 USA phone: +415-786-6370 email: don.hoffman@eng.sun.com draft-hoffman-rtp-mpeg-encap-01.txt [Page 12]