Internet Engineering Task Force Jason Flaks Internet Draft Dolby Laboratories Document: draft-flaks-avt-rtp-ac3-02.txt July 2002 Expires: January 2002 RTP Payload Format for AC-3 Streams Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMEDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1]. Abstract This document describes an RTP payload format for transporting AC-3 encoded audio data. AC-3 is a high quality multichannel audio coding system fully described in [2] by the Advanced Television Standards Committee (ATSC). The RTP payload format presented in this document provides mechanisms for interleaving redundant data, which can increase packet loss resilience. An intelligent method for fragmenting AC-3 frames that exceed the maximum transfer unit (MTU) is also described. 1. Introduction AC-3 is a high quality audio codec designed to encode multiple channels of audio into a low bit-rate format. AC-3 achieves its large compression ratios via encoding a multiplicity of channels as a single entity. Dolby digital, which is a branded version of AC-3 encodes up to 5.1 channels of audio. AC-3 has been adopted as an audio compression scheme for many consumer and professional applications. AC-3 is the mandatory codec for DVD- video, ATSC digital terrestrial television, laser disc, and DVD-audio (as an optional multichannel audio format). AC-3 is also a common audio format for film. Presently there exists a tremendous amount of content encoded in AC-3. The majority of AC-3 content is comprised of more then two channels. It is highly likely that people may wish to stream AC-3 data over computer networks. Applications for streaming AC-3 range from video on demand to multichannel Internet radio. RTP provides a mechanism for stream synchronization and hence serves as the best transport solution for AC-3, which is a codec primarily used in audio for video applications. The RTP payload described in this document also provides a method of ensuring a continuous high quality AC-3 stream. 1.1 Overview of AC-3 AC-3 can deliver upwards of 5.1 channels of audio at data rates approximately equal to half of one PCM channel [2], [3], [4]. The ".1" refers to a band limited optional low-frequency enhancement channel. AC-3 was designed for signals sampled at rates of 32, 44.1, or 48 kHz. Data rates can vary between 64 kbps and 640 kpbs depending the number of channels and desired quality. AC-3 exploits psychoacoustic phenomenon that reveal large amounts of inaudible information contained in a typical audio signal. Substantial data reduction occurs via the removal of all inaudible information contained in an audio stream. Source coding techniques are further used to reduce the data used to code an audio signal. Like most perceptual coders, AC-3 operates in the frequency domain. A 512-point TDAC transform is take with 50% overlap, providing 256 new frequency samples. Frequency samples are then converted to exponents and mantissas. Exponents are differentially encoded. Mantissas are allocated a varying number of bits depending on the audibility of the spectral component associated with it. Audibility is determined via a masking curve. Bits for mantissas are allocated from a global bit pool. 1.2 AC-3 Bitstream AC-3 bitstreams are organized into synchronization frames. Each AC-3 sync frame contains a Sync Information (SI) field, a Bit Stream Information (BSI) field, and 6 audio blocks (AB) representing 256 PCM samples for each channel. The entire frame represents a time duration of 1536 PCM samples across all coded channels (32 msec @ 48kHz) [2]. Figure 1 shows the AC-3 frame format. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |SI |BSI| AB0 | AB1 | AB2 | AB3 | AB4 | AB5 |AUX|CRC| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The Synchronization Information field contains information needed to acquire and maintain synchronization. The Bit Stream Information field contains parameters that describe the coded audio service [2]. Each audio block also contains fields that determine the usage of block switching, dither, dynamic range control, coupling, and exponent strategy. Figure 2 shows the format of an AC-3 audio block. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Block |Dither |Dynamic |Coupling |Coupling |Exponent | | switch |Flags |Range Ctrl |Strategy |Coordinates |Strategy | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Exponents | Bit Allocation | Mantissas | | | Parameters | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2. RTP AC-3 Payload Format According to [5] RTP payload formats should contain an integral number of application data units (ADUs). With audio compression algorithms an ADU typically coincides with codec frame boundaries. In this case an ADU is equivalent to an AC-3 sync frame. Hence each RTP packet will contain an integral number of AC-3 frames unless the AC-3 frame exceeds the maximum transfer unit (MTU) of the underlying network. RTP_Payload = x * AC-3_Frame, Where x belongs to |Z| (set of all positive integers), and RTP_Payload < MTU 2.1 RTP Header Extension 2.1.1 Main Header Extension The following header extension should be at the front of every AC-3 RTP payload. The primary purpose of this main header is to indicate the number of frames or fragments present in the packet. The term ôData Unitö is used to reference AC-3 data be it a full frame or a fragment. There is also a 4-bit field reserved for future use, and to ensure byte alignment. 0 1 2 3 4 5 6 7 8 +-+-+-+-+-+-+-+-+ | NDU | RSV | +-+-+-+-+-+-+-+-+ Number of data units (NDU): A 4-bit field used to indicate the number of AC-3 frames or fragments present in the RTP payload. Reserved: this 4-bit field is reserved for later use. 2.1.2 Data Unit Header Extension The following header should be in front of each audio data unit (i.e. AC-3 frame or fragment) present in the RTP packet. The fields should aid in handling redundant data and fragmented AC-3 frames. 0 1 2 3 4 5 6 7 8 +-+-+-+-+-+-+-+-+ |TYP|F|B| RDT |T| +-+-+-+-+-+-+-+-+ Type field (TYP): This field is used to specify the type of data associated with this header, which can be AC-3 data, AC-3 data plus redundant data, or redundant data alone. The following table shows the various settings for each type of data 00 û AC-3 data 01 û AC-3 data + redundant data 10 û redundant data 11 - reserved Fragment bit (F): This bit is set to 1 if the corresponding data unit is an AC-3 fragment Block ô0ö Bit (B): This bit is set to 1 if the packet contains an AC-3 fragment consisting of the first 5/8ths of the frame, which is guaranteed to contain blocks 0 and 1. If an AC-3 fragment is received and the B bit is not set, and the previous fragment is lost then the frame is useless and can be discarded. However if the first fragment is received, and the later fragment is lost, block 1 can be repeated to complete the frame Redundant Data field (RDT): This 3ûbit field indicates the type of redundant data associated with the frames. The following table shows the various settings for each type of redundant data. 000 û Full frame/Lower bit rate 001 û Full frame/Lower bit rate/Fewer channel 010 - 5/8ths fragment 011 - 3/8ths fragment 100 û 5/8ths fragment/Lower bit rate 101 û 3/8ths fragment/Lower bit rate 110 - 5/8ths fragment/Lower bit rate/fewer channels 111 û 3/8ths fragment/Lower bit rate/fewer channels Time Code Bit (T): This bit is set to 1 if the AC-3 data contains time code. Figure 4 shows how a full AC-3 RTP payload format should appear. 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NDU | RSV |TYP|F|B| RDT |T| AC-3 Frame(1) | +=+=+=+=+=+=+=+=+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+ | | Redundant data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |TYP|F|B| RDT |T| à | +-+-+-+-+-+-+-+-+ + | à | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |TYP|F|B| RDT |T| AC-3 Frame(N) | +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+ | | Redundant data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2.2 Fragmentation of AC-3 Frames The size of AC-3 frames remain constant throughout an encode procedure of a particular piece of audio, but the initial frame size selected can be chosen from large number of possibilities. According to table 5.13 in [2] frames sizes range from 128 bytes to 3840 bytes dependent upon the initial desired bit rate and the sample rate of the uncompressed audio. AC-3 frame sizes can be quite large, which may require fragmentation. For example an audio file sampled at 32 kHz and compressed with a desired bit rate of 640 kbps would have a frame size of 3840 bytes. This exceeds the standard 1500 byte MTU of an Ethernet network and the 1492 byte MTU of the PPPoE protocol. In [6] it is specified that fragmentation should not be left to IP layer, but instead should be handled by the application itself. AC-3 frames were designed with possibility of buffers being smaller then an entire AC-3 frame. For this reason each AC-3 frame contains two 16-bit CRC words. CRC1 is contained in the synchronization information (SI) header located at the beginning of each AC-3 frame. CRC1 is the second 16-bit word of the frame. Figure 2 shows the structure of the SI header. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SYNC WORD | CRC1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |FSC|FRMSIZECD | +-+-+-+-+-+-+-+-+ CRC2 is the last 16-bit word of an AC-3 frame as shown in Figure 1. CRC1 applies to the first 5/8ths of the frame excluding the sync word. CRC2 covers the remaining 3/8ths of the frame as well as the entire frame (excluding the sync word). All AC-3 encoders enforce specific block size restrictions that guarantees blocks 0 and 1 are completely covered by CRC1 [2]. Ensuring that blocks 0 and 1 are in the first 5/8ths of the frame is necessary because block 0 contains information that is shared with the remaining 5 blocks. The dual CRC allows decoders to immediately begin processing block 0 when the 5/8ths point is reached. This 5/8ths split in all AC-3 frames, which was intended for the possibility of smaller input buffers (assuming a guaranteed transport streams such as S/PDIF), provides a very logical fragmentation unit. Using the 5/8thspoint provides two possible gains over arbitrary fragmentation: 1) Using 5/8ths fragmentation, if the second fragment is dropped, the first fragment can still be decoded by an AC-3 decoder. Block 1 will be repeated in place of any missing blocks lost in the second fragment. 2) In closed networks with no QoS problems, it maybe possible to use smaller buffers as was intended in the original design of the 5/8ths split. In [2] the 5/8thspoint is defined to be: 5/8-framesize = truncate(framesize/2) + truncate(framesize/8) According to table 7.34 in [2], 5/8ths frame sizes can range from 80 bytes to 2400 bytes. Hence there are still instances where the 5/8ths boundary may exceed the MTU of the underlying network. In an Ethernet network this would be rare because the majority of AC-3 data publicly available is sampled at 48kHz and is encoded at a data rate of 384kbps or 448kbps. This provides a 5/8thspoint of 960 bytes and 1120 bytes respectively, which would be less then the MTU of a typical Ethernet network. In the rare instances where even the 5/8thspoint exceeds the MTU, AC-3 frames should be arbitrarily fragmented to a length that is less the MTU. It should be noted that using 5/8ths fragmentation in terms of smaller buffer sizes is only useful in networks where the inter-arrival jitter is less then the time it take to decode Blocks 0 and 1 of the AC-3 stream and play the uncompressed audio. Jitter-Bound = Decode-Time(Block 0 & 1) + 2 * (256/Fs), Where Fs is the sample rate of the uncompressed audio 2.3 Data Resiliency This section provides information on how to encapsulate redundant data into an RTP payload to ensure the reception of all the AC-3 data being sent. The are several types of redundant data that can be sent, which are defined in the section 2.1.2 and specified for each data unit in the data unit header. The various types of redundant data are further discussed in the following sections. As a general rule redundant data of any type should never repeat audio information in the same RTP payload. For a given RTP payload with Data Units (n) û (n+k) And redundant data (m) û (m+k) Where k = the number of data units - 1, m <= n-k +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X): | Data Unit(n) | Redundant Data(m = n-4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Unit(n+1) | Redundant Data(m+1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Unit(n+2) | Redundant Data(m+2) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Unit(n+3) | Redundant Data(m+3) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2.3.1 Lower Bit Rate Data In a previously defined AC-3 RTP payload format a method for data resiliency is presented. The paper suggests that AC-3 frames encoded at 32 kbps should be interleaved with the higher quality AC-3 frames, allowing the AC-3 decoder to decode the lower quality frame if the high quality packet is dropped, lost, or arrives with errors. The method described above is a suitable method for trying to send redundant data. However it may be bandwidth intensive and the redundant data can be extremely low quality, especially in cases where a large number of channels are used. 2.3.2 Lower Bit Rate Data With Fewer Channels AC-3 data is often used for film audio. The audio track is stored between the sprocket holes of the film. Over time wear can render sections of the AC-3 track unreadable. When no other error corrections techniques can recover the lost data the two-channel audio track will be used in its place. We present a similar method here for multichannel audio. When encoding multichannel audio a secondary two-channel version of the audio can also be encoded at a lower bit rate. Since the audio is reduced to two channels, it is still possible to maintain high quality even at a lower bit rate. The lower bit-rate two-channel version can be interleaved with the multichannel audio, and when a packet is lost or corrupted the two-channel version can be used in its place 2.3.3 5/8ths and 3/8ths Fragment Another method of sending redundant data might include fragmentation of packets at the 5/8ths split and interleaving fragments from previous frames. This ensures that all data is sent twice which decreases the likely hood of lost data. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X): | AC-3 5/8ths Fragment(n)| AC-3 3/8ths Fragment(n-1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X+1) | AC-3 3/8ths Fragment(n)| AC-3 5/8ths Fragment(n) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ In addition it is possible that one may wish to only send the 5/8ths fragment as redundant data. Since the 5/8ths fragment can be decoded on itÆs own, it would allow for redundant data at a lower overall bit rate. However because block repeats are used when only the first 5/8ths is present, the quality would be significantly reduced if the redundant data was to be used. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X): | AC-3 Frame(n) | AC-3 5/8ths Fragment(n-1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X+1) | AC-3 Frame(n+1) | AC-3 5/8ths Fragment(n) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2.3.4 5/8ths Fragment Lower Bit Rate Following the methods listed in the previous section, it may also be beneficial to send the redundant fragments at a lower bit rate. Ideally a lower bit rate version of the previous frames 5/8ths fragment could be sent along, which would provide for a very low bit rate redundant data channel. 2.3.5 5/8ths and 3/8ths Fragment Lower Bit Rate and Fewer Channels Combining the methods from 2.3.2 and 2.3.3 a version of the 5/8ths fragment that is lower in bit rate and is composed of fewer channels may be sent as redundant data. This provides and opportunity for low bit rate redundant data that has fewer channels but less quality degradation. 3 RTP header fields Payload Type (PT): It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or alternatively a payload type in the dynamic range [96,127] shall be chosen. Marker (M) bit: The M bit is set for last fragment of an AC-3 frame. In instances where one or more full AC-3 frames is encapsulated in an RTP packet the M bit will be set, and the full frame itself will be considered the last fragment. Extension (X) bit: Defined by the RTP profile used. Timestamp: A 32-bit word that corresponds to the sampling instant for the first AC-3 frame in an RTP packet. AC-3 encodes data sampled at 32kHz, 44.1kHz, and 48kHz. Fragmented frames shall maintain the same time stamp until the last fragment is sent. The starting timestamp is selected at random. 4 Types and Names 4.1 MIME type registration MIME media type name: audio MIME subtype name: ac3 Required parameters: Rate: Equal to the RTP timestamp clock rate for the particular AC-3 stream of a given RTP session. In the case on a single frame per a packet, and the AC-3 stream was encoded at 48Khz at a bit rate of 384 kbps the rate parameter would equal 32 milliseconds. In the case of interleaving the 5/8ths and 3/8ths fragments assuming the AC-3 file was encoded again at 48Khz with a bit rate of 384 kbps the clock rate would need to be one half of 32 milliseconds or 16 milliseconds. Optional parameters: Channels: How many channels are present in the AC3 stream. This will be a number between 1 and 6. Ptime: The recommended length of time in milliseconds represented by the AC-3 frame(s) in the packet. Maxptime: The maximum amount of media which can be encapsulated in each RTP packet, expressed as time in milliseconds Encoding considerations: The AC-3 bitstream shall be generated according to the AC-3 specification [2]. This bitstream is binary data and MUST be encoded for non-binary transport (for Email or any transport that cannot accommodate binary directly, the Base64 encoding is sufficient). This type is also defined for transfer via RTP. All RTP packets MUST be packetized using the RTP payload format described in this document. Security considerations: see section 5 of this document Interoperability considerations: none Published specification: see [2] Applications: Multichannel audio compression for audio and audio for video Additional Information: none Magic number(s): none File extension(s): .ac3 Macintosh File Type Code(s): none Object Identifier(s) or OID(s): none Personal information: Jason Flaks Email: jsf@dolby.com Intended Usage: COMMON Author/Change controller: Author: jsf@dolby.com Change Controller: IETF AVT WG 4.2 SDP usage The encoding name when using SDP [6] SHALL be "ac3" (MIME subtype). An example of the media representation in SDP is given below. m = audio 49000 RTP/AVP 100 a = rtpmap:100 ac3/48000 a = fmtp:100 number-channels=[1-6] 5. Security considerations In order to protect copyrighted material, certain security precautions may be necessary. The payload format described in this document is subject to the security considerations defined in the RTP specification [7]. The security considerations discussed in [7] imply the usage of encryption to protect the confidentiality of content. Such an encryption scheme is harmless to the encoded audio data presuming the data is decrypted before being sent to the decoder. 6. References [1] Bradner, S., "Key Words for use in RFCs to Indicate Requirement Levels", RFC 2119, Internet Engineering Task Force, March 1997. [2] U.S. Advanced Television Systems Committee (ATSC), "Digital Audio Compression (AC-3) Standard," Doc A/52, December 1995. [3] Todd, C. et. al, "AC-3: Flexible Perceptual Coding for Audio Transmission and Storage," Preprint 3796, Presented at the 96rh Convention of the Audio Engineering Society, May 1994. [4] Fielder, L. et. al, "AC-2 and AC-3: Low-Complexity Transform-Based Audio Coding," Collected Papers on Digital Audio Bit-Rate Reduction, pp. 54-72, Audio Engineering Society, September 1996. [5] Handley, M. and Perkins, C., "Guidelines for Writers of RTP Payload Format Specifications," RFC 2736, Internet Engineering Task Force, December 1999. [6] Handley, M. and Jacobson, V., "SDP: Session Description Protocol," RFC 2327, Internet Engineering Task Force, April 1998 [7] Schulzrinne, Casner, Frederick, and Jacobson, "RTP: A Transport Protocol for Real-Time Applications," RFC 1889, Internet Engineering Task Force, February 1996. 7. AuthorsÆ Addresses Jason Flaks Dolby Laboratories 100 Potrero Ave San Francisco, CA 94103 Email: jsf@dolby.com www.dolby.com