Internet Engineering Task Force Audio-Video Transport Working Group Internet Draft H. Schulzrinne ietf-avt-profile-05.txt GMD Fokus July 7, 1995 Expires: 12/1/95 RTP Profile for Audio and Video Conferences with Minimal Control STATUS OF THIS MEMO This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress''. To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. ABSTRACT This note describes a profile for the use of the real-time transport protocol (RTP) and the associated control proto- col, RTCP, within audio and video multiparticipant confer- ences with minimal control. It provides interpretations of generic fields within the RTP specification suitable for audio and video conferences. In particular, this document defines a set of default mappings from payload type numbers to encodings. The document also describes how audio and video data may be carried within RTP. It defines a set of standard encodings and their names when used within RTP. However, the definitions are independent of the particular transport mechanism used. The descriptions provide pointers to refer- ence implementations and the detailed standards. This docu- ment is meant as an aid for implementors of audio, video and H. Schulzrinne [Page 1] Internet Draft AV Profile July 7, 1995 other real-time multimedia applications. Changes (This section will not become part of the RFC.) o Video frequency changed to 90 kHz. o Short reference labels for profile definitions. o Explain differences between Intel/IMA DVI format and the one used for this profile; name changed from IDVI to DVI4. o Minor editorial clarifications. 1. Introduction This profile defines aspects of RTP left unspecified in the RTP protocol definition (RFC TBD). This profile is intended for the use within audio and video conferences with minimal session control. In par- ticular, no support for the negotiation of parameters or membership con- trol is provided. Other profiles may make different choices for the items specified here. The profile specifies the use of RTP over unicast and multicast UDP. (This does not preclude the use of these definitions when RTP is carried by other lower-layer protocols.) Use of this pro- file occurs by use of the appropriate applications; there is no explicit indication by port number, protocol identifier or the like. 2. RTP and RTCP Packet Forms and Protocol Behavior This profile follows the default and/or recommended aspects of the RTP specification for these items: Header: The standard format of the fixed RTP data header is used (one marker bit). Extension: No additional fixed fields are appended to the RTP data header. RTCP report interval: The suggested constants are to be used for the RTCP report interval calculation. H. Schulzrinne [Page 2] Internet Draft AV Profile July 7, 1995 SR/RR extension: No extension section is defined for the RTCP SR or RR packet. RTCP packet types: No additional RTCP packet types are defined by this profile specification. Security: The RTP default security services are also the default under this profile. Mapping: The standard mapping of RTP and RTCP to transport-level addresses is used. Encapsulation: No encapsulation of RTP packets is specified. RTP header extensions: No RTP header extensions are defined, but applications operat- ing under this profile may use such extensions. Thus, applica- tions should not assume that the RTP header X bit is always zero and should be prepared to ignore the header extension. If a header extension is defined in the future, that definition must specify the contents of the first 16 bits. SDES use: Applications may use any of the SDES items described. New encodings are to be registered with the Internet Assigned Numbers Authority. When registering a new encoding, the following infor- mation should be provided: o name and description of encoding, in particular the RTP times- tamp clock rate; o indication of who has change control over the encoding (for example, CCITT/ITU, other international standardization bodies, a consortium or a particular company or group of companies); o any operating parameters; o a reference to a further description, if available, for example (in order of preference) an RFC, a published paper, a patent fil- ing, a technical report or a computer manual; o for proprietary encodings, contact information (postal and email address). H. Schulzrinne [Page 3] Internet Draft AV Profile July 7, 1995 o the payload type value for this profile. 3. Audio 3.1. Encoding-independent recommendations The first packet of a talkspurt is distinguished by a set marker bit in the RTP data header. The following recommendations are default operating parameters. Applications should be prepared to handle other values. The ranges given are meant to give guidance to application writers, allowing a set of applications conforming to these guidelines to interoperate without additional negotiation. These guidelines are not intended to restrict operating parameters for applications that can negotiate a set of interoperable parameters, e.g., through a conference control protocol. For packetized audio, the default packetization interval should have a duration of 20 ms, unless otherwise noted when describing the encoding. The packetization interval determines the minimum end-to-end delay; longer packets introduce less header overhead but higher delay and make packet loss more noticeable. For non-interactive applications such as lectures or links with severe bandwidth constraints, a higher packetization delay may be appropriate. For N-channel encodings, each sampling period (say, 1/8000 of a second) generates N samples. (This terminology is standard, but somewhat confusing, as the total number of samples generated per second is then the sampling rate times the channel count.) If multiple audio channels are used, channels are numbered left- to-right, starting at one. In RTP audio packets, information from lower-numbered channels precedes that from higher-numbered channels. For more than two channels, the convention followed by the AIFF-C audio interchange format should be followed [1]. For two-channel stereo, the numbering sequence is left, right; for three channels, left, right, center; for quadrophonic systems, front left, front right, rear left, rear right; for four-channel systems, left, center, right, and surround sound; for six-channel systems left, left center, center, right, right center and surround sound. All channels belonging to a single sampling instance must be within the same packet. The sampling frequency should be drawn from the set: 8000, 11025, 16000, 22050, 44100 and 48000 Hz. (The Apple Macintosh computers have native sample rates of 22254.54 and 11127.27, which can be converted to H. Schulzrinne [Page 4] Internet Draft AV Profile July 7, 1995 22050 and 11025 with acceptable quality by dropping 4 or 2 samples in a 20 ms frame.) A receiver should accept packets representing between 0 and 200 ms of audio data.[1] Receivers should be prepared to accept multi-channel audio, but may choose to only play a single channel. 3.2. Guidelines for Sample-Based Audio Encodings In sample-based encodings, each audio sample is represented by a fixed number of bits. Within the compressed audio data, codes for indi- vidual samples may span octet boundaries. An RTP audio packet may con- tain any number of audio samples, subject to the constraint that the number of bits per sample times the number of samples per packet yields an integral octet count. Fractional encodings produce less than one octet per sample. For sample-based encodings producing one or more octets per sample, samples from different channels, but the same sampling instant are con- secutive. For example, for a two-channel encoding, the octet sequence is (left channel, first sample), (right channel, first sample), (left chan- nel, second sample), (right channel, second sample), .... For multi- octet encodings, octets are transmitted in network byte order (i.e., most significant octet first). The packing order for fractional encodings is that described for the IMA Wave types [2]. For audio encodings yielding four bits per sam- ple, eight such compressed samples from channel 1 are packet into one 32-bit word, followed by eight compressed samples from channel 2, until all channels have been accomodated and the packing resumes at channel 1. For audio encodings yielding three bits per sample, 32 such compressed samples at three bits each from channel 1 are packed into 12 octets, followed by 32 samples from channel 2, etc. 3.3. Guidelines for Frame-Based Audio Encodings Frame-based encodings encode a fixed-length block of audio into another block of compressed data, typically also of fixed length. For frame-based encodings, the sender may choose to combine several such frames into a single message. The receiver can tell the number of frames contained in a message since the frame duration is defined as part of _________________________ [1] This restriction allows reasonable buffer sizing for the receiver. H. Schulzrinne [Page 5] Internet Draft AV Profile July 7, 1995 the encoding. For frame-based codecs, the channel order is defined for the whole block. That is, for two-channel audio, right and left samples are coded independently, with the encoded frame for the left channel preceding that for the right channel. All frame-oriented audio codecs should be able to encode and decode several consecutive frames within a single packet. Since the frame size for the frame-oriented codecs is given, there is no need to use a separate designation for the same encoding, but with different number of frames per packet. 3.4. Audio Encodings encoding sample/frame bits/sample ms/frame __________________________________________________________ 1016 frame N/A 30 G721 sample 4 G722 sample 8 G728 frame N/A 2.5 ms/frame GSM frame N/A 20 DVI4 sample 4 LPC frame N/A 20 L8 sample 8 L16 sample 16 MPA frame N/A PCMU sample 8 PCMA sample 8 Table 1: Properties of Audio Encodings 1016: Encoding 1016 is a frame based encoding using code-excited linear prediction (CELP) and is specified in Federal Standard FED-STD 1016 [3,4,5,6]. The U. S. DoD's Federal-Standard-1016 based 4800 bps code excited linear prediction voice coder version 3.2 (CELP 3.2) For- tran and C simulation source codes are available for worldwide dis- tribution at no charge (on DOS diskettes, but configured to compile on Sun SPARC stations) from: Bob Fenichel, National Communications System, Washington, D.C. 20305, phone +1-703-692-2124, fax +1-703- 746-4960. and ftp://ftp.super.org/pub/speech/celp_3.2a.tar.Z. H. Schulzrinne [Page 6] Internet Draft AV Profile July 7, 1995 G721: G721 is specified in ITU recommendation G.721. Reference implementations for G.721 are available as part of the CCITT/ITU-T Software Tool Library (STL) from the ITU General Secretariat, Sales Service, Place du Nations, CH-1211 Geneve 20, Switzerland. The library is covered by a license and is available at ftp://gaia.cs.umass.edu/pub/hgschulz/ccitt/ccitt_tools.tar.Z G722: G722 is specified in ITU-T recommendation G.722, "7 kHz audio-coding within 64 kbit/s". G728: G728 is specified in ITU-T recommendation G.728, "Coding of speech at 16 kbit/s using low-delay code excited linear pred- iction". GSM: GSM (group speciale mobile) denotes the European GSM 06.10 provisional standard for full-rate speech transcoding, prI-ETS 300 036, which is based on RPE/LTP (residual pulse excitation/long term prediction) coding at a rate of 13 kb/s. A reference implementation was written by Carsten Borman and Jutta Degener (TU Berlin, Germany) and is available at ftp://ftp.cs.tu-berlin.de/pub/local/kbs/tubmik/gsm/. DVI4: DVI4 is specified, with pseudo-code, in [2]as the ADPCM wave type. However, the encoding defined here as DVI4 differs in two respects from the IMA recommendation: - The header contains the predicted value rather than the first sample value. - IMA ADPCM blocks contain odd number of samples, since the first sample of a block is contained just in the header (uncompressed), followed by an even number of compressed samples. DVI4 has an even number of compressed samples only, using the 'predict' word from the header to decode the first sample. Each packet contains a single DVI block. The profile only defines the 4-bit-per-sample version, while IMA also specifies a 3-bit-per-sample encoding. The "header" word for each channel has the following struc- ture: H. Schulzrinne [Page 7] Internet Draft AV Profile July 7, 1995 int16 predict; /* predicted value of first sample from the previous block (L16 format) */ u_int8 index; /* current index into stepsize table */ u_int8 reserved; /* set to zero by sender, ignored by receiver */ Header words for all channels precede the compressed data. An implementation is available from Jack Jansen via anonymous ftp from ftp://ftp.cwi.nl/local/pub/audio/adpcm.shar. L8: L8 denotes linear audio data, using 8-bits of precision with an offset of 128, that is, the most negative signal is encoded as 0. L16: L16 denotes uncompressed audio data, using 16-bit signed representation with 65535 equally divided steps between minimum and maximum signal level, ranging from -32768 to 32767. The value is represented in two's complement notation and network byte order. MPA: MPA denotes MPEG-I or MPEG-II audio encapsulated as elementary streams. The encoding is defined in ISO standards ISO/IEC 11172-3 and 13818-3. The encapsulation is specified in RFC TBD, Section 3. Sampling rate and channel count are contained in the payload. PCMU: PCMU is specified in CCITT/ITU-T recommendation G.711. Audio data is encoded as eight bits per sample, after companding. Code to convert between linear and mu-law companded data is available in [2]. PCMA: PCMA is specified in CCITT/ITU-T recommendation G.711. Audio data is encoded as eight bits per sample, after companding. Code to convert between linear and A-law companded data is available in [2]. LPC: LPC designates an experimental linear predictive encoding written by Ron Frederick, Xerox PARC, available from ftp://parcftp.xerox.com/pub/net-research/lpc.tar.Z. VDVI: VDVI is a variable-rate version of DVI4, yielding speech bit rates of between 10 and 25 kbps. It is specified for single- H. Schulzrinne [Page 8] Internet Draft AV Profile July 7, 1995 channel operation only. It uses the following encoding: DVI4 codeword VDVI bit pattern 0 00 1 010 2 1100 3 11100 4 111100 5 1111100 6 11111100 7 11111110 8 10 9 011 10 1101 11 11101 12 111101 13 1111101 14 11111101 15 11111111 TSP0: TSP0 designates the proprietary variable-rate, frame-based encoding called True Speech. The encoding is defined for a sampling rate of 7200 Hz and has an average data rate of 7200 bits per second. Further information is available by contact- ing VocalTec (see VSC encoding) or the address: DSP Group, Inc. email: tsplayer@dsgp.com VSC: VSC designates the proprietary variable-rate encoding called Vocaltec Software Compression. The encoding is defined for a sampling rate of 5500 Hz and has an average data rate of 963 bytes per second. Further information is available by contact- ing Alon Cohen VocalTec Ltd. Maskit 1, Herzliya Israel phone: +972-9-5612121 email: alon@vocaltec.com The standard audio encodings and their payload types are listed in Table 5. H. Schulzrinne [Page 9] Internet Draft AV Profile July 7, 1995 4. Video The following video encodings are currently defined, with their abbreviated names used for identification: CelB: The CELL-B encoding is a proprietary encoding proposed by Sun Microsystems. The byte stream format is described in RFC TBD. CPV: This proprietary encoding, "Compressed Packet Video is imple- mented by Concept, Bolter, and ViewPoint Systems video codecs. For further information, contact: Glenn Norem, President ViewPoint Systems, Inc. 2247 Wisconsin Street, Suite 110 Dallas, TX 75229-2037 United States Phone: +1-214-243-0634 JPEG: The encoding is specified in ISO Standards 10918-1 and 10918-2. The RTP payload format is as specified in RFC TBD. H261: The encoding is specified in CCITT/ITU-T standard H.261. The packetization and RTP-specific properties are described in RFC TBD. HDCC: The HDCC encoding is a proprietary encoding used by Silicon Graphics. Contact inperson@sgi.com for further details. MPV: MPV designates the use MPEG-I and MPEG-II video encoding ele- mentary streams as specified in ISO Standards ISO/IEC 11172 and 13818-2, respectively. The RTP payload format is as speci- fied in RFC TBD, Section 3. MP2T: MP2T designates the use of MPEG-II transport streams, for either audio or video. The encapsulation is described in RFC TBD, Section 2. nv: The encoding is implemented in the program 'nv' developed at Xerox PARC by Ron Frederick. CUSM: The encoding is implemented in the program CU-SeeMe developed at Cornell University by Dick Cogger, Scott Brim, Tim Dorcey and John Lynn. PicW: The encoding is implemented in the program PictureWindow developed at Bolt, Beranek and Newman (BBN). H. Schulzrinne [Page 10] Internet Draft AV Profile July 7, 1995 RGB8: 8-bit encoding of RGB values, sequenced TBD. Each pixel can assume values from 0 to 255. Each frame is prefixed by a header containing TBD. 5. Payload Type Definitions Table 5 defines this profile's static payload type values for the PT field of the RTP data header. To assign a new value from the range marked 'unassigned' in the table, register your RTP Payload Format specification with the IANA. In addition, payload type values in the range 96--127 may be defined dynamically through a conference control protocol, which is beyond the scope of this document. The payload type range marked 'reserved' has been set aside so that RTCP and RTP packets can be reli- ably distinguished (see Section "Summary of Protocol Constants" of the RTP protocol specification). An RTP source emits a single RTP payload type at any given time; the interleaving of several RTP payload types in a single RTP session is not allowed, but multiple RTP sessions may be used in parallel to send multiple media. The payload types currently defined in this profile carry either audio or video, but not both. However, it is allowed to define payload types that combine several media, e.g., audio and video, with appropriate separation in the payload format. Session participants agree through mechanisms beyond the scope of this specification on the set of allowable payload types in a given session. This set may, for example, be defined by the capabilities of the applications used, nego- tiated by a conference control protocol or established by agreement between the human participants. Audio applications operating under this profile SHOULD at minimum be able to send and receive payload types 0 (mu-law) and 5 (DVI). This allows interoperability without format negotiation and successful nego- tation with a conference control protocol. All current video encodings use a timestamp frequency of 90000 Hz, the same as the MPEG presentation time stamp frequency. This frequency yields exact integer timestamp increments for the typical 24, 25, and 30 Hz frame rates and 50 and 60 Hz field rates and only 1 ppm error for the 29.97 Hz NTSC frame rate. While 90 kHz is the recommended rate for future video encodings used within this profile, other rates are possi- ble. However, it is not sufficient to use the video frame rate (typi- cally between 15 and 30 Hz) because that does not provide adequate reso- lution for typical synchronization requirements when calculating the RTP timestamp corresponding to the NTP timestamp in an RTCP SR packet [8]. The timestamp resolution must also be sufficient for the jitter estimate H. Schulzrinne [Page 11] Internet Draft AV Profile July 7, 1995 contained in the receiver reports. The standard video encodings and their payload types are listed in Table 5. PT encoding audio/video clock rate channels name (A/V) (Hz) (audio) ___________________________________________________________________ 0 PCMU A 8000 1 1 1016 A 8000 1 2 G721 A 8000 1 3 GSM A 8000 1 4 unassigned A 8000 1 5 DVI4 A 8000 1 6 DVI4 A 16000 1 7 LPC A 8000 1 8 PCMA A 8000 1 9 G722 A 8000 1 10 L16 A 44100 2 11 L16 A 44100 1 12 TSP0 A 7200 1 13 VSC A 5500 1 14 MPA A 90000 (see text) 15 G728 A 8000 1 16--22 unassigned A 23 RGB8 V 90000 N/A 24 HDCC V 90000 N/A 25 CelB V 90000 N/A 26 JPEG V 90000 N/A 27 CUSM V 90000 N/A 28 nv V 90000 N/A 29 PicW V 90000 N/A 30 CPV V 90000 N/A 31 H261 V 90000 N/A 32 MPV V 90000 N/A 33 MP2T V 90000 N/A 34--71 unassigned V N/A 72--76 reserved N/A N/A N/A 77--95 unassigned ? 96--127 dynamic ? N/A Table 2: Payload types (PT) for standard audio and video encodings H. Schulzrinne [Page 12] Internet Draft AV Profile July 7, 1995 6. Port Assignment As specified in the RTP protocol definition, RTP data is to be car- ried on an even UDP port number and the corresponding RTCP packets are to be carried on the next higher (odd) port number. Applications operating under this profile may use any such UDP port pair. For example, the port pair may be allocated randomly by a session management program. A single fixed port number pair cannot be required because multiple applications using this profile are likely to run on the same host, and there are some operating systems that do not allow multiple processes to use the same UDP port with different multicast addresses. However, port numbers 5004 and 5005 have been registered for use with this profile for those applications that choose to use them as the default pair. Applications that operate under multiple profiles may use this port pair as an indication to select this profile if they are not subject to the constraint of the previous paragraph. Applications need not have a default and may require that the port pair be explicitly specified. The particular port numbers were chosen to lie in the range above 5000 to accomodate port number allocation practice within the Unix operating system, where port numbers below 1024 can only be used by privileged processes and port numbers between 1024 and 5000 are automat- ically assigned by the operating system. 7. Acknowledgements The comments and careful review of Steve Casner are gratefully ack- nowledged. 8. Address of Author Henning Schulzrinne GMD Fokus Hardenbergplatz 2 D-10623 Berlin Germany electronic mail: schulzrinne@fokus.gmd.de H. Schulzrinne [Page 13]