Internet Engineering Task Force Audio-Video Transport Working Group Internet Draft H. Schulzrinne ietf-avt-profile-04.txt GMD Fokus March 23, 1995 Expires: 9/1/95 RTP Profile for Audio and Video Conferences with Minimal Control STATUS OF THIS MEMO This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as ``work in progress''. To learn the current status of any Internet-Draft, please check the ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. ABSTRACT This note describes a profile for the use of the real-time transport protocol (RTP) and the associated control proto- col, RTCP, within audio and video multiparticipant confer- ences with minimal control. It provides interpretations of generic fields within the RTP specification suitable for audio and video conferences. In particular, this document defines a set of default mappings from payload type numbers to encodings. The document also describes how audio and video data may be carried within RTP. It defines a set of standard encodings and their names when used within RTP. However, the definitions are independent of the particular transport mechanism used. The descriptions provide pointers to reference implementations and the detailed standards. This document is meant as an aid for implementors of audio, video and other real-time multimedia applications. H. Schulzrinne [Page 1] Internet Draft AV Profile March 23, 1995 1. Introduction This profile defines aspects of RTP left unspecified in the RTP protocol definition (RFC TBD). This profile is intended for the use within audio and video conferences with minimal session control. In particular, no support for the negotiation of parameters or member- ship control is provided. Other profiles may make different choices for the items specified here. The profile specifies the use of RTP over unicast and multicast UDP as well as ST-II. (Ed.: How to indicate usage of the profile? Port numbers are not likely to be well-defined.) 2. RTP and RTCP Packet Forms and Protocol Behavior This profile follows the default and/or recommended aspects of the RTP specification for these items: (Ed.: Maybe the main spec should number these items, so that they can be easily aligned between spec and profile?) o The standard format of the fixed RTP data header is used (one marker bit). o No additional fixed fields are appended to the RTP data header. o The suggested constants are to be used for the RTCP report interval calculation. o No extension section is defined for the RTCP SR or RR packet. o No additional RTCP packet types are defined by this profile specification. o The RTP default security services are also the default under this profile. o The standard mapping of RTP and RTCP to transport-level addresses is used. o No encapsulation of RTP packets is specified. o No RTP header extensions are defined, but applications operating under this profile may use such extensions. Thus, applications should not assume that the RTP header X bit is always zero and should be prepared to ignore the header extension. Extensions should register the content of the first 16 bits with IANA. (Ed.: Yet another IANA space? Other ideas?) o Applications may use any of the SDES items described. H. Schulzrinne [Page 2] Internet Draft AV Profile March 23, 1995 New encodings are to be registered with the Internet Assigned Numbers Authority. When registering a new encoding, the following information should be provided: o name and description of encoding, in particular the RTP times- tamp clock rate; o indication of who has change control over the encoding (for example, CCITT/ITU, other international standardization bodies, a consortium or a particular company or group of companies); o any operating parameters; o a reference to a further description, if available, for example (in order of preference) an RFC, a published paper, a patent fil- ing, a technical report or a computer manual; o for proprietary encodings, contact information (postal and email address). o the payload type value for this profile. 3. Audio 3.1. Encoding-independent recommendations The following recommendations are default operating parameters. Applications should be prepared to handle other values. The ranges given are meant to give guidance to application writers, allowing a set of applications conforming to these guidelines to interoperate without additional negotiation. These guidelines are not intended to restrict operating parameters for applications that can negotiate a set of interoperable parameters, e.g., through a conference control protocol. For packetized audio, the default packetization interval should have a duration of 20 ms, unless otherwise noted when describing the encoding. The packetization interval determines the minimum end-to-end delay; longer packets introduce less header overhead but higher delay and make packet loss more noticeable. For non-interactive applications such as lectures or links with severe bandwidth constraints, a higher packetiza- tion delay may be appropriate. For N-channel encodings, each sampling period (say, 1/8000 of a second) generates N samples. (This terminology is standard, but somewhat confusing, as the total number of samples gen- erated per second is then the sampling rate times the channel count.) If multiple audio channels are used, channels are numbered left-to- right, starting at one. In RTP audio packets, information from lower- numbered channels precedes that from higher-numbered channels. For more H. Schulzrinne [Page 3] Internet Draft AV Profile March 23, 1995 than two channels, the convention followed by the AIFF-C audio inter- change format should be followed [1]. For two-channel stereo, the numbering sequence is left, right; for three channels, left, right, center; for quadrophonic systems, front left, front right, rear left, rear right; for four-channel systems, left, center, right, and surround sound; for six-channel systems left, left center, center, right, right center and surround sound. All channels belonging to a single sampling instance must be within the same packet. The sampling frequency should be drawn from the set: 8000, 11025, 16000, 22050, 44100 and 48000 Hz. (The Apple Macintosh computers have native sample rates of 22254.54 and 11127.27, which can be converted to 22050 and 11025 with acceptable quality by dropping 4 or 2 samples in a 20 ms frame.) A receiver should accept packets representing between 0 and 200 ms of audio data.[1] Receivers should be prepared to accept multi-channel audio, but may choose to only play a single channel. 3.2. Guidelines for Sample-Based Audio Encodings In sample-based encodings, each audio sample is represented by a fixed number of bits. Within the compressed audio data, codes for indi- vidual samples may span octet boundaries. An RTP audio packet may con- tain any number of audio samples, subject to the constraint that the number of bits per sample times the number of samples per packet yields an integral octet count. Fractional encodings produce less than one octet per sample. For sample-based encodings producing one or more octets per sample, samples from different channels, but the same sam- pling instant are consecutive. For example, for a two-channel encoding, the octet sequence is (left channel, first sample), (right channel, first sample), (left channel, second sample), (right channel, second sample), .... For multi-octet encodings, octets are transmitted in net- work byte order (i.e., most significant octet first). The packing order for fractional encodings is that described for the IMA Wave types [2]. For audio encodings yielding four bits per sample, eight such compressed samples from channel 1 are packet into one 32-bit word, followed by eight compressed samples from channel 2, until all channels have been accomodated and the packing resumes at channel 1. For audio encodings yielding three bits per sample, 32 such compressed samples at three bits each from channel 1 are packed into 12 octets, followed by 32 samples from channel 2, etc. 3.3. Guidelines for Frame-Based Audio Encodings Frame-based encodings encode a fixed-length block of audio into _________________________ [1] This restriction allows reasonable buffer sizing for the receiver. H. Schulzrinne [Page 4] Internet Draft AV Profile March 23, 1995 another block of compressed data, typically also of fixed length. For frame-based encodings, the sender may choose to combine several such frames into a single message. The receiver can tell the number of frames contained in a message since the frame duration is defined as part of the encoding. For frame-based codecs, the channel order is defined for the whole block. That is, for two-channel audio, right and left samples are coded independently, with the encoded frame for the left channel preceding that for the right channel. All frame-oriented audio codecs should be able to encode and decode several consecutive frames within a single packet. Since the frame size for the frame-oriented codecs is given, there is no need to use a separate designation for the same encoding, but with different number of frames per packet. 3.4. Audio Encodings encoding sample/frame bits/sample ms/frame ______________________________________________________ 1016 frame 30 G721 sample 4 G723 sample 3 GSM frame 20 IDVI sample 4 LPC frame 20 L8 sample 8 L16 sample 16 MPA frame PCMU sample 8 PCMA sample 8 Table 1: Properties of Audio Encodings 1016: Encoding 1016 is a frame based encoding using code-excited linear prediction (CELP) and is specified in Federal Standard FED-STD 1016 [3,4,5,6]. The U. S. DoD's Federal-Standard-1016 based 4800 bps code excited linear prediction voice coder ver- sion 3.2 (CELP 3.2) Fortran and C simulation source codes are available for worldwide distribution at no charge (on DOS diskettes, but configured to compile on Sun SPARC stations) from: Bob Fenichel, National Communications System, Washing- ton, D.C. 20305, phone +1-703-692-2124, fax +1-703-746-4960. and ftp://ftp.super.org/pub/speech/celp_3.2a.tar.Z G721: G721 is specified in ITU recommendation G.721. Reference implementations for G.721 and G.723 are available as part of the CCITT/ITU-T Software Tool Library (STL) from the ITU H. Schulzrinne [Page 5] Internet Draft AV Profile March 23, 1995 General Secretariat, Sales Service, Place du Nations, CH-1211 Geneve 20, Switzerland. The library is covered by a license and is available for anonymous ftp on gaia.cs.umass.edu , file pub/hgschulz/ccitt/ccitt_tools.tar.Z G723: G721 is specified in ITU recommendation G.723. See G721 for information about a reference implementation. bits per sample. GSM: GSM (group speciale mobile) denotes the European GSM 06.10 provisional standard for full-rate speech transcoding, prI-ETS 300 036, which is based on RPE/LTP (residual pulse excitation/long term prediction) coding at a rate of 13 kb/s. A reference implementation was written by Carsten Borman and Jutta Degener (TU Berlin, Germany) and is available for anonymous ftp from ftp.cs.tu-berlin.de , directory pub/local/kbs/tubmik/gsm IDVI: IDVI is specified, with reference implemention, in [2]. Each packet contains a single DVI block. The "header" word for each channel has the following structure: int16 valpred; /* previous predicted value, network byte order */ u_int8 index; /* index into stepsize table */ Header words for all channels precede the compressed data. Note that the first 16 bits differ in definition from the IMA and Microsoft DVI ADPCM Wave type [7]. There, the first 16 bits contain the first (uncompressed) sample. (Ed.: This discrepancy is unfortunate, creating all kinds of problems with hardware-based codecs common with PCs.) L8: L8 denotes linear audio data, using 8-bits of precision with an offset of 128, that is, the most negative signal is encoded as 0. L16: L16 denotes uncompressed audio data, using 16-bit signed representation with 65535 equally divided steps between minimum and maximum signal level, ranging from -32768 to 32767. The value is represented in two's complement notation and network byte order. MPA: MPA denotes MPEG-I or MPEG-II audio. The encoding is defined in ISO standards ISO/IEC 11172-3 and 13818-3. The encapsula- tion is specified in RFC TBD. Sampling rate and channel count are contained in the payload. H. Schulzrinne [Page 6] Internet Draft AV Profile March 23, 1995 PCMU: PCMU is specified in CCITT/ITU-T recommendation G.711. Audio data is encoded as eight bits per sample, after companding. Code to convert between linear and mu-law companded data is available in [2]. PCMA: PCMA is specified in CCITT/ITU-T recommendation G.711. Audio data is encoded as eight bits per sample, after companding. Code to convert between linear and A-law companded data is available in [2]. LPC: LPC designates an experimental linear predictive encoding written by Ron Frederick, Xerox PARC, available from ftp://parcftp.xerox.com/pub/net-research/lpc.tar.Z VDVI: VDVI is a variable-rate version of IDVI, yielding speech bit rates of between 10 and 25 kbps. It is specified for single- channel operation only. It uses the following encoding: IDVI codeword VDVI bit pattern 0 00 1 010 2 1100 3 11100 4 111100 5 1111100 6 11111100 7 11111110 8 10 9 011 10 1101 11 11101 12 111101 13 1111101 14 11111101 15 11111111 TSP0: TSP0 designates the proprietary variable-rate, frame-based encoding called True Speech. The encoding is defined for a sampling rate of 7200 Hz and has an average data rate of 7200 bits per second. Further information is available by contact- ing VocalTec (see VSC encoding) or the address: DSP Group, Inc. email: tsplayer@dsgp.com VSC: VSC designates the proprietary variable-rate encoding called H. Schulzrinne [Page 7] Internet Draft AV Profile March 23, 1995 Vocaltec Software Compression. The encoding is defined for a sampling rate of 5500 Hz and has an average data rate of 963 bytes per second. Further information is available by contact- ing Alon Cohen VocalTec Ltd. Maskit 1, Herzliya Israel phone: +972-9-5612121 email: alon@vocaltec.com The standard audio encodings and their payload types are listed in Table 2. 4. Video The following video encodings are currently defined, with their abbreviated names used for identification: CelB: The CELL-B encoding is a proprietary encoding proposed by Sun Microsystems. The byte stream format is described in RFC TBD. CPV: This proprietary encoding, "Compressed Packet Video is imple- mented by Concept, Bolter, and ViewPoint Systems video codecs. For further information, contact: Glenn Norem, President ViewPoint Systems, Inc. 2247 Wisconsin Street, Suite 110 Dallas, TX 75229-2037 United States Phone: +1-214-243-0634 JPEG: The encoding is specified in ISO Standards 10918-1 and 10918-2. The RTP payload format is as specified in RFC TBD. H261: The encoding is specified in CCITT/ITU-T standard H.261. The packetization and RTP-specific properties are described in RFC TBD. HDCC: The HDCC encoding is a proprietary encoding used by Silicon Graphics. [TBD: Need contact information.] MPV: The MPEG-I and MPEG-II video encoding are specified in ISO Standards ISO/IEC 11172 and 13818-2, respectively. The RTP payload format is as specified in RFC TBD. nv: The encoding is implemented in the program 'nv' developed at Xerox PARC by Ron Frederick. CUSM: The encoding is implemented in the program CU-SeeMe developed at Cornell University by Dick Cogger, Scott Brim, Tim Dorcey H. Schulzrinne [Page 8] Internet Draft AV Profile March 23, 1995 and John Lynn. PicW: The encoding is implemented in the program PictureWindow developed at Bolt, Beranek and Newman (BBN). RGB8: 8-bit encoding of RGB values, sequenced TBD. Each pixel can assume values from 0 to 255. Each frame is prefixed by a header containing TBD. If there is no strong technical reason to the contrary, all video encodings use a timestamp frequency of 65536 Hz. The standard video encodings and their payload types are listed in Table 2. PT encoding audio/video clock rate channels name (A/V) (Hz) (audio) ___________________________________________________________________ 0 PCMU A 8000 1 1 1016 A 8000 1 2 G721 A 8000 1 3 GSM A 8000 1 4 G723 A 8000 1 5 IDVI A 8000 1 6 IDVI A 16000 1 7 LPC A 8000 1 8 unassigned A 9 unassigned A 10 L16 A 44100 2 11 L16 A 44100 1 12 TSP0 A 7200 1 13 VSC A 5500 1 14 MPA A 90000 (see text) 15--22 unassigned A 23 RGB8 V 65536 N/A 24 HDCC V 65536 N/A 25 CelB V 65536 N/A 26 JPEG V 65536 N/A 27 CUSM V 65536 N/A 28 nv V 65536 N/A 29 PicW V 65536 N/A 30 CPV V 65536 N/A 31 H261 V 65536 N/A 32 MPV V 90000 N/A 33--71 unassigned V 65536 N/A 72--76 reserved N/A N/A N/A 77--127 unassigned ? N/A H. Schulzrinne [Page 9] Internet Draft AV Profile March 23, 1995 Table 2: Payload types (PT) for standard audio and video encodings 5. Port Assignment As specified in the RTP protocol definition, RTP data is to be car- ried on an even UDP port number and the corresponding RTCP packets are to be carried on the next higher (odd) port number. Applica- tions operating under this profile may use any such UDP port pair or ST-II SAP pair. For example, the port pair may be allocated ran- domly by a session management program. A single fixed port number pair cannot be required because multiple applications using this profile are likely to run on the same host, and there are some operating systems that do not allow multiple processes to use the same UDP port with different multicast addresses. However, port numbers 5004 and 5005 have been registered for use with this pro- file for those applications that choose to use them as the default pair. Applications that operate under multiple profiles may use this port pair as an indication to select this profile if they are not subject to the constraint of the previous paragraph. Applica- tions need not have a default and may require that the port pair be explicitly specified. The particular port numbers were chosen to lie in the range above 5000 to accomodate port number allocation practice within the Unix operating system, where port numbers below 1024 can only be used by privileged processes and port numbers between 1024 and 5000 are automatically assigned by the operating system. 6. Address of Author Henning Schulzrinne GMD Fokus Hardenbergplatz 2 D-10623 Berlin Germany electronic mail: hgs@fokus.gmd.de H. Schulzrinne [Page 10]