Network Working Group Johan Sjoberg INTERNET-DRAFT Magnus Westerlund Expires: January 2005 Ericsson Ari Lakaniemi Nokia July 9, 2004 Real-Time Transport Protocol (RTP) Payload Format for Extended AMR Wideband (AMR-WB+) Audio Codec Status of this memo By submitting this Internet-Draft, I (we) certify that any applicable patent or other IPR claims of which I am (we are) aware have been disclosed, and any of which I (we) become aware will be disclosed, in accordance with RFC 3668 (BCP 79). Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This document is a submission of the IETF AVT WG. Comments should be directed to the AVT WG mailing list, avt@ietf.org. Abstract This document specifies a real-time transport protocol (RTP) payload format to be used for Extended AMR Wideband (AMR-WB+) encoded audio signals. The AMR-WB+ codec is an audio extension of the AMR-WB codec providing additional modes designed to give higher quality of music and speech than the original modes. A MIME type registration is included for AMR-WB+. Sjoberg, et. al. [Page 1] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 TABLE OF CONTENTS 1. Definitions.....................................................3 1.1. Glossary...................................................3 1.2. Terminology................................................3 2. Introduction....................................................3 3. Background on AMR-WB+ and Design Principles.....................4 3.1. The AMR-WB+ Audio Codec....................................5 3.2. Multi-rate Encoding and Mode Adaptation...................10 3.3. Voice Activity Detection and Discontinuous Transmission...10 3.4. Support for Multi-Channel Session.........................10 3.5. Unequal Bit-error Detection and Protection................10 3.6. Robustness against Packet Loss............................11 3.6.1. Use of Forward Error Correction (FEC)................11 3.6.2. Use of Frame Interleaving............................12 3.7. AMR-WB+ Audio over IP scenarios...........................13 4. RTP Payload Format for AMR-WB+.................................14 4.1. RTP Header Usage..........................................15 4.2. Payload Structure.........................................15 4.3. Payload definitions.......................................16 4.3.1. The Payload Table of Contents........................16 4.3.2. Audio Data...........................................18 4.3.3. Methods for Forming the Payload......................19 4.3.4. Payload Examples.....................................20 4.4. Interleaving Considerations...............................21 4.5. Implementation Considerations.............................22 5. Congestion Control.............................................22 6. Security Considerations........................................23 6.1. Confidentiality...........................................23 6.2. Authentication and Integrity..............................24 6.3. Decoding Validation.......................................24 7. Payload Format Parameters......................................24 7.1. MIME Registration.........................................24 7.2. Mapping MIME Parameters into SDP..........................26 7.2.1. Offer-Answer Model Considerations....................26 7.2.2. Examples.............................................28 8. IANA Considerations............................................28 9. Acknowledgements...............................................28 10. References....................................................29 10.1. Normative references.....................................29 10.2. Informative References...................................29 11. Authors' Addresses............................................30 12. IPR Notice....................................................31 13. Copyright Notice..............................................31 14. Changes.......................................................32 Sjoberg, et. al. Standards Track [Page 2] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 1. Definitions 1.1. Glossary 3GPP - the Third Generation Partnership Project AMR - Adaptive Multi-Rate Codec AMR-WB - Adaptive Multi-Rate Wideband Codec AMR-WB+ - Extended Adaptive Multi-Rate Wideband Codec CMR - Codec Mode Request CN - Comfort Noise DTX - Discontinuous Transmission FEC - Forward Error Correction ISF - Internal Sampling Frequency MI - Mode Index SCR - Source Controlled Rate Operation SID - Silence Indicator (the frames containing only CN parameters) TS - Timestamp VAD - Voice Activity Detection UED - Unequal Error Detection UEP - Unequal Error Protection 1.2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [2]. n^r is exponentiation where n is multiplied by itself r times; n and r are integers. k%m denotes the modulo operation (k mod m), i.e. the remainder part from the operation k/m; k and m are integers. 2. Introduction This document specifies the payload format for packetization of Extended Adaptive Multi-Rate Wideband (AMR-WB+) encoded audio signals into the Real-time Transport Protocol (RTP) [3]. The payload format supports transmission of mono or stereo audio, aggregating multiple frames per payload, and mechanisms enhancing robustness against packet loss. AMR-WB+ codec is an extension to the Adaptive Multi-Rate Wideband (AMR-WB) codec and therefore has a couple of features not available in AMR-WB. The new features in transport point of view are native support also for stereophonic audio and possibility to use different internal sampling frequencies. The primary usage scenario for AMR- WB+ is transport over IP and therefore AMR-WB-like need for interworking with other transport networks is not necessary. Sjoberg, et. al. Standards Track [Page 3] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 AMR-WB+ will mainly be used in streaming scenarios and there the benefit of using a more octet-aligned format to decrease the complexity of the server is seen substantial, and therefore anything similar to the bandwidth efficient mode defined in [7] is not specified for AMR-WB+; the saved bandwidth using bandwidth efficient mode would also be very small for all extension modes as they are octet aligned. The inbuilt codec support for stereo encoding makes the implementation of multi-channel support as in AMR and AMR-WB [7] difficult, but also less needed. Therefore, the multi-channel support as specified in AMR and AMR-WB payload format is not specified for AMR-WB+. Due to all these changes, and the different scope of the AMR-WB+ codec this formats defines a new significantly different RTP payload format compared with the ones for AMR and AMR- WB [7]. There is no file format for AMR-WB+ defined within this specification. Instead the 3GPP defined ISO based 3GP file format [14] will support AMR-WB+, and provides all functionality needed from a file format. This format does also support storage of AMR and AMR-WB, plus other multi-media formats allowing for synchronized playback. As the 3GP format provides much greater capability than the previously defined formats for AMR and AMR-WB, this format is expected to be used and be sufficient for all use cases. Background on AMR-WB+ and design principles can be found in Section 3. The payload format itself is specified in Section 4 and follows the principles used in [3], [9], and [7]. In Section 7, a MIME type registration is provided. 3. Background on AMR-WB+ and Design Principles The Extended Adaptive Multi-Rate Wideband (AMR-WB+) [1] audio codec is designed for compression of speech and audio achieving low bit- rate with good quality. The codec is being specified by 3GPP, and primary target applications within 3GPP are packet-switched streaming service (PSS) [13] and multimedia messaging service (MMS). However, due to its flexibility and robustness, AMR-WB+ is very well suited for streaming services in highly varying transport environments, e.g. the Internet. Because of the flexibility of this codec, the behavior in a particular application is controlled by several parameters that select options or specify the acceptable values for a variable. These options and variables are described in general terms at appropriate points in the text of this specification as parameters to be established through out-of-band means. In Section 7, all of the parameters are specified in the form of MIME subtype registration for the AMR-WB+ encoding. The method used to signal Sjoberg, et. al. Standards Track [Page 4] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 these parameters at session setup or to arrange prior agreement of the participants is beyond the scope of this document; however, Section 7 provides a mapping of the parameters into the Session Description Protocol (SDP) [6] for those applications that use SDP. Note that the AMR-WB+ design and specification work in 3GPP is still work in progress. Target is to finalize the codec specifications within 3GPP Release 6 timeline, the release will be frozen earliest in September 2004. However, due to non-finished status of the codec work some of the issues discussed in this internet-draft are still subject to change, but the draft presents the situation according to author's best knowledge at the time of writing. 3.1. The AMR-WB+ Audio Codec The AMR-WB+ audio codec was originally developed by 3GPP to be used for streaming and messaging services in GSM and 3G cellular systems. AMR-WB+ is designed as an audio extension to the AMR-WB speech codec. Thus, it includes the nine coding modes specified for AMR-WB, extended with additional new modes with bit rates ranging from 5.2 to 53,3 kbit/s. Whereas the AMR-WB modes employ 16000 Hz sampling frequency and operates on monophonic signals in all modes, the extension modes operate at a number of internal sampling frequencies, both in mono and stereo. The audio processing is performed on equal-size super-frames, each correspond to 2048 samples. The codec perform a number of encoding decisions for each super frame. The super frame are then encoded in 4 transport frames, i.e. corresponding to 512 samples, each being individually decodable. For the transport frames to be decodable, the position within the super frame must be known. If the internal sampling rate is set at 25600 Hz, a transport frame is equal to 20 ms and the super frame 80 ms. The encoder is only capable of changing used internal sampling frequency and encoding mode (both core and stereo) at the boundary between two super frames. This limitation does not apply for modes with index 0-9. The AMR-WB+ codec includes the AMR-WB modes, as shown in Table 1 below. Sjoberg, et. al. Standards Track [Page 5] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 Sampling Mono/ Number of Index Mode rate [kHz] stereo bits per frame ----------------------------------------------------- 0 WB 6.60 kbps 16 mono 132 1 WB 8.80 kbps 16 mono 177 2 WB 12.65 kbps 16 mono 253 3 WB 14.25 kbps 16 mono 285 4 WB 15.85 kbps 16 mono 317 5 WB 18.25 kbps 16 mono 365 6 WB 19.85 kbps 16 mono 397 7 WB 23.05 kbps 16 mono 461 8 WB 23.85 kbps 16 mono 477 9 WB SID 16 mono 40 10 WB+ 13.6 kbps 16/24 mono 272 11 WB+ 18 kbps 16/24 stereo 360 12 WB+ 24 kbps 16/24 mono 480 13 WB+ 24 kbps 16/24 stereo 480 14 LOST_SPEECH - - 0 15 NO_DATA - - 0 Table 1: AMR-WB modes. There are four special extension modes (Index 10-13 in table 1) that have a fixed internal sampling frequency (25600 Hz) and audio input frequencies (16 or 24 kHz). These modes share the property with the AMR-WB modes that each frame is only capable of representing 20 ms. The remaining extension modes are specified by three parameters; mono bit-rate, stereo bit-rate and internal sampling frequency. There are eight mono bit-rates and 16 stereo bit-rates available, see Tables 2 and 3 below. Note that the mode naming below assumes an internal sampling frequency of 25600 Hz. Number of Index Mode bits per frame ---------------------------------- 0 WB+ 10.4 kbps 208 1 WB+ 12.0 kbps 240 2 WB+ 13.6 kbps 272 3 WB+ 15.2 kbps 304 4 WB+ 16.8 kbps 336 5 WB+ 19.2 kbps 384 6 WB+ 20.8 kbps 416 7 WB+ 24.0 kbps 480 Table 2: AMR-WB+ core mono modes. Sjoberg, et. al. Standards Track [Page 6] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 Number of Index Mode bits per frame ---------------------------------- 0 WB+_s 2.0 kbps 40 1 WB+_s 2.4 kbps 48 2 WB+_s 2.8 kbps 56 3 WB+_s 3.2 kbps 64 4 WB+_s 3.6 kbps 72 5 WB+_s 4.0 kbps 80 6 WB+_s 4.4 kbps 88 7 WB+_s 4.8 kbps 96 8 WB+_s 5.2 kbps 104 9 WB+_s 5.6 kbps 112 10 WB+_s 6.0 kbps 120 11 WB+_s 6.4 kbps 128 12 WB+_s 6.8 kbps 136 13 WB+_s 7.2 kbps 144 14 WB+_s 7.6 kbps 152 15 WB+_s 8.0 kbps 160 Table 3: AMR-WB+ stereo modes. When using the codec in an extension mode, the number of samples each frame corresponds to is always the same but the duration of each frame varies depending on the internal sampling frequency. There is no preferred sampling frequency for the codec to operate at, but in order to limit the possible settings for an effective transmission, the following sampling frequencies are supported in this payload format. Internal Frame Frame ISF Sampling duration duration Bit-rate Index Rate [Hz] [ms] [RTP TS ticks] factor ------------------------------------------------------ 0 N/A 20 1440 N/A 1 12800 40 2880 1/2 2 14400 35.55 2560 9/16 3 16000 32 2304 5/8 4 17067 30 2160 2/3 5 19200 26.67 1920 3/4 6 21333 24 1728 5/6 7 24000 21.33 1536 15/16 8 25600 20 1440 1 9 28800 17.78 1280 9/8 10 32000 16 1152 5/4 11 34133 15 1080 4/3 12 38400 13.33 960 3/2 13 42667 12 864 5/3 Sjoberg, et. al. Standards Track [Page 7] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 Table 4: The relation between internal sampling frequency and frame lengths in time and RTP timestamp ticks. Note also that the RTP TS ticks assume TS clock rate of 72000 Hz. Index 0 is used for AMR-WB and the 4 extension modes in table 1. The duration of one AMR-WB+ audio transport frame is variable and depends on internal sampling frequency. The frame durations are all between 12 and 40 ms per transport frame. A transport frame is always representing 512 samples at the used internal sampling frequency. This results in that an AMR-WB+ transport frame length in RTP ticks is dependent on the internal sampling frequency and varies between 864 and 2880. Also the bit-rate will be dependent on the internal sampling frequency, the last column of Table 4 indicates which multiplication factor, any bit-rate value for 25600 Hz internal sampling factor should be converted with. The ISF index is carried in the payload format to indicate which internal sampling frequency is used for each AMR-WB+ encoded frame. The mode index is used to identify the content of an AMR-WB+ encoded frame. The mode index indicates if it is; an AMR-WB mode, Comfort noise, NO_DATA, AMR-WB+ core mode in mono usage, or a combination of a core mode and a stereo mode. The mode indexes are presented in the below table 5. The core mode and stereo mode index values are according to table 2 and 3 respectively. The bit-rate value assumes an internal sampling frequency of 25600 Hz. Sjoberg, et. al. Standards Track [Page 8] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 Core Stereo Total Number of Index mode mode bit-rate [kbps] bits per frame ----------------------------------------------------------------- 0-15: As specified in table 1. 16 0 None 10.4 208 17 1 None 12.0 240 18 2 None 13.6 272 19 3 None 15.2 304 20 4 None 16.8 336 21 5 None 19.2 384 22 6 None 20.8 416 23 7 None 24.0 480 24 0 0 12.4 248 25 0 1 12.8 256 26 0 4 14 280 27 1 1 14.4 288 28 1 3 15.2 304 29 1 5 16 320 30 2 2 16.4 328 31 2 4 17.2 344 32 2 6 18 360 33 3 3 18.4 368 34 3 5 19.2 384 35 3 7 20 400 36 4 4 20.4 408 37 4 6 21.2 424 38 4 9 22.4 448 39 5 5 23.2 464 40 5 7 24 480 41 5 11 25.6 512 42 6 8 26 520 43 6 10 26.8 536 44 6 15 28.8 576 45 7 9 29.6 592 46 7 10 30 600 47 7 15 32 640 48-127 : Reserved Table 5: The normative mode index table. Bit-rates assumes 25600 Hz internal sampling frequency. The actual bit-rate of audio encoding is, as indicated, dependent on the combination of core mode and stereo mode (mode index) and the internal sampling frequency (ISF). There exist a number of combinations that will produce the same bit-rate. For example one possible way of producing a 32 kbps audio stream is to utilize MI=41, i.e. 25.6 kbps, and then use an internal sampling frequency of 32kHz (5/4 * 25.6 = 32 kpbs). Sjoberg, et. al. Standards Track [Page 9] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 3.2. Multi-rate Encoding and Mode Adaptation The multi-rate encoding (i.e., multi-mode) capability of AMR-WB+ is designed for preserving high audio quality under a wide range of bandwidth requirements and transmission conditions. AMR-WB+ enables seamless switching between modes using the same number of audio channels and the same internal sampling frequency. Every AMR-WB+ codec implementation is required to support all the respective audio coding modes defined by the codec and must be able to handle mode switching between any two modes. Switching between modes employing different number of audio channels or different internal sampling frequency is possible, but may not be seamless. Therefore it is recommended to perform any such switch during periods where the input is silent, or take other precautions when performing a switch to ensure maintaining good audio quality. 3.3. Voice Activity Detection and Discontinuous Transmission AMR-WB+ supports the same algorithms for voice activity detection (VAD) and generation of comfort noise (CN) parameters during silence periods as used by the AMR-WB codec. However it can only be used in together with the AMR-WB modes (MI=0-8). Hence, also the AMR-WB+ codec has sometimes the option to reduce the number of transmitted bits and packets during silence periods to a minimum. The operation of sending CN parameters at regular intervals during silence periods is usually called discontinuous transmission (DTX) or source controlled rate (SCR) operation. The AMR-WB+ frames containing CN parameters are called Silence Indicator (SID) frames. See more details about VAD and DTX functionality in [4] and [5]. 3.4. Support for Multi-Channel Session Some of the AMR-WB+ modes support encoding of stereophonic audio. Because of this native support for two-channel stereophonic signal it does not seem necessary to support multi-channel transport with separate codecs as done in AMR-WB RTP payload [7]. The codec has the capablility of stereo to mono downmixing. Thus also receiver only capable of playout of mono, can still decode and play stereo signals. However to avoid spending bit-rate on stereo encoding that will not be utilized a mechansism for signalling mono only support is deinfed. 3.5. Unequal Bit-error Detection and Protection The audio bits encoded in each AMR-WB frame, have different perceptual sensitivity to bit errors. This property can be exploited e.g. in cellular systems to achieve better voice quality by using Sjoberg, et. al. Standards Track [Page 10] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 unequal error protection and detection (UEP and UED) mechanisms. However, as the extension modes in the AMR-WB+ codec do not have this property, UEP or UED cannot be utilized. If one has desire to use UEP or UED and needs payload format support for this, please use the RTP payload format for the AMR-WB modes defined in RFC 3267 [7]. 3.6. Robustness against Packet Loss The payload format supports several means, including forward error correction (FEC) and frame interleaving, to increase robustness against packet loss. 3.6.1. Use of Forward Error Correction (FEC) The simple scheme of repetition of previously sent data is one way of achieving FEC. Another possible scheme which can be more bandwidth efficient is to use payload external FEC, e.g. RFC2733 [11], which generates extra packets containing repair data. For the AMR-WB+ extension modes, it is only possible to use the codec to send redundant copies using the same mode index and internal sampling frequency. We describe such a scheme next. This involves the simple retransmission of previously transmitted frames together with the current frame(s). This is done by using a sliding window to group the audio frames to be sent in each payload. Figure 1 below shows us an example. --+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- <---- p(n-1) ----> <----- p(n) -----> <---- p(n+1) ----> <---- p(n+2) ----> <---- p(n+3) ----> <---- p(n+4) ----> Figure 1: An example of redundant transmission. In this example each frame is retransmitted once in the following RTP payload packet. Here, f(n-2)..f(n+4) denotes a sequence of audio frames and p(n-1)..p(n+4) a sequence of payload packets. The use of this approach does not require signaling at the session setup. In other words, the audio sender can choose to use this scheme without consulting the receiver. This is because a packet containing redundant frames will not look different from a packet with only new frames. The receiver may receive multiple copies or Sjoberg, et. al. Standards Track [Page 11] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 both indicated as NO_DATA and endocded audio of a frame for a certain timestamp if no packet is lost. This redundancy scheme provides the same functionality as the one described in RFC 2198 "RTP Payload for Redundant Audio Data" [12]. In most cases the mechanism in this payload format is more efficient and simpler than requiring both endpoints to support RFC 2198 in addition. There is one situation in which use of RFC 2198 is indicated: if some other codec than AMR-WB+ is desired for the redundant encoding, the AMR-WB+ payload format won't be able to carry it. The sender is responsible for selecting an appropriate amount of redundancy based on feedback about the channel, e.g., in RTCP receiver reports. The sender is also responsible for avoiding congestion, which may be exacerbated by redundancy (see Section 5 for more details). 3.6.2. Use of Frame Interleaving To decrease protocol overhead, the payload design allows several audio frames be encapsulated into a single RTP packet. One of the drawbacks of such an approach is that in case of packet loss this means loss of several consecutive audio frames, which usually causes clearly audible distortion in the reconstructed audio. Interleaving of frames can improve the audio quality in such cases by distributing the consecutive losses into a series of single frame losses. However, interleaving and bundling several frames per payload will also increase end-to-end delay and sets higher buffering requirements, and it is therefore not appropriate for all usage scenarios. Anyway, streaming applications will most likely be able to exploit interleaving to improve audio quality in lossy transmission conditions. This payload design supports the use of frame interleaving as an option. The usage of this feature needs to be negotiated or at least signalled. The interleaving supported by this format is rather flexible. For example, a continuous pattern can be defined, as the below example shows. --+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- [ P(n) ] [ P(n+1) ] [ P(n+1) ] [ P(n+2) ] [ P(n+2) ] [ P(n+3) ] [P( Sjoberg, et. al. Standards Track [Page 12] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 [ P(n+4) ] Figure 2: Example of interleaving pattern that has constant delay. In Figure 2 the consecutive frames, denoted f(n-2) to f(n+4), are aggregated two in each packet with interleaving. The packets, P(n) to P(n+4), contains a pattern that allows for constant delay in both interleaving and deinterleaving process. The deinterleaving buffer in this example needs to have room for at least 3 frames including the one that is ready to be consumed. This case when this is needed is for example when f(n) is the next to be played, then the receiver would have consumed all previous frames, and will need to have f(n), f(n+1) and f(n+3) in the buffer. Then when it is time to consume f(n+1) no more RTP packet is need. When f(n+2) is to be consumed then P(n+3) is needed and the deinterleaving buffer will contain f(n+2), f(n+3) and f(n+5). 3.7. AMR-WB+ Audio over IP scenarios Since the primary target for the AMR-WB+ codec is packet switched streaming, the most relevant usage scenario for this payload format is IP end-to-end between a server and a terminal, as shown in Figure 3. +----------+ +----------+ | | IP/UDP/RTP/AMR-WB+ | | | SERVER |<------------------------>| TERMINAL | | | | | +----------+ +----------+ Figure 3: Server to terminal IP scenario Sjoberg, et. al. Standards Track [Page 13] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 4. RTP Payload Format for AMR-WB+ The AMR-WB+ payload format is different from the AMR and AMR-WB payload formats [7]. The structure is simpler, and does only consist of; a table of contents, and the audio data. The payload format has two modes, the basic, and the interleaved mode. The main structural difference between the two modes is the extension of the table of contents with a timestamp offset field in the interleaved mode. As the AMR-WB+ codec contains all the functionality of the AMR-WB codec, anyone supporting the AMR-WB+ codec and this payload format is RECOMMENDED to also implement the payload format in RFC 3267 [7] for the AMR-WB modes. This will significantly help interoperability with other devices that only support AMR-WB, in applications and scenarios where this possible. Otherwise an end-point that is in fact capable of everything except the RTP payload format for AMR-WB will not be able to communicate. The basic mode supports aggregation of multiple consecutive frames in a payload. The interleaved mode supports aggregation of multiple frames that are non-consecutive in time. It is possible to have frames of different internal sampling frequency in the same payload. However frequent switching of the internal sampling frequency is not expected. The codec is restricted for the extended mode to switch ISF on super frame boundaries. However to avoid any limitation on how many frames that are present in a payload, the payload format allows for switching at any frame in the payload. The payload format is designed around the property that the AMR-WB+ frames can be sorted and identified based on the RTP timestamp of each audio frame. For example, the timestamp of the audio frames is used to identify duplicates. The timestamp is also used in the deinterleaving buffer to regenerate the correct order of the frames before decoding. The interleaving scheme of this payload format is significantly more flexible than the one present in RFC 3267. The AMR and AMR-WB payload format is only capable of using periodic patterns with frames taken from an interleaving group at fixed intervals. This interleaving scheme allows for any patterns as long as the time difference between any two in the payload adjacent frames are not more than 0.91 seconds, i.e. maximum field value / RTP timestamp rate (65535/72000). And by using extra NO_DATA frames even that can be extended. To allow for error resiliency through redundant transmission, the periods covered by multiple packets MAY overlap in time. A receiver MUST be prepared to receive any audio frame multiple times, all multiply sent frames MUST use the same mode (or NO_DATA) and internal sampling frequency and have the same RTP timestamp. Sjoberg, et. al. Standards Track [Page 14] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 The payload is always made an integral number of octets long by padding with zero bits if necessary. If additional padding is required to bring the payload length to a larger multiple of octets or for some other purpose, then the P bit in the RTP header MAY be set and padding appended as specified in [3]. 4.1. RTP Header Usage The format of the RTP header is specified in [3]. This payload format uses the fields of the header in a manner consistent with that specification. The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first frame in the packet. The timestamp clock frequency SHALL be 72000 Hz. This frequency allows the frame duration to be integer RTP timestamp ticks for the used internal sampling frequencies, and also gives reasonable conversion factors to used audio sampling frequencies. See section 4.3.1 for how to derive the RTP timestamp for any audio frame beyond the first one. The RTP header marker bit (M) SHALL be set to 1 if the first frame carried in the packet contains an audio frame, which is the first in a talkspurt. For all other packets the marker bit SHALL be set to zero (M=0). The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile under which this payload format is being used will assign a payload type for this encoding or specify that the payload type is to be bound dynamically. The MIME parameter "channels" is used to indicate the maximum number of channels allowed to be used for a given payload. A payload type where channels=1 (mono), SHALL only carry mono content. While a payload type for which channels=2 has been declared MAY carry both mono and stereo content. 4.2. Payload Structure The complete payload consists of a payload table of contents, and audio data representing one or more audio frames. The following diagram shows the general payload format layout: +-------------------+---------------- | table of contents | audio data ... +-------------------+---------------- Payloads containing more than one audio frame are called compound payloads. Sjoberg, et. al. Standards Track [Page 15] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 The following sections describe the variations taken by the payload format depending on whether the AMR-WB+ session is set up to use the basic mode or interleaved mode. 4.3. Payload definitions 4.3.1. The Payload Table of Contents The table of contents (ToC) consists of a list of ToC entries where each entry corresponds to an audio frame carried in the payload, i.e., +----------------+----------------+- ... -+----------------+ | ToC entry #1 | Toc entry #2 | ToC entry #N | +----------------+----------------+- ... -+----------------+ When multiple frames are present in a packet, the ToC entries SHALL be placed in the packet in order of their creation time. All fields in the RTP payload are in network byte order, i.e. with the left most bit being most significiant. A ToC entry takes the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Mode Index |TFI|R|ISF mode | Timestamp offset (optional) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ F (1 bit): If set to 1, indicates that this frame is followed by another audio frame in this payload; if set to 0, indicates that this frame is the last frame in this payload. Mode Index (7 bits): Indicates the audio codec mode used for the corresponding frame. Indicates the combination of AMR-WB+ core and stereo mode, the AMR-WB mode, or comfort noise, as specified by Table 5 in section 3.1. Transport Frame Index (TFI) (2 bits): An index from 0 (first) to 3 (last) indicating this transport frame's position in the super frame. ISF mode (5 bits): Indicates the internal sampling frequency employed for the corresponding frame. The index values correspond to internal sampling frequency as specified in Table 4 in section 3.1. This field SHALL be set to 0 for Mode Index values 0-13. Sjoberg, et. al. Standards Track [Page 16] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 Timestamp offset (16 bits): When using interleaved mode, this field SHALL be present, otherwise not. The field indicates the number of RTP Timestamp ticks that this frame is offset, in relation to the previous frame's RTP timestamp value. The RTP Timestamp offset for the first audio frame SHALL be 0. The field is in network byte order and is a 16 bit unsigned integer. R: Reserved bit, SHALL be set to 0 and SHALL be ignored by receivers. The RTP Timestamp value for a frame is the timestamp value of the first sample encoded in the frame. The timestamp value for a frame is derived differently depending on if it is basic or interleaved mode. In both cases the first frame in a compound packet has a RTP timestamp equal to the one given in the RTP header. In the basic mode, the RTP time for any frame of a subsequent frame is derived by adding together the frame durations of all the previous frames and add that to the RTP header timestamp value. For example if the RTP Header timestamp value is 12345, and the frame duration is 16 ms (Internal sampling frequency = 32 kHz). Then the RTP timestamp of a fourth frame present in the payload will be 12345 + 3 * 1152 = 15801. In interleaved mode the RTP timestamp is derived from the RTP header timestamp field and the sum of the RTP timestamp offset field in the TOC entries up to and including the frame for which one calculates the RTP TS for in modulo arithmetic. So for example to derive the RTP TS for the third frame in a compound packet, which has the following header and TOC information: RTP header TS: 12345 Frame 1 offset field: 0 Frame 2 offset field: 13824 Frame 3 offset field: 18432 In this case one simply adds together the offset values up to current frame to compute the frame timestamp. For example Frame 3's timestamp is (12345 + 0 + 13824 + 18432)% 2^32 = 44601 (% stands for modulo operation) The value of mode index is defined in Table 5 Section 3.1. MI=14 (AUDIO_LOST) is used to indicate frames that are lost. NO_DATA (MI=15) frame could mean either that there is no data produced by the audio encoder for that frame or that no data for that frame is transmitted in the current payload (i.e., valid data for that frame could be sent in either an earlier or later packet). The duration for these non-included frames is dependent on the internal sampling frequency indicated by the ISF mode field. Sjoberg, et. al. Standards Track [Page 17] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 For modes with index 0-13 the ISF field SHALL be set 0 and has no meaning. The frame length for these modes are fixed to 20 ms in time, and an RTP timestamp duration of 1440 ticks. For modes with index 0-9 the TFI field SHALL be set to 0, and lacks meaning. If receiving a ToC entry with a MI value not defined the whole packet SHOULD be discarded. This is to avoid the loss of data synchronization in the depacketization process, which can result in a huge degradation in audio quality. Note that packets containing only NO_DATA frames SHOULD NOT be transmitted. Also, NO_DATA frames at the end of a frame sequence to be carried in a payload SHOULD NOT be included in the transmitted packet. The AMR-WB+ SCR/DTX is identical with AMR-WB SCR/DTX described in [5] and SHALL only be used in combination with the AMR- WB modes (0-8). When multiple frames are present, their ToC entries will be placed in the ToC in order of their creation time independent on payload mode. In basic mode the frames will be consecutive in time, while in interleaved mode the frames may not only be non-consecutive in time but may even have varying inter frame distances. The following figure shows an example of a ToC of three entries in basic mode. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Mode Index1 | 0 |0|ISF mode1|1| Mode Index2 | 1 |0|ISF mode2| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Mode Index3 | 2 |0|ISF mode3| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The following figure shows an example of a TOC of three entries in interleaved mode. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Mode Index1 | 2 |0|ISF mode1| Timestamp offset 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Mode Index2 | 0 |0|ISF mode2| Timestamp offset 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Mode Index3 | 3 |0|ISF mode3| Timestamp offset 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.3.2. Audio Data Sjoberg, et. al. Standards Track [Page 18] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 Audio data of a payload contains one or more audio frames or comfort noise frames, as described in the ToC of the payload. Note, for ToC entries with MI=14 or 15, there will be no corresponding audio frame present in the audio data. Each audio frame for an extension mode represents an AMR-WB+ transport frame containing the encoding of 512 samples of audio sampled with the internal sampling frequency specified by the ISF mode indicator. Modes with index 10-13, being the exception, is only capable of using a single internal sampling frequency (25600 Hz). The encoding modes (core and stereo) is indicated in the mode index field of the corresponding ToC entry. The octet length of the audio frame is implicitly defined by the mode indicated in the mode index field. The order and numbering notation of the bits are as specified in [1]. As specified there, the bits of the AMR-WB audio frames (mode indices in range 0...8) have been rearranged in order of decreasing sensitivity. For the AMR-WB+ modes and comfort noise frames, the bits are in the order produced by the encoder. The resulting bit sequence for a frame of length K bits is denoted d(0), d(1), ..., d(K-1). The last octet of each audio frame MUST be padded with zeroes at the end if not all bits in the octet are used. In other words, each audio frame MUST be octet-aligned. 4.3.3. Methods for Forming the Payload The payload begins with the table of contents consisting of a list of ToC entries, two or four bytes per entry. The audio data follows the table of contents, all of the octets comprising an audio frame are appended to the payload as a unit. The audio frames are packed in the same order as their corresponding ToC entries are arranged in the ToC list, with the exception that if a given frame has a ToC entry with MI=14 or 15, there will be no data octets present for that frame. Sjoberg, et. al. Standards Track [Page 19] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 4.3.4. Payload Examples 4.3.4.1. Example 1, Basic Payload Carrying Multiple Frames The following diagram shows a payload from a session that carries three AMR-WB+ frames of 14 kbps coding mode (MI=26) with a frame length of 280 bits. The internal sampling frequency in this example is 25.6 kHz (ISF mode = 8). The TFI for the first frame is 2, indicating that the first transport frame in this payload is the third in a super frame. The following frames are consecutive, i.e. the fourth and first transport frames in the super frame. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Mode Index1 | 2 |0|ISF mode1|1| Mode Index2 | 3 |0|ISF mode2| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Mode Index3 | 0 |0|ISF mode3| f1(0..7) | f1(8..15) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(272..279) | f2(0..7) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f2(272..279) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f3(0..7) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f3(272..279) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sjoberg, et. al. Standards Track [Page 20] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 4.3.4.2. Example 2, Payload in Interleaved mode This example shows a payload with three frames of 24 kbps stereo coding mode (MI=40), carried in this payload. This payload uses the interleaved mode. The frames 1, 2 and 3 is not consecutive, and is in playout order frame 1, 9, and 17 in a sequence, the TFI values does also match this. The internal sampling frequency in this example is 32 kHz (ISF mode = 10). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Mode Index1 | 1 |0|ISF mode1| Timestamp offset 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Mode Index2 | 1 |0|ISF mode2| Timestamp offset 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Mode Index3 | 1 |0|ISF mode3| Timestamp offset 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(0..7) | f1(8..15) | f1(16..23) | f1(24..31) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(448..455) | f1(456..463) | f1(464..471) | f1(472..479) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f2(0..7) | f2(8..15) | f2(16..23) | f2(24..31) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f2(448..455) | f2(456..463) | f2(464..471) | f2(472..479) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f3(0..7) | f3(8..15) | f3(16..23) | f3(24..31) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f3(448..455) | f3(456..463) | f3(464..471) | f3(472..479) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.4. Interleaving Considerations The new more flexible interleaving scheme requires some further usage considerations. As presented in the example in Section 3.6.2, an interleaving pattern requires certain sizes of the deinterlaving buffer. This required buffer space, expressed as number of frame slots is expressed using the "interleaving" MIME parameter. The number of frame slots needed, can be converted into actually memory requirement, considering the largest (in bytes) combination of AMR- WB+'s core and stereo mode. However the frame buffer size is not always sufficient to determine when it is appropriate to start consuming frames from the Sjoberg, et. al. Standards Track [Page 21] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 interleaving buffer. Two cases exist, either due to switching of the internal sampling frequency or due to changes of the pattern. Due to this the "int-delay" MIME parameter is defined. It allows a sender to indicate the minimal media time that needs to be present in the buffer before starting to consume media from the buffer. 4.5. Implementation Considerations An application implementing this payload format MUST understand all the payload parameters in the out-of-band signaling used. For example, if an application uses SDP, all the SDP and MIME parameters in this document MUST be understood. This requirement ensures that an implementation always can decide if it is capable or not of communicating. Both basic and interleaving mode SHALL be implemented. The implementation burden of both is rather small and requiring both ensures interoperability. It is also RECOMMENDED to implement the AMR-WB format in RFC 3267 [7], for applications or scenarios where interoperability with AMR-WB only codecs is necessary. When doing error concealment certain precautions are needed due to the possibility of switching of the internal sampling frequency. The first problem is that unless one has at least one audio frame and its timestamp value, which is later than the frame to conceal, available when performing error concealment, one can conceal using incorrect framelengths, which can in the worst case make some of the subsequent frames unusable. Example: Frame nr : 1 2 3 4 Frame Len (ms): 20 15 15 15 Assume that one has received frame 1, but none of the following frames. When it is time to decode the next frame, the decoder is going to conceal frame 2. However, as this frame was lost, one does not know that this frame represents 15 ms instead of the previous 20. When then the receiver gets frame nr 4, it can determine that it should have concealed 30 ms to cover missing frames 2 and 3, either as one 30 ms frame, or as several frames adding up to 30 ms. This is something a receiver implementation will need to consider and handle appropriately for the application. A rather basic idea to solve this is to be capable of removing the extra time generated by the wrongly concealed frame. Thus allowing a receiver to at least be able to maintain synchronization. The problem is due to the switching of internal sampling frequency. 5. Congestion Control Sjoberg, et. al. Standards Track [Page 22] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 The general congestion control considerations for transporting RTP data apply to AMR-WB+ audio over RTP as well, see RTP [3] and any applicable RTP profile like AVP [9]. However, the multi-rate capability of AMR-WB+ audio coding provides a mechanism for controlling congestion, since the bandwidth demand can be adjusted by selecting a different coding mode or lower internal sampling rate. Another parameter that may impact the bandwidth demand for AMR-WB+ is the number of frames that are encapsulated in each RTP payload. Packing more frames in each RTP payload can reduce the number of packets sent and hence the overhead from IP/UDP/RTP headers, at the expense of increased delay and reduced error robustness against packet losses. If forward error correction (FEC) is used to combat packet loss, the amount of redundancy added by FEC will need to be regulated so that the use of FEC itself does not cause a congestion problem. 6. Security Considerations RTP packets using the payload format defined in this specification are subject to the general security considerations discussed in RTP [3]. As this format transports encoded audio, the main security issues include confidentiality, integrity protection, and authentication of the audio itself. The payload format itself does not have any built-in security mechanisms. Any suitable external mechanisms, such as SRTP [10], MAY be used. This payload format or the AMR-WB+ decoder does not exhibit any significant non-uniformity in the receiver side computational complexity for packet processing and thus is unlikely to pose a denial-of-service threat due to the receipt of pathological data. 6.1. Confidentiality To achieve confidentiality of the encoded AMR-WB+ audio, all audio data bits will need to be encrypted. There is less a need to encrypt the payload header or the table of contents due to 1) that they only carry information about the frame type, and 2) that this information could be useful to some third party, e.g., quality monitoring. As long as the AMR-WB+ payload is only packed and unpacked at either end, encryption may be performed after packet encapsulation so that there is no conflict between the two operations. Sjoberg, et. al. Standards Track [Page 23] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 6.2. Authentication and Integrity To authenticate the sender of the audio and provide integrity protection, an external mechanism has to be used. It is RECOMMENDED that such a mechanism protect all the audio data bits and the RTP header. Data tampering by a man-in-the-middle attacker could result in erroneous depacketization/decoding that could lower the audio quality. To prevent a man-in-the-middle attacker from tampering with the payload packets, some additional information besides the audio bits SHOULD be protected. This may include the ToC, RTP timestamp, RTP sequence number, RTP payload type, and the RTP marker bit. 6.3. Decoding Validation When processing a received payload packet, if the receiver finds that the calculated payload length, based on the information of the session and the values found in the payload header fields, does not match the size of the received packet, the receiver SHOULD discard the packet. This is because decoding a packet that has errors in its length field could severely degrade the audio quality. 7. Payload Format Parameters This section defines the parameters that may be used to select features of the AMR-WB+ payload format. The parameters are defined here as part of the MIME subtype registration for the AMR-WB+ audio codec. A mapping of the parameters into the Session Description Protocol (SDP) [6] is also provided for those applications that use SDP. Equivalent parameters could be defined elsewhere for use with control protocols that do not use MIME or SDP. The data format and parameters are only specified for real-time transport in RTP. 7.1. MIME Registration The MIME subtype for the Extended Adaptive Multi-Rate Wideband (AMR- WB+) codec is allocated from the IETF tree since AMR-WB+ is expected to be a widely used audio codec in general streaming applications. Note, any unspecified parameter MUST be ignored by the receiver. Media Type name: audio Sjoberg, et. al. Standards Track [Page 24] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 Media subtype name: AMR-WB+ Required parameters: None Optional parameters: These parameters apply to RTP transfer only. channels: The maximum number of audio channels present in the audio frames. Permissible values are 1 (mono) or 2 (stereo). If no parameter is present, the maximum number of channels is 2 (stereo). interleaving: Indicates that frame level interleaving mode SHALL be used for the payload and its value defines the maximum number of frames allowed in an interleaving buffer (see Section 4.4). If this parameter is not present, interleaving SHALL NOT be used. int-delay: The minimal media time delay in RTP timestamp ticks that is needed in the deinterleaving buffer, i.e. the difference in RTP timestamp between the earliest and latest audio frame present in the deinterleaving buffer, to ensure correct decoding. ptime: see RFC2327 [6]. maxptime: see Section 8 in RFC 3267 [7]. Encoding considerations: This type is only defined for transfer via RTP (STD 64) and as described in Section 4 of RFC XXXX. Security considerations: See Section 6 of RFC XXXX. Public specification: Please refer to Section 10 of RFC XXXX. Additional information: File storage of the AMR-WB+ format is to be specified within the 3GPP defined ISO based multimedia file format defined in 3GPP TS 26.244, see reference [14] of RFC XXXX. The file format has the MIME types "audio/3GPP" or "video/3GPP" as defined by RFC YYYY [15]. To maintain interoperability with AMR-WB capable end- points, in cases where negotiation is possible and the Sjoberg, et. al. Standards Track [Page 25] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 AMR-WB+ end-point supporting this format also supports RFC 3267 for AMR-WB transport, an AMR-WB+ end-point SHOULD declare itself also as AMR-WB capable (i.e. supporting also "audio/AMR-WB" as specified in RFC 3267). As the AMR-WB+ decoder is capable of performing stereo to mono conversions, all receivers of AMR-WB+ should be able to receive both stereo and mono, although the receiver only is capable of playout of mono signals. Person & email address to contact for further information: johan.sjoberg@ericsson.com ari.lakaniemi@nokia.com Intended usage: COMMON. It is expected that many IP based streaming applications will use this type. Author/Change controller: johan.sjoberg@ericsson.com ari.lakaniemi@nokia.com IETF Audio/Video transport working group 7.2. Mapping MIME Parameters into SDP The information carried in the MIME media type specification has a specific mapping to fields in the Session Description Protocol (SDP) [6], which is commonly used to describe RTP sessions. When SDP is used to specify sessions employing the AMR-WB+ codec, the mapping is as follows: - The MIME type ("audio") goes in SDP "m=" as the media name. - The MIME subtype (payload format name) goes in SDP "a=rtpmap" as the encoding name. The RTP clock rate in "a=rtpmap" SHALL be 72000 for AMR-WB+, and the encoding parameter number of channels MUST either be explicitly set to 1 or 2, or be omitted, implying the default value of 2. - The parameters "ptime" and "maxptime" go in the SDP "a=ptime" and "a=maxptime" attributes, respectively. - Any remaining parameters go in the SDP "a=fmtp" attribute by copying them directly from the MIME media type string as a semicolon separated list of parameter=value pairs. 7.2.1. Offer-Answer Model Considerations Sjoberg, et. al. Standards Track [Page 26] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 To achieve good interoperability for the AMR-WB+ RTP payload in an Offer-Answer [8] negotiative usage in SDP the following considerations should be made: For negotiable offer/answer usage the following interpretations of the parameters SHALL be done: - The "interleaving" parameter is declarative. For streams declared as sendrecv or recvonly: The receiver will accept to receive payload using the interleaved mode of the payload format. The value declares the amount of buffer space the receiver has available for the sender to utilize. For sendonly streams the parameter indicates the desired configuration and amount of buffer space. A answerer is RECOMMENDED to accept the offered value if capable of using them. - The "int-delay" parameter is declarative. For streams declared as sendrecv or recvonly the value indicate the maximum initial delay the receiver will accept in the deinterleaving buffer. For sendonly streams the value is the amount of media time the sender desires to use, the value SHOULD be copied into any response. - The "channels" parameter is declarative. For "sendonly" streams it indicates the desired channel usage, stereo and mono, or mono only. For "recvonly" and "sendrecv" streams the parameter indicates what the receiver accepts to use. As any receiver will be capable of receiving stereo mode and perform local mixing with the AMR-WB+ decoder, there is normally only one reason to restrict to mono only. That reason is to avoid spending bit-rate on data that are not utilized if the front-end only is capable of mono. - The "ptime" parameter works as indicated by the offer/answer model [8], "maxptime" SHALL be used in the same way. - To maintain interoperability with AMR-WB in cases where negotiation is possible, an AMR-WB+ capable end-point which also implements the AMR-WB payload format [7] is RECOMMENDED to also declare itself capable of AMR-WB as it is a subset of the AMR-WB+ codec. In declarative usage, like SDP in RTSP [16] or SAP [17], the following interpretation of the parameters SHALL be done: - The "interleaving" parameter if present configures the payload format in that mode, and the value indicates the number of frames that the deinterleaving buffer is required to support to be able to handle this session correctly. - The "int-delay" parameter, indicates the initial buffering delay required to receive this stream correctly. Sjoberg, et. al. Standards Track [Page 27] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 - The "channels" parameter indicates if the content being transmitted can contain either both stereo and mono modes, or only mono. - All other parameters indicate the value that are being used by the sending entity. 7.2.2. Examples One example SDP session description utilizing AMR-WB+ mono and stereo encoding follow. m=audio 49120 RTP/AVP 99 a=rtpmap:99 AMR-WB+/72000/2 a=fmtp:99 interleaving=30; int-delay=86400 a=maxptime:100 Note that the payload format (encoding) names are commonly shown in upper case. MIME subtypes are commonly shown in lower case. These names are case-insensitive in both places. Similarly, parameter names are case-insensitive both in MIME types and in the default mapping to the SDP a=fmtp attribute. 8. IANA Considerations It is requested that one new MIME subtype (audio/amr-wb+) is registered by IANA, see Section 7. 9. Acknowledgements The authors would like to thank Redwan Salami and Stefan Bruhn for their significant contributions made throughout the writing and reviewing of this document. Anisse Taleb and Ingemar Johansson contributed by implementing the payload format, and thus helped locating some flaws. We would also like to acknowledge Qiaobing Xie coauthor of RFC 3267 on which this document is based on. Sjoberg, et. al. Standards Track [Page 28] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 10. References 10.1. Normative references [1] 3GPP TS 26.290 "Audio codec processing functions; Extended AMR Wideband codec; Transcoding functions", version 1.0.0 (2004- 05), 3rd Generation Partnership Project (3GPP). [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, Internet Engineering Task Force, March 1997. [3] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, Internet Engineering Task Force, July 2003. [4] 3GPP TS 26.192 "AMR Wideband speech codec; Comfort Noise aspects", version 5.0.0 (2001-03), 3rd Generation Partnership Project (3GPP). [5] 3GPP TS 26.193 "AMR Wideband speech codec; Source Controled Rate operation", version 5.0.0 (2001-03), 3rd Generation Partnership Project (3GPP). [6] Handley, M. and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, Internet Engineering Task Force, April 1998. [7] Sjoberg, J., Westerlund, M., Lakaniemi, A., and Q. Xie, "Real- Time Transport Protocol (RTP) Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi- Rate Wideband (AMR-WB) Audio Codecs", RFC 3267, Internet Engineering Task Force, June 2002. [8] J. Rosenberg, and H. Schulzrinne, "An Offer/Answer Model with the Session Description Protocol (SDP)", RFC 3264, Internet Engineering Task Force, June 2002. 10.2. Informative References [9] Schulzrinne, H., "RTP Profile for Audio and Video Conferences with Minimal Control", STD 65, RFC 3551, Internet Engineering Task Force, July 2003. [10] Baugher, et. al., "The Secure Real Time Transport Protocol", RFC 3711, Internet Engineering Task Force, March 2004. [11] Rosenberg, J. and H. Schulzrinne, "An RTP Payload Format for Generic Forward Error Correction", RFC 2733, Internet Engineering Task Force, December 1999. [12] Perkins, C., Kouvelas, I., Hodson, O., Hardman, V., Handley, M., Bolot, J., Vega-Garcia, A. and S. Fosse-Parisis, "RTP Payload for Redundant Audio Data", RFC 2198, Internet Engineering Task Force, September 1997. [13] 3GPP TS 26.233 "Packet Switched Streaming service", version 5.0.0 (2001-03), 3rd Generation Partnership Project (3GPP). [14] 3GPP TS 26.244 " Transparent end-to-end packet switched streaming service (PSS); 3GPP file format (3GP)", version 6.0.0 (2004-03), 3rd Generation Partnership Project (3GPP). Sjoberg, et. al. Standards Track [Page 29] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 [15] D. Singer, and R. Castagno, "MIME Type Registrations for 3GPP Multimedia files," RFC YYYY (draft-singer-avt-3gpp-mime- 02.txt), Internet Engineering Task Force, September 2003. [16] H. Schulzrinne, A. Rao, R. Lanphier, "Real Time Streaming Protocol (RTSP)", RFC 2326, Internet Engineering Task Force, April 1998. [17] M. Handley, C. Perkins, E. Whelan, " Session Announcement Protocol", RFC 2974, Internet Engineering Task Force, June 2001. Any 3GPP document can be downloaded from the 3GPP webserver, "http://www.3gpp.org/", see specifications. 11. Authors' Addresses Johan Sjoberg Ericsson Research Ericsson AB SE-164 80 Stockholm, SWEDEN Phone: +46 8 7570000 EMail: Johan.Sjoberg@ericsson.com Magnus Westerlund Ericsson Research Ericsson AB SE-164 80 Stockholm, SWEDEN Phone: +46 8 7570000 EMail: Magnus.Westerlund@ericsson.com Ari Lakaniemi Nokia Research Center P.O.Box 407 FIN-00045 Nokia Group, FINLAND Phone: +358-71-8008000 EMail: ari.lakaniemi@nokia.com Sjoberg, et. al. Standards Track [Page 30] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 12. IPR Notice The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. 13. Copyright Notice Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. This Internet-Draft expires in January 2005. RFC Editor Considerations The RFC editor is requested to replace all occurances of XXXX with the RFC number this document receives. It is also requested that all occurances of YYYY is replaced with the RFC number that [15] receives when published. Also the reference [15] is requested to be updated with the correct information upon publication of that document. Sjoberg, et. al. Standards Track [Page 31] INTERNET-DRAFT RTP payload format for AMR-WB+ July 9, 2004 The RFC editor is also requested to remove the next section "Changes". 14. Changes This version comapred to draft-ietf-avt-rtp-amrwbplus-00.txt the following has been changed: - Extended description of the codec to explain the super and transport frame concept used. - Added the Transport Frame Index field. - Clarified what the "channels" parameter is useful for. - Fixed a number of editorial errors. Sjoberg, et. al. Standards Track [Page 32]