Internet Draft draft-ietf-avt-rtp-3gpp-timed-text-01.txt J. Rey Y. Matsui Matsushita Expires: November 10, 2004 May 10, 2004 RTP Payload Format for 3GPP Timed Text Status of this document This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. IPR Disclosure Agreement By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract This document specifies an RTP payload format for the transmission of 3GPP (3rd Generation Partnership Project) timed text. 3GPP timed text is a time-lined decorated text media format with defined storage in a 3GP file. Timed Text can be synchronized with audio/video contents. As of today, 3GP files containing timed text contents can only be downloaded via HTTP. There is no available mechanism for streaming 3GPP timed text contents neither out of 3GP files nor directly from live content. In the following sections the problems IETF draft - Expires November 10, 2004 [Page 1] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 of streaming timed text are addressed and a payload format for streaming 3GPP timed text over RTP is specified. Table of Contents 1. Terminology.....................................................3 2. Introduction....................................................5 3. RTP Payload Format for 3GPP Timed Text..........................8 4. Resilient Transport............................................20 5. Congestion control.............................................21 6. Scene Description..............................................21 7. MIME Type usage Registration...................................22 8. SDP usage......................................................25 9. IANA Considerations............................................27 10. Security considerations.......................................27 11. References....................................................27 12. Annexes.......................................................29 13. Acknowledgements..............................................32 14. Author's Addresses............................................32 15. IPR Notices...................................................32 16. Full Copyright Statement......................................33 17. Acknowledgement...............................................33 [Note to the RFC Editor: please delete the Change Log section upon publication of this document as RFC] [Note to the RFC Editor: please replace "RFCXXXX" with the RFC designation of this document when published] Change Log Changes from draft-rey-avt-rtp-3gpp-timed-text-00 Major changes: - completed empty sections from -00 draft. - abstract and introduction re-arranged. Moved section "Basics of the 3GP File Structure" to end of the document as Annex B. - SLEN, SIDX and SDUR lengths fixed to 16, 16 and 24 bits, respectively. - New OPTIONAL header, SPLDESC, added to transport sample description in-band. - Section 4 on payload format expanded: text header, fragment header and sample description header are fully specified. - SMIL usage section added. Changes from draft-rey-avt-rtp-3gpp-timed-text-01 Rey & Matsui [Page 2] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 Major changes: - Terminology, some terms introduced to clarify text. - Section 4 - rules and recommendations on fragmentation are given. - payload headers were classified into five types, with a common field section and specific fields for each type. - header structure similar to RFC 3640 for easy transformation. Changes from draft-rey-avt-rtp-3gpp-timed-text-02 Major changes: - IPR Disclosure Agreement added to boilerplate, IPR Notices and Copyright Statement modified as per BCP 78. - SIDX usage re-defined. - "spldesc" parameter semantics lightly changed. - LEN field made MANDATORY, therefore TYPE header 2 rearranged to ease processing in 32-bit machines. - clarify that TYPE 5 SHOULD be implemented and, at least, a receiver MUST be able to discard it, if not implemented. - some guidelines on the clockrate for live streaming and within 3GP files. - Offer/Answer section - Extended glossary in the Terminology section - new fmtp parameter, "version", to indicate compliance to a particular version of 3GPP Timed Text specification. Changes from draft-ietf-avt-rtp-3gpp-timed-text-00 - editorial nits and clarifications. 1. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [5]. Rey & Matsui [Page 3] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 Furthermore, the following terms are used and have specific meaning within the context of this document: text sample or whole text sample: this refers to a unit of timed text data as contained in the source 3GP file. Its equivalent in audio/video would be a frame. A text sample contains text strings followed by zero or more modifier boxes. fragment or text sample fragment: a fraction of a text sample. A fragment may contain either text strings or modifier (decoration) contents, but not both at the same time. sample contents: general term to identify timed text data transported when using this payload format. text strings: text strings is the term used to denote the concatenation of a 16 bit byte count value, followed by a 16 bit byte order mark (0xFEFF) if UTF-16 encoding is used, and the actual text characters encoded either as UTF-8 or UTF-16. decoration/modifiers: the terms "decoration" and "modifiers" are used interchangeably throughout the document to denote the contents of the text sample that modify the default text formatting. Modifiers may, for example, specify different font size for a particular sequence of characters or define karaoke timing for the sample. sample description: this term is used to denote information that applies to a text sample as a whole and per default. Examples of such are scrolling direction, text box position, delay valu, default font, background colour, etc. This information may also apply to different text samples. units or access units: Rey & Matsui [Page 4] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 the payload headers specified in this document encapsulate text samples, fragments thereof and sample descriptions by prepending a specific payload header and so building what is called a unit. aggregation / aggregate packet An aggregate RTP packet consists of several units. track / stream 3GP files contain audio/video and text tracks. This document enables to stream these tracks using RTP. Therefore both terms are exchanged in this document in the context of 3GP files. Media Header Box / Track Header Box / ... the 3GP file format makes use of these structures defined in the ISO Base File Format [2]. When referring to these in this document, initials are capitalized for clarity. 2. Introduction 3GPP timed text is a media format for time-lined decorated text specified in [1]. 3GPP Timed text contents may be stored in 3GP files or may be generated in real time. The 3GP file format itself is based on the ISO Base Media File Format recommendation [2]. Section 12.2 gives some insight in the 3GP file structure. The purpose of this draft is to provide a means to stream 3GPP timed text contents using RTP. This includes the streaming of timed text being read out of a 3GP file as well as the streaming of timed text generated in real time, a.k.a. live streaming. 2.1 General Overview of the 3GPP Timed Text format The 3GPP timed text format was developed for use in the services specified in the 3GPP Transparent End-to-end Packet-switched Streaming Services (3GPP PSS) [18]. Besides plain text, the 3GPP timed text format allows the display of decorated text: like for karaoke applications, scrolling text for newscasts or hyperlinked text. Furthermore, these contents may or may not be synchronized with other media, like audio or video. The scope of the 3GPP PSS includes both downloading and streaming of multimedia content over 3G packet-switched networks. However, due to the lack of an appropriate RTP payload format, the current usage of the 3GPP timed text file format is limited to downloading via HTTP. Rey & Matsui [Page 5] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 The 3GPP PSS adopts multimedia codecs (such as MPEG-4 Visual, AMR, MPEG-4 AAC, and JPEG) and protocols like SMIL [9] for presentation layouts or RTP [3] for streaming. In general, a multimedia presentation might consist of several audio/video/text streams (or tracks in ISO file format jargon). Different streams may have different contents. The media may be spatially synchronised either using the information within the streams or a scene description language like SMIL. An example of this would be a media session with three different media streams: 1 audio, 1 video and 1 timed text that reproduces a music video with karaoke subtitles. For each stream some information is needed, which defines the regions where each media is displayed, how the media looks like and its synchronization, among other things. In karaoke, for example, the song lyrics are displayed below the music video and the words are highlighted synchronized with the music track. In order to achieve these goals different functional elements are defined. Four differentiated functional components might be identified: o initial spatial layout information related to the text track: these are the height and width of the text region where text is displayed, the position of the text region in the display and the layer or proximity of the text to the user. These pieces of information are contained in the Track Header Box. Sections 6.1 and 12 provide further details. o default settings for formatting and positioning the text: default style (font, size, colour,...), default background colour, default horizontal and vertical justification, default line width, default scrolling, etcetera. Sample descriptions contain such default settings. o the actual text: encoded characters using either UTF-8 or UTF-16 encoding and, o the decoration inside the modifier boxes. Whether some characters have different style, some delay, blink, etcetera... needs to be indicated by appending the modifier boxes to the text strings. Modifier boxes are only present in the text samples if they are needed. Otherwise, the default settings in the corresponding sample description apply. At the time of writing this payload format the following decorations or modifiers are specified in the 3GPP timed text media format [1]: - text highlight, - highlight color, - blinking text, - karaoke feature, - hyperlink, Rey & Matsui [Page 6] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 - text delay, - text style and, - positioning of the text box and, - text wrap indication. Section 12.3 specifies where to find these values in the 3GP file and how these are mappped to the payload format. For live streaming, appropriate values using the same formats and units shall be used. For further details on the 3GPP Timed Text media format, refer to [1]. 2.2 Requirements for a timed text payload format In this section a set of requirements is listed. A justification for each of them is also given. An RTP Payload Format for 3GPP timed text SHALL: 1. Keep the 3GP text sample structure. A text sample consists of text strings and zero or more modifier boxes. This requirement means that it SHALL be possible for an RTP receiver using this payload format to rebuild the text samples upon the received RTP packets. 2. Transmit the text sample size, sample duration and sample description index in-band. In RTP it is important to transmit it in- band because this information might change from sample to sample. This is also important for buffering purposes as described in Section 3.1.1. 3. Enable the transmission of the sample descriptions both by out-of-band and in-band means. In general, a single sample description may be used by different text samples. Therefore, to save overhead it is reasonable to transmit a default formatting once at the initialization phase and update this upon demand. These pieces of information may become large so that out-of-band transmission might not be the most appropriate transport method. Additionally, out-of-band channels might not be always available. For these reasons, the payload format SHALL enable in-band transmission of sample description information. This is especially useful for live streaming, where contents are not known a priori. 4. Enable the aggregation of units into an RTP packet. In a mobile communication environment a typical text sample size is around 100-200 bytes. Thus, transporting several units in one RTP packet makes the transport more efficient. 5. Enable the fragmentation and reassembly of a text sample into several RTP packets in order to cover a wide range of applications and network environments. In general, fragmentation should be a rare event given the low bit rates and text sample sizes. Rey & Matsui [Page 7] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 However, the 3GPP Timed Text media format does allow for larger text samples. The payload format SHALL take this into account. 6. Enable the use of resilient transport mechanisms, such as repetition, retransmissions and FEC. Additional mechanisms like FEC [7] or retransmission [13] can be used to protect the information. RFC 2354 [8] discusses available mechanisms for packet loss resiliency. 3. RTP Payload Format for 3GPP Timed Text The format of an RTP packet containing 3GPP timed text is shown below: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + RTP payload | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Marker bit (M): the marker bit must be set to 1 if the RTP packet includes one or more whole text samples or the last fragment of a text sample; otherwise set to 0. Timestamp: the timestamp MUST indicate the sampling instant of the earliest (or unique) text sample contained in the RTP packet. The initial value MUST be randomly determined. Text samples MUST be placed in play-out order, i.e. earliest first in the payload. The timestamp of the subsequent samples (or fragments thereof) MUST be obtained by adding the timed text sample duration of subsequent samples to the RTP timestamp value. Example: let sdur(0), sdur(1) and sdur(2) be the durations of three subsequent timed text samples included in an RTP packet. Let rtpts be the timestamp as present in the RTP header. The timestamp ts(i) for each sample (i=0,1,2) would be: ts(i)=rtpts + sum[sdur (i-1)]; ts(0)=rtpts, ts(1)=rtpts + sdur(0) ts(2)=rtpts + (sdur(0)+ sdur(1)) Rey & Matsui [Page 8] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 Some text samples may become large and have to be fragmented into several RTP packets. In this case, the receiver needs to associate fragments of the same text sample. This is done using the timestamp. The order of the fragments is resolved using the payload header defined in this document. The timestamp clockrate does not match the sampling rate, as it is usual in other media such as audio or video. If the timed text is streamed from a 3GP file, the timestamp clockrate MUST be copied directly from the value of "timescale" in the Media Header Box for that text track. Note that each track in a 3GP file MAY have its own clockrate as specified in the Media Header Box. For live streaming an appropriate timestamp clockrate SHALL be used. A default value of 1000 Hz is RECOMMENDED. This value should provide enough timing resolution for synchronizing text with other media and expressing the duration of text samples. Other clockrates MAY be used. Timestamp clockrates MUST be signaled by out-of-band means at session setup, e.g. using SDP. The 3GPP Timed Text format does not mandate any sampling rate, but it is the real time encoder SHALL choose an appropriate sampling rate such that the text samples meet the application needs. E.g. samples may be tailored to match the packet MTU as close as possible or to provide a given redundancy for the available bit rate. The encoding application MUST also take into account the delay constraints of the real-time session and assess whether FEC, retransmission or other similar techniques are reasonable options for repair. The following example shall illustrate how a real-time encoder may choose its settings: Imagine a news program scenario, where the news is transcribed and synchronized with the image of the reporter and the headlines in the background. Assuming that a person can read an average of 4-6 words per second, at an average word length of 5 characters plus one space per word, an available IP MTU of 576 bytes, characters are encoded using 2-bytes, no modifiers are used and a rate of 576*8bits per second=4.6Kbps is available, a text sample covering 60 seconds of text would theoretically be optimum: IP/UDP/RTP+(text sample)=20+8+18 (12+6, TYPE 1 header) + ~250*2= ~546 bytes < 576 bytes. However, a delay of sixty seconds might be too much and just one packet per sample too low of a redundancy. In practice, the allowed delay for real time communications is typically a few seconds, e.g. 3s. Thus, the encoder could sample text every 1s (yielding RTP payloads of ~14-18 bytes), encapsulate the current and last two samples in every RTP packet (accounting to an IP packet size of 98 bytes) and send the packet six times, thus exhausting the available bit rate and increasing packet loss resilience. Rey & Matsui [Page 9] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 These examples illustrate how the encoding application shall adapt to the scenario constraints. Payload Type (PT): the payload type is set dynamically and sent by out-of-band means. The usage of the remaining RTP header fields follows the rules of RTP [3] and the profile in use. 3.1 General Remarks Before going into the details of the payload headers, some general observations are made in this section. These should help the reader in understanding the design decisions. 3.1.1 Character Counting This payload format does not enable a receiver to find out the exact number of text characters lost. The reason for this is that UTF-8/16 encodings yield a variable number of bytes per character, and so the fragment size does not help in finding the number of lost characters. 3.1.2 Fragmentation of Timed Text Samples This section justifies why text samples may have to be fragmented and discusses some of the possible approaches to do it. A solution is proposed together with rules and recommendations for fragmenting and transporting text samples using this payload format. 3GPP Timed Text applications are expected to operate at low bit rates. This fact added to the small size of timed text samples (typically one or two hundred bytes) makes fragmentation of text samples a rare event. Samples should usually fit into the MTU size of the used network path. Nevertheless, some text strings (e.g. ending roll in a movie) and some modifier boxes (i.e. for hyperlinks, for karaoke or for styles) might become large and might need fragmentation. This may also apply for future modifier boxes. In order to transport these larger text samples using RTP, it could be argued that a careful encoding be used to transform the original large sample into smaller self-contained text samples that fit into the given transport MTU. This would comply with the ALF principle, as described in the guidelines for RTP payload formats, RFC 2736 [14]. It would also need additional pre-processing previous to RTP encapsulation and that senders understand the modifiers format. However, given the low probability of fragmentation, it is believed that the overhead of this pre-processing is not worth and it is more appropriate to encode text samples without taking the path MTU into account. In this manner, this payload format meets a trade-off by Rey & Matsui [Page 10] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 intentionally leaving out this pre-processing and making the fragmented samples less robust to packet losses. The most important consequence of this design choice is that while text string fragments can be displayed in the absence of a previous text fragment, modifiers for that text string are useless if they are not completely received. A minimum set of fragmentation rules and recommendations SHALL be observed: o whenever possible, whole text samples SHOULD be aggregated into RTP packets, using the payload headers defined in this document. This increases transport efficiency. o since fragmentation cannot be avoided in all cases, it is RECOMMENDED that text samples are fragmented as seldom as possible. As an example, if a packet has some free space, which would fit only a small part of the next text sample, a new RTP packet SHOULD be sent, instead of sending two or more fragments out of the sample. This reduces complexity by minimizing the number of fragments. o in order to fill up the remaining bits of a packet, piggybacking of sample descriptions MAY be performed. Also fragments of past samples MAY be piggybacked. For this purpose the server MAY reserve a certain amount of buffer to store already sent units for piggybacking. o text strings MUST split at character boundaries. Otherwise, it is not possible to display the text stings of a fragment if a previous fragment was lost. o sample descriptions SHALL NOT be fragmented, because they contain important information that may affect several text samples. o unlike text strings, the modifier boxes are NOT REQUIRED to split at meaningful boundaries, nor there is a possibility to apply partial modifier contents to the text strings. Note that enabling this would require that: a) senders understand the semantics of the modifier boxes and b) specific fragment headers for each of the modifier boxes are defined. As explained previously in this section, this is considered not worth. o as a consequence of the above, the modifier fragments are only useful if all of them are received. Therefore, for enhanced resiliency against packet loss it is RECOMMENDED that fragments containing decoration be especially protected using FEC [7], retransmission [13], packet repetition or an equivalent technique. Similarly, these techniques MAY also be applied to text strings and sample descriptions. o furthermore, when fragmenting samples containing modifiers, the start of the modifiers MUST be indicated using the payload header Rey & Matsui [Page 11] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 defined for that purpose, i.e. a new TYPE 3 unit MUST be defined (see below). Otherwise, if packets are lost, a client may be unable to identify where the modifiers start and the text ends. 3.1.3 On the length indication in the units Usually, RTP applications use the information on packet size from UDP or lower layers to find out the length of the RTP payload. This payload format does not use this information but includes an explicit length indication for each unit in the payload. While this is technically not needed for every unit (those placed last in the payload could leave it out) it is considered that the overhead added is minimum and the overall complexity remains low. At the same time, this design choice allows easy interoperability with the RTP Payload Format for Transport of MPEG-4 Elementary Streams, RFC 3640 [15], which does require an explicit length indication for each unit (see AU-header in RFC 3640). 3.1.4 On the ordering and interleaving of units in aggregate payloads As stated in the timestamp definition, the order of the units in an aggregate payload is important. In general, older units MUST precede newer ones. However, not all units are provided with timing attributes: units containing sample descriptions (TYPE 5) or modifier fragments (TYPE 3&4) lack these. Therefore, relaxed ordering constraints as follow apply: o Units containing sample descriptions MAY be placed in any order (no timing requirements) and MAY be present as often as needed, e.g. piggybacked. o Although units containing modifier boxes or fragments thereof do not include a duration field, they make use of the RTP timestamp to group together. Therefore, they SHOULD be transmitted in the same order as they appear in the sample and be placed as near as possible to the text to which they apply. Logically, this does not apply for retransmitted or redundant packets or for units that are piggybacked into other packets. The latter requirement targets at avoiding (or minimizing) the dispersal of fragments of a text sample over several RTP packets, a.k.a. interleaving. Interleaving of units SHOULD NOT be used with this payload format due to the variable packet size of the timed text samples, which would yield unpredictable latencies. This decreases the robustness against packet losses. Rey & Matsui [Page 12] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 As we have seen, units with and without duration MAY be part of the aggregate payload. Logically, units without timing attributes SHALL NOT be used to resolve the timestamp of subsequent units. For this purpose, they SHALL be ignored, i.e. by jumping to the next unit with duration or until the end of the packet is reached. Otherwise, the algorithm as specified in the timestamp definition above applies. On the other hand, units with unknown duration have some ordering constraints: they MAY only precede units that do not have a duration value (TYPE 3, 4 & 5 below). Otherwise, it would not be clear when the following units should be displayed due to the unknown duration. 3.1.5 Live streaming vs. Streaming from a 3GP file This section shall clarify the differences between streaming live content and streaming text tracks from a 3GP file. For the purpose of this document, the term live streaming refers to those scenarios where the sender application creates the media contents without necessarily storing them in a 3GP file. Usually, the generated contents are stored for a limited amount of time in a buffer. This buffer is used to cancel the network delay and delay jitter. Section 12.3 specifies how the 3GP file parameter values are mapped to the fields of the payload header. For live streaming, appropriate values complying with the format and units described in [1] shall be used. Where needed, clarifications on appropriate values are given in this document. 3.2 Payload Header Definitions An RTP packet using the payload headers defined in this document has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R | TYPE| : +-+-+-+-+-+-+-+-+ : : (variable payload header depending on TYPE value) : : : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : SAMPLE CONTENTS : : : : : Rey & Matsui [Page 13] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 RTP Packet Format. The payload headers specified in this document consist of a set of common fields followed by specific fields for each header type and sample contents. See Figure 2. In this manner, the structure of the payload headers resembles that of the 'access units' (AU) in RFC 3640. This similarity is intentional to improve interoperability. The 'AU header' of that document finds an equivalent in the common header fields for all TYPE values: R, U, TYPE and LEN. Similarly, the specific fields plus the sample contents would be equivalent the 'AU data section' in [15]. Thus, RTP packets complying with this payload format can be seen as consisting of a unit header and a unit payload, as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R |TYPE | LEN | specific | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | fields (variable) | +-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 Payload Header Format. An aggregate RTP packet containing two text samples and a text sample fragment would schematically look like this: +----------------------+ | | | RTP Header | | | _.----------------------+ ..-'' | | _..-'' | Payload Header 1 | ........................ UNIT 1 | | | Text Sample 1 | `-...._ | | ``-. ........................ _,,..-- | | --'' | Payload Header 2 | ........................ UNIT 2 | | | Text Sample 2 | ._ | | `--._ | | `--. ........................ ,-' | | _.-' | Payload Header 3 | ,-' ........................ UNIT 3 | | Rey & Matsui [Page 14] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 | Text Sample Fragment | `-.._ | | `-.._ | | `-+----------------------+ Figure 3 Example RTP packet. 3.2.1 Unit Header Format The unit header has the following format: 0 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R |TYPE | LEN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4 Unit Header Format. Where: o U (1 bit) "UTF Transformation flag": indicates whether the text characters are encoded using UTF-8 (U=0) or UTF-16 (U=1). This is used to inform RTP receivers whether UTF-8 or UTF-16 was used to encode the text string and so enable to display text string fragments. The U bit is only meaningful in TYPE 2 header, otherwise it MUST be set to zero and ignored. This is because complete text samples already contain an implicit indication of the encoding (byte order mark) in the text string itself (unit payload) which is understood by the decoding application. o R (4 bits) "Reserved bits": for future extensions. This field MUST be set to zero (0x0). o TYPE (3 bits) "Type Field": this field specifies which specific header fields follow. The following TYPE values are defined: - TYPE 1, for whole text samples - TYPE 2, for text string fragments - TYPE 3, for whole modifier boxes or first modifier fragments - TYPE 4, for modifier fragments other than first. - TYPE 5, is for sample descriptions. One header per sample description. - TYPE 0, 6 and 7 are reserved. Two TYPEs (1 & 2) are defined for units containing text strings another two (3 & 4) for units not containing text strings (thus no timing attributes) and a final TYPE 5 for sample descriptions (also lacking timing attributes). See details in subsections below. Rey & Matsui [Page 15] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 o Finally, the LEN (16 bits) "Length Field": indicates the size (in bytes) of this header field and all the fields following, .i.e. the LEN field followed by the unit payload. For whole text samples stored content in 3GP files, the sample length is given by SLEN value (see Section 12.3) and the LEN value is easily obtained by adding SLEN to the length of the LEN field (2). For live streaming, both sample length and the LEN value for the current fragment MUST be calculated during fragmentation or during the sampling process. LEN may take the following values: - TYPE = 1, LEN >= 6, - TYPE = 2, LEN > 9, - TYPE = 3, LEN > 3, - TYPE = 4, LEN > 3 and, - TYPE = 5, LEN > 3. In the next subsection the different payload headers for the values of TYPE are specified. 3.2.2 TYPE 1 Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R |TYPE | LEN (always >=6) | SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport whole text samples. If several text samples are sent in an RTP packet, every sample has its own header. See Figure 3. Empty text samples are considered whole text samples, although they do not contain sample contents. In this case, TYPE 1 units MUST not have contents. This means that the LEN field MUST have a value of 6 (0x0006). Otherwise, the LEN field MUST be always greater than 6 (0x0006). The fields above have the following meaning: o SIDX (8 bits) "Text Sample Entry Index": this is an index used to identify the sample descriptions. The SIDX field is used to find the sample description corresponding to the unit's payload. There are two types of SIDX values: static and dynamic. Static SIDX values are used to identify sample descriptions that MUST be sent out-of-band and MUST remain active during the whole Rey & Matsui [Page 16] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 session. The transport of sample descriptions out-of-band is a MANDATORY feature. A static SIDX value is unequivocally linked to one particular sample description during the whole session. It SHOULD be avoided that many sample descriptions are carried out- of-band, since these may become large and, ultimately, transport is not the goal of the out-of-band channel. Thus, this feature MUST be limited to those sample descriptions that provide a set of minimum default format settings. Static SIDX values MUST fall in the interval [129,254]. The first SIDX value assigned to a static sample description MUST be 129. Dynamic SIDX values are used for sample descriptions sent in-band. Sample descriptions MAY be sent in-band for several reasons: because they are generated in real time, for transport resiliency or both. A dynamic SIDX value is unequivocally linked to one particular sample description during the period in which this is active in the session and it SHALL NOT be modified during that period. This period MAY be smaller or equal to the session duration. A maximum of 64 dynamic active SIDX is allowed at any moment. Dynamic SIDX values MUST fall in the interval [0,127]. This should be enough for both, recorded content and live streaming applications. Nevertheless, a wraparound mechanism is provided in Section 12 to handle sessions where more than 64 SIDX values might be needed in a session. SIDX values 128 and 255 are reserved for future use. o SDUR (24 bits) "Text Sample Duration": indicates the sample duration in timestamp units of the text sample. For this field, a length of 3 bytes is preferred to 2 bytes. This is because, for a typical clockrate of 1000 Hz, 16 bits would allow for a maximum duration of just 65 seconds, which might be too short for some streams. Apart from defining the time period during which the text is displayed, the duration field is also used to find the timestamp of any subsequent units within the RTP packet. See the timestamp definition for details. Text samples have generally a known duration at the time of transmission. However, in some cases, e.g. live streaming, the time for which a text piece shall be shown might not be known. Let us revisit previous example: imagine you are in an airport watching the latest news report while you wait for your plane. Airports are loud, so the news report is transcribed in the lower area of the screen. This area displays two lines of text: the headlines and the words spoken by the news speaker. As usual, the headlines are shown for a longer time than the rest. This time is, in principle, unknown to the stream server. A headline is just replaced when the next headline arrives. As seen in this example, units of unknown duration MUST remain valid until the next unit arrives. Rey & Matsui [Page 17] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 Additionally, samples of unknown duration SHALL NOT use features, such as scrolling or karaoke, which would need to know the duration of the sample up front. They also SHALL not precede any unit with the SDUR field. For text stored in 3GP files, see Section 12.3 for details on how to extract the duration value. For live streaming, live encoders SHALL assign appropriate values and units according to [1] and later releases. 3.2.3 TYPE 2 Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R |TYPE | LEN( always >9) | SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR | TOTAL | THIS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SLEN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport text sample fragments containing text strings. In detail: o The LEN field (16 bits) has the same meaning as above. The value of LEN MUST be greater than 9 (0x0009). o The SLEN field (16 bits) indicates the size (in bytes) of the original (whole) text sample to which this fragment belongs. Clients MAY use SLEN to buffer space for the remaining fragments of the text sample. For stored content, see Section 12.3 for details on how to find the SLEN value in a 3GP file. For live content, the SLEN is obtained during the sampling process. o The fields TOTAL (4 bits) and THIS (4 bits) indicate the total number of fragments in which the original text sample has been fragmented and which order occupies the current fragment in that sequence, respectively. The usual "byte offset" field is not used here for two reasons: a) it would take one more byte and b) it does not provide any information on the character offset. UTF- 8/16 text strings have, in general, a variable character length ranging from 1 to 6 bytes. Therefore, the TOTAL/THIS solution is preferred. o The U, R, TYPE, SIDX, and SDUR fields have identical interpretation as above. The U, SIDX and SDUR fields are meaningful since partial text strings can also be displayed. Rey & Matsui [Page 18] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 3.2.4 TYPE 3 Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R |TYPE | LEN( always >3) |TOTAL | THIS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport either the entire modifier contents in a sample or just the first fragment of these. This depends on whether the modifier boxes fit in the current RTP packet. As explained above, the rules for fragmentation require that the start of the modifier boxes be signaled. o The TOTAL/THIS field indicates whether the unit contains a part of or the whole of the modifiers: if TOTAL=THIS, then all modifiers are included here. In this case, TOTAL=THIS MUST be greater than one, because there cannot be a sample of modifiers without text strings. Otherwise, this unit just contains the first fragment. o The U, R, TOTAL/THIS and LEN fields are used as above. The LEN field MUST be greater than three (0x0003). Note that the SLEN, SIDX and SDUR fields are not present. This is because: a) these fragments do not contain text strings and b) these types of fragments are applied over text string fragments, which already contain this information. 3.2.5 TYPE 4 Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R |TYPE | LEN( always >3) |TOTAL | THIS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport modifier fragments, other than the first one. The U, R, TOTAL/THIS and LEN fields are used as above. The LEN field MUST be greater than three (0x0003). 3.2.6 TYPE 5 Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |U| R |TYPE | LEN( always >3) | SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport (dynamic) sample descriptions. The LEN field MUST be greater than three (0x0003). Every sample description MUST have its own TYPE 5 header. Rey & Matsui [Page 19] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 This header SHOULD be supported, since it adds minimum complexity and it may increase the robustness of the streaming session. At the very least, every client implementation MUST be able to discard a TYPE 5 unit, if the unit payload cannot be used. Note that the implementation of this header is only RECOMMENDED, since some text streaming applications might never use dynamic sample descriptions. 4. Resilient Transport Apart from the basic fragmentation measures described in the section above, the simplest option for packet loss resilient transport is to send the same RTP packet or the same text samples (or fragments) again. A server MAY decide to use repetition as a measure for packet loss resilience. Repetition of text samples (or fragments) is only allowed if exactly the same units are sent, as in the first transmission. Only then, a receiver can use the already received and the newly repeated fragments to reconstruct the original text samples. Note that the RTP timestamp is used to group together the fragments of a sample. This measure also reduces complexity as fragmentation of any given text sample is only done once. E.g. if a text sample was originally sent as a unique non-fragmented text sample, a repetition of that sample MUST be sent also as a single non-fragmented text sample in one unit. Likewise, if the original text sample was fragmented and spread over several RTP packets, say a total of 3 units, then the repeated fragments SHALL also have the same byte boundaries and use the same headers and bytes per fragment. With repetition, repeated units resolve to the same timestamp as their originals. Where redundant units are available, the receiver SHOULD use those units received in the RTP packet with the highest sequence number and discard the rest. If single units are repeated in packets different from their originals, care SHALL be taken to preserve their original timing. Regarding the RTP header fields: o in repeated packets, all RTP header fields MUST keep their original values except the sequence number that MUST be increased to comply with RTP. o in packets containing repeated units, the general rules in Section 3 for assigning values to the RTP header fields apply. Rey & Matsui [Page 20] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 Finally, if sample descriptions for a given SIDX value are not available at the receiver, it is a matter of implementation whether the text sample contents are displayed. A possible solution MAY be that the encoder provides a static default sample description to be used for these cases. 5. Congestion control The RTP profile under which this payload format is used defines an appropriate congestion control mechanism in different environments. Following the rules under the profile, an RTP application can determine its acceptable bitrate and packet rate in order to be fair to other TCP or RTP flows. If an RTP application using this payload format uses retransmission, the acceptable packet rate and bitrate includes both the original and retransmitted data. This guarantees that an application using retransmission achieves the same fairness as one that does not. Such a rule may translate in practice into the following actions: If enhanced service is used, it should be made sure that the total bitrate and packet rate do not exceed that of the requested service. It should be further monitored that the requested services are actually delivered. In a best-effort environment, the sender SHOULD NOT send retransmission packets without ensuring first that enough bandwidth for retransmission is available. Other solutions like reducing the packet rate and bitrate of the original stream (for example by encoding the data at a lower rate) MAY be used. Similar considerations apply, if an RTP application using this payload format implements forward error correction, FEC [7]. Hereby, the sender should take care that the amount of FEC does not actually worsen the problem. Therefore, it is RECOMMENDED that applications implementing this payload format also implement congestion control. The actual mechanism for congestion control is out of the scope of this document but should be suitable for real-time flows. As an example, RFC 3448 [11] specifies an equation-based congestion control that fulfils this requirement. 6. Scene Description 6.1 Text rendering position and composition In order to stream timed text, either stored in a 3GP file or streamed live, some initial layout information is needed by the client to correctly display the text. These are the width, height and position of the text area and the layer or proximity of the text to the user. Rey & Matsui [Page 21] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 These pieces of information MUST be conveyed in a reliable form previous to the start of the session. An example of a reliable transport may be the out-of-band channel used for SDP. Any SDP description containing a 3GPP timed text stream MUST include the parameters listed above. Section 7 provides details on the usage in SDP descriptions. For stored content, some values contained in the Track Header Box SHALL be used. See Section 12.3 for details on finding these values in a 3GP file. For live streaming appropriate values SHALL be used. 6.2 SMIL usage The attributes contained in the Track Header Boxes of a 3GP file only specify the spatial relationship of the tracks within the given 3GP file. If several media streams are sent, they require spatial synchronization. For such purpose, SMIL SHOULD be used. SMIL assigns regions in the display to each of those files and places the tracks within those regions. The original track header information is used for each track within its region. Therefore, even if SMIL scene description is used, the track header information pieces SHOULD be sent anyway as they represent the intrinsic media properties. See [1] and the 3GPP SMIL Language Profile in [18] for details. 7. MIME Type usage Registration 7.1 3GPP Timed Text MIME Registration MIME type: video MIME subtype: 3gpp-tt Required parameters: rate: the RTP timestamp clockrate is equal to the clockrate of the media. If RTP packets are generated out of a 3GP file, the clockrate of the text media MUST be copied from the 3GP file, i.e. the clockrate is the value of "timescale" parameter in the Media Header Box describing that text track. Other tracks (audio/video/text) in the 3GP file may have their own clockrates as indicated in their corresponding Media Header Box. For live encoding, a clockrate of 1000 Hz is RECOMMENDED but other values MAY be used. version=, indicates the version of the 3GPP TS 26.245 specification after which the timed text is encoded. "Z" is the number of the Release, "x" and "y" are taken from the 3GPP specification version, vZ.x.y. E.g. for 3GPP TS 26.245 v6.0.0, 6(x*256+y)=6(0), the version value is "60". Rey & Matsui [Page 22] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 spldesc= indicates the way the server sends the sample descriptions. There are three possibilities: o "out": all sample descriptions are sent out-of-band, e.g. in the SDP. This may be used when the total number of sample descriptions used is low. This value MUST always be present. o "both":, where both, in- and out-of-band, mechanisms are used. All clients and servers MUST understand this parameter. Additionally, the server MUST always include the "spldesc" parameter in the session description and it MUST include the supported mechanisms in order of preference. The server MUST include, at least, the value "out". tx3g=,,...This parameter MUST be used for conveying sample descriptions out-of-band. The list of sample entries MAY follow any particular order and it MAY be empty. The represents the base64 encoding of the concatenation of the SIDX and the sample description for that SIDX, in this order. The format of a sample description entry can be found in 3GPP TS 26.245 Release 6 and later releases. All servers and clients MUST understand this parameter and MUST be capable of using the sample description(s) contained in it. width=, indicates the width in pixels of the text track or area where the text is actually displayed. height=, indicates the height in pixels of the text track. tx=, indicates the horizontal translation offset in pixels of the text track with respect to the origin of the video track. ty=, indicates the vertical translation offset in pixels of the text track. layer=, indicates the proximity of the text track to the viewer. Higher values means closer to the viewer. This parameter has no units. Optional parameters: brand=, where indicates the "best use" of the original 3GP file from which the timed text contents are read. Rey & Matsui [Page 23] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 cbrand=,,...indicates the list of compatible brands. mver=, "Minor version" where is a positive integer. It identifies the oldest compatible version of the 3GP file format specification in 3GPP TS 26.234 Release and corresponding specifications in later Releases. Note these parameters are merely informational, as they only provide information about the original 3GP file being read from. Details on these can be found in the 3GP file format section of 3GPP TS 26.234 Release 5 specification and corresponding specifications in later Releases. Encoding considerations: this type is only defined for transfer via RTP. Security considerations: please refer to Section 10 of RFCXXXX. Interoperability considerations: the 3GPP Timed Text media format for which this payload format is defined is specified in Release 6 of 3GPP TS 26.245 "Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6)". The 3GPP file format (3GP) referred to in this document and the used SMIL language profile can be found in Release 5 of 3GPP TS 26.234 and in the corresponding specifications for later Releases. Note also that 3GPP may in future Releases specify extensions or updates to the media format in a backwards-compatible way, e.g. new modifier boxes or extensions to the sample descriptions. The payload format defined in RFCXXXX allows for such extensions. For future 3GPP Releases of the Timed Text Format, the parameter "version" is used to identify the Release and exact specification used. Published specification: RFC XXXX Applications which use this media type: multimedia streaming applications. Additional information: the 3GPP Timed Text media format is specified in 3GPP TS 26.245 "Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6)". This document and future extensions to the 3GPP Timed Text format are publicly available at http://www.3gpp.org. Magic number(s): None. File extension(s): 3GPP Timed Text tracks are stored in files conforming the 3GP file format. The 3GPP file format (3GP) referred to in this document can be found in Release 5 of 3GPP TS 26.234 and in the corresponding specifications for later Releases. Macintosh File Type Code(s): None. Rey & Matsui [Page 24] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 Person & email address to contact for further information: Jose Rey, rey@panasonic.de Yoshinori Matsui, matsui.yoshinori@jp.panasonic.com Audio/Video Transport Working Group. Intended usage: COMMON Author/Change controller: Jose Rey Yoshinori Matsui IETF AVT WG 8. SDP usage 8.1 Mapping to SDP The information carried in the MIME media type specification has a specific mapping to fields in SDP [4]. If SDP is used to specify sessions using this payload format, the mapping is done as follows: o The MIME type ("video") goes in the SDP "m=" as the media name. The "video" MIME Type is used as timed text is considered visual media. m=video RTP/ o The MIME subtype ("3gpp-tt") and the timestamp rate go in SDP "a=rtpmap" line as the encoding name and (clock) rate, respectively: a=rtpmap: 3gpp-tt/ o The REQUIRED payload-format-specific parameters "width", "height", "tx", "ty", "layer", "spldesc", "version" and "tx3g" go in the SDP "a=fmtp" as a semicolon separated list of parameter= pairs or parameter= , for "tx3g" and "spldesc". The format is: a=fmtp: =[ ; =] o The OPTIONAL payload-format-specific parameter "brand", "cbrand", and "mver" go in the SDP "a=fmtp" as a semicolon-separated list of parameter= pairs. Details on the versioning are found in Release 5 of 3GPP TS 26.234 and corresponding specifications for later Releases. o Any remaining parameters go in the SDP "a=fmtp" attribute by copying them directly from the MIME media type string as a semicolon separated list of parameter=value pairs. Rey & Matsui [Page 25] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 o Any unknown parameters SHALL be ignored. 8.2 Usage in Offer/Answer In this section the meaning of the SDP parameters defined in this document within the Offer/Answer (O/A) [16] context is explained. In unicast, sender and receiver typically negotiate the streams, i.e. which codecs and parameter values are used in the session. This is also possible in multicast to a lesser extend. As stated in the O/A model, some "fmtp" (payload-format-specific) parameters have a clear meaning and shall be processed by the answerer as they are contained in the offer. Other parameters may need to be set among parties, because it is not clear that offerer and answerer SHALL use the same values. The only parameter whose value MAY be negotiated is the "spldesc". An offerer may offer to send sample descriptions in two modes: o "both": sample descriptions are sent in the current session both out-of-band and in-band. It is the responsibility of the server to decide which are sent using which method. The server SHALL ensure that the indispensable descriptions are sent out-of-band and, at the same time, that the out-of-band channel is not overloaded with large sample descriptions. Additionally, the contents SHALL still be useful if some in-band descriptions are lost, i.e. redundancy in some form: FEC [7], retransmission [13], repetition or a similar technique SHOULD be applied. o "out": sample descriptions MUST be sent out-of-band only. When including in a clientÆs setup message, this is a form for a client to tell the server that it shall not bother to send in-band sample descriptions because it will not use them anyway. Servers offering solely this method SHALL ensure that it is possible to rely on a reduced number of sample descriptions sent out-of-band so that the text is still useful. Upon receiving the session description with this parameter containing a list of supported mechanisms, the answerer MAY decide to use one of these or none. E.g., if a client only supports out-of-band and the server only offers "both", then the client MUST reject the offer by leaving the "spldesc" parameter empty. Otherwise, the client MUST include the "spldesc" with the desired value (MUST be just one) in its answer. The offerer MUST then use the preferred mechanism. 8.3 Usage outside of Offer/Answer SDP may also be employed outside of the Offer/Answer context, for instance for multimedia sessions that are announced through the Rey & Matsui [Page 26] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 Session Announcement Protocol (SAP) [17], or streamed through the Real Time Streaming Protocol (RTSP) [18]. In this case, the only change with respect to the above, is that the answerer cannot negotiate the "spldesc" value. If the answerer accepts the session as announced, it MUST be prepared to receive sample descriptions using both methods. This is compliant with the requirement for clients and servers to understand the "spldesc" as well as static sample descriptions and, at the same time, be able to discard units with dynamic sample descriptions, if not supported. 9. IANA Considerations IANA is requested to register the MIME subtype name "3gpp-tt" for the media type "video" as specified in Section 8 of this document. 10. Security considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [3]. In particular, an attacker may invalidate the current set of valid sample descriptions at the client by means of repeating a packet with an old sample description, i.e. replay attack. This would mean that the display of the text would be corrupted, if displayed at all. Another form of attack may consist in sending redundant fragments, whose boundaries do not match the exact boundaries of the originals. This may cause a decoder to crash. These types of attack may easily be avoided by using authentication. Additionally, peers in a timed text session may desire to retain privacy in their communication, i.e. confidentiality. This payload format does not provide any mechanisms for achieving these. Both confidentiality and authentication have to be solved by a mechanism external to this payload format, e.g. SRTP [10]. 11. References 11.1 Normative References [1] Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6), TS 26.245 v 0.1.6, Working Draft, July 2003. Rey & Matsui [Page 27] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 [2] ISO/IEC 14496-1:2001/AMD5, "Information technology û Coding of audio-visual objects û Part 1: Systems, ISO Base Media File Format", 2003. [3] H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003. [4] M. Handley, V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. [5] S. Bradner, "Key words for use in RFCs to indicate requirement levels," BCP 14, RFC 2119, IETF, March 1997. 11.2 Informative References [6] C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J.C. Bolot, A. Vega-Garcia, S. Fosse-Parisis, "RTP Payload for Redundant Audio Data", September 1997. [7] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for Generic Forward Error Correction", RFC 2733, December 1999. [8] C. Perkins, O. Hodson, "Options for Repair of Streaming Media", RFC 2354, June 1998. [9] W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)", August, 2001. [10] M. Baugher, D. A. McGrew, D. Oran, R. Blom, E. Carrara, M. Naslund, K. Norrman, "The Secure Real-Time Transport Protocol", draft-ietf-avt-srtp-05.txt, June 2002. [11] Handley, et al., "TCP Friendly Rate Control (TFRC): Protocol Specification ", RFC 3448, January 2003. [12] R. Hovey, S. Bradner, "The Organizations involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. [13] J. Rey et al., "RTP Retransmission Payload Format", draft-ietf- avt-rtp-retransmission-10.txt, work in progress, January 2004. [14] M. Handley, C. Perkins, "Guidelines for Writers of RTP Payload Format Specifications", RFC 2736, December 1999. [15] Van der Meer et al., "RTP Payload Format for Transport of MPEG-4 Elementary Streams ", RFC3640, November 2003. [16] J. Rosenberg., H. Schulzrinne, " An Offer/Answer Model with the Session Description Protocol (SDP)", RFC 3264, June 2002. Rey & Matsui [Page 28] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 [17] Transparent end-to-end packet switched streaming service (PSS); Protocols and codecs (Release 6), TS 26.234 v 0.4.0, Working Draft, February 2004. [18] Transparent end-to-end packet switched streaming service (PSS); Protocols and codecs (Release 5), TS 26.234 v 5.6.0, Working Draft, September 2003. 12. Annexes 12.1 Dynamic SIDX wraparound mechanism This mechanism MUST be implemented if the implementation shall use TYPE 5 units. As mentioned in Section 3.2.2, dynamic SIDX values remain active either during the entire duration of the session (if used just once) or in different intervals of it (if used once or more). Although 64 sample descriptions should cover the needs of most timed text applications, a wraparound mechanism to handle the exception is described here. In the following, SIDX value means dynamic SIDX value. There is a sliding window of 64 active SIDX values. Values within the window are active, all others are considered inactive. An SIDX value becomes "active" if at least one sample description identified by that SIDX has been received. Since sample descriptions MAY be sent redundantly, it is possible that a client receives a given SIDX several times. However, the receiver SHALL ignore redundant sample descriptions and it MUST use the already cached copy. The guard range of inactive values ensures that always the correct association SIDX <-> sample description is used. The following algorithm is used to maintain the dynamic SIDX values: Let X be the SIDX of the last received sample description. Let Y be a value within the allowed range for dynamic SIDX: [0,127], and different from X. 1. Initialize all dynamic SIDX values as inactive. For stored content, read the sample description index in the Sample to Chunk box ("stsc") for that sample. For live streaming, the first value MAY be zero or any other value in the interval above. The initial value is SIDX=X. Go to step 2. 2. First in-band sample description with SIDX=X is received. Go to step 3. 3. Set all SIDX=Y inactive if inside the interval [X+1 modulo(128), X+64 modulo(128)]. Otherwise, set SIDX=Y as active. Go to step 4. 4. Wait for next sample description. Upon reception of a sample description with SIDX=X do: Rey & Matsui [Page 29] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 a. If X is currently active, then wait for next SIDX (do nothing). b. Else go to step 3. Example, if X=4, any SIDX in the interval [5,68] is inactive. Active SIDX values are in the complementary interval [69,127] plus [0,4]. Once the client is initialized, the interval of active SIDX values MUST change whenever a sample description with an inactive SIDX value is received. E.g., if the client receives a SIDX=6, then the active interval is now different: [0,6] plus [71,127]. However, if the received SIDX is in the current valid interval no change SHALL be applied. This means that at any instant a maximum of 64 SIDX values are valid, whereas the total of values used might be over 64. 12.2 Basics of the 3GP File Structure This section provides a coarse overview of the 3GP file structure. Each 3GP file consists of "Boxes". Boxes start with a header, which indicates both size and type contained. In general, a 3GP file contains the File Type Box (ftyp), the Movie Box (moov), and the Media Data Box (mdat). The Movie Box and the Media Data Box, serving as containers, include own boxes for each media. Similarly, each box type may include a number of boxes. See ISO Base Media file Format [2] for a complete list of possibilities. In the following, only those boxes are mentioned, which are useful for the purposes of this payload format. The File Type Box identifies the type and properties of a 3GP file. The File Type Box contents comprise the major brand, the minor version and the compatible brands. When streamed with RTP, these are communicated via out-of-band means, such as SDP. The Movie Box (moov) contains one or more Track Boxes (trak) which include information about each track. A Track Box contains, among others, the Track Header Box (tkhd), the Media Header Box (mdhd) and the Media Information Box (minf). The Track Header Box specifies the characteristics of a single track, where a track is, in this case, the streamed text during a session. Exactly one Track Header Box is present for a track. It contains information about the track, such as the spatial layout (width and height), the video transformation matrix and the layer number. Since these pieces of information are essential and static, i.e. constant for the duration of the session, they MUST be sent prior to the transmission of any text samples. See the ISO base media file format [2] for details about the definition of the conveyed information. Rey & Matsui [Page 30] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 The Media Header Box contains the timescale or number of time units that pass in one second, i.e. cycles per second or Hertz. The Media Information Box includes the Sample Table Box (stbl) which itself contains the Sample Description Box (stsd), the Decoding Time to Sample Box (stts), the Sample Size Box (stsz) and the Sample to Chunk Box (stsc). Sample descriptions for each text sample are encoded as "tx3g" sample entries in the Sample Description Box (stsd). The Sample Table Box (stbl) contains all the time and data indexing of the media samples in a track. Using the tables here, it is possible to locate samples in time, determine their type, and determine their size, container, and offset into that container. Finally, the Media Data Box contains the media data itself. In timed text tracks this box contains text samples. Its equivalent to audio and video is audio and video frames, respectively. The text sample consists of the text length, the text string, and one or several Modifier Boxes. The text length is the size of the text in bytes. The text string is plain text to render. The Modifier Box is information to render in addition to the text such as colour, font, etc. 12.3 Usage of 3GP file information for transport in RTP For the purpose of streaming timed text contents, some values in the boxes contained in a 3GP file are mapped to fields of this payload header. This section explains where to find and how to use those values. From the Track Header Box (tkhd): o tx,ty: these values are the second but last and third but last values in the unity matrix. All 32 bits are used. o width, height, layer: they also have the same name in the box and the payload header. All 32 bits are used. From the Sample Table Box (stbl) the following information is carried in each RTP packet using this payload format: o the Sample Description Box (stsd): this stsd box provides information on the basic characteristics of text samples. Each entry is a sample entry box of type "tx3g". An example of the information contained in a sample entry could be the font size or the background color. These pieces of information are commonly used by many text samples during the session. Each sample entry "tx3g" is transported either in- band or out-of-band. o the Decoding Time to Sample Box (stts): the 24 least significant bits of the "sample_delta" are mapped to the field SDUR (Text Sample Duration), o the Sample Size Box (stsz): the 16 least significant bits of the "sample_size" or "entry_size" (depending on whether the Rey & Matsui [Page 31] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 sample size is fixed or variable) are mapped to the SLEN field for that sample. o the Sample to Chunk Box (stsc): the value of the "sample_description_index" for that sample in the Sample to Chunk Box is mapped to the field SIDX (Text Sample Entry Index). The Sample to Chunk Box (stsc) associates the text sample and its corresponding sample description entry in the Sample Description Box (stsd, see below). The Sample to Chunk Box can be used to associate a text sample with a sample description entry. Since the sample description may vary during the session, the association SDIX is sent together with the text samples using this payload format. 13. Acknowledgements The authors would like to thank Dave Singer, Jan van der Meer, Magnus Westerlund and Colin Perkins for their comments and suggestions to this document. 14. Author's Addresses Jose Rey rey@panasonic.de Panasonic European Laboratories GmbH Monzastr. 4c D-63225 Langen, Germany Phone: +49-6103-766-134 Fax: +49-6103-766-166 Yoshinori Matsui matsui.yoshinori@jp.panasonic.com Matsushita Electric Industrial Co., LTD. 1006 Kadoma Kadoma-shi, Osaka, Japan Phone: +81 6 6900 9689 Fax: +81 6 6900 9699 15. IPR Notices The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this Rey & Matsui [Page 32] Internet Draft RTP Payload Format for 3GPP Timed Text May 10, 2004 specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf- ipr@ietf.org. 16. Full Copyright Statement Copyright (C) The Internet Society (2004). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 17. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Rey & Matsui [Page 33]