Internet Draft draft-rey-avt-3gpp-timed-text-01.txt J. Rey Y. Matsui D. Ido Y. Notoya Matsushita Expires: March 2004 September 2003 RTP Payload Format for 3GPP Timed Text Status of this document This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This document specifies an RTP payload format for the transmission of 3GPP (3rd Generation Partnership Project) timed text. 3GPP timed text is time-lined decorated text media format based on the ISO (International Standardisation Organisation) Base Media File Format. As of today, 3GPP timed text contents can be downloaded via HTTP and synchronised with audio/video contents. There is however no available mechanism for streaming 3GPP timed text. In the following sections the problems of streaming timed text are addressed and a payload format for streaming 3GPP timed text over RTP is specified. IETF draft - Expires December 2003 [Page 1] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 Table of Contents 1. Change Log......................................................2 2. Introduction....................................................2 3. Terminology.....................................................4 4. RTP Payload Format for 3GPP Timed Text..........................4 5. Error Resilient Transport......................................17 6. Congestion control.............................................17 7. SMIL usage.....................................................18 8. MIME Type usage Registration...................................18 9. SDP usage......................................................20 10. Examples of RTP packet structure..............................22 11. IANA Considerations...........................................25 12. Security considerations.......................................25 13. References....................................................25 Annex A Fragmentation cases.......................................26 Annex B Basics of 3GP File Structure..............................28 Author's Addresses................................................30 IPR Notices.......................................................31 Full Copyright Statement..........................................31 1. Change Log 1.1 Changes from draft-rey-rtp-avt-3gpp-tt-00.txt Major changes: - completed empty sections from -00 draft. - abstract and introduction re-arranged. Moved section "Basics of the 3GP File Structure" to end of the document as Annex B. - SLEN, SIDX and SDUR lengths fixed to 16, 16 and 24 bits, respectively. - New OPTIONAL header, SPLDESC, added to transport sample description in-band. - Section 4 on payload format expanded: text header, fragment header and sample description header are fully specified. - SMIL usage section added. 2. Introduction The purpose of this draft is to provide a means to stream the 3GPP timed text using RTP. Rey, et al. [Page 2] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 3GPP timed text is a 3GP file format for time-lined decorated text specified in Annex D.8.a of [1]. The 3GP file format itself follows the ISO Base Media File Format recommendation [2]. Besides plain text, the 3GPP timed text format allows the display of decorated text (e.g. blinking text, scrolling, hyperlinks) synchronised or not with other media. The 3GPP timed text format was developed for 3GPP Transparent End-to- end Packet-switched Streaming Services (PSS) [1]. The scope of the 3GPP PSS includes downloading and streaming of multimedia content over 3G packet-switched networks. The PSS adopts multimedia codecs (such as MPEG-4 Visual, AMR, MPEG-4 AAC, and JPEG) and protocols like SMIL [3] for presentation layouts or RTP for streaming. The current usage of the 3GPP timed text file format is limited to downloading via HTTP (with or without audio contents) due to the lack of an appropriate RTP payload format. In general, a multimedia presentation might consist of several audio/video/text streams or tracks (name used in 3GP file format). Two different tracks have different contents and tracks of different media may be spatially synchronised using the information contained in the tracks or a scene description language like SMIL. An example of this would be a media session with three different media tracks: 1 audio, 1 video and 1 timed text that displays a music video with karaoke subtitles. The information contained in each track define the regions where each media is displayed, how the media looks like and is synchronised, e.g., the song lyrics is displayed below the video and the words are highlighted synchronised to the soundtrack. Basically the 3GPP timed text format can be summarised as consisting of four differentiated functional components: - initial setup information for text tracks: these are the height and width of the text region where the text track (contents) are displayed, the translation offsets tx and ty relative to the video track region and the layer or proximity of the text to the user. In the 3GPP timed text format, these pieces of information are extracted from Track Header Box, "tkhd". - general formatting information about the text track: default font, default background colour, default horizontal and vertical justification, default line width, default scrolling, etcetera. In the 3GPP timed text format, these pieces of information are extracted from the Sample Description Box, "stsd". - the actual text, conveyed as plain text using either UTF-8 or UTF- 16 encoding and, - the "decoration": whether it is highlighted text, blinking text, karaoke, hypertext, scroll delay, other text styles/formatting than the defaults, etcetera. In the 3GPP timed text format, these pieces of information are extracted from the various Modifier Boxes: "hlit", "blnk", "krok", "href", "dlay", "styl" or "tbox". Rey, et al. [Page 3] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 For details refer to Annex B that summarises the basics of this document and to [1], where a more detailed description of the setup information, format parameters and decoration is contained. 2.1 Requirements In this section a set of requirements is listed. A justification for each of them is also given. An RTP Payload Format for 3GPP timed text SHALL: 1. Keep the 3GP text sample structure. The text sample consists of the text length, the text string (either UTF-8 or UTF-16 encoded), and one or several "Modifier Boxes" containing the "decoration", as defined in [1]. This is important to foster interoperability of 3GP and RTP transmission formats. 2. Transmit the text sample size, sample duration and sample description index in-band. In 3GP format this information is included in the header part. In RTP it is important to transmit it in-band because this information might change from sample to sample. 3. Enable the transmit the information contained in the Sample Description Box (stsd) out-of-band and in-band. The reason for out- of-band being that usually a sample description is referenced often from different text samples. To save overhead it is sensible to transmit these pieces of information once at the initialisation phase and update them accordingly upon demand, if needed. However, the payload format SHALL enable also the in-band transmission for applications like streaming of live-created content. 4. Enable the agreggation of text samples in one single RTP packet. In a mobile communication environment a typical text sample size is around 100 bytes. Thus, multiplexing several text samples makes the transport over RTP more efficient. 5. Enable the fragmentation of a text sample into several RTP packets in order to cover a wide range of applications and network environments. 6. Enable the use of resilient transport mechanisms, such as repetition, retransmissions and FEC. . 3. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [11]. 4. RTP Payload Format for 3GPP Timed Text The format of an RTP packet containing 3GPP timed text is shown below: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Rey, et al. [Page 4] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + RTP payload | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Marker bit (M): the marker bit must be set to 1 if the RTP packet includes the last part of a text sample; otherwise set to 0. Timestamp: the timestamp indicates the sampling instant of the timed text sample contained in the RTP packet. The initial value is randomly determined. If the RTP packet includes more than one text sample (text sample aggregation), the timestamp indicates the sampling instant of the first text sample in the RTP packet. The timestamp of the subsequent samples is obtained by adding the timed text sample duration to the timestamp value. For example, let sdur(0), sdur(1) and sdur(2) be the durations of three subsequent timed text samples included in an RTP packet. Let rtpts be the timestamp as in the RTP header. The timestamp ts(i) for each sample (i=0,1,2) would be: ts(i)=rtpts + sum[sdur (i-1)]; ts(0)=rtpts, ts(1)=rtpts + sdur(0) ts(2)=rtpts + (sdur(0)+ sdur(1)) Some text samples may become large and have to be fragmented and so spread over several RTP packets. In this case, the receiver needs to associate fragments of the same text sample. This is done using the timestamp. The default value of the timestamp clockrate is 1000 Hz. Other values may be specified by out-of-band means. Payload Type (PT): the payload type is set dynamically and sent by out-of-band means. The usage of the remaining RTP header fields follows the rules of RTP [7] and the profile in use. This payload format defines three payload headers: the text header, THDR, the fragment header, FHDR, and the sample description header, SPLDESC. The use of these payload headers is defined depending on the contents of the payload and how sample description information is transmitted. Note that both in-band and out-of-band transmission of sample description information are possible. The support for out-of- band transmission is MANDATORY while for in-band is OPTIONAL. This Rey, et al. [Page 5] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 payload format is used to convey both fragmented and non-fragmented text samples. When an RTP packet contains one or more (non- fragmented) text samples, no FHDR is used. When an RTP packet contains one text sample fragment, FHDR is always present and precedes the THDR, if present. Note that only one text sample fragment in a packet is allowed when an RTP packet includes the fragment header, FHDR. Moreover, if sample description is conveyed in-band the SPLDESC header is placed between FHDR and THDR. The use of the SPLDESC header is signalled by out-of-band means, see SDP Section. The RTP sender implementing this payload format sends fragmented and non-fragmented text samples using two different payload types which are mapped dynamically, i.e. payload type multiplexing. For this purpose, a new parameter is specified in this document for SDP, "fragment", see SDP Section. The receiver recognises a fragmented text sample by the payload type value. Note that this fact does not conflict with Section 5.2 of RTP [7] because it is the same media that is being transmitted. The following drawings illustrate the different RTP payload compositions. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + THDR #1 + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | text sample #1 | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | THDR #2 | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | text sample #2 | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | ... | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 RTP payload structure when it contains one or more text samples. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Rey, et al. [Page 6] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 | FHDR (variable) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | THDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + text sample fragment + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 RTP payload structure containing a text sample fragment. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + SPLDESC (variable) | THDR + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | text sample #1 | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3 RTP payload structure containing a text sample with in-band sample description entries. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + FHDR | SPLDESC (variable) + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + THDR | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ text sample fragment + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4 RTP payload structure containing a text sample fragment with in-band sample description entries. 4.1 Text Header The Text Header, THDR, is used to convey both (whole) text samples and text sample fragments. It gives basic characteristics of each text sample. The THDR consists of three fields, SLEN, SIDX and SDUR in this order: - SLEN (16 bits) "Text Sample Length": indicates the size of the text sample in bytes, which corresponds to the entry value in the Sample Size Box ,"stsz", for that sample. Please note that text sample as such includes: text string length field, text string and modifier boxes (if present). Rey, et al. [Page 7] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 - SIDX (16 bits) "Text Sample Entry Index": indicates the reference index for the text sample, which corresponds to the index field in the Sample to Chunk Box, "stsc", for the sample. - SDUR (24 bits) "Text Sample Duration": indicates the sample duration in timestamp units of the text sample, which corresponds to the entry value in the Decoding Time to Sample Box, "stts", for that sample. See Annex A, [1] and [2] for details on the boxes. The composition of the THDR depends on whether the text sample is fragmented or not. In particular: - SLEN MUST NOT be present if the RTP packet contains a text sample fragment. If the RTP packet carries one or several non-fragmented samples SLEN MUST be present for every text sample, as in Figure 1. - SIDX and SDUR MUST be present always when there is text string in the fragment. For fragments, always when T=1 in the fragment header. Some examples follow: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SLEN | SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ (a) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SIDX | SDUR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR | +-+-+-+-+-+-+-+-+ (b) Figure 5 Illustration of THDR In Figure 5 (a) the THDR when the "fragment" parameter is set to 0 is shown. While in (b), "fragment" is set to 1 and SLEN is thus not present. 4.2 Fragment Header The fragment header, FHDR, is used for packets containing a text sample fragment. In this section motivation is given for the Rey, et al. [Page 8] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 specification of a fragment header and some fragmentation rules for 3GPP timed text samples are defined. Following that, the fragment header itself, FHDR, is specified. 4.2.1 Fragmentation of Timed Text Samples It is expected that timed text samples will usually fit into the MTU size of the used network path. However, the text string and some modifier boxes, i.e. for hyperlinks("href"), for karaoke ("krok") and for fonts ("styl") might become large and need fragmentation. In order to provide packet loss resilience for fragmented text samples, fragmentation rules are defined in this document for each allowed fragmentation case. These cases are outlined in Annex A. These rules allow the receiver to use the received fragments without having to receive the whole text sample or complete modifier boxes. 4.2.2 Fragmentation Rules - An RTP packet MAY contain more than one text sample. In this case, all text samples MUST be whole (non-fragmented). This is called aggregation. It is possible to resolve the timestamp of the aggregated text samples as explained in the beginning of this section. - The fragment header MUST inform the receiver about the location of the text strings within the original (whole) text string. This applies to cases f) through j) in Annex A. - Potentially large modifier boxes, such as the "href", "krok" and "styl" modifier boxes, SHOULD be fragmented at meaningful boundaries, if they do not fit into the path MTU. The reason being that the fragmentation of these boxes should allow to use modifier box fragments also if previous fragments have not arrived. Therefore, rules are defined for the fragmentation of each modifier box, to enable the client application to recognise the contents of the various possibilities that yield. As a general rule, a box should be separated at the boundary of smaller objects, such as an entry or a child-box. The modifier box fragmentation header might have to duplicate important information that guarantees the usefulness of the decoration information contained in each fragment. See cases h) through r) in Annex A. 4.2.3 Fragment Header Format The use of the FHDR is signalled in SDP with the parameter "fragment" being set to 1 (see SDP Section for details). The FHDR SHALL be present in all RTP packets containing text sample fragments. If present, the FHDR precedes both SPLDESC header and THDR, see Figure 4. The format of this header is as follows: 0 1 2 3 Rey, et al. [Page 9] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| SCO/MBP/MTSF (conditional) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ - R (3 bits) "Reserved": SHALL be set to zero. - T (1 bit) "Text indicator": SHALL be set if text strings of the text sample fragment are sent in this RTP packet. - U (1 bit): "UTF text type indicator": SHALL be set if text strings of the text sample fragment are encoded by UTF-16. Otherwise, this bit SHALL be set to zero. Furthermore, it is REQUIRED that, for non-fragmented text samples, the decoder is able to identify the text type by looking at the first two bytes of text. If the first two bytes are 0xFE followed by 0xFF, then the text is encoded using UTF-16 big-endian. 3GPP terminals are not required to recognize the reversed byte order mark, i.e. 0xFF followed by 0xFE. Otherwise, the text is UTF-8. Since this indication is present only at the beginning of each text string in the text sample, the U bit is needed in the fragmentation header. - S (1 bit) "Start Character Offset Indicator": if set, it indicates the presence of "Start Character Offset" field, SCO. This bit MUST be set if the text sample fragment does not contain the first bytes of the text sample. - M (1 bit) "Modifier-Box-Pointer Indicator": if set, it indicates the presence of the "Modifier Box Pointer" field, MBP. This field must be set if the text sample fragment contains a text string fragment without its initial bytes followed by one or more modifier boxes. In these cases the first byte of the first whole modifier box cannot be found without the MBP information. See Annex A cases h)-j) and p)-r). - F (1 bit) "Modifier Type Specific Fields indicator": this bit SHALL be set to 1 if the text sample fragment starts with a fragment of a modifier box, which does not contain the first bytes of such modifier box. In this case, the modifier type specific fields, MTSF, are present. The MTSF is specified below in this section. See here Annex A cases n) through r). - SCO (16 bits) "Start Char Offset": specifies the offset (in characters) of the first character of the text string fragment in the original (complete) text string. This field is present only when the S bit is set. Note that since multi-byte characters are permitted it is REQUIRED that the sender of RTP packets finds out the exact SCO value for the second and subsequent fragments. The offset of the first character is zero. - MBP (16 bits) "Modifier Box Pointer": a value in bytes used to point to the first byte of the first whole modifier box in the text Rey, et al. [Page 10] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 sample fragment. This field must be present if the M bit is set. This value does not include the length of any payload header (THDR or SPLDESC) or MTSF fields that may be present after this field. If both SCO and MBP fields are present, the SCO field precedes the MBP field, see Figure 6 c). - MTSF "Modifier Type Specific Fields": these fields are used to decode the contents of modifier boxes, when RTP packets conveying preceding fragments of a modifier box are lost. The MTSF is present if, and only if, the F bit is set to 1. The use of the MTSF enables the transport of potentially larger modifier boxes. At the time of writing this document, the following boxes are allowed to be fragmented and use this extension: "krok" Karaoke box, "styl" Style Box and "href" Hypertext Box. It is considered that other boxes like "blnk", "dlay", "tbox", "hclr" and "hlit" are small enough and need not be fragmented. MTSF fields defined for other modifier boxes specified in the future should follow the MTSF structure outlined in this document. Note that the MTSF fields and the SCO field are mutually exclusive, since a text string fragment without the first byte cannot be in the same text sample fragment as a modifier box fragment which also does not contain its first byte. See Annex A for a visualisation of the different fragmentation possibilities. The MTSF is composed of a modifier box type field (TYPE), a length field (LEN) and modifier-specific fields. The TYPE and LEN fields have the following format: - TYPE (32 bits): contains the 4 ASCII characters corresponding to the modifier box abbreviation. This field indicates the box type of fragmented modifier box. For example, the values 0x7374796C, 0x6B726F6B, 0x68726665 are used for the "styl", "krok" and "href" modifier boxes respectively. - LEN (8 bits): indicates the length in bytes of the modifier specific fields that follow this field. This length field may be used to be able to skip unknown or unsupported MTSF headers. The modifier-specific fields for some modifier boxes are defined in the next sections. In the following, different possibilities for the FHDR are outlined: 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| +-+-+-+-+-+-+-+-+ (a) T = 1, S = 0, M = 0 and F = 0 in the FHDR. Rey, et al. [Page 11] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| SCO | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ (b) T = 1, S = 1, M = 0 and F = 0 in the FHDR. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| SCO | MBP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MBP | +-+-+-+-+-+-+-+-+ (c) T = 1, S = 1, M = 1 and F = 0 in the FHDR. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| MBP | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | | MSTF | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ (d) T = 0, S = 0, M = 1 and F = 0 in the FHDR. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| TYPE | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TYPE | LEN | modifier specific fields | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ (e) T = 0, S = 0, M = 0, F = 1 in the FHDR Figure 6 Examples Fragment Header, FHDR 4.2.4 Modifier-specific fields for the "styl" fragment 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TYPE ="styl" | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | LEN | +-+-+-+-+-+-+-+-+ Rey, et al. [Page 12] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 Figure 7 MSTF for "styl". There is no fragment specific header needed for the "styl" fragment. The first byte of the fragmented box in the payload SHALL begin with the first byte of any entry of the "styl" box. In other words, fragmentation MUST be done at entry boundaries. 4.2.5 Modifier-specific fields for the "krok" fragment 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TYPE ="krok" | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | LEN | KRSTO | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | KRSTO | +-+-+-+-+-+-+-+-+ Figure 8 MSTF for "krok". - KRSTO (32 bits) "Karaoke Start Time Offset": specifies the execution time of the first entry in the "krok" fragment. The first byte of the fragmented box in the payload SHALL begin with the first byte of any entry of the "krok" box. In other words, fragmentation MUST be done at entry boundaries. 4.2.6 Modifier-specific fields for the "href" fragment 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TYPE ="href" | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | LEN | HRSCO | HRECO | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | HRECO | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9 MSTF for "href". - HRSCO (16 bits) "Hypertext Start Char Offset": specifies the start offset of the text to be linked. This field shall be equal to the "startcharoffset" field of the original "href" box. See Annex D.8 of [1] for details. - HRECO (16 bits) "Hypertext End Char Offset": specifies the end offset of the text. This field shall be equal to the "endcharoffset" field of the original "href" box. See Annex D.8 of [1] for details. Rey, et al. [Page 13] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 Fragmentation MUST be done between the URL and the altLength fields. 4.3 SPLDESC Header A streaming server MAY transport sample description in-band using the SPLDESC header. This information is extracted for each text sample from its corresponding "tx3g" sample entry in the Sample Description Box "stsd" (see [1]) and relates to the display options and formatting of the text itself. Examples of such are the exact positioning of the text box within the text region, the complete set of fonts used in the text sample, the type of scrolling or the background colour of the text box. Note that text display region is specified by the Track Header Box "tkhd" information as described in the SDP section. This header is useful for real-time streaming of timed text specially when text samples are created and transmitted in real time, e.g. live streaming with captions. This header is placed before the THDR and after the FHDR, if present. The use of in-band sample description transmission is indicated in SDP. Schematically, the format of the SPLDESC header consists of an initial entry count byte followed by an number of (SIDX,SPLATTR) pairs: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | entry count (1 Byte) | SIDX #1 (variable)|SPLATTR #1 (var.)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SIDX #2 (variable) | SPLATTR #2 (variable.) |...........: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+...........+ :...............................................................: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ where, - entry count: the initial 8 bits indicate the number of entries. An entry is the pair (SIDX,SPLATTR) corresponding to a text sample. - SIDX: this is the sample description index as used in the THDR. The same value SIDX value in that header MUST be copied here. This field is used to map the sample description attributes (see next field, SPLATTR) to the text sample. - SPLATTR: this field actually contains the sample description attributes conveyed by the SPLDESC header for each text sample. The format of this field is described in the next section. For in-band transmission, the following rules apply: Rey, et al. [Page 14] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 - All SIDX values present in the SPLDESC field of an RTP packet MUST be present in at least one of the text samples in the payload. - The contents of the SPLATTR fields for a given SIDX value MUST not be changed during the timed text session. These rules ensure that received packets can be decoded without dependencies upon other packets. 4.3.1 SPLATTR field This field contains the text sample default attributes as in the "tx3g" sample entry defined in Annex D.8a.16 of [1]. The length of this field is variable. It contains an initial byte with 1-bit flags. Each flag indicates if the corresponding field is present in the following bits. In case all flags were set the SPLATTR field would look like this: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | | | | | | |F|F|F|T| | |R|D|H|V|B|T|S|F| R |I|S|S|C| displayFlags | | | | | | | | | | |D|F|Z|R| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | displayFlags | hor. just. | vert. just. | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | background colour | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | default text box | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | default text box | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | default style | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | default style | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | default style | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | font table | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | font table | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | font table | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ where, R (1 bit): reserved bit, Rey, et al. [Page 15] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 D (1 bit): displayFlags flag, H (1 bit): horizontal justification flag, V (1 bit): vertical justification flag, B (1 bit): background rgba colour flag, T (1 bit): default text box flag, S (1 bit): default style flag, F (1 bit): font table flag, The values for the "displayFlags" field (16 bits) indicate display options of the text: scroll in/out, scroll direction, karaoke or vertical text If the H(V) bit is set the horizontal(vertical) justification field (8 bits) is present. If the B bit is set, four octets (32 bits), indicate the rgba background colour. The order of the octets is: red, green, blue and alpha (transparency). If the T bit is set, the default text box field (64 bits) is present. This field consists of four 16-bit values (top, left, bottom, right in this order) that define the position of the text box in pixels relative to the text region origin. If the S bit is set the default style box is present. To indicate which fields are present an additional byte (see figure above) of flags is used as follows: 0 0 1 2 3 4 5 6 7 +-+-+-+-+---+---+---+---+ | R |FID|FSF|FSZ|TCR| +-+-+-+-+---+---+---+---+ where, - R (4 bits): reserved - FID (1 bit): font-ID flag - FSF (1 bit): face-style-flags flag (1 bit) - FSZ (1 bit): font-size flag (1 bit): - TCR (1 bit): text-color-rgba flag (1 bit): If FID bit is set, the fond-ID field (16 bits) is present. If FSF bit is set, the face-style-flag field (8 bits) is present. If FSZ bit is set, the font-size field (8 bits) is present. If TCR bit is set, the text-color-rgba field (32 bits) is present. If the F bit is set the font table (variable size, 10 bytes in this example) is present. The font table contains an entry count field Rey, et al. [Page 16] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 (16 bits) followed by a number of font entries. A font entry consists of: - foint-ID (16 bits): font identifier from the font table. - font-name-length (8 bits): gives the length of the font name in bytes. - font name, expressed as a string of 8-bit UTF-8 characters, unless preceded by a UTF-16 byte-order-mark, whereupon the rest of the string is in 16-bit Unicode characters. The string is a comma-separated list of font names to be used as alternative font, in preference order. For details refer to [1]. 5. Error Resilient Transport 3GPP Timed Text operates at low bit rates. For this reason the use of some form of transport redundancy is RECOMMENDED, unless the underlying transport layer guarantees error-free transmission. In addition, the use of retransmission [14] MAY be useful to re-send lost packets. 5.1 Basic Error Resilient Transport through Repetition The simplest option for error resilient transport is to send the same text samples (or fragments) again. A server MAY decide to send the same RTP packets again as a measure for error resilience or as an update of the original information. If the same RTP packet or a packet that updates the original is sent, all RTP header fields keep their original values except the sequence number that MUST be increased to comply with RTP. If several text samples with the same timestamp are received, the receiver SHOULD use the one received in the RTP packet with the highest sequence number. 6. Congestion control The RTP profile under which this payload format is used defines an appropriate congestion control mechanism in different environments. Following the rules under the profile, an RTP application can determine its acceptable bitrate and packet rate in order to be fair to other TCP or RTP flows. If an RTP application using this payload format uses retransmission, the acceptable packet rate and bitrate includes both the original and retransmitted data. This guarantees that an application using retransmission achieves the same fairness as one that does not. Such a rule would translate in practice into the following actions: If enhanced service is used, it should be made sure that the total bitrate and packet rate do not exceed that of the requested service. It should be further monitored that the requested services are Rey, et al. [Page 17] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 actually delivered. In a best-effort environment, the sender SHOULD NOT send retransmission packets without reducing the packet rate and bitrate of the original stream (for example by encoding the data at a lower rate). Similar considerations apply, if an RTP application using this payload format implements forward error correction, FEC [4]. Hereby, the sender should take care that the amount of FEC does not actually worsen the problem. Therefore, it is RECOMMENDED that applications implementing this payload format also implement congestion control. The actual mechanism for congestion control is out of the scope of this document but should be suitable for real-time flows. As an example, RFC 3448 [10] specifies an equation-based congestion control that fulfils this requirement. 7. SMIL usage The SMIL recommendation [6] specifies a means for synchronising different media streams. This payload format defines the spatial layout parameters for a timed text stream. These specify the location of the text display area relative to the top left corner of the video display area, when a text stream is played with a single video stream without SMIL. In cases where several media streams shall be synchronized, SMIL MAY be used to specify the spatial layout parameters. It shall be noted that even if SMIL scene description is used the track header information pieces SHOULD be sent anyway as they represent the intrinsic media properties. 8. MIME Type usage Registration 8.1 Registration of parameters for the MIME video/3gpp-tt MIME type: video MIME subtype: 3gpp-tt Required parameters: rate: the RTP timestamp clockrate is equal clockrate of the media. The default value of the clock frequency of the timestamp is 1000 Hz. Other values may be specified by out-of- band means. See SDP Section. fragment=, where is either zero or one. If set, it indicates the presence of FHDR in RTP packets with that payload type value. Otherwise FHDR is not present. See Section 4. Rey, et al. [Page 18] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 brand=, where identifies a specification of Timed Text being transmitted in RTP, e.g. "3gp5". See SDP Section. version=, where is a positive integer. See SDP Section. spldesc-inband=, where is either zero or one to indicate out-of-band or in-band transmission of sample description information, respectively. Note that although the support of the SPLDESC header is OPTIONAL, this parameter MUST be implemented by servers and clients. Clients SHALL understand this parameter in order to be able to discard an streaming session offering in-band sample description. tx3g=, ,... where represents a list of sample description entries using base64 encoding. This parameter is only required if "spldesc-inband" is zero. The list of sample entries is not required to follow any particular order. width= indicates the width of the text track or area where the text is actually displayed. height= indicates the height of the text track. tx=, indicates the horizontal translation offset of the text track with respect to the origin of the video track. ty=, indicates the vertical translation offset of the text track. layer=, indicates the proximity of the text track to the viewer. Higher values means closer to the viewer. Encoding considerations: this type is only defined for transfer via RTP. Security considerations: see security Section in this document. Interoperability considerations: none. Published specification: RFC XXXX Applications which use this media type: multimedia streaming applications. Additional information: the 3GPP Timed Text format is specified in Annex D8.a of [1]. Person & email address to contact for further information: Rey, et al. [Page 19] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 rey@panasonic.de matsui@drl.mei.co.jp Intended usage: COMMON Author/Change controller: Jose Rey Yoshinori Matsui IETF AVT WG 9. SDP usage This document defines the MIME type name "video/3gpp-tt" and introduces ten REQUIRED payload-format-specific parameters: "fragment", "brand", "version", "width", "height", "tx", "ty", "layer", "spldesc-inband" and "tx3g". The parameter "fragment" indicates the presence of the FHDR. The syntax is: - fragment=, where is either zero or one. If set, it indicates the presence of FHDR in all RTP packets with that payload type value. Otherwise FHDR is not present. The parameter "brand" has the following format: - brand=, where identifies a specification of Timed Text being transmitted in RTP, e.g. "3gp5". A brand indicates a specification. For example: the brand value "3gp5" indicates 3GPP Technical Specification (TS) 26.234 [1]. The parameter "version" has the following syntax: - version=, where is a positive integer. The definition of how the version is defined is out of the scope of this document. For example, in 3GPP, the calculation of the version for TS 26.234 is defined as (=256 * x + y) for 3GPP TS 26.234 version Z.x.y. Therefore, the version value 768 means version 5.3.0. The parameter "spldesc-inband" is used to indicate whether the transport of sample description information takes place in-band. If set, the SPLDESC header MUST be used. The format is: - spldesc-inband=, where is either zero or one to indicate out-of-band or in-band transmission. Rey, et al. [Page 20] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 The support of out-if-band transmission of sample description is MANDATORY. The support of in-band transmission is OPTIONAL. However clients and servers MUST implement and understand this parameter. The parameter "tx3g" is used to convey the contents of the sample description entry for each text sample, in base64 encoding format. The syntax is as follows: - tx3g=,. This parameter is only required if spldesc-inband is zero. See Section 4. The list of sample entries is not required to follow any particular order Compression is used because the sample description information may comprise hundreds of bytes. This parameter is present only when "spldesc-inband" is set to zero. The parameters "width", "height" represent the width and height of the text box (where text is actually viewed) within the text region; "tx" and "ty" represent the horizontal and vertical translation offset relative to the origin of the video track and "layer" indicates the proximity to the viewer. The format is: - a=fmtp:xy width=;height=;tx=;ty=;layer=, where all values are integers. xy is a dynamic payload type integer value. As pointer out in the SMIL Section it shall be noted that even if SMIL scene description is used the track header information pieces SHOULD be sent anyway as they represent the intrinsic media properties. 9.1 Mapping to SDP The information carried in the MIME media type specification has a specific mapping to fields in SDP [8], which is commonly used to describe RTP sessions. When SDP is used to specify transmission using this payload format, the mapping is done as follows: - The MIME type ("video") goes in the SDP "m=" as the media name. The "video" MIME Type is used as timed text is considered visual media. - The MIME subtype ("3gpp-tt") goes in SDP "a=rtpmap" as the encoding name. The default RTP clockrate for this payload format is 1000 Hz. Other values MAY be specified by out-of-band means. - The REQUIRED payload-format-specific parameters "fragment", "brand", "version", "width", "height", "tx", "ty", "layer", "tx3g" and "spldesc-inband" go in the SDP "a=fmtp" as a semicolon separated list of parameter= (or parameter= for "tx3g") pairs. Rey, et al. [Page 21] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 - Any remaining parameters go in the SDP "a=fmtp" attribute by copying them directly from the MIME media type string as a semicolon separated list of parameter=value pairs. In the following sections some example SDP descriptions are presented. 10. Examples of RTP packet structure In this section, some examples of RTP packet structure are explained for better understanding of this payload format. The wrap-around of the long lines is indicated by the backslash character "\". The examples assume aggregate control of stream container files. The session descriptions are not complete but limited to the example purposes. 10.1 An RTP packet containing multiple text samples Figure 10 shows an example of RTP packet which is composed of two timed text samples and only two sample description entries are described; symbolic , values are used for clarity. m=video 49170 RTP/AVP 98 99 a=control:rtsp://server/example.3gp/text a=rtpmap:98 3gpp-tt/1000 a=rtpmap:99 3gpp-tt/1000 a=fmtp:98 brand=3gp5;fragment=0;width=176;height=144;\ layer=1;tx=0;ty=0;spldesc-in-band=0;tx3g=,\ a=fmtp:99 brand=3gp5;fragment=1;width=176;height=144;\ layer=1;tx=0;ty=0;spldesc-in-band=0;tx3g=,\ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT=98 | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp of text sample #1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SLEN #1 | SIDX #1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (SDUR #1) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | text sample #1 | + +-------------------------------+ | | SLEN #2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Rey, et al. [Page 22] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 | SIDX #2 | SDUR #2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR #2 | | +-+-+-+-+-+-+-+-+ + | text sample #2 | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 10 . An RTP packet containing two text samples. 10.2 RTP packets containing a fragmented text sample In Figure 11 below, a single text sample is split into three text sample fragments and each fragment is sent in a different RTP packet. The text string is encoded by UTF-8 as U bit is set to zero. Note that the timestamp is identical in these three packets and that the M bit of the RTP packet header is set to 1 only the last RTP packet. RTP packets MUST contain a single text sample fragment. RTP packet payload MUST start with a single byte FHDR which is composed of R, T, U, S, M and F bits. Following the FHDR is the THDR, which in case of a text sample fragment is composed of only two fields , SIDX and SDUR. Note that in this case the SLEN field is not needed and thus MUST not be present. Note also that the FHDR is present in the RTP packet carrying the first text sample fragment. This is done for parsing simplicity: m=video 49170 RTP/AVP 98 99 a=control:rtsp://server/example.3gp/text a=rtpmap:98 3gpp-tt/1000 a=rtpmap:99 3gpp-tt/1000 a=fmtp:98 brand=3gp5;fragment=0; width=176;height=144;\ layer=1;tx=0;ty=0; spldesc-in-band=0;tx3g=, a=fmtp:99 brand=3gp5;fragment=1; width=176;height=144;\ layer=1; spldesc-in-band=0;tx3g=, 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT=99 | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SLEN | SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR | R |T|U|S|M|F | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + text sample fragment #1 + + (text-length and the first part of text string) + Rey, et al. [Page 23] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Here, flags in the FHDR are set to T=1, U=0, S=0, M=0, F=0. (a) An RTP packet example containing the first fragment. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT=99 | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| SCO | MBP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MBP | SIDX | SDUR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | text sample fragment #2 | + (last part of text string and the first + + fragment of the modifier box "href") + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Here, flags in the FHDR are set to T=1, U=0, S=1, M=1, F=0. (b) An RTP packet example containing the second fragment. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT=99 | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |T|U|S|M|F| TYPE=href | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (TYPE=href) | LEN | HRSCO | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | HRECO | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | text sample fragment #3 | + (last fragment of the modifier box "href") + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Here, flags in the FHDR are set to T=0, U=0, S=0, M=0, F = 1. Rey, et al. [Page 24] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 (c) An RTP packet example containing the last fragment. Figure 11. RTP packet examples containing a text sample fragment. 11. IANA Considerations The following REQUIRED parameters are introduced in this document: "fragment", "brand", "version", "tx", "ty", "width", "height", "layer", "tx3g" and "spldesc-inband". See SDP Section for details. 12. Security considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [3]. This implies that confidentiality of the media streams is achieved by encryption. Furthermore, the main security issues are confidentiality and authentication of the text itself. The payload format itself does not have any support for security. These issues have to be solved by a payload external mechanism, e.g. SRTP [9]. 13. References 13.1 Normative References 1 3GPP, "Transparent end-to-end packet switched streaming service (PSS); Protocols and codecs (Release 5)", TS 26.234 v 5.3.0, December 2002. 2 ISO/IEC 14496-1:2001/AMD5, "Information technology û Coding of audio-visual objects û Part 1: Systems, ISO Base Media File Format", 2003. 13.2 Informative References 3 C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J.C. Bolot, A. Vega-Garcia, S. Fosse-Parisis, "RTP Payload for Redundant Audio Data", September 1997. 4 J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for Generic Forward Error Correction", RFC 2733, December 1999. 5 C. Perkins, O. Hodson, "Options for Repair of Streaming Media", RFC 2354, June 1998. 6 W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)", August, 2001. Rey, et al. [Page 25] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 7 H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003. 8 M. Handley, V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. 9 M. Baugher, D. A. McGrew, D. Oran, R. Blom, E. Carrara, M. Naslund, K. Norrman, "The Secure Real-Time Transport Protocol", draft-ietf-avt-srtp-05.txt, June 2002. 10 Handley, et al., "TCP Friendly Rate Control (TFRC): Protocol Specification ", RFC 3448, January 2003. 11 S. Bradner, "Key words for use in RFCs to indicate requirement levels," BCP 14, RFC 2119, IETF, March 1997. 12 R. Hovey, S. Bradner, "The Organizations involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. 13 J.Ott et al., "Extended RTP Profile for RTCP-based Feedback (RTP/AVPF)", draft-ietf-avt-rtcp-feedback-07.txt, work in progress, June 2003. 14 J. Rey et al., "RTP Retransmission Payload Format", draft-ietf- avt-rtp-retransmission-09.txt, work in progress, August 2003. Annex A Fragmentation cases These drawings describe the different fragmentation possibilities. +------+------+----------------------------+------+--------+----+ | txt-header | text string | mb1 | mb2 |mb3 | +------+------+----------------------------+------+--------+----+ +------+------+--+ a)| | ) |------+------+--+ +------+------+----------------------------+ b)| | text string | +------+------+----------------------------+ +------+------+----------------------------+-+ c)| | text string | ) +------+------+----------------------------+-+ +------+------+----------------------------+------+ d)| | text string | mb1 | +------+------+----------------------------+------+ +------+------+----------------------------+------+--------+-+ Rey, et al. [Page 26] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 e)| | text string | mb1 | mb2 | ) +------+------+----------------------------+------+--------+-+ +--------------+ f) ( text string ) +--------------+ +-------------------+ g) ( text string | +-------------------+ +-------------------+--+ h) ( text string | ) +-------------------+--+ +------------+------+ i) ( text string | mb1 | +------------+------+ +-------------+------+--+ j) ( text string | mb1 | ) +-------------+------+--+ +--+ k) | ) +--+ +------+ l) | mb1 | +------+ +------+----+ m) | mb1 | ) +------+----+ +-+ n) ( ) +-+ +---+ o) ( | +---+ p) +---+-+ ( | ) +---+-+ +-+--------+ q) ( | mb2 | +-+--------+ +-+--------+-+ Rey, et al. [Page 27] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 r) ( | mb2 | ) +-+--------+-+ The following table reflects which fields (and in some cases, which values) are filled in the payload format defined in this document for each of the fragmentation possibilities outlined in the drawings above. +---+-------------------+--------------------+------------------+ | | FHDR | THDR | FHDR | | +---+---+---+---+---+------+------+------+-----+-----+------+ | | T | F | U | S | M | SLEN | SIDX | SDUR | SCO | MBP | MSTF | +---+---+---+---+---+---+------+------+------+-----+-----+------+ | a | 1 | 0 |0/1| 0 | 0 | nn | N | N | nn | nn | nn | | b | 1 | 0 |0/1| 0 | 0 | nn | N | N | nn | nn | nn | | c | 1 | 0 |0/1| 0 | 0 | nn | N | N | nn | nn | nn | | d | 1 | 0 |0/1| 0 | 0 | nn | N | N | nn | nn | nn | | e | 1 | 0 |0/1| 0 | 0 | nn | N | N | nn | nn | nn | | f | 1 | 0 |0/1| 1 | 0 | nn | N | N | n | nn | nn | | g | 1 | 0 |0/1| 1 | 0 | nn | N | N | n | nn | nn | | h | 1 | 0 |0/1| 1 | 1 | nn | N | N | n | N | nn | | i | 1 | 0 |0/1| 1 | 1 | nn | N | N | n | N | nn | | j | 1 | 0 |0/1| 1 | 1 | nn | N | N | n | N | nn | | k | 0 | 0 | 0 | 0 | 0 | nn | nn | nn | nn | nn | nn | | l | 0 | 0 | 0 | 0 | 0 | nn | nn | nn | nn | nn | nn | | m | 0 | 0 | 0 | 0 | 0 | nn | nn | nn | nn | nn | nn | | n | 0 | 1 | 0 | 0 | 0 | nn | nn | nn | nn | nn | N | | o | 0 | 1 | 0 | 0 | 0 | nn | nn | nn | nn | nn | N | | p | 0 | 1 | 0 | 0 | 1 | nn | nn | nn | nn | N | N | | q | 0 | 1 | 0 | 0 | 1 | nn | nn | nn | nn | N | N | | r | 0 | 1 | 0 | 0 | 1 | nn | nn | nn | nn | N | N | +---+---+---+---+---+---+------+------+------+-----+-----+------+ where, - "nn" means field is not needed and thus not present. In this case it MUST be set to 0. - "0/1" means it can only take the values 0 or 1. - "N" means this field is needed and set by the server to the appropriate value - "n" means this field is needed but MAY be optionally set to all zeros "0000" if the server does not implement the feature of finding out the SCO value. See Section 3. Annex B Basics of 3GP File Structure Each 3GP file consists of "Boxes". Boxes start with a header which indicates both size and type contained. The 3GP file contains the File Type Box (ftyp), the Movie Box (moov), and the Media Data Box (mdat). The Movie Box and the Media Data Box, serving as containers, include own boxes for each media. Similarly, each box type may Rey, et al. [Page 28] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 include a number of boxes, see ISO Base Media file Format [2] for a complete list of possibilities. In the following, only those boxes are mentioned, which are useful for the purposes of this payload format. The File Type Box identifies the type and properties of a 3GP file. The File Type Box contents comprise the major brand, the minor version and the compatible brands. These are communicated via out- of-band means, such as SDP, when streamed with RTP. For the 3GPP timed text file format, the set of compatible-brands MUST include "3gp5". The Movie Box contains one or more Track Boxes (trak) which include information about each track. A Track Box contains the Track Header Box (tkhd) and the Media Information Box (minf). The latter includes the Sample Table Box (stbl) which itself contains the Sample Description Box (stsd), the Decoding Time to Sample Box (stts), the Sample Size Box (stsz) and the Sample to Chunk Box (stsc). Sample descriptions for each text sample are encoded as "tx3g" sample entries in the Sample Description Box (stsd). The Track Header Box specifies the characteristics of a single track, where a track is, in this case, the streamed text during a session. Exactly one Track Header Box is needed for a track. It contains information about the track, such as the spatial layout (width and height), the video transformation matrix and the layer number. Since these pieces of information are essential and static, i.e. constant for the duration of the session, they MUST be sent prior to the transmission of any text samples. See the ISO base media file format [2] for details about the definition of the conveyed information. When using scene description in SMIL [6], it is possible to specify the layer and the position of the text track. However, in this case, the transmission of the Track Header Box (tkhd) is still RECOMMENDED, as the intrinsic track information is specified there. Otherwise, the Track Header Box information MUST be sent prior to the start of the text streaming. The Sample Table Box (stbl) contains all the time and data indexing of the media samples in a track. Using the tables here, it is possible to locate samples in time, determine their type, and determine their size, container, and offset into that container. From the Sample Table Box (stbl) the following information is carried in each RTP packet using this payload format: the Sample Description Box (stsd), the Decoding Time to Sample Box (stts), the Sample Size Box (stsz) and the Sample to Chunk Box (stsc). The Decoding Time to Sample Box (stts) is mapped to the field SDUR (Text Sample Duration); the Sample Size Box (stsz) is mapped the field SLEN (Text Sample Length) and the Sample to Chunk Box is mapped to the field SIDX (Text Sample Entry Index). The Sample to Chunk Box (stsc) associates the text sample and its corresponding sample description entry in the Sample Description Box (stsd, see below). The Sample to Chunk Box Rey, et al. [Page 29] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 can be used to associate a text sample with a sample description entry. Since the sample description may vary during the session, the association SDIX must be sent together with the text samples using this payload format. The Sample Description Box (stsd) provides information on the basic characteristics of text samples. Each entry is a sample entry box of type "tx3g". An example of the information contained in a sample entry could be the font size or the background colour. Since these pieces of information are commonly used by many text samples during the session, it is sent by out-of-bands means. A complete list of text characteristics can be found in [1]. Finally, the Media Data Box contains the media data itself. In 3GPP timed text tracks this box contains text samples. Its equivalent to audio and video is audio and video frames, respectively. The text sample consists of the text length, the text string, and one or several Modifier Boxes. The text length is the size of the text in bytes. The text string is plain text to render. The Modifier Box is information to render in addition to the text such as colour, font, etc. In general, text samples do not exceed the maximum transfer unit (MTU) of a particular network, but in some cases as explained later on in this document, text samples may become large and might need be fragmented. This document defines a method to convey both fragmented and non-fragmented text samples in an error resilient way. Author's Addresses Jose Rey rey@panasonic.de Panasonic European Laboratories GmbH Monzastr. 4c D-63225 Langen, Germany Phone: +49-6103-766-134 Fax: +49-6103-766-166 Yoshinori Matsui matsui.yoshinori@jp.panasonic.com Matsushita Electric Industrial Co., LTD. 1006 Kadoma Kadoma-shi, Osaka, Japan Phone: +81 6 6900 9689 Fax: +81 6 6900 9699 Daiji Ido ido.daiji@jp.panasonic.com Panasonic Mobile Communications Co., Ltd. 5-3, Hikarinooka, Yokosuka-shi, Kanagawa, 239-0847, Japan Phone: +81 46 840 5416 Fax: +81 46 840 5183 Rey, et al. [Page 30] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 Youji Notoya notoya.youji@jp.panasonic.com Matsushita Electric Industrial Co., LTD. 1006 Kadoma Kadoma-shi, Osaka, Japan Phone: +81 6 6900 9689 Fax: +81 6 6900 9699 IPR Notices The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP 11 [12]. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Full Copyright Statement "Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. Rey, et al. [Page 31] Internet Draft RTP Payload Format for 3GPP Timed Text September 2003 This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." Rey, et al. [Page 32]