Internet Draft draft-rey-avt-3gpp-timed-text-02.txt J. Rey Y. Matsui Matsushita Expires: August 2004 February 2004 RTP Payload Format for 3GPP Timed Text Status of this document This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This document specifies an RTP payload format for the transmission of 3GPP (3rd Generation Partnership Project) timed text. 3GPP timed text is time-lined decorated text media format whose defined storage format is the ISO (International Standardisation Organisation) Base Media File Format. As of today, 3GP files containing timed text contents can be downloaded via HTTP and be synchronised with audio/video contents. There is however no available mechanism for streaming 3GPP timed text contents. In the following sections the problems of streaming timed text are addressed and a payload format for streaming 3GPP timed text over RTP is specified. IETF draft - Expires August 2004 [Page 1] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 Table of Contents 1. Change Log......................................................2 2. Introduction....................................................3 3. Terminology.....................................................5 4. RTP Payload Format for 3GPP Timed Text..........................5 5. Resilient Transport............................................13 6. Congestion control.............................................13 7. SMIL usage.....................................................14 8. MIME Type usage Registration...................................14 9. SDP usage......................................................16 10. Examples of RTP packet structure..............................17 11. IANA Considerations...........................................17 12. Security considerations.......................................17 13. References....................................................18 14. Annex A Basics of 3GP File Structure..........................19 15. Author's Addresses............................................20 16. IPR Notices...................................................21 17. Full Copyright Statement......................................21 1. Change Log 1.1 Changes from draft-rey-rtp-avt-3gpp-tt-00.txt Major changes: - completed empty sections from -00 draft. - abstract and introduction re-arranged. Moved section "Basics of the 3GP File Structure" to end of the document as Annex B. - SLEN, SIDX and SDUR lengths fixed to 16, 16 and 24 bits, respectively. - New OPTIONAL header, SPLDESC, added to transport sample description in-band. - Section 4 on payload format expanded: text header, fragment header and sample description header are fully specified. - SMIL usage section added. 1.2 Changes from draft-rey-rtp-avt-3gpp-tt-01.txt Major changes: - Terminology, some terms introduced to clarify text. - Section 4 - rules and recommendations on fragmentation are given. - payload headers were calssified into five types, with a common field section and specific fields for each type. Rey & Matsui [Page 2] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 - header structure similar to RFC 3640 for easy transformation. 2. Introduction The purpose of this draft is to provide a means to stream the 3GPP timed text using RTP. 3GPP timed text is a 3GP file format for time-lined decorated text specified in [1]. The 3GP file format itself follows the ISO Base Media File Format recommendation [2]. Besides plain text, the 3GPP timed text format allows the display of decorated text (e.g. blinking text, scrolling, hyperlinks) synchronised or not with other media. The 3GPP timed text format was developed for 3GPP Transparent End-to- end Packet-switched Streaming Services (PSS) [1]. The scope of the 3GPP PSS includes downloading and streaming of multimedia content over 3G packet-switched networks. The PSS adopts multimedia codecs (such as MPEG-4 Visual, AMR, MPEG-4 AAC, and JPEG) and protocols like SMIL [9] for presentation layouts or RTP for streaming. The current usage of the 3GPP timed text file format is limited to downloading via HTTP (with or without audio contents) due to the lack of an appropriate RTP payload format. In general, a multimedia presentation might consist of several audio/video/text streams or tracks (in 3GP file format jargon). Different tracks may have different contents and tracks of different media may be spatially synchronised using the information within the tracks or a scene description language like SMIL. An example of this would be a media session with three different media tracks: 1 audio, 1 video and 1 timed text that reproduces a music video with karaoke subtitles. The information contained in each track defines the regions where each media is displayed, how the media looks like and how it is synchronised, e.g., the song lyrics is displayed below the video and the words are highlighted and synchronised with the soundtrack. Basically the 3GPP timed text format can be summarised as consisting of four differentiated functional components: - initial setup information for text tracks: these are the height and width of the text region where the text track (contents) are displayed, the translation offsets tx and ty relative to the video track region and the layer or proximity of the text to the user. In the 3GPP timed text format, these pieces of information are extracted from Track Header Box, "tkhd". - general formatting information about the text track: default font, default background colour, default horizontal and vertical justification, default line width, default scrolling, etcetera. In the 3GPP timed text format, these pieces of information are extracted from the Sample Description Box, "stsd". - the actual text, conveyed as plain text using either UTF-8 or UTF- 16 encoding and, Rey & Matsui [Page 3] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 - the "decoration": whether it is highlighted text, blinking text, karaoke, hypertext, scroll delay, other text styles/formatting than the defaults, etcetera. In the 3GPP timed text format, these pieces of information are extracted from the various Modifier Boxes: "hlit", "hclr", "blnk", "krok", "href", "dlay", "styl" or "tbox". For details refer to Annex A that summarises the basics of the 3GP file format and to [1], where a more detailed description of the setup information, format parameters and modifiers is contained. 2.1 Requirements In this section a set of requirements is listed. A justification for each of them is also given. An RTP Payload Format for 3GPP timed text SHALL: 1. Keep the 3GP text sample structure. A text sample consists of the text length, the text string (either UTF-8 or UTF-16 encoded), and one or several Modifier Boxes containing the text "decoration", as defined in [1]. This is important to foster interoperability of 3GP file and RTP payload formats. 2. Transmit the text sample size, sample duration and sample description index in-band. In RTP it is important to transmit it in- band because this information might change from sample to sample. This is also important for buffering purposes as described in Section 4.1. 3. Enable the transmission of the formatting information (contained in the Sample Description Box, "stsd") by out-of-band and in-band means. In general, a single sample description may be used by different text samples. Therefore, to save overhead it is sensible to transmit a default formatting once at the initialisation phase and update this on demand. On the other hand, these pieces of information may become large so that out-of-band transmission might not be the most appropriate method. Also, out-of-band channels might not be always available. For these reasons, the payload format SHALL enable also the in-band transmission of sample description information. This is especially useful for live streaming (where the contents are not known a priori) or to protect this information through other mechanisms like FEC [4] or retransmission [13]. RFC 2354 [8] discusses available mechanisms for packet loss resiliency. 4. Enable the aggregation of text samples in one single RTP packet. In a mobile communication environment a typical text sample size is around 100-200 bytes. Thus, transporting several text samples in one RTP packet makes the transport over RTP more efficient. 5. Enable the fragmentation and reassembly of a text sample into several RTP packets in order to cover a wide range of applications and network environments. In general, fragmentation is Rey & Matsui [Page 4] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 a rare event given the low bit rates and text sample sizes. However, the 3GPP Timed Text media format allows for larger text samples and so SHALL the payload format cover this possibility. 6. Enable the use of resilient transport mechanisms, such as repetition, retransmissions and FEC. 3. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [5]. Furthermore, the following definitions are used in this document: - text sample or whole text sample: this refers to a unit of text data as contained in the source 3GP file. Its equivalent in audio/video would be an audio/video frame (see Section 14 for details). A text sample contains a text length indication, a text string, and zero or several modifier boxes (the decoration of the text). - fragment or text sample fragment: a fraction of a text sample as defined above. A fragment may either text strings and modifiers or just one of these. - sample contents: general term to identify timed text data transported when using this payload format. In addition to text samples and fragments, it may also be used to refer to the sample descriptions associated to the text samples. 4. RTP Payload Format for 3GPP Timed Text The format of an RTP packet containing 3GPP timed text is shown below: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + RTP payload | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Marker bit (M): the marker bit must be set to 1 if the RTP packet includes the last fragment of a text sample; otherwise set to 0. For Rey & Matsui [Page 5] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 RTP packets containing several text samples, being at least one of them non-fragmented, the marker bit MUST be set to 1. Timestamp: the timestamp indicates the sampling instant of the timed text sample contained in the RTP packet. The initial value is randomly determined. If the RTP packet includes more than one text sample (i.e. text sample aggregation), the timestamp indicates the sampling instant of the oldest text sample in the RTP packet. The samples MUST be placed in playout order, whereas the oldest sample is placed first in the payload. The timestamp of the subsequent samples is obtained by adding the timed text sample duration (if present) to the timestamp value. For example, let sdur(0), sdur(1) and sdur(2) be the durations of three subsequent timed text samples included in an RTP packet. Let rtpts be the timestamp as present in the RTP header. The timestamp ts(i) for each sample (i=0,1,2) would be: ts(i)=rtpts + sum[sdur (i-1)]; ts(0)=rtpts, ts(1)=rtpts + sdur(0) ts(2)=rtpts + (sdur(0)+ sdur(1)) Some text samples may become large and have to be fragmented and so spread over several RTP packets. In this case, the receiver needs to associate fragments of the same text sample. This is done using the timestamp. The order of the fragments is resolved using the fields available in the following headers. The value timestamp clockrate is copied directly from the 3GP file: the value of "timescale" in the Media Header Box is to be used. Payload Type (PT): the payload type is set dynamically and sent by out-of-band means. The usage of the remaining RTP header fields follows the rules of RTP [3] and the profile in use. 4.1 Fragmentation of Timed Text Samples This section justifies why text samples MAY be fragmented and discusses some of the possible approaches to do it. A solution is proposed together with rules and recommendations for fragmenting and transporting text samples using this payload format. 3GPP Timed Text applications are expected to operate at low bit rates. This fact added to the small size of timed text samples (typically one or two hundred bytes) makes fragmentation of text samples a rare event. Samples should usually fit into the MTU size of the used network path. Nevertheless, the text string (e.g. ending roll in a movie) and some modifier boxes, i.e. for hyperlinks ("href"), for karaoke ("krok") or Rey & Matsui [Page 6] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 for fonts ("styl") might become large and need fragmentation. This may also apply for future modifier boxes. While the text string is recommended as per [1] to take a maximum of 2048 bytes for maximum client interoperability, there is no recommendation on the amount of space occupied by modifier boxes. In order to transport these larger text samples using RTP, it could be argued that a careful encoding be used to transform the original large sample into smaller self-contained text samples that fit into the transport MTU. This would comply with the ALF principle, as per RFC 2367 [14]. It would also need additional pre-processing previous to RTP encapsulation. Given the low probability of fragmentation, it is believed that the overhead of this pre-processing (careful encoding) is not worth. It appears more appropriate to encode text samples without taking the path MTU into account. In this manner, this payload format meets a trade-off by intentionally leaving out this pre-processing and making some text samples more sensitive to packet losses. However, a minimum set of fragmentation rules and recommendations SHALL be observed to guarantee a minimum resiliency and guide in the task of fragmentation. Text samples and fragments thereof are aggregated in the RTP payload according to the rules and recommendations specified as follows: - it is RECOMMENDED that text samples are fragmented as seldom as possible. E.g. if a previous packet has some free space and a new text sample fits in one MTU, a new RTP packet SHOULD be sent, instead of sending two or more fragments out of it. In order to fill up the remaining bits, piggybacking of sample descriptions MAY be performed. - text strings MUST split at character boundaries. Otherwise, it is not possible to display the text of a fragment if the previous is lost. - it is RECOMMENDED to include as fewer text sample fragments as possible in an RTP packet. This reduces the effects of packet loss. RTP packets using this payload format MAY include zero or more whole text samples, zero or more text sample fragments and zero or more sample descriptions. - sample descriptions SHALL NOT be fragmented, since they contain important information that may affect several text samples. - for enhanced resiliency against packet loss it is RECOMMENDED that fragments containing decoration are especially protected using FEC [4], retransmission [13], packet repetition or a similar technique. - when fragmenting text samples, the start of the decoration (modifiers) MUST be indicated. Otherwise, if packets are lost, a client may be unable to identify where the modifiers start and the text ends. Rey & Matsui [Page 7] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 Usually, RTP applications use the information on packet size from UDP or lower layers to find out the length of the RTP payload. This means that if several text samples (or fragments) are contained in the payload a length indication MUST be present for all fragments, but the last one. Similarly, those transported as unique payload do not need of a length indication. However, some transport schemes for RTP, e.g. RFC 3640 [15], require that the length of each fragment is indicated. This payload format does not mandate to comply with such requirement, but OPTIONALLY allows to do so. Implementations subject to such requirement MUST include an explicit length indication,i.e. the LEN field, by setting the L bit in all cases. Note also that the order of the text samples and fragments in the RTP payload is important. As described above in the definition of the RTP timestamp usage, these MUST be placed chronologically in the RTP payload, so that the SDUR field allows calculating the timestamp of the samples following. At the same time, other samples MAY follow if the current and following samples contain each a length indication. Otherwise, the sample is either placed at the end or as unique RTP payload. Fragments carrying modifier box contents and sample descriptions MAY be placed in any order (no timing requirements) and MAY be present as often as needed. Modifier box fragments SHOULD be placed as close as possible to the text strings, which they belong to. 4.2 Payload Header Definitions An RTP packet using the payload headers defined in this document has the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Res.|U|L|TYPE | | +-+-+-+-+-+-+-+-+ | : (variable payload header depending on TYPE value) : : : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : SAMPLE CONTENTS = Text Sample(s), Fragment(s), Sample : : Description(s) : : : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Rey & Matsui [Page 8] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 The payload headers specified in this document consist of a set of common fields followed by specific fields for each header type. The structure of the payload headers resembles that of the 'access units' in RFC 3640. This similarity is intentional, in order to ease the transport using MPEG4 elementary streams. In this manner, the 'AU header' of that document finds an equivalent in the common header fields for all TYPE values: R, U, L, TYPE and LEN. The specific fields plus the sample contents would be similar to the 'AU data section'. The payload header the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |U|L|TYPE | LEN (if L=1) | specific | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | fields (variable)...| +-+-+-+-+-+-+-+-+-+-+-+-+ where, - R (3 bits) "Reserved bits": this field MUST be set to zero. - U (1 bit) "UTF Transformation": indicates whether the text characters are encoded using UTF-8 (U=0) or UTF-16 (U=1). It MUST also be present in text sample fragments if they contain text strings. If the payload does not carry text strings, this bit has no meaning and SHALL be disregarded. - L (1 bit) "Length Field" flag: if set indicates the presence of the OPTIONAL length field, LEN. By default this flag is reset. Where several 'access units' are included in an RTP packet, the length MUST be used to parse the payload. In this case, it SHALL be present in all 'access units' but the last one. In some cases, as in RFC 3640 [15], the length field is always needed, and so MUST L flag be always set. - LEN (16 bits) "Length Field": indicates the size (in bytes) of this field and all the bits following, i.e. specific payload header fields for each TYPE (excluding initial byte) plus the sample contents. If present, LEN has the following values: - TYPE = 1, LEN >= 6, - TYPE = 2, LEN > 9, - TYPE = 3, LEN > 3, - TYPE = 4, LEN > 3, - TYPE = 5, LEN > 3. For whole text samples (TYPE=1), the sample contents length corresponds to the entry value in the Sample Size Box, "stsz" (see also SLEN below). Otherwise the sample contents length MUST be Rey & Matsui [Page 9] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 calculated when fragmenting the sample, in order to obtain the correct LEN value. Empty text samples do not have sample contents. For this case, TYPE 1 with a LEN of 6 (0x0006) MUST be used. The following payload header compositions (including the specific fields) result for the different values of TYPE: - TYPE = 0,6 and 7 are reserved. - TYPE = 1, 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |U|L|TYPE | LEN (if L=1, always >=6)| SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport whole text samples. If several text samples are sent in an RTP packet, every sample has its own header. The LEN field MUST be always equal (for empty text samples) or greater than 6 (0x0006). The fields above have the following meaning: - SIDX (8 bits) "Text Sample Entry Index": indicates the reference index for the text sample, which corresponds to the index field in the Sample to Chunk Box, "stsc", for the sample. This field is used to map the corresponding sample description information. The SIDX is unequivocally linked to one particular sample description. Therefore, sample descriptions SHALL not be modified during a streaming session. A maximum of 126 SIDX values is allowed per text stream. To allow SIDX wrap-up, clients SHALL keep as valid only those values of SIDX outside of the interval (X+128) modulo 255, where X is the last SIDX value received. The SIDX values 0 (0x00) and 255 (0xff) are reserved for possible future extensions. - SDUR (24 bits) "Text Sample Duration": indicates the sample duration in timestamp units of the text sample, which corresponds to the entry value in the Decoding Time to Sample Box, "stts", for that sample. This field allows by a clockrate of 1000 Hz a maximum duration of approximately 279 hours (16 bits is would allow for just 65 seconds, which might be too short for some streams). It is assumed that all text samples have a known duration at the time of transmission. In some cases however, e.g. live streaming, the SDUR value might not be known. To cover this exception, the value zero (0x000000) is reserved to signal unknown duration. For all other cases SDUR MUST be different from 0x000000. Rey & Matsui [Page 10] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 The ordering of 'access units' in the RTP payload is important. Logically, samples of unknown duration SHALL NOT precede any text samples or text sample fragments in the RTP payload. Otherwise it would not be possible to find out the timestamp of these. However, samples of unknown duration MAY precede sample descriptions, as they have no duration. In general, text sample contents expire when the next sample becomes valid. As an exception, samples of unknown duration (SDUR=0x000000) are valid until new packets arrive. Note also, that samples of unknown duration SHALL NOT use features, such as scrolling or karaoke, which would need to know the duration of the sample up-front. - TYPE = 2, 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |U|L|TYPE | LEN(if L=1, always >9) | SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SLEN | SDUR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SDUR (cont.) |TOTAL | THIS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport text sample fragments containing text strings. The LEN field (16 bits) has the same meaning as above. If present, the length of LEN MUST be greater than a value of nine (0x0009). The SLEN field (16 bits) indicates the size (in bytes) of the (whole) text sample to which this fragment belongs. As seen above, the text sample length corresponds to the entry value in the Sample Size Box, "stsz". Clients MAY use SLEN to buffer space for the remaining fragments of the text sample. The fields TOTAL (4 bits) and THIS (4 bits) indicate the total number of fragments in which the original text sample has been fragmented and which order occupies the current fragment in that sequence, respectively. The usual 'byte offset' field is not used here for two reasons: a) it would take one more byte and b) it does not provide any useful information on the character offset. UTF-8/16 text strings have, in general, a variable character length ranging from 1 to 6 bytes. Therefore, the TOTAL/THIS solution is preferred. The R, U, L, TYPE, SIDX, and SDUR fields have identical interpretation as above. The U, SIDX and SDUR fields are useful since partial text strings MAY also be displayed with the corresponding decoration. Rey & Matsui [Page 11] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 - TYPE = 3, 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |U|L|TYPE | LEN(if L=1, always >3) |TOTAL | THIS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport either whole modifier boxes or just the first fragment of these. This depends on whether the modifier boxes fit into one RTP packet. In case fragmentation is needed, this header type identifies first fragment. As explained above, the rules for fragmentation require that the start of the modifier boxes be signaled. The R, U, L, TOTAL/THIS and LEN fields are used as above. If present, the LEN field MUST be greater than three (0x0003). Note that the SLEN, SIDX and SDUR fields are not present. This is because: a) these fragments do not contain text strings and b) these types of fragments are applied over text string fragments, which already contain this information. - TYPE = 4, 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |U|L|TYPE | LEN(if L=1, always >3) |TOTAL | THIS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This header type is used to transport modifier fragments, other than the first one. The R, U, L, TOTAL/THIS and LEN fields are used as above. If present, the LEN field MUST be greater than three (0x0003). Note that the SLEN, SIDX and SDUR fields are not present. This is because: a) these fragments do not contain text strings and b) these types of fragments are applied over text string fragments, which already contain this information. - TYPE = 5 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | R |U|L|TYPE | LEN(if L=1, always >3) | SIDX | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Rey & Matsui [Page 12] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 This header type is used to transport sample descriptions. The L flag MUST be always set. Consequently, the LEN field MUST be greater than three (0x0003). Every sample description MUST have its own TYPE 5 header. 5. Resilient Transport Apart from the basic fragmentation measures described in the section above, the simplest option for packet loss resilient transport is to send the same text samples (or fragments) again, i.e. repetition. A server MAY decide to send the same contents again as a measure for error resilience. Repetition of text samples (or fragments) is only allowed if exactly the same RTP payloads are sent, so that a receiver can use original and repeated fragments together to reconstruct the text samples. If a text sample was originally sent as non-fragmented text sample, a repetition of that sample MUST be sent also as a single non- fragmented text sample. Likewise, if the original text sample was fragmented and spread over several RTP packets, repeated fragments SHALL also be observe the same byte boundaries and use the same headers and bytes per fragment. Finally, if several text samples resolve to the same timestamp, the receiver SHOULD use the one received in the RTP packet with the highest sequence number and discard the rest. In repeated text samples (or fragments), all RTP header fields MUST keep their original values except the sequence number that MUST be increased to comply with RTP. 6. Congestion control The RTP profile under which this payload format is used defines an appropriate congestion control mechanism in different environments. Following the rules under the profile, an RTP application can determine its acceptable bitrate and packet rate in order to be fair to other TCP or RTP flows. If an RTP application using this payload format uses retransmission, the acceptable packet rate and bitrate includes both the original and retransmitted data. This guarantees that an application using retransmission achieves the same fairness as one that does not. Such a rule may translate in practice into the following actions: If enhanced service is used, it should be made sure that the total bitrate and packet rate do not exceed that of the requested service. It should be further monitored that the requested services are actually delivered. In a best-effort environment, the sender SHOULD NOT send retransmission packets without ensuring first that enough bandwidth for retransmission is available. Other solutions like Rey & Matsui [Page 13] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 reducing the packet rate and bitrate of the original stream (for example by encoding the data at a lower rate) MAY be used. Similar considerations apply, if an RTP application using this payload format implements forward error correction, FEC [4]. Hereby, the sender should take care that the amount of FEC does not actually worsen the problem. Therefore, it is RECOMMENDED that applications implementing this payload format also implement congestion control. The actual mechanism for congestion control is out of the scope of this document but should be suitable for real-time flows. As an example, RFC 3448 [11] specifies an equation-based congestion control that fulfils this requirement. 7. SMIL usage The SMIL recommendation [9] specifies a means for synchronising different media streams. This payload format defines the spatial layout parameters for a timed text stream. These specify the location of the text display area relative to the top left corner of the video display area, when a text stream is played with a single video stream without SMIL. In cases where several media streams shall be synchronized, SMIL MAY be used to specify the spatial layout parameters. It shall be noted that even if SMIL scene description is used the track header information pieces SHOULD be sent anyway as they represent the intrinsic media properties. 8. MIME Type usage Registration 8.1 3GPP Timed Text MIME Registration MIME type: video MIME subtype: 3gpp-tt Required parameters: rate: the RTP timestamp clockrate is equal to the clockrate of the media. The value timestamp clockrate is copied directly from the 3GP file: the value of "timescale" in the Media Header Box is to be used. brand=, where identifies a Release specification of 3GPP Timed Text being transmitted over RTP. A brand indicates the "best use" of the contents: the brand value "3gp5" indicates Release 5 of 3GPP Technical Specification (TS) 26.245 [1]. Rey & Matsui [Page 14] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 spldesc=, where may take three different values: - spldesc=in, for in-band transmission of sample descriptions - spldesc=out, for out-of-band and, - spldesc=both, when both methods are allowed. tx3g=, ,... where represents a list of sample description entries using base64 encoding. This parameter MAY be used to convey sample descriptions out-of-band. The list of sample entries is not required to follow any particular order. Each value represents the concatenation of the SIDX and sample descriptions contents for that SIDX. The LEN field is not needed. width= indicates the width in pixels of the text track or area where the text is actually displayed. height= indicates the height in pixels of the text track. tx=, indicates the horizontal translation offset in pixels of the text track with respect to the origin of the video track. ty=, indicates the vertical translation offset in pixels of the text track. layer=, indicates the proximity of the text track to the viewer. Higher values means closer to the viewer. This parameter has no units. Optional parameters: mver=, "Minor version" where is a positive integer. It identifies the oldest compatible version. How the version is defined can be found in TS 26.245 "3GPP Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6)". cbrand=, "List of Compatible Brands" where value1 is a brand. This list MUST at least contain the as in "brand". Encoding considerations: this type is only defined for transfer via RTP. Security considerations: please refer to Section 12 of RFCXXXX. Interoperability considerations: the 3GPP Timed Text media format is specified in 3GPP Release 5 version of TS 26.245 "Transparent end-to- Rey & Matsui [Page 15] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 end packet switched streaming service (PSS); Timed Text Format (Release 6)". In later releases, 3GPP may specify extensions or updates to the media format in a backwards-compatible way, e.g. new modifier boxes or extensions to the sample descriptions. The payload format RFCXXXX allows for such extensions. For future 3GPP Releases of the Timed Text Format, the therein parameters "brand", "mver" and "cbrand" are used to identify the current version of the media format, the oldest compatible version and a list of compatible versions. Published specification: RFC XXXX Applications which use this media type: multimedia streaming applications. Additional information: the 3GPP Timed Text media format is specified in 3GPP TS 26.245 "Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6)". This document and future extensions to the 3GPP Timed Text format are publicly available at http://www.3gpp.org. Person & email address to contact for further information: rey@panasonic.de matsui.yoshinori@jp.panasonic.com Intended usage: COMMON Author/Change controller: Jose Rey Yoshinori Matsui IETF AVT WG 9. SDP usage This document defines the MIME subtype name "3gpp-tt" and introduces several REQUIRED payload-format-specific parameters: "brand", "width", "height", "tx", "ty", "layer", "spldesc" and "tx3g" and two OPTIONAL parameters "mver" and "cbrand". 9.1 Mapping to SDP The information carried in the MIME media type specification has a specific mapping to fields in SDP [4], which is commonly used to describe RTP sessions. When SDP is used to specify transmission using this payload format, the mapping is done as follows: - The MIME type ("video") goes in the SDP "m=" as the media name. The "video" MIME Type is used as timed text is considered visual media. - The MIME subtype ("3gpp-tt") goes in SDP "a=rtpmap" as the encoding name. The value timestamp clockrate is copied directly from Rey & Matsui [Page 16] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 the 3GP file, the value of "timescale" in the Media Header Box is to be used. Other values MAY be specified by out-of-band means. - The REQUIRED payload-format-specific parameters "brand", "width", "height", "tx", "ty", "layer", "tx3g" and "spldesc" go in the SDP "a=fmtp" as a semicolon separated list of parameter= (or parameter= for "tx3g") pairs. - The OPTIONAL payload-format-specific parameters "mver", "cbrand" go in the SDP "a=fmtp" as a semicolon-separated list of parameter= pairs. - Any remaining parameters go in the SDP "a=fmtp" attribute by copying them directly from the MIME media type string as a semicolon separated list of parameter=value pairs. In the following sections some example SDP descriptions are presented. 10. Examples of RTP packet structure In this section, some examples of RTP packet structure are explained for better understanding of this payload format. The wrap-around of the long lines is indicated by the backslash character "\". The examples assume aggregate control of stream container files. The session descriptions are not complete but limited to the example purposes. 10.1 An RTP packet containing multiple text samples 11. IANA Considerations This document introduces the MIME subtype name "3gpp-tt" in Section 8. 12. Security considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [3]. This implies that confidentiality of the media streams is achieved by encryption. Furthermore, the main security issues are confidentiality and authentication of the text itself. The payload format itself does not have any support for security. These issues have to be solved by a payload external mechanism, e.g. SRTP [10]. Rey & Matsui [Page 17] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 13. References 13.1 Normative References 1 3GPP, "Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6)", TS 26.245 v 0.1.6, Working Draft, July 2003. 2 ISO/IEC 14496-1:2001/AMD5, "Information technology û Coding of audio-visual objects û Part 1: Systems, ISO Base Media File Format", 2003. 3 H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003. 4 M. Handley, V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. 5 S. Bradner, "Key words for use in RFCs to indicate requirement levels," BCP 14, RFC 2119, IETF, March 1997. 13.2 Informative References 6 C. Perkins, I. Kouvelas, O. Hodson, V. Hardman, M. Handley, J.C. Bolot, A. Vega-Garcia, S. Fosse-Parisis, "RTP Payload for Redundant Audio Data", September 1997. 7 J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for Generic Forward Error Correction", RFC 2733, December 1999. 8 C. Perkins, O. Hodson, "Options for Repair of Streaming Media", RFC 2354, June 1998. 9 W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)", August, 2001. 10 M. Baugher, D. A. McGrew, D. Oran, R. Blom, E. Carrara, M. Naslund, K. Norrman, "The Secure Real-Time Transport Protocol", draft-ietf-avt-srtp-05.txt, June 2002. 11 Handley, et al., "TCP Friendly Rate Control (TFRC): Protocol Specification ", RFC 3448, January 2003. 12 R. Hovey, S. Bradner, "The Organizations involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. 13 J. Rey et al., "RTP Retransmission Payload Format", draft-ietf- avt-rtp-retransmission-10.txt, work in progress, January 2004. 14 M. Handley, C. Perkins, "Guidelines for Writers of RTP Payload Format Specifications", RFC 2367, December 1999. Rey & Matsui [Page 18] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 15 Van der Meer et al., "RTP Payload Format for Transport of MPEG-4 Elementary Streams ", RFC3640, November 2003. 14. Annex A Basics of 3GP File Structure Each 3GP file consists of "Boxes". Boxes start with a header which indicates both size and type contained. The 3GP file contains the File Type Box (ftyp), the Movie Box (moov), and the Media Data Box (mdat). The Movie Box and the Media Data Box, serving as containers, include own boxes for each media. Similarly, each box type may include a number of boxes, see ISO Base Media file Format [2] for a complete list of possibilities. In the following, only those boxes are mentioned, which are useful for the purposes of this payload format. The File Type Box identifies the type and properties of a 3GP file. The File Type Box contents comprise the major brand, the minor version and the compatible brands. These are communicated via out- of-band means, such as SDP, when streamed with RTP. For the 3GPP timed text file format, the set of compatible-brands MUST include "3gp5". The Movie Box contains one or more Track Boxes (trak) which include information about each track. A Track Box contains, among others, the Track Header Box (tkhd), the Media Header Box (mdhd) and the Media Information Box (minf). The Media Header Box contains the timescale or number of time units that pass in one second. The Media Information Box includes the Sample Table Box (stbl) which itself contains the Sample Description Box (stsd), the Decoding Time to Sample Box (stts), the Sample Size Box (stsz) and the Sample to Chunk Box (stsc). Sample descriptions for each text sample are encoded as "tx3g" sample entries in the Sample Description Box (stsd). The Track Header Box specifies the characteristics of a single track, where a track is, in this case, the streamed text during a session. Exactly one Track Header Box is needed for a track. It contains information about the track, such as the spatial layout (width and height), the video transformation matrix and the layer number. Since these pieces of information are essential and static, i.e. constant for the duration of the session, they MUST be sent prior to the transmission of any text samples. See the ISO base media file format [2] for details about the definition of the conveyed information. When using scene description in SMIL [9], it is possible to specify the layer and the position of the text track. However, in this case, the transmission of the Track Header Box (tkhd) is still RECOMMENDED, as the intrinsic track information is specified there. Otherwise, the Track Header Box information MUST be sent prior to the start of the text streaming. Rey & Matsui [Page 19] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 The Sample Table Box (stbl) contains all the time and data indexing of the media samples in a track. Using the tables here, it is possible to locate samples in time, determine their type, and determine their size, container, and offset into that container. From the Sample Table Box (stbl) the following information is carried in each RTP packet using this payload format: the Sample Description Box (stsd), the Decoding Time to Sample Box (stts), the Sample Size Box (stsz) and the Sample to Chunk Box (stsc). The Decoding Time to Sample Box (stts) is mapped to the field SDUR (Text Sample Duration); the Sample Size Box (stsz) is mapped the field SLEN (Text Sample Length) and the Sample to Chunk Box is mapped to the field SIDX (Text Sample Entry Index). The Sample to Chunk Box (stsc) associates the text sample and its corresponding sample description entry in the Sample Description Box (stsd, see below). The Sample to Chunk Box can be used to associate a text sample with a sample description entry. Since the sample description may vary during the session, the association SDIX must be sent together with the text samples using this payload format. The Sample Description Box (stsd) provides information on the basic characteristics of text samples. Each entry is a sample entry box of type "tx3g". An example of the information contained in a sample entry could be the font size or the background colour. Since these pieces of information are commonly used by many text samples during the session, it is sent by out-of-bands means. A complete list of text characteristics can be found in [1]. Finally, the Media Data Box contains the media data itself. In 3GPP timed text tracks this box contains text samples. Its equivalent to audio and video is audio and video frames, respectively. The text sample consists of the text length, the text string, and one or several Modifier Boxes. The text length is the size of the text in bytes. The text string is plain text to render. The Modifier Box is information to render in addition to the text such as colour, font, etc. 15. Author's Addresses Jose Rey rey@panasonic.de Panasonic European Laboratories GmbH Monzastr. 4c D-63225 Langen, Germany Phone: +49-6103-766-134 Fax: +49-6103-766-166 Yoshinori Matsui matsui.yoshinori@jp.panasonic.com Matsushita Electric Industrial Co., LTD. 1006 Kadoma Kadoma-shi, Osaka, Japan Phone: +81 6 6900 9689 Fax: +81 6 6900 9699 Rey & Matsui [Page 20] Internet Draft RTP Payload Format for 3GPP Timed Text February 2004 16. IPR Notices The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP 11 [12]. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. 17. Full Copyright Statement "Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." Rey & Matsui [Page 21]