Internet Draft                                               S. Wenger 
Document: draft-wenger-avt-rtp-jvt-00.txt                M. Hannuksela 
Expires: August 2002                                    T. Stockhammer 
                                                         February 2002 
                                                   Expires August 2002 
                                              
 
 
 
                   RTP payload Format for JVT Video 
 
 
 
Status of this Memo 
    
This document is an Internet-Draft and is in full conformance with all 
provisions of Section 10 of RFC2026.  Internet-Drafts are working 
documents of the Internet Engineering Task Force (IETF), its areas, and 
its working groups.  Note that other groups may also distribute working 
documents as Internet-Drafts. 
 
Internet-Drafts are draft documents valid for a maximum of six months 
and may be updated, replaced, or obsoleted by other documents at any 
time.  It is inappropriate to use Internet-Drafts as reference material 
or to cite them other than as "work in progress." 
 
The list of current Internet-Drafts can be accessed at 
http://www.ietf.org/1id-abstracts.txt 
 
The list of Internet-Draft Shadow Directories can be accessed at 
http://www.ietf.org/shadow.html 
    
    
    
Abstract 
    
   This memo describes an RTP Payload format for the JVT codec.  This 
   codec is designed as a joint project of the ITU-T SG 16 VCEG, and 
   the ISO/IEC JTC1/SC29/WG11 MPEG groups. 
    
Wenger et. al.      Expires August 2002            [Page 1] 

Internet Draft                                        21 February 2002 
    
1. The JVT codec 
 
   This memo specifies an RTP payload specification for a new video 
   codec that is currently under development by the Joint Video Group 
   (JVT), which is formed of video coding experts of MPEG and the ITU-
   T.  After the likely approval by the two parent bodies, the codec 
   specification will have the status of an ITU-T Recommendation 
   (likely H.264) and become part of the MPEG-4 specification (ISO/IEC 
   14496 Part 10).  The current project timeline of the JVT project is 
   such that a technically frozen specification (pending bug fixes) is 
   expected in July 2002 in the form of an ISO/IEC Final Committee 
   Draft (FCD).  Before JVT was formed in late 2001, this project used 
   the ITU-T project name H.26L and the JVT project inherited all the 
   technical concepts of the H.26L project. 
 
   The JVT video codec has a very broad application range that covers 
   the whole range from low bit rate Internet Streaming applications to 
   HDTV broadcast and Digital Cinema applications with near loss-less 
   coding.  Most, if not all, relevant companies in all of these fields 
   (including TV broadcast) have participated in the standardization, 
   which gives hope that this wide application range is more than an 
   illusion and may materialize, probably in a relatively short time 
   frame.  The overall performance of the JVT codec is as such that bit 
   rate savings of 50% or more, compared to the current state of 
   technology, are reported.  Digital Satellite TV quality, for 
   example, was reported to be achievable at 1.5 Mbit/s, compared to 
   the current operation point of MPEG 2 video at around 3.5 Mbit/s 
   [1]. 
    
   The codec specification [2] itself distinguishes between a video 
   coding layer (VCL), and a network adaptation layer (NAL).  The VCL 
   contains the signal processing functionality of the codec, things 
   such as transform, quantization, motion search/compensation, and the 
   loop filter.  It follows the general concept of most of today's 
   video codecs, a macroblock based coder that utilized inter picture 
   prediction with motion compensation, and transform coding of the 
   residual signal.  The output of the VCL are Slices in the sense of 
   MPEG-1: a bit string that contains the macroblock data of an integer 
   number of macroblocks in scan order, and the information of the 
   slice header (containing information such as the spatial address of 
   the first macroblock in the slice, or the initial quantization 
   parameter). 
    
   The NAL encapsulates the Slice output of the VCL in a form suitable 
   for transmission over networks or use in multiplex environments.  
   This encapsulation process can be rather trivial like the one 
   described in this memo, or relatively complex for byte-stream 
   oriented transports, where additional framing is required. 
    
   Neither VCL nor NAL are claimed to be media or network independent - 
   the VCL needs to know transmission characteristics in order to 
   appropriately select the error resilience strength, slice size, 
   etc., whereas the NAL needs information like the importance of a bit 
Wenger et. al.      Expires August 2002                 [Page 2] 

Internet Draft                                        21 February 2002 
   string provided by the VCL to select the appropriate application 
   layer protection. 
    
   Internally, the NAL uses NAL packets or NALPs.  A NALP consists of a 
   one-byte header that indicates the type of the NALP and the 
   (potential) presence of bit errors in the NALP payload, and the NALP 
   payload itself.  The RTP payload specification is designed to be 
   unaware of the bit string in the NALP payload. 
    
   One of the main properties of the JVT codec is the possibility of 
   the use of Reference Picture Selection.  For each macroblock the 
   reference picture to be used can be selected independently.  The 
   reference pictures may be used in a first-in, first-out fashion, but 
   it is also possible to handle the reference picture buffers 
   explicitly.  A consequence of this new feature (it was available 
   before only in H.263++ [3]) is the complete decoupling of the 
   transmission time, the decoding time, and the sampling or 
   presentation time of slices and pictures.  For this reason, the 
   handling of the RTP timestamp follows the way introduced in RFC2250 
   [4], which seems to guarantee the integrity of the RTP buffer model, 
   without restricting the functionality of the JVT codec. 
    
    
2. Scope 
 
   This payload specification can only be used to carry the ``naked'' 
   JVT NALP stream over RTP.  Likely, the first applications of a 
   Standard Track RFC resulting from this draft will be in the 
   conversational multimedia field, video telephone or video 
   conference.  The draft is not intended for the use in conjunction 
   with the MPEG 4 system layer [5] or other multiplexing schemes. 
    
    
3. NALP basics 
 
   Tutorial information on the NALP design can be found in [6] and  
   [7].  For the precise definition of the NAL it is referred to [2].  
   This section tries to provide a very short overview of the  
   concepts used. 
    
    
3.1. Parameter Set Concept 
    
   One very fundamental design concept of the JVT codec is to generate 
   self-contained packets, to make mechanisms such as the header 
   duplication of RFC2429 [8] or MPEG-4's HEC [9] unnecessary.  The way 
   how this was achieved is to decouple information that is relevant 
   for more than one slice from the media stream.  This higher layer 
   meta information should be sent reliably and asynchronously from the 
   RTP packet stream that contains the slice packets.  The number of 
   higher layer parameters that were identified as necessary is a 
   surprisingly short list, and the combination of all these list 
   elements is called a Parameter Set.  The Parameter Set contains 
   information such as 
    
Wenger et. al.      Expires August 2002                 [Page 3] 

Internet Draft                                        21 February 2002 
     o Picture size, 
     o Display Window, 
     o optional coding modes employed, 
     o and others. 
      
   In order to be able to change picture parameters (such as the 
   picture size), without having the need to transmit Parameter Set 
   updates synchronously to the Slice packet stream, the encoder and 
   decoder can maintain a list of more than one Parameter Set.  Each 
   Slice header contains a codeword that indicates the Parameter Set to 
   be used.   
    
   This mechanism allows to decouple the transmission of the Parameter 
   Sets from the packet stream, and transmit them by external means, 
   e.g. as a side effect of the capability exchange, or through a 
   (reliable or unreliable) control protocol. It may even be possible 
   that they get never transmitted but are fixed by an application 
   design specification. 
    
   Although, conceptually, the Parameter Set updates are not designed 
   to be sent in the synchronous packet stream, this memo contains a 
   means to convey them in the RTP packet stream.   
    
    
3.2. Network Adaptation Layer Packet (NALP) Types 
 
   There are eight types of NAL packets (NALPs), see JVT WD Annex 
   x(TBD) [2] for details.  They all consist of a single NALP type byte 
   and a byte buffer containing the coded video bits.  The NALP type 
   byte itself distinguishes between the 8 defined NALP types and 
   includes one bit indicating the presence of errors in the NALP.  JVT 
   Video-aware network elements such as Gateways can perform many 
   operations by handling only those byte buffers.  A wireless to 
   wireline gateway can, for example, adjust the packet size to 
   optimize it to the different MTU sizes by transporting several basic 
   NALPs in a single compound packet.  See section 9 ``Application 
   Examples'' below.  The following list describes briefly the NALP 
   types, please see [2] for details: 
    
   Single Slice Packets (SSPs) contain all the information belonging to 
   a slice. 
    
   Type A, B, C Data Partitioning Packets (DPAs, DPBs, DPCs) contain 
   the information of the data partitions A, B, C of a single slice.  
   Hence, when using data partitioning, a typical Inter slice consists 
   of one packet of types A, B, and C.  Of these partitions, the A 
   partition contains header information and is the most important one 
   for reproduction, the B partition contains Intra coefficients and is 
   more important than the C partition, which contains Inter 
   coefficients.  It was found that the use of Data Partitioning is 
   unadvisable in environments without means for uneven error 
   protection [10].  However, such means are available in RTP 
   environments, see section 9 below. 
    
Wenger et. al.      Expires August 2002                 [Page 4] 

Internet Draft                                        21 February 2002 
   An Instantaneous Decoder Refresh Packet (IDERP) indicates a random 
   access position, from which decoding and displaying can be re-
   started without reception of any prior coded slices or data 
   partitions. An IDERP contains a single I or SI slice. Once IDERP is 
   used for one slice of a picture, all other slices of the picture are 
   also encapsulated in IDERPs. 
    
   Supplemental Enhancement Information Packets (SEIPs) contain 
   information that is not necessary to maintain the integrity of the 
   encoders and decoders reference pictures, but are helpful for 
   decoding or presentation purposes.  A typical example for 
   information carried in SEIPs is the presentation timestamp of a 
   picture. 
    
   Parameter Update Packets (PUPs) contain information to update the 
   Parameter Sets for the video stream.  Normally, the transmission and 
   update of Parameter Sets is a function of a control protocol and, 
   hence, PUPs SHOULD NOT be used in such systems where adequate 
   protocol support is available.  However, there are applications 
   where the packet stream has to be self-contained.  In such cases 
   PUPs MAY be used.  Severe synchronization problems between the RTP 
   stream containing PUPs and control protocol messages can occur if 
   PUPs and control protocol messages are used in the same RTP  
   session.  For this reason, PUPs MUST NOT be used in an RTP session 
   whose Parameter Sets were already changed by control protocol 
   messages during the lifetime of the RTP session.  Similarly, control 
   protocol messages MUST NOT be used that affect any RTP session on 
   which at least one PUP was sent.   
   The Parameter Set mechanism is designed to decouple the transmission 
   of picture/GOP/sequence header information from the picture data 
   that is composed of SSPs, IDERPs and/or DPAs, DPBs, DPCs.  To 
   successfully decode a picture, all Parameter Sets (referenced by the 
   Slice Header in a SSP, IDERP, or DPA) need to be available.  Hence, 
   the PUPs (when used) SHOULD be conveyed significantly before their 
   content is first referenced. 
    
   Compound Packets (CPs) are the built-in multiplex mechanism of the 
   JVT codec.  A Compound Packet consists of a variable number of the 
   basic packet types SSP, DPA, DPB, DPC, IDERP, SEIP, and PUP.  In the 
   Unrestricted Mode, Compound packets MAY carry information belonging 
   to more than one picture.  The timestamp of a CP MUST be set 
   corresponding to the latest timestamp of any basic packet the 
   compound packet is composed of.  Ignoring any wrap-around of the 
   Timestamp field, this implies that the Timestamp of a CP is the 
   highest Timestamp of all carried basic packets.  The time 
   association of the basic packets inside the CP is performed through 
   the internal timing information of the basic NALPs. 
    
    
3.3. NALP Type Definition 
    
   The structure of the first byte, that indicates the NALP type and 
   the status of the error indication flag, could not be chosen 
   linearly because of design issues in other networks.  In particular, 
   this byte, preceded with a fixed two-byte start code prefix, is used 
Wenger et. al.      Expires August 2002                 [Page 5] 

Internet Draft                                        21 February 2002 
   as a start code in the MPEG-2 transport environment.  To avoid start 
   code emulations there, [2] reserves the values 0x00 and 0xb9 to 
   0xff.  Hence it specifies the NALP type in a table that is 
   reproduced below for convenience (please see [2] for up-to-date 
   information: 
    
   NALP type     EI-Flag     NALP First Byte 
   SSP           0           0x10 
   SSP           1           0x11 
   DPA           0           0x20 
   DPA           1           0x21 
   DPB           0           0x30 
   DPB           1           0x31 
   DPC           0           0x40 
   DPC           1           0x41 
   SEIP          0           0x50 
   SEIP          1           0x51 
   PUP           0           0x60 
   PUP           1           0x61 
   CP            0           0x70 
   CP            1           0x71 
   IDERP         0           0x80 
   IDERP         1           0x81 
    
    
4. RTP Packetization Process 
 
   The packetization process of the JVT codec using the RTP/UDP/IP 
   Network Adaptation Layer (NAL) for RTP is straightforward and 
   follows the general principles outlined in RFC1889.  The RTP payload 
   consists of the bit buffer containing the coded bits as prepared by 
   the NAL.  There is no specific RTP payload header.  The RTP header 
   information is set as follows:  
    
   Timestamp : 32 bits 
      The RTP timestamp is set similarly as in RFC 2250 [4]: the 
      timestamp is the target transmission time of the first byte of 
      the packet payload.  Note: See open issues at the end of this 
      draft. 
    
   Marker bit (M): 1 bit 
      Set for the very last packet of the picture indicated by the RTP 
      timestamp, in line with the normal use of the M bit and to allow 
      an efficient playout buffer handling.  Note: See open issues at 
      the end of this draft. 
    
   Sequence No (Seq): 16 bit 
      Increased by one for each sent packet.  Set to a random value 
      during startup as per RFC1889 
    
   Version (V): 2 bits 
      set to 2 
    
   Padding (P): 1 bit 
      set to 0 
Wenger et. al.      Expires August 2002                 [Page 6] 

Internet Draft                                        21 February 2002 
    
   Extension (X): 1 bit 
      set to 0 
    
   Payload Type (PT): 8 bits 
      established dynamically during connection establishment 
    
   All other RTP header fields are set as per RFC1889. 
    
    
5. Packetization Rules 
 
   Two cases of packetization rules have to be distinguished by the 
   possibility to put packets belonging to more than a single picture 
   into a single compound packet. 
    
    
5.1. Unrestricted Mode (Multiple Picture Model) 
 
   This mode MAY be supported by some receivers.  Usually, the 
   capability of a receiver to support this mode is indicated by one of 
   the profiles of the JVT codec. The following packetization rules 
   MUST be enforced by the sender: 
    
   o Single Slice Packets belonging to the same picture MAY be sent in 
     any order, although, for delay critical systems, they SHOULD be 
     sent in their original coding order to minimize the delay.  Note 
     that the coding order is not necessarily the scan order, but the 
     order the NAL packets become available to the RTP stack.  
 
    
   o SEIPs MAY be sent anytime. 
    
   o PUPs MUST NOT be sent in an RTP session whose Parameter Sets were 
     already changed by control protocol messages during the lifetime 
     of the RTP session.  If PUPs are allowed by this condition, they 
     MAY be sent at any time. 
    
   o All allowed NALP types MAY be mixed freely, provided that above 
     rules are obeyed.  In particular, it is allowed to mix slices in 
     data partitioned and single-slice mode. 
    
   o Compound-packet aware network elements MAY convert NALPs of all 
     other types into sub-packets of compound packets, convert sub- 
     packets into individual RTP packets carrying single NALPs or mix 
     both concepts.  However, when doing so they SHOULD take into 
     account at least the following parameters: path MTU size, unequal 
     protection mechanisms (e.g. through packet duplication, packet 
     based FEC carried by RFC2198, especially for header and Type A 
     Data Partitioning packets), bearable latency of the system, and 
     buffering capabilities of the receiver. 
    
   o NALPs of all types MAY be conveyed as sub-packets of a Compound 
     Packet rather than individual RTP packets.  Special care SHOULD be 
     taken (particularly in gateways) to avoid more than a single copy 
Wenger et. al.      Expires August 2002                 [Page 7] 

Internet Draft                                        21 February 2002 
     of identical NALPs in a single compound packet in order to avoid 
     unnecessary data transfers without any improvements of QoS. 
    
    
5.2. Restricted Mode (Single Picture Model) 
    
   This mode MUST be supported by all receivers.  It is primarily 
   intended for low delay applications.  Its main difference from the 
   Unrestricted Mode is to forbid the packetization of data belonging 
   to more than one picture in a single RTP packet.  The following 
   packetization rules MUST be enforced by the sender: 
    
   o All rules of the Unrestricted Mode above 
    
   o Compound Packets MUST NOT include SSPs, IDERPs, or DP[ABC]s 
     belonging to different pictures.  A sender naturally has access to 
     this information.  A video-aware network element has to rely on 
     the Slice- or Data-Partition header (part of the NALP payload) in 
     order to ensure that NALPs belong to one single picture (The 
     information of the RTP header is not sufficient). 
    
    
6. De-Packetization Process 
 
   The de-packetization process is implementation dependent.  Hence, 
   the following description should be seen as an example of a suitable 
   implementation.  Other schemes MAY be used as well.  Optimizations 
   relative to the described algorithms are likely possible. 
    
   The general concept behind these de-packetization rules is to 
   collect all packets belonging to a picture, bringing them into a 
   reasonable order, discard anything that is unusable, and pass the 
   rest to the decoder.  Compound packets are handled by unloading 
   their payload into individual NALPs.  Those NALPs are processed as 
   if they were received in separate RTP packets, in the order they 
   were arranged in the Compound Packet. 
    
   The following de-packetization rules MAY be used to implement an 
   operational JVT de-packetizer: 
    
   o NALPs are presented to the JVT decoder in the order of the  
     sequence number. 
    
   o NALPs carried in a Compound Packet are presented in their order in 
     the compound packet.  All NALPs of the compound packet are 
     processed before the next RTP packet is processed.  
    
   o Intelligent RTP receivers (e.g. in Gateways) MAY identify lost  
     DPAs. If a lost DPA is found, the Gateway MAY not send the DPB and 
     DPC partitions, as their information is meaningless for the JVT 
     Decoder.  In this way a network element can reduce network load by 
     discarding useless packets, without parsing a complex bit stream 
    
    
Wenger et. al.      Expires August 2002                 [Page 8] 

Internet Draft                                        21 February 2002 
7. MIME Considerations 
 
   This section is to be completed later.   
    
    
8. Security Considerations 
 
   So far, no security considerations beyond those of RFC1889 have been 
   identified. 
    
   Currently, the JVT WD does not allow to carry any type of active 
   payload.  However, the inclusion of a ``user data'' mechanism is 
   under consideration, which could potentially be used for mechanisms 
   such as remote software updates of the video decoder and similar 
   tasks.  
    
    
9. Informative Appendix: Application Examples 
 
   This payload specification is very flexible in its use, to cover the 
   extremely wide application space that is anticipated for the JVT 
   codec.  However, such a great flexibility also makes it difficult 
   for an implementer to decide on a reasonable packetization scheme.  
   Some information how to apply this specification to real-world 
   scenarios is likely to appear in the form of academic publications 
   and a Test Model in the near future.  However, some preliminary 
   usage scenarios should be described here as well.   
    
    
9.1. Video Telephony, no Data Partitioning, no Compound Packets 
 
   The RTP part of this scheme is implemented and tested (though not 
   the control-protocol part, see below). 
    
   In most real-world video telephony applications, the picture 
   parameters such as picture size or optional modes never change 
   during the lifetime of a connection.  Hence, all necessary Parameter 
   Sets (usually only one) are sent as a side effect of the capability 
   exchange/announcement process.  An example for such a capability 
   exchange with an SDP-like syntax can be found in [11], but other 
   schemes such as ASN.1 are possible as well.  Since all necessary 
   Parameter Set information is established before the RTP session 
   starts, there is no need for sending any PUPs.  Data Partitioning is 
   not used either.  Hence, the RTP packet stream consists basically of 
   SSPs that carry the video information. 
    
   The size of those SSPs is chosen by the encoder such that they offer 
   the best performance.  Often, this is done by adapting the coded 
   slice size to the MTU size of the IP network.  For small picture 
   sizes this may result in a one-picture-per-one-packet strategy.  The 
   loss of packets and the resulting drift-related artifacts are 
   cleaned up by Intra refresh algorithms. 
    
    
Wenger et. al.      Expires August 2002                 [Page 9] 

Internet Draft                                        21 February 2002 
9.2. Video Telephony, Interleaved Packetization using Compound Packets 
 
   This scheme allows better error concealment and is widely used in 
   H.263 based designed using RFC2429 packetization.  It is also 
   implemented and good results were reported [6].  
    
   The source picture is coded by the VCL such that all MBs of one MB 
   line are assigned to one slice.  All slices with even MB row 
   addresses are combined into one Compound Packet, and all slices with 
   odd MB row addresses into another packet.  Those compound packets 
   are transmitted as RTP packets.  The establishment of the Parameter 
   Sets is performed as discussed above. 
    
   Note that the use of compound packets is essential here, because the 
   high number of individual slices (18 for a CIF picture) would lead 
   to unacceptably high IP/UDP/RTP header overhead.  Furthermore, some 
   wireless video transmission systems, such as H.324M and the IP-based 
   video telephony specified in 3GPP, are likely to use relatively 
   small transport packet size.  For example, a typical MTU size of 
   H.223 AL3 SDU is around 100 bytes [12].  Coding individual slices 
   according to this packetization scheme provides a further advantage 
   in communication between wired and wireless networks, as individual 
   slices are likely to be smaller than the preferred maximum packet 
   size of wireless systems.  Consequently, a gateway can convert a 
   Compound Packet used in a wired network to several basic-NALP 
   packets preferred in a wireless network and vice versa.  
    
    
9.3. Video Telephony, with Data Partitioning 
 
   This scheme is implemented and was shown to offer good performance 
   especially at higher packet loss rates [6]. 
   Data Partitioning is known to be useful only when some form of 
   unequal error protection is available.  Normally, in single-session 
   RTP environments, even error characteristics are assumed -- 
   statistically, the packet loss probability of all packets of the 
   session is the same.  However, there are means to reduce the packet 
   loss probability of individual packets in an RTP session.  One 
   simple way is known as Packet Duplication: simply send the to-be-
   protected packet twice, with the same sequence number.  If both 
   packets survive, the receiver will assume a packet duplication by 
   UDP and discard one of the two packets.  Other means of unequal 
   protection within the same RTP session include the use of RFC 2198 
   [13] (for this application it is essentially a packet duplication 
   process as well, with some saved bytes for the second RTP header), 
   or packet-based Forward Error Correction [14] carried in RFC2198. 
    
   The implemented software uses the simple packet duplication process 
   to increase the probability of all DPA NALPs.  The incurred overhead 
   is substantial, but in the same order of magnitude as the number of 
   bits that have otherwise be spent for intra information.  However, 
   this mechanism is not adding any delay to the system.   
    
   Again, the complete Parameter Set establishment is performed through 
   control protocol means. 
Wenger et. al.      Expires August 2002                [Page 10] 

Internet Draft                                        21 February 2002 
9.4. MPEG-2 Transport to RTP Gateway 
 
   This example is not implemented completely, but the basic mechanisms 
   are part of the interim file format the JVT group uses and, hence, 
   well tested.   
    
   When using JVT video in satellite/cable broadcast environments, 
   there is no control protocol available that can be used for the 
   transmission of Parameter Sets.  Furthermore, a receiver has to be 
   able to ``tune'' into an ongoing packet stream at any time, without 
   much delay and artifacts.  For this reason, PUPs that contain all 
   Parameter Set information are included in the packet stream at any 
   Instantaneous Decoder Refresh Point (which are similar to Key Frames 
   in earlier coding standards).  IDERP packets are used to signal 
   these ``key frames'' so that a decoder can most easily determine 
   where to start in its decoding process. 
    
   The simplest possible MPEG-2 transport to RTP gateway could take the 
   NALPs as they come from the MPEG-2 transport stream (after de-
   framing), and send them, each NALP in one RTP packet, with 
   increasing RTP sequence numbers.  However, less than perfect packet 
   loss rates would lead to a very poor performance of such a system.  
   However, a Gateway could use the protection mechanisms discussed 
   above to unequally protect the most important packets, e.g. all PUPs 
   (very strong protection) IDERPs (weak protection), and transmit 
   everything else best effort.  The Gateway can do this without 
   parsing the bit stream, by simply using the NALP type byte. 
   A more sophisticated Gateway may be able to combine some small NALPs 
   to a big Compound Packet in order to save the bytes used for the 
   IP/UDP/RTP headers. 
    
   A similar mechanism is, of course, also possible in H.320 to RTP 
   gateways. 
    
    
9.5. Low-Bit-Rate Streaming 
 
   This scheme has been implemented with H.263 and gave good results 
   [15].  There is no technical reason why similarly good results could 
   not be achievable using the JVT codec.  
    
   In today's Internet streaming, some of the offered bit-rates are 
   relatively low in order to allow terminals with dial-up modems to 
   access the content.  In wired IP networks, relatively large packets, 
   say 500 - 1500 bytes, are preferred to smaller and more frequently 
   occurring packets in order to reduce network congestion.  Moreover, 
   use of large packets decreases the amount of RTP/UDP/IP header 
   overhead.  For low-bit-rate video, the use of large packets means 
   that sometimes up to few pictures should be encapsulated in one 
   packet.  
    
   However, loss of such a packet would have drastic consequences in 
   visual quality, as there is practically no other way to conceal a 
   loss of an entire picture than to repeat the previous one.  One way 
   to construct relatively large packets and maintain possibilities for 
Wenger et. al.      Expires August 2002                [Page 11] 

Internet Draft                                        21 February 2002 
   successful loss concealment is to construct Compound Packets that 
   contain slices from several pictures in an interleaved manner.  A 
   compound packet should not contain spatially adjacent slices from 
   the same picture or spatially overlapping slices from any picture.  
   If a packet is lost, it is likely that a lost slice is surrounded by 
   spatially adjacent slices of the same picture and spatially 
   corresponding slices of the temporally previous and succeeding 
   pictures. Consequently, concealment of the lost slice is likely to 
   succeed relatively well. 
    
    
10. Open Issues 
   There are several open issues on which the authors would like to 
   receive opinions.  They are listed below. 
    
10.1. Timestamp per RFC2250 
    
   RTP payload specifications for video coding schemes normally use 
   either the ``Sampling Instant'' or the ``Presentation Timestamp'' as 
   the base for the calculation of the RTP timestamp.  Both seem to be 
   inappropriate for the JVT codec, for the same reason: the decoupling 
   of sampling, coding, transmission and presentation in a JVT packet 
   stream.   
    
   One example may (hopefully) clarify the issue.  Consider a live 
   broadcast of a sports event.  At unforeseeable times there is a 
   short break, which is used for a commercial.  The sport event itself 
   is live (real-time encoding), but the commercials are available 
   beforehand. 
    
   When Enhanced Reference Picture Selection is available (e.g. in 
   H.263++ or JVT), a video codec can send the bit-expensive I slices 
   of the first picture of a commercial (necessary for the scene 
   change), time-interleaved with the slices of the real-time encoder.  
   Such interleaving can cover seconds or even minutes, which cannot be 
   compensated by the RTP jitter buffer without loosing the ``live'' 
   feeling of the sport event.   
    
   The authors considered the alternatives ``Decoding Timestamp'' and 
   ``Sending Timestamp'' (which is what RFC2250 is doing, and which we 
   propose in this draft).  Using the ``Decoding Timestamp'' has the 
   problem that we, currently, do not have such a concept.  It could be 
   introduced to the JVT design, if this is deemed beneficial by AVT 
   (or by someone else).   
    
   Clearly, using the RFC2250-like timestamp disallows the exact media 
   synchronization between different RTP receivers -- which is one of 
   the main properties of RTP.  However, before diving into the 
   treacherous waters of using the ``Decoding Timestamp'' we would like 
   to hear opinions from AVT, especially with respect to implementation 
   experience with RFC2250. 
    
    
10.2. Parameter Set Updates, and the availability of such packets.   
 
Wenger et. al.      Expires August 2002                [Page 12] 

Internet Draft                                        21 February 2002 
   This is a relatively minor problem.  Should we forbid PUPs in RTP 
   environments?  Is this mechanism necessary for RTP environments?  In 
   other words: do ALL protocol stacks that rely on RTP have a 
   sufficiently capable control protocol to transmit this information 
   (about 30 lines SDP per Parameter Set, probably 5 such sets for a 
   nice application)?  Note: the authors believe that RTCP sender 
   reports are NOT appropriate means for such a transport because their 
   rare use in H.323 environments and because of gateway considerations 
   between MPEG-2 transport and H.323 systems. 
    
    
10.3. Marker Bit issue 
    
   In JVT, picture may or may not be sent as one block of packets.  It 
   is well acceptable to spread them over a large period of time 
   (minutes) and interleave them with data of other pictures, if the 
   delay constraints allow this.  Does in such an environment the 
   current Marker Bit definition make sense? 
    
   Note: Additional clarification ``indicated by the RTP timestamp'' is 
   needed to clarify, which picture of a compound packet the marker bit 
   is associated with. It is unclear if this is a sufficient condition 
   for normal M bit use with compound packets, but at least it is 
   better than nothing. In particular, an M bit signaled for a picture 
   does not necessarily mean that all data of a previous picture in 
   coding order has been received. Does this cause a contradiction with 
   conventional playout buffer handling?  
    
10.4. User Data 
   Most newer video compressions schemes allow to carry ``user data'' 
   in the bit stream.  ``User Data'' is normally composed of a tag that 
   identifies a vendor, and a vendor specific byte string.  Decoders of 
   other vendors that receive user data they don't understand are free 
   to ignore it.  Practically, the standardization process has no 
   influence on the type of user data carried. 
   Should the payload spec explicitly forbid certain types of user 
   data, e.g. active content? 
    
11. Full Copyright Statement 
    
   Copyright (C) The Internet Society (2002). All Rights Reserved. 
    
   This document and translations of it may be copied and furnished to 
   others, and derivative works that comment on or otherwise explain it 
   or assist in its implementation may be prepared, copied, published 
   and distributed, in whole or in part, without restriction of any 
   kind, provided that the above copyright notice and this paragraph 
   are included on all such copies and derivative works. 
    
   However, this document itself may not be modified in any way, such 
   as by removing the copyright notice or references to the Internet 
   Society or other Internet organizations, except as needed for the 
   purpose of developing Internet standards in which case the 
   procedures for copyrights defined in the Internet Standards process 
Wenger et. al.      Expires August 2002                [Page 13] 

Internet Draft                                        21 February 2002 
   must be followed, or as required to translate it into languages 
   other than English. 
    
   The limited permissions granted above are perpetual and will not be 
   revoked by the Internet Society or its successors or assigns. 
    
   This document and the information contained herein is provided on an 
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 
    
    
12. Bibliography 
                     
   [1]  P. Borgwardt, "Handling Interlaced Video in H.26L", VCEG-N57r2, 
        available from ftp://standard.pictel.com/video-
        site/0109_San/VCEG-N57r2.doc, September 2001 
   [2]  JVT Working Draft Version 1, available from 
        ftp://standard.pictel.com/video-site/H26L/jvt11.doc 
   [3]  ITU-T Recommendation H.263-2000 
   [4]  D. Hoffman, G. Fernando, V. Goyal, M. Civanlar, "RTP Payload 
        Format for MPEG1/MPEG2 Video", RFC 2250, January 1998 
   [5]  ISO/IEC IS 14496-1 
   [6]  S. Wenger, "H.26L over IP", IEEE Transaction on Circuits and 
        Systems for Video technology, to appear (April 2002) 
 
   [7]  S. Wenger, "H.26L over IP: The IP Network Adaptation Layer", 
        Proceedings Packet Video Workshop 02, April 2002, to appear. 
   [8]  C. Borman et. Al., "RTP Payload Format for the 1998 Version of 
        ITU-T Rec. H.263 Video (H.263+)", RFC 2429, October 1998 
   [9]  ISO/IEC IS 14496-2 
   [10] T. Stockhammer, et. al., "Simulation Results for Common 
        Conditions for H.323/Internet Case", VCEG-N50, available from 
        ftp://standard.pictel.com/video-site/0109_San/VCEG-N50.doc, 
        September 2001 
   [11] S. Wenger, T. Stockhammer, "H.26L over IP and H.324 Framework", 
        VCEG-N52, available from ftp://standard.pictel.com/video-
        site/0109_San/VCEG-N52.doc, September 2001 
   [12] ITU-T Recommendation H.223 (1999) 
   [13] C. Perkins et. al., "RTP Payload for Redundant Audio Data", RFC 
        2198, September 1997 
   [14] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for 
        Generic Forward Error Correction", RFC 2733, December 1999  
   [15] V Varsa, M. Karczewicz, "Slice interleaving in compressed video 
        packetization", Packet Video Workshop 2000 
    
    
    
Wenger et. al.      Expires August 2002                [Page 14] 

Internet Draft                                        21 February 2002 
                                                                   
   Author's Addresses 
    
   Stephan Wenger                     Phone: +49-172-300-0813 
   TU Berlin / Teles AG               Email: stewe@cs.tu-berlin.de 
   Franklinstr. 28-29 
   D-10587 Berlin 
   Germany 
    
   Thomas Stockhammer                 Phone: +49-89-28923474 
   Institute for Communications Eng.  Email: stockhammer@ei.tum.de 
   Munich University of Technology 
   D-80290 Munich 
   Germany 
    
   Miska M. Hannuksela                Phone: +358 40 5212845 
   Nokia Mobile Phones                Email: miska.hannuksela@nokia.com 
   P.O. Box 68 
   33721 Tampere 
   Finland   
Wenger et. al.      Expires August 2002                [Page 15]