Internet Engineering Task Force Audio-Video Transport Working Group INTERNET-DRAFT H. Schulzrinne AT&T Bell Laboratories December 15, 1992 Expires: 5/1/93 Issues in Designing a Transport Protocol for Audio and Video Conferences and other Multiparticipant Real-Time Applications Status of this Memo This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts). Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress." Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Distribution of this document is unlimited. Abstract This draft is a companion document to the RTP protocol draft draft-ietf-avt-rtp-00.{txt,ps}. It discusses aspects of transporting real-time services (such as voice or video) over the Internet. It compares and evaluates design alternatives for a real-time transport protocol, providing rationales for the design decisions made for RTP. Also covered are issues of port assignment and multicast address allocation. A comprehensive glossary of terms related to multimedia conferencing is provided. INTERNET-DRAFT Issues/RTP December 15, 1992 Acknowledgments This draft is based on discussion within the AVT working group chaired by Stephen Casner. Eve Schooler and Stephen Casner provided valuable comments. This work was supported in part by the Office of Naval Research under contract N00014-90-J-1293, the Defense Advanced Research Projects Agency under contract NAG2-578 and a National Science Foundation equipment grant, CERDCR 8500332. Contents 1 Introduction 4 2 Goals 6 3 Services 9 3.1 Duplex or Simplex? . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Framing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Version Identification . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Conference Identification . . . . . . . . . . . . . . . . . . . . . 14 3.4.1Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.2Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5 Media Encoding Identification . . . . . . . . . . . . . . . . . . . 15 3.5.1Audio Encodings . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5.2Video Encodings . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6 Playout Synchronization . . . . . . . . . . . . . . . . . . . . . . 18 3.6.1Synchronization Methods . . . . . . . . . . . . . . . . . . . . 20 3.6.2Detection of Synchronization Units . . . . . . . . . . . . . . . 22 3.6.3Interpretation of Synchronization Bit . . . . . . . . . . . . . 23 3.6.4Interpretation of Timestamp . . . . . . . . . . . . . . . . . . 25 H. Schulzrinne Expires 5/1/93 [Page 2] INTERNET-DRAFT Issues/RTP December 15, 1992 3.6.5End-of-talkspurt indication . . . . . . . . . . . . . . . . . . 28 3.6.6Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Segmentation and Reassembly . . . . . . . . . . . . . . . . . . . . 29 3.8 Source Identification . . . . . . . . . . . . . . . . . . . . . . . 30 3.8.1Gateways, Reflectors and End Systems . . . . . . . . . . . . . . 30 3.8.2Address Format Issues . . . . . . . . . . . . . . . . . . . . . 32 3.9 Energy Indication . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.10Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.11Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.11.1Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.11.2Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.12Quality of Service Control . . . . . . . . . . . . . . . . . . . . 35 3.12.1Monitoring by Receiver . . . . . . . . . . . . . . . . . . . . . 36 3.12.2Monitoring by Sender . . . . . . . . . . . . . . . . . . . . . . 36 3.12.3Monitoring by Third Party . . . . . . . . . . . . . . . . . . . 37 4 Conference Control Protocol 37 5 The Use of Profiles 38 6 Port Assignment 38 7 Multicast Address Allocation 40 7.1 Channel Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.2 Global Reservation Channel with Scoping . . . . . . . . . . . . . . 41 7.3 Local Reservation Channel . . . . . . . . . . . . . . . . . . . . . 42 7.3.1Hierarchical Allocation with Servers . . . . . . . . . . . . . . 42 7.3.2Distributed Hierarchical Allocation . . . . . . . . . . . . . . 42 7.4 Restricting Scope by Limiting Time-to-Live . . . . . . . . . . . . 43 H. Schulzrinne Expires 5/1/93 [Page 3] INTERNET-DRAFT Issues/RTP December 15, 1992 A Glossary 43 B Address of Author 51 1 Introduction The real-time transport protocol (RTP) discussed in this draft aims to provide services commonly required by interactive multimedia conferences, such as playout synchronization, demultiplexing, media identification and active-party identification. However, RTP is not restricted to multimedia conferences; it is anticipated that other real-time services such as remote data acquisition and control may find its services of use. In this context, a conference describes associations that are characterized by the participation of two or more agents, interacting in real time with one or more media of potentially different types. The agents are anticipated to be human, but may also be measurement devices, remote media servers, simulators and the like. Both two-party and multiple-party associations are to be supported, where one or more agents can take active roles, i.e., generate data. Thus, applications not commonly considered a conference fall under this wider definition, for example, one-way media such as the network equivalent of closed-circuit television or radio, traditional two-party telephone conversations or real-time distributed simulations. Even though intended for real-time interactive applications, the use of RTP for the storage and transmission of recorded real-time data should be possible, with the understanding that the interpretation of some fields such as timestamps may be affected by this off-line mode of operation. RTP uses the services of an end-to-end transport protocol such as UDP, TCP, OSI TPx, ST-II or the like(1). The services used are: end-to-end delivery, framing, demultiplexing and multicast. The underlying network is not assumed to be reliable and can be expected to lose, corrupt, arbitrarily delay and reorder packets. However, the use of RTP within quality-of-service (e.g., rate) controlled networks is anticipated to be of particular interest. Network layer support for multicasting is desirable, but not required. RTP is supported by a real-time control protocol (RTCP) in a relationship similar to that between IP and ICMP. However, RTP can be used, with reduced functionality, without a control protocol. The control protocol RTCP provides minimum functionality for maintaining conference state for one or more flows within a single transport association. RTCP is not guaranteed to be reliable; each participant simply sends the local information periodically to all other conference participants. ------------------------------ 1. ST-II is not properly a transport protocol, as it is visible to intermediate nodes, but it provides services such as process demultiplexing commonly associated with transport protocols. H. Schulzrinne Expires 5/1/93 [Page 4] INTERNET-DRAFT Issues/RTP December 15, 1992 As an alternative, RTP could be used as a transport protocol layered directly on top of IP, potentially increasing performance and reducing header overhead. This may be attractive as the services provided by UDP, checksumming and demultiplexing, may not be needed for multicast real-time conferencing applications. This aspect remains for further study. The relationships between RTP and RTCP to other protocols of the Internet protocol suite are depicted in Fig. 1. +-------------------+-----------------------+ | | conference controller | | media application |----------------+ | | | CCP | | +------------+------+------+---------+------+ | | RTCP | | | +-------------+ | | RTP | +---------------+-------------+ | | | UDP | | | ST-II +-------------+-------------+ | | IP | +---------------+---------------------------+ Figure 1: Embedding of RTP and RTCP in Internet protocol stack Conferences encompassing several media are managed by a (reliable) conference control protocol, whose definition is outside the scope of this note. Some aspects of its functionality, however, are described in Section 4. Within this working group, some common encoding rules and algorithms for media have been specified, keeping in mind that this aspect is largely independent of the remainder of the protocol. Without this specification, interoperability cannot be achieved. It is intended, however, to keep the two aspects as separate RFCs as changes in media encoding should be independent of the transport aspects. The encoding specification includes issues such as byte order for multi-byte samples, sample order for multi-channel audio, the format of state information for differential encodings, the segmentation of encoded video frames into packets, and the like. When used for multimedia services, RTP sources will have to be able to convey the type of media encoding used to the receivers. The number of encodings potentially used is rather large, but a single application will likely restrict itself to a small subset of that. To allow the participants in conferences to unambiguously communicate to each other the current encoding, the working group is defining a set of encoding names to be registered with the Internet Assigned Numbers Authority (IANA). Also, short integers for a default mapping of common encodings are specified. The issue of port assignment will be discussed in more detail in Section 6. H. Schulzrinne Expires 5/1/93 [Page 5] INTERNET-DRAFT Issues/RTP December 15, 1992 It should be emphasized, however, that UDP port assignment does not imply that all underlying transport mechanisms share this or a similar port mechanism. This draft aims to summarize some of the discussions held within the audio-video transport (AVT) working group chaired by Stephen Casner, but the opinions are the author's own. Where possible, references to previous work are included, but the author realizes that the attribution of ideas is far from complete. The draft builds on operational experience with Van Jacobson's and Steve McCanne's vat audio conferencing tool as well as implementation experience with the author's Nevot network voice terminal. This note will frequently refer to NVP [1], the network voice protocol, a protocol used in two versions for early Internet wide-area packet voice experiments. CCITT has standardized as recommendations G.764 and G.765 a packet voice protocol stack for use in digital circuit multiplication equipment. The name RTP was chosen to reflect the fact that audio and video conferences may not be the only applications employing its services, while the real-time nature of the protocol is important, setting it apart from other multimedia-transport mechanisms, such as the MIME multimedia mail effort [2]. The remainder of this draft is organized as follows. Section 2 summarizes the design goals of this real-time transport protocol. Then, Section 3 describes the services to be provided in more detail. Section 4 briefly outlines some of the services added by a higher-layer conference control protocol; a more detailed description is outside the scope of this document. Two appendices discuss the issues of port assignment and multicast address allocation, respectively. A glossary defines terms and acronyms, providing references for further detail. The actual protocol specification embodying the recommendation and conclusions of this report is contained in a separate document. 2 Goals Design decisions should be measured against the following goals, not necessarily listed in order of importance: content flexibility: While the primary applications that motivate the protocol design are conference voice and video, it should be anticipated that other applications may also find the services provided by the protocol useful. Some examples include distribution audio/video (for example, the ``Radio Free Ethernet''application by Sun), distributed simulation and some forms of (loss-tolerant) remote data acquisition (for example, active badge systems [3, 4]). Note that it is possible that the same packet header field may be interpreted in H. Schulzrinne Expires 5/1/93 [Page 6] INTERNET-DRAFT Issues/RTP December 15, 1992 different ways depending on the content (e.g., a synchronization bit may be used to indicate the beginning of a talkspurt for audio and the beginning of a frame for video). Also, new formats of established media, for example, high-quality multi-channel audio or combined audio and video sources, should be anticipated where possible. extensible: Researchers and implementors within the Internet community are currently only beginning to explore real-time multimedia services such as video conferences. Thus, the RTP should be able to incorporate additional services as operational experience with the protocol accumulates and as applications not originally anticipated find its services useful. The same mechanisms should also allow experimental applications to exchange application-specific information without jeopardizing interoperability with other applications. Extensibility is also desirable as it will hopefully speed along the standardization effort, making the consequences of leaving out some group's favorite fixed header field less drastic. It should be understood that extensibility and flexibility may conflict with the goals of bandwidth and processing efficiency. independent of lower-layer protocols: RTP should make as few assumptions about the underlying transport protocol as possible. It should, for example, work reasonably well with UDP, TCP, ST-II, OSI TP, VMTP and experimental protocols, for example, protocols that support resource reservation and quality-of-service guarantees. Naturally, not all transport protocols are equally suited for real-time services; in particular, TCP may introduce unacceptable delays over anything but low-error-rate LANs. Also, protocols that deliver streams rather than packets needs additional framing services as discussed in Section 3.2. It remains to be discussed whether RTP may use services provided by the lower-layer protocols for its own purposes (time stamps and sequence numbers, for example). The goal of independence from lower-layer considerations also affects the issue of address representation. In particular, anything too closely tied to the current IP 4-byte addresses may face early obsolescence. It is to be anticipated, however, that experience gained will suggest a new protocol revision in any event by that time. gateway-compatible: Operational experience has shown that RTP-level gateways are necessary and desirable for a number of reasons. First, it may be desirable to aggregate several media streams into a single stream and then retransmit it with possibly different encoding, packet size or transport protocol. A packet ``reflector'' that achieves multicasting by user-level copying may be needed where multicast tunnels or IP connectivity are unavailable or the end-systems are not multicast-capable. H. Schulzrinne Expires 5/1/93 [Page 7] INTERNET-DRAFT Issues/RTP December 15, 1992 bandwidth efficient: It is anticipated that the protocol will be used in networks with a wide range of bandwidths and with a variety of media encodings. Despite increasing bandwidths within the national backbone networks, bandwidth efficiency will continue to be important for transporting conferences across 56 kb links, office-to-home high-speed modem connections and international links. To minimize end-to-end delay and the effect of lost packets, packetization intervals have to be limited, which, in combination with efficient media encodings, leads to short packet sizes. Generally, packets containing 16 to 32 ms of speech are considered optimal [5, 6, 7]. For example, even with a 65 ms packetization interval, a 4800 b/s encoding produces 39 byte packets. Current Internet voice experiments use packets containing around 20 ms of audio, which translates into 160 bytes of audio information coded at 64 kb/s. Video packets are typically much longer, so that header overhead is less of a concern. For UDP multicast (without counting the overhead of source routing as currently used in tunnels or a separate IP encapsulation as planned), IPv4 incurs 20 bytes and UDP an additional 8 bytes of header overhead, to which datalink layer headers of at least 4 bytes must be added. With RTP header lengths between 4 and 8 bytes, the total overhead amounts to between 36 and 40 (or more) bytes per audio or video packet. For 160-byte audio packets, the overhead of 8-byte RTP headers together with UDP, IP and PPP (as an example of a datalink protocol) headers is 25%. For low bitrate coding, packet headers can easily double the necessary bit rate. Thus, it appears that any fixed headers beyond eight bytes would have to make a significant contribution to the protocol's capabilities as such long headers could stand in the way of running RTP applications over low-speed links. The current fixed header lengths for NVP and vat are 4 and 8 bytes, respectively. It is interesting to note that G.764 has a total header overhead, including the LAPD data link layer, of only 8 bytes, as the voice transport is considered a network-layer protocol. The overhead is split evenly between layers 2 and 3. Bandwidth efficiency can be achieved by transporting non-essential or slowly changing protocol state in optional fields or in a separate low-bandwidth control protocol. Also, header compression [8] may be used. international: Even now, audio and video conferencing tools are used far beyond the North American continent. It would seem appropriate to give considerations to internationalization concerns, for example to allow for the European A-law audio companding and non-US-ASCII character sets in textual data such as site identification. processing efficient: With arrival rates of on the order of 40 to 50 packets per second for a single voice or video source, per-packet processing overhead may become a concern, particularly if the protocol is to be implemented on other than high-end workstations. H. Schulzrinne Expires 5/1/93 [Page 8] INTERNET-DRAFT Issues/RTP December 15, 1992 Multiplication and division operations should be avoided where possible and fields should be aligned to their natural size, i.e., an n-byte integer is aligned on an n-byte multiple, where possible. implementable now: Given the anticipated lifetime and experimental nature of the protocol, it must be implementable with current hardware and operating systems. That does not preclude that hardware and operating systems geared towards real-time services may improve the performance or capabilities of the protocol, e.g., allow better intermedia synchronization. 3 Services The services that may be provided by RTP are summarized below. Note that not all services have to be offered. Services anticipated to be optional are marked with an asterisk. o framing (*) o demultiplexing by conference/association (*) o demultiplexing by media source o demultiplexing by conference o determination of media encoding o playout synchronization between a source and a set of destinations o error detection (*) o encryption (*) o quality-of-service monitoring (*) In the following sections, we will discuss how these services are reflected in the proposed packet header. Information to be conveyed within the conference can be roughly divided into information that changes with every data packet and other information that stays constant for longer time periods. State information that does not change with every packet can be carried in several different ways: as a fixed part of the RTP header: This method is easiest to decode and ensures state synchronization between sender and receiver(s), but can be bandwidth inefficient or restrict the amount of state information to H. Schulzrinne Expires 5/1/93 [Page 9] INTERNET-DRAFT Issues/RTP December 15, 1992 be conveyed. as a header option: The information is only carried when needed. It requires more processing by the sending and receiving application. If contained in every packet, it is also less bandwidth-efficient than the first method. within RTCP packets: This approach is roughly equivalent to header options in terms of processing and bandwidth efficiency. Some means of identifying when a particular option takes effect within the data stream may have to be provided. within a multicast conference announcement: Instead of residing at a well- known conference server, information about on-going or upcoming conferences may be multicast to a well-known multicast address. within conference control: The state information is conveyed when the conference is established or when the information changes. As for RTCP packets, a synchronization mechanism between data and control may be required for certain information. through a conference directory: This is a variant of the conference control mechanism, with a (distributed) directory at a well-known (multicast) address maintaining state information about on-going or scheduled conferences. Changing state information during a conference is probably more difficult than with conference control as participants need to be told to look at the directory for changed information. Thus, a directory is probably best suited to hold information that will persist through the life of the conference, for example, its multicast group, list of media encodings, title and organizer. The first two methods are examples of in-band signaling, the others of out-of-band signaling. Options can be encoded in a number of ways, resulting in different tradeoffs between flexibility, processing overhead and space requirements. In general, options consists of a type field, possibly a length field, and the actual option value. The length field can be omitted if the length is implied by the option type. Implied-length options save space, but require special treatment while processing. While options with explicit length that are added in later protocol versions are backwards-compatible (the receiver can just skip them), implied-length options cannot be added without modifying all receivers, unless they are marked as such and all have a known length. As an example, IP defines two implied-length options, no-op and end-of-option, both with a length of one octet. For indicating the extent of options, a number of alternatives have been suggested. H. Schulzrinne Expires 5/1/93 [Page 10] INTERNET-DRAFT Issues/RTP December 15, 1992 option length: The fixed header contains a field containing the length of the options, as used for IP. This makes skipping over options easy, but consumes precious header space. end-of-options bit: Each option contains a special bit that is set only for the last option in the list. In addition, the fixed header contains a flag indicating that options are present. This conserves space in the fixed header, at the expense of reducing usable space within options, e.g., reducing the number of possible option types or the maximum option length. It also makes skipping options somewhat more processing-intensive, particulary if some options have implied lengths and other explicit lengths. end-of-options option: A special option type indicates the end of the option list, with a bit in the fixed header indicating the presence of options. The properties of this approach are similar to the previous one, except that it can be expected to take up more header space. options directory: An options-present bit in the fixed header indicates the presence of an options directory. The options directory in turn contains a length field for the options list and possibly bits indicating the presence of certain options or option classes. The option length makes skipping options fast, while the presence bits allow a quick decision whether the options list should be scanned for relevant options. If all options have a known, fixed length, the bit mask can be used to directly access certain options, without having to traverse parts of the options list. The drawback is increased header space and the necessity to create the directory. If options are explicitly coded in the bit mask, the number and numbering of options is restricted. 3.1 Duplex or Simplex? In terms of information flow, protocols can be roughly divided into three categories: 1. For one instance of a protocol, packets travel only in one direction; i.e., the receiver has no way to directly influence the sender. UDP is an example of such a protocol. 2. While data only travels in one direction, the receiver can send back control packets, for example, to accept or reject a connection, or request retransmission. ST-II in its standard simplex mode is an example; TCP is symmetric (see next item), but during a file transfer, it typically operates in this mode, where one side sends data and the receiver of the data returns acknowledgements. H. Schulzrinne Expires 5/1/93 [Page 11] INTERNET-DRAFT Issues/RTP December 15, 1992 3. The protocol is fully symmetric during the data transfer phase, with user data and control information travelling in both directions. TCP is a symmetric protocol. Note that bidirectional data flow can usually be simulated by two or more one-directional data flows in opposite directions, however, if the data sinks need to transmit control information to the source, a decoupled stream in the reverse direction will not do without additional machinery to bridge the gap between the two protocol state machines. For most of the anticipated applications for a real-time transport protocol, one-directional data flow appears sufficient. Also, in general, bidirectional flows may be difficult to maintain in one-to-many settings commonly found in conferences. Real-time requirements combined with network latency make achieving reliability through retransmission difficult, eliminating another reason for a bidirectional communication channel. Thus, we will focus only on control flow from the receiver of a data flow to its sender. For brevity, we will refer to packets of this control flow as reverse control packets. There are at least two areas within multimedia conferences where a receiver needs to communicate control information back to the source. First, the sender may want or need to know how well the transmission is proceding, as traditional feedback through acknowledgements is missing (and usually infeasible due to acknowledgment implosion). Secondly, the receiver should be able to request a selective update of its state, for example, to obtain missing image blocks after joining an on-going conference. Note that for both uses, unicast rather than multicast is appropriate. Three approaches allowing the sender to distinguish reverse control packets from data packets are compared here: sender port equals reverse port, marked packet: The same port number is used both for data and return control messages. Packets then have to be marked to allow distinguishing the two. Either the presence of certain options would indicate a reverse control packet, or the options themselves would be interpreted as reverse control information, with the rest of the packet treated as regular data. The latter approach appears to be the most flexible and symmetric, and is similar in spirit to transport protocols with piggy-backed acknowledgements as in TCP. Also, since several conferences with different multicast addresses may be using the same port number, the receiver has to include the multicast address in its reverse control messages. As a final identification, the control packets have to bear the flow identifier they belong to. The scheme has the grave disadvantage that every application on a host has to receive the reverse control messages and decide whether it involves a flow it is responsible for. H. Schulzrinne Expires 5/1/93 [Page 12] INTERNET-DRAFT Issues/RTP December 15, 1992 single reverse port: Reverse control packets for all flows use a single port that differs from the data port. Since the type of the packet (control vs. data) is identified by the port number, only the multicast address and flow number still needs to be included, without a need for a distinguishing packet format. Adding a port means that port negotiation is somewhat more complicated; also, as in the first scheme, the application still has to demultiplex incoming control messages. different reverse port for each flow: This method requires that each source makes it known to all receivers on which port it wishes to receive reverse control messages. Demultiplexing based on flow and multicast address is no longer necessary. However, each participant sending data and expecting return control messages has to communicate the port number to all other participants. Since the reverse control port number should remain constant throughout the conference (except after application restarts), a periodic dissemination of that information is sufficient. Distributing the port information has the advantage that it gives applications the flexibility to designate only certain flows as potential recipients of reverse control information. Unfortunately, the delay in acquiring the reverse control port number when joining an on-going conference may make one of the more interesting uses of a reverse control channel difficult to implement, namely the request by a new arrival to the sender to transmit the complete current state (e.g., image) rather than changes only. 3.2 Framing To satisfy the goal of transport independence, we cannot assume that the lower layer provides framing. (Consider TCP as an example; it would probably not be used for real-time applications except possibly on a local network, but it may be useful in distributing recorded audio or video segments.) It may also be desirable to pack several RTPDUs into a single TPDU. The obvious solution is to provide for an optional message length prefixed to the actual packet. If the underlying protocol does not message delineation, both sender and receiver would know to use the message length. If used to carry multiple RTPDUs, all participants would have to arrive at a mutual agreement as to its use. A 16-bit field should cover most needs, but appears to break the 4-byte alignment for the rest of the header. However, an application would read the message length first and then copy the appropriate number of bytes into a buffer, suitably aligned. H. Schulzrinne Expires 5/1/93 [Page 13] INTERNET-DRAFT Issues/RTP December 15, 1992 3.3 Version Identification Humility suggests that we anticipate that we may not get the first iteration of the protocol right. In order to avoid ``flag days'' where everybody shifts to a new protocol, a version identifier could ensure continued interoperability. Alternatively, a new port could be used, as long as only one port (or at most a few ports) is used for all media. The difficulty in interworking between the current vat and NVP protocols further affirms the desirability of a version identifier. However, the version identifier can be anticipated to be the most static of all proposed header fields. Since the length of the header and the location and meaning of the option length field may be affected by a version change, encoding the version within an optional field is not feasible. Putting the version number into the control protocol packets would make RTCP mandatory and would make rapid scanning of conferences significantly more difficult. vat currently offers a 2-bit version field, while this capability is missing from NVP. Given the low bit usage and their utility in other contexts (IP, ST-II), it may be prudent to include a version identifier. To be useful, any version field must be placed at the very beginning of the header. Assigning an initial version value of one to RTP allows interoperability with the current vat protocol. 3.4 Conference Identification A conference identifier (conference ID) could serve two mutually exclusive functions: providing another level of demultiplexing or a means of logically aggregating flows with different network addresses and port numbers. vat specifies a 16-bit conference identifier. 3.4.1 Demultiplexing Demultiplexing by RTP allows one association characterized by destination address and port number to carry several distinct conferences. However, this appears to be necessary only if the number of conferences exceeds the demultiplexing capability available through (multicast) addresses and port numbers. Efficiency arguments suggest that combining several conferences or media within a single multicast group is not desirable. Combining several conferences or media within a single multicast address reduces the bandwidth efficiency afforded by multicasting if the sets of destinations are different. Also, applications that are not interested in a particular conference or capable of dealing with particular medium are still forced to H. Schulzrinne Expires 5/1/93 [Page 14] INTERNET-DRAFT Issues/RTP December 15, 1992 handle the packets delivered for that conference or medium. Consider as an example two separate applications, one for audio, one for video. If both share the same multicast address and port, being differentiated only by the conference identifier, the operating system has to copy each incoming audio and video packet into two application buffers and perform a context switch to both applications, only to have one immediately discard the incoming packet. Given that application-layer demultiplexing has strong negative efficiency implications and given that multicast addresses are not an extremely scarce commodity, there seems to be no reason to burden every application with maintaining and checking conference identifiers for the purpose of demultiplexing. However, if this protocol is to be used as a transport protocol, demultiplexing capability is required. It is also not recommended to use a conference identifier to distinguish between different encodings, as it would be difficult for the application to decide whether a new conference identifier means that a new conference has arrived or simply all participants should be moved to the new conference with a different encoding. Since the encoding may change for some but not all participants, we could find ourselves breaking a single logical conference into several pieces, with a fairly elaborate control mechanism to decide which conferences logically belong together. 3.4.2 Aggregation Particularly within a network with a wide range of capacities, differing multicast groups for each media component of a conference allows to tailor the media distribution to the network bandwidths and end-system capabilities. It appears useful, however, to have a means of identifying groups that logically belong together, for example for purposes of time synchronization. A conference identifier used in this manner would have to be globally unique. It appears that such logical connections would better be identified as part of the higher-layer control protocol by identifying all multicast addresses belonging to the same logical conference, thereby avoiding the assignment of globally unique identifiers. 3.5 Media Encoding Identification This field plays a similar role to the protocol field in data link or network protocols, indicating the next higher layer (here, the media decoder) that the data is meant for. For RTP, this field would indicate the audio or video or other media encoding. In general, the number of distinct encodings should be kept as small as possible to increase the chance that applications can interoperate. A new encoding should only be recognized H. Schulzrinne Expires 5/1/93 [Page 15] INTERNET-DRAFT Issues/RTP December 15, 1992 if it significantly enhances the range of media quality or the types of networks conferences can be conducted over. The unnecessary proliferation of encodings can be reduced by making reference implementations of standard encoders and decoders widely available. It should be noted that encodings may not be enumerable as easily as, say, transport protocols. A particular family of related encoding methods may be described by a set of parameters, as discussed below in the sections on audio and video encoding. Encodings may change during the duration of a conference. This may be due to changed network conditions, changed user preference or because the conference is joined by a new participant that cannot decode the current encoding. If the information necessary for the decoder is conveyed out-of-band, some means of indicating when the change is effective needs to be incorporated. Also, the indication that the encoding is about to change must reach all receivers reliably before the first packet employing the new encoding. Each receiver needs to track pending changes of encodings and check for every incoming packet whether an encoding change is to take effect with this packet. Conveying media encodings rapidly is also important to allow scanning of conferences or broadcast media. Note that it is not necessary to convey the whole encoder description, with all parameters; an index into a table of well-known encodings is probably preferable. An index would also make it easier to detect whether the encoding has changed. Alternatively, a directory or announcement service could provide encoding information for on-going conferences, without carrying the information in every packet. This may not be sufficient, however, unless all participants within a conference use the same encoding. As soon as the encoding information is separated from the media data, a synchronization mechanism has to be devised that ensures that sender and receiver interpret the data in the same manner after the out-of-band information has been updated. There are at least two approaches to indicating media encoding, either in-band or out-of-band: conference-specific: Here, the media identifier is an index into a table designating the approved or anticipated encodings (together with any particular version numbers or other parameters) for a particular conference or user community. The table can be distributed through RTCP, a higher-layer conference control protocol, a conference announcement service or some other out-of-band means. Since the number of encodings used during a single conference is likely to be small, the field width in the header can likewise be small. Also, there is no need to agree on an Internet-wide list of encodings. It should be noted that conveying the table of encodings through RTCP forces the application to maintain a separate mapping table for each sender as there can be no guarantee that all senders will use the same table. H. Schulzrinne Expires 5/1/93 [Page 16] INTERNET-DRAFT Issues/RTP December 15, 1992 Since the control protocol proposed here is unreliable, changing the meaning of encoding indices dynamically is fraught with possibilities for misinterpretation and lost data unless this mapping is carried in every packet. global: Here, the media identifier is an index into a global table of encodings. A global list reduces the need for out-of-band information. Transmitting the parameters associated with an encoding may be difficult, however, if it has to be done within the header space constraints of per-packet signaling. To make detecting coder mismatches easier, encodings for all media should be drawn from the same numbering space. To facilitate experimentation with new encodings, a part of any global encoding numbering space should be set aside for experimental encodings, with numbers agreed upon within the community experimenting with the encoding, with no Internet-wide guarantee of uniqueness. 3.5.1 Audio Encodings Audio data is commonly characterized by three independent descriptors: encoding (the translation of one or more audio samples into a channel symbol), the number of channels (mono, stereo, :::) and the sampling rate. Theoretically, sampling rate and encoding are (largely) independent. We could, for example, apply mu-law encoding to any sampling rate even though it is traditionally used with a rate of 8,000 Hz. In practical terms, it may be desirable to limit the combinations of encoding and sampling rate to the values the encoding was designed for.(2) Channel counts between 1 and 6 should be sufficient even for surround sound. The audio encodings listed in Table 1 appear particularly interesting, even though the list is by no means exhaustive and does not include some experimental encodings currently in use, for example a non-standard form of LPC. The bit rate is shown per channel. k samples/s, b/sample and kb/s denote kilosamples per second, bits per sample and kilobits per second, respectively. If sampling rates are to be specified separately, the values of 8, 16, 32, 44.1, and 48 kHz suggest themselves, even though other ------------------------------ 2. Given the wide availability of mu-law encoding and its low overhead, using it with a sampling rate of 16,000 or 32,000 Hz might be quite appropriate for high-quality audio conferences, even though there are other encodings, such as G.722, specifically designed for such applications. Note that the signal-to-noise ratio of mu-law encoding is about 38 dB, equivalent to an AM receiver. The ``telephone quality'' associated with G.711 is due primarily to the limitation in frequency response to the 200 to 3500 Hz range. H. Schulzrinne Expires 5/1/93 [Page 17] INTERNET-DRAFT Issues/RTP December 15, 1992 values (11.025 and 22.05 kHz) are supported on some workstations (the Silicon Graphics audio hardware and the Apple Macintosh, for example). Clearly, little is to be gained by allowing arbitrary sampling rates, as conversion particularly between rates not related by simple fractions is quite cumbersome and processing-intensive [9]. Org.______Name_____k_samples/s__b/sample___kb/s__description_____________________________ CCITT G.711 8.0 8 64 mu-law PCM CCITT G.711 8.0 8 64 A-law PCM CCITT G.721 8.0 4 32 ADPCM Intel DVI 8.0 4 32 APDCM CCITT G.723 8.0 3 24 ADPCM CCITT G.726 ADPCM CCITT G.727 ADPCM NIST/GSA FS 1015 8.0 2.4 LPC-10E NIST/GSA FS 1016 8.0 4.8 CELP NADC IS-54 8.0 7.95 North American Digital Cellular, VSELP CCITT G.728 8.0 16 LD-CELP GSM 8.0 13 RPE-LTP CCITT G.722 8.0 64 7 kHz, SB-ADPCM ISO 3-11172 256 MPEG audio 32.0 16 512 DAT 44.1 16 705.6 CD, DAT playback 48.0 16 786 DAT record Table 1: Standardized and common audio encodings 3.5.2 Video Encodings Common video encodings are listed in Table 2. Encodings with tunable rate can be configured for different rates, but produce a fixed-rate stream. The average bit rate produced by variable-rate codecs depends on the source material. 3.6 Playout Synchronization A major purpose of RTP is to provide the support for various forms of synchronization, without necessarily performing the synchronization itself. We can distinguish three kinds of synchronization: playout synchronization: The receiver plays out the medium a fixed time after it was generated at the source (end-to-end delay). This end-to-end delay may vary from synchronization unit to synchronization unit. In other words, playout synchronization assures that a constant H. Schulzrinne Expires 5/1/93 [Page 18] INTERNET-DRAFT Issues/RTP December 15, 1992 _Org.________name______rate________________remarks____________ CCITT JPEG tunable CCITT MPEG variable, tunable CCITT H.261 tunable, px64 kb/s Bolter variable, tunable PictureTel ?? Cornell U. CU-SeeMe variable Xerox Parc nv variable, tunable BBN DVC variable, tunable block differences Table 2: Common video encodings rate source at the sender again becomes a constant rate source at the receiver, despite delay jitter in the network. intra-media synchronization: All receivers play the same segment of a medium at the same time. Intra-media synchronization may be needed during simulations and wargaming. inter-media synchronization: The timing relationship between several media sources is reconstructed at the receiver. The primary example is the synchronization between audio and video (lip-sync). Note that different receivers may experience different delays between the media generation time and their playout time. Playout synchronization is required for most media, while intra-media and inter-media synchronization may or may not be implemented. In connection with playout synchronization, we can group packets into playout units, a number of which in turn form a synchronization unit. More specifically, we define: synchronization unit: A synchronization unit consists of one or more playout units (see below) that, as a group, share a common fixed delay between generation and playout of each part of the group. The delay may change at the beginning of such a synchronization unit. The most common synchronization units are talkspurts for voice and frames for video transmission. playout unit: A playout unit is a group of packets sharing a common timestamp. (Naturally, packets whose timestamps are identical due to timestamp wrap-around are not considered part of the same playout unit.) For voice, the playout unit would typically be a single voice segment, while for video a video frame could be broken down into subframes, each consisting of packets sharing the same timestamp and ordered by some form of sequence number. H. Schulzrinne Expires 5/1/93 [Page 19] INTERNET-DRAFT Issues/RTP December 15, 1992 Two concepts related to synchronization and playout units are absolute and relative timing. Absolute timing maintains a fixed timing relationship between sender and receiver, while relative timing ensures that the spacing between packets at the sender is the same as that at the receiver, measured in terms of the sampling clock. Playout units within the synchronization unit maintain relative timing with respect to each other; absolute timing is undesirable if the receiver clock runs at a (slightly) different rate than the sender clock. Most proposed synchronization methods require a timestamp. The timestamp has to have a sufficient range that wrap-arounds are infrequent. It is desirable that the range exceeds the maximum expected inactive (e.g., silence) period. Otherwise, if the silence period lasts a full timestamp range, the first packet of the next talkspurt would have a timestamp one larger than the last packet of the current talkspurt. In that case, the new talkspurt could not be readily discerned if the difference in increment between timestamps and sequence numbers is used to detect a new talkspurt. The 10-bit timestamp used by NVP is generally agreed to be too small as it wraps around after only 20.5 s (for 20 ms audio packets), while a 32-bit timestamp should serve all anticipated needs, even if the timestamp is expressed in units of samples or other sub-packet entities. A timestamp may be useful not only at the transport, but also at the network layer, for example, for scheduling packets based on urgency. The playout timestamp would be appropriate for such a scheduling timestamp, as it would better reflect urgency than a network-level departure timestamp. Thus, it may make sense to use a network-level timestamp such as the one provided by ST-II at the transport layer. 3.6.1 Synchronization Methods The necessary header components are determined to some extent by the method of synchronizing sender and receivers. In this section, we formally describe some of the popular approaches, building on the exposition and terminology of Montgomery [10]. We define a number of variables describing the synchronization process. In general, the subscript n represents the nth packet in a synchronization unit, n= 1;2;: ::. Let an, dn, pn and tn be the arrival time, variable delay, playout time and generation time of the nth packet, respectively. Let o denote the fixed delay from sender to receiver. Finally, dmax describes the estimated maximum variable delay within the network. The estimate is typically chosen in such a way that only a very small fraction (on the order of 1%) of packets take more than o+dmax time units. For best performance under changing network load conditions, the estimate should be refined based on the actual delays experienced. The variable delay in a network consists of queueing and media access delays, while propagation and processing delays make up the fixed delay. Additional end-to-end fixed H. Schulzrinne Expires 5/1/93 [Page 20] INTERNET-DRAFT Issues/RTP December 15, 1992 delay is unavoidably introduced by packetization; the non-real-time nature of most operating systems adds a variable delay both at the transmitting and receiving end. All variables are expressed in sample unit of time, be that seconds or samples, for example. For simplicity, we ignore that the sender and receiver clocks may not run at exactly the same speed. The relationship between the variables is depicted in Fig. 2. The arrows in the figure indicate the transmission of the packet across the network, occurring after the packetization delay. The packet with sequence number 5 misses the playout deadline and, depending on the algorithm used by the receiver, is either dropped or treated as the beginning of a new talkspurt. Figure only available in PostScript version of document. Figure 2: Playout Synchronization Variables Given the above definitions, the relationship an =tn +dn+ o (1) holds for every packet. For brevity, we also define ln as the ``laxity'' of packet n, i.e., the time pn -an between arrival and playout. Note that it may be difficult to measure an with resolution below a packetization interval, particularly if the measurement is to be in units related to the playback process (e.g., samples). All synchronization methods differ only in how much they delay the first packet of a synchronization unit. All packets within a synchronization unit are played out based on the position of the first packet: pn= pn-1 +(tn- tn-1) for n> 1 Three synchronization methods are of interest. We describe below how they compute the playout time for the first packet in a synchronization unit and what measurement is used to update the delay estimate dmax. blind delay: This method assumes that the first packet in a talkspurt experiences only the fixed delay, so that the full dmax has to be added to allow for other packets within the talkspurt experiencing more delay. p1 =a1 +dmax: (2) The estimate for the variable delay is derived from measurements of the laxity ln, so that the new estimate after n packets is computed dmax;n =f(l1; ::: ;ln), where the function f(.) is a suitably chosen smoothing function. Note that blind delay does not require timestamps to determine p1, only an indication of the beginning of a synchronization unit. Timestamps may be required to compute pn , however, unless tn- tn-1 is a known constant. absolute timing: If the packet carries a timestamp measured in time units known to the receiver, we can improve our determination of the playout point: p1= t1+ o+dmax : H. Schulzrinne Expires 5/1/93 [Page 21] INTERNET-DRAFT Issues/RTP December 15, 1992 This is, clearly, the best that can be accomplished. Here, instead of estimating dmax, we estimate o+ dmax as some function of pn- tn. For this computation, it does not matter whether p and t are measured with clocks sharing a common starting point. added variable delay: Each node adds the variable delay experienced within it to a delay accumulator within the packet, yielding dn. p1= a1- d1+dmax From Eq. 1, it is readily apparent that absolute delay and added variable delay yield the same playout time. The estimate for dmax is based on the measurements for d. Given a clock with suitably high resolution, these estimates can be better than those based on the difference between a and p; however, it requires that all routers can recognize RTP packets. Also, determining the residence time within a router may not be feasible. In summary, absolute timing is to be preferred due to its lower delays compared to blind delay, while synchronization using added variable delays is currently not feasible within the Internet (it is, however, used for G.764). 3.6.2 Detection of Synchronization Units The receiver must have a way of readily detecting the beginning of a synchronization unit, as the playout scheduling of the first packet in a synchronization unit differs from that in the remainder of the unit. This detection has to work reliably even with packet reordering; for example, reordering at the beginning of a talkspurt is particularly likely since common silence detection algorithms send a group of stored packets at the beginning of the talkspurt to prevent front clipping. Two basic methods have been proposed: timestamp and sequence number: The sequence number increases by one with each packet transmitted, while the timestamp reflects the total time covered, measured in some appropriate unit. A packet is declared to start a new synchronization unit if (a) it has the highest timestamp and sequence number seen so far (within this wraparound cycle) and (b) the difference in timestamp values (converted into a packet count) between this and the previous packet is greater than the difference in sequence number between those two packets. This approach has the disadvantage that it may lead to erroneous packet scheduling with blind delay if packets are reordered. An example is shown in Table 3. In the example, the playout delay is set at 50 time H. Schulzrinne Expires 5/1/93 [Page 22] INTERNET-DRAFT Issues/RTP December 15, 1992 units for blind timing and 550 time units for absolute timing. The packet intergeneration time is 20 time units. blind timing absolute timing no reordering with reordering seq. timestamp arrival playout arrival playout arrival playout 200 1020 1520 1570 1520 1570 1520 1570 201 1040 1530 1590 1530 1590 1530 1590 202 1220 1720 1770 1725 1750 1725 1770 203 1240 1725 1790 1720 1770 1720 1790 204 1260 1792 1810 1791 1790 1791 1810 Table 3: Example where out-of-order arrival leads to packet loss for blind timing More significantly, detecting synchronization units requires that the playout mechanism can translate timestamp differences into packet counts, so that it can compare timestamp and sequence number differences. If the timespan ``covered'' by a packet changes with the encoding or even varies for each packet, this may be cumbersome. NVP provides the timestamp/sequence number combination for detecting talkspurts. The following method avoids these drawbacks, at the cost of one additional header bit. synchronization bit: The beginning of a synchronization unit is indicated by setting a synchronization bit within the header. The receiver, however, can only use this information if no later packet has already been processed. Thus, packet reordering at the beginning of a talkspurt leads to missing opportunities for delay adjustment. With the synchronization bit, a sequence number is not necessary to detect the beginning of a synchronization unit, but a sequence number remains useful for detecting packet loss and ordering packets bearing the same timestamp. With just a timestamp, it is impossible for the receiver to get an accurate count of the number of packets that it should have received. While gaps within a talkspurt give some indication of packet loss, the receiver cannot tell what part of the tail of a talkspurt has been transmitted. (Example: consider the talkspurts with time stamps 100, 101, 102, 110, 111. Packets with timestamp 100 and 110 have the synchronization bit set. The receiver has no way of knowing whether it was supposed to have received two talkspurts with a total of five packets, or two or more talkspurts with up to 12 packets.) The synchronization bit is used by vat, without a sequence number. A special sequence number, as used by G.764, is equivalent. 3.6.3 Interpretation of Synchronization Bit Two possibilities for implementing a synchronization bit are discussed here. H. Schulzrinne Expires 5/1/93 [Page 23] INTERNET-DRAFT Issues/RTP December 15, 1992 start of synchronization unit: The first packet in a synchronization unit is marked with a set synchronization bit. With this use of the synchronization bit, the receiver detects the beginning of a synchronization unit with the following simple algorithm: if synchronization bit = 1 and current sequence number > maximum sequence number seen so far then this packet starts a new synchronization unit if current sequence number > maximum sequence number then maximum sequence number := current sequence number endif Comparisons and arithmetic operations are modulo the sequence number range. end of synchronization unit: The last packet in a synchronization unit is marked. As pointed out elsewhere, this information may be useful for initiating appropriate fill-in during silence periods and to start processing a completed video frame. If a voice silence detector uses no hangover, it may have difficulty deciding which is the last packet in a talkspurt until it judges the first packet to contain no speech. The detection of a new synchronization unit by the receiver is only slightly more complicated than with the previous method: if sync_flag then if sequence number >= sync_seq then sync_flag := FALSE endif if sequence number = sync_seq then signal beginning of synchronization unit endif endif if synchronization bit = 1 then sync_seq := sequence number + 1 sync_flag := TRUE endif By changing the equal sign in the second comparison to 'if sequence number > sync_seq', a new synchronization unit is detected even if packets at the beginning of the synchronization unit are reordered. As reordering at the beginning of a synchronization unit is particularly likely, for example when transmitting the packets preceding the beginning of a talkspurt, this should significantly reduce the number H. Schulzrinne Expires 5/1/93 [Page 24] INTERNET-DRAFT Issues/RTP December 15, 1992 of missed talkspurt beginnings. 3.6.4 Interpretation of Timestamp Three proposals as to the interpretation of the timestamp have been advanced: packet or frame interval: Each packetization or (video/audio) frame inter- val increments the timestamp. This approach very efficient in terms of processing and bit-use, but cannot be used without out-of-band information if the time interval of media ``covered'' by a packet varies from packet to packet. This occurs for example with variable-rate encoders or if the packetization interval is changed during a conference. This interpretation of a timestamp is assumed by NVP, which defines a frame as a block of PCM samples or a single LPC frame. Note that there is no inherent necessity that all participants within a conference use the same packetization interval. Local implementation considerations such as available clocks may suggest different intervals. As another example, consider a conference with feedback. For the lecture audio, a long packetization interval may be desirable to better amortize packet headers. For side chats, delays are more important, thus suggesting a shorter packetization interval.(3) sample: This method simply counts samples, allowing a direct translation between time stamp and playout buffer insertion point. It is just as easily computable as the per-packet timestamp. However, for some media and encodings(4), it may not be quite clear what a sample is. Also, some care must be taken at the receiver if incoming streams use different sampling rates. This method is currently used by vat. subset of NTP timestamp: 16 bits encode seconds relative to midnight (0 hours), January 1, 1900 (modulo 65536) and 16 bits encode fractions of a second, with a resolution of approximately 15.2 microseconds, which is smaller than any anticipated audio sampling or video frame interval. This timestamp is the same as the middle 32 bits of the 64-bit NTP timestamp [11]. It wraps around every 18.2 hours. If it should be ------------------------------ 3. Nevot for example, allows each participant to have a different packetization interval, independent of the packetization interval used by Nevot for its outgoing audio. Only the packetization interval for outgoing audio for all conferences this Nevot participates in must be the same. 4. Examples include frame-based encodings such as LPC and CELP. Here, given that these encodings are based on 8,000 Hz input samples, the preferred interpretation would probably be in terms of audio samples, not frames, as samples would be used for reconstruction and mixing. H. Schulzrinne Expires 5/1/93 [Page 25] INTERNET-DRAFT Issues/RTP December 15, 1992 desirable to reconstruct absolute transmission time at the receiver for logging or recording purposes, it should be easy to determine the most significant 16 bits of the timestamp. Otherwise, wrap-arounds are not a significant problem as long as they occur 'naturally', i.e., at a 16 or 32 bit boundary, so that explicit checking on arithmetic operations is not required. Also, since the translation mechanism would probably treat the timestamp as a single integer without accounting for its division into whole and fractional part, the exact bit allocation between seconds and fractions thereof is less important. However, the 16/16 approach simplifies extraction from a full NTP timestamp. The NTP-like timestamp has the disadvantage that its resolution does not map into any of the common sample intervals. Thus, there is a potential uncertainty of one sample at the receiver as to where to place the beginning of the received packet, resulting in the equivalent of a one-sample slip. CCITT recommendation G.821 postulates a mean slip rate of less than 1 slip in 5 hours, with degraded but acceptable service for less than 1 slip in 2 minutes. Tests with appropriate rounding conducted by the author showed that this most likely does not cause problems. In any event, a double-precision floating point multiplication is needed to translate between this timestamp and the integer sample count available on transmission and required for playout.(5) It also needs to be decided whether the time stamp should reflect real time or sample time. A real time timestamp is defined to track wallclock time plus or minus a constant offset. Sample time increases by the nominal sampling interval for each sample. The two clocks in general do not agree since the clock source used for sampling will in all likelihood be slightly off the nominal rate. For example, typical crystals without temperature control are only accurate to 50 -- 100 ppm (parts per million), yielding a potential drift of 0.36 seconds per hour between the sampling clock and wallclock time. It has been suggested to use timestamps relative to the beginning of first transmission from a source. This makes correlation between media from different participants difficult and seems to have no technical or implementation advantages, except for avoiding wrap-around during most conferences. As pointed out above, that seems to be of little benefit. Clearly, the reliability of a wallclock-synchronized timestamps depends on how closely the system clocks are synchronized, but that does not argue for giving up potential real-time synchronization in all cases. Using real time rather than sample time allows for easier synchronization between different media and to compensate for slow or ------------------------------ 5. The multiplication with an appropriate factor can be approximated to the desired precision by an integer multiplication and division, but multiplication by a floating point value is actually much faster on some modern processors. H. Schulzrinne Expires 5/1/93 [Page 26] INTERNET-DRAFT Issues/RTP December 15, 1992 fast sample clocks. Note that it is neither desirable nor necessary to obtain the wall clock time when each packet was sampled. Rather, the sender determines the wallclock time at the beginning of each synchronization unit (e.g., a talkspurt for voice and a frame for video) and adds the nominal sample clock duration for all packets within the talkspurt to arrive at the timestamp value carried in packets. The real time at the beginning of a talkspurt is determined by estimating the true sample rate for the duration of the conference. The sample rate estimate has to be accurate enough to allow placing the beginning of a talkspurt, say, to within at most 50 to 100 ms, otherwise the lack of synchronization may be noticeable, delay computations are confused and successive talkspurts may be concatenated. Estimating the true sampling instant to within a few milliseconds is surprisingly difficult for current operating systems. The sample rate r can to be estimated as r =_s+_q_t:-t0 Here, t is the current time, t0 the time elapsed since the first sample was acquired, s is the number of samples read, q is the number of samples ready to be read (queued) at time t. Let p denote the number of samples in a packet. The timestamp in the synchronization packet reflects the sampling instant of the first sample of that packet and is computed as t -(p +q)=r. Unfortunately, only s and p are known precisely. The accuracy of the estimate for t0 and t depend on how accurately the beginning of sampling and the last reading from the audio device can be measured. There is a non-zero probability that the process will get preempted between the time the audio data is read and the instant the system clock is sampled. It remains unclear whether indications of current buffer occupancy, if available, can be trusted. Even with increasing sample count, the absolute accuracy of the timestamp is roughly the same as the measurement accuracy of t, as differentiating with respect to t shows. Experiments with the SunOS audio driver showed significant variations of the estimated sample rate, with discontinuities of the computed timestamps of up to 25 ms. Kernel support is probably required for meaningful real time measurements. Sample time increments with the sampling interval for every sample or (sub)frame received from the audio or video hardware. It is easy to determine, as long as care is taken to avoid cumulative round-off errors incurred by simply repeatedly adding the approximate packetization interval. However, synchronization between media and end-to-end delay measurements are then no longer feasible. (Example: Consider an audio and a video stream. If the audio sample clock is slightly faster than the real clock and the video sampling clock, a video and audio frame belonging together would be marked by different timestamps, thus played out at different instants.) H. Schulzrinne Expires 5/1/93 [Page 27] INTERNET-DRAFT Issues/RTP December 15, 1992 If we choose to use sample time, the advantage of using an NTP-format timestamp disappears, as the receiver can easily reconstruct a NTP sample-based timestamp from the sample count if needed, but would not have to if no cross-media synchronization is required. RTCP could relate the time increment per sample in full precision. The definition of a ``sample'' will depend on the particular medium, and could be a audio sample, a video or a voice frame (as produced by a non-waveform coder). The mapping fails if there is no time-invariant mapping between sample units and time. It should be noted that it may not be possible to associate an meaningful notion of time with every packet. For example, if a video frame is broken into several fragments, there is no natural timestamp associated with anything but the first fragment, particularly if there is not even a sequential mapping from screen scan location into packets. Thus, any timestamp used would be purely artificial. A synchronization bit could be used in this particular case to mark beginning of synchronization units. For packets within synchronization units, there are two possible approaches: first, we can introduce an auxiliary sequence number that is only used to order packets within a frame. Secondly, we could abuse the timestamp field by incrementing it by a single unit for each packet within the frame, thus allowing a variable number of frames per packet. The latter approach is barely workable and rather kludgy. 3.6.5 End-of-talkspurt indication An end-of-talkspurt indication is useful to distinguish silence from lost packets. The receiver would want to replace silence by an appropriate background noise level to avoid the ``noise-pumping'' associated with silence detection. On the other hand, missing packets should be reconstructed from previous packets. If the silence detector makes use of hangover, the transmitter can easily set the end-of-talkspurt indicator on the last bit of the last hangover packet. If the talkspurts follow end-to-end, the end-of-talkspurt indicator has no effect except in the case where the first packet of a talkspurt is lost. In that case, the indicator would erroneously trigger noise fill instead of loss recovery. The end-of-talkspurt indicator is implemented in G.764 as a ``more'' bit which is set to one for all but the last packet within a talkspurt. 3.6.6 Recommendation Given the ease of cross-media synchronization and the media independence, the use of 32-bit 16/16 timestamps representing the middle part of the NTP timestamp is suggested. Generally, a wallclock-based timestamp appears to be preferable to a sample-based one, but it may only be approximately realizable for some current operating systems. Inter-media synchronization H. Schulzrinne Expires 5/1/93 [Page 28] INTERNET-DRAFT Issues/RTP December 15, 1992 to below 10 to 20 ms has to await mechanisms that can accurately determine when a particular sample was actually received by the A/D converter. Particularly with sample- or wallclock-based timestamp, a synchronization bit simplifies the detection of the beginning of a synchronization unit. Indicating either the end or beginning of a synchronization unit is roughly equivalent, with tradeoffs between the two. 3.7 Segmentation and Reassembly For high-bandwidth video, a single frame may not fit into the maximum transport unit (MTU). Thus, some form of frame sequence number is needed. If possible, the same sequence number should be used for synchronization and fragmentation. Six possibilities suggest themselves: overload the timestamp: No sequence number is used. Within a frame, the timestamp has no meaning. Since it is used for synchronization only when the synchronization bit is set, the other timestamps can just increase by one for each packet. However, as soon as the first frame gets lost or reordered, determining positions and timing becomes difficult or impossible. packet count: The sequence number is incremented for every packet, without regard to frame boundaries. If a frame consists of a variable number of packets, it may not be clear what position the packet occupies within the frame if packets are lost or reordered. Continuous sequence numbers make it possible to determine if all packets for a particular frame have arrived, but only after the first packet of the next frame, distinguished by a new timestamp, has arrived. packet count within a frame: The sequence number is reset to zero at the beginning of each frame. This approach has properties complementary to continuous sequence numbers. packet count and first-packet sequence number: Packets use a continuously incrementing sequence number plus an option field in every packet indicating the initial sequence number within the playout unit(6). Carrying both a continuous and packet-within-frame count achieves the same effect. packet count with last-packet sequence number: Packets carry a continuous sequence number plus an option in every packet indicating the last sequence number within the playout unit. This has the advantage that the receiver can readily detect when the last packet for a playout unit has been received. The transmitter may not know, however, at the beginning of a playout unit how many packets it will comprise. Also, ------------------------------ 6. suggested by Steve Casner H. Schulzrinne Expires 5/1/93 [Page 29] INTERNET-DRAFT Issues/RTP December 15, 1992 the position within the playout unit is more difficult to determine if the initial packet and the previous frame is lost. packet count and frame count: The sequence number counts packets, without regard to frame boundaries. A separate counter increments with each frame. Detecting the end of a frame is delayed until the first packet belonging to the next frame. Also, the frame count cannot help to determe the position of the packet within a frame. It could be argued that encoding-specific location information should be contained within the media part, as it will likely vary in format and use from one media to the next. Thus, frame count, the sequence number of the last or first packet in a frame etc. belong into the media-specific header. The size of the sequence number field should be large enough to allow unambiguous counting of expected vs. received packets. A 16-bit sequence number would wrap around every 20 minutes for a 20 ms packetization interval. Using 16 bits may also simplify modulo arithmetic. 3.8 Source Identification 3.8.1 Gateways, Reflectors and End Systems It is necessary to be able to identify the origin of the real-time data in terms meaningful to the application. First, this is required to demultiplex sites (or sources) within the same conference. Secondly, it allows an indication of the currently active source. Currently, NVP makes no explicit provisions for this, assuming that the network source address can be used. This may fail if intermediate agents intervene between the media source and final destination. Consider the example in Fig. 3. An RTP-level gateway is defined as an entity that transforms either the RTP header or the RTP media data or both. Such a gateway could for example merge two successive packets for increased transport efficiency or, probably the most common case, translate media encodings for each stream, say from PCM to LPC (called transcoding). A synchronizing gateway is defined here as a gateway that recreates a synchronous media stream, possibly after mixing several sources. An application that mixes all incoming streams for a particular conference, recreates a synchronous audio stream and then forwards it to a set of receivers is an example of a synchronizing gateway. A synchronizing gateway could be built from two end system applications, with the first application feeding the media output to the media input of the second application and vice versa. In figure 3, the gateways are used to translate audio encodings, from PCM and ADPCM to LPC. The gateway could be either synchronizing or not. Note H. Schulzrinne Expires 5/1/93 [Page 30] INTERNET-DRAFT Issues/RTP December 15, 1992 that a resynchronizing gateway is only necessary if audio packets depend on their predecessors and thus cannot be transcoded independently. It may be advantageous if the packetization interval can be increased. Also, for low speed links that are barely able to handle one active source at a time, mixing at the gateway avoids excessive queueing delays when several sources are active at the same time. A synchronizing gateway has the disadvantage that it always increases the end-to-end delay. We define reflectors as transport-level entities that translate between transport protocols, but leave the RTP protocol unit untouched. In the figure, the reflector connects a multicast group to a group of hosts that are not multicast capable by performing transport-level replication. We define an end system as an entity that receives and generates media content, but does not forward it. We define three types of sources: the media source is the actual origins of the media, e.g., the talker in an audiocast; a synchronization source is the combination of several media sources with its own timing; network source is the network-level origin as seen by the end system receiving the media. The end system has to synchronize its playout with the synchronization source, indicate the active party according to the media source and return media to the network source. If an end system receives media through a resynchronizing gateway, the end system will see the gateway as the network and synchronization source, but the media sources should not be affected. The reflector does not affect the media or synchronization sources, but the reflector becomes the network source. (Note that having the reflector change the IP source address is not possible since the end systems need to be able to return their media to the reflector.) /-------\ +------+ | | ADPCM | | | group |<------>| GW |--\ LPC | | | | \ /------ end system \-------/ +------+ \|\/ reflector | >------- end system /-------\ +------+ /|/\ | | PCM | | / \------ end system | group |<------>| GW |--/ LPC | | | | \-------/ +------+ <---> multicast Figure 3: Gateway topology vat audio packets include a variable-length list of at most 64 4-byte identifiers containing all media sources of the packet. However, there is no convenient way to distinguish the synchronization source from the network H. Schulzrinne Expires 5/1/93 [Page 31] INTERNET-DRAFT Issues/RTP December 15, 1992 source. The end system needs to be able to distinguish synchronization sources because jitter computation and playout delay differ for each synchronization source. Rather than having the gateway (which may be unaware of the existence of a reflectors down stream) insert a synchronization source identifier or having the reflector know about the internal structure of RTP packets, the current ad-hoc encapsulation solution used by Nevot may be sufficient: the reflector simply prefixes the the true network address (and port?) of the last source (either the gateway or media source, i.e., the synchronization source) to the RTP packet. Thus, each end system and gateway has to be aware whether it is being served by a reflector. Also, multiple concatenated reflectors are difficult to handle. 3.8.2 Address Format Issues The limitation to four bytes of addressing information may not be desirable for a number of reasons. Currently, it is used to hold an IP address. This works as long as four bytes are sufficient to hold an identifier that is unique throughout the conference and as long as there is only one media source per IP address. The latter assumption tends to be true for many current workstations, but it is easy to imagine scenarios where it might not be, e.g., a system could hold a number of audio cards, could have several audio channels (Silicon Graphics systems, for example) or could serve as a multi-line telephone interface.(7) The combination of IP address and source port can identify multiple sources per site if each media source uses a different source port. For a small number of sources, it appears feasible, if inelegant, to allocate ports just to distinguish sources. In the PBX example a single output port would appear to be the appropriate method for sending all incoming calls across the network. The mechanisms for allocating unique file names could also be used. The difficult part will be to convince all applications to draw from the same numbering space. Given the discussion of longer address formats at least in the longer term, it seems appropriate to consider allowing for variable-length identifiers. Ideally, the identifier would identify the agent, not a computer or network interface.(8) A currently viable implementation is the concatenation of the IP address and some locally unique number. The meaning of the local ------------------------------ 7. If we are willing to forego the identification with a site, we could have a multiple-audio channel site pick unused IP addresses from the local network and associate it with the second and following audio ports. 8. In the United States, a one way encryption function applied to the social security number would serve to identify human agents without compromising the SSN itself, given that the likelihood of identical SSNs is sufficiently small. The use of a telephone number may be less controversial H. Schulzrinne Expires 5/1/93 [Page 32] INTERNET-DRAFT Issues/RTP December 15, 1992 discriminator is opaque to the outside world; it appears to be generally easier to have a local unique id service than a distributed version thereof. Possibilities for the local discriminator include the numeric process identifier (plus some distinguishing information within the application), the network source port number or a numeric user identifier. For efficiency in the common case of one source per workstation, the convention (used in vat) of using the network source address, possibly combined with the user id or source port, as media and synchronization source should be maintained. 3.9 Energy Indication G.764 contains a 4-bit noise energy field, which encodes the white noise energy to be played by the receiver in the silences between talkspurts. Playing silence periods as white noise reduces the noise-pumping where the background noise audible during the talkspurt is audibly absent at the receiver during silence periods. Substituting white noise for silence periods at the receiver is not recommended for multi-party conferences, as the summed background noise from all silent parties would be distractive. Determining the proper noise level appears to be difficult. It is suggested that the receiver simply takes the energy of the last packet received before the beginning of a silence period as an indication of the background noise. With this mechanism, an explicit indication in the packet header is not required. 3.10 Error Control In principle, the receiver has four choices in handling packets with bit errors[12]: no checking: the receiver provides no indication whether a data packet contains bit errors, either because a checksum is not present or is not checked. discard: the receiver discards errored packets, with no indication to the application. receive: the receiver delivers and flags errored packets to the application. ------------------------------ and is applicable world-wide, but may require some local coordination if numbers are shared. H. Schulzrinne Expires 5/1/93 [Page 33] INTERNET-DRAFT Issues/RTP December 15, 1992 correct: the receiver drops errored packets and requests retransmission. It remains to be decided whether the header, the whole packet or neither should be protected by checksums. NVP protects its header only, while G.764 has a single 16-bit check sequence covering both datalink and packet voice header. However, if UDP is used as the transport protocol, a checksum over the whole packet is already computed by the receiver. (Checksumming for UDP can typically be disabled by the sending or receiving host, but usually not on a per-port basis.) ST-II does not compute checksums for its payload. Many data link protocols already discard packets with bit errors, so that packets are rarely rejected due to higher-layer checksums. Bit errors within the data part may be easier to tolerate than a lost packet, particularly since some media encoding formats may provide built-in error correction. The impact of bit errors within the header can vary; for example, errors within the timestamp may cause the audio packet to be played out at the wrong time, probably much more noticeable than discarding the packet. Other noticeable effects are caused by a wrong flow or encoding identifier. If a separate checksum is desired for the cases where the underlying protocols do not already provide one, it should be optional. Once optional, it would be easy to define several checksum options, covering just the header, the header plus a certain part of the body or the whole packet. A checksum can also be used to detect whether the receiver has the correct decryption key, avoiding noise or (worse) denial-of-service attacks. For that application, the checksum should be computed across the whole packet, before encrypting the content. Alternatively, a well-known signature could be added to the packet and included in the encryption, as long as known plaintext does not weaken the encryption security. 3.11 Security 3.11.1 Encryption Only encryption can provide privacy as long as intruders can monitor the channel. It is desirable to specify an encryption algorithm and provide implementations without export restrictions. Although DES is widely available outside the United States, its use within software in both source and binary form remains difficult. We have the choice of either encrypting both the header and data or only the data. Encrypting the header denies the intruder knowledge about some conference details (for example, who the participants are, although this is only true as long as the UDP source address does not already reveal that information). It also allows some heuristic detection of key mismatches, as the version identifier, timestamp and other header information are somewhat H. Schulzrinne Expires 5/1/93 [Page 34] INTERNET-DRAFT Issues/RTP December 15, 1992 predictable. However, header encryption makes packet traces and debugging by external programs difficult. Public key cryptography does not work for true multicast systems since the public encoding key for every recipient differs, but it may be appropriate when used in two-party conversations or application-level multicast. In that case, mechanisms similar to privacy enhanced mail will probably be appropriate. Key distribution for symmetric-key encryption such as DES is beyond the scope of this recommendation, but the services of privacy enhanced mail [13, 14] may be appropriate. For one-way applications, it may desirable to prohibit listeners from interrupting the broadcast. (After all, since live lectures on campus get disrupted fairly often, there is reason to fear that a sufficiently controversial lecture carried on the Internet would suffer a similar fate.) Again, asymmetric encryption can be used. Here, the decryption key is made available to all receivers, while the encryption key is known only to the legitimate sender. Current public-key algorithms are probably too computationally intensive for all but low-bit-rate voice. In most cases, filtering based on sources will be sufficient. 3.11.2 Authentication The usual message digest methods are applicable if only the integrity of the message is to be protected against spoofing. Again, services similar to that of privacy-enhanced mail [15] may be appropriate. 3.12 Quality of Service Control Because real-time services cannot afford retransmissions, they are immediately affected by packet loss and delays. Delay jitter and packet loss, for example, provide a good indication of network congestion and may suggest switching to a lower-bandwidth coding. To aid in fault isolation and performance monitoring, quality-of-service measurement support is useful. We can distinguish three scenarios: o monitoring by receiver o monitoring by sender o monitoring by a third party Network providers, for example, would use the third method for quality assurance, as their delays and losses may be quite different from those experienced by a customer within their network. Clearly, more than one of H. Schulzrinne Expires 5/1/93 [Page 35] INTERNET-DRAFT Issues/RTP December 15, 1992 these methods may be employed simultaneously. 3.12.1 Monitoring by Receiver Monitoring by the receiver requires that the receiver can determine how many packets were actually sent and when. As long as packet losses are small, tracking the sequence numbers of arriving packets provides sufficient information to determine packet loss. Only with synchronized clocks can the receiver measure absolute delays, but delay jitter is readily available. If a sequence number is not available, it is difficult to impossible for the receiver to get an accurate count of the packets transmitted. The sender can help out by occasionally transmitting a timestamp and the cumulative packet count up to that timestamp. To make it easier for the receiver to use that information, the sample should be taken at the beginning of a synchronization point. The receiver simply stores the number of received samples at each synchronization point and then, after receiving the timestamp/count packet, can determine the fraction of packets lost so far. Packet reordering may introduce a slight inaccuracy if the packet sent before the synchronization point arrives afterwards. Given that there typically is a gap between that last packet and the synchronization point, this occurrence should be sufficiently unlikely as to leave the loss measurement accurate enough for QOS monitoring. 3.12.2 Monitoring by Sender In order to monitor how well the media data arrives at their destinations, the sender should be able to request all or a subset of receivers to return periodic reception reports indicating loss and delay. A subset may be limited to the receivers most likely to have difficulties, avoiding reports from well-placed receivers on the local network. Based on this information, the sender may decide to adjust the encoding, for example, by reducing the video frame rate. It is probably best to let the monitor convert raw packet counts and delay measurements into more meaningful measures such as loss rate or delay variance. To measure packet loss, the receiver could return a triple consisting of starting and ending sequence number and the number of packets received in that range. If the ending sequence number differs too much from the one most recently sent, it indicates to the sender a temporary loss of one-way connectivity. For constant-packet-rate services, absolute delay can be estimated as long as delays can be assumed to be symmetrical. Sending the number of expected and received packets may be sufficient for most cases, however. A more complete report would also encompass starting and ending timestamp, allowing delay estimates for variable-packet-rate services. One possible indication of delay jitter could be the minimum and maximum difference between departure and arrival timestamp. This has the H. Schulzrinne Expires 5/1/93 [Page 36] INTERNET-DRAFT Issues/RTP December 15, 1992 advantage that the fixed delay can also be estimated if sender and receiver clocks are known to be synchronized. Unfortunately, delay extrema are noisy measurement that give only limited indication of the delay variability. The receiver could also return the playout delay value it uses, although for absolute timing, that again depends on the clock differential, as well as on the particular delay estimation algorithm employed by the receiver. In summary, a minimal set of useful measurements appears to be the expected and received packet count, combined with the minimum and maximum timestamp difference. 3.12.3 Monitoring by Third Party Except for delay estimates based on sequence number ranges, the above section applies for this case as well. 4 Conference Control Protocol Currently, only conference control functions used for loosely controlled conferences (open admission, no explicit conference set-up) have been considered in depth. Support for the following functionality needs to be specified: o authentication o floor control, token passing o invitations, calls o call forwarding, call transfer o discovery of conferences and resources (directory service) o media, encoding and quality-of-service negotiation o voting o conference scheduling o user locator The functional specification of a conference control protocol is beyond the scope of this draft. H. Schulzrinne Expires 5/1/93 [Page 37] INTERNET-DRAFT Issues/RTP December 15, 1992 5 The Use of Profiles RTP is intended to be a rather 'thin' protocol, partially because it aims to serve a wide variety of real-time services. The RTP specification intentionally leaves a number of issues open for other documents (profiles), which in turn have the goal of making it easy to build interoperable applications for a particular application domain, for example, audio and video conferences. Some of the issues that a profile should address include: o the interpretation of the 'content' field with the CDESC option o the structure of the content-specific part at the end of the CDESC option o the mechanism by which applications learn about and define the mapping between the 'content' field in the RTP fixed header and its meaning o the use of the optional framing field prefixed to RTP packets (not used, used only if underlying transport protocol does not provide framing, used by some negotiation mechanism, always used) o any RTP-over-x issues, that is, definitions needed to allow RTP to use a particular underlying protocol o content-specific RTP, RTCP or reverse control options o port assignments for data and reverse control 6 Port Assignment Since it is anticipated that UDP and similar port-oriented protocols will play a major role in carrying RTP traffic, the issue of port assignment needs to be addressed. The way ports are assigned mainly affects how applications can extract the packets destined for them. For each medium, there also needs to be a mechanism for distinguishing data from control packets. For unicast UDP, only the port number is available for demultiplexing. Thus, each media will need a separate port number pair unless a separate demultiplexing agent is used. However, for one-to-one connections, dynamically negotiating a port number is easy. If several UDP streams are used to provide multicast by transport-level replication, the port number issue becomes somewhat more difficult. For ST-II, a common port number has to be agreed upon by all participants, which may be difficult particularly H. Schulzrinne Expires 5/1/93 [Page 38] INTERNET-DRAFT Issues/RTP December 15, 1992 if a new site wants to join an on-going connection, but is already using the port number in a different connection. For UDP multicast, an application can select to receive only packets with a particular port number and multicast address by binding to the appropriate multicast address(9). Thus, for UDP multicast, there is no need to distinguish media by port numbers, as each medium could have its designated and unique multicast group. Any dynamic port allocation mechanism would fail for large, dynamic multicast groups, but might be appropriate for small conferences and two-party conversations. Data and control packets for a single medium can either share a single port or use two different port numbers. (Currently, two adjacent port numbers, 3456 and 3457, are used.) A single port for data and control simplifies the receiver code and reflectors and, less important, conserves port numbers. With the proliferation of firewalls, limiting the number of ports has assumed additional importance. Sharing a single port requires some other means of identifying control packets, for example as a special encoding code. Alternatively, all control data could be carried as options within data packets, akin to the NVP protocol options. Since control messages are also transmitted if no actual medium data is available, header content of packets without media data needs to be determined. With the use of a synchronization bit, the issue of how sequence numbers and timestamps are to be treated for these packets is less critical. It is suggested to use a zero timestamp and to increment the sequence number normally. Due to the low bandwidth requirements of typical control information, the issue of accomodating control information in any bandwidth reservation scheme should be manageable. The penalty paid is the eight-byte overhead of the RTP header for control packets that do not require time stamps, encoding and sequence number information. Using a single RTCP stream for several media may be advantageous to avoid duplicating, for example, the same identification information for voice, video and whiteboard streams. This works only if there is one multicast group that all members of a conference subscribe to. Given the relatively low frequency of control messages, the coordination effort between applications and the necessity to designate control messages for a particular medium are probably reasons enough to have each application send control messages to the same multicast group as the data. In conclusion, for multicast UDP, one assigned port number, for both data and control, seems to offer the most advantages, although the data/control split may offer some bandwidth savings. ------------------------------ 9. This extension to the original multicast socket semantics is currently in the process of being deployed. H. Schulzrinne Expires 5/1/93 [Page 39] INTERNET-DRAFT Issues/RTP December 15, 1992 7 Multicast Address Allocation A fixed, permanent allocation of network multicast addresses to invidual conferences by some naming authority such as the Internet Assigned Numbers Authority is clearly not feasible, since the lifetime of conferences is unknown, the potential number of conferences is rather large and the available number space limited to about 228, of which 216 have been set aside for dynamic allocation by conferences. The alternative to permanent allocation is a dynamic allocation, where an initiator of a multicast application obtains an unused multicast address in some manner (discussed below). The address is then made available again, either implicitly or explicitly, as the application terminates. The address allocation may or may not be handled by the same mechanism that provides conference naming and discovery services. Separating the two has the advantage that dynamic (multicast) address allocation may be useful to applications other than conferencing. Also, different mechanisms (for example, periodic announcements vs. servers) may be appropriate for each. We can distinguish two methods of multicast address assignment: function-based: all applications of a certain type share a common, global address space. Currently, a reservation of a 16-bit address space for conferences is one example. The advantage of this scheme is that directory functions and allocation can be readily combined, as is done in the sd tool by Van Jacobson. A single namespace spanning the globe makes it necessary to restrict the scope of addresses so that allocation does not require knowing about and distributing information about the existence of all global conferences. hierarchical: Based on the location of the initiator, only a subset of addresses are available. This limits the number of hosts that could be involved in resolving collisions, but, like most hierarchical assignment, leads to sparse allocation. Allocation is independent of the function the address is used for. Clearly, combinations are possible, for example, each local namespace could be functionally divided if sufficiently large. With the current allocation of 216 addresses to conferences, hierarchical division except on a very coarse scale is not feasible. To a limited extent, multicast address allocation can be compared to the well-known channel multiple access problem. The multicast address space plays the role of the common channel, with each address representing a time slot. H. Schulzrinne Expires 5/1/93 [Page 40] INTERNET-DRAFT Issues/RTP December 15, 1992 All the following schemes require cooperation from all potential users of the address space. There is no protection against an ignorant or malicious user joining a multicast group. 7.1 Channel Sensing In this approach, the initiator randomly selects a multicast address from a given range, joins the multicast group with that address and listens whether some other host is already transmitting on that address. This approach does not require a separate address allocation protocol or an address server, but it is probably infeasible for a number of reasons. First, a user process can only bind to a single port at one time, making 'channel sensing' difficult. Secondly, unlike listening to a typical broadcast channel, the act of joining the multicast group can be quite expensive both for the listening host and the network. Consider what would happen if a host attached through a low-bandwidth connection joins a multicast group carrying video traffic, say. Channel sensing may also fail if two sections of the network that were separated at the time of address allocation rejoin later. Changes in time-to-live values can make multicast groups 'visible' to hosts that previously were outside their scope. 7.2 Global Reservation Channel with Scoping Each range of multicast addresses has an associated well-known multicast address and port where all initiators (and possibly users) advertise the use of multicast addresses. An initiator first picks a multicast address at random, avoiding those already known to be in use. Some mechanism for collision resolution has to be provided in the unlikely event that two initiators simultaneously choose the same address. Also, since address advertisement will have to be sent at fairly long intervals to keep traffic down, an application wanting to start a conference, for example, has to wait for an extended period of time unless it continuously monitors the allocation multicast group. To limit traffic, it may seem advisable to only have the initiator multicast the address usage advertisement. This, however, means that there needs to be a mechanism for another site to take over advertising the group if the initiator leaves, but the multicast group continues to exist. Time-to-live restrictions pose another problem. If only a single source advertises the group, the advertisement may not reach all those sites that could be reached by the multicast transmissions themselves. The possibility of collisions can be reduced by address reuse with scoping, discussed further below, and by adding port numbers and other identifiers as further discriminators. The latter approach appears to defeat the H. Schulzrinne Expires 5/1/93 [Page 41] INTERNET-DRAFT Issues/RTP December 15, 1992 purpose of using multicast to avoid transmitting information to hosts that have no interest in receiving it. Routers can only filter based on group membership, not ports or other higher-layer demultiplexing identifiers. Thus, even though two conferences with the same multicast address and different ports, say, could coexist at the application layer, this would force hosts and networks that are interested in only one of the conferences to deal with the combined traffic of the two conferences. 7.3 Local Reservation Channel Instead of sharing a global namespace for each application, this scheme divides the multicast address space hierarchically, allowing an initiator within a given network to choose from a smaller set of multicast addresses, but independent of the application. As with many allocation problems, we can devise both server-based and fully distributed versions. 7.3.1 Hierarchical Allocation with Servers By some external means, address servers, distributed throughout the network, are provided with non-overlapping regions of the multicast address space. An initiator asks its favorite address server for an address when needed. When it no longer needs the address, it returns it to the server. To prevent addresses from disappearing when the requestor crashes and looses its memory about allocated addresses, requests should have an associated time-out period. This would also (to some extent) cover the case that the initiator leaves the conference, without the conference itself disbanding. To decrease the chances that an initiator cannot be provided with an address, either the local server could 'borrow' an address from another server or could point the initiator to another server, somewhat akin to the methods used by the Domain Name Service (DNS). Provisions have to be made for servers that crash and may loose knowledge about the status of its block of addresses, in particular their expiration times. The impact of such failures could be mitigated by limiting the maximum expiration time to a few hours. Also, the server could try to request status by multicast from its clients. 7.3.2 Distributed Hierarchical Allocation Instead of a server, each network is allocated a set of multicast addresses. Within the current IP address space, both class A, B and C networks would get roughly 120 addresses, taking into account those that have been permanently assigned. Contention for addresses works like the global reservation channel discussed earlier, but the reservation group is strictly limited to the local network. (Since the address ranges are disjoint, address information that inadvertently leaks outside the network, H. Schulzrinne Expires 5/1/93 [Page 42] INTERNET-DRAFT Issues/RTP December 15, 1992 is harmless.) This method avoids the use of servers and the attendant failure modes, but introduces other problems. The division of the address space leads to a barely adequate supply of addresses (although larger address formats will probably make that less of an issue in the future). As for any distributed algorithm, splitting of networks into temporarily unconnected parts can easily destroy the uniqueness of addresses. Handling initiators that leave on-going conferences is probably the most difficult issue. 7.4 Restricting Scope by Limiting Time-to-Live Regardless of the address allocation method, it may be desirable to distinguish multicast addresses with different reach. A local address would be given out with the restriction of a maximum time-to-live value and could thus be reused at a network sufficiently removed, akin to the combination of cell reuse and power limitation in cellular telephony. Given that many conferences will be local or regional (e.g., broadcasting classes to nearby campuses of the same university or a regional group of universities, or an electronic town meeting), this should allow significant reuse of addresses. Reuse of addresses requires careful engineering of thresholds and would probably only be useful for very small time-to-live values that restrict reach to a single local area network. Using time-to-live fields to restrict scope rather than just prevent looping introduces difficult-to-diagnose failure modes into multicast sessions. In particular, reachability is no longer transitive, as B may have A and C in its scope, but A and B may be outside each other's scope (or A may be in the scope of B, but not vice versa, due to asymmetric routes, etc.). This problem is aggravated by the fact that routers (for obvious reasons) are not supposed to return ICMP time exceeded messages, so that the sender can only guess why multicast packets do not reach certain receivers. A Glossary The glossary below briefly defines the acronyms used within the text. Further definitions can be found in the Internet draft draft-ietf-userglos-glossary-00.txt available for anonymous ftp from nnsc.nsf.net and other sites. Some of the general Internet definitions below are copied from that glossary. The quoted passages followed by a reference of the form ``(G.701)'' are drawn from the CCITT Blue Book, Fascicle I.3, Definitions. The glossary of the document ``Recommended Practices for Enhancing Digital Audio Compatibility H. Schulzrinne Expires 5/1/93 [Page 43] INTERNET-DRAFT Issues/RTP December 15, 1992 in Multimedia Systems'', published by the Interactive Multimedia Association was used for some terms marked with [IMA]. 16/16 timestamp: a 32-bit integer timestamp consisting of a 16-bit field containing the number of seconds followed by a 16-bit field containing the binary fraction of a second. This timestamp can measure about 18.2 hours with a resolution of approximately 15 microseconds. n=m timestamp: a n +m bit timestamp consisting of an n-bit second count and an m-bit fraction. ADPCM: Adaptive differential pulse code modulation. Rather than transmitting ! PCM samples directly, the difference between the estimate of the next sample and the actual sample is transmitted. This difference is usually small and can thus be encoded in fewer bits than the sample itself. The ! CCITT recommendations G.721, G.723, G.726 and G.727 describe ADPCM encodings. ``A form of differential pulse code modulation that uses adaptive quantizing. The predictor may be either fixed (time invariant) or variable. When the predictor is adaptive, the adaptation of its coefficients is made from the quantized difference signal.'' (G.701) adaptive quantizing: ``Quantizing in which some parameters are made variable according to the short term statistical characteristics of the quantized signal.'' (G.701) A-law: a type of audio !companding popular in Europe. CCITT: Comite Consultatif International de Telegraphique et Telephonique (CCITT). This organization is part of the United Nations International Telecommunications Union (ITU) and is responsible for making technical recommendations about telephone and data communications systems. X.25 is an example of a CCITT recommendation. Every four years CCITT holds plenary sessions where they adopt new recommendations. Recommendations are known by the color of the cover of the book they are contained in. CELP: code-excited linear prediction; audio encoding method for low-bit rate codecs; !LPC. CD: compact disc. CIF: common interchange format; interchange format for video images with 352 x 288 pixels. !QCIF codec: short for coder/decoder; device or software that ! encodes and decodes audio or video information. companding: contraction of compressing and expanding; reducing the dynamic range of audio or video by a non-linear transformation of the sample values. The best known methods for audio are mu-law, used in North H. Schulzrinne Expires 5/1/93 [Page 44] INTERNET-DRAFT Issues/RTP December 15, 1992 America, and A-law, used in Europe and Asia. !G.711 For a given number of bits, companded data uses a greater number of binary codes to represent small signal levels than linear data, resulting in a greater dynamic range at the expense of a poorer signal-to-nose ratio. [16] DAT: digital audio tape. decimation: reduction of sample rate by removal of samples [IMA]. delay jitter: Delay jitter is the variation in end-to-end network delay, caused principally by varying media access delays, e.g., in an Ethernet, and queueing delays. Delay jitter needs to be compensated by adding a variable delay (refered to as ! playout delay) at the receiver. DVI: (trademark) digital video interactive. Audio/video compression technology developed by Intel's DVI group. [IMA] dynamic range: a ratio of the largest encodable audio signal to the smallest encodable signal, expressed in decibels. For linear audio data types, the dynamic range is approximately six times the number of bits, measured in dB. encoding: transformation of the media content for transmission, usually to save bandwidth, but also to decrease the effect of transmission errors. Well-known encodings are G.711 (mu-law PCM), and ADPCM for audio, JPEG and MPEG for video. ! encryption encryption: transformation of the media content to ensure that only the intended recipients can make use of the information. ! encoding end system: host where conference participants are located. RTP packets received by an end system are played out, but not forwarded to other hosts (in a manner visible to RTP). FIR: finite (duration) impulse response. A signal processing filter that does not use any feedback components [IMA]. frame: unit of information. Commonly used for video to refer to a single picture. For audio, it refers to a data that forms a encoding unit. For example, an LPC frame consists of the coefficients necessary to generate a specific number of audio samples. frequency response: a system's ability to encode the spectral content of audio data. The sample rate has to be at least twice as large as the maximum possible signal frequency. G.711: ! CCITT recommendation for ! PCM audio encoding at 64 kb/s using mu-law or A-law companding. H. Schulzrinne Expires 5/1/93 [Page 45] INTERNET-DRAFT Issues/RTP December 15, 1992 G.721: ! CCITT recommendation for 32 kbit/s adaptive differential pulse code modulation (! ADPCM, PCM). G.722: ! CCITT recommendation for audio coding at 64 kbit/s; the audio bandwidth is 7 kHz instead of 3.5 kHz for G.711, G.721, G.723 and G.728. G.723: ! CCITT recommendation for extensions of Recommendation G.721 adapted to 24 and 40 kbit/s for digital circuit multiplication equipment. G.728: ! CCITT recommendation for voice coding using code-excited linear prediction (CELP) at 16 kbit/s. G.764: ! CCITT recommendation for packet voice; specifies both ! HDLC-like data link and network layer. In the draft stage, this standard was referred to as G.PVNP. The standard is primarily geared towards digital circuit multiplication equipment used by telephone companies to carry more voice calls on transoceanic links. G.821: ! CCITT recommendation for the error performance of an international digital connection forming part of an integrated services digital network. G.822: ! CCITT recommendation for the controlled !slip rate objective on an international digital connection. G.PVNP: designation of CCITT recommendation ! G.764 while in draft status. GSM: Group Speciale Mobile. In general, designation for European mobile telephony standard. In particular, often used to denote the audio coding used. Formally known as the European GSM 06.10 provisional standard for full-rate speech transcoding, prI-ETS 300 036. It uses RPE/LTP (residual pulse excitation/long term prediction) at 13 kb/s using frames of 160 samples covering 20 ms. H.261: ! CCITT recommendation for the compression of motion video at rates of P x64 kb/s (where p =1: ::30. Originally intended for narrowband !ISDN. hangover: [17] Audio data transmitted after the silence detector indicates that no audio data is present. Hangover ensures that the ends of words, important for comprehension, are transmitted even though they are often of low energy. HDLC: high-level data link control; standard data link layer protocol (closely related to LAPD and SDLC). IMA: Interactive Multimedia Assocation; trade association located in Annapolis, MD. H. Schulzrinne Expires 5/1/93 [Page 46] INTERNET-DRAFT Issues/RTP December 15, 1992 ICMP: Internet Control Message Protocol; ICMP is an extension to the Internet Protocol. It allows for the generation of error messages, test packets and informational messages related to ! IP. in-band: signaling information is carried together (in the same channel or packet) with the actual data. ! out-of-band. interpolation: increase in sample rate by introduction of processed samples. IP: internet protocol; the Internet Protocol, defined in RFC 791, is the network layer for the TCP/IP Protocol Suite. It is a connectionless, best-effort packet switching protocol [18]. IP address: four-byte binary host interface identifier used by !IP for addressing. An IP address consists of a network portion and a host portion. RTP treats IP addresses as globally unique, opaque identifiers. IPv4: current version (4) of ! IP. ISDN: integrated services digital network; refers to an end-to-end circuit switched digital network intended to replace the current telephone network. ISDN offers circuit-switched bandwidth in multiples of 64 kb/s (B or bearer channel), plus a 16 kb/s packet-switched data (D) channel. ISO: International Standards Organization. A voluntary, nontreaty organization founded in 1946. Its members are the national standardards organizations of the 89 member countries, including ANSI for the U.S. (Tanenbaum) ISO 10646: !ISO standard for the encoding of characters from all languages into a single 32-bit code space (Universal Character Set). For transmission and storage, a one-to-five octet code (UTF) has been defined which is upwardly compatible with US-ASCII. JPEG: ISO/CCITT joint photographic experts group. Designation of a variable-rate compression algorithm using discrete cosine transforms for still-frame color images. jitter: ! delay jitter. linear encoding: a mapping from signal values to binary codes where each binary level represents the same signal increment !companding. loosely controlled conference: Participants can join and leave the conference without connection establishment or notifying a conference moderator. The identity of conference participants may or may not be known to other participants. See also: tightly controlled conference. H. Schulzrinne Expires 5/1/93 [Page 47] INTERNET-DRAFT Issues/RTP December 15, 1992 low-pass filter: a signal processing function that removes spectral content above a cutoff frequency. [IMA] LPC: linear predictive coder. Audio encoding method that models speech as a parameters of a linear filter; used for very low bit rate codecs. MPEG: ISO/CCITT motion picture experts group JTC1/SC29/WG11. Designates a variable-rate compression algorithm for full motion video at low bit rates; uses both intraframe and interframe coding. MPEG-1: Informal name of proposed !MPEG (ISO standard DIS 1172). media source: entity (user and host) that produced the media content. It is the entity that is shown as the active participant by the application. MTU: maximum transmission unit; the largest frame length which may be sent on a physical medium. Nevot: network voice terminal; application written by the author. network source: entity denoted by address and port number from which the ! end system receives the RTP packet and to which the end system send any RTP packets for that conference in return. NTP timestamp: ``NTP timestamps are represented as a 64-bit unsigned fixed-point number, in seconds relative to 0 hours on 1 January 1900. The integer part is in the first 32 bits and the fraction part in the last 32 bits.'' [11] NTP timestamps do not include leap seconds, i.e., each and every day contains exactly 86,400 NTP seconds. NVP: network voice protocol; original packet format used in early packet voice experiments; defined in [1]. octet: An octet is an 8-bit datum, which may contain values 0 through 255 decimal. Commonly used in ISO and CCITT documents, also known as a byte. OSI: Open System Interconnection; a suite of protocols, designed by ISO committees, to be the international standard computer network architecture. out of band: signaling and control information is carried in a separate channel or separate packets from the actual data. For example, ICMP carries control information out-of-band, that is, as separate packets, for IP, but both ICMP and IP usually use the same communication channel (in band). parametric coder: coder that encodes parameters of a model representing the input signal. For example, LPC models a voice source as segments of voice and unvoiced speech, represented by a set of H. Schulzrinne Expires 5/1/93 [Page 48] INTERNET-DRAFT Issues/RTP December 15, 1992 parametric coder: coder that encodes parameters of a model representing the input signal. For example, LPC models a voice source as segments of voice and unvoiced speech, represented by filter parameters. Examples include LPC, CELP and GSM. !waveform coder. PCM: pulse-code modulation; speech coding where speech is represented by a given number of fixed-width samples per second. Often used for the coding employed in the telephone network: 64,000 eight-bit samples per second. pel, pixel: picture element. ``Smallest graphic element that can be independently addressed within a picture; (an alternative term for raster graphics element).'' (T.411) playout: Delivery of the medium content to the final consumer within the receiving host. For audio, this implies digital-to-analog conversion, for video display on a screen. playout unit: A playout unit is a group of packets sharing a common timestamp. (Naturally, packets whose timestamps are identical due to timestamp wrap-around are not considered part of the same playout unit.) For voice, the playout unit would typically be a single voice segment, while for video a video frame could be broken down into subframes, each consisting of packets sharing the same timestamp and ordered by some form of sequence number. !synchronization unit plesiochronous: ``The essential characteristic of time-scales or signals such that their corresponding significant instants occur at nominally the same rate, any variation in rate being constrained within specified limits. Two signals having the same nominal digit rate, but not stemming from the same clock or homochronous clocks, are usually plesiochronous. There is no limit to the time relationship between corresponding significant instants.'' (G.701, Q.9) In other words, plesiochronous clocks have (almost) the same rate, but possibly different phase. pulse code modulation (PCM): ``A process in which a signal is sampled, and each sample is quantized independently of other samples and converted by encoding to a digital signal.'' (G.701) PVP: packet video protocol; extension of ! NVP to video data [19] QCIF: quarter common interchange format; format for exchanging video images of 176 x 144 pixels. !CIF, SIF RTCP: real-time control protocol; adjunct to ! RTP. RTP: real-time transport protocol; discussed in this draft. sampling rate: ``The number of samples taken of a signal per unit time.'' (G.701) H. Schulzrinne Expires 5/1/93 [Page 49] INTERNET-DRAFT Issues/RTP December 15, 1992 SB: subband; as in subband codec. Audio or video encoding that splits the frequency content of a signal into several bands and encodes each band separately, with the encoding fidelity matched to human perception for that particular frequency band. SIF: standard interchange format; format for exchanging video images of 352 x 240 pixels. !CIF, QCIF slip: In digital communications, slip refers to bit errors caused by the different clock rates of nominally synchronous sender and receiver. If the sender clock is faster than the receiver clock, occasionally a bit will have to be dropped. Conversely, a faster receiver will need to insert extra bits. The problem also occurs if the clock rates of encoder and decoder are not matched precisely. Information loss can be avoided if the duration of pauses (silence periods between talkspurts or the inter-frame duration) can be adjusted by the receiver. ``The repetition or deletion of a block of bits in a synchronous or plesiochronous bit stream due to a discrepancy in the read and write rates at a buffer.'' (G.810) !G.821, G.822 ST-II: stream protocol; connection-oriented unreliable, non-sequenced packet-oriented network and transport protocol with process demulti- plexing and provisions for establishing flow parameters for resource control; defined in RFC 1190 [20, 21]. Super CIF: video format defined in Annex IV of !H.261 (1992), comprising 704 by 576 pixels. synchronization unit: A synchronization unit consists of one or more !playout units that, as a group, share a common fixed delay between generation and playout of each part of the group. The delay may change at the beginning of such a synchronization unit. The most common synchronization units are talkspurts for voice and frames for video transmission. TCP: transmission control protocol; an Internet Standard transport layer protocol defined in RFC 793. It is connection-oriented and stream-oriented, as opposed to UDP [22]. TPDU: transport protocol data unit. tightly controlled conference: Participants can join the conference only after an invitation from a conference moderator. The identify of all conference participants is known to the moderator. !loosely controlled conference. transcoder: device or application that translates between several encodings, for example between ! LPC and ! PCM. UDP: user datagram protocol; unreliable, non-sequenced connectionless transport protocol defined in RFC 768 [23]. H. Schulzrinne Expires 5/1/93 [Page 50] INTERNET-DRAFT Issues/RTP December 15, 1992 vat: visual audio tool written by Steve McCanne and Van Jacobson, Lawrence Berkeley Laboratory. vt: voice terminal software written at the Information Sciences Institute. VMTP: Versatile message transaction protocol; defined in RFC 1045 [24]. waveform coder: a coder that tries to reproduce the waveform after decompression; examples include PCM and ADPCM for audio and video and discrete-cosine-transform based coders for video; !parametric coder. B Address of Author Henning Schulzrinne AT&T Bell Laboratories MH 2A244 600 Mountain Avenue Murray Hill, NJ 07974 telephone: 908 582-2262 electronic mail: hgs@research.att.com References [1] D. Cohen, ``A network voice protocol NVP-II,'' technical report, University of Southern California/ISI, Marina del Ray, CA, Apr. 1981. [2] N. Borenstein and N. Freed, ``MIME (multipurpose internet mail extensions) mechanisms for specifying and describing the format of internet message bodies,'' Network Working Group Request for Comments RFC 1341, Bellcore, June 1992. [3] R. Want, A. Hopper, V. Falcao, and J. Gibbons, ``The active badge location system,'' ACM Transactions on Information Systems, vol. 10, pp. 91--102, Jan. 1992. [4] R. Want and A. Hopper, ``Active badges and personal interactive computing objects,'' Technical Report ORL 92-2, Olivetti Research, Cambridge, England, Feb. 1992. also in IEEE Transactions on Consumer Electronics, Feb. 1992. [5] J. G. Gruber and L. Strawczynski, ``Subjective effects of variable delay and speech clipping in dynamically managed voice systems,'' IEEE Transactions on Communications, vol. COM-33, pp. 801--808, Aug. 1985. [6] N. S. Jayant, ``Effects of packet losses in waveform coded speech and improvements due to an odd-even sample-interpolation procedure,'' IEEE H. Schulzrinne Expires 5/1/93 [Page 51] INTERNET-DRAFT Issues/RTP December 15, 1992 Transactions on Communications, vol. COM-29, pp. 101--109, Feb. 1981. [7] D. Minoli, ``Optimal packet length for packet voice communication,'' IEEE Transactions on Communications, vol. COM-27, pp. 607--611, Mar. 1979. [8] V. Jacobson, ``Compressing TCP/IP headers for low-speed serial links,'' Network Working Group Request for Comments RFC 1144, Lawrence Berkeley Laboratory, Feb. 1990. [9] IMA Digital Audio Focus and Technical Working Groups, ``Recommended practices for enhancing digital audio compatibility in multimedia systems,'' tech. rep., Interactive Multimedia Association, Annapolis, MD, Oct. 1992. [10] W. A. Montgomery, ``Techniques for packet voice synchronization,'' IEEE Journal on Selected Areas in Communications, vol. SAC-1, pp. 1022--1028, Dec. 1983. [11] D. L. Mills, ``Network time protocol (version 3) -- specification, implementation and analysis,'' Network Working Group Request for Comments RFC 1305, University of Delaware, Mar. 1992. [12] L. Delgrossi, C. Halstrick, R. G. Herrtwich, and H. St"uttgen, ``HeiTP: a transport protocol for ST-II,'' in Proceedings of the Conference on Global Communications (GLOBECOM), (Orlando, FL), pp. --, IEEE, Dec. 1992. [13] J. Linn, ``Privacy enhancement for Internet electronic mail: Part III --- algorithms, modes and identifiers,'' Network Working Group Request for Comments RFC 1115, IETF, Aug. 1989. [14] S. T. Kent and J. Linn, ``Privacy enhancement for Internet electronic mail: Part II --- certificate-based key management,'' Network Working Group Request for Comments RFC 1114, IETF, Aug. 1989. [15] J. Linn, ``Privacy enhancement for Internet electronic mail: Part I --- message encipherment and authentication procedures,'' Network Working Group Request for Comments RFC 1113, IETF, Aug. 1989. [16] N. S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood Cliffs, NJ: Prentice Hall, 1984. [17] P. T. Brady, ``A model for generating on-off speech patterns in two-way conversation,'' Bell System Technical Journal, vol. 48, pp. 2445--2472, Sept. 1969. [18] J. Postel, ``Internet protocol,'' Network Working Group Request for Comments RFC 791, Information Sciences Institute, Sept. 1981. [19] R. Cole, ``PVP - a packet video protocol,'' W-Note 28, Information H. Schulzrinne Expires 5/1/93 [Page 52] INTERNET-DRAFT Issues/RTP December 15, 1992 Sciences Institute, University of Southern California, Los Angeles, CA, Aug. 1981. [20] C. Topolcic, S. Casner, C. Lynn, Jr., P. Park, and K. Schroder, ``Experimental internet stream protocol, version 2 (ST-II),'' Network Working Group Request for Comments RFC 1190, BBN Systems and Technologies, Oct. 1990. [21] C. Topolcic, ``ST II,'' in First International Workshop on Network and Operating System Support for Digital Audio and Video, no. TR-90-062 in ICSI Technical Reports, (Berkeley, CA), 1990. [22] J. B. Postel, ``DoD standard transmission control protocol,'' Network Working Group Request for Comments RFC 761, Information Sciences Institute, Jan. 1980. [23] J. B. Postel, ``User datagram protocol,'' Network Working Group Request for Comments RFC 768, ISI, Aug. 1980. [24] D. R. Cheriton, ``VMTP: Versatile Message Transaction Protocol specification,'' in Network Information Center RFC 1045, (Menlo Park, CA), pp. 1--123, SRI International, Feb. 1988. H. Schulzrinne Expires 5/1/93 [Page 53]