Network Working Group C.F. Harrison Internet-Draft Far Field Associates, LLC Expires: August 23, 2001 February 22, 2001 Audiovisual Transport with Precision Timing draft-harrison-avt-precision-av-00 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on August 23, 2001. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract This memo discusses methods for transporting audiovisual content over the Internet while meeting professional-quality temporal performance. This memo gives information about the timing requirements and synchronization practices in professional audiovisual production and exhibition. It is intended to initiate a discussion which may result in new or modified IETF standards which address the needs of this field. Harrison Expires August 23, 2001 [Page 1] Internet-Draft Precision AVT Timing February 2001 1. Background When audiovisual content is acquired or rendered by a computer system, there are strong "real-time" requirements on the process. For example, in order to provide a satisfactory listening experience, the timing jitter in audio acquisition and playback clocks must be small; this is especially true for high-fidelity or professional applications. Furthermore, when two or more signal sources ("tracks") must be combined (e.g. audio mixing, video cross-fades, soundtrack-to-picture "lip sync"), the system must ensure that time synchronization is maintained among all the tracks. The degree of required precision varies from tens of milliseconds (general purpose lip-sync) to tens of microseconds (musical audio) to a few nanoseconds (broadcast video). Professional audiovisual equipment has achieved this level of accuracy by a variety of means over the last century. Usually a dedicated isochronous transport method is used, and a separate channel (e.g. "black burst" or "SMPTE time code") is used as a timing reference. These methods have historically been specialized and application-specific. Heroic efforts are sometimes required to obtain synchronization between sound and picture elements which originated under incompatible systems. It is now possible to carry audiovisual streams over general-purpose network transport protocols such as IP. In such networks, non-deterministic delays occur between transmitter and receiver. The transport-delay jitter can be removed by providing adequate buffer capacity at the receiving terminal. The correct signal timing can be recovered by means of timestamp information embedded in the data stream. General-purpose networks also suffer from errors, congestion, and packet loss. However, these concerns are outside the scope of this memo, which discusses synchronization and timing. The RTP[1] protocol was developed for multimedia teleconference applications and provides a flexible framework for transporting multimedia content over the Internet. In the existing RTP model, each source contains an internal free-running timebase, from which its video or audio sampling is derived. In the case of an audio channel and video channel which must maintain lip sync, the receiver must correlate the independent media timestamps on the audio and video streams. This may be achieved by referring to the "wallclock" timestamps which are periodically provided by each source in its SR (Sender Report) messages. RTP has provided acceptable service for its original "live teleconference" application. However, RTP is being applied in a Harrison Expires August 23, 2001 [Page 2] Internet-Draft Precision AVT Timing February 2001 range of applications, including playback of prerecorded streaming content, for which the original RTP timing model is insufficient. It is possible to build on the existing RTP framework and thereby support audiovisual transport timing at full professional precision. This memo discusses overall design considerations to achieve this goal, and speculates on some specific implementations. It suggests that new profiles and messages should be added to the existing RTP toolkit. In principle, two new concepts need to be incorporated: o multiple, concurrent timebases referring to a single stream o the ability for source timebases to speed up or slow down in response to external commands Harrison Expires August 23, 2001 [Page 3] Internet-Draft Precision AVT Timing February 2001 2. Concurrent Timebases In a traditional teleconference situation, the content is transient: it is created, transported, and rendered in real time, then lost. In this situation, the idea of a single timeline -- "now" -- is adequate for most purposes. However, when content has lasting value, it is likely to be recorded, edited, and played back many times. Immediately, four timebases become apparent: 1. Capture time (wallclock time during original recording). 2. Program time (offset from the start of this album or show). 3. Presentation time (wallclock time during this playback). 4. Sampleclock time (numerical count of samples, with arbitrary zero reference). Depending on the application, additional concurrent timebases (e.g. offset from the start of this song) may be relevant as well. We may assume that, by definition, a "timebase" advances uniformly and monotonically. Usually, then, the three or more timebases attached to a particular stream are moving in lock-step, and their relationship can be described by constant offsets. Only at certain instants -- e.g. at the end of one song and the beginning of another -- will the offsets change. This suggests that a low-bandwidth information stream carrying this offset data would be adequate to express everything that needs to be known here about the stream timing. This timing information stream (TIS) can provide precise synchronization information among the several media streams in a session by correlating the sampleclock time of each media stream to a common program timeline. There are certain situations in which prerecorded programs are intentionally played out off-speed. For example, material originated on film at 24.00 fps may be played in a European television environment at 25 fps, or in a U.S. television environment at 23.98 fps. In these situations the timing information stream would carry offset information which changes slowly over time as the presentation timebase drifts uniformly relative to the capture and/or program timebase. A proposed implementation of the concurrent timebase model, using the RTP framework, would support a new media stream type: TIS. A single TIS stream can carry information about several timebases. A timebase which belongs to an RTP media stream may be identified by its SSRC (Synchronization Source) identifier. "Virtual" timebases, like program time, may be identified by a label, unique within that Harrison Expires August 23, 2001 [Page 4] Internet-Draft Precision AVT Timing February 2001 TIS. A typical message within a TIS stream would state, effectively, "point A on timebase X is coincident with point B on timebase Y." Fractional clock resolution in these messages is appropriate. It is highly desirable that new sources be able to join an ongoing session and synchronize properly. That is one reason that multiple concurrent TIS streams should be supported. A reference to the label of a virtual timebase may be made unique within a session by pairing it with the SSRC of the corresponding TIS stream. While genuine sampleclock timebases are constrained by the RTP standard to move smoothly forward in time, this is not generally the case with the virtual timebases which appear in a TIS. For example, the "time offset within song" timebase will jump back to zero at the beginning of each song. It may be useful to append an arbitrary instance identifier to each virtual-timebase label, so that this type of event is treated as the termination of one timebase instance and the initiation of another timebase instance. This is one way to retain monotonicity. Messages about the upcoming initiation and termination of timebases could be embedded in the TIS data stream. A particularly important timebase in multimedia applications is the presentation timebase. A presentation timebase exists at each location where content is seen or heard. Sometimes there are several separate pieces of playback hardware at the same location; such cases can lead to critical requirements for inter-equipment synchronization. For example, two broadcast-video streams might be brought back to a TV studio, and converted to analog video by two separate workstations. The two analog video signals are connected to a video switcher, where an operator may perform wipes, crossfades, or cuts. This functionality requires that the two video signals be synchronized within a few tens of nanoseconds. In practice, each workstation will receive a reference signal (black burst) which serves as a presentation timebase for this studio. Such a reference signal may carry abolute time code in accordance with SMPTE standards and this time code can be referenced as the presentation timebase in a TIS data stream. Any existing professional audiovisual "hardwired" synchronization scheme can be linked with an RTP session in a similar way. Harrison Expires August 23, 2001 [Page 5] Internet-Draft Precision AVT Timing February 2001 3. Controllable Source Timebases When several sources contribute to a single session, source synchronization becomes a concern. If the individual source timebases free-run, in practice, they will drift in phase relative to each other. At a subsequent stage of digital mixing, some signals will therefore need to be resampled. The resampling process interpolates new data at time points between the original signal samples. Resampling is surprisingly difficult when professional quality standards are to be maintained. This is particularly true for video signals. There is a wide range of applications in which existing resampling techniques are adequate. Teleconferencing falls in this category. However, in the professional audiovisual world resampling is always a last choice, and considerable effort and ingenuity is expended in avoiding it. Primarily, this means controlling the timebases of all sources -- speeding some up slightly, or slowing others down -- so that all sources are sampling in a phase-coherent way. It is worthwhile to note that it is not so important that all the sources are precisely "on spec" -- e.g. 48000.000 samples/sec for digital audio -- rather, it is critical that all clocks are running together. For this reason, professional audio and video gear provides some type of speed controllability. An edit controller or chase controller connects (often over a proprietary interface) to the guts of the tape deck, and provides the "hooks" that allow a room full of equipment to operate in perfect sync. A similar functionality, provided over a generic network transport, would be very useful. In essence, we need to support messages to a source, commanding it to speed up or slow down slightly. Such messages might be carried over the existing RTCP port assignment, by adding a new packet type to the RTCP[1] standard. It is useful to distinguish two situations in which timebase control is used: digital playback and digital recording. In the first situation, playing back prerecorded digital material, there is little need for precise short-term control of the playback speed. A typical RTP implementation is designed with a large buffer which removes the effect of jitter, regardless of whether it occurs at the playback device or in the network. Timing alignment can be sample-perfect, guided by timestamps, at the output of the buffer. In this case, relatively crude control of the source timebase can be perfectly satisfactory, provided that the buffer does not over- or underflow. In the second case, the recording timebase is being used for digitization of a real-world, analog signal. In particular, when audio signals are being digitized, the sampling timebase must have Harrison Expires August 23, 2001 [Page 6] Internet-Draft Precision AVT Timing February 2001 very low jitter. A few hundred picoseconds of random sampling jitter can introduce audible distortion. Thus, the clock generating the sampling timebase must respond very smoothly to speed-change commands. Obtaining such performance is the responsibility of the manufacturer of the recording equipment; similar problems have been successfully faced in the manufacture of digital studio microphones. Harrison Expires August 23, 2001 [Page 7] Internet-Draft Precision AVT Timing February 2001 4. Security Considerations The proposals in this memo present few new security considerations. It is possible that a defective or malicious application could disrupt the performance of a signal source by means of source timebase control messages. Harrison Expires August 23, 2001 [Page 8] Internet-Draft Precision AVT Timing February 2001 References [1] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, January 1996. Author's Address Chuck Harrison Far Field Associates, LLC 18815 111th Pl SE Snohomish, WA 98290 US Phone: +1 360 863 8340 EMail: chuck_harrison@iname.com Harrison Expires August 23, 2001 [Page 9] Internet-Draft Precision AVT Timing February 2001 Full Copyright Statement Copyright (C) The Internet Society (2001). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC editor function is currently provided by the Internet Society. Harrison Expires August 23, 2001 [Page 10]