Network Working Group                                      C.F. Harrison
Internet-Draft                                 Far Field Associates, LLC
Expires: August 23, 2001                               February 22, 2001


              Audiovisual Transport with Precision Timing
                   draft-harrison-avt-precision-av-00

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as
   Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on August 23, 2001.

Copyright Notice

   Copyright (C) The Internet Society (2001). All Rights Reserved.

Abstract

   This memo discusses methods for transporting audiovisual content
   over the Internet while meeting professional-quality temporal
   performance. This memo gives information about the timing
   requirements and synchronization practices in professional
   audiovisual production and exhibition. It is intended to initiate a
   discussion which may result in new or modified IETF standards which
   address the needs of this field.


Harrison                Expires August 23, 2001                 [Page 1]

Internet-Draft            Precision AVT Timing             February 2001


1. Background

   When audiovisual content is acquired or rendered by a computer
   system, there are strong "real-time" requirements on the process.
   For example, in order to provide a satisfactory listening
   experience, the timing jitter in audio acquisition and playback
   clocks must be small; this is especially true for high-fidelity or
   professional applications. Furthermore, when two or more signal
   sources ("tracks") must be combined (e.g. audio mixing, video
   cross-fades, soundtrack-to-picture "lip sync"), the system must
   ensure that time synchronization is maintained among all the tracks.
   The degree of required precision varies from tens of milliseconds
   (general purpose lip-sync) to tens of microseconds (musical audio)
   to a few nanoseconds (broadcast video).

   Professional audiovisual equipment has achieved this level of
   accuracy by a variety of means over the last century. Usually a
   dedicated isochronous transport method is used, and a separate
   channel (e.g. "black burst" or "SMPTE time code") is used as a
   timing reference. These methods have historically been specialized
   and application-specific. Heroic efforts are sometimes required to
   obtain synchronization between sound and picture elements which
   originated under incompatible systems.

   It is now possible to carry audiovisual streams over general-purpose
   network transport protocols such as IP. In such networks,
   non-deterministic delays occur between transmitter and receiver. The
   transport-delay jitter can be removed by providing adequate buffer
   capacity at the receiving terminal. The correct signal timing can be
   recovered by means of timestamp information embedded in the data
   stream.

   General-purpose networks also suffer from errors, congestion, and
   packet loss. However, these concerns are outside the scope of this
   memo, which discusses synchronization and timing.

   The RTP[1] protocol was developed for multimedia teleconference
   applications and provides a flexible framework for transporting
   multimedia content over the Internet. In the existing RTP model,
   each source contains an internal free-running timebase, from which
   its video or audio sampling is derived. In the case of an audio
   channel and video channel which must maintain lip sync, the receiver
   must correlate the independent media timestamps on the audio and
   video streams. This may be achieved by referring to the "wallclock"
   timestamps which are periodically provided by each source in its SR
   (Sender Report) messages.

   RTP has provided acceptable service for its original "live
   teleconference" application. However, RTP is being applied in a


Harrison                Expires August 23, 2001                 [Page 2]

Internet-Draft            Precision AVT Timing             February 2001


   range of applications, including playback of prerecorded streaming
   content, for which the original RTP timing model is insufficient. It
   is possible to build on the existing RTP framework and thereby
   support audiovisual transport timing at full professional precision.
   This memo discusses overall design considerations to achieve this
   goal, and speculates on some specific implementations. It suggests
   that new profiles and messages should be added to the existing RTP
   toolkit. In principle, two new concepts need to be incorporated: 

   o  multiple, concurrent timebases referring to a single stream

   o  the ability for source timebases to speed up or slow down in
      response to external commands


Harrison                Expires August 23, 2001                 [Page 3]

Internet-Draft            Precision AVT Timing             February 2001


2. Concurrent Timebases

   In a traditional teleconference situation, the content is transient:
   it is created, transported, and rendered in real time, then lost. In
   this situation, the idea of a single timeline -- "now" -- is
   adequate for most purposes. However, when content has lasting value,
   it is likely to be recorded, edited, and played back many times.
   Immediately, four timebases become apparent: 

   1.  Capture time (wallclock time during original recording).

   2.  Program time (offset from the start of this album or show).

   3.  Presentation time (wallclock time during this playback).

   4.  Sampleclock time (numerical count of samples, with arbitrary
       zero reference).

   Depending on the application, additional concurrent timebases (e.g.
   offset from the start of this song) may be relevant as well.

   We may assume that, by definition, a "timebase" advances uniformly
   and monotonically. Usually, then, the three or more timebases
   attached to a particular stream are moving in lock-step, and their
   relationship can be described by constant offsets. Only at certain
   instants -- e.g. at the end of one song and the beginning of another
   -- will the offsets change. This suggests that a low-bandwidth
   information stream carrying this offset data would be adequate to
   express everything that needs to be known here about the stream
   timing. This timing information stream (TIS) can provide precise
   synchronization information among the several media streams in a
   session by correlating the sampleclock time of each media stream to
   a common program timeline.

   There are certain situations in which prerecorded programs are
   intentionally played out off-speed. For example, material originated
   on film at 24.00 fps may be played in a European television
   environment at 25 fps, or in a U.S. television environment at 23.98
   fps. In these situations the timing information stream would carry
   offset information which changes slowly over time as the
   presentation timebase drifts uniformly relative to the capture
   and/or program timebase.

   A proposed implementation of the concurrent timebase model, using
   the RTP framework, would support a new media stream type: TIS. A
   single TIS stream can carry information about several timebases. A
   timebase which belongs to an RTP media stream may be identified by
   its SSRC (Synchronization Source) identifier. "Virtual" timebases,
   like program time, may be identified by a label, unique within that


Harrison                Expires August 23, 2001                 [Page 4]

Internet-Draft            Precision AVT Timing             February 2001


   TIS. A typical message within a TIS stream would state, effectively,
   "point A on timebase X is coincident with point B on timebase Y."
   Fractional clock resolution in these messages is appropriate.

   It is highly desirable that new sources be able to join an ongoing
   session and synchronize properly. That is one reason that multiple
   concurrent TIS streams should be supported. A reference to the label
   of a virtual timebase may be made unique within a session by pairing
   it with the SSRC of the corresponding TIS stream.

   While genuine sampleclock timebases are constrained by the RTP
   standard to move smoothly forward in time, this is not generally the
   case with the virtual timebases which appear in a TIS. For example,
   the "time offset within song" timebase will jump back to zero at the
   beginning of each song. It may be useful to append an arbitrary
   instance identifier to each virtual-timebase label, so that this
   type of event is treated as the termination of one timebase instance
   and the initiation of another timebase instance. This is one way to
   retain monotonicity. Messages about the upcoming initiation and
   termination of timebases could be embedded in the TIS data stream.

   A particularly important timebase in multimedia applications is the
   presentation timebase. A presentation timebase exists at each
   location where content is seen or heard. Sometimes there are several
   separate pieces of playback hardware at the same location; such
   cases can lead to critical requirements for inter-equipment
   synchronization. For example, two broadcast-video streams might be
   brought back to a TV studio, and converted to analog video by two
   separate workstations. The two analog video signals are connected to
   a video switcher, where an operator may perform wipes, crossfades,
   or cuts. This functionality requires that the two video signals be
   synchronized within a few tens of nanoseconds. In practice, each
   workstation will receive a reference signal (black burst) which
   serves as a presentation timebase for this studio. Such a reference
   signal may carry abolute time code in accordance with SMPTE
   standards and this time code can be referenced as the presentation
   timebase in a TIS data stream. Any existing professional audiovisual
   "hardwired" synchronization scheme can be linked with an RTP session
   in a similar way.


Harrison                Expires August 23, 2001                 [Page 5]

Internet-Draft            Precision AVT Timing             February 2001


3. Controllable Source Timebases

   When several sources contribute to a single session, source
   synchronization becomes a concern. If the individual source
   timebases free-run, in practice, they will drift in phase relative
   to each other. At a subsequent stage of digital mixing, some signals
   will therefore need to be resampled. The resampling process
   interpolates new data at time points between the original signal
   samples. Resampling is surprisingly difficult when professional
   quality standards are to be maintained. This is particularly true
   for video signals.

   There is a wide range of applications in which existing resampling
   techniques are adequate. Teleconferencing falls in this category.
   However, in the professional audiovisual world resampling is always
   a last choice, and considerable effort and ingenuity is expended in
   avoiding it. Primarily, this means controlling the timebases of all
   sources -- speeding some up slightly, or slowing others down -- so
   that all sources are sampling in a phase-coherent way. It is
   worthwhile to note that it is not so important that all the sources
   are precisely "on spec" -- e.g. 48000.000 samples/sec for digital
   audio -- rather, it is critical that all clocks are running together.

   For this reason, professional audio and video gear provides some
   type of speed controllability. An edit controller or chase
   controller connects (often over a proprietary interface) to the guts
   of the tape deck, and provides the "hooks" that allow a room full of
   equipment to operate in perfect sync. A similar functionality,
   provided over a generic network transport, would be very useful. In
   essence, we need to support messages to a source, commanding it to
   speed up or slow down slightly. Such messages might be carried over
   the existing RTCP port assignment, by adding a new packet type to
   the RTCP[1] standard.

   It is useful to distinguish two situations in which timebase control
   is used: digital playback and digital recording. In the first
   situation, playing back prerecorded digital material, there is
   little need for precise short-term control of the playback speed. A
   typical RTP implementation is designed with a large buffer which
   removes the effect of jitter, regardless of whether it occurs at the
   playback device or in the network. Timing alignment can be
   sample-perfect, guided by timestamps, at the output of the buffer.
   In this case, relatively crude control of the source timebase can be
   perfectly satisfactory, provided that the buffer does not over- or
   underflow.

   In the second case, the recording timebase is being used for
   digitization of a real-world, analog signal. In particular, when
   audio signals are being digitized, the sampling timebase must have


Harrison                Expires August 23, 2001                 [Page 6]

Internet-Draft            Precision AVT Timing             February 2001


   very low jitter. A few hundred picoseconds of random sampling jitter
   can introduce audible distortion. Thus, the clock generating the
   sampling timebase must respond very smoothly to speed-change
   commands. Obtaining such performance is the responsibility of the
   manufacturer of the recording equipment; similar problems have been
   successfully faced in the manufacture of digital studio microphones.


Harrison                Expires August 23, 2001                 [Page 7]

Internet-Draft            Precision AVT Timing             February 2001


4. Security Considerations

   The proposals in this memo present few new security considerations.
   It is possible that a defective or malicious application could
   disrupt the performance of a signal source by means of source
   timebase control messages.


Harrison                Expires August 23, 2001                 [Page 8]

Internet-Draft            Precision AVT Timing             February 2001


References

   [1]  Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
        "RTP: A Transport Protocol for Real-Time Applications", RFC
        1889, January 1996.


Author's Address

   Chuck Harrison
   Far Field Associates, LLC
   18815 111th Pl SE
   Snohomish, WA  98290
   US

   Phone: +1 360 863 8340
   EMail: chuck_harrison@iname.com


Harrison                Expires August 23, 2001                 [Page 9]

Internet-Draft            Precision AVT Timing             February 2001


Full Copyright Statement

   Copyright (C) The Internet Society (2001). All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph
   are included on all such copies and derivative works. However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

   Funding for the RFC editor function is currently provided by the
   Internet Society.


Harrison                Expires August 23, 2001                [Page 10]