Network Working Group                                         Y(J) Stein
Internet-Draft                                   RAD Data Communications
Expires: April 2, 2006                                     Sept 29, 2005


                   Great Real-Time Problem Statement
                        draft-stein-great-00.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on April 2, 2006.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

   VoIP is commonly perceived to be a low quality, but low cost,
   alternative to standard telephony.  This poor perception is often
   well deserved, being fueled by implementations designed without
   regard to characteristics of IP networks.  This problem statement
   attempts to catalog the shortcomings of current implementations, in
   order to explore the IETF community's interest in working to improve
   this situation.


Stein                     Expires April 2, 2006                 [Page 1]

Internet-Draft                    great                        Sept 2005


1.  Introduction

   Consider the placing of a phone call over the PSTN.  The end-user
   terminal is extremely simplistic and inexpensive, the scaleability of
   the PSTN being based on 'dumb' terminals at end-points, with all the
   intelligence concentrated in the core.  From the moment the user
   requests service by off-hooking, an imperceptible amount of time
   passes before the network indicates that is ready to receive
   signaling by delivering audible dial-tone.  Since the service
   availability is 'five nines' (i.e. 99.999 percent) the user will
   probably not remember an event where dial-tone was not heard
   immediately after off-hooking.  The user then enters the required
   part of a hierarchical destination address, and will then receive
   feedback as to the usage status of the destination terminal, in the
   form of ringback or busy tone, usually within seconds.  Assuming that
   the destination terminal is not in use and the called party is
   present and decides to accept the call, a session is established an
   imperceptible time after off-hooking of the destination terminal.

   For the duration of the conversation the voice is guaranteed to be
   'toll quality' defined to be at least 4 on a Mean Opinion Scale (MOS)
   scale from 1 to 5.  This quality is admittedly imperfect, due to the
   audio spectrum being truncated at 4 kHz (thus making differentiation
   of various unvoiced fricatives impossible, and distorting music) but
   preserves speaker identity and does not impede understandability for
   native speakers of the language spoken.  There will, in general, be
   no unusual noises or audible artifacts (unless due to sources
   radiating close to the end-user terminals), and no gaps or
   discontinuities in the received information.  Furthermore, the one-
   way propagation delay is usually close to the physically minimum
   possible (i.e. the time taken for light to travel between the two
   points) and no perceivable echo is introduced due to the telephone
   electronics.  With extremely high probability the session will only
   be terminated when either the originator or called party decide to
   terminate.

   Now, for comparison, let us consider a typical VoIP call over the
   public Internet.  The end-user terminal may either be a personal
   computer (PC) or IP-phone, the former being a multifunctional
   computational device and the latter smaller and less computationally
   able, but relatively expensive terminal.  Assuming a PC as terminal,
   the user initiates a call by typing an identifier if IP address, or
   by choosing the desired destination from a list.  Thereafter follows
   a rather prolonged period during which the user has no call progress
   feedback; the duration is usually longer for peer-to-peer systems,
   but is often considerable even for systems with centralized
   registries.  Afterwards a simulation of ringback or busy-tone is
   commonly played, and assuming the destination terminal is powered and


Stein                     Expires April 2, 2006                 [Page 2]

Internet-Draft                    great                        Sept 2005


   the called party is willing to take the call, a bi-directional
   session is setup.

   Once the session commences the voice quality will usually not be as
   good as that experienced in the PSTN (see Section 3 for a
   discussion).  In fact, the quality may be variable ranging from
   telephone-like to incomprehensible.  Depending on network
   characteristics there will often be gaps when the sound completely
   disappears, or becomes metallic, or sounds like Martians are
   speaking.  At times artifacts such as beeps may be heard.  When using
   the public Internet the round-tip delay will often be so high (over
   one half second) that free conversation is impossible, and the
   parties to the conversation may repeatedly speak at the same time, or
   may purposely leave long pauses (for the other to interrupt or to
   aver that the connection is still operational), or say 'over' as in
   push-to-talk systems.  The session may also terminate unexpectedly,
   and then may or may not be restored by reconnecting.

   The sum total of the user's perception of the audio quality, delay,
   reliability and other factors is sometimes called the 'user
   experience'.  Why would anyone use a VoIP systems if the user
   experience is significantly inferior to that of the standard
   telephone system?  The situation is analagous to that of cellular
   phones, which also have noticeably lower audio quality and may
   unexpectedly disconnect, but the mediocre user experience is
   tolerated due to a new feature, namely mobility.  Here some
   enthusiasts have suggested that the attraction of VoIP is due to the
   additional functionality that is, or will be, available (e.g. instant
   messaging, video).  However, in most cases it is probably either the
   economics (free calls) or the ready accessibility for people already
   seated at a PC (along with presence indications) that induces most
   people to tolerate the poor quality.  In fact, many times the latter
   type of user will start a conversation on VoIP in order to ask
   whether they can call over the PSTN.  The feeling of most users is
   that the quality is good enough for casual, hobby type conversations
   (reminding some of us of our ham radio origins), and thus such users
   are willing to use it to speak with remote acquaintances, mothers-in-
   law, etc.  They might not, however, choose to use VoIP to call their
   bank branch, an important client, or their boss.

   Of course, much of what was said above is specific to the present
   state of the public Internet, while well engineered, highly
   overprovisioned, networks suffer much less from these troubles.
   However, this does not mean that the public Internet is inherently
   unsuitable for quality transport of voice traffic, nor that it is
   imperative to make major changes in order for it to become suitable
   (although such changes may help).  Many of the above problems can be
   amended, although not completely solved, by taking the


Stein                     Expires April 2, 2006                 [Page 3]

Internet-Draft                    great                        Sept 2005


   characteristics of the Internet into account at all stages of the
   VoIP implementation.  We call such an implementation and its
   components, 'PSN-aware'.

   The above discussion focused on VoIP, but similar statements could be
   made concerning other forms of real-time traffic transported over the
   Internet, such as videoconferencing.  On the other hand not all real-
   time traffic is as problematic.  For example, streaming audio that
   can be delivered after a certain delay may be able to exploit
   retransmission mechanisms, and thus be immunized to many of the above
   hindrances.  The essential ingredients are real-time constraints and
   delay insensitivity, characteristics present in interactive real-time
   applications.

2.  Characteristics of PSNs

   The design philosophy of the Public Switched Telephone Network (PSTN)
   presumes that routing is expensive but bandwidth plentiful, while
   that of Packet Switched Networks (PSNs), such as the Internet,
   presupposes bandwidth to be dear while routing affordable.  The
   former tenets lead to a circuit switched network that naturally
   supports reliable and high quality interactive audio sessions, while
   the resource sharing required by the latter postulates makes
   providing such services a challenge.

   The very fact that PSN users share bandwidth means that no user
   traffic receives treatment identical to that of a PSTN circuit.  The
   major sources of performance degradation for real-time delay-
   sensitive PSN traffic can be identified as follows:
   *  packet creation time
   *  network propagation delay
   *  packet delay variation
   *  packet loss and mis-ordering
   *  congestion events
   *  lack of inherent timing transport
   *  bandwidth conservation algorithms
   *  emulation mechanisms

   Unlike PSTN traffic, PSN traffic is sent in packets.  The first byte
   of data placed in the packet experiences latency corresponding to the
   time required to fill the packet at the source.  Although the last
   byte placed in the packet experiences only minimal delay, it is the
   last to be played out, and thus all data experiences latency equal to
   the packet creation time (PCT).  In VoIP systems this may be less
   than 1 millisecond (for example When using the G.728 LD-CELP
   encoder), it is typically tens of milliseconds (for example 10
   millisecond for G.729, 60 millisecond for a two-frame superpacket of
   G.723.1).  PCT is a frame-size related latency introduced by the


Stein                     Expires April 2, 2006                 [Page 4]

Internet-Draft                    great                        Sept 2005


   source, but additional delay is usually added at the destination.
   Most speech decoders require 'lookahead', and (as will be discussed
   below) jitter buffer based systems require storing of packets.  These
   additional delays may greatly increase the overall one-way delay.

   While TDM switches typically add 1/8000 of a second latency per
   switch, Queuing delay in IP routers may be orders of magnitude
   higher.

   This aforementioned latency is not constant from packet to packet,
   and successive packets do not even necessarily follow the same route.
   For these reasons packets injected into the PSN at a constant rate
   exit it at stochastic intervals.  As we wish to play out audio at a
   constant rate, this packet delay variation (PDV) must be compensated.
   There are two ways this may be accomplished.  In jitter buffer based
   systems Incoming packets are not directly played out, but rather
   placed in a 'jitter buffer' and later played out at a constant rate.
   The jitter buffer is usually configured to be able to absorb the
   maximum expected PDV, and thus introduces a significant amount of
   delay.  In 'shock absorber' based systems packets are played out as
   they arrive, and when a packet is not yet available, a signal
   processing algorithm is employed to extrapolate based on previous
   packets, until such time as a packet arrives.  These systems
   introduce only minimal additional latency, but require considerably
   more computational power.

   IP networks are intrinsically best-effort, and thus there is no
   guarantee that a packet injected into the PSN is actually received.
   In fact, all PSNs introduce some percentage of packet loss (PL), due
   to packets rejected due to detectable errors, packets dropped due to
   congested resources, and packets dropped due to policy decisions.
   Packet loss due to random errors will be independently distributed,
   but other types may cause bursts of lost packets.  In addition, when
   parallel paths exist, packets may be received out-of-order, and must
   be either reordered (may be possible in jitter buffer based systems)
   or treated as lost.  When a packet has not been received a decision
   must be made as to what to play out.  One possibility is silence, but
   this will lead to reduced perceived audio quality.  Depending on the
   expected percentage of packet loss, packet loss concealment (PLC)
   mechanisms may need to be employed.

   Another consequence of the bandwidth sharing of PSNs is the
   possibility of congestion events, statistically infrequent peaks of
   activity during which there is insufficient bandwidth or processing
   power to transport all packets.  For non-real-time traffic there are
   self-regulating rate control mechanisms, but for real-time traffic it
   is not clear that such mechanisms can be useful.


Stein                     Expires April 2, 2006                 [Page 5]

Internet-Draft                    great                        Sept 2005


   The PSTN is based on TDM networks that inherently transport timing
   information in the physical layer along with the data.  PSNs do not
   include such a physical layer clock, and when such a clock is
   required, an appropriate mechanism must be supplied.  This mechanism
   may rely on a clock source external to the PSN (e.g.  GPS
   satellites), or may involve clock recovery over the PSN itself (e.g.
   NTP).

   By bandwidth conservation algorithms we mean all source codings
   employed for reduction of data rate to closer to the Shannon rate.
   These range from lossless data compression, through speech encoding,
   fax image encoding, to video encoding.  Except for lossless
   compression, all such mechanisms introduce some quality reduction,
   and all (including lossless compression) reduce robustness to errors
   and packet loss.

   The final source of degradation is emulation mechanisms internal to
   gateways that enable access to the PSN.  These mechanisms may try to
   simulate behavior of a PSTN system, to terminate or relay PSTN-
   specific signaling, or to optimize operation of interactive real-time
   traffic over the PSN.  These mechanisms are typically required to
   detect various characteristics of the incoming real-time signals, and
   need to do so rapidly, with high probability of detection, and with
   low false alarm rate.  When such a mechanism fails, the gateway may
   enter a state from which it may take time to exit, creating a severe
   anomaly in user perceived performance.

3.  Bandwidth and Audio Quality Problems

   Even assuming a perfect PSN, i.e. one with no packet loss (PL) nor
   packet mis-ordering and only minimal packet delay variation (PDV),
   the perceived voice quality of VoIP calls is highly dependent on
   bandwidth reduction mechanisms.  First, in order to minimize
   bandwidth consumption speech encoding algorithms are employed that
   reduce the MOS to somewhere between 3.5 and 3.8.  Second, voice
   activity detection (VAD) is typically employed to mute (or replace
   with locally generated 'comfort noise') one direction of the
   conversation; this VAD is never perfect and may clip the start of
   voice spurts.  Due to the speech compressions not passing various
   tones (e.g.  DTMF), are passed using special relay functions; false
   alarms in such detection produce annoying beeps known as 'talk-offs'.

   When the present generation of speech encoders was developed, the
   only design criteria were compression ratio, speech quality (MOS),
   and to a certain degree delay (although G.723.1 was supposedly
   designed with VoIP in mind, its round-trip combined delay of 75
   milliseconds is not conducive to use over the public Internet).  At
   about the same time speech encoders were developed for satellite


Stein                     Expires April 2, 2006                 [Page 6]

Internet-Draft                    great                        Sept 2005


   applications that were built to be robust to individual bit errors;
   but no encoders were built to be robust to loss of entire packets.
   Indeed, even the common event of the loss of a single packet may
   cause a disruption to the decoded audio that may last for a long
   time.  Later the iLBC speech coder (described in RFC 3952) was
   designed to eliminate this problem (and today other encoder
   techniques are known that are inherently insensitive to missing
   data).  When the packet loss problem was better understood, PLC
   mechanisms were added to speech encoders used over PSNs, but these
   PLCs helped mainly for loss of isolated packets.  Typical PL patterns
   of IP networks (e.g. loss bursts) were not taken into account.

   As the development of speech encoding algorithms has in general
   proceeded without detailed knowledge of PSN characteristics, required
   functionality, such as PLC, has been added on a posteriori.  Higher
   efficiency and performance may be gained by a priori design of PSN-
   aware speech and other audio (and later video) encoders and PSN-aware
   PLC mechanisms.

   In addition, when the end-user terminals are no longer POTS phones,
   one may ask why we are still limiting ourselves to 4 kHz bandwidth.
   Wideband telephony (8 kHz bandwidth) speech is noticeably superior,
   and may go far to convincing users that VoIP quality may actually
   exceed that of the PSTN.  Design of standardized PSN-aware wideband
   encoders is a worthwhile task waiting to be tackled.

   Most speech encoders used today take in a constant number of bytes of
   uncompressed audio, and produce a constant number of compressed
   bytes.  Some speech coders are called adaptive multirate, in that
   they may be configured to produce a specified number of compressed
   bytes.  Truly variable rate compression techniques vary in output
   rate according to the character of the input sounds.  While the use
   of constant rate transport infrastructures dictates constant rate
   encoders, PSN packets may vary in size from packet to packet, and
   thus variable rate encoders may be used.  It is an open question as
   to how to match these encoder parameters to PSN characteristics.

4.  Delay and Delay Variation Problems

   Standard PSTN practice places tight constraints on the tolerable end-
   to-end and round-trip delays.  Although the more modern approach is
   to consider the effect of delay along with other degradations, one-
   way transmission times of up to 150 milliseconds are considered
   universally acceptable, assuming adequate echo control is provided.
   Echo cancellation is required when the delay exceeds about 20
   milliseconds.

   The one-way delay in PSNs is greater than that of the PSTN, due at


Stein                     Expires April 2, 2006                 [Page 7]

Internet-Draft                    great                        Sept 2005


   very least to PCT and lookahead, and often to queuing delays and
   jitter buffer latency.  Indeed, network propagation times alone may
   be in the 100 millisecond range, and thus incompatible with the
   minimum delay introduced by G.723.1.  Thus a sensible approach would
   be to start with a specification of the network delay, and to derive
   allowable buffering and processing budgets.  This would probably
   require smaller frame sizes and minimization of lookahead, and
   innovative designs would be needed to keep bit rates reasonable.

   More attention should be drawn to the perfection of shock absorber
   based systems.  These may need to be more fully integrated into the
   encoder, perhaps more specifically into the PLC mechanism.

5.  Congestion Problems

   When congestion is detected, either by explicit notification or via
   detection of packet loss, even real-time systems should heed the
   network's warning of imminent trouble.  In addition to PLC on any
   missing packet, in the other direction rate cutback needs to be
   attempted, e.g. by lowering VAD thresholds, via adaptation of the
   rate of adaptive multirate encoders or the average output rate
   parameter of variable rate encoders, and in extreme cases by
   deliberate dropping of packets that are likely to be more effectively
   concealed by the PLC.  Although all these activities reduce the
   user's perception of voice quality, they do so less drastically than
   complete loss of all audio.

   Adaptive multirate encoders can generally change rate on a packet by
   packet basis in 'hitless' fashion, but it is unknown how to do this
   when changing encoder.  There has not been sufficient study of how to
   identify packets that may less harmful to discard.

6.  Emulation Problems

   The lack of precise clock synchronization between source and
   destination (play out) clocks is usually considered unimportant for
   voice.  This is because even a missed or extra speech sample every
   few minutes is undetectable to the ear.  The situation is different
   when the system is used to transport non-speech data, such as fax and
   data modem transfer without appropriate relays.  In such cases it is
   necessary to match the destination clock to that of the source in
   order to eliminate sample slips.

   Accurate (line or acoustic) echo cancellation is essential for high
   ratings of user experience.  At present echo cancellation is
   typically performed where its computational cost is minimized, i.e.
   close to the place where the echo is generated, rather than where it
   would be heard .  It would be useful to be able to employ an echo


Stein                     Expires April 2, 2006                 [Page 8]

Internet-Draft                    great                        Sept 2005


   cancellation server anywhere in the network, but there are problems
   that need to be solved before this can be accomplished.  For example,
   the relative timing of the signals flowing in opposite directions
   needs to be determined (including clock synchronization), and the
   fact that neither signal may be echo-free.

   Real-time monitoring of voice quality has been previously considered.
   Such measures may be based on acoustic models or on measurement of
   network degradations and use of previously determined calibrations.
   Timely feedback of such end-to-end information quality may be useful
   in improving the audio quality, but the precise mechanisms need to be
   worked out.

   Another problem that may be addressed concern multi-user
   conferencing.  Many present-day systems choose a single dominant
   speaker, squelching others desiring to talk.  This introduces various
   perceived quality degradations, in addition to giving a bad
   impression to the user wanting to 'break in'.  Complete summing of
   audio from all users is problematic for several reasons.  It requires
   decompression and recompression of user audio, and rescaling to avoid
   excessive signal levels.  Advances would be welcome here.

   Reduction of the connection setup delay, and the related delays for
   entering/exiting fax-relay and modem-relay modes is an important
   signalling problem to be solved.

   Integration of real-time delay-sensitive traffic along a time line
   with other applications may be interesting.  The most important
   application here is lip syncing, but syncing text for Karaoke,
   whiteboard motions to spoken words, etc. may need to be addressed.

7.  Security Considerations

   Although not directly related to the real-time character of the
   traffic authentication, encryption , and methods for lawful
   interception (CALEA) need to be integrated in a standard way into
   VoIP systems.

8.  IANA Considerations

   This Internet Draft does not propose a protocol, nor a change to any
   existing protocol, and thus no IANA considerations are raised.


Stein                     Expires April 2, 2006                 [Page 9]

Internet-Draft                    great                        Sept 2005


Author's Address

   Yaakov (J) Stein
   RAD Data Communications
   24 Raoul Wallenberg St., Bldg C
   Tel Aviv  69719
   ISRAEL

   Phone: +972 3 645-5389
   Email: yaakov_s@rad.com


Stein                     Expires April 2, 2006                [Page 10]

Internet-Draft                    great                        Sept 2005


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Copyright Statement

   Copyright (C) The Internet Society (2005).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.


Stein                     Expires April 2, 2006                [Page 11]