Internet Engineering Task Force                                   AVT WG
Internet Draft                                               Schulzrinne
ietf-avt-dtmf-01.txt                                         Columbia U.
November 18, 1998
Expires: May 15, 1999


                      RTP Payload for DTMF Digits

STATUS OF THIS MEMO

   This document is an Internet-Draft. Internet-Drafts are working
   documents of the Internet Engineering Task Force (IETF), its areas,
   and its working groups.  Note that other groups may also distribute
   working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as ``work in progress''.

   To learn the current status of any Internet-Draft, please check the
   ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow
   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
   ftp.isi.edu (US West Coast).

   Distribution of this document is unlimited.

                                 ABSTRACT


         This memo describes how to carry dual-tone multifrequency
         (DTMF) signaling and other tone signals in RTP packets.

1 Introduction

   This memo defines a payload type for carrying dual-tone
   multifrequency (DTMF) digits in RTP packets. A separate payload type
   is desirable since low-rate voice codecs cannot be guaranteed to
   accurately reproduce DTMF. Defining a separate payload type also
   permits higher redundancy while maintaining a low bit rate.

   The DTMF payload type must be suitable for both a gateway and end-
   to-end scenario. In the gateway scenario, a gateway connecting a


Schulzrinne                                                   [Page 1]

Internet Draft                  Profile                November 18, 1998


   packet voice network with the PSTN recreates the DTMF tones and
   injects them into the PSTN. Since DTMF digit recognition takes
   several tens of milliseconds, careful time and power (volume)
   alignment is needed to avoid generating spurious digits. For
   interactive voice response (IVR) systems directly connected to the
   packet voice network, time alignment and volume levels are not
   important, since the unit will not perform any signal analysis to
   detect DTMF tones from the audio stream.

   DTMF digits are carried as part of the audio stream, and SHOULD use
   the same sequence number and time-stamp base as the regular audio
   channel to simplify recreation of analog audio at a gateway. The
   default clock frequency is 8000 Hz, but the clock frequency can be
   redefined when assigning the dynamic payload type.

   This format achieves a higher redundancy even in the case of
   sustained packet loss than the method proposed for the Voice over
   Frame Relay Implementation Agreement [1].

   In circumstances where exact timing alignment between the audio
   stream and the DTMF digits is not important and data is sent unicast,
   such as the IVR example mentioned earlier, it may be preferable to
   use a reliable control stream such as H.245.

   A source MAY send coded DTMF and coded audio packets for the same
   time instants, using DTMF as the redundant encoding for the audio
   stream, or it MAY block outgoing audio while DTMF tones are active
   and only send DTMF digits as both the primary and redundant
   encodings.

2 Payload Format


    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |R R R|  digit  |R R| volume    |          duration             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


   digit: The DTMF digits are encoded as follows:

                        DTMF digit    encoding (decimal)


Schulzrinne                                                   [Page 2]

Internet Draft                  Profile                November 18, 1998


                        ________________________________
                        0             0
                        1             1
                        2             2
                        9             9
                        *             10
                        #             11
                        A             12
                        B             13
                        C             14
                        D             15
                        Flash         16


   volume: The power level of the digit, expressed in dBm0 after
        dropping the sign, with range from 0 to -63 dBm0. The range of
        valid DTMF is from 0 to -36 dBm0 (must accept); lower than -55
        dBm0 must be rejected (TR-TSY-000181, ITU-T Q.24A). Thus, larger
        values denote lower volume.

   Note: since the acceptable dip is 10 dB and the minimum detectable
   loudness variation is 3 dB, this field could be compressed by at
   least a bit by reducing resolution to 2 dB, if needed.

   duration: Duration of this digit, in timestamp units.


        For a sampling rate of 8000 Hz, this field is sufficient to
        express digit durations of upto approximately 8 seconds.

   R: This field is reserved for future use. The sender MUST set it to
        zero, the receiver MUST ignore it.

   An audio source SHOULD start transmitting DTMF digit packets as soon
   as it recognizes a DTMF digit and every 50 ms thereafter. (Precise
   spacing between DTMF digit packets is not necessary.)

        Q.24 [2], Table A-1, indicates that all administrations
        surveyed use a minimum signal duration of 40 ms, with
        signaling velocity (tone and pause) of no less than 93 ms.

   If a digit continues for more than one period, it should send a new
   DTMF packet with the RTP timestamp value corresponding to the
   beginning of the digit and the duration of the digit increased
   correspondingly. (The RTP sequence number is incremented by one for
   each packet.) If there has been no new digit in the last interval,
   the digit SHOULD be retransmitted three times (or until the next
   digit is recognized) to ensure some measure of reliability for the
   last digit.


Schulzrinne                                                   [Page 3]

Internet Draft                  Profile                November 18, 1998


        DTMF digits are sent incrementally to avoid having the
        receiver wait for the completion of the digit. Since some
        tones are two seconds long, this would incur a substantial
        delay. The transmitter does not know if digit length is
        important and thus needs to transmit immediately and
        incrementally. If the receiver application does not care
        about digit length, the incremental transmission mechanism
        avoids delay. Some applications, such as gateways into the
        GSTN, care about both delays and digit duration.

3 Reliability

   To achieve reliability even when the network loses packets, the audio
   redundancy mechanism described in RFC 2198 [3] is used. The effective
   data rate is r times 64 bits (32 bits for the redundancy header and
   32 bits for the DTMF payload) every 50 ms or r times 1280
   bits/second, where r is the number of redundant DTMF digits carried
   in each packet. The value of r is an implementation trade-off, with a
   value of 5 suggested.


        The timestamp offset in this redundancy scheme has 14 bits,
        so that it allows a single packet to "cover" 2.048 seconds
        of DTMF digits at a sampling rate of 8000 Hz. Including the
        starting time of previous digits allows precise
        reconstruction of the tone sequence at a gateway. The
        scheme is resilient to consecutive packet losses spanning
        this interval of 2.048 seconds or r digits, whichever is
        less. Note that for previous digits, only an average
        loudness can be represented.

   An encoder MAY treat the DTMF payload as a highly-compressed version
   of the current audio frame. In that mode, each RTP packet during a
   DTMF tone would contain the current audio codec rendition (say,
   G.723.1 or G.729) of this digit as well as the representation
   described in Section 2, plus any previous digits as before.


        This approach allows dumb gateways that do not understand
        this format to function. Other reasons?

3.1 Example

   A typical RTP packet, where the user is just dialing the last digit
   of the DTMF sequence "911". The first digit was 200 ms long and
   started at time 0, the second digit lasted 250 ms and started at time
   800 ms, the third digit was pressed at time 1.4 s and the packet
   shown was sent at 1.45 s. The frame duration is 50 ms. To make the


Schulzrinne                                                   [Page 4]

Internet Draft                  Profile                November 18, 1998


   parts recognizable, the figure below ignores byte alignment.
   Timestamp and sequence number are assumed to have been zero at the
   beginning of the first digit. In this example, the dynamic payload
   types 96 and 97 have been assigned for the redundancy mechanism and
   the DTMF payload, respectively.


    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |V=2|P|X|  CC   |M|     PT      |       sequence number         |
   | 2 |0|0|   0   |0|     96      |              28               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           timestamp                           |
   |                             12000                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           synchronization source (SSRC) identifier            |
   |                            0x5234a8                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |F|   block PT  |     timestamp offset      |   block length    |
   |1|     97      |            12000          |         4         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |F|   block PT  |     timestamp offset      |   block length    |
   |1|     97      |             5600          |         4         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |F|   Block PT  |
   |0|     97      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |R R R|  digit  |R R| volume    |          duration             |
   |0 0 0|    9    |0 0|     7     |             1600              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |R R R|  digit  |R R| volume    |          duration             |
   |0 0 0|    1    |0 0|    10     |             2000              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |R R R|  digit  |R R| volume    |          duration             |
   |0 0 0|    1    |0 0|    20     |              400              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


4 Compact Reliability Scheme

   A more compact representation could be achieved by measuring DTMF
   tones in a different sampling rate from that of the surrounding audio


Schulzrinne                                                   [Page 5]

Internet Draft                  Profile                November 18, 1998


   codec, e.g., as multiples of 1, 10, 40 or 50 ms. Each RTP payload
   type should have a fixed sampling rate, so choosing a value that
   depends on frame interval of the surrounding codec is not
   recommended. For a sampling interval of 50 ms, the following payload
   would "cover" 8 seconds of duration and offset:


    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    offset     |R R R|  digit  |R R| volume    |   duration    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


5 Changes Since Version -00

        o Uniform interval of 50 ms, since audio frame  interval may
          change based on codec.

6 Acknowledgements

   The suggestions of the VoIP working group and Fred Burg are
   gratefully acknowledged.

7 Bibliography

   [1] R. Kocen and T. Hatala, "Voice over frame relay implementation
   agreement," Implementation Agreement FRF.11, Frame Relay Forum,
   Foster City, California, Jan. 1997.

   [2] International Telecommunication Union, "Multifrequency push-
   button signal reception," Recommendation Q.24, Telecommunication
   Standardization Sector of ITU, Geneva, Switzerland, 1988.

   [3] C. Perkins, I. Kouvelas, V. Hardman, M. Handley, and J. Bolot,
   "RTP payload for redundant audio data," RFC 2198, Internet
   Engineering Task Force, Sept. 1997.


Schulzrinne                                                   [Page 6]