Internet Engineering Task Force                               SIPPING WG
Internet Draft                                              J. Rosenberg
                                                             dynamicsoft
draft-rosenberg-sipping-app-interaction-framework-00.txt
October 28, 2002
Expires: April 2003


     A Framework and Requirements for Application Interaction in SIP

STATUS OF THIS MEMO

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress".

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   To view the list Internet-Draft Shadow Directories, see
   http://www.ietf.org/shadow.html.


Abstract

   This document describes a framework and requirements for the
   interaction between users and Session Initiation Protocol (SIP) based
   applications. By interacting with applications, users can guide the
   way in which they operate. The focus of this framework is stimulus
   signaling, which allows a user agent to interact with an application
   without knowledge of the semantics of that application. Stimulus
   signaling can occur to a user interface running locally with the
   client, or to a remote user interface, through media streams.
   Stimulus signaling encompasses a wide range of mechanisms, ranging
   from clicking on hyperlinks, to pressing buttons, to traditional Dual
   Tone Multi Frequency (DTMF) input. In all cases, stimulus signaling
   is supported through the use of markup languages, which play a key
   role in this framework.


J. Rosenberg                                                  [Page 1]

Internet Draft              App Interaction             October 28, 2002


                           Table of Contents


   1          Introduction ........................................    3
   2          Definitions .........................................    3
   3          A Model for Application Interaction .................    6
   3.1        Function vs. Stimulus ...............................    8
   3.2        Real-Time vs. Non-Real Time .........................    8
   3.3        Client-Local vs. Client-Remote ......................    9
   3.4        Interaction Scenarios on Telephones .................   10
   3.4.1      Client Remote .......................................   10
   3.4.2      Client Local ........................................   10
   3.4.3      Flip-Flop ...........................................   11
   4          Framework Overview ..................................   12
   5          Client Local Interfaces .............................   13
   5.1        Discovering Capabilities ............................   14
   5.2        Pushing an Initial Interface Component ..............   14
   5.3        Updating an Interface Component .....................   16
   5.4        Terminating an Interface Component ..................   17
   6          Client Remote Interfaces ............................   17
   6.1        Originating and Terminating Applications ............   18
   6.2        Intermediary Applications ...........................   18
   7          Inter-Application Feature Interaction ...............   18
   7.1        Client Local UI .....................................   19
   7.2        Client-Remote UI ....................................   20
   7.2.1      Centralized Server ..................................   20
   7.2.2      Pipe-and-Filter .....................................   21
   7.2.2.1    Client Resolution ...................................   22
   7.2.3      Comparison ..........................................   31
   8          Intra Application Feature Interaction ...............   33
   9          Examples ............................................   34
   10         Security Considerations .............................   35
   11         Contributors ........................................   35
   12         Authors Address .....................................   35
   13         Normative References ................................   37
   14         Informative References ..............................   37


J. Rosenberg                                                  [Page 2]

Internet Draft              App Interaction             October 28, 2002


1 Introduction

   The Session Initiation Protocol (SIP) [1] provides the ability for
   users to initiate, manage, and terminate communications sessions.
   Frequently, these sessions will involve a SIP application. A SIP
   application is defined as a program running on a SIP-based element
   (such as a proxy or user agent) that provides some value-added
   function to a user or system administrator. Examples of SIP
   applications include pre-paid calling card calls, conferencing, and
   presence-based [2] call routing.

   In order for most applications to properly function, they need input
   from the user to guide their operation. As an example, a pre-paid
   calling card application requires the user to input their calling
   card number, their PIN code, and the destination number they wish to
   reach. The process by which a user provides input to an application
   is called "application interaction".

   Application interaction can be either functional or stimulus.
   Functional interaction requires the user agent to understand the
   semantics of the application, whereas stimulus interaction does not.
   Stimulus signaling allows for applications to be built without
   requiring modifications to the client. Stimulus interaction is the
   subject of this framework. The framework provides a model for how
   users interact with applications through user interfaces, and how
   user interfaces and applications can be distributed throughout a
   network. This model is then used to describe how applications can
   instantiate and manage user interfaces.

2 Definitions

        SIP Application: A SIP application is defined as a program
             running on a SIP-based element (such as a proxy or user
             agent) that provides some value-added function to a user or
             system administrator. Examples of SIP applications include
             pre-paid calling card calls, conferencing, and presence-
             based [2] call routing.

        Application Interaction: The process by which a user provides
             input to an application.

        Real-Time Application Interaction: Application interaction that
             takes place while an application instance is executing. For
             example, when a user enters their PIN number into a pre-
             paid calling card application, this is real-time
             application interaction.

        Non-Real Time Application Interaction: Application interaction


J. Rosenberg                                                  [Page 3]

Internet Draft              App Interaction             October 28, 2002


             that takes place asynchronously with the execution of the
             application. Generally, non-real time application
             interaction is accomplished through provisioning.

        Functional Application Interaction: Application interaction is
             functional when the user device has an understanding of the
             semantics of the application that the user is interacting
             with.

        Stimulus Application Interaction: Application interaction is
             considered to be stimulus when the user device has no
             understanding of the semantics of the application that the
             user is interacting with.

        User Interface (UI): The user interface provides the user with
             context in order to make decisions about what they want.
             The user enters information into the user interface. The
             user interface interprets the information, and passes it to
             the application.

        User Interface Component: A piece of user interface which
             operates independently of other pieces of the user
             interface. For example, a user might have two separate web
             interfaces to a pre-paid calling card application - one for
             hanging up and making another call, and another for
             entering the username and PIN.

        User Device: The software or hardware system that the user
             directly interacts with in order to communicate with the
             application. An example of a user device is a telephone.
             Another example is a PC with a web browser.

        User Input: The "raw" information passed from a user to a user
             interface. Examples of user input include a spoken word or
             a click on a hyperlink.

        Client-Local User Interface: A user interface which is co-
             resident with the user device.

        Client Remote User Interface: A user interface which executes
             remotely from the user device. In this case, a standardized
             interface is needed between them. Typically, this is done
             through media sessions - audio, video, or application
             sharing.

        Media Interaction: A means of separating a user and a user
             interface by connecting them with media streams.


J. Rosenberg                                                  [Page 4]

Internet Draft              App Interaction             October 28, 2002


        Interactive Voice Response (IVR): An IVR is a type of user
             interface that allows users to speak commands to the
             application, and hear responses to those commands prompting
             for more information.

        Prompt-and-Collect: The basic primitive of an IVR user
             interface. The user is presented with a voice option, and
             the user speaks their choice.

        Barge-In: In an IVR user interface, a user is prompted to enter
             some information. With some prompts, the user may enter the
             requested information before the prompt completes. In that
             case, the prompt ceases. The act of entering the
             information before completion of the prompt is referred to
             as barge-in.

        Focus: A user interface component has focus when user input is
             provided fed to it, as opposed to any other user interface
             components. This is not to be confused with the term focus
             within the SIP conferencing framework, which refers to the
             center user agent in a conference [3].

        Focus Determination: The process by which the user device
             determines which user interface component will receive the
             user input.

        Focusless User Interface: A user interface which has no ability
             to perform focus determination. An example of a focusless
             user interface is a keypad on a telephone.

        Feature Interaction: A class of problems which result when
             multiple applications or application components are trying
             to provide services to a user at the same time.

        Inter-Application Feature Interaction: Feature interactions that
             occur between applications.

        DTMF: Dual-Tone Multi-Frequency. DTMF refer to a class of tones
             generated by circuit switched telephony devices when the
             user presses a key on the keypad. As a result, DTMF and
             keypad input are often used synonymously, when in fact one
             of them (DTMF) is merely a means of conveying the other
             (the keypad input) to a client-remote user interface (the
             switch, for example).

        Application Instance: A single execution path of a SIP
             application.


J. Rosenberg                                                  [Page 5]

Internet Draft              App Interaction             October 28, 2002


        Originating Application: A SIP application which acts as a UAC,
             calling the user.

        Terminating Application: A SIP application which acts as a UAS,
             answering a call generated by a user. IVR applications are
             terminating applications.

        Intermediary Application: A SIP application which is neither the
             caller or callee, but rather, a third party involved in a
             call.

3 A Model for Application Interaction


      +---+            +---+            +---+             +---+           
      |   |            |   |            |   |             |   |           
      |   |            | U |            | U |             | A |           
      |   |   Input    | s |   Input    | s |   Results   | p |           
      |   | ---------> | e | ---------> | e | ----------> | p |           
      | U |            | r |            | r |             | l |           
      | s |            |   |            |   |             | i |           
      | e |            | D |            | I |             | c |           
      | r |   Output   | e |   Output   | f |   Update    | a |           
      |   | <--------- | v | <--------- | a | <.......... | t |           
      |   |            | i |            | c |             | i |           
      |   |            | c |            | e |             | o |           
      |   |            | e |            |   |             | n |           
      |   |            |   |            |   |             |   |           
      +---+            +---+            +---+             +---+           
                                                                          
                                                                          
   Figure 1: Model for Real-Time Interactions


   Figure 1 presents a general model for how users interact with


J. Rosenberg                                                  [Page 6]

Internet Draft              App Interaction             October 28, 2002


   applications. Generally, users interact with a user interface through
   a user device. A user device can be a telephone, or it can be a PC
   with a web browser. Its role is to pass the user input from the user,
   to the user interface. The user interface provides the user with
   context in order to make decisions about what they want. The user
   enters information into the user interface. The user interface
   interprets the information, and passes it to the application. The
   application may be able to modify the user interface based on this
   information. Whether or not this is possible depends on the type of
   user interface.

   User interfaces are fundamentally about rendering and interpretation.
   Rendering refers to the way in which the user is provided context.
   This can be through hyperlinks, images, sounds, videos, text, and so
   on. Interpretation refers to the way in which the user interface
   takes the "raw" data provided by the user, and returns the result to
   the application in a meaningful format, abstracted from the
   particulars of the user interface. As an example, consider a pre-paid
   calling card application. The user interface worries about details
   such as what prompt the user is provided, whether the voice is male
   or female, and so on. It is concerned with recognizing the speech
   that the user provides, in order to obtain the desired information.
   In this case, the desired information is the calling card number, the
   PIN code, and the destination number. The application needs that
   data, and it doesn't matter to the application whether it was
   collected using a male prompt or a female one.

   User interfaces generally have real-time requirements towards the
   user. That is, when a user interacts with the user interface, the
   user interface needs to react quickly, and that change needs to be
   propagated to the user right away. However, the interface between the
   user interface and the application need not be that fast. Faster is
   better, but the user interface itself can frequently compensate for
   long latencies there. In the case of a pre-paid calling card
   application, when the user is prompted to enter their PIN, the prompt
   should generally stop immediately once the first digit of the PIN is
   entered. This is referred to as barge-in. After the user-interface
   collects the rest of the PIN, it can tell the user to "please wait
   while processing". The PIN can then be gradually transmitted to the
   application. In this example, the user interface has compensated for
   a slow UI to application interface by asking the user to wait.

   The separation between user interface and application is absolutely
   fundamental to the entire framework provided in this document. Its
   importance cannot be understated.

   With this basic model, we can begin to taxonomize the types of
   systems that can be built.


J. Rosenberg                                                  [Page 7]

Internet Draft              App Interaction             October 28, 2002


3.1 Function vs. Stimulus

   The first way to taxonomize the system is to consider the interface
   between the UI and the application. There are two fundamentally
   different models for this interface. In a functional interface, the
   user interface has detailed knowledge about the application, and is,
   in fact, specific to the application. The interface between the two
   components is through a functional protocol, capable of representing
   the semantics which can be exposed through the user interface.
   Because the user interface has knowledge of the application, it can
   be optimally designed for that application. As a result, functional
   user interfaces are almost always the most user friendly, the
   fastest, the and the most responsive. However, in order to allow
   interoperability between user devices and applications, the details
   of the functional protocols need to be specified in standards. This
   slows down innovation and limits the scope of applications that can
   be built.

   An alternative is a stimulus interface. In a stimulus interface, the
   user interface is generic, totally ignorant of the details of the
   application. Indeed, the application may pass instructions to the
   user interface describing how it should operate. The user interface
   translates user input into "stimulus" - which are data understood
   only by the application, and not by the user interface. Because they
   are generic, and because they require communications with the
   application in order to change the way in which they render
   information to the user, stimulus user interfaces are usually slower,
   less user friendly, and less responsive than a functional
   counterpart. However, they allow for substantial innovation in
   applications, since no standardization activity is needed to built a
   new application, as long as it can interact with the user within the
   confines of the user interface mechanism.

   In SIP systems, functional interfaces are provided by extending the
   SIP protocol to provide the needed functionality. For example, the
   SIP caller preferences specification [4] provides a functional
   interface that allows a user to request applications to route the
   call to specific types of user agents. Functional interfaces are
   important, but are not the subject of this framework. The primary
   goal of this framework is to address the role of stimulus interfaces
   to SIP applications.

3.2 Real-Time vs. Non-Real Time

   Application interaction systems can also be real-time or non-real-
   time. Non-real interaction allows the user to enter information about
   application operation in asynchronously with its invocation.
   Frequently, this is done through provisioning systems. As an example,


J. Rosenberg                                                  [Page 8]

Internet Draft              App Interaction             October 28, 2002


   a user can set up the forwarding number for a call-forward on no-
   answer application using a web page. Real-time interaction requires
   the user to interact with the application at the time of its
   invocation.

3.3 Client-Local vs. Client-Remote

   Another axis in the taxonomization is whether the user interface is
   co-resident with the user device (which we refer to as a client-local
   user interface), or the user interface runs in a host separated from
   the client (which we refer to as a client-remote user interface). In
   a client-remote user interface, there exists some kind of protocol
   between the client device and the UI that allows the client to
   interact with the user interface over a network.

   The most important way to separate the UI and the client device is
   through media interaction. In media interaction, the interface
   between the user and the user interface is through media - audio,
   video, messaging, and so on. This is the classic mode of operation
   for VoiceXML [5], where the user interface (also referred to as the
   voice browser) runs on a platform in the network. Users communicate
   with the voice browser through the telephone network (or using a SIP
   session). The voice browser interacts with the application using HTTP
   to convey the information collected from the user.

   We refer to the second sub-case as a client-local user interface. In
   this case, the user interface runs co-located with the user. The
   interface between them is through the software that interprets the
   users input and passes them to the user interface. The classic
   example of this is the web. In the web, the user interface is a web
   browser, and the interface is defined by the HTML document that it's
   rendering. The user interacts directly with the user interface
   running in the browser. The results of that user interface are sent
   to the application (running on the web server) using HTTP.

   It is important to note that whether or not the user interface is
   local, or remote (in the case of media interaction), is not a
   property of the modality of the interface, but rather a property of
   the system. As an example, it is possible for a web-based user
   interface to be provided with a client-remote user interface. In such
   a scenario, video and application sharing media sessions can be used
   between the user and the user interface. The user interface, still
   guided by HTML, now runs "in the network", remote from the client.
   Similarly, a VoiceXML document can be interpreted locally by a client
   device, with no media streams at all. Indeed, the VoiceXML document
   can be rendered using text, rather than media, with no impact on the
   interface between the user interface and the application.


J. Rosenberg                                                  [Page 9]

Internet Draft              App Interaction             October 28, 2002


   It is also important to note that systems can be hybrid. In a hybrid
   user interface, some aspects of it (usually those associated with a
   particular modality) run locally, and others run remotely.

3.4 Interaction Scenarios on Telephones

   This same model can apply to a telephone. In a traditional telephone,
   the user interface consists of a 12-key keypad, a speaker, and a
   microphone. Indeed, from here forward, the term "telephone" is used
   to represent any device that meets, at a minimum, the characteristics
   described in the previous sentence. Circuit-switched telephony
   applications are almost universally client-remote user interfaces. In
   the Public Switched Telephone Network (PSTN), there is usually a
   circuit interface between the user and the user interface. The user
   input from the keypad is conveyed used Dual-Tone Multi-Frequency
   (DTMF), and the microphone input as PCM encoded voice.

   In an IP-based system, there is more variability in how the system
   can be instantiated. Both client-remote and client-local user
   interfaces to a telephone can be provided.

   In this framework, a PSTN gateway can be considered a "user proxy".
   It is a proxy for the user because it can provide, to a user
   interface on an IP network, input taken from a user on a circuit
   switched telephone. The gateway may be able to run a client-local
   user interface, just as an IP telephone might.

3.4.1 Client Remote

   The most obvious instantiation is the "classic" circuit-switched
   telephony model. In that model, the user interface runs remotely from
   the client. The interface between the user and the user interface is
   through media, set up by SIP and carried over the Real Time Transport
   Protocol (RTP) [6]. The microphone input can be carried using any
   suitable voice encoding algorithm. The keypad input can be conveyed
   in one of two ways. The first is to convert the keypad input to DTMF,
   and then convey that DTMF using a suitance encoding algorithm for it
   (such as PCMU). An alternative, and generally the preferred approach,
   is to transmit the keypad input using RFC 2833 [7], which provides an
   encoding mechanism for carrying keypad input within RTP.

   In this classic model, the user interface would run on a server in
   the IP network. It would perform speech recognition and DTMF
   recognition to derive the user intent, feed them through the user
   interface, and provide the result to an application.

3.4.2 Client Local


J. Rosenberg                                                 [Page 10]

Internet Draft              App Interaction             October 28, 2002


   An alternative model is for the entire user interface to reside on
   the telephone. The user interface can be a VoiceXML browser, running
   speech recognition on the microphone input, and feeding the keypad
   input directly into the script. As discussed above, the VoiceXML
   script could be rendered using text instead of voice, if the
   telephone had a textual display.

3.4.3 Flip-Flop

   A middle-ground approach is to flip back and forth between a client-
   local and client-remote user interface. Many voice applications are
   of the type which listen to the media stream and wait for some
   specific trigger that kicks off a more complex user interaction. The
   long pound in a pre-paid calling card application is one example.
   Another example is a conference recording application, where the user
   can press a key at some point in the call to begin recording. When
   the key is pressed, the user hears a whisper to inform them that
   recording has started.

   The idea way to support such an application is to install a client-
   local user interface component that waits for the trigger to kick off
   the real interaction. Once the trigger is received, the application
   connects the user to a client-remote user interface that can play
   announements, collect more information, and so on.

   The benefit of flip-flopping between a client-local and client-remote
   user interface is cost. The client-local user interface will
   eliminate the need to send media streams into the network just to
   wait for the user to press the pound key on the keypad.

   The Keypad Markup Language (KPML) was designed to support exactly
   this kind of need. It models the keypad on a phone, and allows an
   application to be informed when any sequence of keys have been
   pressed. However, KPML has no presentation component. Since user
   interfaces generally require a response to user input, the
   presentation will need to be done using a client-remote user
   interface that gets instantiated as a result of the trigger.

   It is tempting to use a hybrid model, where a prompt-and-collect
   application is implemented by using a client-remote user interface
   that plays the prompts, and a client-local user interface, described
   by KPML, that collects digits. However, this only complicates the
   application. Firstly, the keypad input will be sent to both the media
   stream and the KPML user interface. This requires the application to
   sort out which user inputs are duplicates, a process that is very
   complicated. Secondly, the primary benefit of KPML is to avoid having
   a media stream towards a user interface. However, there is already a
   media stream for the prompting, so there is no real savings.


J. Rosenberg                                                 [Page 11]

Internet Draft              App Interaction             October 28, 2002


   That said, the framework does support this hybrid model.

4 Framework Overview

   In this framework, we use the term "SIP application" to refer to a
   broad set of functionality. A SIP application is a program running on
   a SIP-based element (such as a proxy or user agent) that provides
   some value-added function to a user or system administrator. SIP
   applications can execute on behalf of a caller, a called party, or a
   multitude of users at once.

   Each application has a number of instances that are executing at any
   given time. An instance represents a single execution path for an
   application. Each instance has a well defined lifecycle. It is
   established as a result of some event. That event can be a SIP event,
   such as the reception of a SIP INVITE request, or it can be a non-SIP
   event, such as a web form post or even a timer. Application instances
   also have a specific end time. Some instances have a lifetime that is
   coupled with a SIP transaction or dialog. For example, a proxy
   application might begin when an INVITE arrives, and terminate when
   the call is answered. Other applications have a lifetime that spans
   multiple dialogs or transactions. For example, a conferencing
   application instance may exist so long as there are any dialogs
   connected to it. When the last dialog terminates, the application
   instance terminates. Other applications have a liftime that is
   completely decoupled from SIP events.

   It is fundamental to the framework described here that multiple
   application instances may interact with a user during a single SIP
   transaction or dialog. Each instance may be for the same application,
   or different applications. Each of the applications may be completely
   independent, in that they may be owned by different providers, and
   may not be aware of each others existence. Similarly, there may be
   application instances interacting with the caller, and instances
   interacting with the callee, both within the same transaction or
   dialog.

   The first step in the interaction with the user is to instantiate one
   of more user interface components for the application instance. A
   user interface component is a single piece of the user interface that
   is defined by a logical flow that is not synchronously coupled with
   any other component. In other words, each component runs more or less
   independently.

   A user interface component can be instantiated in one of the user
   devices (for a client-local user interface), or within a network
   element (for a client-remote user interface). If a client-local user
   interface is to be used, the application needs to determine whether


J. Rosenberg                                                 [Page 12]

Internet Draft              App Interaction             October 28, 2002


   or not the user device is capable of supporting a client-local user
   interface, and in what format. In this framework, all client-local
   user interface components are described by a markup language. A
   markup language describes a logical flow of presentation of
   information to the user, collection of information from the user, and
   transmission of that information to an application. Examples of
   markup languages include HTML, WML, VoiceXML, the Keypad Markup
   Language (KPML) [8] and the Media Server Control Markup Language
   (MSCML) [9].

   The interface between the user interface component and the
   application is typically markup-language specific. However, all of
   the markup languages discussed above use HTTP form POST requests as
   the primary interface [note that this is still an open issue with
   KPML]. As discussed in Section 3, this interface is well suited to
   HTTP, which is a good match for its latency, reliability, and content
   requirements.

   To create a client-local user interface, the application passes the
   markup document (or a reference to it) in a SIP message to that
   client. The SIP message can be one explicitly generated by the
   application (in which case the application has to be a UA or B2BUA),
   or it can be placed in a SIP message that passes by (in which case
   the application can be running in a proxy).

   Client local user interface components are always associated with the
   dialog that the SIP message itself is associated with. Consequently,
   user interface components cannot be placed in messages that are not
   associated with a dialog.

   If a user interface component is to be instantiated in the network,
   there is no need to determine the capabilities of the device on which
   the user interface is instantiated. Presumably, it is on a device on
   which the application knows a UI can be created. However, the
   application does need to connect the user device to the user
   interface. This will require manipulation of media streams in order
   to establish that connection.

   Once a user interface component is created, the application needs to
   be able to change it, and to remove it. Finally, more advanced
   applications may require coupling between application components. The
   framework supports rudimentary capabilities there.

5 Client Local Interfaces

   One key component of this framework is support for client local user
   interfaces.


J. Rosenberg                                                 [Page 13]

Internet Draft              App Interaction             October 28, 2002


5.1 Discovering Capabilities

   A client local user interface can only be instantiated on a client if
   the user device has the capabilities needed to do so. Specifically,
   an application needs to know what markup languages, if any, are
   supported by the client. For example, does the client support HTML?
   VoiceXML?  However, that information is not sufficient to determine
   if a client local user interface can be instantiated. In order to
   instantiate the user interface, the application needs to transfer the
   markup document to the client. There are two ways in which the markup
   document can be transferred. The application can send the client a
   URI which the client can use to fetch the markup, or the markup can
   be sent inline within the message. The application needs to know
   which of these modes are supported, and in the case of indirection,
   which URI schemes are supported to obtain the indirection.

   Many applications will need to know these capabilities at the time an
   application instance is first created. Since applications can be
   created through SIP requests or responses, SIP needs to provide a
   means to convey this information. This introduces several concrete
   requirements for SIP:

        REQ 1: A SIP request or response must be capable of conveying
             the set of markup languages supported by the UA that
             generated the request or response.

        REQ 2: A SIP request or response must be capable of indicating
             whether a UA can obtain markups inline, or through an
             indirection. In the case of indirection, the UA must be
             capable of indicating what URI schemes it supports.

5.2 Pushing an Initial Interface Component

   Once the application has determined that the UA is capable of
   supporting client local user interfaces, the next step is for the
   application to push an interface component to the application.

   Generally, we anticipate that interface components will need to be
   created at various different points in a SIP session. Clearly, they
   will need to be pushed during an initial INVITE, in both responses
   (so as to place a component into the calling UA) and in the request
   (so as to place a component into the called UA). As an example, a
   conference recording application allows the users to record the media
   for the session at any time. The application would like to push an
   HTML user interface component to both the caller and callee at the
   time the call is setup, allowing either to record the session. The
   HTML component would have buttons to start and stop recording. To
   push the HTML component to the caller, it needs to be pushed in the


J. Rosenberg                                                 [Page 14]

Internet Draft              App Interaction             October 28, 2002


   200 OK (and possibly provisional response), and to push it to the
   callee, in the INVITE itself.

   To state the requirement more concretely:

        REQ 3: An application must be able to add a reference to, or an
             inline version of, a user interface component into any
             request or response that passes through or is eminated from
             that application.

   However, there will also be cases where the application needs to push
   a new interface component to a UA, but it is not as a result of any
   SIP message. As an example, a pre-paid calling card application will
   set a timer that determines how long the call can proceed, given the
   availability of funds in the user's account. When the timer fires,
   the application would like to push a new interface component to the
   calling UA, allowing them to click to add more funds.

   In this case, there is no message already in transit that can be used
   as a vehicle for pushing a user interface component. This requires
   that applications can generate their own messages to push a new
   component to a UA:

        REQ 4: A UA application must be able to send a SIP message to
             the UA at the other end of the dialog, asking it to create
             a new interface component.

   In all cases, the information passed from the application to the UA
   must include more than just the interface component itself (or a
   reference to it). The user must be able to decide whether or not it
   wants to proceed with this application. To make that determination,
   the user must have information about the application. Specifically,
   it will need the name of the application, and an identifier of the
   owner or administrator for the application. As an example, a typical
   name would be "Prepaid Calling Card" and the owner could be
   "voiceprovider.com".

        REQ 5: Any user interface component passed to a client (either
             inline or through a reference) must also include markup
             meta-data, including a human readable name of the
             application, and an identifier of the owner of the
             application.

   Clearly, there are security implications. The user will need to
   verify the identity of the application owner, and be sure that the
   user interface component is not being replayed, that is, it actually
   belongs with this specific SIP message.


J. Rosenberg                                                 [Page 15]

Internet Draft              App Interaction             October 28, 2002


        REQ 6: It must be possible for the client to validate the
             authenticity and integrity of the markup document (or its
             reference) and its associated meta-data. It must be
             possible for the client to verify that the information has
             not been replayed from a previous SIP message.

   If the user decides not to execute the user interface component, it
   simply discards it. There is no explicit requirement for the user to
   be able to inform the application that the component was discarded.
   Effectively, the application will think that the component was
   executed, but that the user never entered any information.


        OPEN ISSUE: Are we certain? Adding support for this makes
        the system more complicated though. Warning headers may
        make sense here.

5.3 Updating an Interface Component

   Once a user interface component has been created on a client, it can
   be updated in two ways. The first way is the "normal" path inherent
   to that component. The client enters some data, the user interface
   transfers the information to the application (typically through
   HTTP), and the result of that transfer brings a new markup document
   describing an updated interface. This is referred to as a synchronous
   update, since it is syncrhonized with user interaction.

   However, synchronous updates are not sufficient for many
   applications. Frequently, the interface will need to be updated
   asynchronously by the application, without an explicit user action. A
   good example of this is, once again, the pre-paid calling card
   application. The application might like to update the user interface
   when the timer runs out on the call. This introduces several
   requirements:

        REQ 7: It must be possible for an application to asynchronously
             push an update to an existing user interface component,
             either in a message that was already in transit, or by
             generating a new message.

        REQ 8: It must be possible for the client to associate the new
             interface component with the one that it is supposed to
             replace, so that the old one can be removed.

   Unfortunately, pushing of application components introduces a race
   condition. What if the user enters data into the old component,
   causing an HTTP request to the application, while an update of that
   component is in progress? The client will get an interface component


J. Rosenberg                                                 [Page 16]

Internet Draft              App Interaction             October 28, 2002


   in the HTTP response, and also get the new one in the SIP message.
   Which one does the client use? There needs to be a way in which to
   properly order the components:

        REQ 9: It must be possible for the client to relatively order
             user interface updates it receives as the result of
             syncrhonous and asynchronous messaging.

5.4 Terminating an Interface Component

   User interface components have a well defined lifetime. They are
   created when the component is first pushed to the client. User
   interface components are always associated with the SIP dialog on
   which they were pushed. As such, their lifetime is bound by the
   lifetime of the dialog. When the dialog ends, so does the interface
   component.

   This rule applies to early dialogs as well. If a user interface
   component is passed in a provisional response to INVITE, and a
   separate branch eventually answers the call, the component terminates
   with the arrival of the 2xx. Thats because the early dialog itself
   terminates with the arrival of the 2xx.

   However, there are some cases where the application would like to
   terminate the user interface component before its natural termination
   point. To do this, the application pushes a "null" update to the
   client. This is an update that replaces the existing user interface
   component with nothing.

        REQ 10: It must be possible for an application to terminate a
             user interface component before its natural expiration.

   The user can also terminate the user interface component. However,
   there is no explicit signaling required in this case. The component
   is simply dismissed. To the application, it appears as if the user
   has simply ceased entering data.

6 Client Remote Interfaces

   As an alternative to, or in conjunction with client local user
   interfaces, an application can make use of client remote user
   interfaces. These user interfaces can execute co-resident with the
   application itself (in which case no standardized interfaces between
   the UI and the application need to be used), or it can run
   separately. This framework assumes that the user interface runs on a
   host that has a sufficient trust relationship with the application.
   As such, the means for instantiating the user interface is not
   considered here.


J. Rosenberg                                                 [Page 17]

Internet Draft              App Interaction             October 28, 2002


   The primary issue is to connect the user device to the remote user
   interface. Doing so requires the manipulation of media streams
   between the client and the user interface. Such manipulation can only
   be done by user agents. There are two types of user agent
   applications within this framework - originating/terminating
   applications, and intermediary applications.

6.1 Originating and Terminating Applications

   Originating and terminating applications are applications which are
   themselves the originator or the final recipient of a SIP invitation.
   They are "pure" user agent applications - not back-to-back user
   agents. The classic example of such an application is an interactive
   voice response (IVR) application, which is typically a terminating
   application. Its a terminating application because the user
   explicitly calls it; i.e., it is the actual called party. An example
   of an originating application is a wakeup call application, which
   calls a user at a specified time in order to wake them up.

   Because originating and terminating applications are a natural
   termination point of the dialog, manipulation of the media session by
   the application is trivial. Traditional SIP techniques for adding and
   removing media streams, modifying codecs, and changing the address of
   the recipient of the media streams, can be applied. Similarly, the
   application can direclty authenticate itself to the user through
   S/MIME, since it is the peer UA in the dialog.

6.2 Intermediary Applications

   Intermediary application are, at the same time, more common than
   originating/terminating applications, and more complex. Intermediary
   applications are applications that are neither the actual caller or
   called party. Rather, they represent a "third party" that wishes to
   interact with the user. The classic example is the ubiquitous pre-
   paid calling card application.

   In order for the intermediary application to add a client remote user
   interface, it needs to manipulate the media streams of the user agent
   to terminate on that user interface. This also introduces a
   fundamental feature interaction issue. Since the intermediary
   application is not an actual participant in the call, how does the
   user interact with the intermediary application, and its actual peer
   in the dialog, at the same time? This is discussed in more detail in
   Section 7. In fact, the choice about how this problem is solved
   completely determines the architecture of the application.

7 Inter-Application Feature Interaction


J. Rosenberg                                                 [Page 18]

Internet Draft              App Interaction             October 28, 2002


   The inter-application feature interaction problem is inherent to
   stimulus signaling. Whenever there are multiple applications, there
   are multiple user interfaces. When the user provides an input, to
   which user interface is the input destined? That question is the
   essence of the inter-application feature interaction problem.

   Inter-application feature interaction is not an easy problem to
   resolve. For now, we consider separately the issues for client-local
   and client-remote user interface components.

7.1 Client Local UI

   When the user interface itself resides locally on the client device,
   the feature interaction problem is actually much simpler. The end
   device knows explicitly about each application, and therefore can
   present the user with each one separately. When the user provides
   input, the client device can determine to which user interface the
   input is destined. The user interface to which input is destined is
   referred to as the application in focus, and the means by which the
   focused application is selected is called focus determination.

   Generally speaking, focus determination is purely a local operation.
   In the PC universe, focus determination is provided by window
   managers. Each application does not know about focus, it merely
   receives the user input that has been targeted to it when its in
   focus. This basic concept applies to SIP-based applications as well.

   Focus determination will frequently be trivial, depending on the user
   interface type. Consider a user that makes a call from a PC. The call
   passes through a pre-paid calling card application, and a call
   recording application. Both of these wish to interact with the user.
   Both push an HTML-based user interface to the user. On the PC, each
   user interface would appear as a separate window. The user interacts
   with the call recording application by selecting its window, and with
   the pre-paid calling card application by selecting its window. Focus
   determination is literally provided by the PC window manager. It is
   clear to which application the user input is targeted.

   As another example, consider the same two applications, but on a
   "smart phone" that has a set of buttons, and next to each button, an
   LCD display that can provide the user with an option. This user
   interface can be represented using the Wireless Markup Language
   (WML).

   The phone would allocate some number of buttons to each application.
   The prepaid calling card would get one button for its "hangup"
   command, and the recording application would get one for its
   "start/stop" command. The user can easily determine which application


J. Rosenberg                                                 [Page 19]

Internet Draft              App Interaction             October 28, 2002


   to interact with by pressing the appropriate button. Pressing a
   button determines focus and provides user input, both at the same
   time.

   Unfortunately, not all devices will have these advanced displays. A
   PSTN gateway, or a basic IP telephone, may only have a 12-key keypad.
   The user interfaces for these devices are provided through the Keypad
   Markup Language (KPML). Considering once again the feature
   interaction case above, the pre-paid calling card application and the
   call recording application would both pass a KPML document to the
   device. When the user presses a button on the keypad, to which
   document does the input apply? The user interface does not allow the
   user to select. A user interface where the user cannot provide focus
   is called a focusless user interface. This is quite a hard problem to
   solve. This framework does not make any explicit normative
   recommendation, but concludes that the best option is to send the
   input to both user interfaces. This is a sensible choice by analogy -
   its exactly what the existing circuit switched telephone network will
   do. It is an explicit non-goal to provide a better mechanism for
   feature interaction resolution than the PSTN on devices which have
   the same user interface as they do on the PSTN. Devices with better
   displays, such as PCs or screen phones, can benefit from the
   capabilities of this framework, allowing the user to determine which
   application they are interacting with.

   Indeed, when a user provides input on a focusless device, the input
   must be passed to all client local user interfaces, AND all client
   remote user interfaces. In the case of KPML, key events are passed to
   remote user interfaces by encoding them in RFC 2833 [7]. Of course,
   since a client cannot determine if a media stream terminates in a
   remote user interface or not, these key events are passed in all
   audio media streams.

7.2 Client-Remote UI

   When the user interfaces run remotely, the determination of focus can
   be much, much harder. There are three architectures supported in this
   framework for determining focus. The first is a centralized server
   model, the second is a pipe-and-filter model, and the third is a
   client model.

7.2.1 Centralized Server

   One approach to resolving the feature interaction is to deploy a
   centralized server whose goal is to do just that. The user sends a
   single copy of their media to this server, and the server is the sole
   source of media towards the user. Each application that wishes to
   interact with the user does so using a client local user interface.


J. Rosenberg                                                 [Page 20]

Internet Draft              App Interaction             October 28, 2002


   However, the user interface is not instantiated on the client, its
   instantiated on this central server. The central server is presumed
   to know enough about each application so that it can do a good job of
   determining how media should be passed to each user interface
   requested by each application. This is shown pictorially in Figure 2.


   This model has minimal impact on the client, but it only works well
   in a controlled environment where the entire set of applications is
   known ahead of time.

7.2.2 Pipe-and-Filter

   In order to resolve the interaction, each application acts as a B2BUA
   and as a media relay. This is shown in Figure 3. Each application
   takes its media from the "previous hop", which will be an end-user or
   another B2BUA application, and passes some or all of it on to the
   "next hop". Each application can pick off any media input it feels is
   relevant to its operation, passing the result off to the next hop.
   Furthermore, it can inject media in each direction as it so chooses.
   Conceptually, its each application pipes the media it receives to the
   next hop, and can filter it appropriately before sending it on. Thus
   the name, pipe-and-filter.


   The pipe-and-filter model describes the resolution of focus as
   provided in the existing circuit-switched telephony network.

   Of course, it is not strictly necessary for the application to always
   be a focal point for media. The application can allow the media to
   pass directly between participants when the application has no media
   to present to the user. When the application does have media to
   present to the user, it can execute a re-INVITE to move the media
   streams to a central point of control.


   An example of this is shown in Figure 4. In this example, there are
   two applications - a prepaid calling card application and a call
   recording application. The user makes a call to the prepaid number
   (1). The prepaid application acts as a UAS, answering the INVITE (2-
   3). It prompts the user to enter their calling card, PIN, and
   destination number (4). Once the user has done that, the prepaid
   application makes a call towards the destination number (5). This
   passes through the recording application, which acts as a B2BUA with
   media (i.e., it will also be a media intermediary), and forwards the
   INVITE to the called party (6). The called party answers (7), and the
   200 OKs and ACKs are propagated normally (8-10). At this point, both
   the prepaid application and the call recording application are B2BUA,


J. Rosenberg                                                 [Page 21]

Internet Draft              App Interaction             October 28, 2002


   so that the media flows between the caller and the prepaid app (11),
   then to the call recording app (12), and then to the called party
   (13).

   However, once the call is established, the prepaid calling card
   application does not really wish to remain on the media path. All it
   wants is to wait for the long-pound which the caller users to signal
   the end of the call. To do that, it uses a re-INVITE (14) to both
   remove itself from the media path, and to instantiate a client-local
   user interface, using KPML, into the calling UA. That INVITE contains
   no SDP, as it uses flow I from the third party call control
   specification [10]. The 200 OK from the caller contains its SDP (15),
   which is passed from the prepaid application to the call recording
   application (16). Since the call recording application is a B2BUA, it
   modifies the SDP to keep itself on the media path, passing that SDP
   to the called party (17). The called party answers with its updated
   SDP (18), which is passed to the call recording application, modified
   by it, and passed to the prepaid application (19). The prepaid
   application passes this SDP to the caller in an ACK (22), and then
   generates an ACK back towards the call recording application (20-21).
   Now, media flows from the caller to the call recording application
   (23), and from there, towards the called party (24).

   At some point later, the caller presses the long pound. This is
   passed to the KPML document, which has a single rule waiting for that
   sequence. The result is passed to the prepaid calling card
   application (25). The calling card application now knows that it
   needs to terminate the call with the called party. So, it sends a BYE
   (27), which is propagated normally (28-30). Now, the prepaid
   application needs to prompt the user for the next number. To do that,
   it needs to re-establish a media connection to it, in order to
   execute its client-remote user interface. To do that, it uses a re-
   INVITE (31-33), connecting the application to the caller (34).

7.2.2.1 Client Resolution

   Having the client resolve the interaction represents a fundemantally
   different way of thinking about intermediary applications.

   Instead of having intermediary applications be a B2BUA just to insert
   themselves into the media stream, they are implemented as a UA (i.e.,
   not back-to-back). Each application is a separate UA, and as such,
   will create and maintain a separate dialog with the user that it
   wishes to interact with. How does the user handle this multiplicity
   of dialogs? Simply put, it acts like a focus. A focus, as defined in
   the SIP conferencing framework [3], is a SIP element that terminates
   multiple SIP dialogs, each of which represents a participant into the
   conference. Effectively, the conferencing framework itself provides


J. Rosenberg                                                 [Page 22]

Internet Draft              App Interaction             October 28, 2002


          +-+   +-+                                                       
          |A|   |A|                                                       
          |p|   |p|                                                       
          |p|   |p|                                                       
          |1|   |2|                                                       
          | |   | |                                                       
          |U|   |U|                                                       
          |I|   |I|                                                       
          +-+   +-+                                                       
         +---------+         +------+           +------+                  
         |         |         |      |           |      |                  
         | Central |........>| App1 |..........>| App2 |                  
         |  Server |         |      |           |      |                  
         |         |+++      +------+           +------+                  
         +---------+** ++++                           .                   
            ^ + *     **** ++++                       .                   
            . + *         ***  +++++                  .                   
            . + *            ****   ++++              .                   
            . + *                ***    ++++          .                   
            . + *                   ****    ++++      .                   
            . + *                       ***     +++   V                   
          +---+--+                         ****   +------+                
          |      |                             ** |      |                
          |Client|                                |Callee|                
          |      |                                |      |                
          +------+                                +------+                
                                                                          
                                                                          
        +++++++ RTP Path                                                  
                                                                          
        ******* SIP Dialog                                                
                                                                          
        ....... SIP INVITE Path                                           
                                                                          
                                                                          
   Figure 2: Centralized Server Resolution


J. Rosenberg                                                 [Page 23]

Internet Draft              App Interaction             October 28, 2002


                +--------+          +--------+                            
                |        |+++++++++ |        |                            
                |  App1  |********* |  App1  |                            
                |        |........> |        |                            
                +--------+          +--------+                            
               ^   *   +               .  *  +                            
              .   *   +                 .  *  +                           
             .   *   +                   .  *  +                          
            .   *   +                     .  *  +                         
           .   *   +                       .  *  +                        
          .   *   +                         .  *  +                       
         .   *   +                           .  *  +                      
        .   *   +                             .  *  +                     
       .   *   +                               .  *  +                    
      .   *   +                                 .  *  +                   
         *   +                                   V  *  +                  
    +--------+                                 +--------+                 
    |        |                                 |        |                 
    | Caller |                                 | Callee |                 
    |        |                                 |        |                 
    +--------+                                 +--------+                 
                                                                          
                                                                          
       +++++++ RTP Path                                                   
                                                                          
       ******* SIP Dialog                                                 
                                                                          
       ....... SIP INVITE Path                                            


   Figure 3: Pipe-and-Filter Model


   the foundation upon which client resolution of multiple applications
   will take place.

   Each application has particular requirements on how it would like its
   media stream treated in relation to the other media streams that the
   focus may be managing. As an example, a prepaid calling card
   application will generate media towards the client, in order to
   inform them that they are running out of time in the call. The


J. Rosenberg                                                 [Page 24]

Internet Draft              App Interaction             October 28, 2002


       Caller         Prepaid App     Recorder App        Callee
          |(1) INVITE      |                |                |
          |--------------->|                |                |
          |(2) 200 OK      |                |                |
          |<---------------|                |                |
          |(3) ACK         |                |                |
          |--------------->|                |                |
          |(4) RTP         |                |                |
          |collect PIN     |                |                |
          |and number      |                |                |
          |................|                |                |
          |                |(5) INVITE      |                |
          |                |--------------->|                |
          |                |                |(6) INVITE      |
          |                |                |--------------->|
          |                |                |(7) 200 OK      |
          |                |                |<---------------|
          |                |                |(8) ACK         |
          |                |                |--------------->|
          |                |(9) 200 OK      |                |
          |                |<---------------|                |
          |                |(10) ACK        |                |
          |                |--------------->|                |
          |(11) RTP        |                |                |
          |................|                |                |
          |                |(12) RTP        |                |
          |                |................|                |
          |                |                |(13) RTP        |
          |                |                |................|
          |(14) INVITE     |                |                |
          |no SDP          |                |                |
          |KPML            |                |                |
          |<---------------|                |                |
          |(15) 200 OK     |                |                |
          |SDP1            |                |                |
          |--------------->|                |                |
          |                |(16) INVITE     |                |
          |                |SDP1            |                |
          |                |--------------->|                |
          |                |                |(17) INVITE     |
          |                |                |SDP2            |
          |                |                |--------------->|
          |                |                |(18) 200 OK     |
          |                |                |SDP3            |
          |                |                |<---------------|
          |                |(19) 200 OK     |                |
          |                |SDP4            |                |
          |                |<---------------|                |
          |                |(20) ACK        |                |
          |                |--------------->|                |
          |                |                |(21) ACK        |
          |                |                |--------------->|
          |(22) ACK        |                |                |
          |SDP4            |                |                |
          |<---------------|                |                |
          |(23) RTP        |                |                |
          |.................................|                |
          |                |                |(24) RTP        |
          |                |                |................|
          |Hit #           |                |                |
          |(25) HTTP POST  |                |                |
          |--------------->|                |                |
          |(26) 200 OK     |                |                |
          |<---------------|                |                |
          |                |(27) BYE        |                |
          |                |--------------->|                |
          |                |                |(28) BYE        |
          |                |                |--------------->|
          |                |                |(29) 200 OK     |
          |                |                |<---------------|
          |                |(30) 200 OK     |                |
          |                |<---------------|                |
          |(31) INVITE     |                |                |
          |<---------------|                |                |
          |(32) 200 OK     |                |                |
          |--------------->|                |                |
          |(33) ACK        |                |                |
          |<---------------|                |                |
          |(34) RTP        |                |                |
          |................|                |                |


   Figure 4: Pre-Paid Application with Pipe-and-Filter

J. Rosenberg                                                 [Page 25]

Internet Draft              App Interaction             October 28, 2002


   application would like this announcement to be spoken more loudly
   than the media from the other participants in the call (which is
   usually just the other party in the call, but could include other
   applications too!). Furthermore, the prepaid calling card application
   would like to receive media from just the calling user, not from any
   other applications or from the other participant in the call. To
   implement this, the application uses the media policy control
   protocol [3]. This protocol allows a participant in a conference to
   inform the focus about its desired policies for media handling. Each
   application would act as a client of this protocol, passing its
   request to the media policy server, which actually runs on the end
   user device.

   The media policy server in the end user device would reconcile the
   various requests, and generate the appropriate media streams towards
   each application, and towards the other user in the call. Indeed, the
   media policy server can reconcile the requests in any way it likes,
   so long as it has sufficient information about what each application
   wants to do. When the user device has a powerful user interface, the
   user themselves can be asked to select which application their media
   is targeted to. Effectively, the client determines the application
   focus, just as in the client-local user interface case (Section 7.1).


   Figure 5 depicts this basic model pictorially. The calling device
   makes an initial INVITE to setup a basic call with the called party.
   This INVITE passes through two proxies, both of which kick off
   applications (app1 and app2) as the request is proxied towards the
   called party. The result is a single dialog setup between the caller
   and called party (dialog C). However, the INVITE from the caller
   indicated that the device is capable of acting as a focus. How did it
   do that? It did so by indicating support for the SIP Join extension
   [11] which allows a UA to request to be conferenced into an existing
   dialog. As such, both app1 and app2, acting as a pure UAC, generate
   an INVITE towards this focus, with a Join header requesting to be
   added to a conference which includes the original dialog. The result
   is two additional dialogs, dialog A and dialog B respectively, which
   join the original dialog in their connection to a focus co-resident
   with the caller. Both app1 and app2 use the media policy control
   protocol to interact with the media policy server co-resident with
   the user device (interaction not shown). This would require the
   caller to have indicated that it supports a media policy control
   server.

        REQ 11: There must be a way for a UA to indicate that it
             supports a media policy server function.

   In this model, there may be a media stream from the called party,


J. Rosenberg                                                 [Page 26]

Internet Draft              App Interaction             October 28, 2002


   app1, and app2, towards the mixer present in the calling UA. This
   "may" is important. In many cases, each application is not really
   actively generating media towards the user. It may only need to
   sporadically interact with the user, and during those times, the
   desired effect is for media from other applications, and the peer
   user, to be suppressed. Therefore, a client can support this model of
   resolution without ever needing to actually mix any media!

   Interestingly, this model for resolving the interaction problem does
   not introduce any new requirements into SIP. The existing
   conferencing framework and its associated requirements provide all
   the tools that are needed. For example, the framework will allow an
   application to initiate a new dialog towards the endpoint focus,
   allowing it to join the call without "ringing" the phone again.


   Figure 6 shows a call flow for the example scenario of Section 7.2.2,
   but using the client resolution architecture. The caller sends out an
   initial INVITE to the prepaid application (1). This INVITE contains a
   Supported header indicating the ability to receive INVITE requests
   with Join headers. It also indicates that the UA supports a media
   policy control server. This arrives at the pre-paid application. The
   pre-paid application generates a 183 to the initial INVITE (2). Then,
   it sends a brand new INVITE request (i.e., not a re-INVITE, and not
   with the same dialog identifiers as the original INVITE) towards the
   caller (3). This INVITE has a Join header containing the dialog
   identifiers from the 183. This is received by the caller. The caller
   mutates into a focus [3], and generates a 200 OK to the INVITE (4).
   The Contact header field in this 200 OK contains the conference URI.
   Effectively, the caller is now hosting a conference that has two
   dialogs - one towards the prepaid application, and the other, an
   early dialog. The prepaid application uses the media policy control
   protocol, and informs the caller that it wishes to be the sole source
   and sink of media (6). This media policy request could be presented
   to the user, informing them that the prepaid calling card application
   is now in focus. The application prompts the user for their calling
   card number, their PIN, and the destination number. Once collected,
   the prepaid calling card application acts as a B2BUA on the original
   INVITE request, and forwards it to the call recording application
   (8). Note that the prepaid application is a B2BUA on this dialog
   because it needs to hang up the call. It does not act as a B2BUA with
   media on this dialog; that is, it does not touch the SDP.

   The forwarded INVITE is received by the call recording application.
   At this point, it just proxies the request towards the called party
   (9). It is not a B2BUA on this dialog, although it does record-route.
   The called party receives the INVITE, and answers with a 200 OK (10).
   This is propagated to the call recording application, which carefully


J. Rosenberg                                                 [Page 27]

Internet Draft              App Interaction             October 28, 2002


                   +------+               +------+                        
                   |      |        2      |      |                        
                 > | App1 | .............>| App1 |                        
                .  |      |               |      | .                      
               .   +------+               +------+ .                      
              .       *                     **      .                     
             .      **                   ***        .                     
            .      *                 ****            .                    
           .      *A              ***                .                    
          1.    **             ***                    .                   
          .    *            ***B                      .                   
         .   **          ***                           .3                 
        .   *        ****                              .                  
       .   *      ***                                   .                 
      .  **    ***                                      .                 
   +----*----**---------------+                          .                
   | +----------+             |                          .                
   | | Endpoint | ****        |                           .               
   | |  Focus   |     ******* |                           .               
   | +----------+            *******                       .              
   |      * +-----+ +--------+|     *******                V              
   |      * |mixer| | Media  ||           C*******    +--------+          
   |      * +-----+ | Policy ||                   ****|        |          
   |  +------+      | Server ||                       |+------+|          
   |  | User |      +--------+|                       || User ||          
   |  +------+                |                       |+------+|          
   +--------------------------+                       +--------+          
                Calling Device                     Called Device          
                                                                          
                                                                          
   ........  Path of initial SIP INVITE                                   
                                                                          
   ********  SIP Dialog                                                   
                                                                          
                                                                          
   Figure 5: Architecture for Client Resolution


J. Rosenberg                                                 [Page 28]

Internet Draft              App Interaction             October 28, 2002


       Caller         Prepaid App     Recorder App        Callee
          |(1) INVITE      |                |                |
          |--------------->|                |                |
          |(2) 183         |                |                |
          |<---------------|                |                |
          |(3) INVITE      |                |                |
          |Join            |                |                |
          |<---------------|                |                |
          |(4) 200 OK      |                |                |
          |--------------->|                |                |
          |(5) ACK         |                |                |
          |<---------------|                |                |
          |(6) MS-CTRL     |                |                |
          |just me         |                |                |
          |<---------------|                |                |
          |(7) RTP         |                |                |
          |collect PIN     |                |                |
          |and number      |                |                |
          |................|                |                |
          |                |(8) INVITE      |                |
          |                |--------------->|                |
          |                |                |(9) INVITE      |
          |                |                |--------------->|
          |                |                |(10) 200 OK     |
          |                |                |<---------------|
          |                |(11) 200 OK     |                |
          |                |<---------------|                |
          |(12) 200 OK     |                |                |
          |<---------------|                |                |
          |(13) ACK        |                |                |
          |--------------->|                |                |
          |                |(14) ACK        |                |
          |                |--------------->|                |
          |                |                |(15) ACK        |
          |                |                |--------------->|
          |(16) BYE        |                |                |
          |<---------------|                |                |
          |(17) 200 OK     |                |                |
          |--------------->|                |                |
          |(18) INVITE     |                |                |
          |Join,no media   |                |                |
          |KPML            |                |                |
          |<---------------|                |                |
          |(19) 200 OK     |                |                |
          |--------------->|                |                |
          |(20) ACK        |                |                |
          |<---------------|                |                |
          |(21) INVITE     |                |                |
          |Join            |                |                |
          |<--------------------------------|                |
          |(22) 200 OK     |                |                |
          |-------------------------------->|                |
          |(23) ACK        |                |                |
          |<--------------------------------|                |
          |(24) MS-CTRL    |                |                |
          |fork to me      |                |                |
          |<--------------------------------|                |
          |Hits #          |                |                |
          |(25) HTTP POST  |                |                |
          |--------------->|                |                |
          |(26) 200 OK     |                |                |
          |<---------------|                |                |
          |                |(27) BYE        |                |
          |                |--------------->|                |
          |                |                |(28) BYE        |
          |                |                |--------------->|
          |                |                |(29) 200 OK     |
          |                |                |<---------------|
          |                |(30) 200 OK     |                |
          |                |<---------------|                |
          |(31) BYE        |                |                |
          |<--------------------------------|                |
          |(32) 200 OK     |                |                |
          |-------------------------------->|                |
          |(33) INVITE     |                |                |
          |enable          |                |                |
          |media           |                |                |
          |<---------------|                |                |
          |(34) 200 OK     |                |                |
          |--------------->|                |                |
          |(35) ACK        |                |                |
          |<---------------|                |                |
          |(36) MS-CTRL    |                |                |
          |just me         |                |                |
          |<---------------|                |                |
          |(37) RTP        |                |                |
          |................|                |                |


J. Rosenberg                                                 [Page 29]

Internet Draft              App Interaction             October 28, 2002


   Figure 6: Prepaid Application with Client Resolution


   notes the dialog identifier. This 200 OK is passed to the prepaid
   application (11), which also notes the dialog identifier. The 200 OK
   is passed towards the caller (12). The ACK is propagated back towards
   the called party normally (13-15). The 200 OK will have the effect of
   terminating the early dialog that was established by the pre-paid
   calling card application. This leaves the caller with a hosted
   conference with itself, and the pre-paid application as members,
   along with a new dialog (outside of the conference) created from the
   200 OK.

   Knowing this is the case, the prepaid calling card application
   terminates its previous dialog with the caller (16-17). This dialog
   is not useful any more, since it is not joined with the dialog which
   was actually created for the call. However, the prepaid calling card
   application would like to be involved in the successful dialog. For
   now, it doesn't need media, but it wishes to install a client-local
   user interface, in KPML, to watch for the long pound. So, it sends an
   INVITE with to media, with a Join header containing the dialog
   identifier for the established call. The INVITE also contains a KPML
   document (18). This INVITE completes successfully (19-20).

   Now, the call recording application needs to receive a copy of the
   media stream, in order to record it. To do that, it also generates an
   INVITE towards the caller (21), with a Join header containing the
   dialog identifiers from message 10. The INVITE indicates a receive
   only media stream. This dialog completes succesfully (22-23). Now,
   the caller is hosting a conference which contains itself, the prepaid
   calling card application (which neither sending or receiving media),
   the recording application (which is receiving media), and the called
   party (which is sending and receiving media). The call recording
   application instructs the media policy server in the UA (24) that it
   would like to receive a copy of the media, including that received
   from the called party. Note that there is no need for endpoint mixing
   to support this conference.

   The caller has their conversation. Eventually, they hit the long
   pound to hang up. This results in an HTTP POST to the prepaid
   application, based on the rules in the KPML (25). The prepaid calling
   card application sends a BYE towards the recording application (27).
   The recording application proxies it (28), and it completes normally
   (29-30). Now, recall that the call recording application was actually
   a combination of a proxy (for the original dialog), and a pure UA (to
   record the media stream). Now that the call is over, it terminates
   its dialog with the caller (31-32), and it is now out of the loop.


J. Rosenberg                                                 [Page 30]

Internet Draft              App Interaction             October 28, 2002


   The prepaid calling card would now like to communicate with the
   caller. It already has a dialog active with it. So, it merely
   generates a re-INVITE on that dialog (33), adding media streams. This
   dialog completes sucessfully, (34-35). Now, the pre-paid application
   uses the media policy control protocol to tell the caller that they
   are the only ones that should be sending or receiving a media stream
   (36). The prepaid application can then prompt for the next number.

7.2.3 Comparison

   There are important differences between the three models. Both have
   pros and cons. We generally compare only the client and pipe-and-
   filter models; the centralized server model is not generally
   applicable since it assumes centralized coordination of applications.

   The model in Section 7.2.2 has many benefits. First, it has excellent
   security properties. Because each application has a direct dialog
   with the user, and that dialog manages media streams directly between
   the user and each application, the existing SIP security tools can be
   directly used. S/MIME and potentially TLS (if there are no
   intervening proxies between each application and the user device) can
   provide for authentictation services. The client device can know the
   complete set of applications it is interacting with, since each one
   can authenticate directly with the UA (and vice-a-versa). In the
   model of Section 7.2.2, there is a single dialog between the user and
   their "first" application. Therefore, the user cannot directly
   authenticate each application, and vice-a-versa.

   Similarly, each media stream can be properly secured using SRTP [12].
   Because each application is a UA, and not a B2BUA, SRTP key exchanges
   (using MIKEY, for example [13]) are done directly with the
   application to which the media is being sent. In the model of Section
   7.2.2, the applications are the terminating point of the signaling,
   but may not even touch the media stream (once again, consider the
   pre-paid calling card application). Such a configuration might
   preclude the use of SRTP, since the intermediary application would
   appear as a man-in-the-middle attacker!

   B2BUAs also have well understood interactions with end-to-end
   encryption. If the caller should encrypt their SDP, B2BUA
   applications will not be able to manipulate it, and so the model of
   Section 7.2.2 will simply fail. However, the endpoint-based model of
   Section 7.2.2 still works in the presence of end-to-end encryption of
   SDP. This is, of course, because there are no B2BUAs.

   That leads to another benefit - feature transparency. B2BUAs can
   interfere with the operations of features when messages are
   propagated through them. This problem is completely eliminated in the


J. Rosenberg                                                 [Page 31]

Internet Draft              App Interaction             October 28, 2002


   client-based architecture of Section 7.2.2.

   There is another interesting benefit of the client-based architecture
   - firewall traversal. In the application-based architecture of
   Section 7.2.2, many applications will not need to always be on the
   media path. The applications will use re-INVITEs to move the media
   streams to themselves when needed, and then move them back when done.
   The result of this, as far as the user is concerned, is that a single
   media stream will, at times, appear to be coming from different
   source IP addresses. This means that a SIP-enabled firewall (or one
   controlled by MIDCOM [14]) will need to open a "cone" for the media
   stream - allowing it to go to the user, but come from any source
   address. Such cones are more insecure, and less desirable, than a
   pinhole. With the client-based architecture of Section 7.2.2, a SIP-
   enabled firewall can open a cone initially, and when the media
   arrives from the application, close the cone to a pinhole by
   restricting media packets to always have the same source IP address
   from then on. This restriction is possible because media on a
   particular dialog comes from a single source - the application or the
   user, depending on which dialog. The source of the media does not
   change within a single dialog, as it does in the model of Section
   7.2.2.


        TODO: A picture and some more words are needed here to
        explain this.

   Conceptually, the client-based architecture allows for a unified view
   of applications. A SIP application that desires to instantiate a
   remote client user interface is always a normal user agent, whether
   it be a "terminating" type of application, or "intermediary" type of
   application. These two cases therefore become merged into one.
   Furthermore, the inter-application feature interaction between client
   local user interfaces and client remote user interfaces become
   unified - both become local focus determination problems.

   Furthermore, much of the interactions between application components
   (discussed in Section 8) are simplified because of the simple
   correlation of a dialog to a single application.

   Unfortunately, the benefits of the client-based architecture come at
   a cost of complexity. End devices need to support a focus capability,
   a media policy server function, and possibly a media mixer, although
   the latter can probably be avoided. The model also requires the
   client to construct a globally routable URI to represent its focus,
   something which is not trivial in an IP network laden with NATs and
   firewalls.


J. Rosenberg                                                 [Page 32]

Internet Draft              App Interaction             October 28, 2002


8 Intra Application Feature Interaction

   An application can instantiate a multiplicity of user interface
   components. For example, a single application can instantiate two
   separate HTML components and one WML component. Furthermore, an
   application can instantiate both client local and client remote user
   interfaces.

   The feature interaction issues between these components within the
   same application are less severe. If an application has multiple
   client user interface components, their interaction is resolved
   identically to the inter-application case - through focus
   determination. However, the problems in focusless user interfaces
   (such as a keypad) generally won't exist, since the application can
   generate user interfaces which do not overlap in their usage of an
   input.

   The real issue is that the optimal user experience frequently
   requires some kind of coupling between the differing user interface
   components. This is a classic problem in multi-modal user interfaces,
   such as those described by SALT [15]. As an example, consider a user
   interface where a user can either press a labeled button to make a
   selection, or listen to a prompt, and speak the desired selection.
   Ideally, when the user presses the button, the prompt should cease
   immediately, since both of them were targeted at collecting the same
   information in parallel. Such interactions are best handled by
   markups which natively support such interactions, such as SALT, and
   thus require no explicit support from this framework.

   There is, however, a very common interaction in voice-based
   applications which merits support from this framework. Many
   interactive voice response systems (IVR) allow for a user to
   "interrupt" a prompt by generating a response before the prompt
   finishes. The ideal user experience is achieved by having the prompt
   cease immediately when the user speaks the input. This is known as
   barge-in.

   In a traditional implementation of an IVR system, there would be a
   client-remote user interface, rendered in VoiceXML. VoiceXML has
   native support for barge-in. However, because the VoiceXML script is
   interpreted remotely, there is a fundamental latency between the
   client and the remote user interface. That is, when the user speaks
   or presses a key, the speech or key must be transmitted to the
   platform and interpreted, and then the VoiceXML server ceases playing
   out media. For this to be observed by the client, the last media
   packet must still travel from the VoiceXML server to the client,
   through its playout buffers, and out the speaker system.


J. Rosenberg                                                 [Page 33]

Internet Draft              App Interaction             October 28, 2002


   This framework allows for better performance. A VoiceXML user
   interface can actually delegate a component of the user interface to
   be interpreted on the client. Specifically, the collection of the
   keypad input from the user can be delegated to the client by placing
   a KPML-based user interface on the client solely for this purpose.
   KPML has a barge-in feature as well. When the barge-in option is
   selected, and user input matches a regular expression, all incoming
   media streams associated with the application are muted, and the
   playout buffers on the client are flushed. This situation persists
   until the beginning of the next talkspurt, framed by the market bit
   in the RTP stream.


        OPEN ISSUE: Is the marker bit the right way to do this?

   In this framework, a client local user interface is bound to a
   dialog. A media stream is said to be associated with that user
   interface component if the media stream is managed on the same dialog
   the user interface component is bound to. As a result, if a KPML
   script results in a barge-in, all media streams on that dialog are
   muted until their marker bits flip.

   A similar delegation can occur by placing instantiating a VoiceXML-
   based user interface into the client. That would allow barge-in to
   operate for speech driven IVR, in addition to keypad driven IVR.


   This capability can allow VoIP-based IVR applications to operate with
   zero-latency barge-in, better than todays circuit-switched IVR
   applications. This is shown in Figure 7, which demonstrates a call
   flow for this example. The caller makes an INVITE to a VoiceXML
   server (1). The VoiceXML server fetches the script to execute (2).
   The script, returned in (3), indicates that a prompt should be
   played, and if the user presses bound, to barge-in. So, the VoiceXML
   server generates a KPML script that looks for pound, and sets the
   barge flag to true. This is returned in the 200 OK (4). The user is
   played the prompt, and presses pound in the middle. The KPML notes
   this, and the UA ceases playout of the prompt immediately. At the
   same time, the client generates a POST to the VoiceXML server (7).
   The VoiceXML server knows that the pound has been pressed. So, it
   fetches the next VoiceXML script (8), and extracts from it the next
   KPML script, passed in the 200 OK response to the POST from the
   client (10).

9 Examples

   TODO.


J. Rosenberg                                                 [Page 34]

Internet Draft              App Interaction             October 28, 2002


10 Security Considerations

   There are many security considerations associated with this
   framework. It allows applications in the network to instantiate user
   interface components on a client device. Such instantiations need to
   be from authenticated applications, and also need to be authorized to
   place a UI into the client.

   The means by which the authentication and authorization are done
   depend on the architectural model in use. A pipe-and-filter model
   will make it difficult for the user device to authenticate each
   application, since there is no direct dialog between them. Direct
   dialogs are needed since they are needed for S/MIME, which is the
   primary tool for client authentication of a server through proxies.
   However, authorization is reasonably simple. An application is
   authorized if it was on the original call path. By using a secure SIP
   URI [1], the caller can obtain this guarantee as long as it trusts
   each element on the call setup path.

   With the client-based resolution model, authentication is much
   better, as noted in Section 7.2.2, since it can be done with S/MIME.
   Authorization works identically to the pipe-and-filter model. If the
   caller initiated the call with a secure SIP URI, an application could
   never learn the dialog identifiers unless it was in-path. Therefore,
   an application which generates an INVITE to join a dialog created
   from a SIPS URI must have been on the call path. However, this
   application itself must use SIPS to contact the UA, in order to
   protect the confidentiality of the dialog identifiers.

11 Contributors

   This document was produced as a result of discussions amongst the
   application interaction design team. All members of this team
   contributed significantly to the ideas embodied in this document. The
   members of this team were:


   Eric Burger
   Cullen Jennings
   Robert Fairlie-Cuninghame


12 Authors Address


   Jonathan Rosenberg


J. Rosenberg                                                 [Page 35]

Internet Draft              App Interaction             October 28, 2002


       Caller         VXML Server      Web Server
          |                |                |
          |                |                |
          |(1) SIP INVITE  |                |
          |--------------->|                |
          |                |                |
          |                |                |
          |                |(2) HTTP GET    |
          |                |--------------->|
          |                |                |
          |                |(3) HTTP 200 OK |
          |                |VXML            |
          |                |<---------------|
          |                |                |
          |(4) SIP 200 OK  |                |
          |KPML            |                |
          |<---------------|                |
          |                |                |
          |                |                |
          |(5) SIP ACK     |                |
          |--------------->|                |
          |                |                |
          |                |                |
          |(6) RTP         |                |
          |................|                |
          |                |                |
          |                |                |
          |press #         |                |
          |                |                |
          |                |                |
          |                |                |
          |playout ends    |                |
          |                |                |
          |                |                |
          |                |                |
          |(7) HTTP POST   |                |
          |--------------->|                |
          |                |                |
          |                |                |
          |                |(8) HTTP POST   |
          |                |--------------->|
          |                |                |
          |                |(9) 200 OK      |
          |                |VXML            |
          |                |<---------------|
          |                |                |
          |(10) 200 OK     |                |
          |KPML            |                |
          |<---------------|                |
          |                |                |
          |                |                |
          |                |                |
          |                |                |


   Figure 7: Zero-Latency Barge In

J. Rosenberg                                                 [Page 36]

Internet Draft              App Interaction             October 28, 2002


   dynamicsoft
   72 Eagle Rock Avenue
   First Floor
   East Hanover, NJ 07936
   email: jdrosen@dynamicsoft.com


13 Normative References

14 Informative References

   [1] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J.
   Peterson, R. Sparks, M. Handley, and E. Schooler, "SIP: session
   initiation protocol," RFC 3261, Internet Engineering Task Force, June
   2002.

   [2] M. Day, J. Rosenberg, and H. Sugano, "A model for presence and
   instant messaging," RFC 2778, Internet Engineering Task Force, Feb.
   2000.

   [3] J. Rosenberg, "A framework for conferencing in the session
   initiation protocol," Internet Draft, Internet Engineering Task
   Force, Oct. 2002.  Work in progress.

   [4] H. Schulzrinne and J. Rosenberg, "Session initiation protocol
   (SIP) caller preferences and callee capabilities," Internet Draft,
   Internet Engineering Task Force, July 2002.  Work in progress.

   [5] VoiceXML Forum, "Voice extensible markup language (VoiceXML)
   version 1.0," W3C Note NOTE-voicexml-20000505, World Wide Web
   Consortium (W3C), May 2000.  Available at
   http://www.w3.org/TR/voicexml/.

   [6] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: a
   transport protocol for real-time applications," RFC 1889, Internet
   Engineering Task Force, Jan. 1996.

   [7] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits,
   telephony tones and telephony signals," RFC 2833, Internet
   Engineering Task Force, May 2000.

   [8] E. Burger, "The keypad markup language (kpml)," Internet Draft,
   Internet Engineering Task Force, Oct. 2002.  Work in progress.

   [9] J. V. Dyke, E. Burger, and A. Spitzer, "Snowshore media server
   control markup language and protocol," Internet Draft, Internet


J. Rosenberg                                                 [Page 37]

Internet Draft              App Interaction             October 28, 2002


   Engineering Task Force, Oct. 2002.  Work in progress.

   [10] J. Rosenberg, J. Peterson, H. Schulzrinne, and G. Camarillo,
   "Best current practices for third party call control in the session
   initiation protocol," Internet Draft, Internet Engineering Task
   Force, June 2002.  Work in progress.

   [11] R. Mahy and D. Petrie, "The session initiation protocol (sip)
   join header," Internet Draft, Internet Engineering Task Force, Oct.
   2002.  Work in progress.

   [12] M. Baugher et al.  , "The secure real-time transport protocol,"
   Internet Draft, Internet Engineering Task Force, June 2002.  Work in
   progress.

   [13] J. Arkko et al.  , "MIKEY: Multimedia internet KEYing," Internet
   Draft, Internet Engineering Task Force, Aug. 2002.  Work in progress.

   [14] P. Srisuresh, J. Kuthan, J. Rosenberg, A. Molitor, and A.
   Rayhan, "Middlebox communication architecture and framework," RFC
   3303, Internet Engineering Task Force, Aug. 2002.

   [15] S. Forum, "Speech application language tags 1.0 specification
   (SALT)," salt forum recommendation, Salt Forum, July 2002.  Work in
   progress.


J. Rosenberg                                                 [Page 38]