Requirements for a markup language for HTTP-mediated interactive voice
response services

Peter Danielsen¶,
Nils Klarlund*,
David Ladd§,
Peter Mataga¶,
Christopher Ramming*, Kenneth Rehor¶

*) AT&T Labs-Research, ¶) Lucent Technologies-Bell Laboratories, §)
Motorola ICSD


Voice browsing involves access to the Web via a device, such as a
telephone, that has no display.  Our joint experience with markup
languages for IVR (Interactive Voice Response) systems suggests that
HTML cannot be easily extended in ways that would make voice browsing
possible.  In fact, voice browsing suffers from many of the same
obstacles that make so many IVR systems unpleasant and difficult to use. 
Web contents should nonetheless be accessible to voice browsing
communities.  This goal can be achieved by a structured markup language
that is expressly designed for IVR services.  Such a language could be
used to create voice browsers along with Web applications that parallel
their visual counterparts.  We offer some requirements for such a


It is possible to browse the Web even in the absence of a usable display
device.  The term voice browsing has been used to describe this form of
Web access, which is characterized by the use of sound to render
documents in an interactive fashion.  Voice browsers offer Web access to
the visually impaired, to those whose eyes must be focused on something
other than a computer screen (automobile drivers), and to those using
inherently limited terminal devices (payphone users).  That is to say,
an effective voice browser would be a valuable tool for many user
communities.  There have been several efforts to support the voice
browsing of arbitrary Web documents; most approaches allow modifications
to the underlying document for a more effective voice rendering. 
[1, <
file:///C:/WIN95/TEMP/VXMLreq.h tm#prodWorks> 2,

A voice browser (or voice user agent in Web parlance) must perform audio
rendering and must also provide user input mechanisms that control
hyperlink selection, form field entry, and form submission.  Perhaps the
simplest known terminal device that supports audio browsing is the
common analog telephone.  Telephones have, in fact, long supported
automated information and transaction systems, known in the
telecommunications industry as Interactive Voice Response (IVR) systems. 
It is therefore natural to propose an architecture where IVR systems
become voice browsers that are defined by Web applications.  Indeed, a
recently circulated note on voice browsing
[4] describes functionality
that is applicable to IVR services as well as to browsing for the
visually impaired.  The idea that IVR systems and Web browsers are
closely related suggests that HTML should form a basis for IVR systems. 
If the details of this idea can be worked out, IVR systems will be
created as Web applications, and Web applications will be accessible to
voice browsing communities.

In this paper, we examine the requirements for markup-language-based IVR
service creation.  The three groups represented by the authors of this
paper have had extensive experience in providing HTTP-mediated IVR
systems, and we believe this experience is relevant to the voice
browsing community in two ways.  We have found that an approach to voice
browsing and IVR based on the extension of HTML has some important
limitations.  We have also found the insights derived from experience
with IVR systems may be valuable in guiding the design of voice

It is easy to imagine that any structured document written in a markup
language can be translated in a straightforward way to some state
machine that correctly renders the output and gathers appropriate user
input; indeed, such a translation is at the heart of several
HTTP-mediated voice service languages developed at AT&T
[5], Lucent
[6], and Motorola
[7].  From this perspective,
voice browsing appears to be straightforward a straightforward problem .

However, if voice browsing is seen as IVR, then it is hard to imagine
how voice browsing will be any easier to achieve than it is to create
IVR services .  Indeed, most would agree that few IVR services are
effective and pleasing.  Even if one assumes that speech recognition and
synthesis technology have become so reliable that they are no longer the
limiting factors in designing effective IVR services, experience shows
that the complexity of dialogs and interactions can overwhelm a
programmer not skilled in both concurrent systems and human-computer
interfaces.  Any service creation environment, whether or not based on
HTML, must reflect the fact that IVR service design is a complicated
process that can be parameterized in countless ways.

We believe, however, that effective IVR service creation based on Web
authoring is achievable.  A structured markup language is a convenient
way to hide hardware complexities, to abstract away the difficulties of
concurrent programming, and to codify proven principles of IVR design. 
Thus, a markup language allows the designer to concentrate on the
essential matter:  the contents, choice of pre-defined dialogs, and
associated help menus and prompts.  The details of all of these aspects,
and the way in which they fit together, must be customizable by the
service creator (or, perhaps, the style designer).

We believe it is unlikely that extensions to HTML such as those proposed
in [4] will satisfy all of the
requirements for effective IVR services.  The designer's intent with
regard to voice interactions is simply not well supported by such an
approach.  We conclude that IVR service creation can best be achieved
with a specific markup language for specifying interactive voice
rendering.  It seems to us that many of the same issues apply to the
voice browsing problem, and that such a markup language can provide a
useful substrate for voice interaction with existing Web content,
especially as voice customization becomes more prevalent.

Phones as Web terminals: some striking problems

The starting point for this discussion is the assumption that a service
is presented as a sequence of pages of some markup language (e.g., an
XML dialect).  We also assume that any activity that requires intensive
computation or access to private data is executed on the HTTP server. 
However, each page may nonetheless represent complex user interactions,
during which multiple pieces of information are presented to and
collected from the user.  These interactions take place via the limited
interface provided by a telephone, and are are strikingly different from
those of typical visual browser.  This section covers some of the
differences, and it discusses the consequences for service

Documents must be rendered along a temporal dimension

In the visual world, a full screen of any document may be presented at a
given moment.  The two-dimensional layout of the document is determined
through style sheets or flow-objects.  The reader can instantaneously
select the part of the information that he or she wants to read; in
fact, the effort of intra-page navigating is so negligible that it is
usually unconscious.  In the voice world, information is rendered
linearly, as a function of time.  Thus the analog of layout is a
description of the interactive, temporal process where the listener
selects, through prompts and barge-ins, what linear fragments of the
document to render.

It is not possible to separate content from presentation

In the visual world, the HTML + CSS combination convincingly separates
structured contents (the HTML document) from the presentation details
(the CSS specification).  When we tried to adapt HTML to a voice
setting, we found out early that there is no such strong separation
between contents and presentation.  Even with a library of pre-defined
dialogues structures, a programmer must skillfully add more contents,
much of which concerns presentation.  For example:  What is the initial
prompt?  What is the prompt the nth time that the browser has not been
able to interpret the user's input?  What is the speech recognition
grammar?  Is the grammar merged with grammars for outer elements that
have navigational options?  What if the user says "help"?

The "help" problem illustrates a point fundamental to voice rendering: 
that additional contents regarding the presentation itself must be added
in many places.  We say that contents alone, such as represented in an
HTML document, must be augmented with additional representational
contents.  Representational contents cannot be calculated from an HTML
document; such contents, as all other, require human insight.

User input is often ambiguous

In the visual world, user input is usually treated as certain.  For
example, there is no need to reconfirm the spelling of a city name.  In
the voice world, we assume that a full keyboard is not available (for if
it were, a display would be nearby in most cases, and it would surely be
the output medium of choice for most people).  So typically, input is
either by speech recognition or by keypads, such as the 12 keys on a
usual phone keypad.  Speech input is ambiguous, and so is telephone
keypad input, at least when it comes to spelling (where one key denotes
several letters).  In any case, the interactions or dialogues
necessarily become complicated if they are to deal well with the human
factors involved.

Requirements for a markup language for interactive voice rendering

Our experience with markup languages for HTTP-mediated IVR is that they
are very well-suited for the expression of contents, control flow,
dialogues, and the handling of exceptional situations.  We believe that
these concepts, as a whole, are difficult to reconcile with HTML.  This
point of view, of course, does not conflict with extending HTML with
markup to improve voice rendering, such as pronunciation clues or audio
icons.  In this section, we present requirements for a voice markup
language.  The requirements are intended to guide the development of a
standard markup language, tentatively called VXML (Voice eXtensible
Markup Language).


IVR services are typically written in general purpose languages that
allow for the precise temporal control of all resources at a cost of
high program complexity.  Such resources include timers, speech
recognizers, speech synthesizers, and tone detectors.  In contrast, VXML
should provide abstractions of platform capabilities that enable the
author to focus on behavioral aspects of the service, freeing them from
platform-specific APIs.  VXML elements should encapsulate common IVR
resource usage idioms, such as playing a phrase or collecting input.  It
is an explicit goal to keep the abstraction level high; in particular,
most VXML documents should avoid the use of C++, Java, or
EcmaScript-like code to control a speech or telephony API for reasons of
safety, performance, and portability.

We expect that VXML would have a pre-defined set of higher-level
abstractions.  An author may use the higher-level abstraction capability
to specify interaction behaviors to address human factors,
compatibility, and performance issues.

Control Flow

Conventionally, IVR services are planned and designed in terms of flow
diagrams.  Typically, control flow in IVR systems is affected both by
synchronous completion of an interaction and by asynchronous events. 
Often, control flow is determined by previous inputs.  Therefore,
conditional branching and other basic control flow, must be provided by
the language.  It must also be possible to specify the output and
control flow behavior of a variety of asynchronous or exceptional

HTML Reuse

There are several concepts of HTML that apply in the voice world.  For
example, the document structure itself defines an implicit sequential
flow of control.  Also, the static scoping in HTML achieved through
nesting of elements should be adopted by VXML.  Many elements make sense
as well:  the LINK element in the header section provides information
about how the document is related to others, and the information can be
used for navigational purposes; the SELECT element may be interpreted as
an IVR menu; the FORM element defines the name/value pairs that are to
be returned to the server.  By reusing HTML as much as possible, we make
VXML easier to learn and easier to use.

Speech Standards Reuse

Converting text to audio output is a challenging problem.  Most
text-to-speech engines attempt to infer prosody and pronunciation from
the text, but mistakes are common.  In principle, we agree that specific
speech markup is the correct solution to minimizing the ambiguity
inherent in text.  In particular, we recognize the efforts of the
Sable Consortium in
codifying the capabilities of current text-to-speech technology. 
Similar work is underway for other specialized markup such as Automatic
Speech Recognition grammars.  Such formalisms are outside the scope of
the VXML definition, but a method for their incorporation should be


Given the huge volume of information published through HTML, and the
range of services that are evolving on the Web, it is attractive to
offer a voice interpretation of HTML content.  We believe that work can
and will progress in this area, but that there will be a desire on the
part of many content providers to target the voice medium precisely and
directly, to create high-quality voice services.  Since it seems to us
that HTML, even with extensive style embellishment, is incapable of
specifying the details of the necessary voice interactions, our
preference would be to standardize on a fully explicit voice interaction
language.  This language could in turn be targeted by applications that
form a continuum from voice browsing of existing HTML to hosting of IVR
services.  We think this approach would serve the purposes of the voice
browsing community by bringing about a robust substrate relatively
quickly.  Once the mechanisms for voice browsing of HTML content emerge,
VXML voice browsers could incorporate the HTML rules and style sheets. 
Even so, our experience suggests that it is likely that different pages
will be developed for voice and graphical rendering. 


1.      General Magic
2.      Productivity Works
3.      Rutgers University AudioWeb
4.      Voice Browsers note
5.      AT&T  PML
6.      Lucent Technologies
7.      Motorola's VoxMLTM Voice Markup Language