Web By Phone

Abstract.

WebByPhone is a system that handles a phone call, recognizes touch-tone (DTMF) digits and then reads a web page aloud using a TTS text-to-speech synthesizer. WebByPhone deploys the power of Web browsing, by removing the need of expensive equipment and requires simply a common PSTN telephone. This not only extends the web access to the casual user, but it also provides an effective aid to the visually impaired individuals.

Introduction.

Web technologies have been commonly based on graphical interfaces even since the days of the first Web Browser Mosaic. The use of graphical user interfaces has simplified the user interactions and contributed successfully to the diffusion of the web, but at the same time, it elevated the need to purchase expensive systems. Television and telephones are substantially more widely used than Personal Computers. New attempts to penetrate the market offering web services leveraging on common diffused appliances are rapidly emerging. WebTV and IP-Phones are just an example of this.

WebByPhone deploys the power of Web browsing, by removing the need of expensive equipment and requires simply a common PSTN telephone. This not only extends the web access to the casual user, but it also provides an effective aid to the visually impaired individuals. Although WebByPhone is not the only system providing web access through PSTN, it offers an open architecture and a full platform independence that guarantees portability amongst different systems.
WebByPhone is totally implemented in Java. At the time of design my choices leveraged on emerging standards such as the Java speech API (JSAPI), and on well known and establish design patterns, such as the Observer/Observable and the State pattern.

In this paper I will introduce the related work and provide a background on the technologies and products interacting in the WebByPhone system. I will then illustrate the architecture both describing the main architecture diagram and the components in detail. I will describe the design patterns and programming techniques applied in each module trying to emphasize the benefit gained in operating those design choices.
This document also contains a functional description with various examples of how the user can interact with the system. Since WebByPhone is entirely coded in Java, the installation of the main program does not require any additional compilation and is ready to run. I have added a minimal installation procedure and requirements that include the essential steps needed to install the peripheral software interacting with WebByPhone. At conclusion of this document I report a list of possible directions in which the system may evolve in the future.

Related Work.

There are several related works involving speech synthesis.

Most of these applications are oriented to blind users and provide access to common programs and operating systems. BLYNX [1], and BLINUX [2] are two examples of how it is possible to couple a TTS and an application.
This is just a partial list of applications using voice speech synthesis: [3]

Emacspeak Driver for the DoubleTalk and LiteTalk Synthesizers
Emacspeak Driver for Braille 'n Speak, Braille Lite, and Type 'n Speak
Linux Device Driver for the DoubleTalk PC
Screader: a Text-to-Speech Application for Linux
Slackware96 Rootdisk for the Blind

There is also a company NetPhonic Communications, Inc. [4] that provides two products with similar features: Web-On-Call [5] and Email-On-Call [6].
I have tested these two products through the demo they offer by phone. They appear to be, at least from the demos, well done and production quality.
Although WebByPhone is not a complete product, I place it very close to the competitor Web-On-Call.
The implemented WebByPhone core functionality in addition to the diffused utilization of design patterns provides an open architecture in which it is extremely easy to enrich the system with new features.

Background

The following is a brief introduction to the various software and devices I used building the WebByPhone system.

Speech and other Java APIs

JavaSoft has recently released a Java API for speech [7] More info. JavaTM Speech API [8]

The Java Speech API is one of the Java Media and Communication APIs, a suite of software interfaces that provide cross-platform access to audio, video and other multimedia playback.

The Java Speech API, in combination with the other Java Media and Communication APIs, allows developers to enrich Java applications and applets with rich media and communications capabilities.

The Java Speech API leverages the capabilities of other Java APIs. The Internationalization features of the Java programming language plus the use of the Unicode character set simplify the development of multi-lingual applications. Many of the classes and interfaces of JSAPI follow the patterns of JavaBeansTM. JSAPI events integrate with the event mechanisms of AWT, JavaBeans and the Java Foundation Classes (JFC).

To use the Java Speech API, a user must have certain minimum software and hardware available. The following is a broad sample of requirements. The individual requirements of speech synthesizers and speech recognizers can vary greatly and users should check product requirements closely.

Speech software: A JSAPI-compliant speech recognizer or synthesizer is required.

System requirements: most speech recognizers and some speech synthesizers require powerful desktop computers to run effectively.

It is important to check the minimum and recommended requirements for CPU, memory and disk space when purchasing a speech product.

Audio Hardware: Speech synthesizers require audio output. Speech recognizers require audio input. Most desktop and laptop computers now sold have satisfactory audio support. Most dictation systems perform better with good quality sound cards.

IBM Speech for Java.

IBM is one of the first companies implementing the JSAPI interfaces.
Speech for Java v0.61 [9] is a free evaluation available for testing on the web at http://www.alphaworks.ibm.com/formula/speech /.
Speech for Java is a beta software from IBM and provides Java Speech API on top of ViaVoice.
Speech for Java is a Java programming interface for speech that gives Java application developers access to the IBM ViaVoice speech technology. Speech for Java supports voice command recognition, dictation, and text-to-speech synthesis, based on the IBM ViaVoice [10] technology.
Speech for Java is an alpha implementation of a core subset of the beta Java Speech API. (http://java.sun.com/products/java-media/speech/) The Java Speech API is a cross-platform Speech API that was developed by Sun Microsystems Inc. in collaboration with IBM and other industry speech technology companies. More information on the Java Speech API can be found at the Java Speech API home page http://java.sun.com/products/java-media/speech.

Requirements

In much the same way that Java implementations on Windows are built on top of the native Windows GUI capabilities, Speech for Java is built on top of the native speech recognition and synthesis capabilities in IBM ViaVoice. Thus Speech for Java requires installation of IBM ViaVoice Gold on the computer. ViaVoice is not provided as part of this package.
More information about ViaVoice can be found at the VoiceType / ViaVoice Home Page.
Minimum requirements for running IBM ViaVoice:

166MHz Pentium or 150MHz Pentium with MMX, running Windows 95 with 32MB of memory or Windows NT with 48MB.
The Speech for Java has only been tested on the JavaSoft JDK 1.1.5 version of Java.
ViaVoice Gold is an IBM software available off the shelf.

Microsoft Speech API.

The Microsoft® Speech Application Programming Interface (API) allows application developers to incorporate both speech recognition and text-to-speech into their applications.
More information can be found at [11] http://www.microsoft.com/directx/pavilion/dsound/speechapi.htm
I am not directly using the SAPI in my project.

Serial Port Java Drivers.

SerialPort from Solutions Consulting [12] http://www.sc-systems.com/serPort.html, is a Java class to provide access to serial ports from Java application. SerialPort is a high-performance class that also provides low-level serial port control.

Web browser Lynx

Lynx is a full-featured World Wide Web (WWW) client for users running cursor-addressable, character-cell display devices (e.g., vt100 terminals, vt100 emulators running on PCs or Macs, or any other character-cell display). It will display Hypertext Markup Language (HTML) documents containing links to files on the local system, as well as files on remote systems running http, gopher, ftp, wais, nntp, finger, or cso/ph/qi servers, and services accessible via logins to telnet, tn3270 or rlogin accounts (see URL Schemes Supported by Lynx). Current versions of Lynx run on Unix, VMS, Windows95/NT, 386DOS and OS/2 EMX.
More information on Lynx [13] can be found in http://www.slcc.edu/lynx/release2-8/lynx2-8/lynx_help/lynx_help_main.html

Telephone Access unit T-311.

With the Teltone T-311 Telephone Access Unit computers can make and answer telephone calls, and information about those calls can be returned to the computer. The T-311 allows communication between called and calling parties.
This communication is made possible by the conversion of DTMF-to-ASCII and ASCII-to-DTMF.
With the T-311 computers and other terminal devices can control telephone system functions such as answering and placing calls, observing call status, sending or receiving DTMF signals, "flashing" the line and coupling audio sources, like speech synthesizers, onto the line.
For more information see [14] http://www.teltone.com/cti/t-311.html

Architecture

WebByPhone is entirely implemented in Java and has been tested with the JDK 1.1.5 JavaSoft VM. The whole architecture consists of more than 30 classes and includes about 2500 lines of code. The whole design and implementation of the system took an accumulated time of less than 100 hours with only a programmer.

Architecture modules:

The following is a diagram of the modules included in the system.

The following section contains a detailed description of the WebByPhone modules as previously depicted in the architectural diagram above.

Phone gateway.

This component handles a physical call coming from PSTN and establishing an audio link connected to the sound card. The hardware part of this module is the Teltone T-311. The T-311 has a serial RS-232 interface and is connected to a PC serial port. The class ph_driver is the phone resource manager which acts as a Java wrapper for the T-311. Since Java by itself does not handle the serial port device, the ph_driver leverages on a commercial library to access the Serial Port.
The third party SW I used is called SerialPort and is produced by Solutions Consulting Inc. This package provides access to the serial port abstracting it as if it were a standard Java IO stream.
Since this company provides support for the most common platform this solution still guarantees portability amongst platforms.
By design the ph_driver module is able to generate events according to the observer/observable design pattern[15] . The Events are modeled by the class SerialEv and are composed of two units of information: a type and a value. Each module that wishes to receive SerialEv events must implement the interface SerialEvObserver and subscribe to the source.
The SerialEv are specialized in different kind of events: SerialDTMFLineEv and SerialLineEv.
These two events are produced respectively when a DTMF sequence or a line is read from the phone device.

DTMF converter.

This module is responsible for the conversion of DTMF tones into ASCII strings. This device allows the user to remote pilot the web browser detecting the pressed keys on a touch-tone phone. Although there are programs off the shelf able to perform this kind of conversion, I am satisfied with the quality of the Teltone T-311 DTMF and choose to use it in the WebByPhone.

Mediator.

This module acts as a coordinator among the other entities. I designed this part of the system according to the mediator design pattern. In reality several Objects, the webph and the collection of CallState Objects, compose the Mediator.
The webph is the main program. It contains the Main method and is responsible for the initialization of the system. During the initialization phase all the various Objects needed are created and initialized.
The mediator handles the phone sessions as a finite state machine. On designing the mediator I took into account the extendibility of the phone session model. I designed the system in such a way as to simplify changes on the state graph in terms of adding new states and new transition between states. This flexibility has been achieved using the State design pattern. Distinct objects inheriting from a base CallState Object coded as abstract class represents the session states. The mediator receives events and forwards them to the current state, which is responsible for reacting according to the embedded business logic. Each state implements the SerialEvObserver interface and receives events from the phone device. The behavior of the system is dictated by the current state reaction to the events. Behavior modification of the system is as easy as introducing a new CallState object and linking it to some predecessor state. In this way I have actually introduced user authentication simply by adding a new state Authentication between the WaitingCall and CallConect states. It is also possible to take advantage of the state hierarchy by grouping together some of the common behavior in the base class CallState, avoiding redundant implementation in all states. This is done for example to handle the DTMF request for online help, for the handling of the call termination event and for the explicit termination of the session by the DTMF sequence '**#'.
This state mechanism can be extended recursively to delegate part of behavior to other modules. In particular the menu has a common part in each state and a specialized part that is valid while in the process of reading a web page. During such phases the system must react not only to the standard instructions but also to the link selection fetching the new web page. This behavior is accomplished by introducing a menu object able to process events and react when the link selection is detected. The menu is created from the web page metadata object (webDoc) and contains all the links available from a starting page.

GUI.

Even if it is not strictly required by the project, it is useful to have a GUI interface especially in the intermediate phase of the project implementation. This module allows testing the system without all the hardware. In particular it simulates the events produced by touch-tone keys.

TTS.

This module provides the vendor independence for the Text-to-Speech (TTS) functionality. The Object managing the speech generation is called TTS and is currently based on the IBM Speech for Java v0.61.
The choice of basing the TTS on the IBM product was essentially dictated by the fact that the IBM Speech for Java was at the time the only implementation of the Java Speech API available.
I have in particular evaluated the Lucent TTS application platform 2.0 beta for the PC platform (PC is the only platform supported…) and although I was extremely satisfied with the audio quality, I decided to discard the product because of the absence of a Java interface. Theoretically, since the lucent TTS implements SAPI (The Microsoft Speech API) it should be possible to interface the TTS to Java using C++ native methods. This requires a deep understanding of the SAPI and a concrete programming effort, which is currently beyond the scope of this particular project. The Lucent researchers I contacted confirmed the intentions of evolving the product in this direction.
It should be noted that the Speech for Java v0.61, which is an alpha version, does not implement the full JSAPI interface, and is not stable and robust yet. I would like to acknowledge the IBM researchers for providing me prompt feedback on the problems I encountered with this implementation and for providing me with possible alternative ways to get around the unimplemented features I needed. I still have on occasion some error exceptions coming from the TTS module and am in the process of isolating them.

Web Browser.

This web browser module is able to fetch a web page and extract the text information along with all the meta-information needed to "surf" the page. In order to de-couple the system from a particular browser I have introduced an interface webDriver defining a basic method: public webDoc getUrl(String url).
I decided to not implement yet another Java browser and html parser but to leverage on a well known and maintained public domain text web browser: Lynx.
The main advantage of using well maintained third party software is the implicit guarantee that the system will work consistently when the HTML changes. I used a common technique called screen scraping to extract information from the Lynx browser. Lynx is currently executed with the options -number_links, -pseudo_inlines, -dump, from inside a class called webDriverLynx that implements the webDriver interface. The webDriverLynx reads the output piped throughout the standard output into an Input Stream and then parses it using a finite state machine. The result of this process is a webDoc object. I defined the webDoc containing the textual representation of the web page and a data structure (links) listing the tuples URL, description, related to the anchors contained in the web page.

Program Documentation.

The Program WebByPhone has an online help for the command flags if started with the option '-' (minus).
One useful parameter that can be specified in the command line is the home-page URL. For example to start WebByPhone with the Columbia web page you can type:
WebByPhone -h http://www.cs.columbia.edu/
Once the program starts it waits for incoming calls.
WebByPhone will detect the phone ring and answer at the first ring playing the welcome message. It then follows an authentication phase. In this phase the user is asked to enter a code number and a pin number.
The systems perform a lockup in the registered users and if the user is present with the correct pin it will grant access permission and continue the phone session. In case either the user is not registered or the pin is incorrect WebByPhone will terminate the active session with a "Good bye message".
The authentication phase is very useful for two reasons: it not only provides a basic way of performing access control, but it also allows the system to personalize the session for the particular user. Each user in the system has a user profile and it is possible to store user related information such as home page, bookmarks and voice preferences.
It is possible to achieve a higher level of authentication by gathering the caller ID information from the Teltone T-311 unit. The unit I am using for this prototype does not support the caller id at this time.
Once the user is authenticated WebByPhone refers to the caller by name, using the corresponding user profile information.
Upon completion of the initial authentication phase, WebByPhone begins fetching the requested page, playing the user instructions concurrently. When the page is received and processed by the web browser wrapper, WebByPhone will start reading the text.
Each Link in the page will be numbered and the keyword 'link' will be pronounced before reading the content. At the end of the page the system will remind the user of all options and will read the links, starting from the last read.
By convention each input through the touch tone keypad must terminate with the special character '#'.
To repeat a menu or to get an online help use the '*' followed by '#'.
To terminate the call press '**' followed by '#'.
To select a link, just enter the link number followed by '#'.
Other classes of functionality can be implemented, by assigning them to codes greater than 90. For instance it is possible to map 91 to allow the user to perform a 'go to URL' function. The URL can be inserted in the system either by selection directly from personalized bookmark or by typing it through the touch-tone keypad.
Although the textual data input through a touch-tone keypad is slow and tedious, the following algorithm can be used:
The user can enter each alphabetical character from a to z with a combination of two keys; the key containing the character and the number specifying the position of the character in the key.
For example to enter 'home' the user will type 42 63 61 32.
It is possible to map special characters like '@' and '.' into particular codes.

Installation notes:

The whole system code can be packet and deployed as a single zip archive file.
Following this strategy, WebByPhone installation does not requires any particular operation rather than coping the file to a destination directory, since the Java VM is able to access the classes as needed directly from the archive.
The following is a brief description of the installation sequence of all the helper applications needed by WebByPhone.
Via Voice Gold

Speech for Java 0.6

SerialPort

Unzip the SerialPort installation and place the DLL in the windows/System32 directory.

Lynx

Java JDK 1.1.5.

Download the official JDK from javaSoft and Install the self extracting executable file with the wizard.

Future Work.

Although WebByPhone implements all the core functionality allowing the user to navigate a web page by phone, in order to effectively use the systems several functionality should be added.
It will be useful to emulate the behavior of a common browser enabling the possibility to bookmark pages, to directly access a page from a bookmark and to keep a history of the pages visited to enable back and forth navigation. This feature can be added introducing a new module in the system without changing the architecture. Another future enhancement can be done handling the form data entry. This will allow the user to input data and take advantage of the search engines.
An additional refinement can be made on the way the HTML is rendered. Leveraging on the fact that the JSAPI includes the possibility to select a speech syntetizer with different attributes it should be possible to render the different HTML tags with different voice tones and even changing from male to female to emphasize tags like anchors.
This entire possible enhancement to the system can be introduced without changing the main architecture, but simply adding or modifying few modules.

DISCLAIMER: This document states my personal opinions and I am fully responsible for it.

References

[1] BLYNX - http://leb.net/blinux/blynx/index.html
[2] BLINUX - http://leb.net/blinux/index.html
[3] BLINUX Project - http://leb.net/blinux/betas.html
[4] NetPhonic Communications - http://www.netphonic.com/company/company.htm
[5] Web-On-Call - http://www.netphonic.com/product/woc/wocprod.htm
[6] Email-On-Call - http://www.netphonic.com/product/eoc/eocprod.htm.
[7] Java API for speech - http://java.sun.com/javaone/javaone98/sessions/T604/
[8] java-media - http://java.sun.com/products/java-media/speech
[9] Speech for Java v0.61 - http://www.alphaworks.ibm.com/
[10] IBM ViaVoice - http://www.software.ibm.com/is/voicetype/
[11] Microsoft Speech API - http://www.microsoft.com/directx/pavilion/dsound/speechapi.htm
[12] SerialPort from Solutions Consulting - http://www.sc-systems.com/serPort.html
[13] Lynx - http://www.slcc.edu/lynx/release2-8/lynx2-8/lynx_help/lynx_help_main.html
[14] Telephone Access unit T-311 - http://www.teltone.com/cti/t-311.html
[15] Design Patterns - Elements of reusable OO Software, Erich Gamma , Addison Wesley.

Last updated: Sunday, May 3, 1998 by Francesco Caruso