VoiceXML, the Voice Extensible Markup Language, is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, and recording of audio for telephony applications. It brings the advantage of web technologies to a telephone users by providing an interactive voice response (IVR) mechanism. VoiceXML is being standardized by VoiceXML Forum. The specification defines a language for presenting the voice dialogs.
Session Initiation Protocol (SIP) is a popular Internet telephony protocol. It uses Real-time Transport Protocol (RTP) for carrying real time multimedia traffic including audio packets.
A SIP-based VoiceXML browser is a software application to which an Internet telephone user can connect and interact with the system. The browser is in some way similar to a web browser, in that it fetches the vxml pages from the web server and presents the dialog to the telephone user. Consider how a web browser fetches the content from the web server and displays it to the users. Similarly the VoiceXML browser fetches vxml pages from the web server and presents the interactive dialog to the telephone user.
Software/Hardware Requirements:
You must do the project in C/C++ only. Some of the available libraries can work on only Linux, so it would be important to get hold of a linux system. You may also try to compile it on windows platform.
Question-1:
We will start by sketching an overall design of the browser. Skim through the specification of VoiceXML and present a two page (including any diagram) description of the design of such a SIP-based VoiceXML browser. You should point out the roles of the following components in your design and how they interact with the browser (either inside the browser application or external interface). You should also describe a simple dialog (e.g., asking for four digit pin number to authenticate in a voice mail system) and interaction among the components.SIP, RTP, DTMF detection, DTMF gerenation, HTTP, XML parser, Text-to-speech, Speech recognizion, VoiceXML interpreter, Grammar matching rules.
There are many Internet phones (software as well as hardware based) available. We will use a locally developed SIP phone for testing. The goal of the project is to allow a regular telephone user to use the system, so towards the end of the project you will be required to demonstrate the project using a regular telephone. The audio data sent is usually encoded using a low bitrate speech codec, like G.729 or G.723.1. However for our homework we will use the most common PCMU encoding, G.711 Mu Law. SIP is used for establishing and terminating an Internet telephony call, whereas RTP is used for transporting the encoded audio packets. There is a number of ways in which DTMF can be transported in a call. The most common way is to not distinguish it with the spoken voice. So the DTMF tones are encoded using the currently used audio codec and sent across to the remote party without making any distinction between DTMF and regular speech. A second way to to define a special RTP packet format (RFC2833) to carry the DTMF digit. Such a special packet contains the digit(s) instead of encoded audio. In the first case the receiver has to do the DTMF detection, where as in the second case the sender has to do the DTMF detection. We will implement both the methods. (A third method of transporting DTMF along with the SIP signaling messages will not be considered in this project)
Question-2:
Using the Columbia SIP library and the Elemedia RTP Library, implement a SIP client which receives a call from a SIP phone. Once the connection is established it receives the audio packets over RTP, does DTMF detection and prints out if any digit is detected. Use the plugnSIP software phone to test your program.A browser after accepting an incoming call, fetches an initial VoiceXML page from a web server using HTTP. Once the page is available it starts its interpreter and presents any dialog as specified in the page. For instance, the vxml page may ask the user to enter a four digit pin number to authenticate. Such a dialog is written in text in the vxml page, that needs to be converted to speech. The brower invokes a text-to-speech convertor to convert any prompt and presents it to the telephone user.Sample code for a SIP user agent, RTP library and DTMF detection will be provided to you. TBD.
Many HTTP client libraries are available on the web that allow an application to GET or POST an HTTP querry and retrieve the web pages or other web content. Search at W3C web site for such a library in C/C++. We will use IBM text-to-speech API, ViaVoice.
Question-3:
Extend your browser so that when connected it fetches a predefined web page from http://www.cs.columbia.edu/~kns10/example.txt containing a prompt. Then it calls the TTS API to convert it to speech and send the converted speech over RTP to the telephone user. When the telephone user sends the four digits, it again uses the TTS to convert it to speech and read out something like, "you entered four, seven, two and zero. Thank you." The browser should be able to handle abnormal call termination at any point.Use the callback mechanism of ViaVoice to do Text-to-speech. i.e., you give a piece of text to the library and it calls a callback function when the conversion is done. You can use this callback function to packetize the audio and send it to the remote. Your packets should be 20ms long, i.e., if you are using G.711 Mu Law then the payload will be 160 bytes per packet. If the size of the converted audio is longer than this then you will need to fragment and send the packets of 160 bytes every 20 ms.
At this point we have integrated most of the external systems to our browser. However, the most important component, i.e., the interpreter is still not implemented. A VoiceXML interpreter accepts a voice XML page and executes it. While executing the page it may take the help of HTTP client to fetch more pages, TTS to convert the text to speech, RTP to send back the audio to the user, and so on. It also uses the grammar matching rules to match the input (either DTMF or spoken voice) from the telephone user with the available set of commands. For instance, if a matching rule specifies mastercard | visa | amex then the system should be able to match any of these spoken text: mastercard, visa or amex.
For the purpose of this project we will not worry about spoken voice. Our browser will accept input only using DTMF (both waveform and RFC 2833). Also we will not use the Java API for DTMF as mentioned in the VoiceXML specification. Instead we will define our own grammar for matching DTMF digits.
DTMF Grammar:
A typical dtmf tag in vxml page may look like:Most of the scenarios should have simple grammar similar to the first example. However, some scenarios may require more complex grammars. Formal definition of this grammar is for further.<dtmf type="application/x-dtmf"> 1 | 2 | 3 | 4 | * | # </dtmf>The MIME type for this grammar is "application/x-dtmf". The matching rule is very similar to glob string pattern matching. A special keyword T is used to indicate a timeout. So following grammar can be use to enter a phone number.<dtmf type="application/x-dtmf"> 7??? | [34]???? | 1?????????? | 011*T | ??????? </dtmf>In this example, the phone numbers are either internal 4 or 5 digit numbers, a local number, an US long-distance number or an international number. The value of T is 1 second. If you expect more delay then use multiple T's. e.g., 011*TTT will wait for 3 seconds before assuming the current set of dialed digits to be the international number. Note that the grammar must be in prefix order, e.g., if you have a matching rule for 3? then any other matching rule of the form 34? will be ignored. However, a matching rule of 3?T and 34? can co-exist. The matching rule is applied from left to right when multiple rules are specified using the binary OR ("|") operator.? is used to match a single character, while * can match any sequence of characters including none. Square brackets [ ] are used to match one digit from a sequence, e.g., [345] matches either 3, 4, or 5.
Complex matching of the form "011*#" is also done. This particular example expects a terminating dtmf tone, "#", at the end of the international number. It is always a good practice to include a terminating character for multi-digit inputs. Note, however, that "*" can appear only once in the matching sequence.
You can also specify the lower case characters, a-z, in the matching rule. The character is mapped to the corresponding digit on a standard telephone dialpad as follows:
abc->2 def->3 ghi->4 jkl->5 mno->6 pqrs->7 tuv->8 wxyz->9However, be careful to avoid any ambuguity. e.g., "44?" and "hg?" are the same. To match a dtmf digit "*" use "\*" instead, e.g., "\*s" will match dtmf digit "*" followed by "7".
Question-4:
Implement the DTMF grammar matching program based on the grammar described above. The program should use the grammar from the file dtmf.txt in the current directory. It should accept the user input from the stdin and print out to the screen when any particular grammar is matched. We will provide the code for glob style string match in C. We will also provide example grammar files. An example program trace is shown below:$ cat dtmf.txt 7??? | [34]???? | 1?????????? | 011*T | ??????? $ ./dtmfmatch 71 246 Matched 7124->7??? 726 77843 Matched 6726778->??????? Ctrl-C $
A VoiceXML brower needs to parse vxml pages. Vxml pages are specified in XML. XML specifies a document using a tree like structure. It uses tags similar to HTML tags. The set of tags form a tree. Xerces-C is a freely available XML parser. It supports two ways of using it. We will use the DOM parser. This mode gives a tree like data-structure for the document.
Question-5:
Use the Xerces-C DOM parser to parse a VoiceXML page. After parsing the program should list the forms and fields from the page. The vxml page will be passed from stdin. We will provide example vxml pages. Alternatively you can use some of the examples present in specification.
VoiceXML specification lists many tags. Although a complete voice XML browser implementation should support all the tags, we will implement a sub-set. In particular we will support the following tags:
Supported Tags: assign, audio, block, break, catch, choice, clear, disconnect, dtmf, else, elseif, enumerate, error, exit, exit, field, filled, form, goto, help, if, menu, meta, noinput, nomatch, option, prompt, property, record, submit, throw, value, var, vxml.We will not support any java script or any other script in the browser. A detailed requirement for each tag is given below:
Question-6:
Implement a VoiceXML interpreter as given in the specification. Support the tags mentioned above. Your program need not integrate the RTP, DTMF, Text-to-speech at this point. However you should use the HTTP client library to fetch the web pages.The program has a command line interface. The URL of the initial page is passed from the command line. Once the program is started it outputs any prompts the user on stdout based on the vxml page content. The user can enter the DTMF digits from stdin. In short, use your stdin and stdout instead of a telephone. Note that your program should not wait for the user to type new-line before the input is accepted. You need to integrate the grammar matching rule too.
Question-7
Integrate SIP, RTP, DTMF and Text-to-speech in your interpreter to complete the voiceXML browser implementation. Now you should be able to make a call from SIP internet phone or from a regular telephone using a gateway, to your browser. Use the voicexml pages provided to you as an example. TBD. This example will allow you to check your email from your telephone.After you have completed the implementation please consider writing a technical report describing the implementation and design choices. This is optional.