Interworking between Instant Messaging and Voice Calls


Naoya Seta
Columbia University
New York, NY 10027
USA
ns2015@columbia.edu

Abstract

The project, simvoice implements a translator between SIP based audio call and SIP based instant messaging (IM). It allows a audio-based phone user to send and receive messages with a text-based IM user, via text-to-speech and speech-to-text conversion.

Introduction

SIP (Session Initiation Protocol) described in RFC 3261 [1] is used for signaling of Internet telephony calls. SIP identifies a user by an email like address called a SIP URI, e.g., sip:alice@cs.columbia.edu. SIP can be used for IP telephony call establishment/termination, instant messaging, as well as presence indication. Sipc [2] is a software tool that can support these functions. I have implemented a translator, called simvoice, between the SIP based audio call and SIP based IM session using the CINEMA [3] libraries.

A SIP caller makes a call to the SIP IM user by dialing (sending INVITE to) the translator and specifying the actual final destination in the user part of the request-URI. Simvoice receives the SIP INVITE message, identifies the final destination, and sends initial greeting to the IM-destination user using the SIP MESSAGE method [4]. Simvoice maintains the association between the caller and the final-destination throughout the duration of the call. The spoken audio stream sent by the caller is converted to text using the CMU Sphinx2 [5] speech-to-text engine and sent as MESSAGE to the IM-user. The message content received in MESSAGE method from the IM-user is converted to speech using the IBM ViaVoice [6] text-to-speech engine, and streamed to the phone user. The packetized audio between the caller and simvoice is streamed using RTP described in [7].

The rest of this document is organized as follows:

Background

The users on networking communities are in heterogeneous environment where each of them has a different system capability, for example, some might have only IP phones, others might have PCs, someone may even want to make a phone call from PSTN network to an Internet user through a gateway. In this diverse environment, there should be a need in which users want to interact with each other using different media preferences such as text, voice, video and so on. Therefore, a intermediate converter that converts these different media back and forth between users with different capabilities would provide the users with enormous communication flexibilities.

One of the needs includes the support communications for deaf, hard of hearing and speech-impaired individuals who are currently often unable to use commonly available communication devices that are currently said to be poor in inter-operability. Current difficulties of deaf, hard of hearing and speech-impaired individuals, SIP user requirements and the multi-functional potential of SIP-based communications are discussed in [10].

Technologies Overview

Architecture

Overview

Described below is the general layout of the components in the system.

   Host A                                                         Host B
 +-------+                +-------------+                       +---------+
 | Voice |     INVITE     |  simvoice   |    Initial MESSAGE    | IM User |
 | Caller| -------------> |  IM / call  | ------------------->  |         |
 |(Alice)|                |  Converter  |                       |  (Bob)  |
 +-------+                +-------------+                       +---------+ 
                                 |                             
                                 |                           
                          +-------------+                  
                          |     TTS     |                
           <------------- | Text2speech | <-------------------
                          |    thread   |
            RTP Session   +-------------+   Instant Messaging
                          |     ASR     |
           -------------> | Speech2text | ------------------->
                          |    thread   |
                          +-------------+

      *simvoice: SIP Instant message / Voice call Converter
      *TTS: Text To Speech
      *ASR: Automatic Speech Recognition
      *IM: Instant Messaging 

Flow of Events

  1. Alice at host (A) makes a voice call to simvoice converter indicating that the final destination is Bob at host (B). The final destination is base64 encoded in user-part of request URI e.g. base64(bob@hostB)@simvoice.
  2. Simvoice extracts the encoded user-part and tries to send an initial greeting message to Bob@hostB. Once simvoice gets response (200 OK) within the timeout, simvoice creates RTP session, initiates an ASR module and accepts the call from Alice.
  3. When Alice speak something with following silence longer than specified duration, the ASR module, which is listening Alice's speech over RTP connection, regards it as the end of the talk-spurt and converts the speech to text using an ASR engine. Then it sends the ASR results to Bob@hostB. In addition, simvoice sends a instant message with the ASR result also to Alice to have her make sure what was sent to Bob. If Alice didn't speak for specified duration, simvoice will also notify both Alice and Bob of the long silence using instant messages.
  4. When Bob sends message to simvoice, it will know the associated caller is Alice, because Bob sent the message to session_id@simvoice as in the self-address of the initial greeting message from simvoice. Simvoice keeps association between Alice and Bob using the call id with Alice and the session id with Bob as long as the line with Alice is active. After receiving a message from Bob, simvoice hands it over to TTS module, which will sends synthesized speech over RTP connection to Alice.
  5. If Alice hangs up the line, the association status will be forgotten. Then simvoice will send a message to Bob in order to notify of that he won't be able to send a message any more to Alice. Any message sent by Bob after this will be discarded, and simvoice will send an error message to him that there is no association no more and can't deliver his message.

Program Documentation

Implementation

The implementation of the simvoice includes modification of Libsip++. One of the modifications is made so that the library can format and send out SIP MESSAGE, which makes use of existing code, policy thread, in order to realize retransmission and timeout feature of SIP. The other modification is to have the library call back application level function when it receives a SIP MESSAGE from the network.

The ASR module of the simvoice is implemented using Sphinx2 libraries and RTP library, which receives voice stream from a caller, detects silece, en-queues a talk-spurt into the ASR engine and then calls back application level function with the results.

The existing code from the project, "Email Notification over telephone" [8], is modified and optimized to implement the TTS module of simvoice. The module uses IBM ViaVoice TTS library in order to synthesize text message into audio speech, and also RTP library is used to stream resulting voice data.

The main module of the simvoice is a modified sipua [9] client. While the basic SIP functionality is kept, it is modified to keep states of associations between a voice caller and a IM user. The module is also in charge of sending out messages to associated IM user on establish and termination of a call. In addition, it initiates ASR and TTS modules, and interacts with them on each end of a talk-spurt of a caller and on receiving an instant message from an IM user respectively.

ASR Processing

As shown in the diagram below, a talk-spurt is defined by the series of sound data followed by silence of one second (default) that is defined as talk-spurt silence. The talk-spurt will be en-queued into ASR engine and the callback function in the main module will be called with the ASR results. The callback function is responsible for sending out the results to the associated IM user. The silence shorter than one second following sound is regarded as a part of a speech and included in the talk-spurt. The silence following talk-spurt silence is defined as non-talk-spurt silence and will be discarded, however, if non-talk-spurt silence lasts longer than five seconds (default), the callback function will be also called with the fake ASR result saying "Silence 5 sec". Talk-spurt is, by default, limited to the maximum length of fifteen seconds due to the limitation of the maximum utterance length of thirty seconds of Sphinx2.

                          max.15sec (default)            
        |non-talkspurt|<---------------------->|talkspurt| non-talkspurt     
        |  silence    |         voice          | silence |   silence    
 ...____|_____________|________________________|__ ______|____________...
        |<----------->|<---------------------->|<------->|
	   discard            talkspurt          1sec (default)
              |                  |                
          [if > 5sec]            |  enqueue   +----------+             
              |                  +<---------> |ASR engine|
              |                  |  result    +----------+
              |                  |             
              |          +-------+--------+
              +--------->| OnEndTalkspurt |------> SIP MESSAGE
         "Silence 5 sec" |   (callback)   |
                         +----------------+
                             main module

Configuration Details

Known Bugs and Implementation Issues

Future Works

Acknowledgements

The author would like to thank the following individuals for their contributions to this project:
Prof. Henning Schulzrinne
Kundan Singh
Xiaotao Wu
at Department of Computer Science, Columiba University

References

[1]
J.Rosenberg, H.Schulzrinne, et al.,SIP: Session Initiation Protocol , Request for comment, Internet Engineering Task Force, RFC 3261, May 2002.
[2]
Columbia University,SIPC, SIPC web site
[3]
Kundan Singh, Wenyu Jiang, et al.,CINEMA: Columbia InterNet Extensible Multimedia Architecture, CINEMA Technical Report
[4]
J. Rosenberg, H. Schulzrinne, et al.,SIP Extension for Instant Messaging, Internet Draft, Internet Engineering Task Force, September 14, 2002. work in progress.
[5]
Carnegie Mellon University, Sphinx2, CMU Sphinx web site
[6]
IBM, VIaVoice TTS, ViaVoice TTS web site
[7]
H. Schulzrinne, S. Casner, et al,RTP: a transport protocol for real-time applications., Request for comment, Internet Engineering Task Force, RFC 1889, January 1996.
[8]
Joseph Gagliano,Email Notification over telephone, Columbia University, CS Project Report
[9]
Kundan Singh, Sankaran Narayanan, et al.,SIPUA, document
[10]
N. Charlton, M. Gasson, et al., User Requirements for the SIP in Support of Deaf, Hard of Hearing and Speech-impaired Indivisuals, Request for comment, Internet Engineering Task Force, RFC 3351, August 2002.