Interworking between Instant Messaging and Voice Calls
Naoya Seta
Columbia University
New York, NY 10027
USA
ns2015@columbia.edu
Abstract
The project, simvoice implements a translator between SIP
based audio call and SIP based instant messaging (IM). It allows a
audio-based phone user to send and receive messages with a text-based
IM user, via text-to-speech and speech-to-text conversion.
Introduction
SIP (Session Initiation Protocol) described in RFC 3261 [1] is used for
signaling of Internet telephony calls. SIP identifies a user by an
email like address called a SIP URI, e.g., sip:alice@cs.columbia.edu.
SIP can be used for IP telephony call establishment/termination,
instant messaging, as well as presence indication. Sipc [2] is a
software tool that can support these functions. I have implemented a
translator, called simvoice, between the SIP based audio call and SIP
based IM session using the CINEMA [3] libraries.
A SIP caller makes a call to the SIP IM user by dialing (sending
INVITE to) the translator and specifying the actual final destination
in the user part of the request-URI. Simvoice receives the SIP INVITE
message, identifies the final destination, and sends initial greeting
to the IM-destination user using the SIP MESSAGE method [4].
Simvoice maintains the association between the caller and the final-destination
throughout the duration of the call. The spoken audio stream sent by
the caller is converted to text using the CMU Sphinx2 [5]
speech-to-text engine and sent as MESSAGE to the IM-user.
The message content received in MESSAGE method from the IM-user is
converted to speech using the IBM ViaVoice [6] text-to-speech engine,
and streamed to the phone user.
The packetized audio between the caller and simvoice is streamed using RTP
described in [7].
The rest of this document is organized as follows:
The users on networking communities are in heterogeneous environment where
each of them has a different system capability, for example,
some might have only IP phones, others might have PCs, someone
may even want to make a phone call from PSTN network to an Internet user
through a gateway.
In this diverse environment, there should be a need in which users want to
interact with each other using different media preferences
such as text, voice, video and so on.
Therefore, a intermediate converter that converts these different media back and forth
between users with different capabilities would provide the users with enormous communication flexibilities.
One of the needs includes the support communications for deaf, hard of hearing and speech-impaired
individuals who are currently often unable to use commonly available communication devices that are
currently said to be poor in inter-operability.
Current difficulties of deaf, hard of hearing and speech-impaired individuals,
SIP user requirements and the multi-functional potential of SIP-based communications are
discussed in [10].
- SIP:
The Session Initiation Protocol (SIP) is an application
layer control (signaling) protocol for creating, modifying and terminating sessions
with one or more participants. These sessions include Internet telephone calls,
multimedia distribution and multimedia conferences. Easy to use Libsip++
that provides interface to Columbia University's SIP implementation is available for C++
[1].
- RTP: RTP is the Internet-standard protocol for
the transport of real-time data, including audio and video. It can be used for media-on-demand
as well as interactive services such as Internet telephony. RTP consists of a data and a control part.
The latter is called RTCP. RTP library
is available to implement an application
[7].
- IM:
Instant Messaging (IM) is one of popular Internet services in which users can
exchange messages back and forth in near real-time manner. IMs are usually used in a conversational way
in which the transfer of messages is fast enough for users to keep an interactive dialogue.
SIP extension for IM is discussed in
[4].
- SIPC:
Sipc is a SIP user agent that can be used for Internet telephony, multimedia conferences,
instant messaging and so on. It supports variable media preferences,
such as audio, video, text and white board, and can be extended easily to other media types.
[2].
- IBM ViaVoice TTS:
ViaVoice TTS is a text-to-speech engine developed by IBM,
which synthesizes textual representation of human language into speech. SDK was available
[6].
- CMU Sphinx:
Sphinx is a DARPA-funded project at Carnegie Mellon University, in which
series of speech recognition engines are developed. Sphinx 2 is a real-time, large vocabulary, speaker
independent speech recognition system. Sphinx3 is also available and said to be more accurate but slower than
Sphinx 2. The libraries, written in C, include core speech recognition function and auxiliary ones such as
low-level audio capture functions
[5].
Overview
Described below is the general layout of the components in the system.
Host A Host B
+-------+ +-------------+ +---------+
| Voice | INVITE | simvoice | Initial MESSAGE | IM User |
| Caller| -------------> | IM / call | -------------------> | |
|(Alice)| | Converter | | (Bob) |
+-------+ +-------------+ +---------+
|
|
+-------------+
| TTS |
<------------- | Text2speech | <-------------------
| thread |
RTP Session +-------------+ Instant Messaging
| ASR |
-------------> | Speech2text | ------------------->
| thread |
+-------------+
*simvoice: SIP Instant message / Voice call Converter
*TTS: Text To Speech
*ASR: Automatic Speech Recognition
*IM: Instant Messaging
Flow of Events
- Alice at host (A) makes a voice call to simvoice converter indicating that the final destination is Bob at host (B). The final destination is base64 encoded in user-part of request URI e.g. base64(bob@hostB)@simvoice.
- Simvoice extracts the encoded user-part and tries to send an initial greeting message to Bob@hostB. Once simvoice gets response (200 OK) within the timeout, simvoice creates RTP session, initiates an ASR module and accepts the call from Alice.
- When Alice speak something with following silence longer than specified duration, the ASR module, which is listening Alice's speech over RTP connection, regards it as the end of the talk-spurt and converts the speech to text using an ASR engine. Then it sends the ASR results to Bob@hostB. In addition, simvoice sends a instant message with the ASR result also to Alice to have her make sure what was sent to Bob. If Alice didn't speak for specified duration, simvoice will also notify both Alice and Bob of the long silence using instant messages.
- When Bob sends message to simvoice, it will know the associated caller is Alice, because Bob sent the message to session_id@simvoice as in the self-address of the initial greeting message from simvoice. Simvoice keeps association between Alice and Bob using the call id with Alice and the session id with Bob as long as the line with Alice is active. After receiving a message from Bob, simvoice hands it over to TTS module, which will sends synthesized speech over RTP connection to Alice.
- If Alice hangs up the line, the association status will be forgotten. Then simvoice will send a message to Bob in order to notify of that he won't be able to send a message any more to Alice. Any message sent by Bob after this will be discarded, and simvoice will send an error message to him that there is no association no more and can't deliver his message.
Implementation
The implementation of the simvoice includes modification of
Libsip++.
One of the modifications is made so that the library can format and send out SIP MESSAGE, which makes use of
existing code, policy thread, in order to realize retransmission and timeout feature of SIP.
The other modification is to have the library call back application level function when it receives
a SIP MESSAGE from the network.
The ASR module of the simvoice is implemented using
Sphinx2 libraries and
RTP library,
which receives voice stream from a caller, detects silece, en-queues a talk-spurt into the ASR engine and
then calls back application level function with the results.
The existing code from the project, "Email Notification over telephone" [8],
is modified and optimized to implement the TTS module of simvoice.
The module uses IBM ViaVoice TTS library
in order to synthesize text message into audio speech,
and also RTP library is used to stream resulting voice data.
The main module of the simvoice is a modified sipua [9] client.
While the basic SIP functionality is kept, it is modified to keep states of associations between
a voice caller and a IM user. The module is also in charge of
sending out messages to associated IM user on establish and termination of a call.
In addition, it initiates ASR and TTS modules, and interacts with them on each end of a talk-spurt of a caller
and on receiving an instant message from an IM user respectively.
ASR Processing
As shown in the diagram below, a talk-spurt is defined by the series of sound data
followed by silence of one second (default) that is defined as talk-spurt silence.
The talk-spurt will be en-queued into ASR engine and the callback function in the main module
will be called with the ASR results.
The callback function is responsible for sending out the results to the associated IM user.
The silence shorter than one second following sound is regarded as a part of a speech and
included in the talk-spurt.
The silence following talk-spurt silence is defined as non-talk-spurt silence and
will be discarded, however, if non-talk-spurt silence lasts longer than five seconds (default),
the callback function will be also called with the fake ASR result saying "Silence 5 sec".
Talk-spurt is, by default, limited to the maximum length of fifteen seconds due to the limitation of
the maximum utterance length of thirty seconds of Sphinx2.
max.15sec (default)
|non-talkspurt|<---------------------->|talkspurt| non-talkspurt
| silence | voice | silence | silence
...____|_____________|________________________|__ ______|____________...
|<----------->|<---------------------->|<------->|
discard talkspurt 1sec (default)
| |
[if > 5sec] | enqueue +----------+
| +<---------> |ASR engine|
| | result +----------+
| |
| +-------+--------+
+--------->| OnEndTalkspurt |------> SIP MESSAGE
"Silence 5 sec" | (callback) |
+----------------+
main module
- CMU Sphinx2
should be downloaded and installed properly on the machine where simvoice will
be run. Sphinx2 ASR engine used by simvoice needs a long list of arguments to be initialized,
which gives the engine the information such as the location of the dictionary, input model database
and other configuration parameters.
Current version of simvoice takes a list of arguments as a text file and the file name should be
specified as an command-line argument when simvoice is initiated.
See example argument file. You may want to use a script to prepare an argument file with basic parameters.
See README.txt. More detail about arguments themselves,
see Sphinx2 User Guide.
- IBM ViaVoice TTS should be downloaded and installed properly on the machine where simvoice
will be run. The initialization file of the engine eci.ini that comes with installation
should be in the simvoice directory.
- Although simvoice is made so that it can accept multiple voice calls and
keep multiple associations of a caller and an IM user, the ASR and TTS engines seem too heavy to be invoked
simultaneously from multiple threads. Therefore it might freeze when it has more than two associations.
More tests yet to be done regarding this issue to evaluate the performance of the implementation.
- The recognition performance of Sphinx2 tested in this environment was not good enough. You may be able to
improve it by either using bigger dictionary, training it properly or using next version of Sphinx.
See Sphinx2 User Guide.
- Simvoice was tested only with SIPC
as a voice caller and an IM user, which used "ratmedia" in order to stream voice data.
- Though this implementation of simvoice only allows a voice caller to initiate a
session, an IM user should also be able to initiate a session in the future implementation.
This feature would be simply implemented in the callback function (OnIncomingSIPMessage) of
the main module by having IM user send base64 encoded sip phone user URI.
When simvoice doesn't have any associated voice call with an IM message,
instead of sending an error message back to the IM user, simvoice could make
a phone call to the decoded final destination.
However, if the call is directed to phone, issues such as billing and authentication
need to be considered.
- Though this implementation of simvoice sends ASR results back in a IM message to have the voice caller
make sure what was sent to the associated IM user, future implementation could stream back the ASR results
through the TTS thread. This way might be better when a voice caller prefers, or is capable only of audio media.
However, ASR results should be spoken out using different voice, e.g. system voice, to avoid confusing with IM
user's actual message.
The author would like to thank the following individuals for their contributions
to this project:
Prof. Henning Schulzrinne
Kundan Singh
Xiaotao Wu
at Department of Computer Science, Columiba University
- [1]
- J.Rosenberg, H.Schulzrinne, et al.,SIP: Session Initiation Protocol ,
Request for comment, Internet Engineering Task Force, RFC 3261, May 2002.
- [2]
- Columbia University,SIPC,
SIPC web site
- [3]
- Kundan Singh, Wenyu Jiang, et al.,CINEMA: Columbia InterNet Extensible Multimedia Architecture,
CINEMA Technical Report
- [4]
- J. Rosenberg, H. Schulzrinne, et al.,SIP Extension for Instant Messaging,
Internet Draft, Internet Engineering Task Force, September 14, 2002. work in progress.
- [5]
- Carnegie Mellon University, Sphinx2,
CMU Sphinx web site
- [6]
- IBM, VIaVoice TTS,
ViaVoice TTS web site
- [7]
- H. Schulzrinne, S. Casner, et al,RTP: a transport protocol for real-time applications.,
Request for comment, Internet Engineering Task Force, RFC 1889, January 1996.
- [8]
- Joseph Gagliano,Email Notification over telephone,
Columbia University, CS Project Report
- [9]
- Kundan Singh, Sankaran Narayanan, et al.,SIPUA,
document
- [10]
- N. Charlton, M. Gasson, et al.,
User Requirements for the SIP in Support of Deaf, Hard of Hearing and Speech-impaired Indivisuals,
Request for comment, Internet Engineering Task Force, RFC 3351, August 2002.