IM Analysis

Shruti Gandhi
Columbia University

New York
, NY 12601
USA
slg2109@columbia.edu
 

Abstract

This paper studies and analyzes instant messaging (IM) behavior.  Users are exploring various mediums of communications. Interaction between two users is not just restricted through phone or e-mail. It is being observed that more and more users are moving to other non-obtrusive forms of communication like IM and text messaging. This paper analyzes IM-based communication of a user with other users to study how often he chats with them, amount of time he spends on instant messaging with a contact, frequency of messaging on a per contact basis, average duration of IM conversation and distribution of total chat to chat per conversation. We analyzed the chat logs generated by IBM Sametime 7.5 IM application. The logs were collected from IBM Sametime IM users who volunteered to share their chat data by running a chat-analyzer program on their chat transcripts. Analysis of user IM usage pattern can be used for server side capacity planning based on IM usage trends, automatically adjusting the presence subscription rate for each of the contacts based on amount of inter-user IM activity, providing personalized services to different users based on usage pattern profiling.

Introduction

The use of instant messaging has been increasing over time with addition of newer functionalities to IM in an effort to ease the pains of virtual communication. Enterprise products like IBM's Lotus Sametime [1] and Microsoft Communicator [2] are widely used in enterprises to achieve business communications. Even usage for end user IM products like Yahoo, AOL or Skype have increased in social circles as well as work. [3]

In this paper, we analyze IM usage patterns and identify applications of such trends and data in other areas. Our work provides the tools and baseline measurements to support the analysis of cross-cultural communications for coalition command processes. There are a few problems we encountered during performing IM usage analysis. IM usage data cannot be obtained from server logs because typical servers and service providers don’t expose this data. It requires users to share privacy sensitive chat transcripts or archives. Also, different users use different and often multiple IM clients and the format of each archive is different and sometimes stored encrypted, hence, making it infeasible to analyze.

The initial approach we adopted was to analyze client side chat (IM conversations) logs from Microsoft's MSN Messenger [4]. But not many MSN users were willing to share their chat transcripts.  In order to have unbiased analysis from real data it was crucial that the users continue to use their instant messaging clients as they would normally do without going out of their way to alter their behavior for this study. Hence, we decided to switch to study IM usage behavior using the logs from corporate chat client IBM Sametime 7.5. We plan to make chat-analyzer program for MSN messenger available in public domain so that users who are interested in running it against their MSN chat transcripts can do so and send us the results.

The chat analyzer program written for Sametime only gathers non-privacy sensitive data pertaining to IM usage pattern for each user such as:

·        Per day distribution of total chat with each buddy in contact list;

·        Percentage of time a user chats with each individual buddy;

·        Amount of time a user spends with a buddy per day;

·        Per buddy distribution of total chat each day;

·        Typical length of conversations.

Background

Traditionally, the overall IM volume is used to do server capacity planning. There was no emphasis on the IM usage pattern of clients. As described above, with more clients requiring real time presence updates, due to increased usage of IM from context introduced because of presence, analyzing the pattern helps to automatically adjust presence notification rate. Some other usages can fall in the area of social networking.

There are multiple types of analysis from simple message frequency and number of message, type of data exchanged (URL, file transfers, emoticons) to complex linguistic analysis. This may in turn be used for determining the primary use of IM, trust level between users, and the social or work relationship dynamics.

Such an analysis of IM usage patterns can be used to analyze trends in IM usage, to study if a user is using IM instead of e-mail or phone over a period of time and to determine if the usage of phone is decreasing with increased use of IM. Additionally, usage patterns in terms of number of messages and time spent over IM communication can be used for automatically adjusting presence subscription rate for user’s contacts [5].

 

Related Work

 

There has been some previous work in this area but none of them were focused on determining inter-user communication timings.  Muller et. al., [9] show analysis of IM usage with a large number of users but it is based on user self reports and surveys.  They studied maturity of IM network over a period of 24 months showing development of chat behaviors and social network as well as attitude. 

Another study by Herbsleb et. al, [7] describes experiences based on introducing instant messaging and group chat in geographically diverse work groups. In such environment, informal nature of communication and cross-site communication is perceived to reduce the utility of IM based communication.

 Issacs et.al. in [8] did a major study based on logging IM at workplace. They found workplace IM was for complex work discussion. 28% conversation were simple and single purpose, 31% about scheduling and coordination, heavy IM users and frequent IM partners were generally working together or collaborating and involved many fast paced discussion. Light and infrequent IM users involved more for scheduling and coordination.

There have also been studies on instant messaging, and interruption and productivity. The work proposes techniques to how such interruptions can be queued to avoid productivity loss [11]. Other studies include observation in IM usage by introducing presence [10].

 

Architecture

The software is written for java 1.5 which approximated to 3400 lines of code. The program has a main invocation class ChatParserANDWriter.java. This loads the chat folders and sends it to FilePersonHistory.java and FileChatTranscript.java parse and analyze them. The analysis creates four files with data in xml format that is stored in a chatdata folder. The xml files are then used to plot graphs using an IBM internal charting service and Microsoft Excel. (See Figure 1)


The Sametime chat logs are stored in HTML format. The chat analyzer program creates three objects: Chat, User, and Date objects that store entire program related data. Chat data has information like the start time, end time, initiator of a conversation, and length of a conversation. User object stored information like chat with per user, total chats, and percent chats per user. Date object stores data based on conversations for a day to keep track of user monthly, weekly, daily activity. The chat analyzer processes the HTML based chat logs and generates four XML files. Example XML files are given below:

·         chat_by_user.xml
<userid>
<user creationTime="24 Jan 2007 17:38:46 GMT" id="5" initiator="6" lastActivityTime="24 Jan 2007 17:57:10 GMT" percent="8.44" total="853">
  <date chatdate="20070124">20</date>
  <date chatdate="20070129">21</date>
  <date chatdate="20070214">35</date>
...
...
</user>
</userid>

·         chat_by_time.xml -
<userid>
<tally distribution="9" length="15"></tally>
<tally distribution="4" length="62"></tally>  
<tally distribution="2" length="79"></tally>  
<tally distribution="4" length="107"></tally>
...
...
...
</userid>

·         chat_by_date.xml _
<userid>
<date chatdate="20070122" total="0" />
<date chatdate="20070123" total="84"> 
<user id="1">49</user>  
<user id="2">24</user>
<user id="3">0</user>
...
...
</user>
</date>
</userid>

·         chat_by_bytes.xml -
<userid>
<user chatlength_in_bytes="7" name="1"></user>
<user chatlength_in_bytes="7" name="2"></user>
<user chatlength_in_bytes="9" name="3"></user>
<user>
</userid>

That xml data is formatted to provide it to a plotting API that will generate histograms, line charts, column charts displayed in the measurements section.


Fig. 1 Program Architecture

 

Programming Documentation

Please see attached Chat Programming Doc

Measurements

The goal of studying user instant messaging behavior was to find a pattern in the usage. Chat logs can have from minimal to a lot of data but a careful decision had to be made as to what data can be shared to make such analysis without invading user privacy. Only non-confidential user data like timestamp, hashed unique user id, chat length, and number of chat bytes over a span of time was used for this study. Some things that would come out of analysis would be and useful for user to know:

·        Which users do I talk to more often? (A-list users)

·        What percentage of users do I talk to more often?

·        What is the typical length of my conversations?

·        What days of the week do I use IM more?

·        What month has more activity over others?

·        Is there a pattern in my usage?

 

The data set used is for period of four months (Jan 2007 – Apr 2007). The XML files generated by chat analyzer were used to plot data such as length of chat conversations with users, total number of chat messages (bytes and lines) with users, total chats pattern over 4 months span for a set of users.      

In Fig. 2, we see for a sample user how the chat data is plotted against time. From this particular user’s data it can be seen that moving average of total number of messages is increasing with time (date) (in chart 1). The chat distribution per day also reflects the distributed nature of co-workers. A user in Eastern Time zone, working with users from Vietnam time zone, West Coast and Indian time zone will have IM activity at any point of day.

 

    

 

 

 

 

 

 

 

 

 

 

 

Figure 2 Chat distribution with date

Figure 3a is a representation of chat usage (in number of lines) for five users over four months of time. We observed that there is overall increase in usage of instant messaging over time. We observed drops in usage over the weekend that can be attributed to the fact that it is workplace IM client. If you see Chart2b, an interesting observation made was that there were peaks in beginning and towards the end of the week. For example, on March 5th, 2007 Monday the total chat messages rise and then drop around March 7th, 2007 Wednesday. They start increasing again on March 9th, 2007 Friday. We believe that this happens because people start planning their work week on Monday or Tuesday and then they catch up with people on Friday or make weekend plans thus there is high usage of IM at that time. Also, we noticed that lot of the users of this company work from home on Fridays which might explain their increase in communication with their teams on Friday’s.

             
Fig 3a. Total chat messages distribution with time                                                                                          Figure 3b IM usages of 5 users over 2 weeks

 

The graph in chart 4 shows that most users have small conversations with small occurrences of larger conversations. A conversation is like a sequence of chat messages in a thread, separated in interval from others by thirty minutes. Hence, this study breaks users conversations based on 30 minutes inactivity period. The occurrences of chat conversations range from zero to ten conversations. This data is consistent for all users as you see below. Zero conversations mean users ping other users and get no response from the other party for at least another 30 minutes. After every 30 minutes of no conversation (determined from timestamps) with a user a new chat with that user is considered a new topic / conversation. This is also used to determine total number of conversations per day.


Figure 4 Length of IM conversations vs.  Occurrences

The graph in Figure 5(a) and 5(b) shows distribution of number of chat messages for each user (on x axis). The total amount of message for user is gathered by 1) counting the total lines typed and 2) counting the total number of bytes generated by the message. Some users use one line to deliver their entire message while others type multiple lines with 1 or 2 words at a time. Although, the bytes and lines have relationship, but we wanted to see if such relationship holds same for all contacts in IM or a user has different communication pattern for each contact. This may also depend on relationship with the contact.

We observe both number of lines and bytes have similar pattern as can be seen in Fig. 5(a) and 5(b). From Fig. 5(c) we observe that the pattern is similar for many IM users.

Another, interesting observation to make is distribution of total chat to number of users. As can be seen, majority of chat is with very few contacts. For user in consideration more than 10% of conversations are with less than 10 people, more than 20% are with less than 5 people and 50% is with single person.

 


Figure 5(a) Length of IM conversations vs.  Users (number of messages)


Figure 5(b) Length of IM conversations vs.  Users (number of bytes)


Figure 5(c) Length of IM conversations vs.  Users

       
Figure 6 Normalized total chats vs.  Users

In Figure 6, we observe the distribution of the total chat (or number of instant messages exchanged) for multiple users. As it can be seen, most of messages are concentrated among few users. This is an important observation as this can be used for presence profiles of users having higher communication. Fig 7(a) and 7(b) show amount of chat (number of IM, normalized over all users) vs. number of users. Figure 7(b) is on a log scale.
Figure 7(a) and 7(b, log scale) Normalized total chats vs.  Users

 

Future Work

In the future we plan to extend this work by doing studies using IM applications like MSN, Googletalk, Yahoo, and Trillian.  Since these are non workplace specific, they will give better insight into instant messaging usage analysis from a social communication point of view.  Another plan is to study variation of IM usage against phone and e-mail usage for a user over a period of time. IM is real-time against e-mail and non-intrusive as compared to phone, so it would be interesting to see if IM is replacing either of them to an extent. Linguistic analysis of instant messages can also be done to see the purpose and type of IM utilization.  Socially connected clouds can be created based on IM usage.  We plan to provide tools for measurement and analysis of such communication pattern which in turn may improve the quality and effectiveness of communication.

Conclusions

IM based communication pattern is very useful and can be correlated to situation analysis. A change in set of user could mean changing work or social relationship or tentative change in job or work environment. The variation of chat usage with time on daily basis, increase of overall chat over a period of time, more communication with specific users with whom you share work relationship were some observations. We found inter-user IM usage distribution time and also found that bulk  of IM usage tend to be concentrated to few users. Also, we found some users have exceptionally high average chat usage. These could be typically project managers or program coordinators which may be considered when analyzing average conversations for a job role. Some other questions which such a study could help us to answer could be: Does IM usage increases towards the end of a quarter? Do studying IM patterns demonstrate that no patterns exist in IM usage? Since IM communication, just like text messaging, is non-obtrusive and the newer generation (more so than other age groups) is learning to just stay online anywhere they go – work, school, or home? Can this study be expanded to online behavior in general?

Task List

Most of the code was written by me except:
ValueSortMap.java -
http://www.programmersheaven.com/download/49349/download.aspx

References

[1]   IBM Same time, www.ibm.com/lotus/sametime

[2]   Microsoft-communicator, office.microsoft.com/communicator

[3]   IM Usage survey, http://www.aim.com/survey/

[4]   MSN, www.msn.com

[5]     Singh V. and Schulzrinne H., “Presence Traffic Optimizations”, Columbia University, CS Tech Report cucs-041-06.

[6]   Houri, A., "Problem Statement for SIP/SIMPLE", draft-ietf-simple-interdomain-scaling-analysis-00.txt (work in progress), Feb 2007.                                                              

[7]   Ljungstrand, P., Hard af Segerstad, Y., “Instant messaging with WebWho,” International Journal of Human-Computer Studies Volume 56, Issue 1 (January 2002), Pages: 147 – 171.

[8]   James D. Herbsleb and David L. Atkins and David G. Boyer and Mark Handel and Thomas A. Finholt, “Introducing instant messaging and chat in the workplace”, CHI '02: Proceedings of the SIGCHI conference on Human factors in computing systems, Pages 171-178.

[9]   Ellen Isaacs and Alan Walendowski and Steve Whittaker and Diane J. Schiano and Candace Kamm, “The character, functions, and styles of instant messaging in the workplace”, CSCW '02: Proceedings of the 2002 ACM conference on Computer supported cooperative work. Pages-11-20.

[10]     Michael J. Muller and Mary Elizabeth Raven and Sandra Kogan and David R. Millen and Kenneth Carey, “Introducing chat into business organizations: toward an instant messaging maturity model” GROUP '03: Proceedings of the 2003 international ACM SIGGROUP conference on Supporting group work. Pages 50-57.

[11]     Peter Ljungstrand , Ylva Hård af Segerstad, “Awareness of presence, instant messaging and WebWho,” ACM SIGGROUP Bulletin, v.21 n.3, p.21-27, December 2000.

[12]      Czerwinski, M., Cutrell, E., & Horvitz, E. (2000). Instant Messaging and Interruption: Influence of Task Type on Performance. Proceedings of OZCHI 2000, 356-361.