Spam Analysis and Reputation Project

(Parser and Stand-Alone Framework)

Tejas Nadkarni

Columbia University

New York, NY 10027

USA

tgn2104@columbia.edu

Abstract

There are many spam filters currently being employed to classify email messages as Spam or Non-Spam. These filters are based on various hypotheses and are collectively applied to calculate a spam score for incoming mail. This score is then used to distinguish spam mail from regular mail. The goal of the Spam Analysis and Reputation Project is to analyze spam and study the efficacy of various Spam Filters currently being employed to filter spam. We will do so by testing the various hypotheses on which these spam filters are based.

Introduction

The initial approach we employed was to setup a HoneyPot to attract spam mail. The idea was to parse the messages trapped by the HoneyPot and run various tests on them. The collective results would be used to generate statistics to prove or disprove various hypotheses.

This approach proved to be unsuccessful since the HoneyPot was not able to attract a sufficient amount of spam despite our best efforts to publicize the email id linked to it.

We then decided to employ a different approach to gain access to a sufficient amount of mail. We built a standalone application which employed the IMAP protocol to gain access to a large amount of existing mails and spam already stored in the user’s mail server. The application was meant to connect to IMAP servers and download headers of all mail. Next, various modules would act on these headers and run various tests. The result of these tests would be used to generate statistics.

Parser

One of the most important steps in the first approach was to build a parser to parse the mails attracted by the HoneyPot. The Parser parsed through the mail header and body to extract various fields from the message. The information was converted into a more convenient format for conducting tests, and then stored into a database.

Working

The parsing was implemented in Mail.java. Parsing is triggered by calling the parseMessage() method and passing the message stream as input. This method breaks down the message into Header and Body, and in turn calls parseHeaders() and parseBody().

Parsing Headers

This method breaks down the header into individual header fields and invokes the parsing of individual header fields.

The fields to be parsed are:

Message-Id
From
To
Sender
Reply-To
Return-Path
Date
List-Subscribe
Subject
In-Reply-To
References
Received
DomainKey-Signature
Received-SPF

Parsing Email Addresses

The InternetAddress class is used for parsing email address. The class is successfully able to parse addresses of the form “Display Name” <userid@domain.com>.

Parsing Date

The MailDateFormat class was used to parse the Date string in the message.

Parsing Trace:

Trace was one of the most difficult to parse. The first approach to be employed involved parsing strictly with respect to the grammar defined in RFC 2822 and RFC 2821. However it became apparent that these standards are rarely followed strictly by SMTP servers for the Trace field.

The next approach employed a less strict parsing of the trace, which had a much higher success ratio. Here regular expressions were used to find matches for domains and IP addresses in the “From”, “By” , “With” sections of the Trace stamp. This process had to be iterated over all the Received stamps. This approach gave much more reliable results.

URL Grabber

One of the modules included inside the parser was parsing the body to detect URLs. This was implemented using regular expressions. The following regex was used to match URLs:

"http(s)?://([\\w-]+\\.)+[\\w-]+(/[\\w-./?%&+=]*)?\\b"

However since the first approach was not successful, I was not able to collect any measurements using the URL Grabber.

Domain Age Checker (Whois)

Another functionality embedded inside the parser was a Domain Age Checker. The purpose of this tool is to prove or disprove the hypotheses that domains sending spam are generally newly registered domains. If true, an effective way of filtering spam would be to check the age of the domain from which the message was sent.

Whois servers on the internet keep a record of a subset of domain names registered on the internet. It is possible to fire whois queries to these servers on port 43. One of the details returned by the query is the creation date of the domain.

Whois.java is the class used to find the age of domains. The module fires a whois query to “whois.crsnic.net” server and parses the result to find the creation date. Using this the age of the domain in months is found out and returned.

However since the first approach was not successful, I was not able to collect any measurements using the Domain Checker.

Issues and solutions (Parser and modules)

Parser

The trace fields added by SMTP servers rarely follow the standards described in RFC 2821 and RFC 2822 strictly. The fields could not be parsed using strict parsing techniques. Instead the trace was broken down into “from”, “by”, and “date”. Then each section was parsed individually, usually searching for domain names and IP addresses inside them.

Almost all of the fields parsed by the parser are specified as optional in the RFC standard. This means many of the messages may not contain those fields at all. The parser in such case would enter a null value into the database, and the modules acting on the data are responsible for handling such null cases.

URL Grabber

There are many ways in which URLs can be inserted into a message. The simplest and most common technique is to just enter the URL raw in the mail or inside tags. URLs can even be hidden inside JavaScripts and obfuscated using code. The URL grabber can only detect URLs which are present as full strings inside the message body. I inspected 20 spam mails across 3 mailboxes and found that none of them had obfuscated URLs. Hence we can safely use this approach to detect a majority of URLs.

Domain Age Checker (Whois)

The results of a whois query are generally in human readable format and not machine readable. After analyzing the results of around 15 whois servers, I decided to use the “whois.crsnic.net” for which I was able to parse the result for creation date.

Different servers follow different output formats, with little consistency across and even within servers. The “whois.crsnic.net” server follows a standard format to display creation date which enabled it to be parsed correctly.

No whois server has a record of all the domains on the internet. Most servers generally keep a record of a geographical subset of the total number of domains. The “whois.crsnic.net” server keeps a record for all .edu, .com and .org domains. This list should be sufficient for our research purpose. In the future, other whois servers can be added if further capabilities are required.

Mail Statistics Generator

The second approach employed in SARP was to make use of the large number of mails and spam stored in existing email accounts. Many of the email service providers have started offering IMAP access to their mail servers (e.g. Gmail, CubMail). We can use IMAP to retrieve messages and their headers from mail servers and use them as input for our tests.

Framework

This approach required the presence of a basic framework to which various test modules could be attached. The application accepts username, password and mail server address from the user. Using this, it connects to the mail server and retrieves all message headers or message body. It then sequentially calls all the individual modules for performing tests and generating statistics.

User Interface

A Graphical User Interface was developed for the Mail Statistics Generator. A GUI would facilitate ease of use for the users of our application. Since many of our potential users may not be technical users, a Graphical User Interface was preferred over a command line interface. Another advantage of using a GUI was that it facilitated easier sorting of mail folders .

Sorting Mail

The modules attached to the framework require the messages to be categorized into “Sent”, “Non-Spam” and “Spam”. The user has to classify the directories on his mail server into these three categories.

Reading Mail

JavaMail API was used to establish a connection to the IMAP server. After the user sorted his folders into three categories the application starts reading each folder and retrieving the messages in them. It stores the messages into three message arrays.

Caching Messages

Since the messages are stored on the server each call for a header maps to a server call. Considering that the framework supports multiple modules, each with their own calls for messages, the application would take an unacceptable amount of time to execute. Hence the framework implemented caching of message headers to local memory before invoking the modules.

Modules

The framework required a standard interface for calling the modules and for passing messages to these modules. Module.java defines an abstract class to be extended by all modules attaching to the framework. The class contains methods for passing messages, invoking the processing code and retrieving the results of processing.

Status

Since the application may take a considerable amount of time to run on a large mailbox, it has to keep providing a status update to the user. This was implemented by the use of two progress bars. One progress bar tracks the overall progress of the application over all modules, while the other tracks the progress of individual modules.

Results

Each of the modules produces a result String after finishing execution. The application then concatenates all results string and displays it to the user. The user can then mail the results back for analysis.

Message Body parsing

Some modules required the body of the message to be parsed. Since going through gigabytes of mail messages may take a really long time, it was decided to build a separate framework for modules parsing the body.

Issues and Solutions ( Stand-Alone Mail Statistics Generator )

Many of the modules running on the framework were making the same header calls on the message. However using the Message class meant each call was a separate server call. This meant that the time for execution increased drastically. This problem was solved by caching message headers in memory. This was accomplished by creating the CachedMessage class which extended the Message class. The CachedMessage first downloads all the required headers and stores them in local memory. Every call for a header is then redirected to local memory.

Since the modules were running on the event dispatch thread of the main GUI, the GUI would freeze during processing. This meant that the application could not be closed when it was processing. The solution to this was to move the processing to a different thread. The SwingWorker class was used to move processing to a different thread, while the Event Dispatch thread could return to its idle state.

The Module status thread also lies on the same thread as the processing thread. This meant the progress bar would not show any updates till the processing had completed. This issue was resolved by employing a separate Thread for repainting the status screen.

Some modules required the body of the message to execute. Including these modules with the other modules would mean it would drastically slow down the application. Hence a separate framework and application was developed for the modules requiring the message body.

References

RFC 2822 [http://tools.ietf.org/html/rfc2822]
RFC 2821 [http://tools.ietf.org/html/rfc2821]
JavaMail API [http://java.sun.com/products/javamail/]
Spam Analysis and Reputation Project [http://www1.cs.columbia.edu:8080/display/spam/Home]
SARP Modules [http://www1.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules]
Henning Schulzrinne - Spam Analysis and Reputation Project
Adran Frei - Spam Analysis and Reputation Project: DNS Blacklists
Swati A Kumar - Spam Analysis and Reputation Projects: Email Encryption Headers and Database Schema.
Aditi Rajoriya - Spam Analysis and Reputation Project: IMAP Retrieval and To/Body Module.
Dhrumin Shah - Spam Analysis and Reputation Project: Domain Check and Image Analysis Modules
Nirav Shah - Spam Analysis and Reputation Project: Email Source, Date/Time and Attachment Analysis
Preethi Narayan